# Part Failure Prediction: Data Exploration 探索

**Objective:** To understand the dataset, identify patterns, and check for issues like missing values or class imbalance before building the model.

### 1. Import Libraries and Load Data

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Set plot style
sns.set_style('whitegrid')

# Load the dataset
df = pd.read_csv('../data/service_history.csv')

### 2. Initial Data Inspection

In [None]:
print("First 5 rows of the dataset:")
display(df.head())

print("\nDataset Information:")
df.info()

print("\nNumerical Summary:")
display(df.describe())

### 3. Check Target Variable Distribution (Class Imbalance)

In [None]:
plt.figure(figsize=(8, 5))
sns.countplot(x='part_failed', data=df)
plt.title('Distribution of Target Variable (part_failed)')
plt.xlabel('Part Failed (0 = No, 1 = Yes)')
plt.ylabel('Count')
plt.show()

failure_rate = df['part_failed'].value_counts(normalize=True) * 100
print(f"Failure Rate:\n{failure_rate}")

**Observation:** The dataset is imbalanced. Failures (Class 1) make up about 25% of the data. This is important to consider during model training.

### 4. Analyze Feature Relationships

In [None]:
# Time in Service vs. Failure
plt.figure(figsize=(10, 6))
sns.boxplot(x='part_failed', y='time_in_service_days', data=df)
plt.title('Time in Service vs. Part Failure')
plt.xlabel('Part Failed (0 = No, 1 = Yes)')
plt.ylabel('Time in Service (Days)')
plt.show()

**Observation:** As expected, parts that failed have a significantly higher median `time_in_service_days`.

In [None]:
# Service Type vs. Failure
plt.figure(figsize=(12, 6))
sns.countplot(x='service_type', hue='part_failed', data=df)
plt.title('Service Type Distribution by Failure Status')
plt.xlabel('Type of Service')
plt.ylabel('Count')
plt.xticks(rotation=45)
plt.show()

**Observation:** 'Replacement' and 'Repair' service types are exclusively associated with part failures, while 'Routine Check' is associated with non-failures. This is a very strong predictor.

In [None]:
# Part ID vs. Failure
plt.figure(figsize=(12, 6))
sns.countplot(x='part_id', hue='part_failed', data=df, order=df['part_id'].value_counts().index)
plt.title('Part ID Distribution by Failure Status')
plt.xlabel('Part ID')
plt.ylabel('Count')
plt.xticks(rotation=45)
plt.show()

**Observation:** Some parts, like `P-BRG-02`, appear to have a higher failure rate than others.