# Big G Express: Predicting Derates
In this project, you will be working with J1939 fault code data and vehicle onboard diagnostic data to try and predict an upcoming full derate. 

J1939 is a communications protocol used in heavy-duty vehicles (like trucks, buses, and construction equipment) to allow different electronic control units (ECUs), like the engine, transmission, and brake systems, to talk to each other. Fault codes in this system follow a standard format so that mechanics and diagnostic tools can understand what's wrong, no matter the make or model.

These fault codes have two parts. First, an SPN (Suspect Parameter Number), which identifies what system or component is having the issue. Second, an FMI (Failure Mode Identifier), which explains how the system is failing (too high, too low, short circuit, etc.).

A derate refers to the truck's computer intentionally reducing engine power or speed to protect itself or force the driver to get it serviced. This is a built-in safety measure. A full derate, the main target in this project, means the vehicle is severely limited, requiring a tow for repairs. Full derates are indicated by an SPN of 5246. 

You have been provided with a two files containing the data you will use to make these predictions (J1939Faults.csv and VehicleDiagnosticOnboardData.csv) as well as two files describing some of the contents (DataInfo.docx and Service Fault Codes_1_0_0_167.xlsx) 

Note that in its raw form the data does not have "labels", so you must define what labels you are going to use and create those labels in your dataset. Also, you will likely need to perform some significant feature engineering in order to build an accurate predictor.

There are service locations at (36.0666667, -86.4347222), (35.5883333, -86.4438888), and (36.1950, -83.174722), so you should remove any records in the vicinity of these locations, as fault codes may be tripped when working on the vehicles.

When evaluating the performance of your model, assume that the cost associated with a missed full derate is approximately $4000 in towing and repairs, and the cost of a false positive prediction is about $500 due to having the truck off the road and serviced unnecessarily. While high accuracy or F1 is nice, we are most interested here in saving the company money, so the final metric to evaulate your model should be the cost savings.

**Project Timeline:**

Thursday, May 8: Present preliminary findings to instructors.
Tuesday, May 13: Present final findings to class.

Your presentation should use slides, not code in a notebook. Your final presentation should include at least the following points:
* What features did you use to predict? Report some of the more impactful features using some kind of feature importance metric.
* If you had used the data prior to 2019 to train your model and had been using it from January 1, 2019 onwards, how many full derates would you have caught? How many false positives? What is the net savings or cost of using your model for that time span? Report your estimate here, even if the model would have been a net negative.

In [2]:
#pip install geopy

In [4]:
import pandas as pd
from geopy.distance import geodesic
from scipy.spatial import cKDTree
import numpy as np

**Read in data**

In [7]:
faults = pd.read_csv("../data/J1939Faults.csv")
diagnostics = pd.read_csv("../data/VehicleDiagnosticOnboardData.csv")

  faults = pd.read_csv("../data/J1939Faults.csv")


In [8]:
#faults.info()

**Faults : Cleaning Data**

In [10]:
drop_list = ['ESS_Id', 
             'actionDescription', 
             'ecuSoftwareVersion', 
             'ecuSerialNumber', 
             'ecuModel', 
             'ecuMake', 
             'ecuSource', 
             'faultValue',
             'LocationTimeStamp',
             'MCTNumber']

faults = faults.drop(columns=drop_list)

In [11]:
# service_stations = [
#     (36.0666667, -86.4347222),
#     (35.5883333, -86.4438888),
#     (36.1950, -83.174722)
# ]

# threshold_distance = 1.0  


# def is_near_service_station(lat, lon):
#     point = (lat, lon)
#     for station in service_stations:
#         distance = geodesic(point, station).kilometers
#         if distance <= threshold_distance:
#             return True
#     return False


# faults['IsServiceStation'] = faults.apply(lambda row: is_near_service_station(row['Latitude'], row['Longitude']), axis=1)

In [16]:
service_stations = [
    (36.0666667, -86.4347222),
    (35.5883333, -86.4438888),
    (36.1950, -83.174722)
]
def is_near_service_station_kdtree(df, service_stations, threshold_distance=1.0):
    # convert threshold from km to approximate degrees
    # rough approximation: 1 degree ≈ 111 km
    degree_threshold = threshold_distance / 111.0
    
    # KDTree for service stations
    stations_array = np.array(service_stations)
    tree = cKDTree(stations_array)
    
    # query points
    points = np.vstack([df['Latitude'], df['Longitude']]).T
    
    # find points within threshold distance
    # Returns indices of points within degree_threshold of any service station
    indices = tree.query_ball_point(points, degree_threshold)
    
    is_near = np.array([len(idx) > 0 for idx in indices])
    
    return is_near

# Define threshold distance
threshold_distance = 1.0 

# apply function
faults['IsServiceStation'] = is_near_service_station_kdtree(
    faults, service_stations, threshold_distance)

In [18]:
faults['IsServiceStation'].value_counts(normalize = True)

IsServiceStation
False    0.889795
True     0.110205
Name: proportion, dtype: float64

**Diagnostics (Features) : Cleaning Data**

In [21]:
diagnostics['Value'] = diagnostics['Value'].replace({'FALSE': 0, 'TRUE': 1})

In [22]:
diagnostics_w = diagnostics.pivot(index='FaultId', columns='Name', values='Value')
features = diagnostics_w.reset_index()
features.columns.name = None

**Merged Faults and Features**

In [25]:
combined = pd.merge(faults, 
                    features, 
                    left_on='RecordID', 
                    right_on='FaultId',
                    how = 'left')

**Classifying a Full Derate**

In [28]:
combined['IsDerateFull'] = (combined['spn'] == 5246) & (combined['active'] == True)
combined['IsDerateFull'].value_counts()

IsDerateFull
False    1186728
True         607
Name: count, dtype: int64

**Filter Area**

In [31]:
combined_filtered = combined[combined['IsServiceStation'] == False]
combined_filtered = combined[~((combined['spn'] == 5246) & (combined['active'] == False))]
combined_filtered['IsServiceStation'].value_counts(normalize = True)

IsServiceStation
False    0.889867
True     0.110133
Name: proportion, dtype: float64

**Inspect derate vehicle with highest n faults**

In [33]:
all_derate_vehicles = combined_filtered[(combined_filtered['IsDerateFull'] == True)]
grouped_derate_vehicles = all_derate_vehicles.groupby('EquipmentID').size().reset_index(name='row_count')
grouped_derate_vehicles = grouped_derate_vehicles.sort_values(by='row_count', ascending=False)
print('EquipmentID 1524 has the most rows to inspect')
grouped_derate_vehicles.head()

EquipmentID 1524 has the most rows to inspect


Unnamed: 0,EquipmentID,row_count
41,1524,36
46,1535,27
42,1525,22
48,1539,16
113,1749,15


In [35]:
#Truck_1524 = combined_filtered[combined_filtered['EquipmentID'] == 1524]
#Truck_1524.head(31)
#Truck_1524.to_csv('Truck_1524.csv', index=False)

**Add a New Feature: Severity Level**

In [37]:
import re

def extract_severity(text):

    if pd.isna(text):
        return np.nan
        
    # "Severity" followed by "Low", "Medium", or "High"
    pattern = r'Severity\s+(Low|Medium|High)'
    
    # Search for the pattern 
    match = re.search(pattern, text)
    
    if match:
        # Return "Severity" plus the matched level
        return f"Severity {match.group(1)}"
    else:
        return np.nan

# Apply the function to create the new column
combined_filtered['SeverityLevel'] = combined_filtered['eventDescription'].apply(extract_severity)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  combined_filtered['SeverityLevel'] = combined_filtered['eventDescription'].apply(extract_severity)


In [39]:
severity_map = {
    'Severity Low': 1,
    'Severity Medium': 2,
    'Severity High': 3
}

combined_filtered['SeverityLevelFeature'] = combined_filtered['SeverityLevel'].map(severity_map)


combined_filtered.loc[combined_filtered['spn'] == 5246, 'SeverityLevelFeature'] = 4

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  combined_filtered['SeverityLevelFeature'] = combined_filtered['SeverityLevel'].map(severity_map)


In [41]:
inspect_column = combined_filtered[['eventDescription', 'SeverityLevel','SeverityLevelFeature']].dropna(subset=['SeverityLevel'])
inspect_column.head()

Unnamed: 0,eventDescription,SeverityLevel,SeverityLevelFeature
0,Low (Severity Low) Engine Coolant Level,Severity Low,1.0
5,Low (Severity Low) Engine Coolant Level,Severity Low,1.0
6,Low (Severity Low) Engine Coolant Level,Severity Low,1.0
7,Low (Severity Low) Engine Coolant Level,Severity Low,1.0
8,High (Severity Low) Water In Fuel Indicator,Severity Low,1.0


**Convert Features to_numeric**

In [43]:
feature_numeric_columns = [
    'AcceleratorPedal', 'BarometricPressure', 'CruiseControlSetSpeed', 
    'DistanceLtd', 'EngineCoolantTemperature', 'EngineLoad', 'EngineOilPressure', 'EngineOilTemperature',
    'EngineRpm', 'EngineTimeLtd', 'FuelLevel', 'FuelLtd', 'FuelRate', 'FuelTemperature', 
    'IntakeManifoldTemperature', 'ParkingBrake', 'ServiceDistance', 'Speed', 
    'SwitchedBatteryVoltage', 'Throttle', 'TurboBoostPressure'
]

combined_filtered[feature_numeric_columns] = combined_filtered[feature_numeric_columns].apply(pd.to_numeric, errors='coerce')

#combined_filtered[feature_numeric_columns].dtypes

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  combined_filtered[feature_numeric_columns] = combined_filtered[feature_numeric_columns].apply(pd.to_numeric, errors='coerce')


**Identify derate rows before creating derate window**

In [45]:
inspect_derate_rows = combined_filtered[combined_filtered['IsDerateFull'] == True][['IsDerateFull', 'active']].value_counts()
inspect_derate_rows

IsDerateFull  active
True          True      607
Name: count, dtype: int64

**Identify False Derates : any subsequent derate within the same calendar day**

In [47]:
combined_filtered['EventTimeStamp'] = pd.to_datetime(combined_filtered['EventTimeStamp'])

# new column for actual derates
combined_filtered['IsDerateActual'] = (combined_filtered['spn'] == 5246)

# helper columns for date only (without time)
combined_filtered['DateOnly'] = combined_filtered['EventTimeStamp'].dt.date

# process each truck separately
for equipment_id, group in combined_filtered.groupby('EquipmentID'):
    # For each date with derates for this truck
    derate_condition = (group['spn'] == 5246)
    for date, date_group in group[derate_condition].groupby('DateOnly'):
        if len(date_group) > 1:
            # Sort by timestamp
            date_group = date_group.sort_values('EventTimeStamp')
            
            # Get indices of all derates except the first one for this day
            subsequent_derate_indices = date_group.index[1:]
            
            # Mark these as not actual derates (keep only the first one)
            combined_filtered.loc[subsequent_derate_indices, 'IsDerateActual'] = False

# clean helper columns
combined_filtered.drop(['DateOnly'], axis=1, inplace=True)

print(combined_filtered[combined_filtered['spn'] == 5246][['EventTimeStamp', 'spn', 'IsDerateFull', 'IsDerateActual']])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  combined_filtered['EventTimeStamp'] = pd.to_datetime(combined_filtered['EventTimeStamp'])
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  combined_filtered['IsDerateActual'] = (combined_filtered['spn'] == 5246)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  combined_filtered['DateOnly'] = combined_f

             EventTimeStamp   spn  IsDerateFull  IsDerateActual
45      2015-02-21 12:10:51  5246          True            True
1918    2015-02-22 19:44:55  5246          True            True
2089    2015-02-23 05:05:44  5246          True            True
2971    2015-02-23 15:54:22  5246          True            True
5713    2015-02-25 13:53:08  5246          True            True
...                     ...   ...           ...             ...
1181700 2020-02-13 13:32:39  5246          True            True
1181996 2020-02-14 11:21:54  5246          True            True
1183032 2020-02-19 07:02:33  5246          True            True
1183684 2020-02-21 07:23:44  5246          True            True
1184330 2020-02-24 15:27:26  5246          True            True

[607 rows x 4 columns]


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  combined_filtered.drop(['DateOnly'], axis=1, inplace=True)


**Filter our False Derates**

In [49]:
combined_filtered = combined_filtered[~((combined_filtered['spn'] == 5246) & (combined_filtered['IsDerateActual'] == False))]

**Derate Window**

In [51]:
# datetime format
combined_filtered['EventTimeStamp'] = pd.to_datetime(combined_filtered['EventTimeStamp'])

# sort
combined_filtered = combined_filtered.sort_values(['EquipmentID', 'EventTimeStamp'])

# intitial target column
combined_filtered['DeratePredictionTarget'] = 0

# helper dataframe with just the derate events
derate_events = combined_filtered[combined_filtered['IsDerateActual']].copy()

# group by EquipmentID to process each truck separately
for equipment_id, group in combined_filtered.groupby('EquipmentID'):
    # Get derate events for this truck only
    truck_derates = derate_events[derate_events['EquipmentID'] == equipment_id]
    
    if len(truck_derates) > 0:
        # get indices and timestamps for this truck's rows
        truck_indices = group.index
        truck_timestamps = group['EventTimeStamp'].values
        
        # for each derate event in this truck
        for _, derate_row in truck_derates.iterrows():
            derate_time = derate_row['EventTimeStamp']
            
            # window: 
            window_start = derate_time - pd.Timedelta(hours=24)
            window_end = derate_time - pd.Timedelta(hours=.01)
            
            # all events in the prediction window
            in_window = (truck_timestamps >= window_start) & (truck_timestamps <= window_end)
            indices_to_mark = truck_indices[in_window]
            
            # marked as predicting a derate
            combined_filtered.loc[indices_to_mark, 'DeratePredictionTarget'] = 1

# results
print(f"Total events: {len(combined_filtered)}")
print(f"Events predicting a derate: {combined_filtered['DeratePredictionTarget'].sum()}")

Total events: 1186601
Events predicting a derate: 3062


In [52]:
# Truck_1524_test_v5 = combined_filtered[combined_filtered['EquipmentID'] == 1524]
# Truck_1524_test_v5.to_csv('Truck_1524_test_v5.csv', index=False)

In [53]:
# n = 5  # rolling avg size

# # List of numeric feature columns to impute
# feature_numeric_columns = [
#     'AcceleratorPedal', 'BarometricPressure', 'CruiseControlSetSpeed', 
#     'DistanceLtd', 'EngineCoolantTemperature', 'EngineLoad', 'EngineOilPressure', 'EngineOilTemperature',
#     'EngineRpm', 'EngineTimeLtd', 'FuelLevel', 'FuelLtd', 'FuelRate', 'FuelTemperature', 
#     'IntakeManifoldTemperature', 'ParkingBrake', 'ServiceDistance', 'Speed', 
#     'SwitchedBatteryVoltage', 'Throttle', 'TurboBoostPressure'
# ]

# # Apply rolling average for rows where spn == 5246 and group by 'equipment_id'
# for column in feature_numeric_columns:
#     # Create a mask for the rows where 'spn' == 5246 and the column value is NaN
#     derate_nan = (combined_filtered['spn'] == 5246) & combined_filtered[column].isna()
    
#     # Group by 'equipment_id' and apply rolling mean for each group
#     combined_filtered.loc[derate_nan, column] = combined_filtered.groupby('EquipmentID')[column] \
#         .apply(lambda x: x.rolling(window=n, min_periods=1, axis=0).mean())

# # Check the number of NaN values only for rows where spn == 5246
# na_count_spn_5246 = combined_filtered[combined_filtered['spn'] == 5246][feature_numeric_columns].isna().sum()

# # Print the result
# na_count_spn_5246

**Investigate NaN values in both outcomes of response variable:**

*Is there a difference in available data between the two outcomes: IsDerateFull T/F*

In [55]:
from scipy.stats import chi2_contingency

In [57]:
feature_columns = [
    'AcceleratorPedal', 'BarometricPressure', 'CruiseControlActive', 'CruiseControlSetSpeed', 
    'DistanceLtd', 'EngineCoolantTemperature', 'EngineLoad', 'EngineOilPressure', 'EngineOilTemperature',
    'EngineRpm', 'EngineTimeLtd', 'FuelLevel', 'FuelLtd', 'FuelRate', 'FuelTemperature', 'IgnStatus', 
    'IntakeManifoldTemperature', 'LampStatus', 'ParkingBrake', 'ServiceDistance', 'Speed', 
    'SwitchedBatteryVoltage', 'Throttle', 'TurboBoostPressure'
]

derate_true = combined[combined['IsDerateFull'] == True]
derate_false = combined[combined['IsDerateFull'] == False]

derate_true = derate_true[feature_columns]
derate_false = derate_false[feature_columns]

nan_count_true = derate_true.isna().sum()
non_nan_count_true = derate_true.count()

nan_count_false = derate_false.isna().sum()
non_nan_count_false = derate_false.count()

nan_percent_true = (nan_count_true / len(derate_true)) * 100
nan_percent_false = (nan_count_false / len(derate_false)) * 100

nan_summary_true = pd.DataFrame({
    'Feature': nan_count_true.index, 
    'NaN Count': nan_count_true.values,
    'Non-NaN Count': non_nan_count_true.values,
    'NaN Percentage (%)_IsDerateFull = True': nan_percent_true.values
})

nan_summary_false = pd.DataFrame({
    'Feature': nan_count_false.index, 
    'NaN Count': nan_count_false.values,
    'Non-NaN Count': non_nan_count_false.values,
    'NaN Percentage (%)_IsDerateFull = False': nan_percent_false.values
})

In [59]:
true_pct = nan_summary_true[['Feature','NaN Percentage (%)_IsDerateFull = True']]
false_pct = nan_summary_false[['Feature','NaN Percentage (%)_IsDerateFull = False']]
nan_comparison = pd.merge(true_pct, false_pct, on = 'Feature')

In [60]:
nan_comparison['Difference'] =  nan_comparison['NaN Percentage (%)_IsDerateFull = False'] - nan_comparison['NaN Percentage (%)_IsDerateFull = True']
nan_comparison_sorted = nan_comparison.sort_values(by='Difference', ascending = False)
nan_comparison_sorted

Unnamed: 0,Feature,NaN Percentage (%)_IsDerateFull = True,NaN Percentage (%)_IsDerateFull = False,Difference
15,IgnStatus,0.0,48.779586,48.779586
11,FuelLevel,11.037891,57.677328,46.639437
16,IntakeManifoldTemperature,7.578254,50.643281,43.065028
9,EngineRpm,7.578254,50.590194,43.011941
1,BarometricPressure,7.742998,50.669741,42.926742
5,EngineCoolantTemperature,7.742998,50.661735,42.918737
7,EngineOilPressure,8.072488,50.646989,42.574501
12,FuelLtd,8.401977,50.735215,42.333238
6,EngineLoad,8.896211,50.699065,41.802854
13,FuelRate,9.2257,50.731254,41.505554


**Identify features for model:**

*Are there features that show trends over time?*

In [62]:
features = [
    'DistanceLtd',
    'EngineOilTemperature',
    'TurboBoostPressure',
    'FuelRate',
    'EngineLoad',
    'EngineOilPressure',
    'EngineCoolantTemperature',
    'BarometricPressure',
    'EngineRpm',
    'IntakeManifoldTemperature',
    'FuelTemperature',
    'SwitchedBatteryVoltage',
    'SeverityLevelFeature'
]

**Statistic Models**

In [64]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
from sklearn.ensemble import RandomForestClassifier
from imblearn.over_sampling import SMOTE

In [65]:
combined_filtered[features].dtypes

DistanceLtd                  float64
EngineOilTemperature         float64
TurboBoostPressure           float64
FuelRate                     float64
EngineLoad                   float64
EngineOilPressure            float64
EngineCoolantTemperature     float64
BarometricPressure           float64
EngineRpm                    float64
IntakeManifoldTemperature    float64
FuelTemperature              float64
SwitchedBatteryVoltage       float64
SeverityLevelFeature         float64
dtype: object

**Logistic Regression**

Version_1

*Logistic regression model with 80/20 test split*

In [74]:
# X = combined[features]
# y = combined['IsDerateFull']

# X = X.apply(pd.to_numeric, errors='coerce')
# X = X.fillna(X.mean())

# X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=321)

# scaler = StandardScaler()
# X_train_scaled = scaler.fit_transform(X_train)  
# X_test_scaled = scaler.transform(X_test)  

# model = LogisticRegression(class_weight='balanced')
# model.fit(X_train_scaled, y_train)

# y_pred = model.predict(X_test_scaled)

# accuracy = accuracy_score(y_test, y_pred)
# print(f"Accuracy: {accuracy:.4f}")
# print("Classification Report:\n", classification_report(y_test, y_pred))
# print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))

True Negatives (TN): 190,358 — Correctly predicted False.

False Positives (FP): 46,991 — Incorrectly predicted True.

False Negatives (FN): 24 — Incorrectly predicted False.

True Positives (TP): 94 — Correctly predicted True.

In [77]:
combined_filtered['EventTimeStamp'] = pd.to_datetime(combined['EventTimeStamp'], errors='coerce')
combined_filtered['IsTestData'] = combined_filtered['EventTimeStamp'] >= '2019-01-01'
combined_filtered['IsTestData'].value_counts(normalize=True)

IsTestData
False    0.891157
True     0.108843
Name: proportion, dtype: float64

**Logistic Regression**

Version_2

*Logistic regression split Test >= 2019*

In [80]:
random_state = 42

# Training Data < 2019 
# Test Data >= 2019
train_data = combined_filtered[combined_filtered['IsTestData'] == False]
test_data = combined_filtered[combined_filtered['IsTestData'] == True]

X_train = train_data[features]
y_train = train_data['DeratePredictionTarget']
X_test = test_data[features]
y_test = test_data['DeratePredictionTarget']

X_train = X_train.apply(pd.to_numeric, errors='coerce')
X_test = X_test.apply(pd.to_numeric, errors='coerce')

X_train = X_train.fillna(X_train.mean())
X_test = X_test.fillna(X_test.mean())

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)  
X_test_scaled = scaler.transform(X_test)  

model = LogisticRegression(class_weight='balanced', random_state=random_state)
model.fit(X_train_scaled, y_train)

y_pred = model.predict(X_test_scaled)

accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.4f}")
print("Classification Report:\n", classification_report(y_test, y_pred))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))

Accuracy: 0.9449
Classification Report:
               precision    recall  f1-score   support

           0       1.00      0.95      0.97    128654
           1       0.01      0.14      0.02       499

    accuracy                           0.94    129153
   macro avg       0.50      0.55      0.50    129153
weighted avg       0.99      0.94      0.97    129153

Confusion Matrix:
 [[121969   6685]
 [   427     72]]


**Confusion Matrix**

In [82]:
TN_m1 = 120665
FP_m1 = 7894
FN_m1 = 441
TP_m1 = 153


Costs = (FP_m1 * 500)
Savings = (TP_m1 * 4000)

Net = Savings - Costs
print(Net)

-3335000


**Random Forest**

Version_1

In [84]:
# split data into train and test sets
train_data = combined_filtered[combined_filtered['IsTestData'] == False]
test_data = combined_filtered[combined_filtered['IsTestData'] == True]

# prepare features and target
X_train = train_data[features]
y_train = train_data['DeratePredictionTarget']
X_test = test_data[features]
y_test = test_data['DeratePredictionTarget']

# numeric conversion and missing values
X_train = X_train.apply(pd.to_numeric, errors='coerce')
X_test = X_test.apply(pd.to_numeric, errors='coerce')
X_train = X_train.fillna(X_train.mean())
X_test = X_test.fillna(X_test.mean())

# scale
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# OPTIMIZATION 1
from imblearn.over_sampling import RandomOverSampler
ros = RandomOverSampler(random_state=42)
X_train_resampled, y_train_resampled = ros.fit_resample(X_train_scaled, y_train)

# OPTIMIZATION 2
random_forest_model = RandomForestClassifier(
    n_estimators=50,     
    max_depth=15,        
    min_samples_split=20, 
    n_jobs=-1,           
    random_state=42,
    class_weight='balanced'
)

# train the model
random_forest_model.fit(X_train_resampled, y_train_resampled)

# predictions
y_pred = random_forest_model.predict(X_test_scaled)

# evaluate
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.4f}")
print("Classification Report:\n", classification_report(y_test, y_pred))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))

# feature importance
feature_importances = pd.DataFrame({
    'Feature': features,
    'Importance': random_forest_model.feature_importances_
}).sort_values('Importance', ascending=False)
print("\nTop 10 Important Features:")
print(feature_importances.head(10))

Accuracy: 0.9959
Classification Report:
               precision    recall  f1-score   support

           0       1.00      1.00      1.00    128654
           1       0.14      0.01      0.02       499

    accuracy                           1.00    129153
   macro avg       0.57      0.50      0.51    129153
weighted avg       0.99      1.00      0.99    129153

Confusion Matrix:
 [[128623     31]
 [   494      5]]

Top 10 Important Features:
                      Feature  Importance
0                 DistanceLtd    0.180428
1        EngineOilTemperature    0.107261
12       SeverityLevelFeature    0.096604
10            FuelTemperature    0.083038
9   IntakeManifoldTemperature    0.079819
8                   EngineRpm    0.076914
6    EngineCoolantTemperature    0.076282
7          BarometricPressure    0.071720
5           EngineOilPressure    0.066700
3                    FuelRate    0.046864


**Confusion Matrix**

In [88]:
TN_m2 = 128448

FP_m2 = 111

FN_m2 = 508

TP_m2 = 86