# Big G Express: Predicting Derates
In this project, you will be working with J1939 fault code data and vehicle onboard diagnostic data to try and predict an upcoming full derate. 

J1939 is a communications protocol used in heavy-duty vehicles (like trucks, buses, and construction equipment) to allow different electronic control units (ECUs), like the engine, transmission, and brake systems, to talk to each other. Fault codes in this system follow a standard format so that mechanics and diagnostic tools can understand what's wrong, no matter the make or model.

These fault codes have two parts. First, an SPN (Suspect Parameter Number), which identifies what system or component is having the issue. Second, an FMI (Failure Mode Identifier), which explains how the system is failing (too high, too low, short circuit, etc.).

A derate refers to the truck's computer intentionally reducing engine power or speed to protect itself or force the driver to get it serviced. This is a built-in safety measure. A full derate, the main target in this project, means the vehicle is severely limited, requiring a tow for repairs. Full derates are indicated by an SPN of 5246. 

You have been provided with a two files containing the data you will use to make these predictions (J1939Faults.csv and VehicleDiagnosticOnboardData.csv) as well as two files describing some of the contents (DataInfo.docx and Service Fault Codes_1_0_0_167.xlsx) 

Note that in its raw form the data does not have "labels", so you must define what labels you are going to use and create those labels in your dataset. Also, you will likely need to perform some significant feature engineering in order to build an accurate predictor.

There are service locations at (36.0666667, -86.4347222), (35.5883333, -86.4438888), and (36.1950, -83.174722), so you should remove any records in the vicinity of these locations, as fault codes may be tripped when working on the vehicles.

When evaluating the performance of your model, assume that the cost associated with a missed full derate is approximately $4000 in towing and repairs, and the cost of a false positive prediction is about $500 due to having the truck off the road and serviced unnecessarily. While high accuracy or F1 is nice, we are most interested here in saving the company money, so the final metric to evaulate your model should be the cost savings.

**Project Timeline:**

Thursday, May 8: Present preliminary findings to instructors.
Tuesday, May 13: Present final findings to class.

Your presentation should use slides, not code in a notebook. Your final presentation should include at least the following points:
* What features did you use to predict? Report some of the more impactful features using some kind of feature importance metric.
* If you had used the data prior to 2019 to train your model and had been using it from January 1, 2019 onwards, how many full derates would you have caught? How many false positives? What is the net savings or cost of using your model for that time span? Report your estimate here, even if the model would have been a net negative.

In [2]:
#pip install geopy

In [3]:
import pandas as pd
from geopy.distance import geodesic
from scipy.spatial import cKDTree
import numpy as np

**Read in data**

In [7]:
faults = pd.read_csv("../data/J1939Faults.csv")
diagnostics = pd.read_csv("../data/VehicleDiagnosticOnboardData.csv")

  faults = pd.read_csv("../data/J1939Faults.csv")


In [8]:
#faults.info()

**Faults : Cleaning Data**

In [10]:
drop_list = ['ESS_Id', 
             'actionDescription', 
             'ecuSoftwareVersion', 
             'ecuSerialNumber', 
             'ecuModel', 
             'ecuMake', 
             'ecuSource', 
             'faultValue',
             'LocationTimeStamp',
             'MCTNumber']

faults = faults.drop(columns=drop_list)

In [11]:
service_stations = [
    (36.0666667, -86.4347222),
    (35.5883333, -86.4438888),
    (36.1950, -83.174722)
]
def is_near_service_station_kdtree(df, service_stations, threshold_distance=1.0):
    # convert threshold from km to approximate degrees
    # rough approximation: 1 degree ≈ 111 km
    degree_threshold = threshold_distance / 111.0
    
    # KDTree for service stations
    stations_array = np.array(service_stations)
    tree = cKDTree(stations_array)
    
    # query points
    points = np.vstack([df['Latitude'], df['Longitude']]).T
    
    # find points within threshold distance
    # Returns indices of points within degree_threshold of any service station
    indices = tree.query_ball_point(points, degree_threshold)
    
    is_near = np.array([len(idx) > 0 for idx in indices])
    
    return is_near

# Define threshold distance
threshold_distance = 1.0 

# apply function
faults['IsServiceStation'] = is_near_service_station_kdtree(
    faults, service_stations, threshold_distance)

In [12]:
threshold_miles = 0.5
threshold_meters = threshold_miles * 1609.34

In [13]:
faults['IsServiceStation'].value_counts(normalize = True)

IsServiceStation
False    0.889795
True     0.110205
Name: proportion, dtype: float64

**Diagnostics (Features) : Cleaning Data**

In [21]:
diagnostics['Value'] = diagnostics['Value'].replace({'FALSE': 0, 'TRUE': 1})

In [22]:
diagnostics.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12821626 entries, 0 to 12821625
Data columns (total 4 columns):
 #   Column   Dtype 
---  ------   ----- 
 0   Id       int64 
 1   Name     object
 2   Value    object
 3   FaultId  int64 
dtypes: int64(2), object(2)
memory usage: 391.3+ MB


In [26]:
diagnostics[diagnostics['Name'].isin(['FuelTemperature', 'IgnStatus', 'IntakeManifoldTemperature', 'LampStatus', 'ParkingBrake', 'ServiceDistance', 'Speed'])].head()

Unnamed: 0,Id,Name,Value,FaultId
0,1,IgnStatus,False,1
6,7,IntakeManifoldTemperature,78.8,1
10,11,LampStatus,1023,1
13,14,Speed,0,1
18,19,ParkingBrake,True,1


In [28]:
features[['IgnStatus', 'IntakeManifoldTemperature', 'LampStatus','Speed','ParkingBrake']].head()

NameError: name 'features' is not defined

In [30]:
diagnostics_w = diagnostics.pivot(index='FaultId', columns='Name', values='Value')
features = diagnostics_w.reset_index()
features.columns.name = None

**Merged Faults and Features**

In [32]:
combined = pd.merge(faults, 
                    features, 
                    left_on='RecordID', 
                    right_on='FaultId',
                    how = 'left')

**Classifying a Full Derate**

In [35]:
combined['IsDerateFull'] = (combined['spn'] == 5246) & (combined['active'] == True)
combined['IsDerateFull'].value_counts()

IsDerateFull
False    1186728
True         607
Name: count, dtype: int64

In [None]:
faults['IsFullDerate'] = (faults['spn'] == 5246) & (faults['active'] == True)

**Filter Area**

In [38]:
combined_filtered = combined[combined['IsServiceStation'] == False]
combined_filtered = combined_filtered[~((combined_filtered['spn'] == 5246) & (combined_filtered['active'] == False))]
combined_filtered['IsServiceStation'].value_counts(normalize = True)

IsServiceStation
False    1.0
Name: proportion, dtype: float64

**Inspect derate vehicle with highest n faults**

In [40]:
all_derate_vehicles = combined_filtered[(combined_filtered['IsDerateFull'] == True)]
grouped_derate_vehicles = all_derate_vehicles.groupby('EquipmentID').size().reset_index(name='row_count')
grouped_derate_vehicles = grouped_derate_vehicles.sort_values(by='row_count', ascending=False)
print('EquipmentID 1524 has the most rows to inspect')
grouped_derate_vehicles.head()

EquipmentID 1524 has the most rows to inspect


Unnamed: 0,EquipmentID,row_count
38,1524,31
43,1535,23
39,1525,15
45,1539,14
3,305,13


In [41]:
#Truck_1524 = combined_filtered[combined_filtered['EquipmentID'] == 1524]
#Truck_1524.head(31)
#Truck_1524.to_csv('Truck_1524.csv', index=False)

**Add a New Feature: Severity Level**

In [44]:
import re

def extract_severity(text):

    if pd.isna(text):
        return np.nan
        
    # "Severity" followed by "Low", "Medium", or "High"
    pattern = r'Severity\s+(Low|Medium|High)'
    
    # Search for the pattern 
    match = re.search(pattern, text)
    
    if match:
        # Return "Severity" plus the matched level
        return f"Severity {match.group(1)}"
    else:
        return np.nan

combined_filtered['SeverityLevel'] = combined_filtered['eventDescription'].apply(extract_severity)

In [47]:
severity_map = {
    'Severity Low': 1,
    'Severity Medium': 2,
    'Severity High': 3
}

combined_filtered['SeverityLevelFeature'] = combined_filtered['SeverityLevel'].map(severity_map)

combined_filtered.loc[combined_filtered['spn'] == 1569, 'SeverityLevelFeature'] = 4
#combined_filtered.loc[combined_filtered['spn'] == 5246, 'SeverityLevelFeature'] = 5

In [48]:
inspect_column = combined_filtered[['eventDescription', 'SeverityLevel','SeverityLevelFeature']].dropna(subset=['SeverityLevel'])
inspect_column.head()

Unnamed: 0,eventDescription,SeverityLevel,SeverityLevelFeature
0,Low (Severity Low) Engine Coolant Level,Severity Low,1.0
5,Low (Severity Low) Engine Coolant Level,Severity Low,1.0
6,Low (Severity Low) Engine Coolant Level,Severity Low,1.0
7,Low (Severity Low) Engine Coolant Level,Severity Low,1.0
8,High (Severity Low) Water In Fuel Indicator,Severity Low,1.0


**Convert Features to_numeric**

In [50]:
feature_numeric_columns = [
    'AcceleratorPedal', 'BarometricPressure', 'CruiseControlSetSpeed', 
    'DistanceLtd', 'EngineCoolantTemperature', 'EngineLoad', 'EngineOilPressure', 'EngineOilTemperature',
    'EngineRpm', 'EngineTimeLtd', 'FuelLevel', 'FuelLtd', 'FuelRate', 'FuelTemperature', 
    'IntakeManifoldTemperature', 'ParkingBrake', 'ServiceDistance', 'Speed', 
    'SwitchedBatteryVoltage', 'Throttle', 'TurboBoostPressure'
]

combined_filtered[feature_numeric_columns] = combined_filtered[feature_numeric_columns].apply(pd.to_numeric, errors='coerce')

#combined_filtered[feature_numeric_columns].dtypes

**Identify derate rows before creating derate window**

In [55]:
inspect_derate_rows = combined_filtered[combined_filtered['IsDerateFull'] == True][['IsDerateFull', 'active']].value_counts()
inspect_derate_rows

IsDerateFull  active
True          True      498
Name: count, dtype: int64

**Identify False Derates : any subsequent derate within 24hr period**

In [57]:
combined_filtered['EventTimeStamp'] = pd.to_datetime(combined_filtered['EventTimeStamp'])
combined_filtered['IsDerateActual'] = (combined_filtered['spn'] == 5246)

# filter to only derate events and sort by equipment ID and timestamp
derate_events = combined_filtered[combined_filtered['spn'] == 5246].copy()
derate_events = derate_events.sort_values(['EquipmentID', 'EventTimeStamp'])

# empty list to store indices of duplicate derates (within 24 hours)
duplicate_indices = []

# group by equipment id
for equipment_id, group in derate_events.groupby('EquipmentID'):
    # Reset the index for easier iteration
    group = group.reset_index()
    
    # keep track of the last valid derate timestamp
    last_valid_timestamp = None
    
    for i, row in group.iterrows():
        current_timestamp = row['EventTimeStamp']
        
        if last_valid_timestamp is None:
            # First derate for this equipment - keep it
            last_valid_timestamp = current_timestamp
        elif (current_timestamp - last_valid_timestamp).total_seconds() < 24 * 3600:
            # This derate is within 24 hours of the last valid one - mark as duplicate
            duplicate_indices.append(row['index'])
        else:
            # This derate is more than 24 hours after the last valid one - keep it
            last_valid_timestamp = current_timestamp

# mark duplicates as not actual derates
if duplicate_indices:
    combined_filtered.loc[duplicate_indices, 'IsDerateActual'] = False

# results
verification_df = combined_filtered[combined_filtered['spn'] == 5246][['EventTimeStamp', 'EquipmentID', 'spn', 'IsDerateActual']].sort_values(['EquipmentID', 'EventTimeStamp'])
print(verification_df)

# verify
verification_df['TimeSincePrevDerate'] = verification_df.groupby('EquipmentID')['EventTimeStamp'].diff()
print(verification_df)

             EventTimeStamp EquipmentID   spn  IsDerateActual
516208  2016-07-12 19:11:07         301  5246            True
1171245 2020-01-06 10:13:57         302  5246            True
1173036 2020-01-13 13:18:31         302  5246            True
1181996 2020-02-14 11:21:54         302  5246            True
376483  2016-02-15 10:59:28         304  5246            True
...                     ...         ...   ...             ...
998062  2018-07-10 13:37:00        1942  5246           False
1006240 2018-08-06 10:37:53        1946  5246            True
990328  2018-06-14 14:43:49        1978  5246            True
993600  2018-06-25 14:48:16         305  5246            True
1015659 2018-09-07 11:22:40         306  5246            True

[498 rows x 4 columns]
             EventTimeStamp EquipmentID   spn  IsDerateActual  \
516208  2016-07-12 19:11:07         301  5246            True   
1171245 2020-01-06 10:13:57         302  5246            True   
1173036 2020-01-13 13:18:31         3

In [124]:
combined_filtered['IsDerateActual'].value_counts()

IsDerateActual
False    1055549
True         355
Name: count, dtype: int64

**Filter our False Derates**

In [59]:
combined_filtered = combined_filtered[~((combined_filtered['spn'] == 5246) & (combined_filtered['IsDerateActual'] == False))]

**Derate Window**

In [61]:
# datetime format
combined_filtered['EventTimeStamp'] = pd.to_datetime(combined_filtered['EventTimeStamp'])

# sort
combined_filtered = combined_filtered.sort_values(['EquipmentID', 'EventTimeStamp'])

# intitial target column
combined_filtered['DeratePredictionTarget'] = 0

# helper dataframe with just the derate events
derate_events = combined_filtered[combined_filtered['IsDerateActual']].copy()

# group by EquipmentID 
for equipment_id, group in combined_filtered.groupby('EquipmentID'):
    # Get derate events for this truck only
    truck_derates = derate_events[derate_events['EquipmentID'] == equipment_id]
    
    if len(truck_derates) > 0:
        # get indices and timestamps for this truck's rows
        truck_indices = group.index
        truck_timestamps = group['EventTimeStamp'].values
        
        # for each derate event in this truck
        for _, derate_row in truck_derates.iterrows():
            derate_time = derate_row['EventTimeStamp']
            
            # window: 
            window_start = derate_time - pd.Timedelta(hours=8)
            window_end = derate_time - pd.Timedelta(hours=.001)
            
            # all events in the prediction window
            in_window = (truck_timestamps >= window_start) & (truck_timestamps <= window_end)
            indices_to_mark = truck_indices[in_window]
            
            # marked as predicting a derate
            combined_filtered.loc[indices_to_mark, 'DeratePredictionTarget'] = 1

# results
print(f"Total events: {len(combined_filtered)}")
print(f"Events predicting a derate: {combined_filtered['DeratePredictionTarget'].sum()}")

Total events: 1055904
Events predicting a derate: 1090


In [62]:
# Truck_1524_test_v5 = combined_filtered[combined_filtered['EquipmentID'] == 1524]
# Truck_1524_test_v5.to_csv('Truck_1524_test_v5.csv', index=False)

**Investigate NaN values in both outcomes of response variable:**

*Is there a difference in available data between the two outcomes: IsDerateFull T/F*

In [64]:
from scipy.stats import chi2_contingency

In [65]:
feature_columns = [
    'AcceleratorPedal', 'BarometricPressure', 'CruiseControlActive', 'CruiseControlSetSpeed', 
    'DistanceLtd', 'EngineCoolantTemperature', 'EngineLoad', 'EngineOilPressure', 'EngineOilTemperature',
    'EngineRpm', 'EngineTimeLtd', 'FuelLevel', 'FuelLtd', 'FuelRate', 'FuelTemperature', 'IgnStatus', 
    'IntakeManifoldTemperature', 'LampStatus', 'ParkingBrake', 'ServiceDistance', 'Speed', 
    'SwitchedBatteryVoltage', 'Throttle', 'TurboBoostPressure'
]

derate_true = combined[combined['IsDerateFull'] == True]
derate_false = combined[combined['IsDerateFull'] == False]

derate_true = derate_true[feature_columns]
derate_false = derate_false[feature_columns]

nan_count_true = derate_true.isna().sum()
non_nan_count_true = derate_true.count()

nan_count_false = derate_false.isna().sum()
non_nan_count_false = derate_false.count()

nan_percent_true = (nan_count_true / len(derate_true)) * 100
nan_percent_false = (nan_count_false / len(derate_false)) * 100

nan_summary_true = pd.DataFrame({
    'Feature': nan_count_true.index, 
    'NaN Count': nan_count_true.values,
    'Non-NaN Count': non_nan_count_true.values,
    'NaN Percentage (%)_IsDerateFull = True': nan_percent_true.values
})

nan_summary_false = pd.DataFrame({
    'Feature': nan_count_false.index, 
    'NaN Count': nan_count_false.values,
    'Non-NaN Count': non_nan_count_false.values,
    'NaN Percentage (%)_IsDerateFull = False': nan_percent_false.values
})

In [66]:
true_pct = nan_summary_true[['Feature','NaN Percentage (%)_IsDerateFull = True']]
false_pct = nan_summary_false[['Feature','NaN Percentage (%)_IsDerateFull = False']]
nan_comparison = pd.merge(true_pct, false_pct, on = 'Feature')

In [67]:
nan_comparison['Difference'] =  nan_comparison['NaN Percentage (%)_IsDerateFull = False'] - nan_comparison['NaN Percentage (%)_IsDerateFull = True']
nan_comparison_sorted = nan_comparison.sort_values(by='Difference', ascending = False)
nan_comparison_sorted

Unnamed: 0,Feature,NaN Percentage (%)_IsDerateFull = True,NaN Percentage (%)_IsDerateFull = False,Difference
15,IgnStatus,0.0,48.779586,48.779586
11,FuelLevel,11.037891,57.677328,46.639437
16,IntakeManifoldTemperature,7.578254,50.643281,43.065028
9,EngineRpm,7.578254,50.590194,43.011941
1,BarometricPressure,7.742998,50.669741,42.926742
5,EngineCoolantTemperature,7.742998,50.661735,42.918737
7,EngineOilPressure,8.072488,50.646989,42.574501
12,FuelLtd,8.401977,50.735215,42.333238
6,EngineLoad,8.896211,50.699065,41.802854
13,FuelRate,9.2257,50.731254,41.505554


**Identify features for model:**

*Are there features that show trends over time?*

In [69]:
features = [
    'DistanceLtd',
    'EngineOilTemperature',
    'TurboBoostPressure',
    'FuelRate',
    'EngineLoad',
    'EngineOilPressure',
    'EngineCoolantTemperature',
    'BarometricPressure',
    'EngineRpm',
    'IntakeManifoldTemperature',
    'FuelTemperature',
    'SwitchedBatteryVoltage',
    'SeverityLevelFeature'
]

**Statistic Models**

In [73]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
from sklearn.ensemble import RandomForestClassifier
from imblearn.over_sampling import SMOTE

In [74]:
combined_filtered[features].dtypes

DistanceLtd                  float64
EngineOilTemperature         float64
TurboBoostPressure           float64
FuelRate                     float64
EngineLoad                   float64
EngineOilPressure            float64
EngineCoolantTemperature     float64
BarometricPressure           float64
EngineRpm                    float64
IntakeManifoldTemperature    float64
FuelTemperature              float64
SwitchedBatteryVoltage       float64
SeverityLevelFeature         float64
dtype: object

**Separate Train and Test Data**

In [77]:
combined_filtered['EventTimeStamp'] = pd.to_datetime(combined['EventTimeStamp'], errors='coerce')
combined_filtered['IsTestData'] = combined_filtered['EventTimeStamp'] >= '2019-01-01'
combined_filtered['IsTestData'].value_counts(normalize=True)

IsTestData
False    0.894563
True     0.105437
Name: proportion, dtype: float64

**Logistic Regression**

Version_2

*Logistic regression split Test >= 2019*

In [80]:
random_state = 42

# Training Data < 2019 
# Test Data >= 2019
train_data = combined_filtered[combined_filtered['IsTestData'] == False]
test_data = combined_filtered[combined_filtered['IsTestData'] == True]

X_train = train_data[features]
y_train = train_data['DeratePredictionTarget']
X_test = test_data[features]
y_test = test_data['DeratePredictionTarget']

X_train = X_train.apply(pd.to_numeric, errors='coerce')
X_test = X_test.apply(pd.to_numeric, errors='coerce')

X_train = X_train.fillna(X_train.mean())
X_test = X_test.fillna(X_test.mean())

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)  
X_test_scaled = scaler.transform(X_test)  

model = LogisticRegression(class_weight='balanced', random_state=random_state)
model.fit(X_train_scaled, y_train)

y_pred = model.predict(X_test_scaled)

accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.4f}")
print("Classification Report:\n", classification_report(y_test, y_pred))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))

Accuracy: 0.1345
Classification Report:
               precision    recall  f1-score   support

           0       1.00      0.13      0.24    111148
           1       0.00      0.92      0.00       183

    accuracy                           0.13    111331
   macro avg       0.50      0.53      0.12    111331
weighted avg       1.00      0.13      0.23    111331

Confusion Matrix:
 [[14810 96338]
 [   14   169]]


**Confusion Matrix**

In [82]:
TN = 118219
FP = 10435
FN = 316
TP = 183


Costs = (FP * 500)
Savings = (TP * 4000)

Net = Savings - Costs
print(Net)

-4485500


**Random Forest**

Version_1

In [86]:
# split data into train and test sets
train_data = combined_filtered[combined_filtered['IsTestData'] == False]
test_data = combined_filtered[combined_filtered['IsTestData'] == True]

# prepare features and target
X_train = train_data[features]
y_train = train_data['DeratePredictionTarget']
X_test = test_data[features]
y_test = test_data['DeratePredictionTarget']

# numeric conversion and missing values
X_train = X_train.apply(pd.to_numeric, errors='coerce')
X_test = X_test.apply(pd.to_numeric, errors='coerce')
X_train = X_train.fillna(X_train.mean())
X_test = X_test.fillna(X_test.mean())

# scale
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# OPTIMIZATION 1
from imblearn.over_sampling import RandomOverSampler
ros = RandomOverSampler(random_state=42)
X_train_resampled, y_train_resampled = ros.fit_resample(X_train_scaled, y_train)

# OPTIMIZATION 2
random_forest_model = RandomForestClassifier(
    n_estimators=50,     
    max_depth=15,        
    min_samples_split=20, 
    n_jobs=-1,           
    random_state=42,
    class_weight='balanced'
)

# train the model
random_forest_model.fit(X_train_resampled, y_train_resampled)

# predictions
y_pred = random_forest_model.predict(X_test_scaled)

# evaluate
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.4f}")
print("Classification Report:\n", classification_report(y_test, y_pred))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))

# feature importance
feature_importances = pd.DataFrame({
    'Feature': features,
    'Importance': random_forest_model.feature_importances_
}).sort_values('Importance', ascending=False)
print("\nTop 10 Important Features:")
print(feature_importances.head(10))

Accuracy: 0.9957
Classification Report:
               precision    recall  f1-score   support

           0       1.00      1.00      1.00    111148
           1       0.06      0.10      0.07       183

    accuracy                           1.00    111331
   macro avg       0.53      0.55      0.54    111331
weighted avg       1.00      1.00      1.00    111331

Confusion Matrix:
 [[110829    319]
 [   164     19]]

Top 10 Important Features:
                      Feature  Importance
12       SeverityLevelFeature    0.214960
0                 DistanceLtd    0.127416
10            FuelTemperature    0.108458
1        EngineOilTemperature    0.077515
9   IntakeManifoldTemperature    0.071135
5           EngineOilPressure    0.067904
8                   EngineRpm    0.066278
7          BarometricPressure    0.055489
6    EngineCoolantTemperature    0.052904
4                  EngineLoad    0.045936


**Confusion Matrix**

In [89]:
cm = confusion_matrix(y_test, y_pred)

TN = cm[0, 0]  
FP = cm[0, 1] 
FN = cm[1, 0] 
TP = cm[1, 1]  

In [90]:
Costs = (FP * 500)
Savings = (TP * 4000)



Net = Savings - Costs
print("Random Forest Model")
print(f"True Negatives: {TN}")
print(f"False Positives: {FP}")
print(f"False Negatives: {FN}")
print(f"True Positives: {TP}")
print(f"Money Saved: ${Net:,.2f}")

Random Forest Model
True Negatives: 110829
False Positives: 319
False Negatives: 164
True Positives: 19
Money Saved: $-83,500.00


In [91]:
test_derate_count = test_data[test_data['spn'] == 5246]
test_derate_count.shape

(43, 42)

In [122]:
test_data['IsDerateActual'].value_counts()

IsDerateActual
False    111288
True         43
Name: count, dtype: int64

In [92]:
# add predictions to the original test dataframe
test_data['Predicted'] = y_pred

# show if prediction was correct
test_data['Correct_Prediction'] = test_data['DeratePredictionTarget'] == test_data['Predicted']


test_data.head()

# If you want to focus on a specific subset of columns
columns_to_view = ['DeratePredictionTarget', 'Predicted', 'Correct_Prediction'] + features[:5]  
test_data[columns_to_view].head()

# To examine just the misclassified cases
misclassified = test_data[test_data['DeratePredictionTarget'] != test_data['Predicted']]
misclassified.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  test_data['Predicted'] = y_pred
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  test_data['Correct_Prediction'] = test_data['DeratePredictionTarget'] == test_data['Predicted']


Unnamed: 0,RecordID,EventTimeStamp,eventDescription,spn,fmi,active,activeTransitionCount,EquipmentID,Latitude,Longitude,...,Throttle,TurboBoostPressure,IsDerateFull,SeverityLevel,SeverityLevelFeature,IsDerateActual,DeratePredictionTarget,IsTestData,Predicted,Correct_Prediction
1143740,1198979,2019-09-19 06:45:54,High (Severity Medium) Aftertreatment 1 Partic...,3251,16,True,1,301,35.804212,-86.389351,...,24.8,5.8,False,Severity Medium,2.0,False,0,True,1,False
1152407,1209631,2019-10-26 06:58:08,Incorrect Data Engine Exhaust Back Pressure Re...,649,2,True,19,301,36.37875,-86.713472,...,98.0,15.08,False,,,False,0,True,1,False
1171187,1230351,2020-01-06 07:20:54,Error in System Engine Torque Limit Request - ...,1787,11,True,1,302,38.247407,-85.788009,...,0.0,0.58,False,,,False,1,True,0,False
1171188,1230352,2020-01-06 07:20:54,Low Current Aftertreatment Fuel Injector 1,3556,5,True,1,302,38.247407,-85.788009,...,0.0,0.58,False,,,False,1,True,0,False
1171184,1230348,2020-01-06 07:21:34,Abnormal Update Rate Engine Exhaust Gas Recirc...,2791,9,False,1,302,38.247407,-85.788009,...,,,False,,,False,1,True,0,False


In [114]:
import numpy as np
import pandas as pd
from scipy.stats import pointbiserialr

features = [
    'DistanceLtd',
    'EngineOilTemperature',
    'TurboBoostPressure',
    'FuelRate',
    'EngineLoad',
    'EngineOilPressure',
    'EngineCoolantTemperature',
    'BarometricPressure',
    'EngineRpm',
    'IntakeManifoldTemperature',
    'FuelTemperature',
    'SwitchedBatteryVoltage',
    'SeverityLevelFeature'
]

# Create a new DataFrame to preserve the original
combined_imputed = combined_filtered.copy()

# Replace inf values with the maximum value of each column (ignoring NaNs)
for feature in features:
    # Replace positive and negative infinity with NaN
    combined_imputed[feature] = combined_imputed[feature].replace([np.inf, -np.inf], np.nan)
    
    # Replace NaN values with the column's max value (ignoring NaNs)
    max_value = combined_imputed[feature].max()  # Now max will ignore NaNs
    combined_imputed[feature] = combined_imputed[feature].fillna(max_value)

# Now, handle NaN or Inf in the target column
combined_imputed['DeratePredictionTarget'] = combined_imputed['DeratePredictionTarget'].replace([np.inf, -np.inf], np.nan)
combined_imputed['DeratePredictionTarget'] = combined_imputed['DeratePredictionTarget'].fillna(combined_imputed['DeratePredictionTarget'].mode()[0])

# Drop any remaining rows with NaN values (if any) after replacements
combined_imputed = combined_imputed.dropna(subset=features + ['DeratePredictionTarget'])

# Re-run the correlation calculation
correlations = {}
for feature in features:
    feature_data = combined_imputed[feature]
    correlation, _ = pointbiserialr(feature_data, combined_imputed['DeratePredictionTarget'])
    correlations[feature] = correlation

# Create a DataFrame to display correlations
correlation_df = pd.DataFrame(list(correlations.items()), columns=['Feature', 'Point-Biserial Correlation'])

# Add a column with absolute correlation values to sort easily
correlation_df['Abs Correlation'] = correlation_df['Point-Biserial Correlation'].abs()

# Sort by absolute correlation to see which features are most related to the target
correlation_df = correlation_df.sort_values(by='Abs Correlation', ascending=False)

# Display the sorted correlation DataFrame
print(correlation_df)


                      Feature  Point-Biserial Correlation  Abs Correlation
11     SwitchedBatteryVoltage                   -0.021373         0.021373
12       SeverityLevelFeature                    0.013634         0.013634
9   IntakeManifoldTemperature                   -0.012105         0.012105
8                   EngineRpm                   -0.011511         0.011511
6    EngineCoolantTemperature                   -0.011321         0.011321
1        EngineOilTemperature                   -0.010810         0.010810
5           EngineOilPressure                   -0.010611         0.010611
3                    FuelRate                   -0.010598         0.010598
2          TurboBoostPressure                   -0.010509         0.010509
4                  EngineLoad                   -0.010384         0.010384
0                 DistanceLtd                   -0.010204         0.010204
7          BarometricPressure                   -0.002488         0.002488
10            FuelTempera

**For Slides**

In [140]:

value_counts = combined_filtered['IsDerateActual'].value_counts()


value_counts_df = value_counts.reset_index()
value_counts_df.columns = ['IsFullDerate', 'Count']

print('Full Dataset: Count of Derate Rows')
value_counts_df.head()

Full Dataset: Count of Derate Rows


Unnamed: 0,IsFullDerate,Count
0,False,1055549
1,True,355


In [138]:

value_counts_test_data = test_data['IsDerateActual'].value_counts()


value_counts_test_data_df = value_counts_test_data.reset_index()
value_counts_test_data_df.columns = ['IsFullDerate', 'Count']


print('Test Data: Count of Derate Rows')


Test Data: Count of Derate Rows


Unnamed: 0,IsFullDerate,Count
0,False,111288
1,True,43


In [142]:

display(value_counts_df)
display(value_counts_test_data_df)

Unnamed: 0,IsFullDerate,Count
0,False,1055549
1,True,355


Unnamed: 0,IsFullDerate,Count
0,False,111288
1,True,43
