# Big G Express: Predicting Derates
In this project, you will be working with J1939 fault code data and vehicle onboard diagnostic data to try and predict an upcoming full derate. 

J1939 is a communications protocol used in heavy-duty vehicles (like trucks, buses, and construction equipment) to allow different electronic control units (ECUs), like the engine, transmission, and brake systems, to talk to each other. Fault codes in this system follow a standard format so that mechanics and diagnostic tools can understand what's wrong, no matter the make or model.

These fault codes have two parts. First, an SPN (Suspect Parameter Number), which identifies what system or component is having the issue. Second, an FMI (Failure Mode Identifier), which explains how the system is failing (too high, too low, short circuit, etc.).

A derate refers to the truck's computer intentionally reducing engine power or speed to protect itself or force the driver to get it serviced. This is a built-in safety measure. A full derate, the main target in this project, means the vehicle is severely limited, requiring a tow for repairs. Full derates are indicated by an SPN of 5246. 

You have been provided with a two files containing the data you will use to make these predictions (J1939Faults.csv and VehicleDiagnosticOnboardData.csv) as well as two files describing some of the contents (DataInfo.docx and Service Fault Codes_1_0_0_167.xlsx) 

Note that in its raw form the data does not have "labels", so you must define what labels you are going to use and create those labels in your dataset. Also, you will likely need to perform some significant feature engineering in order to build an accurate predictor.

There are service locations at (36.0666667, -86.4347222), (35.5883333, -86.4438888), and (36.1950, -83.174722), so you should remove any records in the vicinity of these locations, as fault codes may be tripped when working on the vehicles.

When evaluating the performance of your model, assume that the cost associated with a missed full derate is approximately $4000 in towing and repairs, and the cost of a false positive prediction is about $500 due to having the truck off the road and serviced unnecessarily. While high accuracy or F1 is nice, we are most interested here in saving the company money, so the final metric to evaulate your model should be the cost savings.

**Project Timeline:**

Thursday, May 8: Present preliminary findings to instructors.
Tuesday, May 13: Present final findings to class.

Your presentation should use slides, not code in a notebook. Your final presentation should include at least the following points:
* What features did you use to predict? Report some of the more impactful features using some kind of feature importance metric.
* If you had used the data prior to 2019 to train your model and had been using it from January 1, 2019 onwards, how many full derates would you have caught? How many false positives? What is the net savings or cost of using your model for that time span? Report your estimate here, even if the model would have been a net negative.

In [None]:
#pip install geopy

In [2]:
import pandas as pd
from geopy.distance import geodesic
import numpy as np

**Read in data**

In [4]:
faults = pd.read_csv("../data/J1939Faults.csv")
diagnostics = pd.read_csv("../data/VehicleDiagnosticOnboardData.csv")

  faults = pd.read_csv("../data/J1939Faults.csv")


In [97]:
faults.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1187335 entries, 0 to 1187334
Data columns (total 11 columns):
 #   Column                 Non-Null Count    Dtype  
---  ------                 --------------    -----  
 0   RecordID               1187335 non-null  int64  
 1   EventTimeStamp         1187335 non-null  object 
 2   eventDescription       1126490 non-null  object 
 3   spn                    1187335 non-null  int64  
 4   fmi                    1187335 non-null  int64  
 5   active                 1187335 non-null  bool   
 6   activeTransitionCount  1187335 non-null  int64  
 7   EquipmentID            1187335 non-null  object 
 8   Latitude               1187335 non-null  float64
 9   Longitude              1187335 non-null  float64
 10  IsServiceStation       1187335 non-null  bool   
dtypes: bool(2), float64(2), int64(4), object(3)
memory usage: 83.8+ MB


**Faults : Cleaning Data**

In [6]:
drop_list = ['ESS_Id', 
             'actionDescription', 
             'ecuSoftwareVersion', 
             'ecuSerialNumber', 
             'ecuModel', 
             'ecuMake', 
             'ecuSource', 
             'faultValue',
             'LocationTimeStamp',
             'MCTNumber']

faults = faults.drop(columns=drop_list)

In [8]:
service_stations = [
    (36.0666667, -86.4347222),
    (35.5883333, -86.4438888),
    (36.1950, -83.174722)
]

threshold_distance = 1.0  


def is_near_service_station(lat, lon):
    point = (lat, lon)
    for station in service_stations:
        distance = geodesic(point, station).kilometers
        if distance <= threshold_distance:
            return True
    return False


faults['IsServiceStation'] = faults.apply(lambda row: is_near_service_station(row['Latitude'], row['Longitude']), axis=1)

In [10]:
faults['IsServiceStation'].value_counts(normalize = True)

IsServiceStation
False    0.88936
True     0.11064
Name: proportion, dtype: float64

**Diagnostics (Features) : Cleaning Data**

In [14]:
diagnostics['Value'] = diagnostics['Value'].replace({'FALSE': 0, 'TRUE': 1})

In [16]:
diagnostics_w = diagnostics.pivot(index='FaultId', columns='Name', values='Value')
features = diagnostics_w.reset_index()
features.columns.name = None

**Merged Faults and Features**

In [24]:
combined = pd.merge(faults, 
                    features, 
                    left_on='RecordID', 
                    right_on='FaultId',
                    how = 'left')

In [26]:
combined_filtered = combined[combined['IsServiceStation'] == False]
#combined_filtered['IsServiceStation'].value_counts(normalize = True)

In [103]:
#combined['IsDerateFull'] = combined['spn'] == 5246
combined['IsDerateFull'] = (combined['spn'] == 5246) & (combined['active'] == True)


In [105]:
combined['IsDerateFull'].value_counts(normalize = True)

IsDerateFull
False    0.999489
True     0.000511
Name: proportion, dtype: float64

In [107]:
inspect_derate_rows = combined[combined['IsDerateFull'] == True][['IsDerateFull', 'active']].value_counts()
inspect_derate_rows

IsDerateFull  active
True          True      607
Name: count, dtype: int64

**Investigate NaN values in both outcomes of response variable:**

*Is there a difference in available data between the two outcomes: IsDerateFull T/F*

In [109]:
feature_columns = [
    'AcceleratorPedal', 'BarometricPressure', 'CruiseControlActive', 'CruiseControlSetSpeed', 
    'DistanceLtd', 'EngineCoolantTemperature', 'EngineLoad', 'EngineOilPressure', 'EngineOilTemperature',
    'EngineRpm', 'EngineTimeLtd', 'FuelLevel', 'FuelLtd', 'FuelRate', 'FuelTemperature', 'IgnStatus', 
    'IntakeManifoldTemperature', 'LampStatus', 'ParkingBrake', 'ServiceDistance', 'Speed', 
    'SwitchedBatteryVoltage', 'Throttle', 'TurboBoostPressure'
]

derate_true = combined[combined['IsDerateFull'] == True]
derate_false = combined[combined['IsDerateFull'] == False]

derate_true = derate_true[feature_columns]
derate_false = derate_false[feature_columns]

nan_count_true = derate_true.isna().sum()
non_nan_count_true = derate_true.count()

nan_count_false = derate_false.isna().sum()
non_nan_count_false = derate_false.count()

nan_percent_true = (nan_count_true / len(derate_true)) * 100
nan_percent_false = (nan_count_false / len(derate_false)) * 100

nan_summary_true = pd.DataFrame({
    'Feature': nan_count_true.index, 
    'NaN Count': nan_count_true.values,
    'Non-NaN Count': non_nan_count_true.values,
    'NaN Percentage (%)_IsDerateFull = True': nan_percent_true.values
})

nan_summary_false = pd.DataFrame({
    'Feature': nan_count_false.index, 
    'NaN Count': nan_count_false.values,
    'Non-NaN Count': non_nan_count_false.values,
    'NaN Percentage (%)_IsDerateFull = False': nan_percent_false.values
})

In [111]:
true_pct = nan_summary_true[['Feature','NaN Percentage (%)_IsDerateFull = True']]
false_pct = nan_summary_false[['Feature','NaN Percentage (%)_IsDerateFull = False']]
nan_comparison = pd.merge(true_pct, false_pct, on = 'Feature')

In [117]:
nan_comparison['Difference'] =  nan_comparison['NaN Percentage (%)_IsDerateFull = False'] - nan_comparison['NaN Percentage (%)_IsDerateFull = True']
nan_comparison_sorted = nan_comparison.sort_values(by='Difference', ascending = False)
nan_comparison_sorted

Unnamed: 0,Feature,NaN Percentage (%)_IsDerateFull = True,NaN Percentage (%)_IsDerateFull = False,Difference
15,IgnStatus,0.0,48.779586,48.779586
11,FuelLevel,11.037891,57.677328,46.639437
16,IntakeManifoldTemperature,7.578254,50.643281,43.065028
9,EngineRpm,7.578254,50.590194,43.011941
1,BarometricPressure,7.742998,50.669741,42.926742
5,EngineCoolantTemperature,7.742998,50.661735,42.918737
7,EngineOilPressure,8.072488,50.646989,42.574501
12,FuelLtd,8.401977,50.735215,42.333238
6,EngineLoad,8.896211,50.699065,41.802854
13,FuelRate,9.2257,50.731254,41.505554


**Identify features for model:**

*Are there features that show trends over time?*

In [51]:
features = [
    'DistanceLtd',
    'EngineOilTemperature',
    'TurboBoostPressure',
    'FuelRate',
    'EngineLoad',
    'EngineOilPressure',
    'EngineCoolantTemperature',
    'BarometricPressure',
    'EngineRpm',
    'IntakeManifoldTemperature',
    'FuelTemperature',
    'SwitchedBatteryVoltage'
]

**Statistic Models**

In [42]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
from sklearn.ensemble import RandomForestClassifier
from imblearn.over_sampling import SMOTE

In [55]:
combined[features].dtypes

DistanceLtd                  object
EngineOilTemperature         object
TurboBoostPressure           object
FuelRate                     object
EngineLoad                   object
EngineOilPressure            object
EngineCoolantTemperature     object
BarometricPressure           object
EngineRpm                    object
IntakeManifoldTemperature    object
FuelTemperature              object
SwitchedBatteryVoltage       object
dtype: object

**Logistic Regression**

Version_1

*Logistic regression model with 80/20 test split*

In [119]:
X = combined[features]
y = combined['IsDerateFull']

X = X.apply(pd.to_numeric, errors='coerce')
X = X.fillna(X.mean())

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=321)

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)  
X_test_scaled = scaler.transform(X_test)  

model = LogisticRegression(class_weight='balanced')
model.fit(X_train_scaled, y_train)

y_pred = model.predict(X_test_scaled)

accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.4f}")
print("Classification Report:\n", classification_report(y_test, y_pred))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))

Accuracy: 0.8020
Classification Report:
               precision    recall  f1-score   support

       False       1.00      0.80      0.89    237349
        True       0.00      0.80      0.00       118

    accuracy                           0.80    237467
   macro avg       0.50      0.80      0.45    237467
weighted avg       1.00      0.80      0.89    237467

Confusion Matrix:
 [[190358  46991]
 [    24     94]]


True Negatives (TN): 190,358 — Correctly predicted False.

False Positives (FP): 46,991 — Incorrectly predicted True.

False Negatives (FN): 24 — Incorrectly predicted False.

True Positives (TP): 94 — Correctly predicted True.

In [121]:
combined['EventTimeStamp'] = pd.to_datetime(combined['EventTimeStamp'], errors='coerce')
combined['IsTestData'] = combined['EventTimeStamp'] >= '2019-01-01'
combined['IsTestData'].value_counts(normalize=True)

IsTestData
False    0.891129
True     0.108871
Name: proportion, dtype: float64

**Logistic Regression**

Version_2

*Logistic regression split Test >= 2019*

In [123]:
random_state = 42

# Training Data < 2019 
# Test Data >= 2019
train_data = combined[combined['IsTestData'] == False]
test_data = combined[combined['IsTestData'] == True]

X_train = train_data[features]
y_train = train_data['IsDerateFull']
X_test = test_data[features]
y_test = test_data['IsDerateFull']

X_train = X_train.apply(pd.to_numeric, errors='coerce')
X_test = X_test.apply(pd.to_numeric, errors='coerce')

X_train = X_train.fillna(X_train.mean())
X_test = X_test.fillna(X_test.mean())

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)  
X_test_scaled = scaler.transform(X_test)  

model = LogisticRegression(class_weight='balanced', random_state=random_state)
model.fit(X_train_scaled, y_train)

y_pred = model.predict(X_test_scaled)

accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.4f}")
print("Classification Report:\n", classification_report(y_test, y_pred))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))

Accuracy: 0.0705
Classification Report:
               precision    recall  f1-score   support

       False       1.00      0.07      0.13    129172
        True       0.00      0.97      0.00        94

    accuracy                           0.07    129266
   macro avg       0.50      0.52      0.07    129266
weighted avg       1.00      0.07      0.13    129266

Confusion Matrix:
 [[  9018 120154]
 [     3     91]]


True Negatives (TN): 9,018 — Correctly predicted False

False Positives (FP): 120,154 — Incorrectly predicted True

False Negatives (FN): 3 — Incorrectly predicted False

True Positives (TP): 91 — Correctly predicted True

**Random Forest**

Version_1

In [132]:
train_data = combined[combined['IsTestData'] == False]
test_data = combined[combined['IsTestData'] == True]

X_train = train_data[features]
y_train = train_data['IsDerateFull']
X_test = test_data[features]
y_test = test_data['IsDerateFull']

X_train = X_train.apply(pd.to_numeric, errors='coerce')
X_test = X_test.apply(pd.to_numeric, errors='coerce')

X_train = X_train.fillna(X_train.mean())
X_test = X_test.fillna(X_test.mean())

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

smote = SMOTE(random_state=42)
X_train_resampled, y_train_resampled = smote.fit_resample(X_train_scaled, y_train)

random_forest_model = RandomForestClassifier(n_estimators=100, random_state=42, class_weight='balanced')

random_forest_model.fit(X_train_resampled, y_train_resampled)

y_pred = random_forest_model.predict(X_test_scaled)


accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.4f}")

print("Classification Report:\n", classification_report(y_test, y_pred))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))

Accuracy: 0.4326
Classification Report:
               precision    recall  f1-score   support

       False       1.00      0.43      0.60    129172
        True       0.00      0.20      0.00        94

    accuracy                           0.43    129266
   macro avg       0.50      0.32      0.30    129266
weighted avg       1.00      0.43      0.60    129266

Confusion Matrix:
 [[55897 73275]
 [   75    19]]
