<a href="https://www.kaggle.com/code/gregoriusbayuaji/traffic-prediction-using-performance-data?scriptVersionId=201346153" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np
import pandas as pd
import os
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder, LabelEncoder, MinMaxScaler
from sklearn.ensemble import HistGradientBoostingClassifier, RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import GaussianNB

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

# Print file paths
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

# Traffic Prediction

The main purpose of this portfolio is to predict the traffic condition based on the cars' performance. By using the their speed and engine condition, I can detect the traffic on that area. I'm using the performance data of Peugeot 207 and Opel Corsa at some traffic conditions.  

In [None]:
# Read each CSV file, handling parsing errors by filling unexpected fields
def read_csv_error(file_path):
    try:
        return pd.read_csv(file_path, sep=";")
    except pd.errors.ParserError:
        print(f"Error parsing {file_path}. Filling with 'null' values")
        return pd.read_csv(file_path, sep=";", on_bad_lines = 'skip')

In [None]:
peugeot_01 = read_csv_error('/kaggle/input/traffic-driving-style-road-surface-condition/peugeot_207_01.csv')
peugeot_02 = read_csv_error('/kaggle/input/traffic-driving-style-road-surface-condition/peugeot_207_02.csv')
opel_01 = read_csv_error('/kaggle/input/traffic-driving-style-road-surface-condition/opel_corsa_01.csv')
opel_02 = read_csv_error('/kaggle/input/traffic-driving-style-road-surface-condition/opel_corsa_02.csv')

## Data Preparation

There are 17 columns at the datasets
1. AltitudeVariation = The altitude variation of the cars
2. VehicleSpeedInstantaneous = The car speed at an instant time
3. VehicleSpeedAverage = the latest average car speed
4. VehicleSpeedVariance = the car speed variance
5. VehicleSpeedVariation = the car speed variation
6. LongitudinalAccelation = the acceleration rate of the cars
7. EngineLoad = the engine load of each cars
8. EngineCoolantTemperature = engine temperature of each cars
9. ManifoldAbsolutePressure = engine pressure of the cars
10. EngineRPM = engine RPM at the cars when it's accelerating
11. MassAirflow = airflow of the cars
12. IntakeAirTemperature = the intake air temperature of the cars
13. VerticalAcceleration = the vertical acceleration rate of the cars
14. FuelConsumptionAverage = the fuel consumption average of the cars
15. roadSurface = the road condition
16. traffic = the traffic condition
17. drivingStyle = the driving style of the drivers

We started by checking each dataset

## Peugeot

In [None]:
# Show the datatype and the number of columns
peugeot_01.info()

In [None]:
# Show the datatype and the number of columns
peugeot_02.info()

It showed that each dataset has the same datatype and has the same column. So, because it's a data of the same cars, we can merge them

In [None]:
# Merge the DataFrames
peugeot = pd.concat([peugeot_01, peugeot_02])

# Drop duplicate rows
peugeot = peugeot.drop_duplicates()

# Display the resulting DataFrame
peugeot.info()

Since a lot of the columns actually have numerical values, we should change the datatype of the columns (except for roadSurface, traffic, and drivingStyle column)

In [None]:
# Group the integer column with 'object' format
column_names = ['AltitudeVariation', 'VehicleSpeedInstantaneous', 'VehicleSpeedAverage', 'VehicleSpeedVariation',
                'VehicleSpeedVariance', 'LongitudinalAcceleration', 'EngineLoad',
                'EngineRPM', 'MassAirFlow', 'VerticalAcceleration', 'FuelConsumptionAverage']

for col in column_names:
    # Apply the replacement only on string values that contain a comma
    peugeot[col] = peugeot[col].apply(lambda x: str(x).replace(',', '.') if isinstance(x, str) and ',' in x else x)
    
    # Convert the column to float
    peugeot[col] = peugeot[col].astype(float)

# Add new 'brand' column and filled it with 'peugeot' value
peugeot["brand"] = "peugeot"

# Show the dataset
peugeot.head()

In [None]:
# Count the null value number of each column
peugeot.isnull().sum()

In [None]:
# Drop the null values of these columns
# Because these columns has a little number of null rows
peugeot.dropna(subset=(['EngineLoad','AltitudeVariation', 'VehicleSpeedInstantaneous']), inplace=True)

# Check the null number
peugeot.isnull().sum()

In [None]:
# Change the null value with mean
mean_vehicle_speed = peugeot['VehicleSpeedAverage'].mean()
mean_fuel_consumption = peugeot['FuelConsumptionAverage'].mean()

# Change the null value with variance
var_vehicle_speed = peugeot['VehicleSpeedVariance'].var()

# Change the null value with mode
mode_vehicle_speed = peugeot['VehicleSpeedVariation'].mode()

# Fill the null value with mean, variance, and mode
peugeot['VehicleSpeedAverage'].fillna(mean_vehicle_speed, inplace=True)
peugeot['VehicleSpeedVariance'].fillna(var_vehicle_speed, inplace=True)
peugeot['VehicleSpeedVariation'].fillna(mode_vehicle_speed, inplace=True)
peugeot['FuelConsumptionAverage'].fillna(mean_fuel_consumption, inplace=True)

# Check the null value number again
peugeot.isnull().sum()

In [None]:
# Drop the remaining null values
peugeot.dropna(inplace=True)

In [None]:
# Describe the peugeot datasets
peugeot.info()

## Opel

In [None]:
opel_01.info()

In [None]:
opel_02.info()

It seems Opel also has the same type of data with Peugeot. So, I can also merge all the opel data

In [None]:
# Merge the DataFrames
opel = pd.concat([opel_01, opel_02])

# Drop duplicate rows
opel = opel.drop_duplicates()

# Display the resulting DataFrame
opel.info()

In [None]:
for col in column_names:
    # Apply the replacement only on string values that contain a comma
    opel[col] = opel[col].apply(lambda x: str(x).replace(',', '.') if isinstance(x, str) and ',' in x else x)
    
    # Convert the column to float
    opel[col] = opel[col].astype(float)

opel["brand"] = "opel"

# Display dataframe information
opel.head()

In [None]:
# Check the null value number of each column
opel.isnull().sum()

In [None]:
# Drop the null values at some column
opel.dropna(subset=(['AltitudeVariation', 'VehicleSpeedInstantaneous', 'VehicleSpeedVariation']), inplace=True)

# Check the null value number
opel.isnull().sum()

In [None]:
# Change the null value into mean
mean_vehicle_speed = opel['VehicleSpeedAverage'].mean()
mean_fuel_consumption = opel['FuelConsumptionAverage'].mean()

# Change the null value into variance
var_vehicle_speed = opel['VehicleSpeedVariance'].var()

# Fill the equation
opel['VehicleSpeedAverage'].fillna(mean_vehicle_speed, inplace=True)
opel['VehicleSpeedVariance'].fillna(var_vehicle_speed, inplace=True)
opel['FuelConsumptionAverage'].fillna(mean_fuel_consumption, inplace=True)

# Check the null value number
opel.isnull().sum()

In [None]:
# Count the 'object' values of 'peugeot' dataset
value_road_peugeot = peugeot['roadSurface'].value_counts()
value_traffic_peugeot = peugeot['traffic'].value_counts()
value_driving_peugeot = peugeot['drivingStyle'].value_counts()

print(value_road_peugeot)
print('-----------')
print(value_traffic_peugeot)
print('-----------')
print(value_driving_peugeot)

In [None]:
# Count the 'object' values of 'opel' dataset
value_road_opel = opel['roadSurface'].value_counts()
value_traffic_opel = opel['traffic'].value_counts()
value_driving_opel = opel['drivingStyle'].value_counts()

print(value_road_opel)
print('-----------')
print(value_traffic_opel)
print('-----------')
print(value_driving_opel)

Since both Peugeot and Opel has the same feature, and also their value on roadSurface, traffic, and drivingStyle are the same, we can merge both opel and peugeot data.

## Merge both Peugeot and Opel

In [None]:
# Concatenate both opel and peugeot
df = pd.concat([peugeot, opel])

# Describe the result
df.info()

# Exploratory Data Analysis (EDA)

Create a heatmap to show which data has the most importance

In [None]:
# Select only numeric columns
numeric_df = df.select_dtypes(include=[np.number])

# Calculate the correlation matrix for numeric columns
corr_matrix = numeric_df.corr()

# Plot the heatmap
plt.figure(figsize=(15,15))
sns.heatmap(corr_matrix, annot=True, cmap="RdYlGn", annot_kws={"size":10})

plt.show()

# Data Splitting

In [None]:
# Split the data into test and train model
train, test = train_test_split(df, test_size=0.25, random_state=21, stratify=df.traffic)

In [None]:
print("train: ", train.shape)
print("------------")
print("test: ", test.shape)

In [None]:
# Do the reset_index
train = train.reset_index(drop=True)
test = test.reset_index(drop=True)

# Data Encoding

Encode the data on both test and train data to make all of the data becomes numerical. I use one-hot encoding for road surface, driving style, and brand, because their column only have 2 or 3 values

In [None]:
# Apply one-hot encoding
encode = OneHotEncoder(sparse_output=False)
encoded_train = encode.fit_transform(train[['roadSurface', 'drivingStyle', 'brand']])
encoded_test = encode.fit_transform(test[['roadSurface', 'drivingStyle', 'brand']])

# Convert the one-hot encoded array to a DataFrame
one_hot_encoded_train = pd.DataFrame(encoded_train, columns=encode.get_feature_names_out(['roadSurface', 'drivingStyle', 'brand']))
one_hot_encoded_test = pd.DataFrame(encoded_test, columns=encode.get_feature_names_out(['roadSurface', 'drivingStyle', 'brand']))

# Combine the original DataFrame (excluding 'roadSurface' and 'drivingStyles') with the one-hot encoded columns
train = pd.concat([train.drop(['roadSurface', 'drivingStyle', 'brand'], axis=1), one_hot_encoded_train], axis=1)
test = pd.concat([test.drop(['roadSurface', 'drivingStyle', 'brand'], axis=1), one_hot_encoded_test], axis=1)

train.head()

I use label encoder for the traffic, because, since I also need traffic in just 1 column, the value of this column is a bit variative.

In [None]:
# Apply Label Encoder
label_encoder = LabelEncoder()
train['traffic_encoded'] = label_encoder.fit_transform(train['traffic'])
test['traffic_encoded'] = label_encoder.fit_transform(test['traffic'])

# Drop the old 'traffic' column
train = train.drop(['traffic'], axis=1)
test = test.drop(['traffic'], axis=1)

train.head()

In [None]:
# Split the data into x and y with traffic as the supervisor
x_train = train.drop('traffic_encoded', axis=1)
y_train = train.traffic_encoded

print("x_train: ", x_train.shape)
print("y_train: ", y_train.shape)

In [None]:
x_test = test.drop('traffic_encoded', axis=1)
y_test = test.traffic_encoded

print("x_test: ", x_test.shape)
print("y_test: ", y_test.shape)

# Feature Selection

In [None]:
# Create the random forest function
def tree_based_feature_importance(x_train, y_train):
    # Create the random forest model
    model = RandomForestClassifier()

    # Fit the model to start training
    model.fit(x_train, y_train)

    # Get the importance of the resulting features.
    importances = model.feature_importances_

    # Create a data frame for visualization.
    final_df = pd.DataFrame({"Features": x_train.columns, "Importances":importances})
    final_df.set_index('Importances')

    # Sort in descending order 
    final_df = final_df.sort_values('Importances', ascending=False)
    
    # Visualising feature importance
    pd.Series(model.feature_importances_, index=x_train.columns).nlargest(6).plot(kind='barh')  
    return final_df

In [None]:
# Use the function to find 5 most important features
feature_importance = tree_based_feature_importance(x_train, y_train)

In [None]:
# Display the feature importance score
display(feature_importance)

In [None]:
# Making a list of selected features
selected_features = ['VehicleSpeedAverage', 'IntakeAirTemperature', 'FuelConsumptionAverage',
                    'EngineCoolantTemperature', 'VehicleSpeedVariance', 'roadSurface_UnevenCondition',
                    'LongitudinalAcceleration', 'roadSurface_SmoothCondition', 'EngineRPM', 'ManifoldAbsolutePressure']

# Show the selected_features list
x_train[selected_features].head()

In [None]:
# Creating new datasets with just the selected features
# To make the Machine Learning process becomes more effective
x_train_new = x_train[selected_features]
x_test_new = x_test[selected_features]

In [None]:
# Create an instance scaler
scaler = MinMaxScaler() 

# Fit the scaler to all sets and transform them
x_train_scaled = scaler.fit_transform(x_train_new)
x_test_scaled = scaler.transform(x_test_new)

# Show the x_train_scaled for the first 5 rows
x_train_scaled[:5]

# Modelling

Next part is the modelling part, I try some method to create the most accurate classification

## 1. Histogram Gradient Boosting

In [None]:
# Create the model with Gradient Boosting Method
model = HistGradientBoostingClassifier()
model.fit(x_train_scaled, y_train)

In [None]:
# Create the Prediction
y_pred_boost = model.predict(x_test_scaled)

In [None]:
# Show the accuracy score and the classification report of Gradient Boosting Method
print("Accuracy:", accuracy_score(y_test, y_pred_boost))
print(classification_report(y_test, y_pred_boost))

## 2. Logistic Regression

In [None]:
# Create the model with Logistic Regression Method
model = LogisticRegression(max_iter=1000)
model.fit(x_train_scaled, y_train)

In [None]:
# Create the Prediction
y_pred_log = model.predict(x_test_scaled)

# Show the accuracy score and the classification report of Logistic Regression Method
print("Accuracy:", accuracy_score(y_test, y_pred_log))
print(classification_report(y_test, y_pred_log))

## 3. K-Nearest Neighbor

In [None]:
# Create the model using KNN Method
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(x_train_scaled, y_train)

In [None]:
# Create the Prediction
y_pred_knn = knn.predict(x_test_scaled)

# Show the accuracy score and the classification report of KNN
print("Accuracy:", accuracy_score(y_test, y_pred_knn))
print(classification_report(y_test, y_pred_knn))

## 4. Decision Tree

In [None]:
# Create the model using Decision Tree
tree = DecisionTreeClassifier()
tree.fit(x_train_scaled, y_train)

In [None]:
# Create the Prediction
y_pred_tree = tree.predict(x_test_scaled)

In [None]:
# Show the accuracy score and the classification report of KNN
print("Accuracy:", accuracy_score(y_test, y_pred_tree))
print(classification_report(y_test, y_pred_tree))

## 5. Naive-Bayes

In [None]:
# Create the model using Naive-Bayes Method
nb = GaussianNB()
nb.fit(x_train_scaled, y_train)

In [None]:
# Create the Prediction
y_pred_nb = nb.predict(x_test_scaled)

In [None]:
# Show the accuracy score and the classification report of Naive-Bayes
print("Accuracy:", accuracy_score(y_test, y_pred_nb))
print(classification_report(y_test, y_pred_nb))

# Prediction Result

In [None]:
traffic_pred = pd.Series(y_pred_boost)

traffic_pred.shape

In [None]:
# Create the table of the value counts
traffic_count = traffic_pred.value_counts().to_frame()
print(traffic_count)
print("-----------")
print("Legend")
print("0 = High congestion")
print("1 = Low Congestion")
print("2 = Normal Congestion")

From the result above we can see that based on the most accurate model, most of the road passed by the opel and the peugeot are on the low congestion condition