# Ensemble Methods Notebook
Welcome to the weekly project on Ensemble Methods. You will be working with a dataset of traffic jams.

## Dataset
The dataset that will be used in this task is `Traffic_Jam.csv`

## Instructions
- Follow the steps outlined below.
- Write your code in the empty code cells.
- Comment on your code to explain your reasoning.

## Dataset Overview
This dataset contains traffic data including various counts of vehicle types across different times and days. Below are samples of these columns:

* `Time`: The timestamp of the traffic count (in intervals).
* `Date`: The day of the month the data was recorded.
* `Day of the Week`: The day of the week for the recorded data.
* `CarCount`: The number of cars counted during the time interval.
* `BikeCount`: The number of bikes counted during the time interval.
* `BusCount`: The number of buses counted during the time interval.
* `TruckCount`: The number of trucks counted during the time interval.
* `Total`: Total vehicles counted during the time interval.
* `Traffic Situation`: Qualitative assessment of the traffic (e.g., normal, congested).

## Goal
The primary goal of this exam is to develop a predictive model capable of determining the `Traffic Situation` based on your choice of features provided in the dataset. Students are expected to apply ensemble methods to build and evaluate their models.

# Import Libraries

In [1]:
import pandas as pd
import numpy as np
import sklearn
import matplotlib.pyplot as plt
from sklearn.datasets import load_diabetes
import seaborn as sns
from sklearn import preprocessing
from sklearn.preprocessing import MinMaxScaler
from sklearn.impute import SimpleImputer
from sklearn.decomposition import PCA
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder


# Load the dataset


In [72]:
df= pd.read_csv("Traffic_Jams.csv")

# Exploratory Data Analysis (EDA)

Below are some steps and visualizations to perform EDA on the dataset:

1. **Summary Statistics**: Obtain summary statistics for the dataset to understand the central tendencies and dispersion of numerical features.describe()

2. **Distribution of the Target Variable**: Analyze the distribution of the target variable `Traffic Situation` to understand the class balance.

3. **Correlation Analysis**: Analyze correlations between features.

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6324 entries, 0 to 6323
Data columns (total 9 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   Time               6324 non-null   object
 1   Date               6324 non-null   int64 
 2   Day of the week    6324 non-null   object
 3   CarCount           6324 non-null   int64 
 4   BikeCount          6324 non-null   int64 
 5   BusCount           6324 non-null   int64 
 6   TruckCount         6324 non-null   int64 
 7   Total              6324 non-null   int64 
 8   Traffic Situation  6324 non-null   object
dtypes: int64(6), object(3)
memory usage: 444.8+ KB


In [4]:
df.shape

(6324, 9)

In [5]:
df.head()

Unnamed: 0,Time,Date,Day of the week,CarCount,BikeCount,BusCount,TruckCount,Total,Traffic Situation
0,12:00:00 AM,10,Tuesday,13,2,2,24,41,normal
1,12:15:00 AM,10,Tuesday,14,1,1,36,52,normal
2,12:30:00 AM,10,Tuesday,10,2,2,32,46,normal
3,12:45:00 AM,10,Tuesday,10,2,2,36,50,normal
4,1:00:00 AM,10,Tuesday,11,2,1,34,48,normal


In [6]:
df.tail()

Unnamed: 0,Time,Date,Day of the week,CarCount,BikeCount,BusCount,TruckCount,Total,Traffic Situation
6319,10:30:00 AM,9,Thursday,26,16,13,16,71,normal
6320,8:00:00 PM,9,Thursday,72,25,10,27,134,high
6321,9:00:00 PM,9,Thursday,107,13,14,28,162,high
6322,9:30:00 PM,9,Thursday,106,18,13,27,164,high
6323,11:45:00 PM,9,Thursday,14,3,1,15,33,normal


In [7]:
df.sample()

Unnamed: 0,Time,Date,Day of the week,CarCount,BikeCount,BusCount,TruckCount,Total,Traffic Situation
865,12:15:00 AM,19,Thursday,8,0,2,34,44,normal


# Preprocess the data (if necessary)

Before building models, it's crucial to preprocess the data to ensure it's clean and suitable for training. Follow these steps to prepare the dataset:

1. **Check for Missing Values**: Determine if there are any missing values in the dataset and handle them appropriately. You can choose to fill them with a mean, median, or mode value, or drop rows with missing values if necessary.

2. **Encode Categorical Variables**: Convert categorical variables into numerical representations. This can be done using techniques such as one-hot encoding and lable-encoder.

3. **Feature Scaling**: Standardize or Normalize numerical features if needed to have a consistent scale.

4. **Remove Unnecessary Columns**: Drop any columns that are not relevant for modeling.

In [8]:
df.isnull().sum()

Time                 0
Date                 0
Day of the week      0
CarCount             0
BikeCount            0
BusCount             0
TruckCount           0
Total                0
Traffic Situation    0
dtype: int64

In [31]:
# there is no nulls values

In [32]:
pip install category_encoders


Collecting category_encoders
  Downloading category_encoders-2.6.3-py2.py3-none-any.whl.metadata (8.0 kB)
Collecting statsmodels>=0.9.0 (from category_encoders)
  Downloading statsmodels-0.14.2-cp311-cp311-win_amd64.whl.metadata (9.5 kB)
Collecting patsy>=0.5.1 (from category_encoders)
  Downloading patsy-0.5.6-py2.py3-none-any.whl.metadata (3.5 kB)
Downloading category_encoders-2.6.3-py2.py3-none-any.whl (81 kB)
   ---------------------------------------- 0.0/81.9 kB ? eta -:--:--
   ----- ---------------------------------- 10.2/81.9 kB ? eta -:--:--
   --------------- ------------------------ 30.7/81.9 kB 325.1 kB/s eta 0:00:01
   -------------------- ------------------- 41.0/81.9 kB 279.3 kB/s eta 0:00:01
   ----------------------------------- ---- 71.7/81.9 kB 435.7 kB/s eta 0:00:01
   ---------------------------------------- 81.9/81.9 kB 380.9 kB/s eta 0:00:00
Downloading patsy-0.5.6-py2.py3-none-any.whl (233 kB)
   ---------------------------------------- 0.0/233.9 kB ? eta -:--:


[notice] A new release of pip is available: 24.0 -> 24.2
[notice] To update, run: python.exe -m pip install --upgrade pip


In [57]:
import category_encoders as ce
encoder=ce.OneHotEncoder(cols='Traffic Situation',handle_unknown='return_nan',return_df=True,use_cat_names=True)

In [73]:
data_encoded = encoder.fit_transform(df)
data_encoded

Unnamed: 0,Time,Date,Day of the week,CarCount,BikeCount,BusCount,TruckCount,Total,Traffic Situation_normal,Traffic Situation_low,Traffic Situation_heavy,Traffic Situation_high
0,12:00:00 AM,10,Tuesday,13,2,2,24,41,1.0,0.0,0.0,0.0
1,12:15:00 AM,10,Tuesday,14,1,1,36,52,1.0,0.0,0.0,0.0
2,12:30:00 AM,10,Tuesday,10,2,2,32,46,1.0,0.0,0.0,0.0
3,12:45:00 AM,10,Tuesday,10,2,2,36,50,1.0,0.0,0.0,0.0
4,1:00:00 AM,10,Tuesday,11,2,1,34,48,1.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...
6319,10:30:00 AM,9,Thursday,26,16,13,16,71,1.0,0.0,0.0,0.0
6320,8:00:00 PM,9,Thursday,72,25,10,27,134,0.0,0.0,0.0,1.0
6321,9:00:00 PM,9,Thursday,107,13,14,28,162,0.0,0.0,0.0,1.0
6322,9:30:00 PM,9,Thursday,106,18,13,27,164,0.0,0.0,0.0,1.0


In [59]:
df_encoded

Unnamed: 0,Time,CarCount,BikeCount,BusCount,TruckCount,Total,Date_2,Date_3,Date_4,Date_5,...,Date_31,Day of the week_Monday,Day of the week_Saturday,Day of the week_Sunday,Day of the week_Thursday,Day of the week_Tuesday,Day of the week_Wednesday,Traffic Situation_high,Traffic Situation_low,Traffic Situation_normal
0,12:00:00 AM,13,2,2,24,41,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0
1,12:15:00 AM,14,1,1,36,52,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0
2,12:30:00 AM,10,2,2,32,46,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0
3,12:45:00 AM,10,2,2,36,50,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0
4,1:00:00 AM,11,2,1,34,48,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6319,10:30:00 AM,26,16,13,16,71,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0
6320,8:00:00 PM,72,25,10,27,134,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0
6321,9:00:00 PM,107,13,14,28,162,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0
6322,9:30:00 PM,106,18,13,27,164,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0


In [55]:
df

Unnamed: 0,Date,CarCount,BikeCount,BusCount,TruckCount,Total,Time_10:00:00 PM,Time_10:15:00 AM,Time_10:15:00 PM,Time_10:30:00 AM,...,Time_9:45:00 PM,Day of the week_Monday,Day of the week_Saturday,Day of the week_Sunday,Day of the week_Thursday,Day of the week_Tuesday,Day of the week_Wednesday,Traffic Situation_high,Traffic Situation_low,Traffic Situation_normal
0,10,13,2,2,24,41,False,False,False,False,...,False,False,False,False,False,True,False,False,False,True
1,10,14,1,1,36,52,False,False,False,False,...,False,False,False,False,False,True,False,False,False,True
2,10,10,2,2,32,46,False,False,False,False,...,False,False,False,False,False,True,False,False,False,True
3,10,10,2,2,36,50,False,False,False,False,...,False,False,False,False,False,True,False,False,False,True
4,10,11,2,1,34,48,False,False,False,False,...,False,False,False,False,False,True,False,False,False,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6319,9,26,16,13,16,71,False,False,False,True,...,False,False,False,False,True,False,False,False,False,True
6320,9,72,25,10,27,134,False,False,False,False,...,False,False,False,False,True,False,False,True,False,False
6321,9,107,13,14,28,162,False,False,False,False,...,False,False,False,False,True,False,False,True,False,False
6322,9,106,18,13,27,164,False,False,False,False,...,False,False,False,False,True,False,False,True,False,False


In [76]:
# Handle missing values
d = df.dropna()

# Encode categorical variables
df = pd.get_dummies(df, drop_first=True)



# Visualize the Data

Visualizing the data helps in understanding the relationships between features and the target variable. Below are some common visualizations that can be used to gain insights into the dataset:

1. **Count Plots for Categorical Features**: Use count plots to visualize the frequency of categorical features such as the `Traffic Situation`.

2. **Correlation Heatmap**: Create a heatmap to visualize the correlation between numerical features and identify any strong relationships.

In [41]:
from sklearn.model_selection import train_test_split

In [70]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42)

# Split the Dataset

1. **Define Features and Target**: Separate the dataset into features (`X`) and the target variable (`y`).

2. **Train-Test Split**: Use the `train_test_split` function from `sklearn.model_selection` to split the data.

In [74]:
X = df.drop(columns='Traffic Situation') 
y = df['Traffic Situation'] 

# Initialize and Train the Classifiers

## Bagging
Chose the bagging model to go with and initialize and train a the model.

In [78]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import BaggingClassifier, RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

In [91]:
X_combined = pd.concat([X_train, X_test], axis=0)
X_encoded_combined = pd.get_dummies(X_combined, drop_first=True)

In [None]:

X_train_encoded = X_encoded_combined.iloc[:len(X_train)]
X_test_encoded = X_encoded_combined.iloc[len(X_train):]

In [92]:

base_estimator = KNeighborsClassifier()
bagging_classifier = BaggingClassifier(base_estimator, n_estimators=50, random_state=42)


In [79]:
df

Unnamed: 0,Date,CarCount,BikeCount,BusCount,TruckCount,Total,Time_10:00:00 PM,Time_10:15:00 AM,Time_10:15:00 PM,Time_10:30:00 AM,...,Time_9:45:00 PM,Day of the week_Monday,Day of the week_Saturday,Day of the week_Sunday,Day of the week_Thursday,Day of the week_Tuesday,Day of the week_Wednesday,Traffic Situation_high,Traffic Situation_low,Traffic Situation_normal
0,10,13,2,2,24,41,False,False,False,False,...,False,False,False,False,False,True,False,False,False,True
1,10,14,1,1,36,52,False,False,False,False,...,False,False,False,False,False,True,False,False,False,True
2,10,10,2,2,32,46,False,False,False,False,...,False,False,False,False,False,True,False,False,False,True
3,10,10,2,2,36,50,False,False,False,False,...,False,False,False,False,False,True,False,False,False,True
4,10,11,2,1,34,48,False,False,False,False,...,False,False,False,False,False,True,False,False,False,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6319,9,26,16,13,16,71,False,False,False,True,...,False,False,False,False,True,False,False,False,False,True
6320,9,72,25,10,27,134,False,False,False,False,...,False,False,False,False,True,False,False,True,False,False
6321,9,107,13,14,28,162,False,False,False,False,...,False,False,False,False,True,False,False,True,False,False
6322,9,106,18,13,27,164,False,False,False,False,...,False,False,False,False,True,False,False,True,False,False


In [89]:
X_train

Unnamed: 0,Time,Date,Day of the week,CarCount,BikeCount,BusCount,TruckCount,Total
1170,4:30:00 AM,22,Sunday,72,8,4,18,102
5437,3:15:00 PM,4,Saturday,120,30,28,15,193
5458,8:30:00 PM,4,Saturday,106,12,16,9,143
2948,5:00:00 PM,9,Thursday,132,16,13,11,172
1835,2:45:00 AM,29,Sunday,15,2,1,37,55
...,...,...,...,...,...,...,...,...
3772,7:00:00 AM,18,Wednesday,141,15,21,5,182
5191,1:45:00 AM,2,Thursday,11,0,0,16,27
5226,10:30:00 AM,2,Thursday,77,23,14,20,134
5390,3:30:00 AM,4,Saturday,19,5,1,38,63


In [93]:
bagging_classifier.fit(X_train_encoded, y_train)

In [94]:
predictions = bagging_classifier.predict(X_test_encoded)

### Evaluate the model performance

In [95]:
accuracy = accuracy_score(y_test, predictions)
print(f'Bagging Classifier Model Accuracy: {accuracy * 100:.2f}%')

Bagging Classifier Model Accuracy: 88.46%


## Boosting
Chose the Boosting model to go with and initialize and train a the model.

In [131]:
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import accuracy_score


In [132]:

gradient_boosting_classifier = GradientBoostingClassifier(n_estimators=100, random_state=42)


In [133]:

gradient_boosting_classifier.fit(X_train_encoded, y_train)


In [135]:

predictions = gradient_boosting_classifier.predict(X_test_encoded)


In [136]:

accuracy = accuracy_score(y_test, predictions)
print(f'Gradient Boosting Classifier Model Accuracy: {accuracy * 100:.2f}%')

Gradient Boosting Classifier Model Accuracy: 91.94%


### Evaluate the model performance

In [104]:
accuracy = accuracy_score(y_test, predictions)
print(f' Accuracy: {accuracy * 100:.2f}%')

 Accuracy: 88.46%


## Stacking Classifier
Combine the previous classifiers as the base models using a Stacking Classifier.

In [108]:
from sklearn.ensemble import StackingClassifier
from sklearn.linear_model import LogisticRegression

In [140]:
from sklearn.ensemble import StackingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score


In [138]:

base_classifiers = [
    ('decision_tree', DecisionTreeClassifier(max_depth=3)),
    ('svm', SVC(probability=True))
]


In [139]:

meta_classifier = LogisticRegression()


In [142]:

stacking_classifier = StackingClassifier(
    estimators=base_classifiers,
    final_estimator=meta_classifier,
    cv=5
)


In [143]:

stacking_classifier.fit(X_train_encoded, y_train)

In [144]:

predictions = stacking_classifier.predict(X_test_encoded)

In [145]:

accuracy = accuracy_score(y_test, predictions)
print(f'Stacking Classifier Model Accuracy: {accuracy * 100:.2f}%')

Stacking Classifier Model Accuracy: 90.04%


### Define meta-learner (LogisticRegression)

In [118]:
from sklearn.linear_model import LogisticRegression

In [121]:
meta = LogisticRegression()

In [122]:
meta

### Initialize and Train the Stacking Classifier

Stacking combines multiple models (base learners) using a meta-learner. The meta-learner is trained on the predictions of the base learners to make the final prediction.

In [149]:
from sklearn.ensemble import StackingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score

     

In [150]:

stacking_classifier.fit(X_train_encoded, y_train)


In [152]:

predictions = stacking_classifier.predict(X_test_encoded)

In [153]:

accuracy = accuracy_score(y_test, predictions)
print(f'Stacking Classifier Model Accuracy: {accuracy * 100:.2f}%')


Stacking Classifier Model Accuracy: 90.04%


### Evaluate the model performance

In [155]:
accuracy = accuracy_score(y_test, predictions)
print(f'Accuracy: {accuracy * 100:.2f}%')

Accuracy: 90.04%


# Notebook Questions:

After completing the tasks in this notebook, take some time to reflect on the work you have done and answer the following questions. These questions are designed to help you think critically about the steps you took and the decisions you made.

* **Feature Selection and Engineering**
   - Which features did you find most important for predicting churn, and why do you think they are significant?
   - Did you perform any feature engineering? If so, what new features did you create, and how did they improve the model performance?

* **Model Selection**
   - Why did you choose the specific ensemble methods you implemented? What are the advantages of using ensemble methods over single models?
   - Compare the performance of different models you used. Which model performed the best, and what do you think contributed to its success?

* **Model Evaluation**
   - Which evaluation metrics did you use to assess the model performance, and why? What insights did these metrics provide about the models' strengths and weaknesses?


# Answer here: