<a href="https://colab.research.google.com/github/i1nourax/i1nourax/blob/main/Copy_of_Ensemble_Task.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Ensemble Methods Notebook
Welcome to the weekly project on Ensemble Methods. You will be working with a dataset of traffic jams.

## Dataset
The dataset that will be used in this task is `Traffic_Jam.csv`

## Instructions
- Follow the steps outlined below.
- Write your code in the empty code cells.
- Comment on your code to explain your reasoning.

## Dataset Overview
This dataset contains traffic data including various counts of vehicle types across different times and days. Below are samples of these columns:

* `Time`: The timestamp of the traffic count (in intervals).
* `Date`: The day of the month the data was recorded.
* `Day of the Week`: The day of the week for the recorded data.
* `CarCount`: The number of cars counted during the time interval.
* `BikeCount`: The number of bikes counted during the time interval.
* `BusCount`: The number of buses counted during the time interval.
* `TruckCount`: The number of trucks counted during the time interval.
* `Total`: Total vehicles counted during the time interval.
* `Traffic Situation`: Qualitative assessment of the traffic (e.g., normal, congested).

## Goal
The primary goal of this exam is to develop a predictive model capable of determining the `Traffic Situation` based on your choice of features provided in the dataset. Students are expected to apply ensemble methods to build and evaluate their models.

# Import Libraries

In [161]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Load the dataset


In [162]:
df1 =pd.read_csv('/content/Traffic_Jams.csv')

# Exploratory Data Analysis (EDA)

Below are some steps and visualizations to perform EDA on the dataset:

1. **Summary Statistics**: Obtain summary statistics for the dataset to understand the central tendencies and dispersion of numerical features.describe()

2. **Distribution of the Target Variable**: Analyze the distribution of the target variable `Traffic Situation` to understand the class balance.

3. **Correlation Analysis**: Analyze correlations between features.

In [163]:
df1.describe()

Unnamed: 0,Date,CarCount,BikeCount,BusCount,TruckCount,Total
count,6324.0,6324.0,6324.0,6324.0,6324.0,6324.0
mean,16.043327,64.355629,12.013283,12.557875,18.658128,107.584915
std,8.956907,44.307088,11.363955,12.319831,10.724822,55.850784
min,1.0,5.0,0.0,0.0,0.0,21.0
25%,8.0,18.0,3.0,1.0,10.0,53.0
50%,16.0,61.0,9.0,10.0,18.0,103.0
75%,24.0,101.25,19.0,20.0,27.0,151.0
max,31.0,180.0,70.0,50.0,60.0,279.0


In [164]:
df1.head()

Unnamed: 0,Time,Date,Day of the week,CarCount,BikeCount,BusCount,TruckCount,Total,Traffic Situation
0,12:00:00 AM,10,Tuesday,13,2,2,24,41,normal
1,12:15:00 AM,10,Tuesday,14,1,1,36,52,normal
2,12:30:00 AM,10,Tuesday,10,2,2,32,46,normal
3,12:45:00 AM,10,Tuesday,10,2,2,36,50,normal
4,1:00:00 AM,10,Tuesday,11,2,1,34,48,normal


In [165]:
df1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6324 entries, 0 to 6323
Data columns (total 9 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   Time               6324 non-null   object
 1   Date               6324 non-null   int64 
 2   Day of the week    6324 non-null   object
 3   CarCount           6324 non-null   int64 
 4   BikeCount          6324 non-null   int64 
 5   BusCount           6324 non-null   int64 
 6   TruckCount         6324 non-null   int64 
 7   Total              6324 non-null   int64 
 8   Traffic Situation  6324 non-null   object
dtypes: int64(6), object(3)
memory usage: 444.8+ KB


In [166]:
df1.isnull().sum()

Unnamed: 0,0
Time,0
Date,0
Day of the week,0
CarCount,0
BikeCount,0
BusCount,0
TruckCount,0
Total,0
Traffic Situation,0


In [167]:
df1.duplicated().sum()

0

In [168]:
df1.nunique()

Unnamed: 0,0
Time,96
Date,31
Day of the week,7
CarCount,173
BikeCount,71
BusCount,51
TruckCount,59
Total,239
Traffic Situation,4


In [169]:
df1.shape

(6324, 9)

In [170]:
df1.columns

Index(['Time', 'Date', 'Day of the week', 'CarCount', 'BikeCount', 'BusCount',
       'TruckCount', 'Total', 'Traffic Situation'],
      dtype='object')

In [171]:
df1.dtypes

Unnamed: 0,0
Time,object
Date,int64
Day of the week,object
CarCount,int64
BikeCount,int64
BusCount,int64
TruckCount,int64
Total,int64
Traffic Situation,object


# Preprocess the data (if necessary)

Before building models, it's crucial to preprocess the data to ensure it's clean and suitable for training. Follow these steps to prepare the dataset:

1. **Check for Missing Values**: Determine if there are any missing values in the dataset and handle them appropriately. You can choose to fill them with a mean, median, or mode value, or drop rows with missing values if necessary.

2. **Encode Categorical Variables**: Convert categorical variables into numerical representations. This can be done using techniques such as one-hot encoding and lable-encoder.

3. **Feature Scaling**: Standardize or Normalize numerical features if needed to have a consistent scale.

4. **Remove Unnecessary Columns**: Drop any columns that are not relevant for modeling.

In [172]:
df1.drop(['Time','Date'],axis=1,inplace=True)

In [173]:
import pandas as pd
from sklearn.preprocessing import LabelEncoder

In [174]:
X=df1.drop('Traffic Situation',axis=1)
y=df1['Traffic Situation']

In [175]:
X.head()

Unnamed: 0,Day of the week,CarCount,BikeCount,BusCount,TruckCount,Total
0,Tuesday,13,2,2,24,41
1,Tuesday,14,1,1,36,52
2,Tuesday,10,2,2,32,46
3,Tuesday,10,2,2,36,50
4,Tuesday,11,2,1,34,48


In [176]:
y

Unnamed: 0,Traffic Situation
0,normal
1,normal
2,normal
3,normal
4,normal
...,...
6319,normal
6320,high
6321,high
6322,high


# Visualize the Data

Visualizing the data helps in understanding the relationships between features and the target variable. Below are some common visualizations that can be used to gain insights into the dataset:

1. **Count Plots for Categorical Features**: Use count plots to visualize the frequency of categorical features such as the `Traffic Situation`.

2. **Correlation Heatmap**: Create a heatmap to visualize the correlation between numerical features and identify any strong relationships.

In [177]:
import matplotlib.pyplot as plt
import seaborn as sns

In [178]:
df1['Traffic Situation'].value_counts()

Unnamed: 0_level_0,count
Traffic Situation,Unnamed: 1_level_1
normal,3858
heavy,1137
low,834
high,495


In [179]:
df1.columns

Index(['Day of the week', 'CarCount', 'BikeCount', 'BusCount', 'TruckCount',
       'Total', 'Traffic Situation'],
      dtype='object')

# Split the Dataset

1. **Define Features and Target**: Separate the dataset into features (`X`) and the target variable (`y`).

2. **Train-Test Split**: Use the `train_test_split` function from `sklearn.model_selection` to split the data.

In [180]:
import pandas as pd
from sklearn.model_selection import train_test_split

In [181]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [182]:
X_train.shape

(5059, 6)

In [183]:
X_test.shape

(1265, 6)

In [184]:
X_train.head()

Unnamed: 0,Day of the week,CarCount,BikeCount,BusCount,TruckCount,Total
1170,Sunday,72,8,4,18,102
5437,Saturday,120,30,28,15,193
5458,Saturday,106,12,16,9,143
2948,Thursday,132,16,13,11,172
1835,Sunday,15,2,1,37,55


In [185]:
import pandas as pd
from sklearn.model_selection import train_test_split

In [186]:
train_data, test_data = train_test_split(df1, test_size=0.2, random_state=42)

In [187]:
X_train.shape

(5059, 6)

# Initialize and Train the Classifiers

## Bagging
Chose the bagging model to go with and initialize and train a the model.

In [188]:
import pandas as pd
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

In [189]:
clf = BaggingClassifier(base_estimator=DecisionTreeClassifier(), n_estimators=10, random_state=42)

In [190]:
print(clf)

BaggingClassifier(base_estimator=DecisionTreeClassifier(), random_state=42)


In [191]:
import pandas as pd
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

In [192]:
target_outdated = 'Traffic Situation'

In [193]:
import pandas as pd
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

### Evaluate the model performance

## Boosting
Chose the Boosting model to go with and initialize and train a the model.

In [194]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

### Evaluate the model performance

In [195]:
X_train.shape

(5059, 6)

## Stacking Classifier
Combine the previous classifiers as the base models using a Stacking Classifier.

In [196]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [197]:
clf = BaggingClassifier(base_estimator=DecisionTreeClassifier(), n_estimators=10, random_state=42)

### Define meta-learner (LogisticRegression)

In [198]:
df1.head()

Unnamed: 0,Day of the week,CarCount,BikeCount,BusCount,TruckCount,Total,Traffic Situation
0,Tuesday,13,2,2,24,41,normal
1,Tuesday,14,1,1,36,52,normal
2,Tuesday,10,2,2,32,46,normal
3,Tuesday,10,2,2,36,50,normal
4,Tuesday,11,2,1,34,48,normal


### Initialize and Train the Stacking Classifier

Stacking combines multiple models (base learners) using a meta-learner. The meta-learner is trained on the predictions of the base learners to make the final prediction.

In [199]:
clf = BaggingClassifier(base_estimator=DecisionTreeClassifier(), n_estimators=10, random_state=42)

### Evaluate the model performance

In [200]:
import pandas as pd
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

In [201]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [202]:
X

Unnamed: 0,Day of the week,CarCount,BikeCount,BusCount,TruckCount,Total
0,Tuesday,13,2,2,24,41
1,Tuesday,14,1,1,36,52
2,Tuesday,10,2,2,32,46
3,Tuesday,10,2,2,36,50
4,Tuesday,11,2,1,34,48
...,...,...,...,...,...,...
6319,Thursday,26,16,13,16,71
6320,Thursday,72,25,10,27,134
6321,Thursday,107,13,14,28,162
6322,Thursday,106,18,13,27,164


In [203]:
y

Unnamed: 0,Traffic Situation
0,normal
1,normal
2,normal
3,normal
4,normal
...,...
6319,normal
6320,high
6321,high
6322,high


In [204]:
X_train.shape

(5059, 6)

In [205]:
X_test.shape

(1265, 6)

In [206]:
X_train.head()

Unnamed: 0,Day of the week,CarCount,BikeCount,BusCount,TruckCount,Total
1170,Sunday,72,8,4,18,102
5437,Saturday,120,30,28,15,193
5458,Saturday,106,12,16,9,143
2948,Thursday,132,16,13,11,172
1835,Sunday,15,2,1,37,55


In [207]:
X_test.head()

Unnamed: 0,Day of the week,CarCount,BikeCount,BusCount,TruckCount,Total
3090,Wednesday,109,25,3,8,145
198,Thursday,14,2,0,35,51
3934,Thursday,15,4,0,17,36
1611,Thursday,71,2,17,10,100
5435,Saturday,68,24,15,14,121


In [208]:
X_train.shape

(5059, 6)

# Notebook Questions:

After completing the tasks in this notebook, take some time to reflect on the work you have done and answer the following questions. These questions are designed to help you think critically about the steps you took and the decisions you made.

* **Feature Selection and Engineering**
   - Which features did you find most important for predicting churn, and why do you think they are significant?
   - Did you perform any feature engineering? If so, what new features did you create, and how did they improve the model performance?

* **Model Selection**
   - Why did you choose the specific ensemble methods you implemented? What are the advantages of using ensemble methods over single models?
   - Compare the performance of different models you used. Which model performed the best, and what do you think contributed to its success?

* **Model Evaluation**
   - Which evaluation metrics did you use to assess the model performance, and why? What insights did these metrics provide about the models' strengths and weaknesses?


# Answer here: