# Exploratory Data Analysis of Metro Interstate Traffic Volume

This notebook explores the Metro Interstate Traffic Volume dataset, which includes hourly traffic volume data along with weather and holiday information from 2012 to 2018 on Interstate 94 Westbound in Minneapolis-St Paul, MN. The goal is to perform exploratory data analysis to understand traffic patterns and their relationship with environmental conditions and holidays.


**Data Source:** https://archive.ics.uci.edu/dataset/492/metro+interstate+traffic+volume

![alt text](data/metadata_desc.JPG)

## Setup and Data Loading

### Import libraries and loading the dataset

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

# Load the dataset
data = pd.read_csv('data/modified_Metro_traffic_data.csv')
# Replace NaN values in 'holiday' column with the string 'None'
#data['holiday'].fillna('None', inplace=True)
data.fillna({'holiday': 'None'}, inplace=True)

data.head()


## Initial Data Exploration

### Examining the data structure, types of data, unique values, and basic statistics.

In [None]:
# General overview of the data structure
print("Data Shape:", data.shape)  # Showing the number of rows and columns/features
print("\nData Info:")
data.info()

## Overview of numerical features

In [None]:
# Overview of numerical features
print("\nStatistics for Numerical Features:")
print(data.describe())

## Overview of categorical features

In [None]:
# Overview of categorical features
categorical_cols = data.select_dtypes(include=['object']).columns
print("\nCategorical Features:")
for col in categorical_cols:
    print(f"--- {col} ---")
    print(data[col].value_counts())
    print("\n")

## Visual overview using histograms for numerical data

In [None]:
# Visual overview using histograms for numerical data
data.hist(figsize=(12, 10))
plt.suptitle('Histograms of Numerical Features')
plt.show()

## Boxplots for each numerical feature to spot outliers

In [None]:
# Boxplots for each numerical feature to spot outliers
fig, axs = plt.subplots(nrows=len(data.select_dtypes(include=['number']).columns), figsize=(8, 20))
for i, col in enumerate(data.select_dtypes(include=['number']).columns):
    sns.boxplot(x=data[col], ax=axs[i])
    axs[i].set_title(f'Boxplot of {col}')
plt.tight_layout()
plt.show()

## Data Cleaning

### Handle missing values, remove duplicates, and correct data types if necessary.

### Check for missing values

In [None]:
# Check for missing values
print(data.isnull().sum())

### Removing Duplicates

In [None]:
print("Data Shape-Original:", data.shape)  

# Remove duplicates
data.drop_duplicates(inplace=True)

print("Data Shape-After removing dupliocates:", data.shape)  

### Correcting data types: Making Sure date-time is represented correctly

In [22]:
# Correct data types
data['date_time'] = pd.to_datetime(data['date_time'])

In [None]:
data.head(10)

## Exploratory Data Analysis

### Statistical summary and visualize data distributions..

In [None]:
# Statistical summary
print(data.describe())

# Distribution of traffic volumes
sns.histplot(data['traffic_volume'], kde=True)
plt.title('Distribution of Traffic Volume')
plt.show()

# Boxplot for hourly traffic volume
sns.boxplot(x=data['date_time'].dt.hour, y='traffic_volume', data=data)
plt.title('Hourly Traffic Volume')
plt.xlabel('Hour of the Day')
plt.ylabel('Traffic Volume')
plt.show()


## Identifying Noise and Anomalies

**Pairplot:** This plot helps in quickly spotting distributions, anomalies, and relationships between multiple numerical variables. Skewed distributions or unusual scatter patterns can suggest outliers or anomalies.

**Correlation Heatmap:** Useful for identifying relationships between variables. Highly correlated variables or unexpected correlations can suggest underlying patterns or errors in data collection.

**Count Plots for Categorical Data:** These plots are excellent for visualizing the frequency distribution of categorical variables. Anomalies might be very rare categories that could actually be data entry errors.

**Boxplots for Each Numerical Feature:** Boxplots are particularly useful for spotting outliers. They provide a clear visualization of the quartile ranges and any points that fall outside these ranges.

In [None]:
# Boxplots for numerical features to identify outliers
fig, axs = plt.subplots(nrows=len(data.select_dtypes(include=['number']).columns), figsize=(8, 20))
for i, col in enumerate(data.select_dtypes(include=['number']).columns):
    sns.boxplot(x=data[col], ax=axs[i])
    axs[i].set_title(f'Boxplot of {col} - Check for Outliers')
plt.tight_layout()
plt.show()


**Removing Negative Values:** This step filters out entries with negative traffic volumes, which do not make sense in this context and should be considered noise or errors.

**Handling Outliers:** Traffic volumes that are too large are treated as outliers based on the interquartile range (IQR) method. This helps in normalizing the data distribution and preparing it for more accurate model predictions.

In [26]:
from sklearn.preprocessing import OneHotEncoder, MinMaxScaler
# Removing rows with negative traffic volume
data = data[data['traffic_volume'] >= 0]

# Removing extreme outliers in traffic volume
Q1 = data['traffic_volume'].quantile(0.25)
Q3 = data['traffic_volume'].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
data = data[(data['traffic_volume'] >= lower_bound) & (data['traffic_volume'] <= upper_bound)]

In [None]:
# New Distribution of traffic volumes
sns.histplot(data['traffic_volume'], kde=True)
plt.title('Distribution of Traffic Volume')
plt.show()

In [None]:
data.head(10)

Here is a question: How should we start transforming our dataset into a format that is efficiently read by an ML model? How can we scale this transformation so that we do not need to repeat it for data ingestion every time?

Using a **transformation pipeline** is the answer, especially when you plan to scale your project to handle new incoming data for prediction. A transformation pipeline automates the steps of data preprocessing, such as scaling, encoding, and handling date-time variables. This not only ensures consistency in how data is treated both during model training and prediction but also simplifies the process of applying the same transformations to new data.

## Setting Up a Transformation Pipeline



In [None]:
from sklearn.base import TransformerMixin, BaseEstimator
from scipy.sparse import issparse
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, MinMaxScaler
from sklearn.impute import SimpleImputer

# Custom transformer for dense conversion
class DenseTransformer(BaseEstimator, TransformerMixin):
    def fit(self, X, y=None):
        return self
    def transform(self, X):
        if issparse(X):
            return X.toarray()
        return X

# Custom transformer for date-time feature extraction
class DateFeaturesExtractor(BaseEstimator, TransformerMixin):
    def fit(self, X, y=None):
        return self
    def transform(self, X):
        X['hour'] = X['date_time'].dt.hour
        X['day'] = X['date_time'].dt.day
        X['month'] = X['date_time'].dt.month
        X['year'] = X['date_time'].dt.year
        return X[['hour', 'day', 'month', 'year']]


# Separate features and target
features = data.drop('traffic_volume', axis=1)
target = data['traffic_volume']

# Define columns for transformations
numerical_cols = ['temp', 'rain_1h', 'snow_1h', 'clouds_all']
categorical_cols = ['holiday', 'weather_main', 'weather_description']

# Column transformer with all preprocessing steps to normalize the numerical features, create one hot encoder for categorical features, and break up date time feature
preprocessor = ColumnTransformer(
    transformers=[
        ('num', MinMaxScaler(), numerical_cols),
        ('cat', OneHotEncoder(), categorical_cols),
        ('date', DateFeaturesExtractor(), ['date_time'])
    ], remainder='drop')

# Create the preprocessing pipeline
pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('to_dense', DenseTransformer())
])

# Applying the pipeline to the feature data
transformed_features = pipeline.fit_transform(features)

# Fetch feature names from the OneHotEncoder and concatenate with other feature names
feature_names = np.concatenate([
    numerical_cols,
    pipeline.named_steps['preprocessor'].transformers_[1][1].get_feature_names_out(categorical_cols),
    ['hour', 'day', 'month', 'year']
])

# Create DataFrame from the processed features
transformed_df = pd.DataFrame(transformed_features, columns=feature_names)
transformed_df['traffic_volume'] = target  # Adding the target variable back

# Display the transformed data
transformed_df.head()

For the **date_time variable**, you typically extract features that could have predictive power, such as the hour of the day, day of the week, month, or even year if the dataset spans several years. These extracted features can then be treated as categorical or numerical data, as shown in the DateFeaturesExtractor transformer above.

By using this pipeline, when new data comes in, you simply pass it through the pipeline which will handle all the preprocessing and prediction steps in one go. This approach not only maintains data integrity but also simplifies deployment and maintenance of your machine learning model.

In [None]:
transformed_df.info()

### Saving the Transformed Data for Machine Learning Application (Next Module's Focus)

In [31]:
transformed_df.to_csv('data/transformed_Metro_traffic_data.csv', index=False)

# Assignment

#### 1- Download the Traffic Data from the link mentioned in this Notebook.
#### 2- Try to add custom noise and outliers to the data.
#### 3- Repeat the data preparation and cleaning steps on this Notebook.
#### 4- Create **Pair Plots** and **Correlation Heatmaps** from the features and the target to have more understanding of the underlying patterns in the dataset
#### 5- Create a data transformation pipeline and save your dataset
#### 6- Read and watch videos about Supervised Machine Learning - Regression to have a base understanding of how we could make predictions on this dataset
#### 7- Bonus: Try creating Training, Validation, and Test Datasets from the Transformed dataset for the Machine Learning Pipeline (To be discussed in the next module) 