# Title
Finding Heavy Traffic Indicators on I-94

# Project Description
I'm going to analyze a dataset about the westbound traffic on the I-94 Interstate highway.

My goal is to find out what factors affect heavy traffic on I-94. These factors can be weather type, time of the day, time of the week, etc. 

## Importation
Here is the section to import all the packages/libraries that will be used through this notebook.

In [2]:
# Data handling
import pandas as pd
import numpy as np

# Vizualisation (Matplotlib, Plotly, Seaborn, etc. )
import seaborn as sns
import matplotlib as plt

# EDA (pandas-profiling, etc. )
...

# Feature Processing (Scikit-learn processing, etc. )
...

# Machine Learning (Scikit-learn Estimators, Catboost, LightGBM, etc. )
...

# Hyperparameters Fine-tuning (Scikit-learn hp search, cross-validation, etc. )
...

# Other packages


Ellipsis

# Data Loading
Here is the section to load the datasets (train, eval, test) and the additional files

In [3]:
traffic=pd.read_csv("Metro_Interstate_Traffic_Volume.csv")

# Exploratory Data Analysis: EDA
Here is the section to **inspect** the datasets in depth, **present** it, make **hypotheses** and **think** the *cleaning, processing and features creation*.

In [8]:
traffic.shape

(48204, 9)

In [6]:
traffic.head(20)

Unnamed: 0,holiday,temp,rain_1h,snow_1h,clouds_all,weather_main,weather_description,date_time,traffic_volume
0,,288.28,0.0,0.0,40,Clouds,scattered clouds,2012-10-02 09:00:00,5545
1,,289.36,0.0,0.0,75,Clouds,broken clouds,2012-10-02 10:00:00,4516
2,,289.58,0.0,0.0,90,Clouds,overcast clouds,2012-10-02 11:00:00,4767
3,,290.13,0.0,0.0,90,Clouds,overcast clouds,2012-10-02 12:00:00,5026
4,,291.14,0.0,0.0,75,Clouds,broken clouds,2012-10-02 13:00:00,4918
5,,291.72,0.0,0.0,1,Clear,sky is clear,2012-10-02 14:00:00,5181
6,,293.17,0.0,0.0,1,Clear,sky is clear,2012-10-02 15:00:00,5584
7,,293.86,0.0,0.0,1,Clear,sky is clear,2012-10-02 16:00:00,6015
8,,294.14,0.0,0.0,20,Clouds,few clouds,2012-10-02 17:00:00,5791
9,,293.1,0.0,0.0,20,Clouds,few clouds,2012-10-02 18:00:00,4770


## Hypothesis
#### Null Hypothesis, HO
Time of day is the number one factor that affects traffic the most.

#### AlternativeHypothesis, H1
Time of day is not the number one factor that affects traffic the most.

## Questions
1. Which holidays have the most traffic?
2. Which weekdays have the most traffic?
3. What time of day has the most traffic?(Is it morning, afternoon, evening, midnight or midday?)

MORNING
This is the time from midnight to midday.

AFTERNOON
This is the time from midday (noon) to evening.
From 12:00 hours to approximately 18:00 hours.

EVENING
This is the time from the end of the afternoon to midnight.
From approximately 18:00 hours to 00:00 hours.

MIDNIGHT
This is the middle of the night (00:00 hours).

MIDDAY
This is the middle of the day, also called "NOON" (12:00 hours).

4. Which factor affects traffic the most?
5. Compare rain, snow and temparature based on how they affect traffic
6. What's the highest recorded traffic?


In [7]:
traffic.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48204 entries, 0 to 48203
Data columns (total 9 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   holiday              48204 non-null  object 
 1   temp                 48204 non-null  float64
 2   rain_1h              48204 non-null  float64
 3   snow_1h              48204 non-null  float64
 4   clouds_all           48204 non-null  int64  
 5   weather_main         48204 non-null  object 
 6   weather_description  48204 non-null  object 
 7   date_time            48204 non-null  object 
 8   traffic_volume       48204 non-null  int64  
dtypes: float64(3), int64(2), object(4)
memory usage: 3.3+ MB


## Dataset overview

Have a look at the loaded datsets using the following methods: `.head(), .info()`

## Issues with the data
1. column names of sheet1 are different from the others
2. sheet3 has an unnecessary column('Unnamed: 9')
3. sheet1 doesn't have columns Founders,Investor and Founded.
4. sheet1 amount column has rupees, dollars, commas and is a string
5. amount column of other sheets have dollars, commas and is a string
6. there are null values in the founded, headquater, sector and stage columns
7. headquater column for sheet1 has more information
8. for the sector column, the values are different in all
9. amount column has null values
10. datatypes are mostly object

## How we intend to handle each issue identified
1. change column names of sheet1 to match the others
2. drop unnecessary column in sheet3
3. add missing columns to sheet1 with null values
4. convert sheet1 amount column to dollars in float
5. convert amount of other sheets to dollars in float
6. we'll leave the null values since they are not numbers
7. separate by commas and keep only the first word
8. make the values similar for example: "Ecommerce" and "E-Commerce Platforms" should me "E-commerce"
9. replace null values in amount column by calculating the mean or median
10. change datatypes accordinly for each column

## Verify Data Quality
The data is dirty and needs a lot of cleaning before we can use it. It also has a lot of null values. So the quality is low.

## Data Cleaning

1. change column names of sheet1 to match the others

2. drop unnecessary column in sheet3

3. add missing columns to sheet1 with null values

In [11]:
sheet1["Founded"] = np.nan
sheet1["Founders"] = np.nan
sheet1["Investor"] = np.nan

4. convert sheet1 amount column to dollars in float