### Dataset Visualizations

Dataset used: [Metro dataset](https://archive.ics.uci.edu/ml/datasets/Metro+Interstate+Traffic+Volume)

##### Features:
* holiday: US National Holiday + Minnesota State Holiday
* temp: average Temperature in Kelvin
* rain_1h: mm or Rain
* snow_1h: mm of Snow
* clouds_all: percentage of cloud cover
* weather_main: short text descr. of weather
* weather_description: longer text descr. of weather
* date_time: datetime
* traffic_volume: westbound Traffic Volume (Ground Truth)

In [60]:
# All imports needed
import pandas as pd 
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

import datetime

In [61]:
# Read data from file
df = pd.read_csv("../data/metro/metro_raw.csv")
df.head()

Unnamed: 0,holiday,temp,rain_1h,snow_1h,clouds_all,weather_main,weather_description,date_time,traffic_volume
0,,288.28,0.0,0.0,40,Clouds,scattered clouds,2012-10-02 09:00:00,5545
1,,289.36,0.0,0.0,75,Clouds,broken clouds,2012-10-02 10:00:00,4516
2,,289.58,0.0,0.0,90,Clouds,overcast clouds,2012-10-02 11:00:00,4767
3,,290.13,0.0,0.0,90,Clouds,overcast clouds,2012-10-02 12:00:00,5026
4,,291.14,0.0,0.0,75,Clouds,broken clouds,2012-10-02 13:00:00,4918


In [62]:
# Showing missing values in metro dataset (none)
null_vals = df[df.isnull().values.any(axis=1)]
null_vals.shape

(0, 9)

In [63]:
# separate data from ground truth
X = df.drop('traffic_volume', axis=1)
Y = df['traffic_volume']

In [64]:
X.head()

In [65]:
# show categorical features
for col_name in X.columns:
    if X[col_name].dtypes == 'object':
        unique_cat = len(X[col_name].unique())
        print("Feature '{col_name}' has {unique_cat} unique categories".format(
        col_name=col_name, unique_cat=unique_cat))

Feature 'holiday' has 12 unique categories
Feature 'weather_main' has 11 unique categories
Feature 'weather_description' has 38 unique categories
Feature 'date_time' has 40575 unique categories


In [66]:
# clean up holiday feature
X['holiday'] = [0 if x == 'None' else 1 for x in X['holiday']]
print(X['holiday'].value_counts())

In [67]:
X.head()

In [68]:
# weather_main feature
X['weather_main'] = X['weather_main'].str.lower()
print(X['weather_main'].value_counts(dropna=False))

In [69]:
# one-hot dummies
dummies = pd.get_dummies(X['weather_main'], prefix='wmain')
X = X.drop('weather_main',1)
X = pd.concat([X, dummies], axis=1)
X.head()

In [70]:
# clean up weather_description feature
X['weather_description'] = X['weather_description'].str.lower()
print(X['weather_description'].value_counts(dropna=False))

In [71]:
# one-hot dummies
# pd.get_dummies(X['weather_description']).head()

In [72]:
# clean up date_time feature --> split to weekday and hour

datetime = pd.to_datetime(X['date_time'])
X['weekday'] = datetime.dt.dayofweek
X['hour'] = datetime.dt.hour
X = X.drop('date_time',1)

print(X['weekday'].value_counts(dropna=False))
print(X['hour'].value_counts(dropna=False))

In [73]:
X.head()

In [15]:
# Y.value_counts().plot(kind='bar', rot=0)