# Data exploration and preparation for Machine Learning

## Hands on case study
Explore Hamburg data to ...
@TODO Jan

### Useful shortcuts in Jupyter
(for Vimium users: exclude http://localhost:8888/* in the options)
  - run current cell 
    - `shift + Enter`
  - go into cell to modify 
    - `Enter` (the frame around cell becomes **green**)
  - deselect cell
    - `Esc` (the frame around cell becomes **blue**)
  - new cell **above** current cell 
    - `Esc` to deselect followed by `A`
  - new cell **below** current cell
    - `Esc` to deselect followed by `B`

## Introduce dataset
- Booking Requests in Hamburg without rehails from \[2018-11-10 -- 2019-11-11 \[
- goal is to find pattern
  - `booking_id` index column
  - `date_created` 
  - `origin_lat`
  - `origin_lon`

## Let's get started

In [None]:
import numpy as np
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt
from matplotlib.ticker import MultipleLocator, FormatStrFormatter
import seaborn as sns

from plotting import plot_boxplot

In [None]:
# increase the font size of the plots
FONT_SIZE = 14
mpl.rcParams['xtick.labelsize'] = FONT_SIZE 
mpl.rcParams['ytick.labelsize'] = FONT_SIZE
mpl.rcParams['legend.fontsize'] = FONT_SIZE
mpl.rcParams['axes.labelsize'] = FONT_SIZE
mpl.rcParams['figure.figsize'] = (10, 10)

In [None]:
file_path = '../data/hamburg_data.csv'
index_col = 'booking_id'
date_columns = ['date_created']
delimiter = ';'

In [None]:
df = pd.read_csv(file_path, index_col=index_col, parse_dates=date_columns, sep=delimiter)
df.head()

## TASK 1 - Exploring column values
- How are the values in each column distributed?
- Which columns include missing values?
- Did you find any outliers? Filter them out.

In [None]:
df.describe(percentiles=[0.1, 0.25, 0.75, 0.9])

In [None]:
df.isna().sum()

In [None]:
plot_boxplot(df,1,2)

### Discuss:
- What can you see here?
- Do we have outliers and unclean data?
- Why are not all columns shown?
- What does `counts` and `mean` mean?

In [None]:
fig = plt.figure(figsize=(16, 12))
ax = fig.gca()
s = ax.scatter(df['origin_lon'], df['origin_lat'], marker='.')
ax.set_title('pickup locations', fontsize=20)
ax.set_xlabel('longitude')
ax.set_ylabel('latitude')
ax.xaxis.set_major_locator(MultipleLocator(0.2))
ax.yaxis.set_major_locator(MultipleLocator(0.2))
ax.grid()
fig.colorbar(s);

## Discuss:
- is the density ok?
- Can you spot the Alster or Elbe? TODO

## TASK 2 - EDA (Exploratory Data Analysis)

In [None]:
df['year'] = df['date_created'].dt.year
df['month'] = df['date_created'].dt.month
df['day'] = df['date_created'].dt.day
df['doy'] = df['date_created'].dt.dayofyear
df['hour'] = df['date_created'].dt.hour
df['minute'] = df['date_created'].dt.minute
df['second'] = df['date_created'].dt.second

In [None]:
# Select data from df
count_per_day = df['doy'].value_counts()
sns.lineplot(count_per_day.index, count_per_day.values, alpha=0.8)
# add a title and label
plt.title('Rides per day of year')
plt.ylabel('Tours', fontsize=12)
plt.xlabel('dayof year', fontsize=12)

### Discuss:
- Patterns?
- ...

## TASK 3 - Feature Engineering

## TASK 4 - Predict and evaluate