# ADS Midterm 2019

## Theoretic part

## Question 1 (5pts). 
Imagine traninig a model which considers multiple sattelite images of urban traffic and tries to find groups of typical
(repeated with minor deviations) scenarios. How would you classify this problem from Machine Learning perspective?

A. Supervised leanring;

B. Unsupervised learning;

C. Semi-supervised learning;

D. Reinforcement learning.

Explain you choice:

## Question 2 (7pts). 
Which of the following statements (select all that apply) are true about overfitting problem for linear regression:

A. Overfitting problem could be detected by R-squared if the in-sample R-squared
is very low.

B. Overfitting problem often happens when we do not have enough features but a big
number of observations.

C. Overfitting problem could be detected by R-squared if the out-of-sample R-
squared is very low.

D. Overfitting problem could happen when we have many noisy features but a
small number of observations.

E. Overfitting problem could be detected by R-squared if the out-of-sample R-
squared is considerably lower compared to in-sample R-squared.

Explain how do you understand the concept of overfitting in general and in the context of your answer.

### Question 3 (8pts). 
Please explain why would you need separate training, validation and test samples to learn the model. In which cases you may need all three, including a validation sample?

### Question 4 (10pts). 

95%-confidence interval for the linear regression coefficient estimate is [500,1000]. Which of
the below can **not** be the 99\% confidence interval (select all that apply):

a) [650 850], b) [0 900], c) [300 1200], d) [300 1100], e) [-1000 -500].

Given the 95% interval and the remaining possible choices for the 99\% confidence interval from above, which of the following p-values could **not** be possible:

a) 0.01, b) 0.05, c) 0.1, d) 0.001?

Please explain your choice. Note: we need to prove that excluded options are not possible. We do not need to prove the possibility for the remaining options, but need to list all the options which can be proven to be impossible

In [1]:
# import packages
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import pandas as pd
import statsmodels.formula.api as smf
from sklearn.metrics import mean_squared_error, r2_score
import datetime as dt
from sklearn import linear_model
from sklearn import model_selection
from sklearn import preprocessing
from sklearn.metrics import silhouette_samples, silhouette_score
from sklearn.cluster import KMeans

# suppress warning
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

# Midterm: FHV Traffic Modeling for Real-Time Autonomous Vehicle Solutions in JFK

Transportation network models are essential to transportation operations and planning. A simple yet well-designed linear model can provide us insights on the traffic demand. We are going to model the outgoing traffic around JFK, one of the busiest transportation hubs in NYC. 
In this test, you'll be asked to:
* Find possible correlations from observations
* Incoperate time patterns using dummy variables
* Run and diagnose linear models, in-sample and out-of-sample. Perform feature selection
* Cluster the days based on their ridership patterns to see if we can detect any outliers

We will be importing the dataset `JFK60.csv` providing FHV ridership and arrivals at the airport aggregated on the hourly basis:
* `fhv`: Number of FHV (For Hired Vehicle) departing from JFK. This is our target variable.
* `arrival`: Number of incoming domestic flights arriving JFK, which is assumed to provide a basis for future FHV demand

In [2]:
# import and curate the dataset
dataset = pd.read_csv("JFK60.csv")

In [3]:
dataset.head()

Unnamed: 0,date,arrival,fhv
0,18/1/1 0:00,6,263
1,18/1/1 1:00,6,138
2,18/1/1 2:00,2,50
3,18/1/1 3:00,0,24
4,18/1/1 4:00,2,45


In [4]:
# convert the `date` feature into `dt.datetime` format. This is for later creating dummy variables
dataset.date = pd.to_datetime(dataset.date, format='%y/%m/%d %H:%M')

In [5]:
#get day from beginning of the year, hour and day of the week from datetime
dataset['hour']=pd.DatetimeIndex(dataset.date).hour

In [6]:
#get day of the week; monday - 0, sunday - 6
dataset['dow']=pd.DatetimeIndex(dataset.date).weekday

In [7]:
#get day from beginning of the year
dataset['day']=((dataset.date-dt.datetime(2018,1,1))/dt.timedelta(days = 1)).astype(int)

In [8]:
dataset.head()

Unnamed: 0,date,arrival,fhv,hour,dow,day
0,2018-01-01 00:00:00,6,263,0,0,0
1,2018-01-01 01:00:00,6,138,1,0,0
2,2018-01-01 02:00:00,2,50,2,0,0
3,2018-01-01 03:00:00,0,24,3,0,0
4,2018-01-01 04:00:00,2,45,4,0,0


In [9]:
#add time-lagged arrivals (1,2,3,4,5,6 hours before)
maxlag = 12
lagdata=pd.DataFrame([])
for lag in range(1,maxlag+1):
        varname = 'lag' + str(lag)
        lagdata[varname] = dataset['arrival'].iloc[maxlag-lag:len(dataset)-lag].reset_index(drop = True)
datasetL = pd.concat([dataset.loc[maxlag:].reset_index(drop = True), lagdata.reset_index(drop = True)], axis = 1, sort = False)

In [10]:
datasetL.head()

Unnamed: 0,date,arrival,fhv,hour,dow,day,lag1,lag2,lag3,lag4,lag5,lag6,lag7,lag8,lag9,lag10,lag11,lag12
0,2018-01-01 12:00:00,10,357,12,0,0,7,17,13,11,19,11,16,2,0,2,6,6
1,2018-01-01 13:00:00,18,390,13,0,0,10,7,17,13,11,19,11,16,2,0,2,6
2,2018-01-01 14:00:00,19,606,14,0,0,18,10,7,17,13,11,19,11,16,2,0,2
3,2018-01-01 15:00:00,28,601,15,0,0,19,18,10,7,17,13,11,19,11,16,2,0
4,2018-01-01 16:00:00,15,676,16,0,0,28,19,18,10,7,17,13,11,19,11,16,2


## Task 1. Data Exploration
### Q1 (5pts). Print some dataset characteristics: number of records, total number of FHV trips, total number of arriving flights

### Q2 (10pts). Visualize the timeline of FHV rides and arriving flights over the first month (January, 2018)

### Q3 (5pts). Report correlation between FHV rides and arriving flights

## Task 2: Build Linear Regression Model of FHV vs Arrival data

### Q1 (7pts). Build an OLS model with intercept (you may want to use smf.ols) over `train` using `arrival` as a sole predictor for `fhv` 
Check p-value for arrival. What does it indicate? Report the 99% confidence interval for arrival's coefficient

### Q2 (8pts): Consider Historical Impact 
by adding time lags - add all 12 lag variables into the regression above

There is always some delay between passengers arrival and departure (e.g. passing customs, picking up luggage etc). `fhv` might be more related to historical values of flight arrivals (lag) rather than immediate `arrival`. Engineer a formulae with all the following variables and run the regression:
* arrival, lag1, ..., lagN: that happens 1hr, ..., N=12 hr ahead.
Which of the varialbes have statistically significant impact according to p-values?

## Q3 (5pts): Incorperate Temporal Patterns 
by adding categorical variables for day of week and hour 

From the visualization in task1Q3 you may see that both - `fhv` as well as arrivals follow a somewhat periodic temporal pattern. Intuitively, this is true for most traffic flows following daily rhytms including rush hours and also varying over the course of the week. Usually we add dummy/categorical variables (Boolean variable: 1 for True and 0 for False) to encapsulate people's traveling pattern during different time periods.

Note that it would not make sense to add hour and dow as regular regressors as we can't anticipate their linear numeric impact. Instead expression `C(.)` could be used in the regression formulae in order to treat those variables as categorical adding corresponding dummy variables to account for their possible discrete values.

Perform the regression of fhv agains arrival, lags and temporal categorical variables. Which of the varialbes have statistically significant impact according to p-values?

### Q4 (10pts). Perform feature selection for lag variables
As you may see not all the lag variables have statistically signifant impact on the regression. Maybe some of them are not really relevant?
Try different amounts of lag variables m=0,1,...,12 using a loop for training the above regression over the training sample, report the out-of-sample R2 over the validation sample and pick up m which maximizes it. Evaluate the final regression over test sample.

### Q5 (10pts). Visualize temporal patterns and lag impacts through bar plots
For the best regression above visualize:
- bar plot of hour of the day vs its impact coefficient
- bar plot of day of the week vs its impact coefficient
- bar plot of the lag (0 for immediate arrivals, 1,2,... for lags)

Please find the optimal choice for Lasso regression. What lag feature should we use here?

## Task 3. Cluster the days of the year based on the relative timeline of their FHV departures from the airport

### Q1 (8pts). From the entire `dataset`, create a dataframe with days as rows, hours as columns and FHV ridership as values (feel free to use pd.pivottable). Normalize by the total daily ridership

### Q2 (12pts). Try K-means with the differet numbers of clusters k=2..7, reporting average Silhuette score for each. Which k is the "optimal" from Silhuette's standpoint?

### Q3 (15pts). Perform K-means with the optimal k from above. Report the number of occurance of each day of the week within each of the clusters. How would you interpret the clusters based on that? Visualize the aggregated hourly timeline over all the days within each cluster.
Create a dictionary of the cluster numbers corresponding to each day of the year, apply it adding a column "cluster" to the dataframe and use pivot table with aggregation function `count` to collect the numbers above. Also use pivot table to collect total riders per hour of the day within each cluster for further visualization (after appropriate normalization by the grand total).