1. [__CRQ1__]: *Does the fare for mile change across NY's borough?* We want to discover whether the expenses of a user that enjoys Taxis in one zone is different from those that uses it in another one. 
    * Considering the fare amount, we want to compute the price per mile  ![equation](https://latex.codecogs.com/gif.latex?P) for each trip:
        - Run the mean and the standard deviation of the variable. Then plot the distribution. What do you see?
        - Run a statistical test that checks if the average price for mile in each borough is significantly different from the average price in New York
        - Can you say that statistically significant differences, on the averages, hold among zones? In other words, are Taxis trip in some boroughs, on average, more expensive than others? 
    * The price per mile might depend on traffic the Taxi finds on its way. So we try to mitigate this effect:
        - Likely, the duration of the trip says something about the city's congestion, especially if combined with the distances. It might be a good idea to weight the price for mile using the average time ![equation](https://latex.codecogs.com/gif.latex?T) needed to travel one mile. Thus, instead of ![equation](https://latex.codecogs.com/gif.latex?P), you can use ![equation](https://latex.codecogs.com/gif.latex?P^\prime&space;=&space;P&space;\cdot&space;T) 
        - Run the mean and the standard deviation of the new variable. Then plot the distribution. What do you see?
        - Run a statistical test that checks if the average *weighted* price for mile in each borough is significantly different from the average price in New York
        - Can you say that statistically significant differences, on the averages, hold among zones? In other words, are Taxis trip in some boroughs, on average, more expensive than others?            
    * Compare the results obtained for the price per mile and the weighted price for mile. What do you think about that?
   

## References
https://en.wikipedia.org/wiki/Taxicabs_of_New_York_City
http://cs229.stanford.edu/proj2016/report/AntoniadesFadaviFobaAmonJuniorNewYorkCityCabPricing-report.pdf
https://www.kaggle.com/selfishgene/yellow-cabs-tell-the-story-of-new-york-city

In [None]:
#Libraries
import pandas as pd
import scipy.stats as stat
from datetime import datetime
from collections import Counter
import numpy as np
import matplotlib.pyplot as plt

## Loading and cleanup

In [None]:
df = pd.DataFrame() #this is the main dataframe, i'll store the columns that i need to work out here
df['Distance'] = pd.Series() #Here will be the trip distance
df['Fare'] = pd.Series() #Here will be the Fare amout of the trip
df['LocationID'] = pd.Series() #I need this for join the data frame with another one that maps the borough by it's id
df['$/mile'] = pd.Series() # fare / distance
buffer = 100 #this is for memory purpose

putime = pd.Series()
dotime = pd.Series()
#for every month
for month in ['01', '02', '03', '04', '05', '06']:
    #for every slice of the dataset
    for chunk in pd.read_csv("yellow_tripdata_2018-"+month+".csv", chunksize=buffer, nrows=10000, usecols=['trip_distance', 'fare_amount', 'PULocationID', 'tpep_pickup_datetime', 'tpep_dropoff_datetime']):
        #append the chunk to my dataframe
        df = df.append(pd.DataFrame({'Distance': pd.Series(chunk['trip_distance']),'Fare':pd.Series(chunk['fare_amount']), 'LocationID':pd.Series(chunk['PULocationID'])}), ignore_index = True)
        putime = putime.append(pd.to_datetime(chunk['tpep_pickup_datetime']), ignore_index=True)
        dotime = dotime.append(pd.to_datetime(chunk['tpep_dropoff_datetime']), ignore_index=True)
        
#after the nested loop i have to perform some manipulations in order to have the complete the dataframe 
zone_lookup = pd.read_csv("taxi_zone_lookup.csv") # this dataset have a map between the LocationID and Borough
df = df.merge(zone_lookup[['LocationID', 'Borough']], how='inner' ,on='LocationID').fillna(0) #Inner join between the interested columns and the previous dataset
df = df.loc[(df['Distance'] > 0) & (df['Fare'] > 0)].reset_index() #removing the 0-distance records
df = df.drop(['LocationID', 'index'], axis=1) #drop useless columns

putime = putime.loc[putime.dt.year > 2017]
dotime = dotime.loc[dotime.dt.year > 2017]

df['$/mile'] = df['Fare'] / df['Distance'] #calc the dollar per mile amount
df['Deltatime'] = ((dotime-putime).dt.seconds)/60
df['$*minutes/mile'] = df['$/mile']*df['Deltatime']

In [None]:
#questo time ha pagato 14 dollari per stare 24h dentro un taxi
display(putime.loc[6611])
display(dotime.loc[6611])
display(df.loc[6611])

## $\mu$ and $\sigma$ for the entire city

In [1]:
new_york_dollar_per_mile_mean = df['$/mile'].mean()
new_york_dollar_per_mile_std = df['$/mile'].std()

NameError: name 'df' is not defined

## Plotting the distribution of the random variable P = Fare_amout / mile

In [None]:
labels, values = zip(*Counter(df['$/mile'].round(decimals=0)).items())

indexes = np.arange(len(labels))
width = 2

plt.bar(indexes, values, width)
plt.show()

## $\mu$ and $\sigma$ for every borough

In [None]:
mean = df.groupby('Borough').mean()['$/mile'] #this groups by borough, calc the mean and takes only the column that i want to plot
std = df.groupby('Borough').std()['$/mile'] #same as above, but for the standard deviaton

In [None]:
df2 = pd.DataFrame([mean, std], columns=[el for el in set(df['Borough'])],  index=['mean', 'std']) #combine the previous result in a dataframe

In [None]:
df2

In [None]:
df2.plot(kind='bar')
df2.T.plot(kind='bar')

## Ttest and p-value

the variables are related, so ttest_ind does not works and ttest_rel want arrays of same length

In [None]:
df3 = pd.DataFrame(columns=[el for el in set(df['Borough'])],  index=['ttest', 'pval'])
for borough in set(df['Borough']):
    ttest_result = stat.ttest_ind(df.loc[df['Borough'] == borough]['$/mile'], df['$/mile'])
    df3[borough] = pd.Series([ttest_result[0], ttest_result[1]], index=['ttest', 'pval'])

In [None]:
x = range(len(set(df3)))
x_labels = set(df3)
y = [df3[i][0] for i in x_labels]

markerline, stemlines, baseline = plt.stem(x,y, '-.')
plt.setp(baseline, color='orchid', linewidth=1)
plt.title('TTest')
plt.xticks(x, x_labels, rotation='vertical')
plt.show()

y = [df3[i][1] for i in x_labels]

markerline, stemlines, baseline = plt.stem(x,y, '-.')
plt.setp(baseline, color='orchid', linewidth=1)
plt.title('p-val')
plt.xticks(x, x_labels, rotation='vertical')
plt.show()

In [None]:
df3

# Second request

In [None]:
df = df.loc[df['Deltatime'] < 120]

In [None]:
new_york_dollar_per_mile_mean = df['$*minutes/mile'].mean()
new_york_dollar_per_mile_std = df['$*minutes/mile'].std()

In [None]:
mean = df.groupby('Borough').mean()['$*minutes/mile'] #this groups by borough, calc the mean and takes only the column that i want to plot
std = df.groupby('Borough').std()['$*minutes/mile'] #same as above, but for the standard deviaton

In [None]:
labels, values = zip(*Counter(df['$*minutes/mile'].round(decimals=1)).items())

indexes = np.arange(len(labels))
width = 2

plt.bar(indexes, values, width)
plt.show()

In [None]:
df2 = pd.DataFrame([mean, std], columns=[el for el in set(df['Borough'])],  index=['mean', 'std']) #combine the previous result in a dataframe

In [None]:
df2

In [None]:
df2.plot(kind='bar')
df2.T.plot(kind='bar')

In [None]:
df3 = pd.DataFrame(columns=[el for el in set(df['Borough'])],  index=['ttest', 'pval'])
for borough in set(df['Borough']):
    ttest_result = stat.ttest_ind(df.loc[df['Borough'] == borough]['$*minutes/mile'], df['$*minutes/mile'])
    df3[borough] = pd.Series([ttest_result[0], ttest_result[1]], index=['ttest', 'pval'])

In [None]:
x = range(len(set(df3)))
x_labels = set(df3)
y = [df3[i][0] for i in x_labels]

markerline, stemlines, baseline = plt.stem(x,y, '-.')
plt.setp(baseline, color='orchid', linewidth=1)
plt.title('TTest')
plt.xticks(x, x_labels, rotation='vertical')
plt.show()

y = [df3[i][1] for i in x_labels]

markerline, stemlines, baseline = plt.stem(x,y, '-.')
plt.setp(baseline, color='orchid', linewidth=1)
plt.title('p-val')
plt.xticks(x, x_labels, rotation='vertical')
plt.show()