# Project Description: 

A database about Taxi's in New York will be inspected in the scope of this project. First, information about the dataset itself will be given, such as descriptive statistics, datatypes, shapes, etc... After that, Some interesting information will be tried to be drived from the dataset, for example the most common districts for pickup/drop-off locations. And finally, 2 questions will be answered using the help of statistics:
1. Does passenger group size affect the distance?
2. Do trip distances increase in weekends?

### Import Statements
These import statements are necessary for our program to run

In [None]:
import numpy as np
import pandas as pd
from collections import Counter
import reverse_geocoder as rg
from geopy.distance import geodesic
import seaborn as sns
import matplotlib.pyplot as plt
from scipy import stats
import datetime


### Information about the data:

All the descriptive statistics, shapes, datatypes, etc... will be given in here

In [None]:
database = pd.read_csv("taxi-trips.csv")
print("Descriptive statistics of the database: ", database.describe(include = "all"))
print("Correlations: ", database.corr())
print("Covariance matrix: ", database.cov())
print("Shape of the Database: ", database.shape)
print("Datatypes: ", database.dtypes)

### Extracting the pickup and drop-off latitude and longitudes

Not much to explain here, pandas dataframe structure is used

In [None]:

# col no 6-7 pickup longtitude latitude
# col no 8-9 dropoff longtitude latitude

pickup_lon = database['pickup_longitude']  # extract pickup longitude
pickup_lat = database['pickup_latitude']  # etract pickup latitude
pick_coor = pd.DataFrame(pickup_lat)  # create a new dataframe for latitude
pick_coor['pickup_longitude'] = pickup_lon  # add longitude to that dataframe, now we have pickup coordinates

# same thing for the drop locations(below) with pickup locations(above)
drop_lon = database['dropoff_longitude']
drop_lat = database['dropoff_latitude']
drop_coor = pd.DataFrame(drop_lat)
drop_coor['dropoff_longitude'] = drop_lon


### Getting the name of locations for pickup/drop-off

Longitude and lattitude values are given as coordinates, they need to be converted into real names using Reverse-Geocoding,

### Calculating distances and adding them into Database

With the locations, we can calculate the distances between pickup/drop-off locations, and then we will store them in the database

In [None]:
# turning the dataframe row into tuples since Reverse-Geocode requires tuples
pick_coor = [tuple(x) for x in pick_coor.values]
drop_coor = [tuple(x) for x in drop_coor.values]

# running Reverse-Geocode and getting the results
pick_res = rg.search(pick_coor)
drop_res = rg.search(drop_coor)

# storing the location values in an array, which located in dictionary under 'name' key
pick_locs = [x['name'] for x in pick_res]
drop_locs = [x['name'] for x in drop_res]

# adding new columns to the database (pickup and dropoff locations)
database['pickup_district'] = pick_locs
database['dropoff_district'] = drop_locs

# getting the distances via geopy and adding them to the dataframe
distances = [geodesic(pick_coor[x], drop_coor[x]).miles for x in range(len(pick_coor))]
database['distance'] = distances


### Getting the 5 most popular districts for pickup/drop-off locations

Via (Collections, counter, most_common method)

In [None]:


# getting the 5 most common leave and arrival districts from the database and storing them
c = Counter(database['pickup_district'])
most_common_pickup_Locations = c.most_common(5)
c = Counter(database['dropoff_district'])
most_common_dropoff_Locations = c.most_common(5)


### Adding time_of_day column, also calculating average trip distances and durations on the way

By extracting pickup_datetime column's "TIME" value, I set 5 time intervals (rush_hour_morning, afternoon, rush_hour_evening, evening, late_night). And for every row that is being inspected at the moment, duration and distance of that respective row is stored in an array, which will be used to calculate average trip duration and distances for every time interval

In [None]:
# extracting travel durations
durations = database['trip_duration'].values

# extracting pickup times from the databse and selecting the hour part only
time = database['pickup_datetime']
time = [int(time[x][11:13]) for x in range(len(time))]


# assigning string values to specific time intervals, then adding those into to main data as a new column
# also calculating average distances and average trip durations for these specific time intervals
avg_distances = [0,0,0,0,0]
avg_durations = [0,0,0,0,0]
for x in range(len(time)):
    hour = time[x]
    if 7 <= hour <= 9:
        time[x] = "rush_hour_morning"
        avg_distances[0] += distances[x]
        avg_durations[0] += durations[x]
    elif 9 < hour <= 16:
        time[x] = "afternoon"
        avg_distances[1] += distances[x]
        avg_durations[1] += durations[x]
    elif 16 < hour <= 18:
        time[x] = "rush_hour_evening"
        avg_distances[2] += distances[x]
        avg_durations[2] += durations[x]
    elif 18 < hour <= 23:
        time[x] = "evening"
        avg_distances[3] += distances[x]
        avg_durations[3] += durations[x]
    else:
        time[x] = "late_night"
        avg_distances[4] += distances[x]
        avg_durations[4] += durations[x]

database['time_of_day'] = time


avg_time = database.groupby(by='time_of_day').mean()["distance"]
print("X-axis symbolizes distance in km")
plt.figure()
avg_time.plot(kind="barh")
plt.show()

avg_time = database.groupby(by='time_of_day').mean()["trip_duration"]
print("X-axis symbolizes duration")
avg_time.plot(kind="barh")
plt.show()

### Answering the question "Does passenger group size affect the distance?". 

Null-hyptothesis claims that it does not affect.
First, let's get the mean distances for during weekday travel distance and during weekend travel distance to have an idea, afterwards a t-test will be applied to all the respective data.

In [None]:
print("Mean distance when there is only 1 passenger in miles:",database['distance'][database['passenger_count']==1].mean())
print("Mean distance when there are more than 1 passenger in miles:",database['distance'][database['passenger_count']>1].mean())

one_pass_distance = database['distance'][database['passenger_count']==1]
multiple_pass_distance =database['distance'][database['passenger_count']>1]

ax = sns.kdeplot(one_pass_distance, shade=True)
sns.kdeplot(multiple_pass_distance, ax=ax, shade=True)
plt.show()

p_value = stats.ttest_ind(a=one_pass_distance, b=multiple_pass_distance, equal_var=False)
print(p_value)
print("Since p value is smaller than 0.05, meaning that obtaining this result by chance is very unlikely and there is a significant difference between 2 results, so we have to reject null-hypothesis")


### Result
As it's explained in the "print" statement in the code, our p-value score was lower than 0.05. This means that, there is a difference between two datasets (single passenger travel distance vs multiple passengers travel distance), and observing that difference by chance is really low, so it's most likely that passenger amount is affecting the trip distance.

### Answering the question "Do trip distances increase in weekends?". 

Null-hyptothesis claims that it does not affect.
First, let's get the mean distances for during weekday travel distance and during weekend travel distance to have an idea, afterwards a t-test will be applied to all the respective data.

For inspecting this further, a function is needed to convert numerical values into strings for days: 
such as (1 : Monday)

In [None]:

def timefunc(input):
    str_input=str(input)
    dictionary = { '1' : 'Monday' ,'2' : 'Tuesday' ,'3' : 'Wednesday' ,'4' : 'Thursday' ,'5' : 'Friday' ,'6' : 'Saturday' ,'7' : 'Sunday' }
    year=int(str_input[0:4])
    month=int(str_input[5:7])
    day=int(str_input[8:10])
    dayname=dictionary[str(datetime.datetime(year,month,day).isoweekday())]
    return dayname

database['date_time'] = database.apply(lambda database: timefunc(database['pickup_datetime']), axis=1)
weekend=database['distance'][(database['date_time']=="Saturday") | (database['date_time']=="Sunday") ]
weekday=database['distance'][(database['date_time']!="Saturday") & (database['date_time']!="Sunday") ]

print("Mean distance when day is weekend in miles:", weekday.mean())
print("Mean distance when day is weekday in miles:", weekend.mean())

ax = sns.kdeplot(weekend.rename("Weekend"), shade=True)
sns.kdeplot(weekday.rename("Weekday"), ax=ax, shade=True)
plt.show()

p_value = stats.ttest_ind(a=weekend, b=weekday, equal_var=False)
print(p_value)
result2="it can be seen that p value is 2.092414433069292e-08 which is smaller than 0.05 we cannot say that null hypthesis is true since there will be difference between distances depending on passenger is alone or there is passengers so we reject the null hypothesis, we accept the alternative hypothesis."
print()
print(result2)


### Result
As it's explained in the "print" statement in the code, our p-value score was lower than 0.05. This means that, there is a difference between two datasets (weekend travel distance vs weekday travel distance), and observing that difference by chance is really low, so it's most likely that passenger amount is affecting the trip distance.

### Saving the new database into a csv file
###### Because that's the right thing to do :)

In [None]:


database.to_csv("newTaxiDatabase.csv")
