## EDA 
___
Exploratory data analysis for Taxi rides in chicago.

An examination of cab rides in Chicago from November 2017. Does weather play a role in ride duration?

Steps:<br>
* Load the data files
* import necessary libraries:<br>
    datetime simplifies the handling of datetime format values<br>
    pandas is a powerful tool for working with tabular data<br>
    plotly express is used to create customizable plots<br>
    scipy stats contains the statistical functions<br>
    
* The data should be cleaned but you never know! Examine the data for necessary edits: looking for missing data, correct data types etc.
* explore the data, developing a grasp of the contents, look for patterns, and make observations

Specific requests:<br>
*identify the top 10 neighborhoods in terms of drop-offs
*make graphs: taxi companies and number of rides, top 10 neighborhoods by number of dropoffs

* Hypothesis testing: <br>
Does the average duration of rides from the Loop to O'Hare International Airport change on rainy Saturdays?
Null hypothesis: There is no significant difference in ride duration between rainy and non-rainy days.
Alternative hypothesis: Inclimate weather will increase the ride durations.

Steps:
* Filter for trips on saturdays.
* Create separate the weather condition samples to compare
* Determine whether or not the sample can be considered normal.
* Determine if the samples have different variances with the levene test, part of scipy stats
* Compare our sample with a ttest and interupt the results

Information about the data files<br>
***company_trips***<br>
company_name: taxi company name<br>
trips_amount: the number of rides for each taxi company on November 15-16, 2017.<br>

***drop_off***<br>
dropoff_location_name: Chicago neighborhoods where rides ended<br>
average_trips: the average number of rides that ended in each neighborhood in November 2017.<br>

***weather***<br>
start_ts: pickup date and time<br>
weather_conditions: weather conditions at the moment the ride started<br>
duration_seconds: ride duration in seconds<br>


In [127]:
# load libraries
from datetime import datetime
import pandas as pd
import plotly.express as px
from scipy import stats as st

In [None]:
# import data
company_trips = pd.read_csv('../company.csv')
drop_off = pd.read_csv('../drop_off.csv')
weather = pd.read_csv('../weather.csv')

### Explore Data
___
#### company_trips data

In [None]:
# take a look at the company_trips data
print(company_trips.info())
company_trips.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 64 entries, 0 to 63
Data columns (total 2 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   company_name  64 non-null     object
 1   trips_amount  64 non-null     int64 
dtypes: int64(1), object(1)
memory usage: 1.1+ KB
None


Unnamed: 0,company_name,trips_amount
0,Flash Cab,19558
1,Taxi Affiliation Services,11422
2,Medallion Leasin,10367
3,Yellow Cab,9888
4,Taxi Affiliation Service Yellow,9299


In [45]:
# explore company names
print(f"There are 64 rows in the table, and {
      company_trips.company_name.nunique()} unique company names")
company_trips.company_name.unique()

There are 64 rows in the table, and 64 unique company names


array(['Flash Cab', 'Taxi Affiliation Services', 'Medallion Leasin',
       'Yellow Cab', 'Taxi Affiliation Service Yellow',
       'Chicago Carriage Cab Corp', 'City Service', 'Sun Taxi',
       'Star North Management LLC', 'Blue Ribbon Taxi Association Inc.',
       'Choice Taxi Association', 'Globe Taxi',
       'Dispatch Taxi Affiliation', 'Nova Taxi Affiliation Llc',
       'Patriot Taxi Dba Peace Taxi Associat', 'Checker Taxi Affiliation',
       'Blue Diamond', 'Chicago Medallion Management', '24 Seven Taxi',
       'Chicago Medallion Leasing INC', 'Checker Taxi', 'American United',
       'Chicago Independents', 'KOAM Taxi Association', 'Chicago Taxicab',
       'Top Cab Affiliation', 'Gold Coast Taxi',
       'Service Taxi Association', '5 Star Taxi', '303 Taxi',
       'Setare Inc', 'American United Taxi Affiliation', 'Leonard Cab Co',
       'Metro Jet Taxi A', 'Norshore Cab', '6742 - 83735 Tasha ride inc',
       '3591 - 63480 Chuks Cab', '1469 - 64126 Omar Jada',
       '6

In [None]:
# explore trip counts
company_trips.describe()

Unnamed: 0,trips_amount
count,64.0
mean,2145.484375
std,3812.310186
min,2.0
25%,20.75
50%,178.5
75%,2106.5
max,19558.0


thoughts
___ 
We were informed that the data contained two columns: the company name and the amount of trips they took on Nov. 15-16th of 2017. 
* No missing values
* Data types look good, string=object and numeric=int64
* the naming convention isn't my favorite for company name, might not matter
* each entry is unique
* trip amounts range from 2-19,558 the standard deviation is greater than the mean, there definitely are outliers.
* The mean is actually larger than the 75% quartile, that means there are some serious skewness going on, however good for them! So good at business that you become an outlier


In [49]:
# what does the top end of trips look like?
company_trips[(company_trips['trips_amount'])
              > (company_trips['trips_amount'].mean())]

Unnamed: 0,company_name,trips_amount
0,Flash Cab,19558
1,Taxi Affiliation Services,11422
2,Medallion Leasin,10367
3,Yellow Cab,9888
4,Taxi Affiliation Service Yellow,9299
5,Chicago Carriage Cab Corp,9181
6,City Service,8448
7,Sun Taxi,7701
8,Star North Management LLC,7455
9,Blue Ribbon Taxi Association Inc.,5953


#### Drop off data
___

In [None]:
# explore drop_off
print(drop_off.info())
drop_off.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 94 entries, 0 to 93
Data columns (total 2 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   dropoff_location_name  94 non-null     object 
 1   average_trips          94 non-null     float64
dtypes: float64(1), object(1)
memory usage: 1.6+ KB
None


Unnamed: 0,dropoff_location_name,average_trips
0,Loop,10727.466667
1,River North,9523.666667
2,Streeterville,6664.666667
3,West Loop,5163.666667
4,O'Hare,2546.9


In [50]:
drop_off.describe()

Unnamed: 0,average_trips
count,94.0
mean,599.953728
std,1714.591098
min,1.8
25%,14.266667
50%,52.016667
75%,298.858333
max,10727.466667


In [51]:
print(drop_off.dropoff_location_name.nunique())
drop_off.dropoff_location_name.unique()

94


array(['Loop', 'River North', 'Streeterville', 'West Loop', "O'Hare",
       'Lake View', 'Grant Park', 'Museum Campus', 'Gold Coast',
       'Sheffield & DePaul', 'Lincoln Park', 'East Village',
       'Little Italy, UIC', 'Uptown', 'Near South Side', 'Garfield Ridge',
       'Logan Square', 'Edgewater', 'West Town', 'Old Town',
       'Rush & Division', 'North Center', 'Lincoln Square', 'Rogers Park',
       'West Ridge', 'Irving Park', 'Hyde Park', 'Avondale',
       'Wicker Park', 'Albany Park', 'United Center', 'Lower West Side',
       'Douglas', 'Portage Park', 'Humboldt Park', 'Norwood Park',
       'Kenwood', 'Bridgeport', 'Armour Square', 'Jefferson Park',
       'Bucktown', 'North Park', 'Garfield Park', 'Mckinley Park',
       'Belmont Cragin', 'Boystown', 'Chinatown', 'Grand Boulevard',
       'Austin', 'Sauganash,Forest Glen', 'South Shore', 'Woodlawn',
       'Little Village', 'Jackson Park', 'North Lawndale', 'Dunning',
       'Ukrainian Village', 'Hermosa', 'Englewood'

thoughts
___
* zero missing
* data types look fine
* why is trip amount an average?
* all 94 are different, that's great
* neighborhood trip data is even more skewed than the than the company trip, mean: 599 75%: 298

In [None]:
# identify the top 10 neighborhoods in terms of drop-offs

# order by average_trips, descending order
drop_off.sort_values(by='average_trips', ascending=False)
drop_off.head(10)

Unnamed: 0,dropoff_location_name,average_trips
0,Loop,10727.466667
1,River North,9523.666667
2,Streeterville,6664.666667
3,West Loop,5163.666667
4,O'Hare,2546.9
5,Lake View,2420.966667
6,Grant Park,2068.533333
7,Museum Campus,1510.0
8,Gold Coast,1364.233333
9,Sheffield & DePaul,1259.766667


#### Weather data
___

In [79]:
print(weather.info())
weather.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1068 entries, 0 to 1067
Data columns (total 3 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   start_ts            1068 non-null   object 
 1   weather_conditions  1068 non-null   object 
 2   duration_seconds    1068 non-null   float64
dtypes: float64(1), object(2)
memory usage: 25.2+ KB
None


Unnamed: 0,start_ts,weather_conditions,duration_seconds
0,2017-11-25 16:00:00,Good,2410.0
1,2017-11-25 14:00:00,Good,1920.0
2,2017-11-25 12:00:00,Good,1543.0
3,2017-11-04 10:00:00,Good,2512.0
4,2017-11-11 07:00:00,Good,1440.0


In [None]:
# what are the weather options?
weather.weather_conditions.unique()

array(['Good', 'Bad'], dtype=object)

In [None]:
# ride durations
weather.describe()

Unnamed: 0,duration_seconds
count,1068.0
mean,2071.731273
std,769.461125
min,0.0
25%,1438.25
50%,1980.0
75%,2580.0
max,7440.0


thoughts
___
* date isn't datetime
* 0 second rides?
* need to know which timestamps are saturday

In [84]:
# what about those 0 sec rides?
weather[weather['duration_seconds'] == 0]

Unnamed: 0,start_ts,weather_conditions,duration_seconds
163,2017-11-11 09:00:00,Good,0.0
168,2017-11-11 07:00:00,Good,0.0
204,2017-11-18 19:00:00,Good,0.0
552,2017-11-04 01:00:00,Good,0.0
801,2017-11-04 09:00:00,Good,0.0
1063,2017-11-25 11:00:00,Good,0.0


In [93]:
# less than 10min rides?
weather[weather['duration_seconds'] < 600]

Unnamed: 0,start_ts,weather_conditions,duration_seconds
15,2017-11-25 13:00:00,Good,60.0
203,2017-11-18 00:00:00,Bad,480.0
424,2017-11-11 13:00:00,Good,420.0
860,2017-11-04 18:00:00,Bad,480.0


I am going to drop the zero second rides. The rest seem real.

In [92]:
# drop the zero seconds rides
weather = weather[weather['duration_seconds'] > 0]

In [94]:
# which timestamps are saturday?

# convert 'start_ts' to datetime objects
weather['start_ts'] = pd.to_datetime(weather['start_ts'])

# store saturday's
weather['is_saturday'] = weather['start_ts'].dt.weekday == 5

In [None]:
# how much saturday data do we have?
weather['is_saturday'].sum()

1062

thoughts
___
Oh maybe I missed something, all the data is saturdays

In [None]:
# sample sizes
weather.groupby('weather_conditions')['duration_seconds'].count()

weather_conditions
Bad     180
Good    882
Name: duration_seconds, dtype: int64

### Graphs
___
#### Taxi companies their number of rides
* We know there is a larger dispairity in the distribution.

In [56]:
# review the data
company_trips.head()

Unnamed: 0,company_name,trips_amount
0,Flash Cab,19558
1,Taxi Affiliation Services,11422
2,Medallion Leasin,10367
3,Yellow Cab,9888
4,Taxi Affiliation Service Yellow,9299


In [59]:
# All 64 companies
fig = px.bar(company_trips,
             x='company_name',
             y='trips_amount',
             title='Ride totals for Chicago taxi companies (Nov.15-16th 2017)')

fig.update_layout(
    xaxis_title="Taxi Company",
    yaxis_title="Ride Total"
)
fig.show()

We knew that was going to be a tad messy, so here it is in two parts

In [None]:
# top 32 companies
fig = px.bar(company_trips.iloc[:32,],
             x='company_name',
             y='trips_amount',
             title='Ride totals for the top Chicago taxi companies (Nov.15-16th 2017)')

fig.update_layout(
    xaxis_title="Taxi Company",
    yaxis_title="Ride Total"
)
fig.show()

In [None]:
# bottom 32 companies
fig = px.bar(company_trips.iloc[32:,],
             x='company_name',
             y='trips_amount',
             title='Ride totals for the bottom Chicago taxi companies (Nov.15-16th 2017)')

fig.update_layout(
    xaxis_title="Taxi Company",
    yaxis_title="Ride Total"
)
fig.show()

In [72]:
# how many companies does it take to add up to flash cab ride total
for company in range(len(company_trips), 0, -1):
    if company_trips.iloc[0,]['trips_amount'].sum() < company_trips.iloc[company:,]['trips_amount'].sum():
        print(f"It takes the bottom {len(company_trips) -
              company} companies to make up Flash Cab production")
        break

It takes the bottom 49 companies to make up Flash Cab production


thoughts
___
* Flash cab has such a large gap between them and the next company, I wouldn't be surprised if they have exclusive airport rights or something. They might simply have the most cabs.
#### What are the top drop off locations

In [73]:
# top 10 locations bar chart
fig = px.bar(drop_off.iloc[:10,],
             x='dropoff_location_name',
             y='average_trips',
             title='Top 10 drop off locations in Chicago (Nov. 15-16th 2017)')
fig.update_layout(
    xaxis_title="Drop off neighborhood",
    yaxis_title="Ride Total"
)
fig.show()

thoughts
___
* from the internet, 'loop', 'river north', 'streeterville', and 'west loop' are all right next to each other downtown.
* Ohare is the airport inside city limits. (midway is not apparently)
* downtown chicago? (outsiders viewpoint) is by far the most popular destination

In [126]:
# comparing the distributions of both populations

# good weather plot
good = px.histogram(weather[weather['weather_conditions'] == 'Good'],
                    x='duration_seconds',
                    nbins=80,
                    color_discrete_sequence=['green'])

# bad weather plot
bad = px.histogram(weather[weather['weather_conditions'] == 'Bad'],
                   x='duration_seconds',
                   nbins=40,
                   color_discrete_sequence=['blue'])

# combine the histograms into one figure
good.add_traces(bad.data)

# add stuff
good.update_layout(
    title="Trip duration distribution by weather conditions",
    xaxis_title="Trip duration (Seconds)",
    yaxis_title="Frequency",
    legend_title="Weather Conditions",
    barmode='overlay'
)


good.show()

My samples are large enough that the means have 'normalized' or made them normal. There appears to be a binomial distribution for taxi rides. People either ride about 20mins (1300seconds) or 40mins

#### Hypothesis testing
____
Does rainy weather have an affect on the amount of taxi rides on saturdays in November?

Null hypothesis: There is no significant difference in rides between rainy and non-rainy days.
Alternative hypothesis: Rainy weather will affect the duration of rides.

Alpha/ p-value will be set to 0.05 due to imbalanced samples. 

Samples are large enough to act as normalized, the relative shape of the histograms match.

In [131]:
# set up samples
good_weather = weather[weather['weather_conditions'] == 'Good']
bad_weather = weather[weather['weather_conditions'] == 'Bad']

In [134]:
good_weather.head()

Unnamed: 0,start_ts,weather_conditions,duration_seconds,is_saturday
0,2017-11-25 16:00:00,Good,2410.0,True
1,2017-11-25 14:00:00,Good,1920.0,True
2,2017-11-25 12:00:00,Good,1543.0,True
3,2017-11-04 10:00:00,Good,2512.0,True
4,2017-11-11 07:00:00,Good,1440.0,True


In [135]:
# check for difference in variance

p_value_levene = st.levene(
    good_weather['duration_seconds'], bad_weather['duration_seconds']).pvalue
print(p_value_levene)

0.6687312920630069


The levene p-value is greater than alpha of 0.05. We cannot reject the null hypothesis, thus treating the variances as equal.

In [138]:
# compare the sample means with t_test_ind

# find the pvalue of the means
t_statistic, p_value = st.ttest_ind(good_weather['duration_seconds'],
                                    bad_weather['duration_seconds'], equal_var=True)

print("p value:", pvalue)
print("t statistic:", t_statistic)

p value: 1.3318772977743245e-11
t statistic: -6.8404589322166425


OH that is tiny! We should confidently reject the null hypothesis, and the t statistic was negative, that means the bad weather trips lasted longer on average.

In [141]:
# what are the actual averages
print(f"good weather trip average: {good_weather['duration_seconds'].mean()}")
print(f"bad weather trip average: {bad_weather['duration_seconds'].mean()}")

good weather trip average: 2013.2789115646258
bad weather trip average: 2427.2055555555557


### Conclusion
___
Weather conditions have an impact on the average taxi ride durations, for the city of Chicago and on Saturday's in November. That is a rather specific situation, however that was the data we were able to use. 

The heart of 'downtown' is the most popular drop off location and the distribution of rides per company is unbalanced.

It CANNOT be determined from the data, why the rides are longer, simply that there is a difference highly unlikely to come from chance.

Bad weather trips are on average ~414 seconds longer, that's nearly 7 mins. 

There appears to be some binomial-ness, binomiality, to the ride durations. both good weather and bad weather had similiar peaks. People either ride about 20mins or 40mins

Our p-value was: 1.3318772977743245e-11
That is tiny! So tiny that the difference in mean can confidently be assessed as real. We can Accept the alternative hypothesis. The odds these results were up to chance is about 1000 times less likely than me winning the lottery 1/100,00,000

notes:
Alpha/ p-value will be set to 0.05 due to imbalanced samples in weather conditions, good job chicago weather!
Samples are large enough to act as normalized, the relative shape of the histograms match.