# Zuber Rides Analysis

This project utilizes data from Zuber, a new ride share company launching in Chicago, who is interested in understanding the market in their area. The analysis is based on three datasets: 1) 64 ride share companies' and their total number of trips, 2) 94 neighborhoods and their average number of trips, and 3) 1068 ride share trips' start time, weather, and duration. The data were initially retrieved from the web and a large database using HTML Parsing and SQL queries, respectively, downloaded into csvs, and then uploaded into Python for analysis. The project contains 3 parts: 

1. Import HTML parsed and SQL queried csv files into python
2. Exporatory data analysis and data visualizations
3. Hypothesis tests
4. Conclusion and business application

## Import libraries and CSVs

Data is read in from csv files and viewed to check for data types and missing values.

In [32]:
# Import libraries
import pandas as pd
import plotly.express as px
import scipy.stats as st
import researchpy as rp
import kaleido 

In [18]:
# Read in data from files
url_company = 'https://raw.githubusercontent.com/kellyshreeve/Zuber_Rides_Analysis/main/data/sql_result_01.csv'
url_neighborhood = 'https://raw.githubusercontent.com/kellyshreeve/Zuber_Rides_Analysis/main/data/sql_result_04.csv'
url_weather = 'https://raw.githubusercontent.com/kellyshreeve/Zuber_Rides_Analysis/main/data/sql_result_07.csv'

company = pd.read_csv(url_company)
neighborhood = pd.read_csv(url_neighborhood)
weather = pd.read_csv(url_weather)

In [19]:
# Print info on each csv
display(company.info())
display(neighborhood.info())
display(weather.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 64 entries, 0 to 63
Data columns (total 2 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   company_name  64 non-null     object
 1   trips_amount  64 non-null     int64 
dtypes: int64(1), object(1)
memory usage: 1.1+ KB


None

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 94 entries, 0 to 93
Data columns (total 2 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   dropoff_location_name  94 non-null     object 
 1   average_trips          94 non-null     float64
dtypes: float64(1), object(1)
memory usage: 1.6+ KB


None

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1068 entries, 0 to 1067
Data columns (total 3 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   start_ts            1068 non-null   object 
 1   weather_conditions  1068 non-null   object 
 2   duration_seconds    1068 non-null   float64
dtypes: float64(1), object(2)
memory usage: 25.2+ KB


None

All data types are correct. There are no missing values.

In [20]:
# Print sample of rows for each data frame
display(company.head(10))
display(neighborhood.head(10))
display(weather.head(10))

Unnamed: 0,company_name,trips_amount
0,Flash Cab,19558
1,Taxi Affiliation Services,11422
2,Medallion Leasin,10367
3,Yellow Cab,9888
4,Taxi Affiliation Service Yellow,9299
5,Chicago Carriage Cab Corp,9181
6,City Service,8448
7,Sun Taxi,7701
8,Star North Management LLC,7455
9,Blue Ribbon Taxi Association Inc.,5953


Unnamed: 0,dropoff_location_name,average_trips
0,Loop,10727.466667
1,River North,9523.666667
2,Streeterville,6664.666667
3,West Loop,5163.666667
4,O'Hare,2546.9
5,Lake View,2420.966667
6,Grant Park,2068.533333
7,Museum Campus,1510.0
8,Gold Coast,1364.233333
9,Sheffield & DePaul,1259.766667


Unnamed: 0,start_ts,weather_conditions,duration_seconds
0,2017-11-25 16:00:00,Good,2410.0
1,2017-11-25 14:00:00,Good,1920.0
2,2017-11-25 12:00:00,Good,1543.0
3,2017-11-04 10:00:00,Good,2512.0
4,2017-11-11 07:00:00,Good,1440.0
5,2017-11-11 04:00:00,Good,1320.0
6,2017-11-04 16:00:00,Bad,2969.0
7,2017-11-18 11:00:00,Good,2280.0
8,2017-11-11 14:00:00,Good,2460.0
9,2017-11-11 12:00:00,Good,2040.0


### Import Data Conclusion

All 3 data files are read into Jupyter Notebooks. All data types are correct and there are no missing values. The data is ready for analysis.

## Eploratory Data Analysis and Data Visualizations

This section uses the above data frames to find the top 10 neighborhoods for drop offs, display the number of rides by taxi company, and display the average drop offs by neighborhood.

### Top 10 neighborhoods for drop offs

In [21]:
# Sort neighborhoods by drop off and extract top 10
neighborhood_sorted = neighborhood.sort_values(by=['average_trips'], ascending=False) \
                                  .reset_index(drop=True) \
                                  .rename(columns={'dropoff_location_name':'dropoff_neighborhood'}) \
                                  .round(2)
        
neighborhood_top_10 = neighborhood_sorted[0:10]

print('Top 10 neighborhoods for drop offs:')
display(neighborhood_top_10)

Top 10 neighborhoods for drop offs:


Unnamed: 0,dropoff_neighborhood,average_trips
0,Loop,10727.47
1,River North,9523.67
2,Streeterville,6664.67
3,West Loop,5163.67
4,O'Hare,2546.9
5,Lake View,2420.97
6,Grant Park,2068.53
7,Museum Campus,1510.0
8,Gold Coast,1364.23
9,Sheffield & DePaul,1259.77


### Top 10 Rides by Taxi Company

In [33]:
# Find top 10 taxi companies by number of rides
company_sorted = company.sort_values(by='trips_amount', ascending=False) \
                        .reset_index(drop=True)

company_top_10 = company_sorted[0:10]

print('The top 10 companies by number of trips:')
display(company_top_10)


The top 10 companies by number of trips:


Unnamed: 0,company_name,trips_amount
0,Flash Cab,19558
1,Taxi Affiliation Services,11422
2,Medallion Leasin,10367
3,Yellow Cab,9888
4,Taxi Affiliation Service Yellow,9299
5,Chicago Carriage Cab Corp,9181
6,City Service,8448
7,Sun Taxi,7701
8,Star North Management LLC,7455
9,Blue Ribbon Taxi Association Inc.,5953


In [40]:
# Plot bar chart of number of rides by taxi company
company_top_10_bar = px.bar(company_top_10, x='company_name', y='trips_amount',
                            title='Total Number of Trips by Company',
                            labels={'company_name':'Company Name', 
                            'trips_amount':'Total Number of Trips'})

company_top_10_bar.update_layout({
    'plot_bgcolor':'rgba(0, 0, 0, 0)',
    'paper_bgcolor':'rgba(0, 0, 0, 0)'
})

company_top_10_bar.show()

The company with the largest number of total trips is Flash Cab with 19558 total trips. They have almost twice as many trips as the next leading company, Taxi Affiliation Services, which has 11422 total trips. The 3rd - 10th companies all have similar numbers of total rides, amounting to between 6,000 - 10,000 total rides per company.

### Number of Drop Offs by Top 10 Neighborhoods

In [38]:
# Plot bar chart of average number of drop offs for top 10 neighborhoods
neighborhood_top_10_bar = px.bar(neighborhood_top_10, x='dropoff_neighborhood', y='average_trips',
                                 title='Average Number of Drop Offs by Neighborhood',
                                 labels={'dropoff_neighborhood':'Drop Off Neighborhood', 
                                 'average_trips':'Average Number of Drop Offs'})

neighborhood_top_10_bar.update_layout({
    'plot_bgcolor':'rgba(0, 0, 0, 0)',
    'paper_bgcolor':'rgba(0, 0, 0, 0)'
})

neighborhood_top_10_bar.show()

The Loop neighborhood has the highest number of average drop offs at 10727.47 average trips, which is almost 8.5 times as many as the tenth-highest neighboorhood, Sheffield & DePaul, having 1259.77 average trips. There is a clear distinction between the four most popular neighborhoods of Loop, River North, Streeterville, and West Loop, and the rest of the neighborhoods. These four neighborhoods have more than two times as many trips as the next-leading neighborhood of O'Hare, meaning twice as many rides drop off in these neighborhoods as any other neighborhood. 

### Exploratory Analysis Conclusions

The company with the highest number of total rides was Flash Cab, followed by Taxi Affiliation Services, Medallion Leasin, and Yellow Cab. There is a big drop off between Flash Cab and the other companies, meaning Flash Cab gets almost twice as many rides as the next leading service. The 2nd - 10th companies all give similar amounts of total rides, around 10,000. The neighborhoods with the highest drop offs were Loop, River North, Streeterville, and West Loop. There is also a large drop off between West Loop and the next leading neighborhood of O'Hare, meaning Loop, River North, Streeterville, and West Loop are the most important neighborhoods by far.

## Testing Hypotheses

This section uses a two-indepedent samples t-test to test the hypothesis: The average duration of rides from the Loop to O'Hare Inernational Airport changes on rainy Saturdays.

<div style="padding-left: 30px;">
H<sub>0</sub>: µ<sub>bad</sub> = µ<sub>good</sub>  The average trip duration is the same on days with bad weather and days with good weather.
</div>
<div style="padding-left: 30px;">
H<sub>a</sub>: µ<sub>bad</sub> ≠ µ<sub>good</sub>  The average trip duration is different on days with bad weather than days with good weather.

alpha = 0.05

In [None]:
# Check for equality of variances
weather_bad = weather[weather['weather_conditions'] == 'Bad']['duration_seconds']
weather_good = weather[weather['weather_conditions'] == 'Good']['duration_seconds']

levene = st.levene(weather_bad, weather_good, center='mean') # Levene's Test

W = levene[0].round(2)
p_value = levene[1].round(4)

print(f'The test statistic is: W = {W}')
print(f'The p value is: p = {p_value}')
print()

if p_value < 0.05:
    print(f'The p value of {p_value} is less than alpha.') 
    print()
    print('Reject the null hypothesis. The groups have different variances.')
else:
    print(f'The p value of {p_value} is greather than alpha.') 
    print()
    print(f'Do not reject the null hypothesis. The groups do not have different variances.')

The test statistic is: W = 0.72
The p value is: p = 0.3969

The p value of 0.3969 is greather than alpha.

Do not reject the null hypothesis. The groups do not have different variances.


In [None]:
# Two Independent Samples t-test, equal variances assumed
descriptives, results = rp.ttest(weather_bad, weather_good, 
                                 group1_name='bad', group2_name='good')

display(descriptives)
display(results)

t = results.iloc[2,1]
p = results.iloc[3,1]

print(f'The test statistic is: t = {t:.2f}')
print(f'The p value is: p = {p:.4f}')
print()

if p < 0.05:
    print(f'The p value of {p:.4f} is less than alpha.') 
    print()
    print('Reject the null hypothesis. The groups have different averages.')
else:
    print(f'The p value of {p:.4f} is greather than alpha.') 
    print()
    print(f'Do not reject the null hypothesis. The groups do not have different averages.')


The series.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.



Unnamed: 0,Variable,N,Mean,SD,SE,95% Conf.,Interval
0,bad,180.0,2427.205556,721.314138,53.763582,2321.113588,2533.297523
1,good,888.0,1999.675676,759.198268,25.477026,1949.673393,2049.677958
2,combined,1068.0,2071.731273,769.461125,23.545128,2025.531264,2117.931283


Unnamed: 0,Independent t-test,results
0,Difference (bad - good) =,427.5299
1,Degrees of freedom =,1066.0
2,t =,6.9462
3,Two side test p value =,0.0
4,Difference < 0 p value =,1.0
5,Difference > 0 p value =,0.0
6,Cohen's d =,0.5678
7,Hedge's g =,0.5674
8,Glass's delta1 =,0.5927
9,Point-Biserial r =,0.2081


The test statistic is: t = 6.95
The p value is: p = 0.0000

The p value of 0.0000 is less than alpha.

Reject the null hypothesis. The groups have different averages.


### Hypothesis Test Conclusions

We find that the average trip duration during good weather is significantly different than the average trip duration during bad weather (t(1066) = 6.95, p = 0.000). The average trip duration during good weather (M = 1999.68, SD = 759.20) is less than the average trip duration during bad weather (M = 2427.21, 759.20), providing evidence that trips take less time on average when the weather is dry than when it is raining. 

## Conclusion and Business Application

For Zuber, this means that our biggest competition is Flash Cab, followed by the other companies in the top 10 including, Taxi Affiliation Services, Medallion Leasin, and Yellow Cab. Additionally, many of Chicago's biggest customers either live in or visit the Loop, River North, Streeterville, and West Loop neighborhoods. We should further delve into the location and customer profile of these neighborhoods to understand our local pull and customer demographic. Trips take longer during poor weather, so we should plan to increase rates or have more drivers on the road during rainy days than non-rainy days.