# Data Storage and Collection Project
***

For this project, we will be doing work for Zuber, a brand new rideshare company located in the Chicago area. We will be looking for patterns in the data and attempting to analyse passenger preferences and the impact of external factors on the length and quality of rides in the area.

Data on local competitor companies, as well as relevant weather data has been collected for analysis. 

Preliminary EDA was already performed via SQL:

1. [Weather](https://practicum-content.s3.us-west-1.amazonaws.com/data-analyst-eng/moved_chicago_weather_2017.html) data was extracted for data enhancement.
2. Calculated the number of rides for each company on November 15-16, 2017, which will be refered to as `df_rides`.
3. Calculated average number of rides that ended in each Chicago neighborhood in November 2017, which will be refered to as `df_dropoff`.
4. Identified target neighborhoods, and calculated weather conditions and ride lengths for all Saturdays in the dataset, which will be refered to as `df_test`.

Using this accumilated data, we will generate distubution diagrams in order to make statistical observations and answer the following hypothesis.

**Hypothesis:**

- The average duration of rides from the Loop to O'Hare International Airport changes on rainy Saturdays

In [4]:
# Importing libraries
import pandas as pd
import plotly.express as px
from scipy import stats as st

Saving data into pandas DataFrames

In [5]:
# Loading data into DataFrames
df_rides = pd.read_csv('project_sql_result_01.csv')
df_dropoff = pd.read_csv('project_sql_result_04.csv')
df_test = pd.read_csv('project_sql_result_07.csv')

## `df_rides`

In [6]:
# Generating a brief description and a sample of the table
df_rides.info()
df_rides.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 64 entries, 0 to 63
Data columns (total 2 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   company_name  64 non-null     object
 1   trips_amount  64 non-null     int64 
dtypes: int64(1), object(1)
memory usage: 1.1+ KB


Unnamed: 0,company_name,trips_amount
0,Flash Cab,19558
1,Taxi Affiliation Services,11422
2,Medallion Leasin,10367
3,Yellow Cab,9888
4,Taxi Affiliation Service Yellow,9299


Due to preliminary EDA, the data looks very clean. The data will be checked for missing or duplicate data to ensure accuracy.  

In [7]:
# Checking for duplicates
df_rides.duplicated().sum()

0

In [8]:
# Checking for missing data
df_rides.isna().sum()

company_name    0
trips_amount    0
dtype: int64

After ensuring the data is clean, it will be used to generate a graph.

In [9]:
px.bar(df_rides[df_rides['trips_amount'] > 100],
       'company_name',
       'trips_amount',
       title='Total Rides on November 15-16, 2017',
       labels={'company_name':'Company Name',
               'trips_amount':'Total Rides'}).update_xaxes(tickangle = 45)

From the data, the company *Flash Cab* completed almost double the rides of its competitors. The data is heavily right skewed, suggesting that customers in the Chicago area tend to trust a few select competitor companies over most others.

## `df_dropoff`

In [10]:
# Generating a brief description and a sample of the table
df_dropoff.info()
df_dropoff.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 94 entries, 0 to 93
Data columns (total 2 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   dropoff_location_name  94 non-null     object 
 1   average_trips          94 non-null     float64
dtypes: float64(1), object(1)
memory usage: 1.6+ KB


Unnamed: 0,dropoff_location_name,average_trips
0,Loop,10727.466667
1,River North,9523.666667
2,Streeterville,6664.666667
3,West Loop,5163.666667
4,O'Hare,2546.9


As before, ensuring data is clean.

In [11]:
# Checking for duplicates
display(df_dropoff.duplicated().sum())
df_dropoff['dropoff_location_name'].duplicated().sum()

0

0

In [12]:
# Checking for missing data
df_dropoff.isna().sum()

dropoff_location_name    0
average_trips            0
dtype: int64

Now we will look at the top 10 neighborhoods for November. 

In [13]:
# Choosing the top 10 neighborhood
top_ten = df_dropoff.sort_values('average_trips').tail(10)

# Graph
neighborhood = px.bar(top_ten,
                      'average_trips',
                      'dropoff_location_name',
                      title='Average Daily Dropoffs for November',
                      labels={'average_trips':'Average Number of Trips',
                              'dropoff_location_name':'Dropoff Neighborhood'})

neighborhood.show()

Over the course of November the Loop neighborhood had over 10,000 daily dropoffs on average, followed closely by the River North, Streeterville and then West Loop neighborhoods. The high number of dropoffs can be explained by travelers and tourists coming to the city, which subsequently explains the higher number of dropoffs in the O'Hare neighborhood, the location of the airport.

## `df_test`

In [14]:
# Generating a brief description and a sample of the table
df_test.info()
df_test.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1068 entries, 0 to 1067
Data columns (total 3 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   start_ts            1068 non-null   object 
 1   weather_conditions  1068 non-null   object 
 2   duration_seconds    1068 non-null   float64
dtypes: float64(1), object(2)
memory usage: 25.2+ KB


Unnamed: 0,start_ts,weather_conditions,duration_seconds
0,2017-11-25 16:00:00,Good,2410.0
1,2017-11-25 14:00:00,Good,1920.0
2,2017-11-25 12:00:00,Good,1543.0
3,2017-11-04 10:00:00,Good,2512.0
4,2017-11-11 07:00:00,Good,1440.0


In [15]:
# Checking for duplicates
df_test.duplicated().sum()

197

Due to the small sample size this is an acceptable number of duplicates. 

In [16]:
df_test.isna().sum()

start_ts              0
weather_conditions    0
duration_seconds      0
dtype: int64

In [17]:
weather = px.histogram(df_test, 
                       'duration_seconds', 
                       color='weather_conditions',
                       title="Ride Lengths Between the Loop and O’Hare for Saturdays in November",
                       labels={'duration_seconds':'Ride Length in Seconds',
                               'weather_conditions':'Weather Conditions'},
                       barmode='overlay')
weather.update_layout(yaxis_title='Total Rides')
weather.show()

Good weather Saturdays tend to have an average ride time of 20-25 minutes, while bad weather Saturdays tend to have an average ride time of 40-45 minutes. Both datasets have a slight skew to the right due to factors that are not being tested here like traffic due to events happening on Saturdays or other extraneous circumstances. There are a few rides that have times of less than 15 minutes which can be assumed to have been cancelled early.

## Testing Hypothesis

**Null Hypothesis:**

H<sub>0</sub>: μGood Weather = μBad weather

**Alternate Hypothesis:**

H<sub>1</sub>: μGood weather ≠ μBad weather

- μGood weather = Population mean time of trips with good weather
- μBad weather = Population mean time of trips with bad weather

The null hypothesis assumes the ride times for both good and bad weather from the Loop to O'Hare International Airport are the same, while the alternate hypothesis assumes they are different. The alpha value set for this test is 0.01%.

In [18]:
# Setting alpha value
alpha = 0.01

# Grouping data for test
good = df_test[df_test['weather_conditions']=='Good']
bad = df_test[df_test['weather_conditions']=='Bad']

# Preforming test
results = st.ttest_ind(good['duration_seconds'], bad['duration_seconds'])

# Printing Results
print('p-value:', results.pvalue)

if (results.pvalue < alpha):
    print("We have sufficient evidence to reject the null hypothesis")
else:
    print("We do not have sufficient evidence to reject the null hypothesis")

p-value: 6.517970327099473e-12
We have sufficient evidence to reject the null hypothesis


The data presents enough evidence to reject the null hypothesis, reinforcing the idea that ride times differ depending on weather conditions.

# Conclusion
***

From the data collected, we can tell that cusomers in the Chicago area tend to trust a few select rideshare options over most others, suggesting high customer loyalty. Most competitor companies did not complete more than 5,000 rides in the chosen time period, which suggests most companies are smaller and have poor customer retention.

There were higher numbers of dropoffs in the Loop, River North, Streeterville and West Loop neighborhoods which can be attributed to higher concentrations of hotels and tourism in the area. This is further reinforced by the next highest number of dropoffs being in the O'Hare neighborhood which is where Chicago's major airport is located.

By testing ride times from the airport to the Loop neighborhood, which had the most dropoffs on average, we were able to determine that external factors such as weather have a significant impact on ride times for customers. Rides during bad weather tended to have ride times of over 20-25 minutes longer than rides on days with better weather.