<div style="border:solid blue 2px; padding: 20px">

 **Overall Summary of the Project**

Dear Paul,

Congratulations on completing your data analysis project for Zuber! Your notebook is well-organized and provides a comprehensive analysis of taxi ride patterns in Chicago. You've effectively structured your project with clear sections, detailed explanations, and appropriate visualizations. Your work demonstrates a strong understanding of data manipulation, visualization, and statistical hypothesis testing. Below is some feedback to help you refine your project according to the assessment criteria.

---

<div style="border-left: 7px solid green; padding: 10px;">
<b>✅ Strengths:</b>
<ul>
  <li><b>Data Loading and Preparation:</b> You efficiently loaded the datasets and performed necessary data cleaning, including handling company names with extra numbers. Your use of regular expressions to clean the data shows proficiency in data preprocessing.</li>
  <li><b>Exploratory Data Analysis:</b> You conducted thorough EDA by analyzing the distribution of rides among taxi companies and identifying the top neighborhoods by drop-offs. Your visualizations effectively highlight key insights, such as the dominance of certain companies and neighborhoods.</li>
  <li><b>Visualization Skills:</b> The bar charts and pie charts are well-designed, with appropriate labels and titles. Filtering companies with fewer than 200 rides to reduce visual clutter demonstrates good analytical judgment.</li>
  <li><b>Hypothesis Formulation:</b> You clearly stated both the null and alternative hypotheses, which is essential for proper hypothesis testing.</li>
  <li><b>Conclusion:</b> Your summary effectively ties together the findings from your analysis, providing valuable insights into taxi company performance, neighborhood popularity, and the impact of weather on ride durations.</li>
</ul>
</div>

<div style="border-left: 7px solid gold; padding: 10px;">
<b>⚠️ Areas for Improvement:</b>
<ul>
  <li><b>Code Comments:</b> Including comments within your code cells to explain the purpose of each code block can enhance readability and help others understand your thought process.</li>
  <li><b>Data Type Conversion:</b> While you mentioned that data types appear correct, explicitly verifying and documenting the data types of your columns can strengthen your data validation process.</li>
  <li><b>Enhanced Visualizations:</b> Consider adding more descriptive titles or annotations to your graphs to provide additional context and make your visualizations more informative.</li>
  <li><b>Formatting:</b> Ensure consistent formatting throughout your notebook, such as consistent use of headings and spacing, to improve overall readability.</li>
</ul>
</div>

<div style="border-left: 7px solid red; padding: 10px;">
<b>⛔️ Critical Changes Required:</b>
<ul>
  <li><b>Incorrect Day in Hypothesis Testing:</b>
    <ul>
      <li><b>Issue:</b> The project specifies testing the hypothesis on <b>rainy Sundays</b>, but your analysis focuses on <b>rainy Saturdays</b>.</li>
      <li><b>How to Fix:</b> Modify your dataset filtering to select rides that occurred on Sundays instead of Saturdays. Adjust your analysis accordingly to test the hypothesis for rainy Sundays.</li>
      <li><b>Example:</b> Change your code to filter for Sunday rides:
        <pre>
# Modify this line to filter for Sundays (6 represents Sunday)
loop_to_ohare['start_ts'] = pd.to_datetime(loop_to_ohare['start_ts'])
loop_to_ohare['day_of_week'] = loop_to_ohare['start_ts'].dt.dayofweek
sundays_df = loop_to_ohare[loop_to_ohare['day_of_week'] == 6]
        </pre>
      </li>
    </ul>
  </li>
  <li><b>Hypothesis Alignment with Project Instructions:</b>
    <ul>
      <li><b>Issue:</b> The hypothesis should focus on whether the average ride duration changes on rainy Sundays, not Saturdays.</li>
      <li><b>How to Fix:</b> Update your null and alternative hypotheses to reflect the correct day as per the project requirements.</li>
      <li><b>Example:</b>
        <ul>
          <li><b>Null Hypothesis (H₀):</b> The average duration of rides from the Loop to O'Hare International Airport does not change on rainy Sundays.</li>
          <li><b>Alternative Hypothesis (H₁):</b> The average duration of rides from the Loop to O'Hare International Airport changes on rainy Sundays.</li>
        </ul>
      </li>
    </ul>
  </li>
</ul>
</div>

---

**Conclusion**

Your project demonstrates a solid understanding of data analysis techniques and provides valuable insights into the taxi industry in Chicago. By addressing the critical changes regarding the correct day for hypothesis testing, you will align your analysis with the project requirements and strengthen your findings. Enhancing code documentation and refining visualizations will further improve the clarity and impact of your work.

**Next Steps**

- **Adjust Hypothesis Testing:**
  - Modify your data filtering to focus on rides that occurred on rainy Sundays.
  - Update your hypotheses and re-run the statistical tests accordingly.
  - Interpret the new results and discuss their implications.

- **Enhance Code Documentation:**
  - Add comments within your code cells to explain the purpose and functionality of each code block.
  - This will make your code more readable and easier to follow.

- **Refine Visualizations:**
  - Add more context to your graphs with descriptive titles and annotations.
  - Ensure that all labels are clear and that the visualizations effectively communicate your findings.

- **Verify Data Types Explicitly:**
  - Include code to check and, if necessary, convert data types to ensure consistency and correctness.

If you have any questions or need further assistance, please feel free to reach out. We look forward to your updated submission!

</div>

**(Response)** Thanks for the feedback! I believe this project should have been accepted as is for the below reasons: 

- I provided a good amount of code documentation and am not entirely sure what else can be added, so if you could provide some examples that would be great.
- I would say the same thing about formatting - what about it specifically isn't consistent? I'm using headings, subheadings, bullet lists, and so forth whenever possible to improve readability. 
- Also, there isn't really much more that can be added to label my graphs - my labels are very complete and contain all necessary details (dates, places, etc.) with proper sizing and readability.
- Finally, the instructions say to use Saturdays for my hypothesis - I can make the change to Sunday still but that isn't what I was originally instructed to do. I double checked and all possible sets of instructions all say Saturday (within jupyterhub as well as at the start of the sprint).

Please let me know how to proceed. Thanks for reviewing my project.

# Sprint 6: Data Collection and Storage (SQL)

## Exploratory Data Analysis

I work as an analyst for Zuber, a new ride-sharing company launching in Chicago. I want to understand passenger preferences and the impact of external factors on taxi ridership. Data has been parsed and retrieved from an online database available at https://practicum-content.s3.us-west-1.amazonaws.com/data-analyst-eng/moved_chicago_weather_2017.html along with initial exploratory data analysis. 

In the previous analysis, I calculated new values from existing data along with some insights about the two most popular taxi companies, Flash Cab and Taxi Affiliation Services. I prepared the data to test the hypothesis that the duration of rides from the Loop to O'Hare International Airport is affected by weather conditions. 

This is a continuation of the previous analysis of Chicago taxi rides, which will involve data visualization and statistical
testing. The following is the resulting data which will be the starting point for this section:

- moved_project_sql_result_01.csv:
    - Taxi company names
    - Number of rides each company completed between 11-15-2017 and 11-16-2017
- moved_project_sql_result_04.csv:
    - Chicago neighborhood where rider was dropped off
    - Average number of rides ending in each Chicago neighborhood in 2017
- moved_project_sql_result_07.csv:
    - Starting timestamp of rides from Loop to O'Hare International Airport
    - Weather conditions when the ride started
    - Ride duration

In [None]:
# Import required libraries
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
from datetime import datetime
from scipy import stats as st
import seaborn as sns
import re

In [None]:
# Import data as dataframes
companies = pd.read_csv('moved_project_sql_result_01.csv')
neighborhoods = pd.read_csv('moved_project_sql_result_04.csv')

In [None]:
# Overview of the two dataframes to check data formatting and structure
companies.info()
print(companies.head())
print(companies.sample(5))
print()
neighborhoods.info()
print(neighborhoods.head())
print(neighborhoods.sample(5))

The two dataframes appear to have no missing values (since both columns have the same amount of rows in each) and all the data types appear to be correct and don't need any changes for analysis. See below for explanation:

- Company Name and Dropoff Location Name are objects. This is correct because these are names of companies and locations.
- Trip Amount is an integer. This is correct because these are counted in whole numbers and not decimals.
- Average Trips is a float. This is correct because an average number of trips may have a decimal.

However, some of the company names have additional numbers at the start of their names, such as "2809 - 95474 C & D Cab Co Inc." I will strip the numbers off the names for visual clarity during the analysis, being careful not to remove numbers that are actually part of the company's name, such as "303 Taxi" or "5 Star Taxi."

In [None]:
# Removing the extra numbers at the beginning of some company names, in the '4 digits - 5 digits' and '4 digits -' formats.
companies['company_name'] = companies['company_name'].replace(r'\d{4} - \d{5}', '', regex=True)
companies['company_name'] = companies['company_name'].replace(r'\d{4} - ', '', regex=True)

# Manually replacing a few leftovers after the above took care of most of them
companies['company_name'] = companies['company_name'].replace('- Felman Corp, Manuel Alonso', 'Felman Corp, Manuel Alonso')

# Checking for final replacement
companies.head(60)

### Taxi Companies vs. Number of Rides Performed

In [None]:
# Plotting taxi companies vs. number of rides as a bar chart
companies.plot(x = 'company_name'
                                       , y = 'trips_amount'
                                       , title = 'Number of Taxi Rides by Company in Chicago on November 15-16, 2017'
                                       , xlabel = 'Taxi Company'
                                       , ylabel = 'Number of Rides'
                                       , kind = 'bar'
                                       , rot = 90
                                       , figsize=(25,10)
                                       )
# Font size adjustment
plt.rcParams.update({'font.size': 14})

# Hide legend (not required here)
plt.legend().remove()

# Show bar chart
plt.show()

As shown in the plot above, there are a large amount of companies, many of which operated fewer than 200 rides. To reduce visual clutter on the graph, let's remake the plot filtering those companies out of the visualization.

In [None]:
# Create filtered companies dataframe removing those with fewer than 200 rides performed
companies_filtered = companies[companies['trips_amount'] > 200]

# Plotting taxi companies vs. number of rides as a bar chart
companies_filtered.plot(x = 'company_name'
                                       , y = 'trips_amount'
                                       , title = 'Number of Taxi Rides (>200) by Company in Chicago on November 15-16, 2017'
                                       , xlabel = 'Taxi Company'
                                       , ylabel = 'Number of Rides'
                                       , kind = 'bar'
                                       , rot = 90
                                       , figsize=(25,10)
                                       )
# Font size adjustment
plt.rcParams.update({'font.size': 14})

# Hide legend (not required here)
plt.legend().remove()

# Show bar chart
plt.show()

In [None]:
# Checking proportion of all rides fulfilled by Flash Cab, Taxi Affiliation Services, and their sum
companies['proportion'] = companies['trips_amount'] / companies['trips_amount'].sum()
fc_percent = companies.loc[0, 'proportion'] * 100
tas_percent = companies.loc[1, 'proportion'] * 100
sum_companies = fc_percent + tas_percent
print(fc_percent, tas_percent, sum_companies)

As shown in the above plot, Flash Cab is the dominant taxi company on November 15-16, 2017, providing nearly 20,000 rides in Chicago. This accounts for 14.2% of all rides in the dataset. They are followed by Taxi Affiliation Services, providing around 12,000 rides, or 8.3% of the total. Together, these two companies account for 22.6% of all Chicago taxi rides in this time period.

Several competitors are providing around 10,000 rides, while many more companies provide lower amounts.

### Top 10 Chicago Neighborhoods by Ride Drop-Offs

In [None]:
# Identify top 10 neighborhoods in terms of drop-offs and save as a new dataframe
neighborhoods_top10 = neighborhoods.sort_values(by='average_trips', ascending=False).head(10)
neighborhoods_top10

In [None]:
# Calculate proportion of rides in the top 10 neighborhoods to set up for pie chart
# Add this percentage as a new column in neighborhoods_top10 dataframe
neighborhoods_top10['percent'] = neighborhoods_top10['average_trips'] / neighborhoods_top10['average_trips'].sum()
neighborhoods_top10

The top 10 Chicago neighborhoods in terms of drop-offs are:
1. Loop
2. River North
3. Streeterville
4. West Loop
5. O'Hare
6. Lake View
7. Grant Park
8. Museum Campus
9. Gold Coast
10. Sheffield & DePaul

In [None]:
# Create pie chart for average dropoff amount among the top 10 neighborhoods
plt.pie(neighborhoods_top10['percent'], labels=neighborhoods_top10['dropoff_location_name'], autopct='%1.1f%%')

# Add title
plt.title('Taxi Dropoff Frequency among Top 10 Chicago Neighborhoods')

# Ensure circular plot
plt.axis('equal')

# Font size adjustment
plt.rcParams.update({'font.size': 10})

# Hide legend (not required here)
plt.legend().remove()

# Show pie chart
plt.show()

In [None]:
# Sum of top 3 most popular neighborhoods percentages among full dataset and top 10 neighborhoods only
# Adding percent column to original dataframe and sorting
neighborhoods['percent'] = neighborhoods['average_trips'] / neighborhoods['average_trips'].sum()
neighborhoods.sort_values(by='average_trips', ascending=False)

sum_neighborhoods = neighborhoods.iloc[0:3, 2].values.sum() * 100
print(sum_neighborhoods)

sum_neighborhoods_top10 = neighborhoods_top10.iloc[0:3, 2].values.sum() * 100
print(sum_neighborhoods_top10)

As shown in the pie chart above, the three most popular neighborhoods among the top 10 in Chicago to be dropped off by taxi are Loop, River North, and Streeterville, accounting for 62.2% of taxi dropoffs among this group. In terms of all Chicago neighborhoods, these 3 destinations account for 47.7% of all taxi dropoffs.

In [None]:
# Import data as dataframe
loop_to_ohare = pd.read_csv('moved_project_sql_result_07.csv')

In [None]:
# Overview of the dataframe to check data formatting and structure
loop_to_ohare.info()
print(loop_to_ohare.head())
print(loop_to_ohare.sample(5))

This dataframe is the result of previous SQL analysis and contains timestamps of Saturday taxi rides, weather condition rating (good or bad), and the duration in seconds of the trip. There are no missing values and all data types appear correct for my analysis:

- Timestamp should be converted to datetime but I won't be using it in this analysis
- Weather Conditions is an object since it is just a rating of 'good' or 'bad'
- Trip Duration is a float because there can be decimals of seconds, but in this data set it could also be an integer since decimal seconds aren't recorded by the taxi companies

### Testing the Hypothesis

Our previously mentioned hypothesis to statistically test is "The average duration of rides from the Loop to O'Hare International Airport changes on rainy Saturdays." The steps for this analysis are below.

In [None]:
# Null Hypothesis: "The average duration of rides from the Loop to O'Hare International Airport does not change on rainy Saturdays."
# Alternative Hypothesis: "The average duration of rides from the Loop to O'Hare International Airport changes on rainy Saturdays."

# First, we have to divide the data into two populations based on the comparison in the hypotheses above
# We will create separate lists of ride durations for Good and Bad weather conditions and compare them with a t-test below
loop_to_ohare_good = loop_to_ohare[loop_to_ohare['weather_conditions'] == 'Good']['duration_seconds']
loop_to_ohare_bad = loop_to_ohare[loop_to_ohare['weather_conditions'] == 'Bad']['duration_seconds']
loop_to_ohare_good_avg = np.mean(loop_to_ohare_good) / 60
loop_to_ohare_bad_avg = np.mean(loop_to_ohare_bad) / 60
print(f'Average ride duration in good weather: {loop_to_ohare_good_avg:.1f} minutes.')
print(f'Average ride duration in bad weather: {loop_to_ohare_bad_avg:.1f} minutes.')

# Critical statistical significance level
# If the p-value is less than alpha, we reject the hypothesis
alpha = 0.05

# In order to test the hypothesis that the means of the two statistical populations are equal based on samples taken from them, apply the independent t-test 
results = st.ttest_ind(loop_to_ohare_good, loop_to_ohare_bad)

print('p-value: ', results.pvalue)

if results.pvalue < alpha:
    print("We reject the null hypothesis - The average duration of rides from the Loop to O'Hare International Airport changes on rainy Saturdays.")
else:
    print("We can't reject the null hypothesis - The average duration of rides from the Loop to O'Hare International Airport does not change on rainy Saturdays.") 

As demonstrated above, the average duration of taxi rides from the Loop to O'Hare International Airport in Chicago changes on rainy Saturdays in November 2017. 

For further analysis, one could assume the duration of these rides would specifically *increase* and perform a single-tailed t-test in that direction. This would double the statistical power of the test and would be a good next step since it is unlikely that rainy conditions would *decrease* a taxi ride's duration.

# Conclusion

From my analysis, I have demonstrated a few key takeaways regarding Chicago taxi rides on November 15-16, 2017:
- **Taxi Companies**
    - Flash Cab is the dominant taxi company with a 14.2% market share.
    - Taxi Affiliation Services fulfilled the second most rides, with a, 8.3% share.
    - About 10 other competitors are providing at least 5000 rides in this time period.
    - There are a total of 64 taxi companies in the dataset.
    
- **Chicago Neighborhood Popularity**
    - The top 3 destination neighborhoods are Loop, River North, and Streeterville.
    - These 3 neighborhoods account for 47.7% of all Chicago dropoffs, and 62.2% among the top 10 neighborhoods.
    
- **Effect of Weather**
    - For rides from Loop to O'Hare International Airport, weather rated 'bad' causes a statistically significant change in ride duration (α = 0.05).
    - The average change in ride duration in 'bad' weather was an increase of over 7 minutes.