<a href="https://colab.research.google.com/github/nmagee/ds1002/blob/main/notebooks/data-cleaning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Pandas Data Cleaning Practice

You'll be working with bicycle ride sharing data in San Francisco called `ride_sharing.csv`.

It contains information on the start and end stations, the trip duration, and some user information for a bike sharing service.

In [None]:
# import dependencies
import pandas as pd

In [None]:
# import csv
rides = pd.read_csv('https://ds1002-resources.s3.amazonaws.com/data/ride_sharing.csv')

The `user_type column` contains information on whether a user is taking a free ride and takes on the following values:

`1` for free riders.  
`2` for pay per ride.  
`3` for monthly subscribers.

**1. Provide summary statistics for the `user_type` columns**

In [None]:
rides.dtypes

In [None]:
# create a new column that the correct data type for `user_type`
rides['user_type_cat'] = rides['user_type'].astype('category')

In [None]:
# use `assert` to confirm change
assert rides['user_type_cat'].dtypes == 'category'

In [None]:
# run summary stats on new column
rides['user_type_cat'].describe()

**2. Find the average ride `duration`.**

In [None]:
# `duration` column: separate the units from the numerical value and
# store it in a column called `duration_trim`

rides['duration_trim'] = rides['duration'].str.strip('minutes')

In [None]:
# convert `duration_trim` to `int` and store it as 'duration_min'
rides['duration_min'] = rides['duration_trim'].astype('int')

In [None]:
# confirm the change
assert rides['duration_min'].dtypes == 'int'

In [None]:
# print the average ride duration
print(rides['duration_min'].mean().round(2))

Bicycle tire sizes could be either 26″, 27″ or 29″ and are here correctly stored as a categorical value. In an effort to cut maintenance costs, the ride sharing provider decided to set the maximum tire size to be 27″.

**3. Set the maximum tire size to 27" in the dataset**

In [None]:
rides.dtypes

In [None]:
# Set all values above 27 to 27
rides.loc[rides['tire_size'] > 27, 'tire_size'] = 27

In [None]:
# Convert tire_sizes to categorical
rides['tire_size'] = rides['tire_size'].astype('category')

In [None]:
# Print tire size description
print(rides['tire_size'].describe())

A new update to the data pipeline feeding into ride_sharing has added the ride_id column, which represents a unique identifier for each ride.

The update however coincided with radically shorter average ride duration times and irregular user birth dates set in the future. Most importantly, the number of rides taken has increased by 20% overnight, leading you to think there might be both complete and incomplete duplicates in the ride_sharing DataFrame.

**4.**
* Drop complete duplicates in ride_sharing and store the results in ride_dup.
* Create the statistics dictionary which holds minimum aggregation for user_birth_year and median aggregation for duration.
* Drop incomplete duplicates by grouping by ride_id and applying the aggregation in statistics.  
* Find duplicates again and run the assert statement to verify de-duplication.


In [None]:
#import `ride_sharing_updated.csv`
rides_updated = pd.read_csv('https://ds1002-resources.s3.amazonaws.com/data/ride_sharing_updated.csv')

In [None]:
# Find duplicates
duplicates = rides_updated.duplicated(subset = 'ride_id', keep = False)

In [None]:
# Sort your duplicated rides
duplicated_rides = rides_updated[duplicates].sort_values('ride_id')

In [None]:
# Print relevant columns
print(duplicated_rides[['ride_id','duration','user_birth_year']])

In [None]:
# Drop complete duplicates from ride_sharing
ride_dup = rides_updated.drop_duplicates()

In [None]:
# Create statistics dictionary for aggregation function
statistics = {'user_birth_year': 'min', 'duration': 'median'}

In [None]:
# Group by ride_id and compute new statistics
ride_unique = ride_dup.groupby('ride_id').agg(statistics).reset_index()

In [None]:
# Find duplicated values again
duplicates = ride_unique.duplicated(subset = 'ride_id', keep = False)
duplicated_rides = ride_unique[duplicates == True]

In [None]:
# Assert duplicates are processed
assert duplicated_rides.shape[0] == 0