# Cleaning data

In [1]:
# Import the course packages
import pandas as pd
import numpy as np
import datetime as dt
import matplotlib.pyplot as plt
import missingno as msno
import fuzzywuzzy
import recordlinkage 

# Import the course datasets
ride_sharing = pd.read_csv('data/ride_sharing_new.csv', index_col = 'Unnamed: 0')
airlines = pd.read_csv('data/airlines_final.csv',  index_col = 'Unnamed: 0')
banking = pd.read_csv('data/banking_dirty.csv', index_col = 'Unnamed: 0')
restaurants = pd.read_csv('data/restaurants_L2.csv', index_col = 'Unnamed: 0')
restaurants_new = pd.read_csv('data/restaurants_L2_dirty.csv', index_col = 'Unnamed: 0')

# Common data problems

## Data Types constraints

The user_type column contains information on whether a user is taking a free ride and takes on the following values:

1. for free riders.
2. for pay per ride.
3. for monthly subscribers.
   
In this instance, you will print the information of ride_sharing using `.info()`  and see a example of how an incorrect data type can flaw your analysis of the dataset. Consider the table `ride_sharing`

In [18]:
display(ride_sharing.head(2))

Unnamed: 0,duration,station_A_id,station_A_name,station_B_id,station_B_name,bike_id,user_type,user_birth_year,user_gender
0,12 minutes,81,Berry St at 4th St,323,Broadway at Kearny,5480,2,1959,Male
1,24 minutes,3,Powell St BART Station (Market St at 4th St),118,Eureka Valley Recreation Center,5193,2,1965,Male


In [10]:
# Print the information of ride_sharing
print(ride_sharing.info())

<class 'pandas.core.frame.DataFrame'>
Int64Index: 25760 entries, 0 to 25759
Data columns (total 10 columns):
 #   Column           Non-Null Count  Dtype   
---  ------           --------------  -----   
 0   duration         25760 non-null  object  
 1   station_A_id     25760 non-null  int64   
 2   station_A_name   25760 non-null  object  
 3   station_B_id     25760 non-null  int64   
 4   station_B_name   25760 non-null  object  
 5   bike_id          25760 non-null  int64   
 6   user_type        25760 non-null  int64   
 7   user_birth_year  25760 non-null  int64   
 8   user_gender      25760 non-null  object  
 9   user_type_cat    25760 non-null  category
dtypes: category(1), int64(5), object(4)
memory usage: 2.0+ MB
None


In [11]:
# Print summary statistics of user_type column
print(ride_sharing['user_type'].describe())

count    25760.000000
mean         2.008385
std          0.704541
min          1.000000
25%          2.000000
50%          2.000000
75%          3.000000
max          3.000000
Name: user_type, dtype: float64


The `.astype` method is used to cast a pandas object to a specified dtype. 

In [12]:
# Convert user_type from integer to category
ride_sharing['user_type_cat'] = ride_sharing['user_type'].astype("category")

# Write an assert statement confirming the change
assert ride_sharing['user_type_cat'].dtype == 'category'

# Print new summary statistics 
print(ride_sharing['user_type_cat'].describe())

count     25760
unique        3
top           2
freq      12972
Name: user_type_cat, dtype: int64


## Uniqueness constraints

The `.duplicated()` method returns a boolean Series that indicates which rows are duplicates of previous rows. By default, it considers all columns in the DataFrame, but it can also be used to check for duplicates in a subset of columns. 

The `.drop_duplicates()` method is used to remove duplicate rows from a DataFrame. By default, it removes all rows that are duplicates of previous rows, but it can also be used to remove duplicates based on a subset of columns.

`.duplicated()` and `.drop_duplicated()` has the following arguments:

- **subset**: list of column names to check for duplication
- **keep**: whether to keep first(`first`), last (`last`) or all (`False`) duplicate values.

- **inplace(for drop_duplicated)**: drop duplicated rows directly inside DataFrame without creating new object (`True`)


**Finding duplicates**



The update however coincided with radically shorter average ride duration times and irregular user birth dates set in the future. Most importantly, the number of rides taken has increased by 20% overnight, leading you to think there might be both complete and incomplete duplicates in the ride_sharing DataFrame.

In [56]:
# Pandas library cuts part of dataframe when it's too long. Before print, add this line to show all rows desired:
pd.options.display.max_rows = 20
ride_sharing.head(5)

Unnamed: 0,duration,station_A_id,station_A_name,station_B_id,station_B_name,bike_id,user_type,user_birth_year,user_gender
0,12 minutes,81,Berry St at 4th St,323,Broadway at Kearny,5480,2,1959,Male
1,24 minutes,3,Powell St BART Station (Market St at 4th St),118,Eureka Valley Recreation Center,5193,2,1965,Male
2,8 minutes,67,San Francisco Caltrain Station 2 (Townsend St...,23,The Embarcadero at Steuart St,3652,3,1993,Male
3,4 minutes,16,Steuart St at Market St,28,The Embarcadero at Bryant St,1883,1,1979,Male
4,11 minutes,22,Howard St at Beale St,350,8th St at Brannan St,4626,2,1994,Male


In [20]:
list(ride_sharing.columns)

['duration',
 'station_A_id',
 'station_A_name',
 'station_B_id',
 'station_B_name',
 'bike_id',
 'user_type',
 'user_birth_year',
 'user_gender']

In [43]:
# Pandas library cuts part of dataframe when it's too long. Before print, add this line to show all rows desired:
birth_duration = ride_sharing.groupby(['user_birth_year' ,'duration'])['duration'].sum()
pd.set_option('display.max_row', None)
print(birth_duration)

user_birth_year  duration    
1901             5 minutes                                               5 minutes
1902             1 minutes                                               1 minutes
                 10 minutes                                             10 minutes
                 6 minutes                                               6 minutes
                 7 minutes                             7 minutes7 minutes7 minutes
1927             11 minutes                                             11 minutes
                 17 minutes                                             17 minutes
                 4 minutes                                               4 minutes
1936             28 minutes                                             28 minutes
1939             18 minutes                                             18 minutes
                 22 minutes                                             22 minutes
1942             11 minutes                              