# Cleaning Data in Python

👋 Welcome to your workspace! Here, you can write and run Python code and add text in [Markdown](https://www.markdownguide.org/basic-syntax/). Below, we've imported the datasets from the course _Cleaning Data in Python_ as DataFrames as well as the packages used in the course. This is your sandbox environment: analyze the course datasets further, take notes, or experiment with code!

In [1]:
%%capture
# Install fuzzywuzzy
!pip install fuzzywuzzy

In [3]:
# Importing course packages; you can add more too!
import pandas as pd
import numpy as np
import datetime as dt
import matplotlib.pyplot as plt
import missingno as msno
import fuzzywuzzy
#import recordlinkage 

# Importing course datasets as DataFrames
ride_sharing = pd.read_csv('datasets/ride_sharing_new.csv', index_col = 'Unnamed: 0')
airlines = pd.read_csv('datasets/airlines_final.csv',  index_col = 'Unnamed: 0')
banking = pd.read_csv('datasets/banking_dirty.csv', index_col = 'Unnamed: 0')
restaurants = pd.read_csv('datasets/restaurants_L2.csv', index_col = 'Unnamed: 0')
restaurants_new = pd.read_csv('datasets/restaurants_L2_dirty.csv', index_col = 'Unnamed: 0')

ride_sharing.head() # Display the first five rows of this DataFrame

Unnamed: 0,duration,station_A_id,station_A_name,station_B_id,station_B_name,bike_id,user_type,user_birth_year,user_gender
0,12 minutes,81,Berry St at 4th St,323,Broadway at Kearny,5480,2,1959,Male
1,24 minutes,3,Powell St BART Station (Market St at 4th St),118,Eureka Valley Recreation Center,5193,2,1965,Male
2,8 minutes,67,San Francisco Caltrain Station 2 (Townsend St...,23,The Embarcadero at Steuart St,3652,3,1993,Male
3,4 minutes,16,Steuart St at Market St,28,The Embarcadero at Bryant St,1883,1,1979,Male
4,11 minutes,22,Howard St at Beale St,350,8th St at Brannan St,4626,2,1994,Male


### Don't know where to start?

Try completing these tasks:
- For each DataFrame, inspect the data types of each column and, where needed, clean and convert columns into the correct data type. You should also rename any columns to have more descriptive titles.
- Identify and remove all the duplicate rows in `ride_sharing`.
- Inspect the unique values of all the columns in `airlines` and clean any inconsistencies.
- For the `airlines` DataFrame, create a new column called `International` from `dest_region`, where values representing US regions map to `False` and all other regions map to `True`.
- The `banking` DataFrame contains out of date ages. Update the `Age` column using today's date and the `birth_date` column.
- Clean the `restaurants_new` DataFrame so that it better matches the categories in the `city` and `type` column of the `restaurants` DataFrame. Afterward, given typos in restaurant names, use record linkage to generate possible pairs of rows between `restaurants` and `restaurants_new` using criteria you think is best.


In [16]:
#For each DataFrame, inspect the data types of each column and, where needed, 
#clean and convert columns into the correct data type. You should also rename any columns to have more descriptive titles.

ride_sharing.dtypes
airlines.dtypes
banking.dtypes
restaurants.dtypes
restaurants_new.dtypes



name     object
addr     object
city     object
phone     int64
type     object
dtype: object

In [8]:
#Identify and remove all the duplicate rows in `ride_sharing`.

# Identifying duplicate rows
duplicate_rows = ride_sharing.duplicated()

# Counting the number of duplicate rows
num_duplicates = duplicate_rows.sum()
print("Number of duplicate rows: ", num_duplicates)

# Dropping duplicate rows
ride_sharing_cleaned = ride_sharing.drop_duplicates()

# Verifying the result
print("Shape of the original DataFrame: ", ride_sharing.shape)
print("Shape of the cleaned DataFrame: ", ride_sharing_cleaned.shape)


Number of duplicate rows:  4
Shape of the original DataFrame:  (25760, 9)
Shape of the cleaned DataFrame:  (25756, 9)


In [9]:
#Inspect the unique values of all the columns in `airlines` and clean any inconsistencies.
# Inspect unique values in each column
for column in airlines.columns:
    unique_values = airlines[column].unique()
    print(f"Unique values in column '{column}':")
    print(unique_values)
    print()


Unique values in column 'id':
[1351  373 2820 ... 2684 2549 2162]

Unique values in column 'day':
['Tuesday' 'Friday' 'Thursday' 'Wednesday' 'Saturday' 'Sunday' 'Monday']

Unique values in column 'airline':
['UNITED INTL' 'ALASKA' 'DELTA' 'SOUTHWEST' 'AMERICAN' 'JETBLUE'
 'AEROMEXICO' 'AIR CANADA' 'UNITED' 'INTERJET' 'TURKISH AIRLINES'
 'AIR FRANCE/KLM' 'HAWAIIAN AIR' 'COPA' 'WOW' 'KOREAN AIR' 'EMIRATES'
 'AVIANCA' 'AER LINGUS' 'CATHAY PACIFIC' 'BRITISH AIRWAYS'
 'PHILIPPINE AIRLINES' 'LUFTHANSA' 'QANTAS' 'FRONTIER' 'CHINA EASTERN'
 'EVA AIR' 'VIRGIN ATLANTIC' 'AIR NEW ZEALAND' 'SINGAPORE AIRLINES'
 'AIR CHINA' 'CHINA SOUTHERN' 'ANA ALL NIPPON']

Unique values in column 'destination':
['KANSAI' 'SAN JOSE DEL CABO' 'LOS ANGELES' 'MIAMI' 'NEWARK' 'LONG BEACH'
 'MEXICO CITY' 'TORONTO' 'PORTLAND' 'SAN DIEGO' 'BOSTON' 'SPOKANE'
 'GUADALAJARA' 'MINNEAPOLIS-ST. PAUL' 'NEW YORK-JFK' 'ISTANBUL'
 'BALTIMORE' 'LAS VEGAS' 'SHANGHAI' 'TOKYO-NARITA' 'PARIS-DE GAULLE'
 'HONOLULU' 'DALLAS-FT. WORTH' '

In [10]:
#For the `airlines` DataFrame, create a new column called `International` from `dest_region`, where values representing US regions map 
# to `False` and all other regions map to `True`.


# Create a mapping dictionary for the International column
mapping = {'US': False, 'EU': True, 'AS': True, 'SA': True, 'OC': True, 'AF': True}

# Create the International column using the map() function
airlines['International'] = airlines['dest_region'].map(mapping)

# Display the updated DataFrame
print(airlines)


        id        day        airline        destination    dest_region  \
0     1351    Tuesday    UNITED INTL             KANSAI           Asia   
1      373     Friday         ALASKA  SAN JOSE DEL CABO  Canada/Mexico   
2     2820   Thursday          DELTA        LOS ANGELES        West US   
3     1157    Tuesday      SOUTHWEST        LOS ANGELES        West US   
4     2992  Wednesday       AMERICAN              MIAMI        East US   
...    ...        ...            ...                ...            ...   
2804  1475    Tuesday         ALASKA       NEW YORK-JFK        East US   
2805  2222   Thursday      SOUTHWEST            PHOENIX        West US   
2806  2684     Friday         UNITED            ORLANDO        East US   
2807  2549    Tuesday        JETBLUE         LONG BEACH        West US   
2808  2162   Saturday  CHINA EASTERN            QINGDAO           Asia   

     dest_size boarding_area   dept_time  wait_min     cleanliness  \
0          Hub  Gates 91-102  2018-12-31 

In [13]:
#The `banking` DataFrame contains out of date ages. Update the `Age` column using today's date and the `birth_date` column.

# Convert birth_date column to datetime type
banking['birth_date'] = pd.to_datetime(banking['birth_date'])

# Calculate the age based on today's date
today = pd.to_datetime('today')
banking['Age'] = (today - banking['birth_date']).astype('<m8[Y]')

# Display the updated DataFrame
print(banking)


     cust_id birth_date   Age  acct_amount  inv_amount   fund_A   fund_B  \
0   870A9281 1962-06-09  60.0     63523.31       51295  30105.0   4138.0   
1   166B05B0 1962-12-16  60.0     38175.46       15050   4995.0    938.0   
2   BFC13E88 1990-09-12  32.0     59863.77       24567  10323.0   4590.0   
3   F2158F66 1985-11-03  37.0     84132.10       23712   3908.0    492.0   
4   7A73F334 1990-05-17  33.0    120512.00       93230  12158.4  51281.0   
..       ...        ...   ...          ...         ...      ...      ...   
95  CA507BA1 1974-08-10  48.0     12209.84        7515    190.0    931.0   
96  B99CD662 1989-12-12  33.0     92838.44       49089   2453.0   7892.0   
97  13770971 1984-11-29  38.0     92750.87       27962   3352.0   7547.0   
98  93E78DA3 1969-12-14  53.0     41942.23       29662   1758.0  11174.0   
99  AC91D689 1993-05-18  30.0     99490.61       32149   2184.0  17918.0   

     fund_C   fund_D account_opened last_transaction  
0    1420.0  15632.0       02-09

In [14]:
#Clean the `restaurants_new` DataFrame so that it better matches the categories in the `city` and `type` column of the
# `restaurants` DataFrame. Afterward, given typos in restaurant names, use record linkage to generate possible pairs of 
#rows between `restaurants` and `restaurants_new` using criteria you think is best.


from fuzzywuzzy import fuzz
from fuzzywuzzy import process

# Define the matching criteria
threshold = 80  # Adjust the threshold as needed

# Generate possible matching pairs
potential_matches = []
for idx, row in restaurants.iterrows():
    name = row['name']
    city = row['city']
    restaurant_matches = restaurants_new[(restaurants_new['name'].apply(lambda x: fuzz.token_set_ratio(x, name)) > threshold) &
                                         (restaurants_new['city'].apply(lambda x: fuzz.token_set_ratio(x, city)) > threshold)]
    for _, match in restaurant_matches.iterrows():
        potential_matches.append((idx, match.name))

# Display the potential matching pairs
for pair in potential_matches:
    print(f"Potential match: restaurants[{pair[0]}] - restaurants_new[{pair[1]}]")








Potential match: restaurants[0] - restaurants_new[40]
Potential match: restaurants[1] - restaurants_new[28]
Potential match: restaurants[2] - restaurants_new[74]
Potential match: restaurants[3] - restaurants_new[1]
Potential match: restaurants[4] - restaurants_new[53]
Potential match: restaurants[5] - restaurants_new[65]
Potential match: restaurants[6] - restaurants_new[73]
Potential match: restaurants[7] - restaurants_new[79]
Potential match: restaurants[8] - restaurants_new[43]
Potential match: restaurants[9] - restaurants_new[50]
Potential match: restaurants[10] - restaurants_new[75]
Potential match: restaurants[11] - restaurants_new[21]
Potential match: restaurants[12] - restaurants_new[26]
Potential match: restaurants[13] - restaurants_new[7]
Potential match: restaurants[14] - restaurants_new[67]
Potential match: restaurants[15] - restaurants_new[55]
Potential match: restaurants[16] - restaurants_new[57]
Potential match: restaurants[17] - restaurants_new[12]
Potential match: resta