## Data Cleaning II

This is a continuation of the Data Cleaning I project. This notebook takes care of other data cleaning techniques.

#### Text & Categorical Data Problems
Categorical and text data can often be some of the messiest parts of a dataset due to their unstructured nature. In this notebook, I’ll try to : 
- fix whitespace and capitalization inconsistencies in category labels 
- collapse multiple categories into one 
- and reformat strings for consistency

The dataset to be used here is one that contains answers to survey questions on the San Francisco Airport from airline customers. The DataFrame contains flight metadata such as the airline, the destination, waiting times as well as answers to key questions regarding cleanliness, safety, and satisfaction. Another DataFrame named categories was created, containing all correct possible values for the survey columns.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
#Loading airlines data 
airlines = pd.read_csv('../Datasets/airlines_final.csv')
airlines.head(3)

Unnamed: 0.1,Unnamed: 0,id,day,airline,destination,dest_region,dest_size,boarding_area,dept_time,wait_min,cleanliness,safety,satisfaction
0,0,1351,Tuesday,UNITED INTL,KANSAI,Asia,Hub,Gates 91-102,2018-12-31,115.0,Clean,Neutral,Very satisfied
1,1,373,Friday,ALASKA,SAN JOSE DEL CABO,Canada/Mexico,Small,Gates 50-59,2018-12-31,135.0,Clean,Very safe,Very satisfied
2,2,2820,Thursday,DELTA,LOS ANGELES,West US,Hub,Gates 40-48,2018-12-31,70.0,Average,Somewhat safe,Neutral


In [3]:
# Creating the categories dataframe

#Creating the dictionary that will be used to create the dataframe
data = {'cleanliness' : ['Clean', 'Average', 'Somewhat clean', 'Somewhat dirty', 'Dirty'],
        'safety': ['Neutral', 'Very safe', 'Somewhat safe', 'Very unsafe', 'Somewhat unsafe'],
        'satisfaction': ['Very satisfied', 'neutral', 'Somewhat satisfied', 'Somewhat unsatisfied', 'Very unsatisfied']
       }

#Creating the dataframe
categories = pd.DataFrame(data)
# Print categories DataFrame
print(categories)

      cleanliness           safety          satisfaction
0           Clean          Neutral        Very satisfied
1         Average        Very safe               neutral
2  Somewhat clean    Somewhat safe    Somewhat satisfied
3  Somewhat dirty      Very unsafe  Somewhat unsatisfied
4           Dirty  Somewhat unsafe      Very unsatisfied


In the airlines dataframe above, the entries in the columns `cleanliness`, `safety`, and `satisfaction` should match the those in the categories dataframe. 

Take a look at the output. Out of the cleanliness, safety and satisfaction columns, which one has an inconsistent category and what is it?

In [4]:
airlines.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2477 entries, 0 to 2476
Data columns (total 13 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Unnamed: 0     2477 non-null   int64  
 1   id             2477 non-null   int64  
 2   day            2477 non-null   object 
 3   airline        2477 non-null   object 
 4   destination    2477 non-null   object 
 5   dest_region    2477 non-null   object 
 6   dest_size      2477 non-null   object 
 7   boarding_area  2477 non-null   object 
 8   dept_time      2477 non-null   object 
 9   wait_min       2477 non-null   float64
 10  cleanliness    2477 non-null   object 
 11  safety         2477 non-null   object 
 12  satisfaction   2477 non-null   object 
dtypes: float64(1), int64(2), object(10)
memory usage: 251.7+ KB


In [5]:
#Checking the inconsistencies across the two dataframes categories and airlines 
inconsistent_categories = set(airlines['cleanliness']).difference(categories['cleanliness'])
print(inconsistent_categories)

inconsistent_categories1 = set(airlines['safety']).difference(categories['safety'])
print(inconsistent_categories1)

inconsistent_categories2 = set(airlines['satisfaction']).difference(categories['satisfaction'])
print(inconsistent_categories2)

set()
set()
{'Neutral', 'Somewhat satsified'}


Two inconsistent entries found in the satisfaction category. ('Neutral', Somewhat satsified')

In [6]:
# Find the safety category in airlines not in categories
cat_satisfaction = set(airlines['satisfaction']).difference(categories['satisfaction'])
print(cat_satisfaction)

# Find rows with that category
cat_satisfaction_rows = airlines['satisfaction'].isin(cat_satisfaction)
print(airlines[cat_satisfaction_rows].head())
len(cat_satisfaction_rows)

{'Neutral', 'Somewhat satsified'}
   Unnamed: 0    id        day     airline  destination    dest_region  \
2           2  2820   Thursday       DELTA  LOS ANGELES        West US   
3           3  1157    Tuesday   SOUTHWEST  LOS ANGELES        West US   
4           4  2992  Wednesday    AMERICAN        MIAMI        East US   
6           6  2578   Saturday     JETBLUE   LONG BEACH        West US   
7           8  2592   Saturday  AEROMEXICO  MEXICO CITY  Canada/Mexico   

  dest_size boarding_area   dept_time  wait_min     cleanliness  \
2       Hub   Gates 40-48  2018-12-31      70.0         Average   
3       Hub   Gates 20-39  2018-12-31     190.0           Clean   
4       Hub   Gates 50-59  2018-12-31     559.0  Somewhat clean   
6     Small    Gates 1-12  2018-12-31      63.0           Clean   
7       Hub    Gates 1-12  2018-12-31     215.0  Somewhat clean   

          safety        satisfaction  
2  Somewhat safe             Neutral  
3      Very safe  Somewhat satsified  
4

2477

#### Examining Dest Region (dest_region) and Dest size (dest_size) columns

In [7]:
#Looking at all the different destination regions in the data 
airlines['dest_region'].unique()

array(['Asia', 'Canada/Mexico', 'West US', 'East US', 'Midwest US',
       'EAST US', 'Middle East', 'Europe', 'eur', 'Central/South America',
       'Australia/New Zealand', 'middle east'], dtype=object)

From the results of the above code, it can be seen that East US appears twice (one in all caps and the other 'East US) Europe has been coded as eur and Europe all at the same time. East US and EAST US represent the same place coded differently. These issues will be addressed in the next cell. 

In [8]:
#Converting the strings in dest_region to lower
airlines['dest_region'] = airlines['dest_region'].str.lower()
print(airlines['dest_region'].unique())
print('')
#Number of different dest_regions
print(airlines['dest_region'].nunique(), 'different regions')

['asia' 'canada/mexico' 'west us' 'east us' 'midwest us' 'middle east'
 'europe' 'eur' 'central/south america' 'australia/new zealand']

10 different regions


In [9]:
#Changing eur to europe
airlines['dest_region'] = airlines['dest_region'].replace({'eur':'europe'})
print(airlines['dest_region'].unique())
print('')
#Number of different dest_regions
print(airlines['dest_region'].nunique(), 'different regions after cleaning')

['asia' 'canada/mexico' 'west us' 'east us' 'midwest us' 'middle east'
 'europe' 'central/south america' 'australia/new zealand']

9 different regions after cleaning


In [10]:
# Examining dest size column

#Looking at the different destination sizes in the dest size column
airlines['dest_size'].unique()

array(['Hub', 'Small', '    Hub', 'Medium', 'Large', 'Hub     ',
       '    Small', 'Medium     ', '    Medium', 'Small     ',
       '    Large', 'Large     '], dtype=object)

There is a lot of needless white space around most of the text. eg 'Hub', 'Medium'.

In [None]:
airlines