## Data Cleaning II

This is a continuation of the Data Cleaning I project. This notebook takes care of other data cleaning techniques.

#### Text & Categorical Data Problems
Categorical and text data can often be some of the messiest parts of a dataset due to their unstructured nature. In this notebook, I’ll try to : 
- fix whitespace and capitalization inconsistencies in category labels 
- collapse multiple categories into one 
- and reformat strings for consistency

The dataset to be used here is one that contains answers to survey questions on the San Francisco Airport from airline customers. The DataFrame contains flight metadata such as the airline, the destination, waiting times as well as answers to key questions regarding cleanliness, safety, and satisfaction. Another DataFrame named categories was created, containing all correct possible values for the survey columns.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [6]:
#Loading airlines data 
airlines = pd.read_csv('../Datasets/airlines_final.csv')
airlines.head(3)

Unnamed: 0.1,Unnamed: 0,id,day,airline,destination,dest_region,dest_size,boarding_area,dept_time,wait_min,cleanliness,safety,satisfaction
0,0,1351,Tuesday,UNITED INTL,KANSAI,Asia,Hub,Gates 91-102,2018-12-31,115.0,Clean,Neutral,Very satisfied
1,1,373,Friday,ALASKA,SAN JOSE DEL CABO,Canada/Mexico,Small,Gates 50-59,2018-12-31,135.0,Clean,Very safe,Very satisfied
2,2,2820,Thursday,DELTA,LOS ANGELES,West US,Hub,Gates 40-48,2018-12-31,70.0,Average,Somewhat safe,Neutral


In [7]:
# Creating the categories dataframe

#Creating the dictionary that will be used to create the dataframe
data = {'cleanliness' : ['Clean', 'Average', 'Somewhat clean', 'Somewhat dirty', 'Dirty'],
        'safety': ['Neutral', 'Very Safe', 'Somewhat safe', 'Very unsafe', 'Somewhat unsafe'],
        'satisfaction': ['Very satisfied', 'neutral', 'Somewhat satisfied', 'Somewhat unsatisfied', 'Very unsatisfied']
       }

#Creating the dataframe
categories = pd.DataFrame(data)
# Print categories DataFrame
print(categories)

      cleanliness           safety          satisfaction
0           Clean          Neutral        Very satisfied
1         Average        Very Safe               neutral
2  Somewhat clean    Somewhat safe    Somewhat satisfied
3  Somewhat dirty      Very unsafe  Somewhat unsatisfied
4           Dirty  Somewhat unsafe      Very unsatisfied


In the airlines dataframe above, the entries in the columns `cleanliness`, `safety`, and `satisfaction` should match the those in the categories dataframe. 

Take a look at the output. Out of the cleanliness, safety and satisfaction columns, which one has an inconsistent category and what is it?

In [8]:
airlines.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2477 entries, 0 to 2476
Data columns (total 13 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Unnamed: 0     2477 non-null   int64  
 1   id             2477 non-null   int64  
 2   day            2477 non-null   object 
 3   airline        2477 non-null   object 
 4   destination    2477 non-null   object 
 5   dest_region    2477 non-null   object 
 6   dest_size      2477 non-null   object 
 7   boarding_area  2477 non-null   object 
 8   dept_time      2477 non-null   object 
 9   wait_min       2477 non-null   float64
 10  cleanliness    2477 non-null   object 
 11  safety         2477 non-null   object 
 12  satisfaction   2477 non-null   object 
dtypes: float64(1), int64(2), object(10)
memory usage: 251.7+ KB


In [9]:
#Checking the inconsistencies across the two dataframes categories and airlines 
inconsistent_categories = set(airlines['cleanliness']).difference(categories['cleanliness'])
print(inconsistent_categories)

inconsistent_categories1 = set(airlines['safety']).difference(categories['safety'])
print(inconsistent_categories1)

inconsistent_categories2 = set(airlines['satisfaction']).difference(categories['satisfaction'])
print(inconsistent_categories2)

set()
{'Very safe'}
{'Neutral', 'Somewhat satsified'}


Can be seen that cleanliness entries in the airlines dataframe matches those in the categories dataframe. There was one inconsistent entry (Very safe) with the safety column in the airline dataframe. Two inconsistent entries found in the satisfaction category. ('Neutral', Somewhat satsified')