## Cleaning Data

This involves:

- rename columns
- rename the index
- remove irrelevant columns
- split one column into two
- combine two or more columns into one
- remove non-data rows
- remove repeated rows
- remove rows with missing data (aka NaN)
- replace NaN data with a single value
- replace NaN data via interpolation
- standardize strings
- fix typos in strings
- remove whitespace from strings
- correct the types used for columns
- identify and remove outliers

In [17]:
import pandas as pd

# Import the nyc violations data
nyc_violations_df = pd.read_csv(r'data\nyc_violations_2020.csv', usecols=['Plate ID', 'Registration State', 'Vehicle Make', 'Vehicle Color', 'Violation Time', 'Street Name'])

# Get Data info
nyc_violations_df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12495734 entries, 0 to 12495733
Data columns (total 6 columns):
 #   Column              Dtype 
---  ------              ----- 
 0   Plate ID            object
 1   Registration State  object
 2   Vehicle Make        object
 3   Violation Time      object
 4   Street Name         object
 5   Vehicle Color       object
dtypes: object(6)
memory usage: 572.0+ MB


Remove rows with any missing data (i.e., a NaN value). How many rows remain after
doing this pruning? If each parking ticket brings $100 into the city, and missing data
means that the ticket can be successfully contested, how much money might New York
City lose as a result of such missing data?

> Removing rows with missing data

In [None]:
# First count rows with missing data
nyc_violations_df.isnull().sum()

In [None]:
# No NA's in df
no_na_violations_df = nyc_violations_df.dropna()

no_na_violations_df.info() 

In [None]:
# How many rows have been removed?
del_rows = nyc_violations_df.shape[0] - no_na_violations_df.shape[0]

print(f"Money lost by New York due to misssing data is ${(del_rows * 100):,.2f}")

Let’s instead assume that a ticket can only be dismissed if the license plate, state, car
make, and/or street name are missing. Remove rows that are missing one or more of
these. How many rows remain? Assuming $100/ticket, how much money would the city
lose as as result of this missing data?

In [None]:
# Drop rows if these columns contain an NaN
new_violations_df = nyc_violations_df.dropna(subset=['Plate ID', 'Registration State', 'Vehicle Make', 'Street Name'])


# How many rows have been removed?
del_rows = nyc_violations_df.shape[0] - new_violations_df.shape[0]

print(f"Money lost by New York due to misssing data is ${(del_rows * 100):,.2f}")

Now let’s assume that tickets can be dismissed if the license plate, state, and/or street
name are all there—that is, the same as the previous question, but without requiring the
make of car. Remove rows that are missing one or more of these. How many rows
remain? Assuming $100/ticket, how much money would the city lose as as result of this
missing data?

In [None]:
# Drop rows if these columns contain an NaN
new_violations_df = nyc_violations_df.dropna(subset=['Plate ID', 'Registration State', 'Street Name'])


# How many rows have been removed?
del_rows = nyc_violations_df.shape[0] - new_violations_df.shape[0]

print(f"Money lost by New York due to misssing data is ${(del_rows * 100):,.2f}")

When we have NaN values, we have a few options:
- remove them
- leave them
- replace them with something else

What is the right choice? The answer, of course, is "it depends." If you’re getting your data ready
to feed into a machine-learning model, then you’ll likely need to get rid of the NaN values, either
by removing those rows or by replacing them with something else. If you’re calculating basic
sales information, then you might be OK with null values, since they aren’t going to affect your
numbers too much. And of course, there are many variations on these.

In this exercise, we are going to fill in missing data from the famous Titanic data set—a table of
all passengers on that famous, doomed ship. Many of the columns in this file are complete, but
several are missing data. It’ll be up to you to decide whether and how to fill in that missing data.

In [2]:
import pandas as pd

# Upload the titanic dataset

titanic_df = pd.read_csv(r"data\titanic.csv")

titanic_df.head()

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest
0,1.0,1.0,"Allen, Miss. Elisabeth Walton",female,29.0,0.0,0.0,24160,211.3375,B5,S,2.0,,"St Louis, MO"
1,1.0,1.0,"Allison, Master. Hudson Trevor",male,0.9167,1.0,2.0,113781,151.55,C22 C26,S,11.0,,"Montreal, PQ / Chesterville, ON"
2,1.0,0.0,"Allison, Miss. Helen Loraine",female,2.0,1.0,2.0,113781,151.55,C22 C26,S,,,"Montreal, PQ / Chesterville, ON"
3,1.0,0.0,"Allison, Mr. Hudson Joshua Creighton",male,30.0,1.0,2.0,113781,151.55,C22 C26,S,,135.0,"Montreal, PQ / Chesterville, ON"
4,1.0,0.0,"Allison, Mrs. Hudson J C (Bessie Waldo Daniels)",female,25.0,1.0,2.0,113781,151.55,C22 C26,S,,,"Montreal, PQ / Chesterville, ON"


In [8]:
# Which columns have null values

titanic_df.columns[titanic_df.isnull().sum() > 1]

Index(['age', 'fare', 'cabin', 'embarked', 'boat', 'body', 'home.dest'], dtype='object')

For each column containing null values, decide whether you will fill it with a value—and if so, then with what value, whether it’s calculated or otherwise.

Deciding what we should do with each NaN-containing column depends on a variety of factors,
including the type of data that the column contains. Another factor is just how many rows have
null values. In two cases—fare and embarked we have one and two null rows, respectively.
Given that our data frame has more than 1,300 rows, missing 1 or 2 of them won’t make any
significant difference. I thus suggest that we remove those rows from the data frame:

In [9]:
# Removing Null values in Fare and Embarked

titanic_df = titanic_df.dropna(subset=['fare', 'embarked'])

# Check for Columns Containing Nulls
titanic_df.columns[titanic_df.isnull().sum() > 1]

Index(['age', 'cabin', 'boat', 'body', 'home.dest'], dtype='object')

When it comes to the age column, though, we might want to consider our steps carefully. I’m
inclined to use the mean here. But you could use the mode. You could also use a more sophisticated technique, using the mean from within a particular cabin. You could even try to get
the complete set of ages on the Titanic, and choose from a random distribution built from that.

Using the mean age has some advantages: It won’t affect the mean age, although it will reduce
the standard deviation. It’s not necessarily wrong, even though we know that it’s not totally right,
either. In another context, such as sales of a particular product in an online store, replacing
missing values with the mean can sometimes work, especially if you have similar products with a
similar sales history.

In [None]:
# Filling the NaN column in the age column with the mean age

titanic_df['age'] = titanic_df['age'].fillna(titanic_df['age'].mean())

- First, we calculate df['age'].mean(). pandas ignores NaN values by default, which means that this calculation is based on the non-null numeric values in that column.

- Next, we run fillna on df['age']. And what value do we want to put instead of NaN? What we just calculated, the mean of df['age']. And yes, it looks a bit confusing to use df['age'] twice.

- The result of df['age'].fillna is a new series, which we than assign back to
df['age'], replacing the original values.

> In the end, we’ve replaced any NaN values in df['age'] with the mean of the existing values.

Finally, I want to set the home.dest column similarly to what I did with the age column—but
instead of using the mean, I’ll use the mode (i.e., the most common value). I’ll do this for two
reasons: First, because you can only calculate the mean from a numeric value, and the destination
is a categorical/textual value. Secondly, because this means that given no other information, we
might be able to assume that a passenger is going where most others are going. We might be
wrong, but this is the least wrong choice that we can make. We could, of course, be a bit more
sophisticated than this, choosing the mode of home.dest for all passengers who embarked at the
same place, but we’ll ignore that for now.

In [10]:
titanic_df['home.dest'] = titanic_df['home.dest'].fillna(titanic_df['home.dest'].mode())

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  titanic_df['home.dest'] = titanic_df['home.dest'].fillna(titanic_df['home.dest'].mode())


## Dealing with Inconsistent Data

In [19]:
# Displaying the Vehicle Color Column
nyc_violations_df['Vehicle Color'].unique()[1:30]

array(['BLK', 'BLACK', nan, 'GREY', 'GY', 'WH', 'RED', 'GRAY', 'TAN',
       'SILVE', 'BN', 'WHT', 'WHITE', 'GRN', 'Y', 'BL', 'GRY', 'BLUE',
       'BROWN', 'BR', 'BLU', 'RD', 'YW', 'SL', 'GN', 'GR', 'BW', 'GREEN',
       'W'], dtype=object)

In [20]:
# Get the distinct vehicle colors

nyc_violations_df['Vehicle Color'].value_counts().head(50)

WH       2344858
GY       2307704
BK       2066374
WHITE    1061234
BL        775124
RD        483298
BLACK     465110
GREY      306787
BROWN     292348
SILVE     191477
GR        182929
BLUE      178298
RED       161693
TN        120576
BR        102204
YW         98700
BLK        91539
OTHER      60245
GREEN      58765
GL         54851
GRY        46527
MR         42812
GRAY       40854
WHT        35433
YELLO      32792
WHI        29760
OR         28100
BK.        27830
WT         25583
WT.        24593
GY.        22460
GOLD       21687
TAN        21091
SIL        20116
BLU        15240
SL.        13145
LTGY       13055
ORANG      11506
SL         10343
LTG        10093
BL.         9649
LT/         8976
PR          7518
DK/         7498
W           7367
RD.         7128
DKGY        6004
GYGY        5039
BLK.        4853
GRN         4829
Name: Vehicle Color, dtype: int64

In [15]:
# Translation Table for Cleaning the Color Dataset
colormap = {'WH': 'WHITE', 'GY':'GRAY', 'BK':'BLACK',
'BL':'BLUE', 'RD':'RED', 'SILVE':'SILVER', 'GR':'GRAY',
'TN':'TAN', 'BR':'BROWN', 'YW':'YELLO', 'BLK':'BLACK',
'GRY':'GRAY', 'WHT':'WHITE', 'WHI':'WHITE', 'OR':'ORANGE',
'BK.':'BLACK', 'WT':'WHITE', 'WT.':'WHITE'}

nyc_violations_df['Vehicle Color'] = nyc_violations_df['Vehicle Color'].replace(colormap)

nyc_violations_df['Vehicle Color'].value_counts().shape[0]

1880

In [16]:
# Get the distinct vehicle colors
nyc_violations_df['Vehicle Color'].value_counts().head(50)

WHITE     3521461
BLACK     2650853
GRAY      2578014
BLUE       953422
RED        644991
BROWN      394552
GREY       306787
SILVER     191477
TAN        141667
YELLO      131492
OTHER       60245
GREEN       58765
GL          54851
MR          42812
ORANGE      28100
GY.         22460
GOLD        21687
SIL         20116
BLU         15240
SL.         13145
LTGY        13055
ORANG       11506
SL          10343
LTG         10093
BL.          9649
LT/          8976
PR           7518
DK/          7498
W            7367
RD.          7128
DKGY         6004
GYGY         5039
BLK.         4853
GRN          4829
B            4145
WH.          3811
BRO          3802
DKG          3702
PURPL        3635
BRN          3582
BKGY         3504
WHBL         3489
DKBL         2912
GN           2883
WHT.         2796
BN           2787
BLUE.        2638
WHGY         2381
UNKNO        2205
RED.         2141
Name: Vehicle Color, dtype: int64