# 3.4 Cleaning Data

Cleaning data is one of pandas' primary functions. It is often said that data scientists spend anywhere from 40-50% of their time cleaning data, a necessary step before analysis can occur.

### What is "dirty data"?

Many rows of data are often useless to the overall data analysis because they (1) *contain null values*, (2) *do not match the datatype* of other values in the column, or (3) because the *format is inconsistent with other, similar values*. Data is called "dirty" when values like this appear in the dataset.

##### Contains null values
Data containing null values can be problematic because they can skew analysis. If you have a dataset that is 100 rows long, for example, but only three of those rows have a value for the "Age" column, the `.mean()` calculation will likely not be representative of the entire population. There are several ways to deal with null values, although we will not go over them extensively in this notebook. Many data scientists use **imputation** (calculate and assign a value to each row) to fill in the blanks, and others simply **drop** the null rows or the entire column. The decision of "what to do" really depends on the situation and the context of the data.

##### Data type doesn't match
If the data types in a column don't match, there could be problems with analysis. For example, in a dataset that records the number of bathrooms in a house, one row could list a house with an integer (1) and another house with a string ("two"). The values `1` and `"two"` are different data types. They either both need to be converted to be the same data type, or one needs to be removed from the data set.

##### Inconsistent format
Imagine that you are performing aggregations on a dataset and grouping by U.S. state. However, some of the values for "state" are recorded using the two-letter code (ie. UT) and others are recorded with the full name of the state (Utah). Pandas won't know that these two places should be grouped together and so each one will be its own group.

In [None]:
import pandas as pd
df = pd.read_csv("./data/titanic.csv")

### Data contains null values
Pandas has several methods for looking at null values. Note that in Pandas, null values are called `NaN` (not a number).

##### `.isna()`
The `.isna()` method can be applied to a Series or a dataframe and will return a boolean value describing if the value was missing (null) or not.

In [None]:
df.isna().head()

We can count up the number of null values in each column by adding the `.sum()` aggregation method at the end.

In [None]:
df.isna().sum()

When it comes to null values, there are several things we could do. If the data is categorical and there are few null values, we might simply assign the most common category to each null row. If the data is quantitative, we might assign each null value the average or median of the column. This is called *imputation*.

Another approach would be to drop rows with null values. This would reduce the overall number of observations but could remove some inacurracies that could be produced by imputation.

Let's **impute** the two missing values for the "Embarked" column by assigning those two values the most commonly occuring value. We can do this by using the `.fillna()` method.

In [None]:
most_common_embarked = df['Embarked'].mode().iloc[0] # The `.mode()` method returns a Series-- get the first and only item in the Series
df['Embarked'].fillna(most_common_embarked, inplace=True)

In [None]:
df.isna().sum()

It looks like there are 177 null values in the "Age" column. That seems like a lot to impute since there are only 800 ish rows in total. Let's **drop** them instead, with the `.dropna()` method. This method accepts a parameter `how`, which can be "any" or "all", meaning that it will drop the row if *any* of the columns in the subset are null or only if *all* columns in the subset are null. The `subset` parameter requires a list of columns to look for null values in.

In [None]:
df.dropna(how='any', subset=['Age'], inplace=True)

In [None]:
df.isna().sum()

### Data type doesn't match

We can use the `.dtypes` property to see the data types of each column as automatically interpreted by Pandas. Note that data type `object` usually means "string".

In [None]:
df.dtypes

Observe that the "Age" column has the data type `float64`. That might be intentional, but if we wanted to group passengers by their age we might want them to express their age as an integer instead. 

We can convert the values in the column to an integer type using the `.astype()` method.

In [None]:
df['Age'] = df['Age'].astype('int64')

In [None]:
df.dtypes

### Inconsistent format

In this data set, it appears that tickets have no standardization. Some of them start with numbers and some of them start with letters. All of them, however, seem to have a string of five or six numbers at the end. We can trim off the leading letters and preserve just the string of numbers by applying a function to the column.

In [None]:
df.head()

In [None]:
def getTicketNumber(rawTicket):
    ticketParts = rawTicket.split(" ") # split the ticket by its spaces into a list
    if len(ticketParts) == 1:          # If the list only has one value (ie. no spaces in ticket, so wasn't split)
        return ticketParts[0]          # Return the first and only value
    else:                              # Otherwise return the last item in list (gets number without preceding text)
        return ticketParts[-1]
    
df['TicketNumber'] = df['Ticket'].apply(getTicketNumber)

In [None]:
df.head()

There are other situations in which your data may need cleaning, and there are many more Pandas methods and parameters available to use. The more familiar you become with Pandas, the better you will be able to clean your data.