# 3.5 Cleaning Data

Cleaning data is one of pandas' primary functions. It is often said that data scientists spend anywhere from 40-50% of their time cleaning data, a necessary step before analysis can occur.

### What is "dirty data"?

Many rows of data are often useless to the overall data analysis because they (1) *contain null values*, (2) *do not match the datatype* of other values in the column, (3) because the *format is inconsistent with other, similar values*, or (4) because the data contains extreme or unexpected values. Data is called "dirty" when values like this appear in the dataset.

##### Contains null values
Data containing null values can be problematic because they can skew analysis. If you have a dataset that is 100 rows long, for example, but only three of those rows have a value for the "Age" column, the `.mean()` calculation will likely not be representative of the entire population. There are several ways to deal with null values, although we will not go over them extensively in this notebook. Many data scientists use **imputation** (calculate and assign a value to each row) to fill in the blanks, and others simply **drop** the null rows or the entire column. The decision of "what to do" really depends on the situation and the context of the data.

In the event that very few values are available in the data set, it may be best to just drop the column, since there may be no way to know what the values should be. If only a few rows are missing values, you might decide to fill them in with either the most frequently occurring value or the average/median value.

##### Data type doesn't match
If the data types in a column don't match, there could be problems with analysis. For example, in a dataset that records the number of bathrooms in a house, one row could list a house with an integer (1) and another house with a string ("two"). The values `1` and `"two"` are different data types. They either both need to be converted to be the same data type, or one needs to be removed from the data set.

##### Inconsistent format
Imagine that you are performing aggregations on a dataset and grouping by U.S. state. However, some of the values for "state" are recorded using the two-letter code (ie. "UT") and others are recorded with the full name of the state ("Utah"). Pandas wouldn't know that these two places should be grouped together and so each one will be its own group. Thus, these values need to be changed so that they are uniform.

##### Extreme/unexpected/incorrect values
Sometimes, due to errors in the data collection process, values can be saved incorrectly to a dataset. These values may *technically* meet the requirements of the database; however, they logically don't make sense. 

For example, in a dataset about personal income where many rows describe yearly income in amounts greater than \\$1,000, it would be unexpected to see a value of `1`. It might also be unexpected to see a negative number. As data analysts, we need to decide how to interpret and fix these rows. For example, we might simply be able to change the negative number to a positive number and be done, but because we can't know if `1` means \\$1,000 or \\$1,000,000, we might decide to simply drop this row.

Other examples of extreme/unexpected values would be a birth date that is in the future; letter grades that are not A, B, C, D, or F; and inventory amounts that are negative.

Errors that are simply incorrect are difficult to detect-- however, finding them and fixing them is an essential step in the data cleaning process!

### About the data
​
The data used in this notebook shows information about passengers on the *Titanic* cruiseliner, a ship which set out from Southampton, U.K. to sail across the Atlantic ocean and which tragically sank upon collision with an iceberg. The dataset contains information about each passenger's passenger class, name, sex, age, siblings, parents/children, ticket number, ticket fare, cabin number, and the embarked location. It also contains information about each passenger's survival status. This data set is extremely popular among data scientists and will facilitate demonstrations of Pandas concepts.

In [34]:
import pandas as pd
df = pd.read_csv("./data/titanic_dirty.csv")
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


### Data contains null values
Pandas has several methods for looking at null values. Note that in Pandas, null values are called `NaN` (not a number).

##### `.isna()`
The `.isna()` method can be applied to a Series or a dataframe and will return a boolean value describing if the value was missing (null) or not.

In [35]:
df.isna()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,False,False,False,False,False,False,False,False,False,False,True,False
1,False,False,False,False,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False,False,False,True,False
3,False,False,False,False,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False,False,False,True,False
...,...,...,...,...,...,...,...,...,...,...,...,...
886,False,False,False,False,False,False,False,False,False,False,True,False
887,False,False,False,False,False,False,False,False,False,False,False,False
888,False,False,False,False,False,True,False,False,False,False,True,False
889,False,False,False,False,False,False,False,False,False,False,False,False


We can count up the number of null values in each column by adding the `.sum()` aggregation method at the end.

In [36]:
df.isna().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

We can also see the rows that have null values by passing a filter that uses the `.isna()` method to the `.loc` property. Note that in the code below, we check only the `Embarked` column.

In [37]:
df.loc[df['Embarked'].isna()]

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
61,62,1,1,"Icard, Miss. Amelie",female,38.0,0,0,113572,80.0,B28,
829,830,1,1,"Stone, Mrs. George Nelson (Martha Evelyn)",female,62.0,0,0,113572,80.0,B28,


When it comes to null values, there are several things we could do. If the data is categorical and there are few null values, we might simply assign the most common category to each null row. If the data is quantitative, we might assign each null value the average or median of the column. This is called *imputation*.

Another approach would be to drop rows with null values. This would reduce the overall number of observations but could remove some inacurracies that could be produced by imputation.

Let's **impute** the two missing values for the "Embarked" column by assigning those two values the most commonly occuring value. We can do this by getting the most common value with the `.mode()` method and then using the `.fillna()` method to fill in each null value with the mode.

Note that the `.mode()` method always returns a Series (since their can be multiple modes). In this case, we'll just get the first mode by adding `[0]` after `.mode()`.

In [38]:
most_common_embarked = df['Embarked'].mode()[0] # The `.mode()` method returns a Series-- get the first and only item in the Series
df['Embarked'].fillna(most_common_embarked, inplace=True)

In [39]:
df.isna().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         0
dtype: int64

It looks like there are 177 null values in the "Age" column. That seems like a lot to impute since there are only 800 ish rows in total. Let's **drop** them instead, with the `.dropna()` method. This method accepts a parameter `how`, which can be "any" or "all", meaning that it will drop the row if *any* of the columns in the subset are null or only if *all* columns in the subset are null. The `subset` parameter requires a list of columns to look for null values in.

We will also use the `inplace` argument to update the original dataframe.

In [40]:
df.dropna(how='any', subset=['Age'], inplace=True)

Now let's look at the total number of null values again.

In [41]:
df.isna().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age              0
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          529
Embarked         0
dtype: int64

### Data type doesn't match

We can use the `.dtypes` property to see the data types of each column as automatically interpreted by Pandas. Note that data type `object` usually means "string".

In [42]:
df.dtypes

PassengerId      int64
Survived         int64
Pclass           int64
Name            object
Sex             object
Age            float64
SibSp            int64
Parch            int64
Ticket          object
Fare           float64
Cabin           object
Embarked        object
dtype: object

Observe that the "Age" column has the data type `float64`, meaning that each person's age is represented as a decimal number. That might be intentional, but if we wanted to group passengers by their age we might want them to express their age as an integer instead. 

We can convert the values in the column to an integer type using the `.astype()` method. To use this method, we simply select a Series, use the `.astype()` method on it, and pass in a new data type as a string. The most common data types are `float64`, `int64`, and `object`. The `.astype()` method returns a new Series, so we need to save it on top of the existing `Age` column.

In [43]:
df['Age'] = df['Age'].astype('int64')

Changing the data type "cleans up" the data a little bit by rounding the decimal numbers to whole numbers. That said, in this example, changing `Age` from a `float64` to an `int64` type probably won't make a huge difference in our analysis. However, there may be instances when changing the data type will make sense for aggregating and also processing efficiency.

For example, in the example above, we changed the `Age` column to be of data type `int64`. By definition, the `int64` data type takes up 64 bits of memory in the computer hardware (that's 64 0's and 1's to store each number). If you had a large data set, you might prefer to store the data in less bits to take up less space for more efficient processing, using data types like `int32`, `int16`, or even `int8`, especially since ages don't tend to go above 255 (the largest number that can be stored in 8 bits). In other words, you don't need very many bits to store numbers like ages, so you can store them in smaller memory chunks and not have drawbacks.

In [44]:
df.dtypes

PassengerId      int64
Survived         int64
Pclass           int64
Name            object
Sex             object
Age              int64
SibSp            int64
Parch            int64
Ticket          object
Fare           float64
Cabin           object
Embarked        object
dtype: object

### Inconsistent format
Sometimes information contained in the columns of a data set is correct, but the values are formatted differently. For example, the `Survived` column might have used 1's to indicate that a passenger survived, but also may have used the word "True" and the letter "T" to indicate the same thing. "True", "T", and 1 would all be accurate and correct descriptions of the passengers' survival, but the format would not allow the data to be processed and aggregated correctly.

For example, it looks like the embarked column has just three possible values, which include "S", "C", and "Q". We can check to see all of the unique values in the column by using the `.unique()` method.

In [45]:
df['Embarked'].unique()

array(['S', 'C', 'Q', 'Cherbourg', 'X'], dtype=object)

It looks like there are some unexpected values. One of them is "X", which we will deal with in the next section since we can't know for sure what information this is trying to communicate. The other unexpected value is "Cherbourg", which is referring to the embarkment location denoted as "C". Thus, "Cherbourg" and "C" communicate the same information and have the same datatypes, but their formats are inconsistent. They need to be changed to have the same format.

In this case, the best idea would probably be to change "Cherbourg" to "C". We can accomplish this by using the `.replace()` method on the `Embarked` column, passing in a dictionary to the method with the key "Cherbourg" and the value "C". Don't forget to override the older `Embarked` column with the Series that `.replace()` returns!

In [48]:
df['Embarked'] = df['Embarked'].replace({'Cherbourg': 'C'})

We can check the unique values again afterwards:

In [49]:
df['Embarked'].unique()

array(['S', 'C', 'Q', 'X'], dtype=object)

We can also see that in this data set, it appears that tickets have no standardization. Some of them start with numbers and some of them start with letters. All of them, however, seem to have a string of five or six numbers at the end. We can trim off the leading letters and preserve just the string of numbers by using the `.str.split()` method, and applying a function to the column.

In [13]:
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35,0,0,373450,8.05,,S


To begin, let's use the `.str.split()` method to split the `Ticket` column by a space. This will create a new Series whose values are lists. If the ticket number for that row actually *did* have a space, the list will have two items, with the second item being the ticket number. If the row *did not* have a space, the list will only have one item that is the ticket number.

In [14]:
df['Ticket'].str.split(" ")

0             [A/5, 21171]
1              [PC, 17599]
2      [STON/O2., 3101282]
3                 [113803]
4                 [373450]
              ...         
885               [382652]
886               [211536]
887               [112053]
889               [111369]
890               [370376]
Name: Ticket, Length: 714, dtype: object

We can use the `.str` accessor object again and regular string indexing with square brackets `[]` to extract the last item from each row's list. The last item will always be the ticket number, since lists with two items will always have the ticket number as the second item and lists with just one item will only contain a single ticket number. We can access the last item by using an index of `-1`.

In [15]:
df['Ticket'].str.split(" ").str[-1]

0        21171
1        17599
2      3101282
3       113803
4       373450
        ...   
885     382652
886     211536
887     112053
889     111369
890     370376
Name: Ticket, Length: 714, dtype: object

We can then save this new series as a new column to the dataframe, `TicketNumber`.

In [16]:
df['TicketNumber'] = df['Ticket'].str.split(" ").str[-1]

Let's check to make sure that our function worked, and that the column was added correctly.

In [17]:
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,TicketNumber
0,1,0,3,"Braund, Mr. Owen Harris",male,22,1,0,A/5 21171,7.25,,S,21171
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38,1,0,PC 17599,71.2833,C85,C,17599
2,3,1,3,"Heikkinen, Miss. Laina",female,26,0,0,STON/O2. 3101282,7.925,,S,3101282
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35,1,0,113803,53.1,C123,S,113803
4,5,0,3,"Allen, Mr. William Henry",male,35,0,0,373450,8.05,,S,373450


Quantitative fields like `Age` or `Fare` probably won't have different formats since numbers can only take on one format. However, if the data is input in a different format to a quantitative field (as a string, perhaps), likely the only way you will be able to detect it is by looking at the data type that Pandas automatically assigns to the column. If you expected `int64` or `float64` (a number) but Pandas assigned `object`, there's probably a string mixed in there somewhere and it should be fixed.

### Extreme/unexpected/incorrect values
Finding data with extreme or unexpected values can be challenging. This is because software (like Pandas) is only concious of whether or not datatypes are consistent in each column, but doesn't know the logic behind the column. Thus, for example, while paying a negative Fare amount to get on the Titanic might be obviously suspicious to us humans, a computer will just assume that it is normal. As a result, it's up to the data analyst to find mistakes in the data and correct them.

There isn't a specific Pandas function for fixing unexpected values. That is because the solution for fixing unexpected values differs across many situations. In some instances, you might just need to fix a single cell in a specific row, which you could do using the `.loc` property. In other cases, you might need to fix an entire column of values, which you could do by using either the `.loc` property or the `.apply()` method in combination with a named function or a lambda function.

In the example below, I want to show you how you might locate extreme or incorrect data while doing analysis. Then, I'll show you in an example how I might fix the data using the `.loc` property.

#### Finding unexpected values
If we conduct a quick `df.info()`, we can see how many non-null rows we have and what the data types of each column are. It might be useful to run this function sometimes to look for anomalies. In this case, however, it doesn't tell us very much, especially since we've already cleaned the data somewhat.

In [16]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 714 entries, 0 to 890
Data columns (total 13 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   PassengerId   714 non-null    int64  
 1   Survived      714 non-null    int64  
 2   Pclass        714 non-null    int64  
 3   Name          714 non-null    object 
 4   Sex           714 non-null    object 
 5   Age           714 non-null    int64  
 6   SibSp         714 non-null    int64  
 7   Parch         714 non-null    int64  
 8   Ticket        714 non-null    object 
 9   Fare          714 non-null    float64
 10  Cabin         185 non-null    object 
 11  Embarked      714 non-null    object 
 12  TicketNumber  714 non-null    object 
dtypes: float64(1), int64(6), object(6)
memory usage: 78.1+ KB


We can also use `df.describe()` to see some summaries about our data. What do you notice about the fields below? Do any of the summary statistics jump out to you?

In [17]:
df.describe()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,714.0,714.0,714.0,714.0,714.0,714.0,714.0
mean,448.582633,0.406162,2.236695,29.679272,0.509804,0.431373,34.694514
std,259.119524,0.49146,0.83825,14.536483,0.931324,0.853289,52.91893
min,1.0,0.0,1.0,0.0,-1.0,0.0,0.0
25%,222.25,0.0,1.0,20.0,0.0,0.0,8.05
50%,445.0,0.0,2.0,28.0,0.0,0.0,15.7417
75%,677.75,1.0,3.0,38.0,1.0,1.0,33.375
max,891.0,1.0,3.0,80.0,5.0,6.0,512.3292


Notice the minimum value in the `SibSp` column. It's `-1`! If we don't know what data the `SibSp` is supposed to have, we might not recognize the problem right away. If we were to investigate a little more, we would find out that `SibSp` shows the number of siblings and spouses that each passenger has on board the *Titanic*. Knowing that, it's obvious that `-1` is an impossible value to exist, even though Pandas didn't throw an error or complain.

Let's assume that the value should actually just be positive 1 and turn each -1 into positive 1. We can find each occurrence where `SibSp = -1` by using a filter. 

In [18]:
df.loc[df['SibSp'] == -1]

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,TicketNumber
73,74,0,3,"Chronopoulos, Mr. Apostolos",male,26,-1,0,2680,14.4542,,C,2680


#### Fixing the unexpected values

Now that we've found all of the rows where `SibSp = 1`, we can simply add `= 1` to the filter to turn each occurrence of `-1` into `1`. Notice that in the `.loc` property, we first got the rows where `SibSp` was `-1` and then got the `SibSp` column, changing all of its values to 1.

In [19]:
# Turn each occurrence of -1 into 1
df.loc[df['SibSp'] == -1, 'SibSp'] = 1

Now we can describe the data again to check if the problem was corrected.

In [20]:
df.describe()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,714.0,714.0,714.0,714.0,714.0,714.0,714.0
mean,448.582633,0.406162,2.236695,29.679272,0.512605,0.431373,34.694514
std,259.119524,0.49146,0.83825,14.536483,0.929783,0.853289,52.91893
min,1.0,0.0,1.0,0.0,0.0,0.0,0.0
25%,222.25,0.0,1.0,20.0,0.0,0.0,8.05
50%,445.0,0.0,2.0,28.0,0.0,0.0,15.7417
75%,677.75,1.0,3.0,38.0,1.0,1.0,33.375
max,891.0,1.0,3.0,80.0,5.0,6.0,512.3292


#### Checking if value in list
Occassionally, we may want to create a list of possible values to make sure that all values in a column are one of those values and not something else. For example, in our Titanic example, we might check to make sure that all values are either `'S'`, `'C'`, or `'Q'`. We can do so by creating a list and then using the `.isin()` method on the column, passing the list inside the parentheses.

In [21]:
locations = ['S', 'C', 'Q']
df['Embarked'].isin(locations)

0      True
1      True
2      True
3      True
4      True
       ... 
885    True
886    True
887    True
889    True
890    True
Name: Embarked, Length: 714, dtype: bool

By summing this Series of boolean values, we can see how many `False` values there are.

In [22]:
df['Embarked'].isin(locations).sum()

713

Notice that that previously, 714 rows were returned, but only 713 returned `True`.

Let's use the `isin()` method to create a negative filter that returns rows where the value is *not* in the list. Remember that we can negate a condition in NumPy by placing the tilda `~` in front of it, and it's the same in Pandas.

In [23]:
~df['Embarked'].isin(locations)

0      False
1      False
2      False
3      False
4      False
       ...  
885    False
886    False
887    False
889    False
890    False
Name: Embarked, Length: 714, dtype: bool

In [24]:
df.loc[ ~df['Embarked'].isin(locations) ]

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,TicketNumber
221,222,0,2,"Bracken, Mr. James H",male,27,0,0,220367,13.0,,X,220367


We can then grab the index of this row, and drop it, since we have no way of knowing what the `Embarked` value actually was. We can get the index by adding `.index` to the end of the previous line of code. We can save this index to a variable `rows_to_drop`.

In [25]:
rows_to_drop = df.loc[ ~df['Embarked'].isin(locations) ].index

We can then drop the row by its index using the `.drop()` method with the `index` and `inplace` arguments.

In [26]:
df.drop(index=rows_to_drop, inplace=True)

In the end, we expected to get 713 rows with values that are also in the list `locations`, since we dropped one row that did not meet that criteria.

In [27]:
df['Embarked'].isin(locations).sum()

713

#### Dropping Duplicates
You can use the dataframe method `.drop_duplicates()` to drop rows that are duplicates of each other. This method returns a new dataframe where is row is completely unique across all of its columns. If any value in the row is different, however, the row is unique and is not dropped.

In [28]:
df.drop_duplicates()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,TicketNumber
0,1,0,3,"Braund, Mr. Owen Harris",male,22,1,0,A/5 21171,7.2500,,S,21171
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38,1,0,PC 17599,71.2833,C85,C,17599
2,3,1,3,"Heikkinen, Miss. Laina",female,26,0,0,STON/O2. 3101282,7.9250,,S,3101282
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35,1,0,113803,53.1000,C123,S,113803
4,5,0,3,"Allen, Mr. William Henry",male,35,0,0,373450,8.0500,,S,373450
...,...,...,...,...,...,...,...,...,...,...,...,...,...
885,886,0,3,"Rice, Mrs. William (Margaret Norton)",female,39,0,5,382652,29.1250,,Q,382652
886,887,0,2,"Montvila, Rev. Juozas",male,27,0,0,211536,13.0000,,S,211536
887,888,1,1,"Graham, Miss. Margaret Edith",female,19,0,0,112053,30.0000,B42,S,112053
889,890,1,1,"Behr, Mr. Karl Howell",male,26,0,0,111369,30.0000,C148,C,111369


Notice that this dataset does not contain duplicate rows, so no rows were dropped (still have 713 rows). However, we can also pass in a list of columns and Pandas will look for duplicates **only among those columns** and drop duplicates if found. For example, if we wanted to have a dataframe with just one family member per cabin, we could pass in `Cabin` inside a list to the `.drop_duplicates()` method.

In [29]:
df.drop_duplicates(['Cabin'])

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,TicketNumber
0,1,0,3,"Braund, Mr. Owen Harris",male,22,1,0,A/5 21171,7.2500,,S,21171
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38,1,0,PC 17599,71.2833,C85,C,17599
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35,1,0,113803,53.1000,C123,S,113803
6,7,0,1,"McCarthy, Mr. Timothy J",male,54,0,0,17463,51.8625,E46,S,17463
10,11,1,3,"Sandstrom, Miss. Marguerite Rut",female,4,1,1,PP 9549,16.7000,G6,S,9549
...,...,...,...,...,...,...,...,...,...,...,...,...,...
857,858,1,1,"Daly, Mr. Peter Denis",male,51,0,0,113055,26.5500,E17,S,113055
867,868,0,1,"Roebling, Mr. Washington Augustus II",male,31,0,0,PC 17590,50.4958,A24,S,17590
879,880,1,1,"Potter, Mrs. Thomas Jr (Lily Alexenia Wilson)",female,56,0,1,11767,83.1583,C50,C,11767
887,888,1,1,"Graham, Miss. Margaret Edith",female,19,0,0,112053,30.0000,B42,S,112053


After doing this, we only get 135 rows back. Now we could do analysis on the `Fare` column and get a more realistic idea of what the average fare was across each room, rather than across each guest.

### Final thoughts

There are other situations in which your data may need cleaning, and there are many more Pandas methods and parameters available to use. The more experienced you become with data analysis and Pandas, the better you will be able to clean your data.