# 6.1.1 Duplicates

Duplicate values in data analytics refer to identical or nearly identical records or entries within a dataset. These repetitions can arise from various sources such as data entry errors, system glitches, or merging multiple data sources. Addressing duplicate values is crucial as they can distort analysis results, lead to inaccurate insights, and waste computational resources.

Treating duplicate values can invovle several steps, but mainly involves identifying the duplicate entries and then deciding what to do with each one.

Let's start by looking at a modified version of the *Titanic* data.

## About the data
This data set is a modified version of the classic *Titanic* data set. It contains many kinds of errors, which will be used to demonstrate cleaning concepts in this reading.

⚠ Warning: Remember to upload the Titanic Dirty data set to Google Colab before running the cells below.

In [None]:
import pandas as pd
df = pd.read_csv('titanic_dirty.csv')
df.head()

In [None]:
df.shape

## Identifying duplicates

Observe the data set above. In the first five rows of the data set, things look fairly normal (for the most part). Finding duplicates in any data set is difficult to do by just looking at the data, since there could be hundreds of thousands of rows.

Pandas makes identifying duplicate rows easy using the `.duplicated()` method.

In [None]:
df.duplicated()

The `.duplicated()` method goes through each row of the dataframe and decides whether or not another row matches it **exactly**. If another row does exist that is exactly the same, the `.duplicated()` method marks that row as True.

We can use `.duplicated()` to create a filter that shows all duplicated rows.

Note: You can add the parameter `keep=False` to the `.duplicated()` method to show both of the rows at the same time, to verify that they are duplicates.

In [None]:
df[ df.duplicated() ].sort_values(by='Name')

In the code above, we can see that there were about 24 passengers whose rows were duplicated exactly. Great work!

## Now what?
Now that the duplicate entries have been identified, we have to decide what to do with them. Should we drop the duplicate passengers? What if there are two passengers who happen to have the same name and the same gender and paid the same fare and have the same age?

The answer to these questions will determine your course of action. Most of the time, however, duplicate rows are simply dropped.

## Dropping duplicates

To drop duplicates, you can simply use the `.drop_duplicates()` method on a pandas dataframe. Note that running the function by itself will only show a preview of the data without duplicates, but won't save unless it is saved back to the variable `df` or is given the parameter `inplace=True`.

In [None]:
df.drop_duplicates(inplace=True)

In [None]:
df[ df.duplicated() ].sort_values(by='Name')

That's it! The duplicates have been dropped from the data set.

## Checking for duplicates by subset
Dropping duplicates may have seemed easy, but remember that by default, the `.duplicated()` and `.drop_duplicates()` methods look for rows that are **exactly the same**. But what happens if the same passenger was entered into the data set twice, but with a different `Fare` each time? Because the rows aren't exactly the same, they won't be marked as duplicates.

For this reason, we can use the `subset` argument to search for duplicates using one or a few columns instead of all the columns.

Let's use some logic to determine which columns to look for duplicates in. We couldn't use the `Fare` column or `Age` columns because it is possible for two different passengers to have paid the same fare and have the same age. We *could* probably use the `Name` column, but technically, two passengers *could* have the exact same name. However, what are the chances that two passengers have the same `Name` and stayed in the same `Cabin`? Let's take a look at the passengers who meet that criteria using the `subset` parameter.

In [None]:
df.loc[ df.duplicated(subset=['Name', 'Cabin'])]

By using the `subset` parameter and searching for duplicates only using the `Name` and `Fare` columns, we were able to identify an additional 63 duplicates in the data set. Now, we can use the `subset` argument to drop these passengers.

In [None]:
df.drop_duplicates(subset=['Name', 'Cabin'], inplace=True)
df.shape

Deciding which rows are duplicates is very subjective, so there isn't a great way to validate and check if all the duplicates were removed or not. It is likely impossible to always remove all duplicate entries from the data set, but by reducing the amount of duplicate entries the analysis will be more accurate than otherwise.