The duplicated() method in Pandas helps us to find these duplicates in our data quickly and returns True for duplicates and False for unique rows. It is used to clean our dataset before going into analysis. In this article, we'll see how the duplicated() method works with some examples.

In [1]:
import pandas as pd
df = pd.DataFrame({
    'Name': ['Alice', 'Bob', 'Alice', 'Charlie'],
    'Age': [25, 32, 25, 37]
})
duplicates = df[df.duplicated()]
print(duplicates)

    Name  Age
2  Alice   25


#### Syntax:

DataFrame.duplicated(subset=None, keep='first')

Parameters:    

1. subset: (Optional) Specifies which columns to check for duplicates. By default, it checks all columns.

2. keep: Finds which duplicates to mark as True:  

* 'first' (default): Marks duplicates after the first occurrence as True. 
* 'last': Marks duplicates after the last occurrence as True.  
* False: Marks all occurrences of duplicates as True.   
Returns: A Boolean series where each value corresponds to whether the row is a duplicate (True) or unique (False).  

Let's look at some examples of the duplicated method in Pandas library used to identify duplicated rows in a DataFrame. Here we will be using custom dataset.  

You can download the dataset from Here.  

### Example 1: Returning a Boolean Series
In this example we will identify duplicate values in the First Name column using the default keep='first' parameter.. This keeps the first occurrence of each duplicate and marks the rest as duplicates.

In [2]:
data = pd.read_csv("employees.csv")
data.sort_values("First Name", inplace = True)
bool_series = data["First Name"].duplicated()
data.head()
data[bool_series]

Unnamed: 0,First Name,Gender,Start Date,Last Login Time,Salary,Bonus %,Senior Management,Team
327,Aaron,Male,1/29/1994,6:48 PM,58755,5.097,True,Marketing
440,Aaron,Male,7/22/1990,2:53 PM,52119,11.343,True,Client Services
937,Aaron,,1/22/1986,7:39 PM,63126,18.424,False,Client Services
141,Adam,Male,12/24/1990,8:57 PM,110194,14.727,True,Product
302,Adam,Male,7/5/2007,11:59 AM,71276,5.027,True,Human Resources
...,...,...,...,...,...,...,...,...
902,,Male,5/23/2001,7:52 PM,103877,6.322,,Distribution
925,,Female,8/23/2000,4:19 PM,95866,19.388,,Sales
946,,Female,9/15/1985,1:50 AM,133472,16.941,,Distribution
947,,Male,7/30/2012,3:07 PM,107351,5.329,,Marketing


### Example 2: Removing duplicates 
In this example we'll remove all duplicates from the DataFrame. By setting keep=False we remove every instance of a duplicate.

In [3]:
data = pd.read_csv("employees.csv")
data.sort_values("First Name", inplace = True)
bool_series = data["First Name"].duplicated(keep = False)
bool_series
data = data[~bool_series]
data.info()
data

<class 'pandas.core.frame.DataFrame'>
Index: 9 entries, 8 to 291
Data columns (total 8 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   First Name         9 non-null      object 
 1   Gender             9 non-null      object 
 2   Start Date         9 non-null      object 
 3   Last Login Time    9 non-null      object 
 4   Salary             9 non-null      int64  
 5   Bonus %            9 non-null      float64
 6   Senior Management  9 non-null      object 
 7   Team               9 non-null      object 
dtypes: float64(1), int64(1), object(6)
memory usage: 648.0+ bytes


Unnamed: 0,First Name,Gender,Start Date,Last Login Time,Salary,Bonus %,Senior Management,Team
8,Angela,Female,11/22/2005,6:29 AM,95570,18.523,True,Engineering
688,Brian,Male,4/7/2007,10:47 PM,93901,17.821,True,Legal
190,Carol,Female,3/19/1996,3:39 AM,57783,9.129,False,Finance
887,David,Male,12/5/2009,8:48 AM,92242,15.407,False,Legal
5,Dennis,Male,4/18/1987,1:35 AM,115163,10.125,False,Legal
495,Eugene,Male,5/24/1984,10:54 AM,81077,2.117,False,Sales
33,Jean,Female,12/18/1993,9:07 AM,119082,16.18,False,Business Development
832,Keith,Male,2/12/2003,3:02 PM,120672,19.467,False,Legal
291,Tammy,Female,11/11/1984,10:30 AM,132839,17.463,True,Client Services


### Example 3: Keeping the Last Occurrence of Duplicates
In this example, we will keep the last occurrence of each duplicate and mark the rest as duplicates. This is done using the keep='last' arguments.

In [4]:
data = pd.read_csv("employees.csv")
data.sort_values("First Name", inplace=True)
bool_series_last = data["First Name"].duplicated(keep='last')
data_last = data[~bool_series_last]
data_last.info()
print(data_last)

<class 'pandas.core.frame.DataFrame'>
Index: 201 entries, 937 to 951
Data columns (total 8 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   First Name         200 non-null    object 
 1   Gender             165 non-null    object 
 2   Start Date         201 non-null    object 
 3   Last Login Time    201 non-null    object 
 4   Salary             201 non-null    int64  
 5   Bonus %            201 non-null    float64
 6   Senior Management  200 non-null    object 
 7   Team               189 non-null    object 
dtypes: float64(1), int64(1), object(6)
memory usage: 14.1+ KB
    First Name  Gender  Start Date Last Login Time  Salary  Bonus %  \
937      Aaron     NaN   1/22/1986         7:39 PM   63126   18.424   
538       Adam    Male   10/8/2010         9:53 PM   45181    3.491   
610       Alan    Male   2/17/2012        12:26 AM   41453   10.084   
959     Albert    Male   9/19/1992         2:35 AM   45094    5.850   
6

The drop_duplicates() method in Pandas is designed to remove duplicate rows from a DataFrame based on all columns or specific ones. By default, it scans the entire DataFrame and retains the first occurrence of each row and removes any duplicates that follow. In this article, we will see how to use the drop_duplicates() method and its examples.

Let's start with a basic example to see how drop_duplicates() works.

In [5]:
data = {
    "Name": ["Alice", "Bob", "Alice", "David"],
    "Age": [25, 30, 25, 40],
    "City": ["NY", "LA", "NY", "Chicago"]
}

df = pd.DataFrame(data)

print("Original DataFrame:")
print(df)

df_cleaned = df.drop_duplicates()

print("\nModified DataFrame (no duplicates)")
print(df_cleaned)

Original DataFrame:
    Name  Age     City
0  Alice   25       NY
1    Bob   30       LA
2  Alice   25       NY
3  David   40  Chicago

Modified DataFrame (no duplicates)
    Name  Age     City
0  Alice   25       NY
1    Bob   30       LA
3  David   40  Chicago


#### Syntax:

DataFrame.drop_duplicates(subset=None, keep='first', inplace=False)

Parameters:  

1. subset: Specifies the columns to check for duplicates. If not provided all columns are considered.

2. keep: Finds which duplicate to keep:

* 'first' (default): Keeps the first occurrence, removes subsequent duplicates.
* 'last': Keeps the last occurrence and removes previous duplicates.
* False: Removes all occurrences of duplicates.

3. inplace: If True it modifies the original DataFrame directly. If False (default), returns a new DataFrame.  

Return type: Method returns a new DataFrame with duplicates removed unless inplace=True.

### Examples
Below are some examples of dataframe.drop_duplicates() method:

#### 1. Dropping Duplicates Based on Specific Columns
We can target duplicates in specific columns using the subset parameter. This is useful when some columns are more relevant for identifying duplicates.

In [6]:

df = pd.DataFrame({
    'Name': ['Alice', 'Bob', 'Alice', 'David'],
    'Age': [25, 30, 25, 40],
    'City': ['NY', 'LA', 'SF', 'Chicago']
})

df_cleaned = df.drop_duplicates(subset=["Name"])

print(df_cleaned)

    Name  Age     City
0  Alice   25       NY
1    Bob   30       LA
3  David   40  Chicago


#### 2. Keeping the Last Occurrence of Duplicates
By default drop_duplicates() retains the first occurrence of duplicates. If we want to keep the last occurrence we can use keep='last'.

In [7]:
df = pd.DataFrame({
    'Name': ['Alice', 'Bob', 'Alice', 'David'],
    'Age': [25, 30, 25, 40],
    'City': ['NY', 'LA', 'NY', 'Chicago']
})

df_cleaned= df.drop_duplicates(keep='last')
print(df_cleaned)

    Name  Age     City
1    Bob   30       LA
2  Alice   25       NY
3  David   40  Chicago


#### 3. Dropping All Duplicates
If we want to remove all rows that are duplicates, we can set keep=False.






In [8]:


df = pd.DataFrame({
    'Name': ['Alice', 'Bob', 'Alice', 'David'],
    'Age': [25, 30, 25, 40],
    'City': ['NY', 'LA', 'NY', 'Chicago']
})
df_cleaned = df.drop_duplicates(keep=False)
print(df_cleaned)

    Name  Age     City
1    Bob   30       LA
3  David   40  Chicago


#### 4. Modifying the Original DataFrame Directly
If we want to modify the DataFrame in place without creating a new DataFrame set inplace=True.

In [9]:
df = pd.DataFrame({
    'Name': ['Alice', 'Bob', 'Alice', 'David'],
    'Age': [25, 30, 25, 40],
    'City': ['NY', 'LA', 'NY', 'Chicago']
})
df.drop_duplicates(inplace=True)
print(df)

    Name  Age     City
0  Alice   25       NY
1    Bob   30       LA
3  David   40  Chicago
