Checking duplicate value is extremely simple. It will behave differently in Series and DataFrames. Firstly we will start with Series. So for this we are about to set an example of inviting different ambassadors frmo different countries. But, we can only invite only one ambassador per country. So we will be creating one example list where there will be duplicated ambassadors for two countries.

In [52]:
import numpy as np
import pandas as pd

# Series

In [53]:
ambassadors = pd.Series([
    "France",
    "United Kingdom",
    "United Kingdom",
    "Italy",
    "Germany",
    "Germany",
    "Germany"
],
index=[
    "Bipin Tamang",
    "Yogesh Chamling",
    "Buddha Tamang",
    "Bhupendra Pradhan",
    "Pasang Gyalzen Sherpa",
    "Lakpa Sherpa",
    "Dajangbu Sherpa"
])

In [54]:
ambassadors

Bipin Tamang                     France
Yogesh Chamling          United Kingdom
Buddha Tamang            United Kingdom
Bhupendra Pradhan                 Italy
Pasang Gyalzen Sherpa           Germany
Lakpa Sherpa                    Germany
Dajangbu Sherpa                 Germany
dtype: object

In [55]:
ambassadors.duplicated()

Bipin Tamang             False
Yogesh Chamling          False
Buddha Tamang             True
Bhupendra Pradhan        False
Pasang Gyalzen Sherpa    False
Lakpa Sherpa              True
Dajangbu Sherpa           True
dtype: bool

Here we can see that the duplicated didn't consider the first occurence "Yogesh Chamling" as duplicates. Similarly, "Pasang Gyalzen Sherpa" too. We can change this behaviour with the parameter "keep". 

# DataFrame.duplicated(subset=None, keep='first')

THe subset parameters helps in specifyinh the column


In [56]:
ambassadors.duplicated(keep="last")

Bipin Tamang             False
Yogesh Chamling           True
Buddha Tamang            False
Bhupendra Pradhan        False
Pasang Gyalzen Sherpa     True
Lakpa Sherpa              True
Dajangbu Sherpa          False
dtype: bool

What does keep do?

• keep="first" (default)

Marks all duplicates except the first occurrence as True.

Example: [1, 1, 1] → [False, True, True].

• keep="last"

Marks all duplicates except the last occurrence as True.

Example: [1, 1, 1] → [True, True, False].

• keep=False

Marks all duplicates as True (none are kept).

Example: [1, 1, 1] → [True, True, True].

# Drop by default checks ever rows of every columns

In [57]:
ambassadors.duplicated(keep=False)

# None of the rows are ommitted 

Bipin Tamang             False
Yogesh Chamling           True
Buddha Tamang             True
Bhupendra Pradhan        False
Pasang Gyalzen Sherpa     True
Lakpa Sherpa              True
Dajangbu Sherpa           True
dtype: bool

In [58]:
ambassadors.drop_duplicates()

# By default the parameter for the keep is first, so this simply keeps the first row and drops all the other

Bipin Tamang                     France
Yogesh Chamling          United Kingdom
Bhupendra Pradhan                 Italy
Pasang Gyalzen Sherpa           Germany
dtype: object

In [59]:
ambassadors.drop_duplicates(keep = "last")

# This keeps the last duplicate value and drops all before that 

Bipin Tamang                 France
Buddha Tamang        United Kingdom
Bhupendra Pradhan             Italy
Dajangbu Sherpa             Germany
dtype: object

In [60]:
ambassadors.drop_duplicates(keep = False)

# This will drop everything and doesnot keep anything

Bipin Tamang         France
Bhupendra Pradhan     Italy
dtype: object

# Dataframes

In [61]:
players = pd.DataFrame({
    "Name"  : [
        "Luka Doncic",
        "Allen Iverson",
        "Luka Doncic",
        "Kyrie Irving",
        "Luka Doncic"
    ],
    "Pos" : [
        "SG",
        "SF",
        "SG",
        "SF",
        "SF"
    ]
})

In [62]:
players.duplicated(keep=False)

0     True
1    False
2     True
3    False
4    False
dtype: bool

As we can see above there must be something wrong that didnt gave us any duplicated informations. So, we will be accessing it with the help of the column.

In [63]:
players.duplicated(subset=["Name"])

0    False
1    False
2     True
3    False
4     True
dtype: bool

In [64]:
players.duplicated(subset = "Pos", keep = False)

0    True
1    True
2    True
3    True
4    True
dtype: bool

In [65]:
players.duplicated(subset= "Name", keep ="last")

0     True
1    False
2     True
3    False
4    False
dtype: bool

In [66]:
players.drop_duplicates(subset="Name")

Unnamed: 0,Name,Pos
0,Luka Doncic,SG
1,Allen Iverson,SF
3,Kyrie Irving,SF


In [67]:
players.drop_duplicates()

# This will only consider duplicate is all the columns value of that row is same as the one in the another row.
# So, this will remove any rows that have same Name and position. It keeps the first row and removes all after that.

Unnamed: 0,Name,Pos
0,Luka Doncic,SG
1,Allen Iverson,SF
3,Kyrie Irving,SF
4,Luka Doncic,SF


In [68]:
players.drop_duplicates(subset="Pos")

# This is what we are going to get if we coinsider the duplicates in the Column "Pos". Only the first occurence of the positions are kept.

Unnamed: 0,Name,Pos
0,Luka Doncic,SG
1,Allen Iverson,SF


In [69]:
players.drop_duplicates(subset= "Name", keep = "last")

Unnamed: 0,Name,Pos
1,Allen Iverson,SF
3,Kyrie Irving,SF
4,Luka Doncic,SF


In [70]:
players

Unnamed: 0,Name,Pos
0,Luka Doncic,SG
1,Allen Iverson,SF
2,Luka Doncic,SG
3,Kyrie Irving,SF
4,Luka Doncic,SF


# Text Handling

Cleaning text values can be hard. invalid text values involves, 99% of the time , mistyping which is completely unpredictible and doesn't follow any pattern. Thankgod, this is not common these days, where data entry tasks has been replaced by machines. Still lets explore some commno casees.

# Splitting Columns

For this we will be making one dataframe.

In [103]:
df = pd.DataFrame({
    "Data" : [
        "1987_M_US _1",
        "1990?_M_UK_1",
        "1992_F_US_2",
        "1070?_M_   IT_1",
        "1985_F_I  T_2" 
    ]
})

In [104]:
df 

Unnamed: 0,Data
0,1987_M_US _1
1,1990?_M_UK_1
2,1992_F_US_2
3,1070?_M_ IT_1
4,1985_F_I T_2


We can see that these single columns represent Year, Sex, Country and number of children. The problem here is that it has been all grouped into a single column and has been seperated by an underscore. Pandas has convenient method named split for this.

# The split() method is used to break a string into multiple parts (substrings) based on a given separator (like a space, comma, or underscore).


Different attributs like str for string, datetime as dt, and category as cat has method split. 

In [105]:
df["Data"].str.split("_") 

0       [1987, M, US , 1]
1       [1990?, M, UK, 1]
2        [1992, F, US, 2]
3    [1070?, M,    IT, 1]
4      [1985, F, I  T, 2]
Name: Data, dtype: object

In [106]:
df = df["Data"].str.split("_", expand=True)

1. The split() function (when used as df["col"].str.split("_"))

• Splits each string in the column into a list of substrings based on the delimiter (e.g., _).

• The column now contains lists instead of strings.

2. The expand=True argument

• Takes each list (from step 1) and “expands” it into separate columns.

• Each element of the list at a specific index is placed into its own column (0, 1, 2, …).

In [107]:
df

Unnamed: 0,0,1,2,3
0,1987,M,US,1
1,1990?,M,UK,1
2,1992,F,US,2
3,1070?,M,IT,1
4,1985,F,I T,2


In [108]:
#Naming the column names 
df.columns = ["Year", "Sex", "Country", "No Children"]

#Make sure to indicate as columns not column

In [109]:
df

Unnamed: 0,Year,Sex,Country,No Children
0,1987,M,US,1
1,1990?,M,UK,1
2,1992,F,US,2
3,1070?,M,IT,1
4,1985,F,I T,2


In [110]:
df["Year"].str.contains("\?")

0    False
1     True
2    False
3     True
4    False
Name: Year, dtype: bool

In [111]:
df

Unnamed: 0,Year,Sex,Country,No Children
0,1987,M,US,1
1,1990?,M,UK,1
2,1992,F,US,2
3,1070?,M,IT,1
4,1985,F,I T,2


A question mark (?) has a special meaning in search patterns (regex).

• It normally means: “the thing before me can appear 0 or 1 time.”

	• (\+91)? means the “+91” country code may or may not be present.
	
    • Matches:
	
    • ”+919876543210”
	
    • “9876543210”

• So if you just write "?", Python thinks you’re talking about this special rule, not the actual ? symbol.

In [112]:
df["Country"].str.contains("U")

0     True
1     True
2     True
3    False
4    False
Name: Country, dtype: bool

In [113]:
df["Country"].str.strip()

# The main use of the strip function is to remove the  unwanted spaces from the beginning and the end of the string.

0      US
1      UK
2      US
3      IT
4    I  T
Name: Country, dtype: object

In [114]:
df["Country"].str.replace(" ", "")

# This will just replace the empty space with no space as given above.

0    US
1    UK
2    US
3    IT
4    IT
Name: Country, dtype: object

As we said , replace and contains take regex patterns, which can make it easier to replace values in bulk

# Instead of only matching exact words, they can match patterns of text – which lets you handle many variations at once.

In [115]:
# Without the regex
# df["City"].str.replace("NYC", "New York")
	# This will only replace exactly “NYC”??.

# With regex
# df["City"].str.replace("N[Yy][Cc]", "New York", regex=True)
# This will replace “NYC”, “nyc”, “NyC”, “nYC” all at once (any case variation) because the regex [Yy] means “Y or y”.


In [116]:
df["Year"].str.replace(r'(?P<year>\d{4})\?', lambda m: m.group('year'), regex = True)

0    1987
1    1990
2    1992
3    1070
4    1985
Name: Year, dtype: object