### Handling Duplicate Data

Duplicate data refers to rows in a dataset that are exact copies of other rows. Handling duplicates is essential to ensure data quality and prevent bias in analysis or model training.

---

#### Process:

1. **Detecting Duplicates**:
   - Use the `duplicated()` method to identify duplicate rows in the dataset.
   - This method returns a Boolean Series, where `True` indicates a duplicate row, and `False` indicates a unique row.

2. **Removing Duplicates**:
   - Use the `drop_duplicates()` method to remove duplicate rows from the dataset.
   - By default, it keeps the first occurrence of each duplicate row and removes subsequent duplicates.

---

#### Example:

- The dataset `marks_data` contains information about students' names and their scores in Math and Physics.
- The `duplicated()` method is applied to identify rows where both `Name`, `Math`, and `Phy` columns have identical values.
- The `drop_duplicates()` method is then used to create a new dataset with duplicates removed.

---

#### Benefits:
- **Improves Data Quality**: Ensures the dataset represents unique records, reducing redundancy.
- **Prevents Bias**: Avoids giving undue weight to duplicate entries during analysis or model training.
- **Enhances Performance**: Reduces the dataset size, improving computational efficiency.

---

#### Note:
- When handling duplicates, ensure that removing duplicates aligns with the dataset's context and analysis requirements.  
- The `subset` parameter in `drop_duplicates()` can be used to specify columns for checking duplicates if only specific columns matter.


In [29]:
import pandas as pd
    

In [30]:
marks = {'Name':['a', 'b', 'c', 'd', 'a', 'c'], 'Math':[10, 8, 9, 7, 10, 5], 'Phy':[5, 6, 7, 8, 5, 7]}

In [31]:
marks_data = pd.DataFrame(marks)
marks_data

Unnamed: 0,Name,Math,Phy
0,a,10,5
1,b,8,6
2,c,9,7
3,d,7,8
4,a,10,5
5,c,5,7


In [32]:
marks_data.duplicated()

0    False
1    False
2    False
3    False
4     True
5    False
dtype: bool

In [33]:
marks_data.drop_duplicates()


Unnamed: 0,Name,Math,Phy
0,a,10,5
1,b,8,6
2,c,9,7
3,d,7,8
5,c,5,7


In [34]:
loan_data = pd.read_csv('loan.csv')
loan_data.head(3)

Unnamed: 0,Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
0,LP001002,Male,No,0,Graduate,No,5849,0.0,,360.0,1.0,Urban,Y
1,LP001003,Male,Yes,1,Graduate,No,4583,1508.0,128.0,360.0,1.0,Rural,N
2,LP001005,Male,Yes,0,Graduate,Yes,3000,0.0,66.0,360.0,1.0,Urban,Y


In [35]:
loan_data.shape

(614, 13)

In [38]:
loan_data.drop_duplicates()
loan_data.shape

(614, 13)

**There is no duplicate row in the dataset**