<p style="text-align:center">
    <a href="https://skills.network" target="_blank">
    <img src="https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/assets/logos/SN_web_lightmode.png" width="200" alt="Skills Network Logo"  />
    </a>
</p>


# **Removing Duplicates**


Estimated time needed: **30** minutes


## Introduction


In this lab, you will focus on data wrangling, an important step in preparing data for analysis. Data wrangling involves cleaning and organizing data to make it suitable for analysis. One key task in this process is removing duplicate entries, which are repeated entries that can distort analysis and lead to inaccurate conclusions.  


## Objectives


In this lab you will perform the following:


1. Identify duplicate rows  in the dataset.
2. Use suitable techniques to remove duplicate rows and verify the removal.
3. Summarize how to handle missing values appropriately.
4. Use ConvertedCompYearly to normalize compensation data.
   


### Install the Required Libraries


In [1]:
!pip install pandas



### Step 1: Import Required Libraries


In [3]:
import pandas as pd

### Step 2: Load the Dataset into a DataFrame



load the dataset using pd.read_csv()


In [4]:
# Define the URL of the dataset
file_path = "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/n01PQ9pSmiRX6520flujwQ/survey-data.csv"

# Load the dataset into a DataFrame
df = pd.read_csv(file_path)

# Display the first few rows to ensure it loaded correctly
print(df.head())


   ResponseId                      MainBranch                 Age  \
0           1  I am a developer by profession  Under 18 years old   
1           2  I am a developer by profession     35-44 years old   
2           3  I am a developer by profession     45-54 years old   
3           4           I am learning to code     18-24 years old   
4           5  I am a developer by profession     18-24 years old   

            Employment RemoteWork   Check  \
0  Employed, full-time     Remote  Apples   
1  Employed, full-time     Remote  Apples   
2  Employed, full-time     Remote  Apples   
3   Student, full-time        NaN  Apples   
4   Student, full-time        NaN  Apples   

                                    CodingActivities  \
0                                              Hobby   
1  Hobby;Contribute to open-source projects;Other...   
2  Hobby;Contribute to open-source projects;Other...   
3                                                NaN   
4                                 

**Note: If you are working on a local Jupyter environment, you can use the URL directly in the <code>pandas.read_csv()</code>  function as shown below:**



#df = pd.read_csv("https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/n01PQ9pSmiRX6520flujwQ/survey-data.csv")


### Step 3: Identifying Duplicate Rows


**Task 1: Identify Duplicate Rows**
  1. Count the number of duplicate rows in the dataset.
  2. Display the first few duplicate rows to understand their structure.


In [6]:
## Write your code here
num_duplicate_rows = df.duplicated().sum() 
print(f"Number of duplicate rows: {num_duplicate_rows}") 

duplicate_rows =df[df.duplicated()]  
print("First few duplicate rows:") 
print(duplicate_rows.head())

Number of duplicate rows: 0
First few duplicate rows:
Empty DataFrame
Columns: [ResponseId, MainBranch, Age, Employment, RemoteWork, Check, CodingActivities, EdLevel, LearnCode, LearnCodeOnline, TechDoc, YearsCode, YearsCodePro, DevType, OrgSize, PurchaseInfluence, BuyNewTool, BuildvsBuy, TechEndorse, Country, Currency, CompTotal, LanguageHaveWorkedWith, LanguageWantToWorkWith, LanguageAdmired, DatabaseHaveWorkedWith, DatabaseWantToWorkWith, DatabaseAdmired, PlatformHaveWorkedWith, PlatformWantToWorkWith, PlatformAdmired, WebframeHaveWorkedWith, WebframeWantToWorkWith, WebframeAdmired, EmbeddedHaveWorkedWith, EmbeddedWantToWorkWith, EmbeddedAdmired, MiscTechHaveWorkedWith, MiscTechWantToWorkWith, MiscTechAdmired, ToolsTechHaveWorkedWith, ToolsTechWantToWorkWith, ToolsTechAdmired, NEWCollabToolsHaveWorkedWith, NEWCollabToolsWantToWorkWith, NEWCollabToolsAdmired, OpSysPersonal use, OpSysProfessional use, OfficeStackAsyncHaveWorkedWith, OfficeStackAsyncWantToWorkWith, OfficeStackAsyncAdmi

### Step 4: Removing Duplicate Rows


**Task 2: Remove Duplicates**
   1. Remove duplicate rows from the dataset using the drop_duplicates() function.
2. Verify the removal by counting the number of duplicate rows after removal .


In [7]:
## Write your code here

# Step 1: Remove full duplicate rows (if any)
df_no_duplicates = df.drop_duplicates()

# Step 2: Verify removal by checking for duplicates again
remaining_duplicates = df_no_duplicates.duplicated().sum()

print("‚úÖ Duplicate rows removed.")
print(f"Number of duplicate rows after removal: {remaining_duplicates}")


‚úÖ Duplicate rows removed.
Number of duplicate rows after removal: 0


### Step 5: Handling Missing Values


**Task 3: Identify and Handle Missing Values**
   1. Identify missing values for all columns in the dataset.
   2. Choose a column with significant missing values (e.g., EdLevel) and impute with the most frequent value.


In [14]:
## Write your code here
missing_values = df.isnull().sum() 
missing_values = missing_value[missing_values>0].sort_values(ascending =False) 
print("Missing values per column:")
print(missing_values)

print(f"\nMissing values in 'EdLevel': {df['EdLevel'].isnull().sum()}")  
most_frequent = df['EdLevel'].mode()[0] 
print(f"Most frequent value in EdLevel is:{most_frequent}")

df['EdLevel'].fillna(most_frequent, inplace = True) 

print(f"Missing value  in 'EdLevel' after imputation: {df['EdLevel'].isnull().sum()}")

Missing values per column:
AINextMuch less integrated    64289
AINextLess integrated         63082
AINextNo change               52939
AINextMuch more integrated    51999
EmbeddedAdmired               48704
                              ...  
LanguageHaveWorkedWith         5692
YearsCode                      5568
NEWSOSites                     5151
LearnCode                      4949
AISelect                       4530
Length: 108, dtype: int64

Missing values in 'EdLevel': 0
Most frequent value in EdLevel is:Bachelor‚Äôs degree (B.A., B.S., B.Eng., etc.)
Missing value  in 'EdLevel' after imputation: 0


### Step 6: Normalizing Compensation Data


**Task 4: Normalize Compensation Data Using ConvertedCompYearly**
   1. Use the ConvertedCompYearly column for compensation analysis as the normalized annual compensation is already provided.
   2. Check for missing values in ConvertedCompYearly and handle them if necessary.


In [15]:
## Write your code here
missing_comp = df['ConvertedCompYearly'].isnull().sum()
print(f"Missing value in ConvertedCompYearly : {missing_comp}")

Missing value in ConvertedCompYearly : 42002


### Step 7: Summary and Next Steps


**In this lab, you focused on identifying and removing duplicate rows.**

- You handled missing values by imputing the most frequent value in a chosen column.

- You used ConvertedCompYearly for compensation normalization and handled missing values.

- For further analysis, consider exploring other columns or visualizing the cleaned dataset.


In [None]:

## üßº Lab Summary: Data Cleaning Process

### üîÅ 1. Identified and Removed Duplicate Rows
- Checked for fully duplicate rows across all columns using `df.duplicated()`.
- Removed duplicates using `df.drop_duplicates()`.
- Verified removal by confirming no duplicates remained.

---

### üß© 2. Handled Missing Values
- Identified missing values in all columns using `df.isnull().sum()`.
- Selected the `EdLevel` column (which had significant missing data).
- Imputed missing values with the **most frequent (mode)** value using `fillna()`.

---

### üí∞ 3. Normalized Compensation Data
- Used the `ConvertedCompYearly` column for analyzing annual compensation, as it was already normalized.
- Checked and handled missing values:
  - Optionally dropped rows with missing compensation.
  - Or filled them with the **median** value.

---

### üìà Next Steps
For further analysis, consider:
- Exploring trends across countries, roles, or education levels.
- Visualizing distributions using plots (histograms, box plots, etc.).
- Identifying outliers or grouping compensation by job type.

---

‚úÖ Your dataset is now **clean and ready** for meaningful analysis!


<!--
## Change Log

|Date (YYYY-MM-DD)|Version|Changed By|Change Description|
|-|-|-|-|
|2024-11-05|1.2|Madhusudhan Moole|Updated lab|
|2024-09-24|1.1|Madhusudhan Moole|Updated lab|
|2024-09-23|1.0|Raghul Ramesh|Created lab|

--!>


Copyright ¬© IBM Corporation. All rights reserved.
