<p style="text-align:center">
    <a href="https://skills.network" target="_blank">
    <img src="https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/assets/logos/SN_web_lightmode.png" width="200" alt="Skills Network Logo"  />
    </a>
</p>


# **Removing Duplicates**


Estimated time needed: **45 to 60** minutes


## Introduction


In this lab, you will focus on data wrangling, an important step in preparing data for analysis. Data wrangling involves cleaning and organizing data to make it suitable for analysis. One key task in this process is removing duplicate entries, which are repeated entries that can distort analysis and lead to inaccurate conclusions.  


## Objectives


In this lab you will perform the following:


1. Identify duplicate rows  in the dataset.
2. Use suitable techniques to remove duplicate rows and verify the removal.
3. Summarize how to handle missing values appropriately.
4. Use ConvertedCompYearly to normalize compensation data.
   


### Install the Required Libraries


In [None]:
!pip install pandas

### Step 1: Import Required Libraries


In [1]:
import pandas as pd

### Step 2: Load the Dataset into a DataFrame


#### **Read Data**


If you are using JupyterLite, use the code below to download the dataset into your environment. If you are using a local environment, you can use the direct URL with <code>pd.read_csv()</code>.


In [None]:
from pyodide.http import pyfetch

async def download(url, filename):
    response = await pyfetch(url)
    if response.status == 200:
        with open(filename, "wb") as f:
            f.write(await response.bytes())

# Define the file path for the data
file_path = "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/n01PQ9pSmiRX6520flujwQ/survey-data.csv"

# Download the dataset
await download(file_path, "survey_data.csv")
file_name = "survey_data.csv"


**Load the data into a pandas dataframe:**


In [38]:
file_name=r"C:\Users\21650\Downloads\survey_data.csv"
df = pd.read_csv(file_name)

**Note: If you are working on a local Jupyter environment, you can use the URL directly in the <code>pandas.read_csv()</code>  function as shown below:**



##### df = pd.read_csv("https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/n01PQ9pSmiRX6520flujwQ/survey-data.csv")


### Step 3: Identifying Duplicate Rows


**Task 1: Identify Duplicate Rows**
  1. Count the number of duplicate rows in the dataset.
  2. Display the first few duplicate rows to understand their structure.


In [39]:
cols = ["MainBranch", "Employment", "RemoteWork","Country", "YearsCode", "DevType","CompTotal","WorkExp","EdLevel"]
df[cols].value_counts().reset_index(name="count").head(10)

Unnamed: 0,MainBranch,Employment,RemoteWork,Country,YearsCode,DevType,CompTotal,WorkExp,EdLevel,count
0,I am a developer by profession,"Employed, full-time",Remote,United States of America,12,"Developer, full-stack",110000.0,7.0,"Bachelor’s degree (B.A., B.S., B.Eng., etc.)",3
1,I am a developer by profession,"Employed, full-time",In-person,India,4,"Developer, full-stack",500000.0,1.0,"Bachelor’s degree (B.A., B.S., B.Eng., etc.)",2
2,I am a developer by profession,"Employed, full-time",In-person,India,2,"Developer, full-stack",483000.0,2.0,"Bachelor’s degree (B.A., B.S., B.Eng., etc.)",2
3,I am a developer by profession,"Employed, full-time","Hybrid (some remote, some in-person)",Poland,9,"Developer, full-stack",250000.0,17.0,"Master’s degree (M.A., M.S., M.Eng., MBA, etc.)",2
4,I am a developer by profession,"Employed, full-time","Hybrid (some remote, some in-person)",Canada,12,"Developer, back-end",90000.0,6.0,"Bachelor’s degree (B.A., B.S., B.Eng., etc.)",2
5,I am a developer by profession,"Employed, full-time","Hybrid (some remote, some in-person)",Philippines,6,DevOps specialist,300000.0,2.0,"Bachelor’s degree (B.A., B.S., B.Eng., etc.)",2
6,I am a developer by profession,"Employed, full-time","Hybrid (some remote, some in-person)",United States of America,14,"Developer, back-end",200000.0,10.0,"Bachelor’s degree (B.A., B.S., B.Eng., etc.)",2
7,I am a developer by profession,"Employed, full-time","Hybrid (some remote, some in-person)",India,15,"Developer, mobile",3500000.0,11.0,"Bachelor’s degree (B.A., B.S., B.Eng., etc.)",2
8,I am a developer by profession,"Employed, full-time","Hybrid (some remote, some in-person)",Canada,8,"Developer, full-stack",80000.0,4.0,"Bachelor’s degree (B.A., B.S., B.Eng., etc.)",2
9,I am a developer by profession,"Employed, full-time","Hybrid (some remote, some in-person)",United Kingdom of Great Britain and Northern I...,5,"Developer, full-stack",30000.0,1.0,"Bachelor’s degree (B.A., B.S., B.Eng., etc.)",2


In [40]:
# your code goes here
cols = ["MainBranch", "Employment", "RemoteWork","Country", "YearsCode", "DevType","CompTotal","WorkExp","EdLevel"]
dup_rows=df[df.duplicated(subset=cols)]
print(f"There are {len(dup_rows)} duplicate rows.")
dup_rows.head()

There are 7240 duplicate rows.


Unnamed: 0,ResponseId,MainBranch,Age,Employment,RemoteWork,Check,CodingActivities,EdLevel,LearnCode,LearnCodeOnline,...,JobSatPoints_6,JobSatPoints_7,JobSatPoints_8,JobSatPoints_9,JobSatPoints_10,JobSatPoints_11,SurveyLength,SurveyEase,ConvertedCompYearly,JobSat
366,367,I am a developer by profession,35-44 years old,"Employed, full-time",Remote,Apples,Hobby;Professional development or self-paced l...,"Bachelor’s degree (B.A., B.S., B.Eng., etc.)",Online Courses or Certification,,...,60.0,30.0,10.0,0.0,0.0,0.0,Appropriate in length,Easy,,5.0
911,912,I am a developer by profession,45-54 years old,"Independent contractor, freelancer, or self-em...",In-person,Apples,Hobby;Contribute to open-source projects;Profe...,"Secondary school (e.g. American high school, G...",Books / Physical media;Colleague;Other online ...,Technical documentation;Blogs;Books;Written Tu...,...,,,,,,,Appropriate in length,Easy,76307.0,
970,971,I code primarily as a hobby,18-24 years old,"Student, full-time",,Apples,,"Secondary school (e.g. American high school, G...","Other online resources (e.g., videos, blogs, f...",Technical documentation;Blogs;Books;Written Tu...,...,,,,,,,Appropriate in length,Easy,,
1072,1073,I am a developer by profession,55-64 years old,"Independent contractor, freelancer, or self-em...",Remote,Apples,Professional development or self-paced learnin...,"Master’s degree (M.A., M.S., M.Eng., MBA, etc.)",Books / Physical media;On the job training;Oth...,Technical documentation;Books;Written Tutorial...,...,,,,,,,Appropriate in length,Easy,,
1428,1429,I am learning to code,Under 18 years old,"Student, full-time",,Apples,,"Secondary school (e.g. American high school, G...","Other online resources (e.g., videos, blogs, f...",Books;Stack Overflow;How-to videos;Video-based...,...,,,,,,,Appropriate in length,Easy,,


### Step 4: Removing Duplicate Rows


**Task 2: Remove Duplicates**
   1. Remove duplicate rows from the dataset using the drop_duplicates() function.
2. Verify the removal by counting the number of duplicate rows after removal .


In [41]:
# your code goes here
a=len(df)
df.drop_duplicates(subset=cols,inplace=True)
b=len(df)
print("Number of duplicate rows removed:", a-b)

Number of duplicate rows removed: 7240


### Step 5: Handling Missing Values


**Task 3: Identify and Handle Missing Values**
   1. Identify missing values for all columns in the dataset.
   2. Choose a column with significant missing values (e.g., EdLevel) and impute with the most frequent value.


In [46]:
# your code goes here
for col in df.columns.values.tolist():
    print(f"column: {col}\nmissing values: {df[col].isna().sum()}\n\n")

column: ResponseId
missing values: 0


column: MainBranch
missing values: 0


column: Age
missing values: 0


column: Employment
missing values: 0


column: RemoteWork
missing values: 8052


column: Check
missing values: 0


column: CodingActivities
missing values: 8058


column: EdLevel
missing values: 256


column: LearnCode
missing values: 477


column: LearnCodeOnline
missing values: 11163


column: TechDoc
missing values: 18876


column: YearsCode
missing values: 792


column: YearsCodePro
missing values: 7882


column: DevType
missing values: 1188


column: OrgSize
missing values: 11569


column: PurchaseInfluence
missing values: 11632


column: BuyNewTool
missing values: 13775


column: BuildvsBuy
missing values: 15496


column: TechEndorse
missing values: 15149


column: Country
missing values: 1508


column: Currency
missing values: 12171


column: CompTotal
missing values: 24561


column: LanguageHaveWorkedWith
missing values: 2271


column: LanguageWantToWorkWith
missing val

In [None]:
mode=df["EdLevel"].value_counts().idxmax()
df["EdLevel"].fillna(mode,inplace=True)
print(df["EdLevel"].isna().sum())

np.int64(0)

### Step 6: Normalizing Compensation Data


**Task 4: Normalize Compensation Data Using ConvertedCompYearly**
   1. Use the ConvertedCompYearly column for compensation analysis as the normalized annual compensation is already provided.
   2. Check for missing values in ConvertedCompYearly and handle them if necessary.


In [None]:
# your code goes here

In [61]:
df["ConvertedCompYearly"].describe()

count    2.338700e+04
mean     8.613105e+04
std      1.869297e+05
min      1.000000e+00
25%      3.269600e+04
50%      6.500000e+04
75%      1.079370e+05
max      1.625660e+07
Name: ConvertedCompYearly, dtype: float64

In [64]:
print("missing values:",df["ConvertedCompYearly"].isna().sum())

missing values: 34810


In [92]:
df["ConvertedCompYearly"].fillna(df["ConvertedCompYearly"].median(),inplace=True)

### Step 7: Summary and Next Steps


**In this lab, you focused on identifying and removing duplicate rows.**

- You handled missing values by imputing the most frequent value in a chosen column.

- You used ConvertedCompYearly for compensation normalization and handled missing values.

- For further analysis, consider exploring other columns or visualizing the cleaned dataset.


<!--
## Change Log

|Date (YYYY-MM-DD)|Version|Changed By|Change Description|
|-|-|-|-|
|2024-11-05|1.2|Madhusudhan Moole|Updated lab|
|2024-09-24|1.1|Madhusudhan Moole|Updated lab|
|2024-09-23|1.0|Raghul Ramesh|Created lab|

--!>


## <h3 align="center"> © IBM Corporation. All rights reserved. <h3/>
