<p style="text-align:center">
    <a href="https://skills.network" target="_blank">
    <img src="https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/assets/logos/SN_web_lightmode.png" width="200" alt="Skills Network Logo"  />
    </a>
</p>


# **Removing Duplicates**


Estimated time needed: **30** minutes


## Introduction


In this lab, you will focus on data wrangling, an important step in preparing data for analysis. Data wrangling involves cleaning and organizing data to make it suitable for analysis. One key task in this process is removing duplicate entries, which are repeated entries that can distort analysis and lead to inaccurate conclusions.  


## Objectives


In this lab you will perform the following:


1. Identify duplicate rows  in the dataset.
2. Use suitable techniques to remove duplicate rows and verify the removal.
3. Summarize how to handle missing values appropriately.
4. Use ConvertedCompYearly to normalize compensation data.
   


### Install the Required Libraries


In [1]:
!pip install pandas



### Step 1: Import Required Libraries


In [10]:
import pandas as pd
import numpy as np

### Step 2: Load the Dataset into a DataFrame



load the dataset using pd.read_csv()


In [3]:
# Define the URL of the dataset
file_path = "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/n01PQ9pSmiRX6520flujwQ/survey-data.csv"

# Load the dataset into a DataFrame
df = pd.read_csv(file_path)

# Display the first few rows to ensure it loaded correctly
print(df.head())


   ResponseId                      MainBranch                 Age  \
0           1  I am a developer by profession  Under 18 years old   
1           2  I am a developer by profession     35-44 years old   
2           3  I am a developer by profession     45-54 years old   
3           4           I am learning to code     18-24 years old   
4           5  I am a developer by profession     18-24 years old   

            Employment RemoteWork   Check  \
0  Employed, full-time     Remote  Apples   
1  Employed, full-time     Remote  Apples   
2  Employed, full-time     Remote  Apples   
3   Student, full-time        NaN  Apples   
4   Student, full-time        NaN  Apples   

                                    CodingActivities  \
0                                              Hobby   
1  Hobby;Contribute to open-source projects;Other...   
2  Hobby;Contribute to open-source projects;Other...   
3                                                NaN   
4                                 

**Note: If you are working on a local Jupyter environment, you can use the URL directly in the <code>pandas.read_csv()</code>  function as shown below:**



#df = pd.read_csv("https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/n01PQ9pSmiRX6520flujwQ/survey-data.csv")


### Step 3: Identifying Duplicate Rows


**Task 1: Identify Duplicate Rows**
  1. Count the number of duplicate rows in the dataset.
  2. Display the first few duplicate rows to understand their structure.


In [4]:
## Write your code here
dup = df[df.duplicated()]
print('The number of duplicates are', dup.count())
print(dup.head())

The number of duplicates are ResponseId             0
MainBranch             0
Age                    0
Employment             0
RemoteWork             0
                      ..
JobSatPoints_11        0
SurveyLength           0
SurveyEase             0
ConvertedCompYearly    0
JobSat                 0
Length: 114, dtype: int64
Empty DataFrame
Columns: [ResponseId, MainBranch, Age, Employment, RemoteWork, Check, CodingActivities, EdLevel, LearnCode, LearnCodeOnline, TechDoc, YearsCode, YearsCodePro, DevType, OrgSize, PurchaseInfluence, BuyNewTool, BuildvsBuy, TechEndorse, Country, Currency, CompTotal, LanguageHaveWorkedWith, LanguageWantToWorkWith, LanguageAdmired, DatabaseHaveWorkedWith, DatabaseWantToWorkWith, DatabaseAdmired, PlatformHaveWorkedWith, PlatformWantToWorkWith, PlatformAdmired, WebframeHaveWorkedWith, WebframeWantToWorkWith, WebframeAdmired, EmbeddedHaveWorkedWith, EmbeddedWantToWorkWith, EmbeddedAdmired, MiscTechHaveWorkedWith, MiscTechWantToWorkWith, MiscTechAdmired, T

### Step 4: Removing Duplicate Rows


**Task 2: Remove Duplicates**
   1. Remove duplicate rows from the dataset using the drop_duplicates() function.
2. Verify the removal by counting the number of duplicate rows after removal .


In [5]:
## Write your code here
df.drop_duplicates(keep='first', inplace=True)
remaining_dups = df.duplicated(keep=False).sum()
print(f"The remaining duplicates are {remaining_dups}")

The remaining duplicates are 0


### Step 5: Handling Missing Values


**Task 3: Identify and Handle Missing Values**
   1. Identify missing values for all columns in the dataset.
   2. Choose a column with significant missing values (e.g., EdLevel) and impute with the most frequent value.


In [16]:
## Write your code here
missing_dat = df.isnull()
for column in missing_dat.columns.values.tolist():
    print(f"The column {missing_dat[column].value_counts()}")

for col_in in df.columns.values.tolist():
    if df[col_in].count() < df['ResponseId'].count():
        if df[col_in].dtype == 'object':
            avg_val = df[col_in].value_counts().idxmax()
            df[col_in].replace(np.nan, avg_val, inplace=True)
        elif df[col_in].dtype == 'int64':
            avg_val = df[col_in].astype('float').mean(axis=0)
            df[col_in].replace(np.nan, avg_val, inplace=True)

missing_dat = df.isnull()
for column in missing_dat.columns.values.tolist():
    print(f"The column {missing_dat[column].value_counts()}")

The column ResponseId
False    65437
Name: count, dtype: int64
The column MainBranch
False    65437
Name: count, dtype: int64
The column Age
False    65437
Name: count, dtype: int64
The column Employment
False    65437
Name: count, dtype: int64
The column RemoteWork
False    54806
True     10631
Name: count, dtype: int64
The column Check
False    65437
Name: count, dtype: int64
The column CodingActivities
False    54466
True     10971
Name: count, dtype: int64
The column EdLevel
False    60784
True      4653
Name: count, dtype: int64
The column LearnCode
False    60488
True      4949
Name: count, dtype: int64
The column LearnCodeOnline
False    49237
True     16200
Name: count, dtype: int64
The column TechDoc
False    40897
True     24540
Name: count, dtype: int64
The column YearsCode
False    59869
True      5568
Name: count, dtype: int64
The column YearsCodePro
False    51610
True     13827
Name: count, dtype: int64
The column DevType
False    59445
True      5992
Name: count, dtype:

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df[col_in].replace(np.nan, avg_val, inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df[col_in].replace(np.nan, avg_val, inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting 

The column ResponseId
False    65437
Name: count, dtype: int64
The column MainBranch
False    65437
Name: count, dtype: int64
The column Age
False    65437
Name: count, dtype: int64
The column Employment
False    65437
Name: count, dtype: int64
The column RemoteWork
False    65437
Name: count, dtype: int64
The column Check
False    65437
Name: count, dtype: int64
The column CodingActivities
False    65437
Name: count, dtype: int64
The column EdLevel
False    65437
Name: count, dtype: int64
The column LearnCode
False    65437
Name: count, dtype: int64
The column LearnCodeOnline
False    65437
Name: count, dtype: int64
The column TechDoc
False    65437
Name: count, dtype: int64
The column YearsCode
False    65437
Name: count, dtype: int64
The column YearsCodePro
False    65437
Name: count, dtype: int64
The column DevType
False    65437
Name: count, dtype: int64
The column OrgSize
False    65437
Name: count, dtype: int64
The column PurchaseInfluence
False    65437
Name: count, dtype: int6

### Step 6: Normalizing Compensation Data


**Task 4: Normalize Compensation Data Using ConvertedCompYearly**
   1. Use the ConvertedCompYearly column for compensation analysis as the normalized annual compensation is already provided.
   2. Check for missing values in ConvertedCompYearly and handle them if necessary.


In [17]:
## Write your code here
df['ConvertedCompYearly'] = df['ConvertedCompYearly']/df['ConvertedCompYearly'].max()

### Step 7: Summary and Next Steps


**In this lab, you focused on identifying and removing duplicate rows.**

- You handled missing values by imputing the most frequent value in a chosen column.

- You used ConvertedCompYearly for compensation normalization and handled missing values.

- For further analysis, consider exploring other columns or visualizing the cleaned dataset.


In [18]:
## Write your code here
for column in df.columns.values.tolist():
    print(df[column].head(2))

0    1
1    2
Name: ResponseId, dtype: int64
0    I am a developer by profession
1    I am a developer by profession
Name: MainBranch, dtype: object
0    Under 18 years old
1       35-44 years old
Name: Age, dtype: object
0    Employed, full-time
1    Employed, full-time
Name: Employment, dtype: object
0    Remote
1    Remote
Name: RemoteWork, dtype: object
0    Apples
1    Apples
Name: Check, dtype: object
0                                                Hobby
1    Hobby;Contribute to open-source projects;Other...
Name: CodingActivities, dtype: object
0                       Primary/elementary school
1    Bachelor’s degree (B.A., B.S., B.Eng., etc.)
Name: EdLevel, dtype: object
0                               Books / Physical media
1    Books / Physical media;Colleague;On the job tr...
Name: LearnCode, dtype: object
0    Technical documentation;Blogs;Written Tutorial...
1    Technical documentation;Blogs;Books;Written Tu...
Name: LearnCodeOnline, dtype: object
0    API document(s) and

<!--
## Change Log

|Date (YYYY-MM-DD)|Version|Changed By|Change Description|
|-|-|-|-|
|2024-11-05|1.2|Madhusudhan Moole|Updated lab|
|2024-09-24|1.1|Madhusudhan Moole|Updated lab|
|2024-09-23|1.0|Raghul Ramesh|Created lab|

--!>


Copyright © IBM Corporation. All rights reserved.
