<p style="text-align:center">
    <a href="https://skills.network" target="_blank">
    <img src="https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/assets/logos/SN_web_lightmode.png" width="200" alt="Skills Network Logo"  />
    </a>
</p>


# **Removing Duplicates**


Estimated time needed: **30** minutes


## Introduction


In this lab, you will focus on data wrangling, an important step in preparing data for analysis. Data wrangling involves cleaning and organizing data to make it suitable for analysis. One key task in this process is removing duplicate entries, which are repeated entries that can distort analysis and lead to inaccurate conclusions.  


## Objectives


In this lab you will perform the following:


1. Identify duplicate rows  in the dataset.
2. Use suitable techniques to remove duplicate rows and verify the removal.
3. Summarize how to handle missing values appropriately.
4. Use ConvertedCompYearly to normalize compensation data.
   


### Install the Required Libraries


In [1]:
!pip install pandas



### Step 1: Import Required Libraries


In [2]:
import pandas as pd
import numpy as np

### Step 2: Load the Dataset into a DataFrame



load the dataset using pd.read_csv()


In [3]:
# Define the URL of the dataset
file_path = "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/n01PQ9pSmiRX6520flujwQ/survey-data.csv"

# Load the dataset into a DataFrame
df = pd.read_csv(file_path)

# Display the first few rows to ensure it loaded correctly
print(df.tail())


       ResponseId                      MainBranch              Age  \
65432       65433  I am a developer by profession  18-24 years old   
65433       65434  I am a developer by profession  25-34 years old   
65434       65435  I am a developer by profession  25-34 years old   
65435       65436  I am a developer by profession  18-24 years old   
65436       65437     I code primarily as a hobby  18-24 years old   

                Employment                            RemoteWork   Check  \
65432  Employed, full-time                                Remote  Apples   
65433  Employed, full-time                                Remote  Apples   
65434  Employed, full-time                             In-person  Apples   
65435  Employed, full-time  Hybrid (some remote, some in-person)  Apples   
65436   Student, full-time                                   NaN  Apples   

                                        CodingActivities  \
65432                      Hobby;School or academic work   
65

**Note: If you are working on a local Jupyter environment, you can use the URL directly in the <code>pandas.read_csv()</code>  function as shown below:**



#df = pd.read_csv("https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/n01PQ9pSmiRX6520flujwQ/survey-data.csv")


### Step 3: Identifying Duplicate Rows


**Task 1: Identify Duplicate Rows**
  1. Count the number of duplicate rows in the dataset.
  2. Display the first few duplicate rows to understand their structure.


In [4]:
## Write your code here
duplicate_rows = df[df.duplicated()]
num_duplicates = duplicate_rows.shape[0]

num_duplicates

0

In [5]:
duplicate_counts = {}
for col in df.columns:
    duplicate_counts[col] = df[col].duplicated(keep='first').sum()

df_column_dupe = pd.DataFrame(list(duplicate_counts.items()), columns=['Column', 'Count Dupe'])
#max_value = df_column_dupe['Count Dupe'].idxmax()
#df_column_dupe.loc[max_value]
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
df_column_dupe

Unnamed: 0,Column,Count Dupe
0,ResponseId,0
1,MainBranch,65432
2,Age,65429
3,Employment,65327
4,RemoteWork,65433
5,Check,65436
6,CodingActivities,65318
7,EdLevel,65428
8,LearnCode,65018
9,LearnCodeOnline,54583


### Step 4: Removing Duplicate Rows


**Task 2: Remove Duplicates**
   1. Remove duplicate rows from the dataset using the drop_duplicates() function.
2. Verify the removal by counting the number of duplicate rows after removal .


In [7]:
## Write your code here
#There is no duplicate row from the data

### Step 5: Handling Missing Values


**Task 3: Identify and Handle Missing Values**
   1. Identify missing values for all columns in the dataset.
   2. Choose a column with significant missing values (e.g., EdLevel) and impute with the most frequent value.


In [5]:
## Write your code here
missing_data = df.isnull()
print(missing_data.head())
#for column in missing_data.columns.values.tolist():
#    print(column)
#    print (missing_data[column].value_counts())
#    print("")

   ResponseId  MainBranch    Age  Employment  RemoteWork  Check  \
0       False       False  False       False       False  False   
1       False       False  False       False       False  False   
2       False       False  False       False       False  False   
3       False       False  False       False        True  False   
4       False       False  False       False        True  False   

   CodingActivities  EdLevel  LearnCode  LearnCodeOnline  ...  JobSatPoints_6  \
0             False    False      False             True  ...            True   
1             False    False      False            False  ...           False   
2             False    False      False            False  ...            True   
3              True    False      False            False  ...            True   
4              True    False      False            False  ...            True   

   JobSatPoints_7  JobSatPoints_8  JobSatPoints_9  JobSatPoints_10  \
0            True            True       

In [6]:
most_EdLevel = df['EdLevel'].value_counts().idxmax()
df['EdLevel'].replace(np.nan, most_EdLevel, inplace=True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['EdLevel'].replace(np.nan, most_EdLevel, inplace=True)


In [7]:
df['EdLevel'].isnull().value_counts() #check the value null

EdLevel
False    65437
Name: count, dtype: int64

In [8]:
df['EdLevel']

0                                Primary/elementary school
1             Bachelor’s degree (B.A., B.S., B.Eng., etc.)
2          Master’s degree (M.A., M.S., M.Eng., MBA, etc.)
3        Some college/university study without earning ...
4        Secondary school (e.g. American high school, G...
                               ...                        
65432         Bachelor’s degree (B.A., B.S., B.Eng., etc.)
65433         Bachelor’s degree (B.A., B.S., B.Eng., etc.)
65434         Bachelor’s degree (B.A., B.S., B.Eng., etc.)
65435    Secondary school (e.g. American high school, G...
65436         Bachelor’s degree (B.A., B.S., B.Eng., etc.)
Name: EdLevel, Length: 65437, dtype: object

### Step 6: Normalizing Compensation Data


**Task 4: Normalize Compensation Data Using ConvertedCompYearly**
   1. Use the ConvertedCompYearly column for compensation analysis as the normalized annual compensation is already provided.
   2. Check for missing values in ConvertedCompYearly and handle them if necessary.


In [9]:
## Write your code here
df['ConvertedCompYearly'].isnull().value_counts()
#df['ConvertedCompYearly'].describe()

ConvertedCompYearly
True     42002
False    23435
Name: count, dtype: int64

In [10]:
df['ConvertedCompYearly'].replace(np.nan,df['ConvertedCompYearly'].mean, inplace=True)
df['ConvertedCompYearly'].isnull().value_counts() #check the value null


ConvertedCompYearly
False    65437
Name: count, dtype: int64

In [16]:
import sys
sys.setrecursionlimit(5000)  # Example: Increase the recursion limit

In [17]:
#df['ConvertedCompYearly'] = df['ConvertedCompYearly'] / df['ConvertedCompYearly'].max

df['ConvertedCompYearly'].describe()

RecursionError: maximum recursion depth exceeded

### Step 7: Summary and Next Steps


**In this lab, you focused on identifying and removing duplicate rows.**

- You handled missing values by imputing the most frequent value in a chosen column.

- You used ConvertedCompYearly for compensation normalization and handled missing values.

- For further analysis, consider exploring other columns or visualizing the cleaned dataset.


In [None]:
## Write your code here

<!--
## Change Log

|Date (YYYY-MM-DD)|Version|Changed By|Change Description|
|-|-|-|-|
|2024-11-05|1.2|Madhusudhan Moole|Updated lab|
|2024-09-24|1.1|Madhusudhan Moole|Updated lab|
|2024-09-23|1.0|Raghul Ramesh|Created lab|

--!>


Copyright © IBM Corporation. All rights reserved.
