# **Removing Duplicates**


## Introduction


In this lab, you will focus on data wrangling, an important step in preparing data for analysis. Data wrangling involves cleaning and organizing data to make it suitable for analysis. One key task in this process is removing duplicate entries, which are repeated entries that can distort analysis and lead to inaccurate conclusions.  


## Objectives


1. Identify duplicate rows  in the dataset.
2. Use suitable techniques to remove duplicate rows and verify the removal.
3. Summarize how to handle missing values appropriately.
4. Use ConvertedCompYearly to normalize compensation data.
   


### Step 1: Import Required Libraries


In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

### Step 2: Load the Dataset into a DataFrame


#### **Read Data**


If you are using JupyterLite, use the code below to download the dataset into your environment. If you are using a local environment, you can use the direct URL with <code>pd.read_csv()</code>.


**Load the data into a pandas dataframe:**


In [None]:
df = pd.read_csv("https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/n01PQ9pSmiRX6520flujwQ/survey-data.csv")

**Task 1: Identify Duplicate Rows**
  1. Count the number of duplicate rows in the dataset.
  2. Display the first few duplicate rows to understand their structure.


In [None]:
# your code goes here
count=df.duplicated().sum()
print('duplicated are:',count)

display=df[df.duplicated()].head()
print(display)

duplicated are: 0
Empty DataFrame
Columns: [ResponseId, MainBranch, Age, Employment, RemoteWork, Check, CodingActivities, EdLevel, LearnCode, LearnCodeOnline, TechDoc, YearsCode, YearsCodePro, DevType, OrgSize, PurchaseInfluence, BuyNewTool, BuildvsBuy, TechEndorse, Country, Currency, CompTotal, LanguageHaveWorkedWith, LanguageWantToWorkWith, LanguageAdmired, DatabaseHaveWorkedWith, DatabaseWantToWorkWith, DatabaseAdmired, PlatformHaveWorkedWith, PlatformWantToWorkWith, PlatformAdmired, WebframeHaveWorkedWith, WebframeWantToWorkWith, WebframeAdmired, EmbeddedHaveWorkedWith, EmbeddedWantToWorkWith, EmbeddedAdmired, MiscTechHaveWorkedWith, MiscTechWantToWorkWith, MiscTechAdmired, ToolsTechHaveWorkedWith, ToolsTechWantToWorkWith, ToolsTechAdmired, NEWCollabToolsHaveWorkedWith, NEWCollabToolsWantToWorkWith, NEWCollabToolsAdmired, OpSysPersonal use, OpSysProfessional use, OfficeStackAsyncHaveWorkedWith, OfficeStackAsyncWantToWorkWith, OfficeStackAsyncAdmired, OfficeStackSyncHaveWorkedWith, 

### Step 4: Removing Duplicate Rows


**Task 2: Remove Duplicates**
   1. Remove duplicate rows from the dataset using the drop_duplicates() function.
2. Verify the removal by counting the number of duplicate rows after removal .


In [None]:
# your code goes here
#1
remove_duplicates=df.drop_duplicates(inplace=True)
count=df.duplicated().sum()
print('Duplicated rows --->',count)

Duplicated rows ---> 0


In [None]:
#2
df_cleaned = df.drop_duplicates()

num_duplicates_after = df_cleaned.duplicated().sum()

print(f"Duplicate rows remaining after removal: {num_duplicates_after}")
print(f"New dataset shape: {df_cleaned.shape}")


Duplicate rows remaining after removal: 0
New dataset shape: (65437, 114)


### Step 5: Handling Missing Values


**Task 3: Identify and Handle Missing Values**
   1. Identify missing values for all columns in the dataset.
   2. Choose a column with significant missing values (e.g., EdLevel) and impute with the most frequent value.


In [None]:
# your code goes here
#1
missing_values=df.isnull()
missing_values.head(5)

for column in missing_values.columns.values.tolist():
    print(missing_values[column].value_counts())
    print('')


ResponseId
False    65437
Name: count, dtype: int64

MainBranch
False    65437
Name: count, dtype: int64

Age
False    65437
Name: count, dtype: int64

Employment
False    65437
Name: count, dtype: int64

RemoteWork
False    54806
True     10631
Name: count, dtype: int64

Check
False    65437
Name: count, dtype: int64

CodingActivities
False    54466
True     10971
Name: count, dtype: int64

EdLevel
False    60784
True      4653
Name: count, dtype: int64

LearnCode
False    60488
True      4949
Name: count, dtype: int64

LearnCodeOnline
False    49237
True     16200
Name: count, dtype: int64

TechDoc
False    40897
True     24540
Name: count, dtype: int64

YearsCode
False    59869
True      5568
Name: count, dtype: int64

YearsCodePro
False    51610
True     13827
Name: count, dtype: int64

DevType
False    59445
True      5992
Name: count, dtype: int64

OrgSize
False    47480
True     17957
Name: count, dtype: int64

PurchaseInfluence
False    47406
True     18031
Name: count, dtype: 

In [None]:
print(df['EdLevel'].dtype)
mode_val = df['EdLevel'].mode()[0]
df['EdLevel'] = df['EdLevel'].fillna(mode_val)

object


In [None]:
import numpy as np

#2
for col in df.columns:
    if df[col].dtype in ['int64', 'float64']:
        mean_val = df[col].mean()
        df[col] = df[col].fillna(mean_val)
    elif df[col].dtype == 'object':
        if not df[col].mode().empty:
            mode_val = df[col].mode()[0]
            df[col] = df[col].fillna(mode_val)

### Step 6: Normalizing Compensation Data


**Task 4: Normalize Compensation Data Using ConvertedCompYearly**
   1. Use the ConvertedCompYearly column for compensation analysis as the normalized annual compensation is already provided.
   2. Check for missing values in ConvertedCompYearly and handle them if necessary.


In [None]:
# your code goes here
numeric_df = df.select_dtypes(include=['number'])
print("Integer and Float Columns:")
for col in numeric_df.columns:
    print(col)

print("\nContent of Integer and Float Columns:")
print(numeric_df)
print(df['CompTotal'].value_counts())
print(df['ConvertedCompYearly'].value_counts())

Integer and Float Columns:
ResponseId
CompTotal
WorkExp
JobSatPoints_1
JobSatPoints_4
JobSatPoints_5
JobSatPoints_6
JobSatPoints_7
JobSatPoints_8
JobSatPoints_9
JobSatPoints_10
JobSatPoints_11
ConvertedCompYearly
JobSat

Content of Integer and Float Columns:
       ResponseId  CompTotal  WorkExp  JobSatPoints_1  JobSatPoints_4  \
0               1        NaN      NaN             NaN             NaN   
1               2        NaN     17.0             0.0             0.0   
2               3        NaN      NaN             NaN             NaN   
3               4        NaN      NaN             NaN             NaN   
4               5        NaN      NaN             NaN             NaN   
...           ...        ...      ...             ...             ...   
65432       65433        NaN      NaN             NaN             NaN   
65433       65434        NaN      NaN             NaN             NaN   
65434       65435        NaN      NaN             NaN             NaN   
65435      

In [None]:

missing_count = df['ConvertedCompYearly'].isnull().sum()
print(f"Missing values in 'ConvertedCompYearly': {missing_count}")

df_cleaned = df.dropna(subset=['ConvertedCompYearly'])
df_cleaned['ConvertedCompYearly'] = pd.to_numeric(df_cleaned['ConvertedCompYearly'], errors='coerce')
print("\nSummary of Normalized Compensation:")
print(df_cleaned['ConvertedCompYearly'].describe())

Missing values in 'ConvertedCompYearly': 0

Summary of Normalized Compensation:
count    2.343500e+04
mean     8.615529e+04
std      1.867570e+05
min      1.000000e+00
25%      3.271200e+04
50%      6.500000e+04
75%      1.079715e+05
max      1.625660e+07
Name: ConvertedCompYearly, dtype: float64


In [None]:
len(df_cleaned['ConvertedCompYearly'])

23435

In [None]:
df_cleaned.head()

Unnamed: 0,ResponseId,MainBranch,Age,Employment,RemoteWork,Check,CodingActivities,EdLevel,LearnCode,LearnCodeOnline,...,JobSatPoints_6,JobSatPoints_7,JobSatPoints_8,JobSatPoints_9,JobSatPoints_10,JobSatPoints_11,SurveyLength,SurveyEase,ConvertedCompYearly,JobSat
72,73,I am a developer by profession,18-24 years old,"Employed, full-time;Student, full-time;Indepen...","Hybrid (some remote, some in-person)",Apples,Hobby;School or academic work;Professional dev...,"Secondary school (e.g. American high school, G...",On the job training;Other online resources (e....,Technical documentation;Blogs;Written Tutorial...,...,65.0,100.0,100.0,100.0,50.0,90.0,Too long,Easy,7322.0,10.0
374,375,"I am not primarily a developer, but I write co...",25-34 years old,"Employed, full-time","Hybrid (some remote, some in-person)",Apples,Hobby;School or academic work;Professional dev...,"Professional degree (JD, MD, Ph.D, Ed.D, etc.)",Books / Physical media;Colleague;On the job tr...,Written Tutorials;Stack Overflow;Written-based...,...,,,,,,,Appropriate in length,Neither easy nor difficult,30074.0,
379,380,I am a developer by profession,35-44 years old,"Employed, full-time",Remote,Apples,Hobby;Bootstrapping a business,"Master’s degree (M.A., M.S., M.Eng., MBA, etc.)",Books / Physical media;Other online resources ...,Technical documentation;Books;Social Media;Wri...,...,0.0,0.0,0.0,0.0,0.0,0.0,Too long,Difficult,91295.0,10.0
385,386,I am a developer by profession,35-44 years old,"Independent contractor, freelancer, or self-em...",Remote,Apples,Hobby,"Master’s degree (M.A., M.S., M.Eng., MBA, etc.)",Books / Physical media;On the job training;Oth...,Technical documentation;Blogs;Written Tutorial...,...,,,,,,,Too short,Easy,53703.0,
389,390,I am a developer by profession,25-34 years old,"Employed, full-time;Student, part-time",Remote,Apples,Hobby;School or academic work,Some college/university study without earning ...,Books / Physical media;Colleague;On the job tr...,Written Tutorials;Stack Overflow;Coding sessio...,...,20.0,30.0,5.0,20.0,10.0,5.0,Too long,Easy,110000.0,10.0
