# **Impute Missing Values**


- **Load the Data:** Import the dataset into a DataFrame using the pandas library.

- **Clean the Data:** Identify and remove duplicate entries to maintain data integrity.

- **Handle Missing Values:** Detect missing values, impute them with appropriate strategies, and verify the imputation to create a complete and reliable dataset for analysis.


## Objectives


-   Identify missing values in the dataset.

-   Apply techniques to impute missing values in the dataset.
  
-   Use suitable techniques to normalize data in the dataset.


-----


#### Install needed library


In [None]:
!pip install pandas

### Step 1: Import Required Libraries


In [None]:
import pandas as pd

### Step 2: Load the Dataset Into a Dataframe


#### **Read Data**
<p>
The functions below will download the dataset into your browser:
</p>


In [None]:
df = pd.read_csv("https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/n01PQ9pSmiRX6520flujwQ/survey-data.csv")

# Display the first few rows to ensure it loaded correctly
print(df.head())

   ResponseId                      MainBranch                 Age  \
0           1  I am a developer by profession  Under 18 years old   
1           2  I am a developer by profession     35-44 years old   
2           3  I am a developer by profession     45-54 years old   
3           4           I am learning to code     18-24 years old   
4           5  I am a developer by profession     18-24 years old   

            Employment RemoteWork   Check  \
0  Employed, full-time     Remote  Apples   
1  Employed, full-time     Remote  Apples   
2  Employed, full-time     Remote  Apples   
3   Student, full-time        NaN  Apples   
4   Student, full-time        NaN  Apples   

                                    CodingActivities  \
0                                              Hobby   
1  Hobby;Contribute to open-source projects;Other...   
2  Hobby;Contribute to open-source projects;Other...   
3                                                NaN   
4                                 

### Step 3. Finding and Removing Duplicates
##### Task 1: Identify duplicate rows in the dataset.


In [None]:
## Write your code here
#1
duplicated_rows=df.duplicated().sum()
print('duplicated rows are -------------------------->',duplicated_rows)

duplicated rows are --------------------------> 0


##### Task 2: Remove the duplicate rows from the dataframe.



In [None]:
## Write your code here
#2
duplicate_rows = df[df.duplicated(keep=False)]
print("First few duplicate rows:")
print(duplicate_rows.head())

First few duplicate rows:
Empty DataFrame
Columns: [ResponseId, MainBranch, Age, Employment, RemoteWork, Check, CodingActivities, EdLevel, LearnCode, LearnCodeOnline, TechDoc, YearsCode, YearsCodePro, DevType, OrgSize, PurchaseInfluence, BuyNewTool, BuildvsBuy, TechEndorse, Country, Currency, CompTotal, LanguageHaveWorkedWith, LanguageWantToWorkWith, LanguageAdmired, DatabaseHaveWorkedWith, DatabaseWantToWorkWith, DatabaseAdmired, PlatformHaveWorkedWith, PlatformWantToWorkWith, PlatformAdmired, WebframeHaveWorkedWith, WebframeWantToWorkWith, WebframeAdmired, EmbeddedHaveWorkedWith, EmbeddedWantToWorkWith, EmbeddedAdmired, MiscTechHaveWorkedWith, MiscTechWantToWorkWith, MiscTechAdmired, ToolsTechHaveWorkedWith, ToolsTechWantToWorkWith, ToolsTechAdmired, NEWCollabToolsHaveWorkedWith, NEWCollabToolsWantToWorkWith, NEWCollabToolsAdmired, OpSysPersonal use, OpSysProfessional use, OfficeStackAsyncHaveWorkedWith, OfficeStackAsyncWantToWorkWith, OfficeStackAsyncAdmired, OfficeStackSyncHaveWork

### Step 4: Finding Missing Values
##### Task 3: Find the missing values for all columns.


In [None]:
## Write your code here
missing_values=df.isnull()
missing_values.head(5)

for column in missing_values.columns.values.tolist():
    print(missing_values[column].value_counts())
    print('')

ResponseId
False    65437
Name: count, dtype: int64

MainBranch
False    65437
Name: count, dtype: int64

Age
False    65437
Name: count, dtype: int64

Employment
False    65437
Name: count, dtype: int64

RemoteWork
False    54806
True     10631
Name: count, dtype: int64

Check
False    65437
Name: count, dtype: int64

CodingActivities
False    54466
True     10971
Name: count, dtype: int64

EdLevel
False    60784
True      4653
Name: count, dtype: int64

LearnCode
False    60488
True      4949
Name: count, dtype: int64

LearnCodeOnline
False    49237
True     16200
Name: count, dtype: int64

TechDoc
False    40897
True     24540
Name: count, dtype: int64

YearsCode
False    59869
True      5568
Name: count, dtype: int64

YearsCodePro
False    51610
True     13827
Name: count, dtype: int64

DevType
False    59445
True      5992
Name: count, dtype: int64

OrgSize
False    47480
True     17957
Name: count, dtype: int64

PurchaseInfluence
False    47406
True     18031
Name: count, dtype: 

##### Task 4: Find out how many rows are missing in the column RemoteWork.


In [None]:
## Write your code here
print(missing_values['RemoteWork'].isnull().value_counts())

RemoteWork
False    65437
Name: count, dtype: int64


### Step 5. Imputing Missing Values


In [None]:
## Write your code here
import numpy as np

#2
for col in df.columns:
    if df[col].dtype in ['int64', 'float64']:
        mean_val = df[col].mean()
        df[col] = df[col].fillna(mean_val)
    elif df[col].dtype == 'object':
        if not df[col].mode().empty:
            mode_val = df[col].mode()[0]
            df[col] = df[col].fillna(m

##### Task 8: Check for any compensation-related columns and describe their distribution.



In [None]:
## Write your code here
comp_cols = [col for col in df.columns if 'comp' in col.lower() or 'salary' in col.lower()]

print("Compensation-related columns found:")
print(comp_cols)

# Step 2: Describe numeric compensation columns
for col in comp_cols:
    if pd.api.types.is_numeric_dtype(df[col]):
        print(f"\nSummary for '{col}':")
        print(df[col].describe())
    else:
        print(f"\nColumn '{col}' is not numeric — sample values:")
        print(df[col].dropna().unique()[:5])

Compensation-related columns found:
['CompTotal', 'AIComplex', 'ConvertedCompYearly']

Summary for 'CompTotal':
count     3.374000e+04
mean     2.963841e+145
std      5.444117e+147
min       0.000000e+00
25%       6.000000e+04
50%       1.100000e+05
75%       2.500000e+05
max      1.000000e+150
Name: CompTotal, dtype: float64

Column 'AIComplex' is not numeric — sample values:
['Bad at handling complex tasks'
 'Good, but not great at handling complex tasks'
 'Neither good or bad at handling complex tasks'
 'Very well at handling complex tasks'
 'Very poor at handling complex tasks']

Summary for 'ConvertedCompYearly':
count    2.343500e+04
mean     8.615529e+04
std      1.867570e+05
min      1.000000e+00
25%      3.271200e+04
50%      6.500000e+04
75%      1.079715e+05
max      1.625660e+07
Name: ConvertedCompYearly, dtype: float64
