<p style="text-align:center">
    <a href="https://skills.network" target="_blank">
    <img src="https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/assets/logos/SN_web_lightmode.png" width="200" alt="Skills Network Logo"  />
    </a>
</p>


# **Impute Missing Values**


Estimated time needed: **30** minutes


In this lab, you will practice essential data wrangling techniques using the Stack Overflow survey dataset. The primary focus is on handling missing data and ensuring data quality. You will:

- **Load the Data:** Import the dataset into a DataFrame using the pandas library.

- **Clean the Data:** Identify and remove duplicate entries to maintain data integrity.

- **Handle Missing Values:** Detect missing values, impute them with appropriate strategies, and verify the imputation to create a complete and reliable dataset for analysis.

This lab equips you with the skills to effectively preprocess and clean real-world datasets, a crucial step in any data analysis project.


## Objectives


In this lab, you will perform the following:


-   Identify missing values in the dataset.

-   Apply techniques to impute missing values in the dataset.
  
-   Use suitable techniques to normalize data in the dataset.


-----


#### Install needed library


In [1]:
!pip install pandas



### Step 1: Import Required Libraries


In [18]:
import pandas as pd
import numpy as np

### Step 2: Load the Dataset Into a Dataframe


#### **Read Data**
<p>
The functions below will download the dataset into your browser:
</p>


In [3]:
file_path = "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/n01PQ9pSmiRX6520flujwQ/survey-data.csv"
df = pd.read_csv(file_path)

# Display the first few rows to ensure it loaded correctly
print(df.head())

   ResponseId                      MainBranch                 Age  \
0           1  I am a developer by profession  Under 18 years old   
1           2  I am a developer by profession     35-44 years old   
2           3  I am a developer by profession     45-54 years old   
3           4           I am learning to code     18-24 years old   
4           5  I am a developer by profession     18-24 years old   

            Employment RemoteWork   Check  \
0  Employed, full-time     Remote  Apples   
1  Employed, full-time     Remote  Apples   
2  Employed, full-time     Remote  Apples   
3   Student, full-time        NaN  Apples   
4   Student, full-time        NaN  Apples   

                                    CodingActivities  \
0                                              Hobby   
1  Hobby;Contribute to open-source projects;Other...   
2  Hobby;Contribute to open-source projects;Other...   
3                                                NaN   
4                                 

### Step 3. Finding and Removing Duplicates
##### Task 1: Identify duplicate rows in the dataset.


In [4]:
## Write your code here
duplicate_rows = df[df.duplicated()]
num_duplicates = duplicate_rows.shape[0]

num_duplicates

0

##### Task 2: Remove the duplicate rows from the dataframe.



In [5]:
## Write your code here
# There is no duplicate Rows

### Step 4: Finding Missing Values
##### Task 3: Find the missing values for all columns.


In [6]:
## Write your code here
missing_data = df.isnull()
#print(missing_data.head())
for column in missing_data.columns.values.tolist():
    print(column)
    print (missing_data[column].value_counts())
    print("")

ResponseId
ResponseId
False    65437
Name: count, dtype: int64

MainBranch
MainBranch
False    65437
Name: count, dtype: int64

Age
Age
False    65437
Name: count, dtype: int64

Employment
Employment
False    65437
Name: count, dtype: int64

RemoteWork
RemoteWork
False    54806
True     10631
Name: count, dtype: int64

Check
Check
False    65437
Name: count, dtype: int64

CodingActivities
CodingActivities
False    54466
True     10971
Name: count, dtype: int64

EdLevel
EdLevel
False    60784
True      4653
Name: count, dtype: int64

LearnCode
LearnCode
False    60488
True      4949
Name: count, dtype: int64

LearnCodeOnline
LearnCodeOnline
False    49237
True     16200
Name: count, dtype: int64

TechDoc
TechDoc
False    40897
True     24540
Name: count, dtype: int64

YearsCode
YearsCode
False    59869
True      5568
Name: count, dtype: int64

YearsCodePro
YearsCodePro
False    51610
True     13827
Name: count, dtype: int64

DevType
DevType
False    59445
True      5992
Name: count, dty

##### Task 4: Find out how many rows are missing in the column RemoteWork.


In [10]:
## Write your code here
df['RemoteWork'].isnull().value_counts()

RemoteWork
False    54806
True     10631
Name: count, dtype: int64

### Step 5. Imputing Missing Values
##### Task 5: Find the value counts for the column RemoteWork.


In [12]:
## Write your code here
count_RW = df.groupby ('RemoteWork')['RemoteWork'].value_counts()
count_RW

RemoteWork
Hybrid (some remote, some in-person)    23015
In-person                               10960
Remote                                  20831
Name: count, dtype: int64

##### Task 6: Identify the most frequent (majority) value in the RemoteWork column.



In [15]:
## Write your code here
print(count_RW.idxmax(),count_RW.max())

Hybrid (some remote, some in-person) 23015


##### Task 7: Impute (replace) all the empty rows in the column RemoteWork with the majority value.



In [19]:
## Write your code here
most_RW = count_RW.idxmax()
df['RemoteWork'].replace(np.nan, most_RW, inplace=True)

df['RemoteWork'].isnull().value_counts()

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['RemoteWork'].replace(np.nan, most_RW, inplace=True)


RemoteWork
False    65437
Name: count, dtype: int64

##### Task 8: Check for any compensation-related columns and describe their distribution.



In [20]:
## Write your code here
count_Industry = df.groupby ('Industry')['Industry'].value_counts()
count_Industry

Industry
Banking/Financial Services                     1371
Computer Systems Design and Services            844
Energy                                          578
Fintech                                        1641
Government                                      962
Healthcare                                     1277
Higher Education                                890
Insurance                                       389
Internet, Telecomm or Information Services     1629
Manufacturing                                  1265
Media & Advertising Services                    894
Other:                                         3077
Retail and Consumer Services                   1264
Software Development                          11918
Transportation, or Supply Chain                 859
Name: count, dtype: int64

### Summary 


**In this lab, you focused on imputing missing values in the dataset.**

- Use the <code>pandas.read_csv()</code> function to load a dataset from a CSV file into a DataFrame.

- Download the dataset if it's not available online and specify the correct file path.



<!--
## Change Log
|Date (YYYY-MM-DD)|Version|Changed By|Change Description|
|-|-|-|-|
|2024-11-05|1.3|Madhusudhan Moole|Updated lab|
|2024-10-29|1.2|Madhusudhan Moole|Updated lab|
|2024-09-27|1.1|Madhusudhan Moole|Updated lab|
|2024-09-26|1.0|Raghul Ramesh|Created lab|
--!>


Copyright © IBM Corporation. All rights reserved.
