<p style="text-align:center">
    <a href="https://skills.network" target="_blank">
    <img src="https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/assets/logos/SN_web_lightmode.png" width="200" alt="Skills Network Logo"  />
    </a>
</p>


# **Data Wrangling Lab**


Estimated time needed: **45** minutes


In this lab, you will perform data wrangling tasks to prepare raw data for analysis. Data wrangling involves cleaning, transforming, and organizing data into a structured format suitable for analysis. This lab focuses on tasks like identifying inconsistencies, encoding categorical variables, and feature transformation.


## Objectives


After completing this lab, you will be able to:


- Identify and remove inconsistent data entries.

- Encode categorical variables for analysis.

- Handle missing values using multiple imputation strategies.

- Apply feature scaling and transformation techniques.


#### Intsall the required libraries


In [3]:
!pip install pandas
!pip install matplotlib
!pip install numpy



## Tasks


#### Step 1: Import the necessary module.


### 1. Load the Dataset


<h5>1.1 Import necessary libraries and load the dataset.</h5>


Ensure the dataset is loaded correctly by displaying the first few rows.


In [4]:
# Import necessary libraries
import pandas as pd
import numpy as np
# Load the Stack Overflow survey data
dataset_url = "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/n01PQ9pSmiRX6520flujwQ/survey-data.csv"
df = pd.read_csv(dataset_url)

# Display the first few rows
print(df.head())


   ResponseId                      MainBranch                 Age  \
0           1  I am a developer by profession  Under 18 years old   
1           2  I am a developer by profession     35-44 years old   
2           3  I am a developer by profession     45-54 years old   
3           4           I am learning to code     18-24 years old   
4           5  I am a developer by profession     18-24 years old   

            Employment RemoteWork   Check  \
0  Employed, full-time     Remote  Apples   
1  Employed, full-time     Remote  Apples   
2  Employed, full-time     Remote  Apples   
3   Student, full-time        NaN  Apples   
4   Student, full-time        NaN  Apples   

                                    CodingActivities  \
0                                              Hobby   
1  Hobby;Contribute to open-source projects;Other...   
2  Hobby;Contribute to open-source projects;Other...   
3                                                NaN   
4                                 

#### 2. Explore the Dataset


<h5>2.1 Summarize the dataset by displaying the column data types, counts, and missing values.</h5>


In [5]:
summary = pd.DataFrame({
    "dtype": df.dtypes,
    "non_null": df.notna().sum(),
    "missing": df.isna().sum(),
    "missing_%": (df.isna().mean()*100).round(1)
}).sort_values("missing_%", ascending=False)

print(df.shape)  # (rows, cols)
summary.head(25)  # show top missing columns

(65437, 114)


Unnamed: 0,dtype,non_null,missing,missing_%
AINextMuch less integrated,object,1148,64289,98.2
AINextLess integrated,object,2355,63082,96.4
AINextNo change,object,12498,52939,80.9
AINextMuch more integrated,object,13438,51999,79.5
EmbeddedAdmired,object,16733,48704,74.4
EmbeddedWantToWorkWith,object,17600,47837,73.1
EmbeddedHaveWorkedWith,object,22214,43223,66.1
ConvertedCompYearly,float64,23435,42002,64.2
AIToolNot interested in Using,object,24414,41023,62.7
AINextMore integrated,object,24428,41009,62.7


<h5>2.2 Generate basic statistics for numerical columns.</h5>


In [6]:
df.describe(numeric_only=True).T

TypeError: NDFrame.describe() got an unexpected keyword argument 'numeric_only'

### 3. Identifying and Removing Inconsistencies


<h5>3.1 Identify inconsistent or irrelevant entries in specific columns (e.g., Country).</h5>


In [None]:
col = "Country"
# normalize trivial formatting issues
df[col+"_norm"] = (
    df[col].astype(str).str.strip().str.replace(r"\s+", " ", regex=True)
)

# inspect top values to spot variants like "US", "USA", "United States of America"
df[col+"_norm"].value_counts().head(30)

<h5>3.2 Standardize entries in columns like Country or EdLevel by mapping inconsistent values to a consistent format.</h5>


In [None]:
country_map = {
    "US": "United States",
    "USA": "United States",
    "United States of America": "United States",
    "U.S.": "United States",
    "UK": "United Kingdom",
    "U.K.": "United Kingdom",
    "England": "United Kingdom",
    "Scotland": "United Kingdom",
    "Wales": "United Kingdom",
    "Viet Nam": "Vietnam",
    "Russian Federation": "Russia",
    "Czech Republic": "Czechia",
}
df["Country_std"] = df["Country_norm"].replace(country_map)
df[["Country","Country_norm","Country_std"]].head()

### 4. Encoding Categorical Variables


<h5>4.1 Encode the Employment column using one-hot encoding.</h5>


In [None]:
emp_dummies = pd.get_dummies(df["Employment"], prefix="Employment", dummy_na=True)
df = pd.concat([df, emp_dummies], axis=1)
df.filter(like="Employment_").head()

### 5. Handling Missing Values


<h5>5.1 Identify columns with the highest number of missing values.</h5>


In [None]:
df.isna().sum().sort_values(ascending=False).head(15)

<h5>5.2 Impute missing values in numerical columns (e.g., `ConvertedCompYearly`) with the mean or median.</h5>


In [None]:
col = "ConvertedCompYearly"
df[col] = pd.to_numeric(df[col], errors="coerce")
med = df[col].median()
df[col] = df[col].fillna(med)
print(f"{col} imputed with median={med:.0f}. Remaining NaNs:", df[col].isna().sum())

<h5>5.3 Impute missing values in categorical columns (e.g., `RemoteWork`) with the most frequent value.</h5>


In [None]:
col = "RemoteWork"
mode_val = df[col].mode(dropna=True).iloc[0]
df[col] = df[col].fillna(mode_val)
print(f"Missing in {col} after impute:", df[col].isna().sum())

### 6. Feature Scaling and Transformation


<h5>6.1 Apply Min-Max Scaling to normalize the `ConvertedCompYearly` column.</h5>


In [None]:
col = "ConvertedCompYearly"
df[col] = pd.to_numeric(df[col], errors="coerce")

# fill NaNs first (median is common)
df[col] = df[col].fillna(df[col].median())

min_v, max_v = df[col].min(), df[col].max()
df["ConvertedCompYearly_MinMax"] = (df[col] - min_v) / (max_v - min_v)
df[["ConvertedCompYearly", "ConvertedCompYearly_MinMax"]].head()

<h5>6.2 Log-transform the ConvertedCompYearly column to reduce skewness.</h5>


In [None]:
df["ConvertedCompYearly_Log"] = np.log1p(df["ConvertedCompYearly"])  # log(1+x) handles 0 safely
df[["ConvertedCompYearly", "ConvertedCompYearly_Log"]].head()

### 7. Feature Engineering


<h5>7.1 Create a new column `ExperienceLevel` based on the `YearsCodePro` column:</h5>


In [8]:
import re
def years_to_num(x):
    if pd.isna(x): return np.nan
    s = str(x).strip()
    if s.lower().startswith("less than"):  # "Less than 1 year"
        return 0.5
    if s.lower().startswith("more than"):  # "More than 50 years"
        m = re.search(r"\d+", s)
        return float(m.group()) if m else np.nan
    try:
        return float(s)  # plain numbers as strings
    except:
        return np.nan

df["YearsCodePro_num"] = df.get("YearsCodePro").apply(years_to_num)

bins   = [-np.inf, 2, 5, 10, 20, np.inf]
labels = ["Beginner", "Junior", "Mid", "Senior", "Expert"]
df["ExperienceLevel"] = pd.cut(df["YearsCodePro_num"], bins=bins, labels=labels)

df[["YearsCodePro", "YearsCodePro_num", "ExperienceLevel"]].head()

Unnamed: 0,YearsCodePro,YearsCodePro_num,ExperienceLevel
0,,,
1,17.0,17.0,Senior
2,27.0,27.0,Expert
3,,,
4,,,


### Summary


In this lab, you:

- Explored the dataset to identify inconsistencies and missing values.

- Encoded categorical variables for analysis.

- Handled missing values using imputation techniques.

- Normalized and transformed numerical data to prepare it for analysis.

- Engineered a new feature to enhance data interpretation.


Copyright © IBM Corporation. All rights reserved.
