<p style="text-align:center">
    <a href="https://skills.network" target="_blank">
    <img src="https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/assets/logos/SN_web_lightmode.png" width="200" alt="Skills Network Logo"  />
    </a>
</p>


# **Finding How The Data Is Distributed**


Estimated time needed: **30** minutes


In this lab, you will work with a cleaned dataset to perform Exploratory Data Analysis (EDA). You will examine the structure of the data, visualize key variables, and analyze trends related to developer experience, tools, job satisfaction, and other important aspects.


## Objectives


In this lab you will perform the following:


- Understand the structure of the dataset.

- Perform summary statistics and data visualization.

- Identify trends in developer experience, tools, job satisfaction, and other key variables.


### Install the required libraries


In [1]:
!pip install pandas
!pip install matplotlib
!pip install seaborn




### Step 1: Import Libraries and Load Data


- Import the `pandas`, `matplotlib.pyplot`, and `seaborn` libraries.


- You will begin with loading the dataset. You can use the pyfetch method if working on JupyterLite. Otherwise, you can use pandas' read_csv() function directly on their local machines or cloud environments.


In [2]:
# Import necessary libraries
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Load the Stack Overflow survey dataset
data_url = 'https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/n01PQ9pSmiRX6520flujwQ/survey-data.csv'
df = pd.read_csv(data_url)

# Display the first few rows of the dataset
df.head()


Unnamed: 0,ResponseId,MainBranch,Age,Employment,RemoteWork,Check,CodingActivities,EdLevel,LearnCode,LearnCodeOnline,...,JobSatPoints_6,JobSatPoints_7,JobSatPoints_8,JobSatPoints_9,JobSatPoints_10,JobSatPoints_11,SurveyLength,SurveyEase,ConvertedCompYearly,JobSat
0,1,I am a developer by profession,Under 18 years old,"Employed, full-time",Remote,Apples,Hobby,Primary/elementary school,Books / Physical media,,...,,,,,,,,,,
1,2,I am a developer by profession,35-44 years old,"Employed, full-time",Remote,Apples,Hobby;Contribute to open-source projects;Other...,"Bachelor’s degree (B.A., B.S., B.Eng., etc.)",Books / Physical media;Colleague;On the job tr...,Technical documentation;Blogs;Books;Written Tu...,...,0.0,0.0,0.0,0.0,0.0,0.0,,,,
2,3,I am a developer by profession,45-54 years old,"Employed, full-time",Remote,Apples,Hobby;Contribute to open-source projects;Other...,"Master’s degree (M.A., M.S., M.Eng., MBA, etc.)",Books / Physical media;Colleague;On the job tr...,Technical documentation;Blogs;Books;Written Tu...,...,,,,,,,Appropriate in length,Easy,,
3,4,I am learning to code,18-24 years old,"Student, full-time",,Apples,,Some college/university study without earning ...,"Other online resources (e.g., videos, blogs, f...",Stack Overflow;How-to videos;Interactive tutorial,...,,,,,,,Too long,Easy,,
4,5,I am a developer by profession,18-24 years old,"Student, full-time",,Apples,,"Secondary school (e.g. American high school, G...","Other online resources (e.g., videos, blogs, f...",Technical documentation;Blogs;Written Tutorial...,...,,,,,,,Too short,Easy,,


### Step 2: Examine the Structure of the Data


- Display the column names, data types, and summary information to understand the data structure.

- Objective: Gain insights into the dataset's shape and available variables.


In [None]:
print("Shape:", df.shape)
df.info()

# summary stats (numeric + categorical)
display(df.describe(numeric_only=True).T)
display(df.describe(include="object").T.head(20))

### Step 3: Handle Missing Data


- Identify missing values in the dataset.

- Impute or remove missing values as necessary to ensure data completeness.



In [None]:
cols = ["Employment", "JobSat", "YearsCodePro"]
print(df[cols].isna().sum().sort_values(ascending=False))

# 2) simple imputation:
#    - categorical → mode (most frequent)
for c in ["Employment", "JobSat"]:
    if c in df.columns and df[c].isna().any():
        df[c] = df[c].fillna(df[c].mode(dropna=True).iloc[0])

#    - YearsCodePro → numeric then median
import re, numpy as np, pandas as pd
def years_to_num(x):
    if pd.isna(x): return np.nan
    s = str(x).strip().lower()
    if s.startswith("less than"): return 0.5
    if s.startswith("more than"):
        m = re.search(r"\d+", s);  return float(m.group()) if m else np.nan
    try: return float(s)
    except: return np.nan

if "YearsCodePro" in df.columns:
    df["YearsCodePro_num"] = df["YearsCodePro"].apply(years_to_num)
    df["YearsCodePro_num"] = df["YearsCodePro_num"].fillna(df["YearsCodePro_num"].median())

# verify
print(df[["Employment","JobSat"]].isna().sum())
if "YearsCodePro_num" in df.columns:
    print("YearsCodePro_num missing:", df["YearsCodePro_num"].isna().sum())

### Step 4: Analyze Key Columns


- Examine key columns such as `Employment`, `JobSat` (Job Satisfaction), and `YearsCodePro` (Professional Coding Experience).

- **Instruction**: Calculate the value counts for each column to understand the distribution of responses.



In [None]:
for c in ["Employment", "JobSat", "YearsCodePro"]:
    if c in df.columns:
        print(f"\nValue counts for {c}:")
        display(df[c].value_counts(dropna=False).head(20))

### Step 5: Visualize Job Satisfaction (Focus on JobSat)


- Create a pie chart or KDE plot to visualize the distribution of `JobSat`.

- Provide an interpretation of the plot, highlighting key trends in job satisfaction.


In [None]:
counts = df["JobSat"].value_counts()
plt.figure(figsize=(6,6))
counts.plot(kind="pie", autopct="%1.1f%%")
plt.title("Job Satisfaction (JobSat) distribution")
plt.ylabel("")
plt.tight_layout(); plt.show()

### Step 6: Programming Languages Analysis


- Compare the frequency of programming languages in `LanguageHaveWorkedWith` and `LanguageWantToWorkWith`.
  
- Visualize the overlap or differences using a Venn diagram or a grouped bar chart.


In [None]:
sep = r"[;|,]"  # multi-select separator in the survey

def split_explode(s):
    return (s.dropna()
             .str.split(sep)
             .explode()
             .str.strip()
             .replace("", pd.NA)
             .dropna())

have = split_explode(df["LanguageHaveWorkedWith"]).value_counts().rename("Have").to_frame()
want = split_explode(df["LanguageWantToWorkWith"]).value_counts().rename("Want").to_frame()

lang_counts = have.join(want, how="outer").fillna(0).astype(int)
top = (lang_counts.assign(Total=lambda x: x["Have"]+x["Want"])
                 .sort_values("Total", ascending=False)
                 .head(12)
                 .drop(columns="Total"))

print(top.head())  # quick look

# grouped bar
plot_df = top.stack().rename_axis(index=["Language","Type"]).reset_index(name="Count")
plt.figure(figsize=(10,6))
sns.barplot(data=plot_df, x="Count", y="Language", hue="Type")
plt.title("Languages: Have Worked With vs Want To Work With (Top)")
plt.xlabel("Count"); plt.ylabel("")
plt.tight_layout(); plt.show()

### Step 7: Analyze Remote Work Trends


- Visualize the distribution of RemoteWork by region using a grouped bar chart or heatmap.


In [None]:
ct = pd.crosstab(df["Country"], df["RemoteWork"], normalize="index")*100

plt.figure(figsize=(11,7))
sns.heatmap(ct.round(1), cmap="Blues")
plt.title("Remote Work by Country (%)")
plt.xlabel("RemoteWork"); plt.ylabel("Country")
plt.tight_layout(); plt.show()

### Step 8: Correlation between Job Satisfaction and Experience


- Analyze the correlation between overall job satisfaction (`JobSat`) and `YearsCodePro`.
  
- Calculate the Pearson or Spearman correlation coefficient.


In [None]:
def years_to_num(x):
    if pd.isna(x): return np.nan
    s = str(x).strip().lower()
    if s.startswith("less than"): return 0.5
    if s.startswith("more than"):
        m = re.search(r"\d+", s);  return float(m.group()) if m else np.nan
    try: return float(s)
    except: return np.nan

df["YearsCodePro_num"] = df["YearsCodePro"].apply(years_to_num)

# JobSat → numeric
sat_map = {
    "Very dissatisfied": 1,
    "Slightly dissatisfied": 2,
    "Neither satisfied nor dissatisfied": 3,
    "Slightly satisfied": 4,
    "Very satisfied": 5
}
df["JobSat_num"] = df["JobSat"].map(sat_map)

# scatter + line
data = df.dropna(subset=["YearsCodePro_num","JobSat_num"])
plt.figure(figsize=(7,5))
sns.scatterplot(data=data, x="YearsCodePro_num", y="JobSat_num", alpha=0.35)
sns.regplot(data=data, x="YearsCodePro_num", y="JobSat_num", scatter=False, color="red")
plt.title("Experience vs Job Satisfaction")
plt.xlabel("Years of professional coding"); plt.ylabel("Job satisfaction (1–5)")
plt.tight_layout(); plt.show()

# Pearson & Spearman
pearson = data[["YearsCodePro_num","JobSat_num"]].corr(method="pearson").iloc[0,1]
spearman = data[["YearsCodePro_num","JobSat_num"]].corr(method="spearman").iloc[0,1]
print(f"Pearson r = {pearson:.3f}  |  Spearman ρ = {spearman:.3f}")

### Step 9: Cross-tabulation Analysis (Employment vs. Education Level)


- Analyze the relationship between employment status (`Employment`) and education level (`EdLevel`).

- **Instruction**: Create a cross-tabulation using `pd.crosstab()` and visualize it with a stacked bar plot if possible.


In [None]:
ct = pd.crosstab(df["EdLevel"], df["Employment"], normalize="index")*100

plt.figure(figsize=(11,7))
sns.heatmap(ct.round(1), cmap="Greens", annot=True, fmt=".1f")
plt.title("Employment type by Education Level (%)")
plt.xlabel("Employment"); plt.ylabel("EdLevel")
plt.tight_layout(); plt.show()

### Step 10: Export Cleaned Data


- Save the cleaned dataset to a new CSV file for further use or sharing.


In [None]:
df.to_csv("cleaned_dataset.csv", index=False)
print("Saved cleaned_dataset.csv")

### Summary:


In this lab, you practiced key skills in exploratory data analysis, including:


- Examining the structure and content of the Stack Overflow survey dataset to understand its variables and data types.

- Identifying and addressing missing data to ensure the dataset's quality and completeness.

- Summarizing and visualizing key variables such as job satisfaction, programming languages, and remote work trends.

- Analyzing relationships in the data using techniques like:
    - Comparing programming languages respondents have worked with versus those they want to work with.
      
    - Exploring remote work preferences by region.

- Investigating correlations between professional coding experience and job satisfaction.

- Performing cross-tabulations to analyze relationships between employment status and education levels.


## Authors:
Ayushi Jain


### Other Contributors:
Rav Ahuja
Lakshmi Holla
Malika


Copyright © IBM Corporation. All rights reserved.
