# The datasets

## Women in STEM (Kaggle)

This dataset explores the representation of women in STEM education globally over two decades. It includes data on female enrollment, graduation rates, and fields of study within STEM.
Columns: Country, Year, Female Enrollment (%), Female Graduation Rate (%), STEM Fields (e.g., Engineering, Computer Science), Gender Gap Index.

Source: https://www.kaggle.com/datasets/bismasajjad/womens-representation-in-global-stem-education

## Women graduating from STEM programs (Our World In Data)

Female share of graduates in the given field of education, tertiary is the number of female graduates expressed as a percentage of the total number of graduates in the given field of education from tertiary education.


Source: https://ourworldindata.org/grapher/share-graduates-stem-female

# Load datasets using pandas

In [None]:
import pandas as pd

kaggle_data = pd.read_csv("/Users/mereltheisen/Downloads/women_in_stem.csv")
kaggle_data.head()

In [None]:
# Let's look at just one country, for example China
china_data = kaggle_data[kaggle_data["Country"] == "China"]
china_data.head()

In [None]:
# Let's filter it down to only data for Engineering students in China
china_engineering = china_data[china_data["STEM Fields"] == "Engineering"]
# print(china_engineering)

# Create visualisations using plotly

## What is Plotly?
Plotly is a Python (and JavaScript) graphing library that enables the creation of interactive, publication-quality charts and dashboards, supporting a wide variety of chart types like scatter plots, bar charts, 3D graphs, and maps. It's commonly used in data science and analytics for visual exploration and presentation, with support for integration in Jupyter notebooks, web apps, and dashboards.

**Helpful links:**
- 📚 [Plotly Python Docs](https://plotly.com/python/)
- 🚀 [Dash (Plotly’s Web App Framework)](https://dash.plotly.com/)
- 🧑‍💻 [Plotly GitHub Repository](https://github.com/plotly/plotly.py)

In [None]:
import plotly.express as px

In [None]:
# Suppose we'd like to create a bar chart showing the number of Female Graduates for Engineering in China. 
# Can we use the data as is?
# Let's try!

fig = px.bar(
    china_engineering,
    x="Year",
    y="Female Graduation Rate (%)",
    title="Female Graduation Rate by Year",
    text="Female Graduation Rate (%)"  # show values on bars
)
fig.show()

In [None]:
# Hmmm... how this looks a bit odd... it looks like we have multiple values for the same year.
# Let's double check. Do we have multiple entries for 2005?
print(china_engineering[china_engineering["Year"]==2005])

In [None]:
# In order to get rid of this duplication, let's group the stem fields per year and aggregate the female graduation rate.
china_agg = (
    kaggle_data[kaggle_data["Country"] == "China"]
    .groupby(["STEM Fields", "Year"], as_index=False)
    .agg({"Female Graduation Rate (%)": "mean"})
)

In [None]:
china_engineering = china_agg[china_agg["STEM Fields"] == "Engineering"]
fig = px.bar(
    china_engineering,
    x="Year",
    y="Female Graduation Rate (%)",
    title="Female Graduation Rate by Year",
    text="Female Graduation Rate (%)"  # show values on bars
)
fig.show()

In [None]:
# We can also create a line chart and show the graduation rates for all stem fields:
fig = px.line(
    china_agg,
    x="Year",
    y="Female Graduation Rate (%)",
    color="STEM Fields",
    markers=True,
    title=f"Female Graduation Rate in STEM Fields Over Time (China)"
)
fig.show()

In [None]:
# Now let's create some more plots that show graduation rates for engineering in multiple countries:
engineering = kaggle_data[kaggle_data["STEM Fields"] == "Engineering"]

engineering_agg = (
    engineering
    .groupby(["Country", "Year"], as_index=False)
    .agg({"Female Graduation Rate (%)": "mean"})
)

import plotly.express as px

fig = px.line(
    engineering_agg,
    x="Year",
    y="Female Graduation Rate (%)",
    color="Country",
    markers=True,
    title="Female Graduation Rate in Engineering Over Time (Multiple Countries)"
)

fig.update_layout(yaxis_title="Graduation Rate (%)", xaxis_title="Year")
fig.show()


## A more advanced dataset


In [None]:
gender_stats = pd.read_csv("/Users/mereltheisen/Downloads/P_Data_Extract_From_Gender_Statistics/065bc21d-9ad2-4004-b442-31f67904e33b_Data.csv", na_values=["NA", ".."])

### 🔍 Here's what happens step-by-step:

1. **`pd.read_csv(...)`**  
   Loads the CSV file located at the provided path into a DataFrame called `gender_stats`.

2. **`na_values=["NA", ".."]`**  
   Specifies **additional strings to be treated as missing values** (`NaN`) besides the default ones.  
   In this case:
   - Any cell with `"NA"` or `".."` will be converted to `NaN`.

---

### 🧠 Why this is useful:
Many real-world datasets (especially those from organizations like the UN or World Bank) use various codes to represent missing data. Using `na_values`, you ensure:
- You can work with true `NaN` values
- You avoid errors or unexpected behavior in numerical operations or plots

---

Let me know if you want to check for missing values next (`gender_stats.isna().sum()`) or clean them up!

gender_stats = pd.read_csv("/Users/mereltheisen/Downloads/P_Data_Extract_From_Gender_Statistics/065bc21d-9ad2-4004-b442-31f67904e33b_Data.csv", na_values=["NA", ".."])

In [None]:
# Let's explore the data a bit
gender_stats.head()

In [None]:
# There's a lot more data in this dataset, but also many NaN values. Let's try create some meaningful charts. 

# Create a bar chart that shows the Women in STEM numbers for 2017 for all countries in the dataset:
fig = px.bar(gender_stats, x="Country Name", y="2017 [YR2017]", title="Women in STEM 2017")
fig.show()


In [None]:
# Let's zoom in on a single country:
country = "Netherlands"  # Change this to the country you want to plot

# Filter the DataFrame for the specific country
df_filtered = gender_stats[gender_stats["Country Name"] == country]

# Melt the DataFrame to convert wide format to long format
df_long = df_filtered.melt(
    id_vars=["Country Name"], 
    value_vars=[col for col in gender_stats.columns if "YR" in col], 
    var_name="Year", 
    value_name="Value"
)

# Clean the 'Year' column (extract just the year number)
df_long["Year"] = df_long["Year"].str.extract(r"(\d{4})").astype(int)

fig = px.bar(df_long, x="Year", y="Value", title=f"Trends for {country}")
fig.show()

# Exercises for the Kaggle dataset

### 🧠 Exercise 1: Line Plot of Female Enrollment Over Time

Create a line plot using Plotly to show how **Female Enrollment (%)** has changed over the years for a specific country (e.g., 'Canada').

In [None]:
# ✅ Solution
import plotly.express as px
df = kaggle_data
df_canada = df[df['Country'] == 'Canada']

canada_agg = (
    df_canada
    .groupby(["STEM Fields", "Year"], as_index=False)
    .agg({"Female Enrollment (%)": "mean"})
)
fig = px.line(canada_agg, x='Year', y='Female Enrollment (%)', color="STEM Fields", markers=True, title='Female Enrollment Over Time in Canada')

fig.show()

### 🧠 Exercise 2: Compare STEM Fields Enrollment

Create a bar chart comparing the number of **STEM Fields** across different countries for the most recent year in the dataset.

In [None]:
# ✅ Solution
df = kaggle_data
latest_year = df['Year'].max()
df_latest = df[df['Year'] == latest_year]

df_agg = (
    df_latest
    .groupby(["Country", "Year", "STEM Fields"], as_index=False)
    .agg({"Female Enrollment (%)": "mean"})
    )
fig = px.bar(
    df_agg,
    x="Country",
    y="Female Enrollment (%)",
    color="STEM Fields",
    barmode="group",
    title=f"Female Enrollment (%) per STEM Field by Country ({latest_year})",
    text="Female Enrollment (%)"
)

fig.show()

### 🧠 Exercise 3: Explore Relationship Between Enrollment and Graduation

Create a scatter plot to explore the relationship between **Female Enrollment (%)** and **Female Graduation Rate (%)** for **Biology** across all countries for a given year.

In [None]:
# ✅ Solution
selected_year = 2020  # change this if needed
df_year = df[df['Year'] == selected_year]
biology = df_year[df_year['STEM Fields'] == 'Biology']

biology_agg = (
    biology.groupby("Country", as_index=False)[
        ["Female Enrollment (%)", "Female Graduation Rate (%)"]
    ].mean()
)


fig = px.scatter(biology_agg, x='Female Enrollment (%)', y='Female Graduation Rate (%)', color='Country',
                 title=f'Enrollment vs Graduation Rate ({selected_year})')
fig.show()

### 🌟 Bonus Challenge: Your Own Plot

Choose any two variables from the dataset and create your own custom visualization using Plotly.

You can use bar, line, scatter, pie, or any other type of plot!

In [None]:
# ✅ Example (customize this!)
# fig = px.scatter(df, x='Female Enrollment (%)', y='STEM Fields', color='Country')
# fig.show()