# The datasets

## Women in STEM (Kaggle)

This dataset explores the representation of women in STEM education globally over two decades. It includes data on female enrollment, graduation rates, and fields of study within STEM.
Columns: Country, Year, Female Enrollment (%), Female Graduation Rate (%), STEM Fields (e.g., Engineering, Computer Science), Gender Gap Index.

Source: https://www.kaggle.com/datasets/bismasajjad/womens-representation-in-global-stem-education

## Women graduating from STEM programs (Our World In Data)

Female share of graduates in the given field of education, tertiary is the number of female graduates expressed as a percentage of the total number of graduates in the given field of education from tertiary education.


Source: https://ourworldindata.org/grapher/share-graduates-stem-female

# Load datasets using pandas

In [None]:
import pandas as pd

kaggle_data = pd.read_csv("/Users/mereltheisen/Downloads/women_in_stem.csv")
kaggle_data.head()

In [None]:
# Let's look at just one country
china_data = kaggle_data[kaggle_data["Country"] == "China"]
china_engineering = china_data[china_data["STEM Fields"] == "Engineering"]

# Create visualisations using plotly

## What is Plotly?
Plotly is a Python (and JavaScript) graphing library that enables the creation of interactive, publication-quality charts and dashboards, supporting a wide variety of chart types like scatter plots, bar charts, 3D graphs, and maps. It's commonly used in data science and analytics for visual exploration and presentation, with support for integration in Jupyter notebooks, web apps, and dashboards.

**Helpful links:**
- 📚 [Plotly Python Docs](https://plotly.com/python/)
- 🚀 [Dash (Plotly’s Web App Framework)](https://dash.plotly.com/)
- 🧑‍💻 [Plotly GitHub Repository](https://github.com/plotly/plotly.py)

In [None]:
import plotly.express as px

In [None]:
print(china_engineering[china_engineering["Year"]==2017])

In [None]:
china_agg = (
    kaggle_data[kaggle_data["Country"] == "China"]
    .groupby(["STEM Fields", "Year"], as_index=False)
    .agg({"Female Graduation Rate (%)": "mean"})
)

In [None]:
# Create the line chart
fig = px.line(
    china_agg,
    x="Year",
    y="Female Graduation Rate (%)",
    color="STEM Fields",
    markers=True,
    title=f"Female Graduation Rate in STEM Fields Over Time (China)"
)
fig.show()

In [None]:
engineering = kaggle_data[kaggle_data["STEM Fields"] == "Engineering"]
engineering.head()

In [None]:
engineering_agg = (
    engineering
    .groupby(["Country", "Year"], as_index=False)
    .agg({"Female Graduation Rate (%)": "mean"})
)


In [None]:
import plotly.express as px

fig = px.line(
    engineering_agg,
    x="Year",
    y="Female Graduation Rate (%)",
    color="Country",
    markers=True,
    title="Female Graduation Rate in Engineering Over Time (Multiple Countries)"
)

fig.update_layout(yaxis_title="Graduation Rate (%)", xaxis_title="Year")
fig.show()


## A more advanced dataset


In [None]:
gender_stats = pd.read_csv("/Users/mereltheisen/Downloads/P_Data_Extract_From_Gender_Statistics/065bc21d-9ad2-4004-b442-31f67904e33b_Data.csv", na_values=["NA", ".."])

In [None]:
# Create bar chart
fig = px.bar(gender_stats, x="Country Name", y="2017 [YR2017]", title="Women in STEM 2017")
fig.show()


In [None]:
# Assuming 'df' is your original DataFrame
country = "Netherlands"  # Change this to the country you want to plot

# Filter the DataFrame for the specific country
df_filtered = gender_stats[gender_stats["Country Name"] == country]

# Melt the DataFrame to convert wide format to long format
df_long = df_filtered.melt(
    id_vars=["Country Name"], 
    value_vars=[col for col in gender_stats.columns if "YR" in col], 
    var_name="Year", 
    value_name="Value"
)

# Clean the 'Year' column (extract just the year number)
df_long["Year"] = df_long["Year"].str.extract(r"(\d{4})").astype(int)

fig = px.bar(df_long, x="Year", y="Value", title=f"Trends for {country}")
fig.show()

### 🧠 Exercise 1: Line Plot of Female Enrollment Over Time

Create a line plot using Plotly to show how **Female Enrollment (%)** has changed over the years for a specific country (e.g., 'Canada').

In [None]:
# ✅ Solution
import plotly.express as px
df_canada = df[df['Country'] == 'Canada']
fig = px.line(df_canada, x='Year', y='Female Enrollment (%)', title='Female Enrollment Over Time in Canada')
fig.show()

### 🧠 Exercise 2: Compare STEM Fields Participation

Create a bar chart comparing the number of **STEM Fields** across different countries for the most recent year in the dataset.

In [None]:
# ✅ Solution
latest_year = df['Year'].max()
df_latest = df[df['Year'] == latest_year]
fig = px.bar(df_latest, x='Country', y='STEM Fields', title=f'STEM Fields Participation in {latest_year}')
fig.show()

### 🧠 Exercise 3: Explore Relationship Between Enrollment and Graduation

Create a scatter plot to explore the relationship between **Female Enrollment (%)** and **Female Graduation Rate (%)** across all countries for a given year.

In [None]:
# ✅ Solution
selected_year = 2020  # change this if needed
df_year = df[df['Year'] == selected_year]
fig = px.scatter(df_year, x='Female Enrollment (%)', y='Female Graduation Rate (%)', color='Country',
                 title=f'Enrollment vs Graduation Rate ({selected_year})')
fig.show()

### 🧠 Exercise 4: Animated Plot of Gender Gap Index

Use Plotly to create an animated scatter plot that shows the **Gender Gap Index** over time for each country.

In [None]:
# ✅ Solution
fig = px.scatter(df, x='Year', y='Gender Gap Index', animation_frame='Year', animation_group='Country',
                 color='Country', size='Gender Gap Index',
                 title='Gender Gap Index Over Time')
fig.show()

### 🌟 Bonus Challenge: Your Own Plot

Choose any two variables from the dataset and create your own custom visualization using Plotly.

You can use bar, line, scatter, pie, or any other type of plot!

In [None]:
# ✅ Example (customize this!)
# fig = px.scatter(df, x='Female Enrollment (%)', y='STEM Fields', color='Country')
# fig.show()

### 🧠 Exercise 1: Line Plot of Female Enrollment Over Time

Create a line plot using Plotly to show how **Female Enrollment (%)** has changed over the years for a specific country (e.g., 'Canada').

In [None]:
# ✅ Solution
import plotly.express as px
df_canada = df[df['Country'] == 'Canada']
fig = px.line(df_canada, x='Year', y='Female Enrollment (%)', title='Female Enrollment Over Time in Canada')
fig.show()

### 🧠 Exercise 2: Compare STEM Fields Participation

Create a bar chart comparing the number of **STEM Fields** across different countries for the most recent year in the dataset.

In [None]:
# ✅ Solution
latest_year = df['Year'].max()
df_latest = df[df['Year'] == latest_year]
fig = px.bar(df_latest, x='Country', y='STEM Fields', title=f'STEM Fields Participation in {latest_year}')
fig.show()

### 🧠 Exercise 3: Explore Relationship Between Enrollment and Graduation

Create a scatter plot to explore the relationship between **Female Enrollment (%)** and **Female Graduation Rate (%)** across all countries for a given year.

In [None]:
# ✅ Solution
selected_year = 2020  # change this if needed
df_year = df[df['Year'] == selected_year]
fig = px.scatter(df_year, x='Female Enrollment (%)', y='Female Graduation Rate (%)', color='Country',
                 title=f'Enrollment vs Graduation Rate ({selected_year})')
fig.show()

### 🧠 Exercise 4: Animated Plot of Gender Gap Index

Use Plotly to create an animated scatter plot that shows the **Gender Gap Index** over time for each country.

In [None]:
# ✅ Solution
fig = px.scatter(df, x='Year', y='Gender Gap Index', animation_frame='Year', animation_group='Country',
                 color='Country', size='Gender Gap Index',
                 title='Gender Gap Index Over Time')
fig.show()

### 🌟 Bonus Challenge: Your Own Plot

Choose any two variables from the dataset and create your own custom visualization using Plotly.

You can use bar, line, scatter, pie, or any other type of plot!

In [None]:
# ✅ Example (customize this!)
# fig = px.scatter(df, x='Female Enrollment (%)', y='STEM Fields', color='Country')
# fig.show()