### Project 2

In [1]:
import plotly.io as pio

pio.renderers.default = "vscode+jupyterlab+notebook_connected"

## Do countries with lower female educational attainment rates tend to have higher cases of stunting among children?

### Project Overview

Stunting is a big challenge for a lot of developing countries. It basically means kids aren’t growing as they should because they’re not getting enough proper nutrition. It’s a serious problem that tackling it is part of the [Sustainable Development Goals (SDGs)](https://www.un.org/sustainabledevelopment/hunger/).

Different countries are trying all kinds of strategies to fight stunting, such as by providing nutritions, clean water access, and adequate health services. But here’s what I’m curious about: could stunting also have something to do with how educated parents—especially moms—are? If mothers haven’t had the chance to get a solid education, they might not know as much about nutrition and how to make sure their kids are getting what they need to grow healthy.

To dig into this idea, I decided to look at whether there’s a link between stunting rates and how many women have finished high school (or upper secondary education). For this, I pulled stunting data from the World Health Organization (WHO) and education data from the World Bank. The goal is to see if improving education for women could be part of the focus to addressing issue.

### Data Sources

1. The stunting data was pulled from [World Health Organization](https://data.who.int/indicators/i/A5A7413/5F8A486#:~:text=Worldwide%2C%20the%20prevalence%20of%20stunting,%25%20%2D%2022.9%25%5D%20in%202022.)
2. The female education data was pulled from [World Bank](https://genderdata.worldbank.org/en/indicator/se-cuat-zs#data-table-section)


### Working with the Datasets

First, I loaded the stunting dataset using pandas to read the CSV file. The dataset contains information on stunting prevalence by country over several years.

In [2]:
import pandas as pd

stunting_rate = pd.read_csv("API_SH.STA.STNT.ZS_DS2_en_csv_v2_2030.csv")
stunting_rate

Unnamed: 0,Country Name,Country Code,Indicator Name,Indicator Code,1960,1961,1962,1963,1964,1965,...,2014,2015,2016,2017,2018,2019,2020,2021,2022,2023
0,Aruba,ABW,"Prevalence of stunting, height for age (% of c...",SH.STA.STNT.ZS,,,,,,,...,,,,,,,,,,
1,Africa Eastern and Southern,AFE,"Prevalence of stunting, height for age (% of c...",SH.STA.STNT.ZS,,,,,,,...,,,,,,,,,,
2,Afghanistan,AFG,"Prevalence of stunting, height for age (% of c...",SH.STA.STNT.ZS,,,,,,,...,,,,,38.2,,,,44.6,
3,Africa Western and Central,AFW,"Prevalence of stunting, height for age (% of c...",SH.STA.STNT.ZS,,,,,,,...,,,,,,,,,,
4,Angola,AGO,"Prevalence of stunting, height for age (% of c...",SH.STA.STNT.ZS,,,,,,,...,,37.6,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
261,Kosovo,XKX,"Prevalence of stunting, height for age (% of c...",SH.STA.STNT.ZS,,,,,,,...,,,,,,,,,,
262,"Yemen, Rep.",YEM,"Prevalence of stunting, height for age (% of c...",SH.STA.STNT.ZS,,,,,,,...,,,,,,,,,,
263,South Africa,ZAF,"Prevalence of stunting, height for age (% of c...",SH.STA.STNT.ZS,,,,,,,...,,21.2,27.4,21.3,,,,,,
264,Zambia,ZMB,"Prevalence of stunting, height for age (% of c...",SH.STA.STNT.ZS,,,,,,,...,,,,,34.6,,,,,


Here, I used the `info()` function to get a quick overview of the dataset and check which year has the most complete data for stunting rates. This step helps make the analysis more robust focus on a year with fewer gaps. From the output, it looks like 2019 has enough data available (43 non-null entries) while being recent enough, so I’ll go with that for the analysis.

In [3]:
stunting_rate.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 266 entries, 0 to 265
Data columns (total 68 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Country Name    266 non-null    object 
 1   Country Code    266 non-null    object 
 2   Indicator Name  266 non-null    object 
 3   Indicator Code  266 non-null    object 
 4   1960            0 non-null      float64
 5   1961            0 non-null      float64
 6   1962            0 non-null      float64
 7   1963            0 non-null      float64
 8   1964            0 non-null      float64
 9   1965            0 non-null      float64
 10  1966            0 non-null      float64
 11  1967            0 non-null      float64
 12  1968            0 non-null      float64
 13  1969            0 non-null      float64
 14  1970            0 non-null      float64
 15  1971            0 non-null      float64
 16  1972            0 non-null      float64
 17  1973            0 non-null      flo

Next, I narrowed down the dataset to focus only on the country names and their stunting rates for 2019. This keeps things simple and directly relevant to the analysis by getting rid of unnecessary columns and using the key data I need.

In [4]:
stunting_rate_2019 = stunting_rate[["Country Name", "2019"]]
stunting_rate_2019

Unnamed: 0,Country Name,2019
0,Aruba,
1,Africa Eastern and Southern,
2,Afghanistan,
3,Africa Western and Central,
4,Angola,
...,...,...
261,Kosovo,
262,"Yemen, Rep.",
263,South Africa,
264,Zambia,


I repeated a similar step to load the dataset for female education levels from World Bank for the analysis.

In [5]:
female_education_level = pd.read_csv("Educational attainment by level of education, cumulative (% population 25+).csv")
female_education_level

Unnamed: 0,Indicator Name,Indicator Code,Country Name,Country Code,Year,Value,Disaggregation
0,"Educational attainment, at least completed pri...",SE.PRM.CUAT.FE.ZS,Afghanistan,AFG,2022,10.220,"At least completed primary, female"
1,"Educational attainment, at least completed pri...",SE.PRM.CUAT.FE.ZS,Afghanistan,AFG,2021,8.626,"At least completed primary, female"
2,"Educational attainment, at least completed pri...",SE.PRM.CUAT.FE.ZS,Afghanistan,AFG,2020,9.939,"At least completed primary, female"
3,"Educational attainment, at least completed pri...",SE.PRM.CUAT.FE.ZS,Afghanistan,AFG,2017,9.983,"At least completed primary, female"
4,"Educational attainment, at least completed pri...",SE.PRM.CUAT.FE.ZS,Afghanistan,AFG,2015,6.848,"At least completed primary, female"
...,...,...,...,...,...,...,...
40684,"Educational attainment, Doctoral or equivalent...",SE.TER.CUAT.DO.ZS,Zambia,ZMB,2018,0.000,"Doctoral or equivalent, total"
40685,"Educational attainment, Doctoral or equivalent...",SE.TER.CUAT.DO.ZS,Zambia,ZMB,2017,0.000,"Doctoral or equivalent, total"
40686,"Educational attainment, Doctoral or equivalent...",SE.TER.CUAT.DO.ZS,Zimbabwe,ZWE,2021,0.155,"Doctoral or equivalent, total"
40687,"Educational attainment, Doctoral or equivalent...",SE.TER.CUAT.DO.ZS,Zimbabwe,ZWE,2019,0.000,"Doctoral or equivalent, total"


The dataset from the World Bank includes education levels for both males and females across all stages of schooling. After loading the data, I wanted to narrow it down to focus specifically on females who completed upper secondary education. To do this, I used `.unique()` on the 'Disaggregation' column to see all the available categories and ensure I can get the data I wanted for the analysis.

In [6]:
female_education_level["Disaggregation"].unique()

array(['At least completed primary, female',
       'At least completed primary, male',
       'At least completed primary, total',
       'At least completed lower secondary, female',
       'At least completed lower secondary, male',
       'At least completed lower secondary, total',
       'At least completed upper secondary, female',
       'At least completed upper secondary, male',
       'At least completed upper secondary, total',
       'At least completed post-secondary, female',
       'At least completed post-secondary, male',
       'At least completed post-secondary, total',
       'At least completed short-cycle tertiary, female',
       'At least completed short-cycle tertiary, male',
       'At least completed short-cycle tertiary, total',
       "At least Bachelor's or equivalent, female",
       "At least Bachelor's or equivalent, male",
       "At least Bachelor's or equivalent, total",
       "At least Master's or equivalent, female",
       "At least Master's or 

Looking at the unique values in the 'Disaggregation' column, the category I need for this analysis is 'At least completed upper secondary, female', which shows the rate for female education attainment at upper secondary level.

Next, I wanted to check which year had the most data for the category I’m analyzing. So, I filtered the 'Disaggregation' column to the category I needed, grouped the data by the 'Year' column, and sorted it in descending order to see which years had the most entries. I also looked at the top 10 years with the most data available. It turns out that 2019 has the highest number of entries, with 125 data points, which lines up perfectly with the year I’m using for the stunting rate analysis from WHO.

In [7]:
check1 = female_education_level[
    female_education_level["Disaggregation"] == "At least completed upper secondary, female"
].groupby("Year").size().to_frame("check").sort_values(by="check", ascending=False)

check1.head(10)

Unnamed: 0_level_0,check
Year,Unnamed: 1_level_1
2019,125
2018,117
2017,115
2014,105
2015,104
2021,103
2020,98
2016,98
2010,92
2011,91


Again, I filtered the dataset to focus on the data I need for the analysis: females who completed upper secondary education in 2019. I selected only the relevant columns, which are 'Country Name' and 'Value', to keep things simple.

In [8]:
female_upsecondary_2019 = female_education_level[
    (female_education_level["Disaggregation"] == "At least completed upper secondary, female") &
    (female_education_level["Year"] == 2019)
][["Country Name", "Value"]]

female_upsecondary_2019


Unnamed: 0,Country Name,Value
12267,Albania,46.062
12280,Algeria,24.827
12296,Angola,14.582
12306,Armenia,95.958
12323,Australia,77.800
...,...,...
14215,Vanuatu,17.210
14240,Viet Nam,29.527
14246,West Bank and Gaza,48.984
14270,Zambia,28.142


Next, I merged the two datasets: stunting rates for 2019 and female upper secondary education data for the same year. I matched them up using the 'Country Name' column so that each country has both data points in one place. I also rename the columns name to make it more intuitive to put in the chart.

In [9]:
merged_data = pd.merge(stunting_rate_2019,female_upsecondary_2019, on="Country Name")
merged_data

Unnamed: 0,Country Name,2019,Value
0,Angola,,14.582
1,Albania,,46.062
2,United Arab Emirates,,71.313
3,Armenia,,95.958
4,Australia,,77.800
...,...,...,...
120,Vanuatu,,17.210
121,Samoa,7.3,42.782
122,South Africa,,54.323
123,Zambia,,28.142


In [10]:
merged_data = merged_data.rename(columns = {"2019": "Stunting Rate (%)", "Value": "Female Completing Uppersecondary Education (%)"})
merged_data

Unnamed: 0,Country Name,Stunting Rate (%),Female Completing Uppersecondary Education (%)
0,Angola,,14.582
1,Albania,,46.062
2,United Arab Emirates,,71.313
3,Armenia,,95.958
4,Australia,,77.800
...,...,...,...
120,Vanuatu,,17.210
121,Samoa,7.3,42.782
122,South Africa,,54.323
123,Zambia,,28.142


In this step, I cleaned up the merged dataset by removing rows with missing values in either the stunting rate or the female upper secondary education columns. This ensures I’m working with complete and reliable data for both variables. The dataset is now good to go for analysis without any gaps that could affect the results.

In [11]:
merged_data = merged_data.dropna(subset=["Stunting Rate (%)", "Female Completing Uppersecondary Education (%)"]).reset_index()
merged_data

Unnamed: 0,index,Country Name,Stunting Rate (%),Female Completing Uppersecondary Education (%)
0,9,Burkina Faso,23.8,4.994
1,10,Bangladesh,28.0,26.839
2,16,Brazil,8.0,52.41
3,20,Central African Republic,39.8,3.26
4,27,Cuba,7.1,55.831
5,32,Dominican Republic,6.7,45.49
6,33,Algeria,9.8,24.827
7,34,Ecuador,23.0,42.61
8,37,Ethiopia,36.8,3.456
9,47,Guyana,9.5,36.345


In [12]:
merged_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 29 entries, 0 to 28
Data columns (total 4 columns):
 #   Column                                          Non-Null Count  Dtype  
---  ------                                          --------------  -----  
 0   index                                           29 non-null     int64  
 1   Country Name                                    29 non-null     object 
 2   Stunting Rate (%)                               29 non-null     float64
 3   Female Completing Uppersecondary Education (%)  29 non-null     float64
dtypes: float64(2), int64(1), object(1)
memory usage: 1.0+ KB


### Creating the Scatter Plot

Here, I created a scatter plot using Plotly Express to visualize the relationship between stunting rates and female upper secondary education completion rates for 2019. Each point represents a country, with the stunting rate on the x-axis and the education rate on the y-axis. I also added a trendline (using "OLS" for linear regression) to help show the overall pattern or correlation between the two variables. The goal is to see if there's a noticeable relationship between these factors.

In [13]:
import plotly.express as px

fig = px.scatter(merged_data,
    x = "Stunting Rate (%)",
    y = "Female Completing Uppersecondary Education (%)",
    title = "Relationship between Stunting Rate and Female Completing Upper Secondary Education Rate (2019)",
    trendline = "ols"
)
fig.show()

### Analysis and Conclusion

This scatterplot shows the relationship between stunting rates and the percentage of females completing upper secondary education in 2019. Each dot represents a country, with the stunting rate on the x-axis and female education on the y-axis.

Looking at the plot, we can see a general trend: **countries with higher female education rates seem to have lower stunting rates**. For example, the dots higher up on the y-axis (representing better female education rates) are mostly toward the left, where stunting rates are lower. Meanwhile, countries further to the right, with higher stunting rates, tend to have lower education rates.

**This suggests that there might be a negative relationship between the two variables—basically, as female education goes up, stunting rates seem to go down.** It lines up with the idea that better-educated mothers might be more aware of how to provide proper nutrition for their kids. 

Honestly, we’d need to dive deeper into the trendline or correlation to really confirm anything here. There are a bunch of limitations in this analysis—like the small number of observations, possible confounding factors, and other things we didn’t account for. Plus, **correlation doesn’t mean causation**, so this is definitely not strong enough to give solid recommendations to policymakers. But honestly, the goal of this project for me was more about learning how to combine two datasets and visualize them using Plotly. And for that, I’d say it was super useful! ☺️