# Fertility rate - project for Data Analysis course

## Problem formulation

In recent decades, fertility rates have exhibited notable fluctuations globally, raising concerns about their implications for population dynamics, economic development, and social welfare. Understanding the intricate interplay between socioeconomic variables and fertility rates is essential for policymakers, economists, and social scientists alike. This project aims to delve into this complex relationship, focusing on the impact of Gross Domestic Product (GDP), education, women's labor force participation, and contraception prevalence on fertility rates.

Creating model in the context of the described problem is to develop a tool that can simulate or predict the relationship between various socioeconomic variables and fertility rates. By doing so it is possible to obtain insight into how changes in these factors might influence fertility rates which might be crucial for economists and social scientists in making informed decisions.

The dataset used for this project is sourced from [ourworldindata](https://ourworldindata.org/). Data related to labor force, GDP and education can be found [here](https://ourworldindata.org/fertility-rate),
the data related to spread of contraception can be found [here](https://ourworldindata.org/grapher/fertility-vs-contraception). </br>
* GDP data contains information about: ['Entity', 'Code for entity', 'Year', 'Fertility rate', 'GDP per capita', 'Population (historical estimates)', 'Continent'] </br>
* Labor force data contains information about: ['Entity', 'Code for entity', 'Year', 'Labor force participation rate, female (% of female population ages 15+), 'Fertility rate', 'Population (historical estimates)', 'Continent'] </br>
* Eductaion data contains information about: ['Entity', 'Code for entity', 'Year', 'Fertility rate', 'Combined - average years of education for 15-64 years female youth and adults', 'Population (historical estimates)', 'Continent'] </br>
* Contraception data contains information about: ['Entity', 'Code for entity', 'Year', 'Fertility rate', 'Contraceptive prevalence, any method (% of married women ages 15-49)', 'Continent']

In [75]:
#TODO DAG model (is it directed acyclic graph?)
#TODO Confoundings (pipe, fork, collider) -> they are related to DAG propably

### Importing libraries

In [76]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

from cmdstanpy import CmdStanModel #It has to be used in container

### Reading data from files

In [77]:
gdp_df = pd.read_csv('children-per-woman-fertility-rate-vs-level-of-prosperity.csv')
labor_df = pd.read_csv('fertility-and-female-labor-force-participation.csv')
education_df = pd.read_csv('womens-educational-attainment-vs-fertility.csv')
contraception_df = pd.read_csv('fertility-vs-contraception.csv')

## Data preprocessing

All data has been cleaned up, all unnecesary columns has been removed. It was performed by using .drop instruction. Also to make merging data easier to identify, the column name has been changed in some columns by using .rename function. Described processes has been repeated for each dataframe. Removed columns weren't relevant to identify each dataframe or they were useless in terms of analysis.

Since all fertality rates indicator are all the same across dataframes, the columns responsible for holding value there was also dropped expect for one dataframe.

### Cleaning up GDP dataframe

In [78]:
gdp_df.columns

Index(['Entity', 'Code', 'Year',
       'Fertility rate - Sex: all - Age: all - Variant: estimates',
       'GDP per capita (output, multiple price benchmarks)',
       'Population (historical estimates)', 'Continent'],
      dtype='object')

In [79]:
gdp_df.drop(columns=['Code', 'Population (historical estimates)', 'Continent'], inplace=True)
gdp_df.rename(columns={'Fertility rate - Sex: all - Age: all - Variant: estimates':'Fertility rate', 'GDP per capita (output, multiple price benchmarks)':'GDP per capita'}, inplace=True)
gdp_df.head()

Unnamed: 0,Entity,Year,Fertility rate,GDP per capita
0,Abkhazia,2015,,
1,Afghanistan,1950,7.2484,
2,Afghanistan,1951,7.2596,
3,Afghanistan,1952,7.2601,
4,Afghanistan,1953,7.2662,


### Cleaning up Labor dataframe

In [80]:
labor_df.columns

Index(['Entity', 'Code', 'Year',
       'Labor force participation rate, female (% of female population ages 15+) (national estimate)',
       'Fertility rate - Sex: all - Age: all - Variant: estimates',
       'Population (historical estimates)', 'Continent'],
      dtype='object')

In [81]:
labor_df.drop(columns=['Code','Population (historical estimates)', 'Continent', 'Fertility rate - Sex: all - Age: all - Variant: estimates'], inplace=True)
labor_df.rename(columns={'Labor force participation rate, female (% of female population ages 15+) (national estimate)':'Labor force rate'}, inplace=True)
labor_df.head()

Unnamed: 0,Entity,Year,Labor force rate
0,Abkhazia,2015,
1,Afghanistan,1979,6.83
2,Afghanistan,2008,43.79
3,Afghanistan,2012,16.015
4,Afghanistan,2014,25.784


### Cleaning up Education dataframe

In [82]:
education_df.columns

Index(['Entity', 'Code', 'Year',
       'Fertility rate - Sex: all - Age: all - Variant: estimates',
       'Combined - average years of education for 15-64 years female youth and adults',
       'Population (historical estimates)', 'Continent'],
      dtype='object')

In [83]:
education_df.drop(columns=['Code','Population (historical estimates)', 'Continent', 'Fertility rate - Sex: all - Age: all - Variant: estimates'], inplace=True)
education_df.rename(columns={'Combined - average years of education for 15-64 years female youth and adults':'Education years'}, inplace=True)
education_df.head()

Unnamed: 0,Entity,Year,Education years
0,Abkhazia,2015,
1,Afghanistan,1950,0.08
2,Afghanistan,1951,
3,Afghanistan,1952,
4,Afghanistan,1953,


### Cleaning up Contraception dataframe

In [84]:
contraception_df.columns

Index(['Entity', 'Code', 'Year', 'Fertility rate, total (births per woman)',
       'Contraceptive prevalence, any method (% of married women ages 15-49)',
       'Continent'],
      dtype='object')

In [85]:
contraception_df.drop(columns=['Code','Continent','Fertility rate, total (births per woman)'], inplace=True)
contraception_df.rename(columns={'Contraceptive prevalence, any method (% of married women ages 15-49)': 'Contraceptive prevalence'}, inplace=True)
contraception_df.head()

Unnamed: 0,Entity,Year,Contraceptive prevalence
0,Abkhazia,2015,
1,Afghanistan,1960,
2,Afghanistan,1961,
3,Afghanistan,1962,
4,Afghanistan,1963,


### Merging data and preping data for further analysis

In order to perform some more advanced actions to the dataset, it is necessary to merge dataset in one big dataset so that is easier to perform further analysis. The merged dataframe has multiple NaN values, which will be removed further.

In [86]:
merged_df= gdp_df.merge(labor_df, on=['Entity', 'Year'])\
                           .merge(education_df, on=['Entity', 'Year'])\
                           .merge(contraception_df, on=['Entity', 'Year'])