# Analysis of Fertility Rate and Its Various Factors Across Countries In The Past Decade 

Jason Lee and Jay Lee

## Introduction 

The fertility rate in all, if not most countries is susceptible to fluctuation over period of time. Whether that may be due to a singular factor or a result of intricate relationships of multiple factors, we are intrigued to possibly compare and identify different factors that might play a significant role in the fluctuation of fertility rate across differnt countries. We have collected publicly available data with statistics on fertility rate of OECD (The Organization for Economic Cooperation and Development) in the last decade along with other notable statistics about those countries such as GDP, Health Spending, Employment Rate, Adult Education Level, and Internet Access. We are analzying this data to identify patterns in the possible relationships these attributes might have on the fertility rate across different countries over time. 

Fertility rate in a specific year is described as the total number of children that would be born to each woman given that she is able to live to the end of her child-bearing years and give birth to children in correspondence to the age-specific fertility rates. A total fertility rate of 2.1 children per woman ensures a broadly stable population assuming that there is an equilibrium in net migration and mortality. Together with mortality and migration, fertility rate reflects the effects of multifaceted development of a country, whether that may be economical, social, and/or more.  

## Data Collection and Cleaning 

Our data is collected from OECD website. We collected the countries' Fertility Rates, GDP (Gross Domestic Product), Young Population, Elderly Population, Adult Education Level,and Access to Internet. It is also hosted on Google Drive [here](https://drive.google.com/drive/folders/100_ZZScW3yAXBZ9k2wZW5fegNIw2HC6G?usp=sharing)

In [2]:
import numpy as np 
import seaborn as sns 
import pandas as pd 
import duckdb, sqlalchemy 

%load_ext sql

%config SqlMagic.autopandas = True
%config SqlMagic.feedback = False
%config SqlMagic.displaycon = False

%sql duckdb:///:memory:

## Fertility Rates

These are the fertility rates of OECD countries from 1960-1970. In the cells below, the data is cleaned by removing insignificant datas. For example, data pertaining to "OECD Average" is unnecessary for the research as it looks into each specific country, rather than an average of the countries. In addition, some rows are renamed and/or removed to make the data simpler and more intuitive. The same procedures are followed for other sets of data as well. 

In [118]:
fertility_rates_df = pd.read_csv('/Users/jasonlee/Downloads/INFO2950_Project_Data/Fertility_Rates.csv')

#Renaming the columns 
fertility_rates_df.drop(['INDICATOR','SUBJECT','MEASURE','FREQUENCY','Flag Codes'],axis=1, inplace=True)

fertility_rates_df.columns
fertility_rates_df.columns = ['Country','Year','FertilityRate']

print(fertility_rates_df.shape)

#Rows with the Country value 'OAVG' is removed 
fertility_rates_df = fertility_rates_df.drop(fertility_rates_df.index[fertility_rates_df['Country'].isin(['OAVG' ])])

fertility_rates_df.head()

(380, 3)


Unnamed: 0,Country,Year,FertilityRate
0,AUS,2011,1.92
1,AUS,2012,1.93
2,AUS,2013,1.88
3,AUS,2014,1.79
4,AUS,2015,1.79


## GDP (Gross Domestic Product) 

Similar to Fertility Rates, the data regarding GDP is followed by similar procedures for cleaning. To make sure the number rows match the previous set of data, we went through the data and removed any excess data, such as "OECD".

In [117]:
GDP_df = pd.read_csv('/Users/jasonlee/Downloads/INFO2950_Project_Data/GDP.csv')

#Renaming the columns 
GDP_df.drop(['INDICATOR','SUBJECT','MEASURE','FREQUENCY','Flag Codes'],axis=1, inplace=True)

GDP_df.columns
GDP_df.columns = ['Country','Year','GDP (US Dollar/Capita)']


#Rows with the Country value 'OECD' is removed 
GDP_df = GDP_df.drop(GDP_df.index[GDP_df['Country'].isin(['OECD'])])

print(GDP_df.shape)


GDP_df.head()


(380, 3)


Unnamed: 0,Country,Year,GDP (US Dollar/Capita)
0,AUS,2011,44429.559507
1,AUS,2012,43883.378891
2,AUS,2013,47761.901259
3,AUS,2014,47603.880878
4,AUS,2015,47232.62912


## Health Spending 

In [116]:
Health_spending_df = pd.read_csv('/Users/jasonlee/Downloads/INFO2950_Project_Data/Health_Spending.csv')

#Renaming the columns 
Health_spending_df.drop(['INDICATOR','SUBJECT','MEASURE','FREQUENCY','Flag Codes'],axis=1, inplace=True)

Health_spending_df.columns
Health_spending_df.columns = ['Country','Year','Health Spending (US Dollar/Capita)']


#Rows with the Country value 'OECD' is removed 
Health_spending_df = Health_spending_df.drop(Health_spending_df.index[Health_spending_df['Country'].isin(['OECD'])])
print(Health_spending_df.shape)
Health_spending_df.head()

(380, 3)


Unnamed: 0,Country,Year,Health Spending (US Dollar/Capita)
0,AUS,2011,3809.112
1,AUS,2012,3854.19
2,AUS,2013,4087.849
3,AUS,2014,4562.73
4,AUS,2015,4777.388


## Employment Rate

The original set of Employment Rate data consisted a total of 389 rows. Upon closer inspection, we cleaned up the data by removing set of datas with 'OECD' like the data sets above. However, this set of data was missing a value in employment rate, specifically Mexico in 2020. Thus, the data was physically manipulated to include NaN in order to keep the number of rows in the data consistent. 




In [115]:
Employment_rate_df = pd.read_csv('/Users/jasonlee/Downloads/INFO2950_Project_Data/Employment_Rates.csv')

#Renaming the columns 
Employment_rate_df.drop(['INDICATOR','SUBJECT','MEASURE','FREQUENCY','Flag Codes'],axis=1, inplace=True)

Employment_rate_df.columns
Employment_rate_df.columns = ['Country','Year','EmploymentRate']


#Rows with the Country value 'OECD' is removed 
Employment_rate_df = Employment_rate_df.drop(Employment_rate_df.index[Employment_rate_df['Country'].isin(['OECD'])])
print(Employment_rate_df.shape)
Employment_rate_df.head()

(380, 3)


Unnamed: 0,Country,Year,EmploymentRate
0,AUS,2011,72.65694
1,AUS,2012,72.34558
2,AUS,2013,71.97044
3,AUS,2014,71.56635
4,AUS,2015,72.15796


## Percentage of Tertiary Education Received 

In [114]:
Tertiary_education_df = pd.read_csv('/Users/jasonlee/Downloads/INFO2950_Project_Data/Tertiary_Education.csv')

#Renaming the columns 
Tertiary_education_df.drop(['INDICATOR','SUBJECT','MEASURE','FREQUENCY','Flag Codes'],axis=1, inplace=True)

Tertiary_education_df.columns
Tertiary_education_df.columns = ['Country','Year','Percentage (25-64yrs)']


#Rows with the Country value 'OECD' is removed 
Tertiary_education_df = Tertiary_education_df.drop(Tertiary_education_df.index[Tertiary_education_df['Country'].isin(['OECD','OAVG','G20',])])
print(Tertiary_education_df.shape)
Tertiary_education_df.head()

(380, 3)


Unnamed: 0,Country,Year,Percentage (25-64yrs)
0,AUS,2011,38.342072
1,AUS,2012,41.282364
2,AUS,2013,39.539928
3,AUS,2014,41.901855
4,AUS,2015,42.888756


## Internet Access

In [119]:
Internet_access_df = pd.read_csv('/Users/jasonlee/Downloads/INFO2950_Project_Data/Internet_Access.csv')

#Renaming the columns 
Internet_access_df.drop(['INDICATOR','SUBJECT','MEASURE','FREQUENCY','Flag Codes'],axis=1, inplace=True)

Internet_access_df.columns
Internet_access_df.columns = ['Country','Year','Internet Access']


#Rows with the Country value 'OECD' is removed 
Internet_access_df = Internet_access_df.drop(Internet_access_df.index[Internet_access_df['Country'].isin(['OECD'])])
print(Internet_access_df.shape)
Internet_access_df.head()

(380, 3)


Unnamed: 0,Country,Year,Internet Access
0,AUS,2011,
1,AUS,2012,83.0
2,AUS,2013,
3,AUS,2014,85.89
4,AUS,2015,


## Comprehensive View 


In [120]:
Comp_view = fertility_rates_df.merge(GDP_df, on=['Country','Year']).merge(Health_spending_df, on=['Country','Year']).merge(Employment_rate_df, on=['Country','Year']).merge(Tertiary_education_df, on=['Country','Year']).merge(Internet_access_df, on= ['Country','Year'])

print(Comp_view.shape)

Comp_view.head()



(370, 8)


Unnamed: 0,Country,Year,FertilityRate,GDP (US Dollar/Capita),Health Spending (US Dollar/Capita),EmploymentRate,Percentage (25-64yrs),Internet Access
0,AUS,2011,1.92,44429.559507,3809.112,72.65694,38.342072,
1,AUS,2012,1.93,43883.378891,3854.19,72.34558,41.282364,83.0
2,AUS,2013,1.88,47761.901259,4087.849,71.97044,39.539928,
3,AUS,2014,1.79,47603.880878,4562.73,71.56635,41.901855,85.89
4,AUS,2015,1.79,47232.62912,4777.388,72.15796,42.888756,
