### World Bank Group Project - Maria & August 

**Our hypothesis : The Gross National Income decreases as percentage of children out of school increases.**

Data being used : The World Bank 'WB_more_data.csv' file

Breakdown:
- WBdata_group.ipynb notebook - has initial anlysis of dataset to be able to understand the data better, also included conclusion at the end
- challenge.ipynb notebook - has the linear regression analysis 

In [229]:
## download needed packages
import pandas as pd
import os
import numpy as np
import matplotlib.pyplot as plt

In [211]:
## get current working directory
cwd = os.getcwd()

In [213]:
## read in csv file
df = pd.read_csv(cwd + "/data/WB_more_data.csv")

In [214]:
df 

Unnamed: 0,Series Name,Series Code,Country Name,Country Code,2001,2002,2003,2011,2012,2013,2021,2022,2023
0,"Population, total",SP.POP.TOTL,Afghanistan,AFG,19688632,21000256,22645130,29249157,30466479,31541209,40099462,41128771,42239854
1,"Population, total",SP.POP.TOTL,Albania,ALB,3060173,3051010,3039616,2905195,2900401,2895092,2811666,2777689,2745972
2,"Population, total",SP.POP.TOTL,Algeria,DZA,31200985,31624696,32055883,36543541,37260563,38000626,44177969,44903225,45606480
3,"Population, total",SP.POP.TOTL,American Samoa,ASM,58324,58177,57941,54310,53691,52995,45035,44273,43914
4,"Population, total",SP.POP.TOTL,Andorra,AND,67820,70849,73907,70567,71013,71367,79034,79824,80088
...,...,...,...,...,...,...,...,...,...,...,...,...,...
1080,"GNI, Atlas method (current US$)",NY.GNP.ATLS.CD,Virgin Islands (U.S.),VIR,..,..,..,..,..,..,..,..,..
1081,"GNI, Atlas method (current US$)",NY.GNP.ATLS.CD,West Bank and Gaza,PSE,4093011325.58684,3665299323.09048,4328972370.19645,11175431624.7764,12759178234.8466,14150619727.8472,21000435319.6176,23810735951.7193,21800691653.9541
1082,"GNI, Atlas method (current US$)",NY.GNP.ATLS.CD,"Yemen, Rep.",YEM,8683416697.21847,9543316523.25537,10362910322.8485,25694362119.0205,30836220079.2874,36138877868.4285,..,..,..
1083,"GNI, Atlas method (current US$)",NY.GNP.ATLS.CD,Zambia,ZMB,3691193326.08369,3945626806.80281,4596361145.52769,19888819424.0619,24445149303.3303,26269107774.6518,20349541094.6806,23489215309.9948,27196324765.3154


### Initial Analysis

In [215]:
## shape helps us know the dimensions of the dataset, 13 columns by 1085 rows 
## describe helps us look at the values in each column, take a count of how many in each column and how many values are unique
print(df.shape)
print(df.describe())

(1085, 13)
              Series Name  Series Code Country Name Country Code  2001  2002  \
count                1085         1085         1085         1085  1085  1085   
unique                  5            5          217          217   797   813   
top     Population, total  SP.POP.TOTL  Afghanistan          AFG    ..    ..   
freq                  217          217            5            5   252   240   

        2003  2011  2012  2013  2021  2022  2023  
count   1085  1085  1085  1085  1085  1085  1085  
unique   810   875   882   872   869   792   597  
top       ..    ..    ..    ..    ..    ..    ..  
freq     242   200   190   206   202   287   477  


In [228]:
## info lets us know how many total values there are, how many are null, and their type
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1085 entries, 0 to 1084
Data columns (total 13 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   Series Name   1085 non-null   object
 1   Series Code   1085 non-null   object
 2   Country Name  1085 non-null   object
 3   Country Code  1085 non-null   object
 4   2001          1085 non-null   object
 5   2002          1085 non-null   object
 6   2003          1085 non-null   object
 7   2011          1085 non-null   object
 8   2012          1085 non-null   object
 9   2013          1085 non-null   object
 10  2021          1085 non-null   object
 11  2022          1085 non-null   object
 12  2023          1085 non-null   object
dtypes: object(13)
memory usage: 110.3+ KB


In [216]:
## finding the unique strings in the Series Name column helps us see how the dataset is sectioned off and which sections we will be needing
df["Series Name"].unique()

array(['Population, total',
       'Children out of school (% of primary school age)',
       'Children out of school, primary',
       'GNI per capita, Atlas method (current US$)',
       'GNI, Atlas method (current US$)'], dtype=object)

In [220]:
## look at maxmimum/minimum values, most important for Year columns
print(df.min())
print(df.max())

Series Name     Children out of school (% of primary school age)
Series Code                                       NY.GNP.ATLS.CD
Country Name                                         Afghanistan
Country Code                                                 ABW
2001                                                          ..
2002                                                          ..
2003                                                          ..
2011                                                          ..
2012                                                          ..
2013                                                          ..
2021                                                          ..
2022                                                          ..
2023                                                          ..
dtype: object
Series Name     Population, total
Series Code           SP.POP.TOTL
Country Name             Zimbabwe
Country Code                  ZWE
2001 

In [246]:
## finding dtypes, tell us that we need to convert year columns to numeric to continue
df.dtypes

Series Name     object
Series Code     object
Country Name    object
Country Code    object
2001            object
2002            object
2003            object
2011            object
2012            object
2013            object
2021            object
2022            object
2023            object
dtype: object

### Subset of df

In [221]:
## creating a subset
df2 = df.copy()

In [222]:
## if Series Name column has either of these inputs, add row to a new subset named df2 
df2 = df2[(df2["Series Name"] == "Children out of school (% of primary school age)") | (df2["Series Name"] == "GNI per capita, Atlas method (current US$)")].reset_index(drop=True)

In [223]:
df2

Unnamed: 0,Series Name,Series Code,Country Name,Country Code,2001,2002,2003,2011,2012,2013,2021,2022,2023
0,Children out of school (% of primary school age),SE.PRM.UNER.ZS,Afghanistan,AFG,..,..,..,..,..,..,..,..,..
1,Children out of school (% of primary school age),SE.PRM.UNER.ZS,Albania,ALB,1.43961000442505,..,1.93194997310638,0.169939994812012,0.216979995369911,0.371250003576279,6.18711996078491,9.20401954650879,..
2,Children out of school (% of primary school age),SE.PRM.UNER.ZS,Algeria,DZA,2.9766800403595,1.37446999549866,0.937269985675812,0.789669990539551,0.6214200258255,0.427540004253387,0.196449995040894,0.628679990768433,0.975709974765778
3,Children out of school (% of primary school age),SE.PRM.UNER.ZS,American Samoa,ASM,..,..,..,..,..,..,..,..,..
4,Children out of school (% of primary school age),SE.PRM.UNER.ZS,Andorra,AND,..,..,..,56.8440017700195,55.4199981689453,54.3419990539551,8.0449800491333,7.08159017562866,..
...,...,...,...,...,...,...,...,...,...,...,...,...,...
429,"GNI per capita, Atlas method (current US$)",NY.GNP.PCAP.CD,Virgin Islands (U.S.),VIR,..,..,..,..,..,..,..,..,..
430,"GNI per capita, Atlas method (current US$)",NY.GNP.PCAP.CD,West Bank and Gaza,PSE,1370,1190,1370,2880,3210,3470,4270,4720,4220
431,"GNI per capita, Atlas method (current US$)",NY.GNP.PCAP.CD,"Yemen, Rep.",YEM,450,490,510,1010,1180,1340,..,..,..
432,"GNI per capita, Atlas method (current US$)",NY.GNP.PCAP.CD,Zambia,ZMB,360,380,420,1390,1660,1720,1050,1170,1320


In [143]:
## check all rows for missing cells
df2.dropna(axis="columns", how="any")

Unnamed: 0,Series Name,Series Code,Country Name,Country Code,2001,2023
0,Children out of school (% of primary school age),SE.PRM.UNER.ZS,Afghanistan,AFG,..,..
1,Children out of school (% of primary school age),SE.PRM.UNER.ZS,Albania,ALB,1.43961000442505,..
2,Children out of school (% of primary school age),SE.PRM.UNER.ZS,Algeria,DZA,2.9766800403595,0.975709974765778
3,Children out of school (% of primary school age),SE.PRM.UNER.ZS,American Samoa,ASM,..,..
4,Children out of school (% of primary school age),SE.PRM.UNER.ZS,Andorra,AND,..,..
...,...,...,...,...,...,...
429,"GNI per capita, Atlas method (current US$)",NY.GNP.PCAP.CD,Virgin Islands (U.S.),VIR,..,..
430,"GNI per capita, Atlas method (current US$)",NY.GNP.PCAP.CD,West Bank and Gaza,PSE,1370,4220
431,"GNI per capita, Atlas method (current US$)",NY.GNP.PCAP.CD,"Yemen, Rep.",YEM,450,..
432,"GNI per capita, Atlas method (current US$)",NY.GNP.PCAP.CD,Zambia,ZMB,360,1320


After the initial analysis was done, the linear regression was then coded, with a focus on the rows of data based on 'Children out of school (% of primary school age)' and 'GNI per capita, Atlas method (current US$)' in order to see if there is an interaction between the two. After creating two subsets, named X and Y on the challenge.ipynb notebook, the subsets were pivoted and then transposed in order to make the years the index and the countries the columns and this helps create the data into a time-series dataset. The estimation summary gives us a p-value of 0.2798, which concludes that we fail to reject the null hypothesis since it is greater than the significance level. Limitations we had were the two variables we chose perhaps choosing other variables would have provided more accurate testing, as well as having some data missing so dataset is not completely accurate.