# Share of education in government expenditure

## Plan of Action

- Explore the data 
- Clean the data
- Get the mean rate of education in government expenditure over a given time period (2000-2021)

## Data source

- The raw data source is avaliable in a csv file titled 'share-of-education-in-government-expenditure.csv'
- This data source (https://ourworldindata.org/grapher/share-of-education-in-government-expenditure) was derived from the World Bank government spending across the world
- The original datasets spans a time period from 1980 to 2021
- The world bank describes the data as "General government expenditure on education (current, capital, and transfers) is expressed as a percentage of total general government expenditure on all sectors (including health, education, social services, etc.)" https://ourworldindata.org/government-spending
- They also add "General government usually refers to local, regional and central governments."

In [1]:
import pandas as pd

In [2]:
df2 = pd.read_csv("share-of-education-in-government-expenditure.csv")
df2

Unnamed: 0,Entity,Code,Year,"Government expenditure on education, total (% of government expenditure)"
0,Afghanistan,AFG,2010,17.067560
1,Afghanistan,AFG,2011,16.048429
2,Afghanistan,AFG,2013,14.102800
3,Afghanistan,AFG,2014,14.465930
4,Afghanistan,AFG,2015,12.509000
...,...,...,...,...
3529,Zimbabwe,ZWE,2014,30.015150
3530,Zimbabwe,ZWE,2015,29.470831
3531,Zimbabwe,ZWE,2016,23.527081
3532,Zimbabwe,ZWE,2017,20.874201


## Let's see what this data looks like 

In [3]:
df2.shape

(3534, 4)

This dataframe has 3534 rows and 4 columns

In [4]:
df2['Year'].min()

1980

The earliest year in this dataset is 1980

In [5]:
df2.dtypes

Entity                                                                       object
Code                                                                         object
Year                                                                          int64
Government expenditure on education, total (% of government expenditure)    float64
dtype: object

In [6]:
df2.columns

Index(['Entity', 'Code', 'Year',
       'Government expenditure on education, total (% of government expenditure)'],
      dtype='object')

## Renaming columns 
- All other names are rather simple, except for the government expenditure one, which may cause unnecessary problems when dealing with this column. So let's rename this column 
- The 'Entity' columnn refers to  countries, so let's rename this column as well  

In [7]:
df2.rename(columns={'Government expenditure on education, total (% of government expenditure)': 'Gvt_Exp_on_Education'}, inplace=True)
df2.rename(columns = {'Entity': 'Country'}, inplace=True )

Let's check the column names now

In [8]:
df2.columns

Index(['Country', 'Code', 'Year', 'Gvt_Exp_on_Education'], dtype='object')

Let's create a copy of the filtered table in case we want to go back to the original

In [9]:
filtered_df = df2.copy()

Since we have already identified the countries with the 5 highest and lowest highest literacy rates, we can filter those countries 

In [10]:
countries = ['Korea', 'Latvia', 'Estonia', 'Lithuania', 'Cuba', 'Chad', 'Afghanistan', 'Mali', 'Niger', 'Guinea']
filtered_df = df2[df2["Country"].isin(countries)]
filtered_df = filtered_df.reset_index(drop=True)
filtered_df

Unnamed: 0,Country,Code,Year,Gvt_Exp_on_Education
0,Afghanistan,AFG,2010,17.067560
1,Afghanistan,AFG,2011,16.048429
2,Afghanistan,AFG,2013,14.102800
3,Afghanistan,AFG,2014,14.465930
4,Afghanistan,AFG,2015,12.509000
...,...,...,...,...
157,Niger,NER,2017,13.215160
158,Niger,NER,2018,16.339970
159,Niger,NER,2019,13.012810
160,Niger,NER,2020,13.332540


Let's check to see if our data has all of the countries that we are looking for

In [32]:
unique_countries = filtered_df['Country'].unique()
print("Unique countries:", unique_countries)

We can see that Cuba and Korea are missing from the dataset, so we'll have to add those in later

### Looking at our filtered dataset

By looking at the .min() of the data we can see that the data starts from 1991

In [14]:
filtered_df['Year'].min()

1991

Since we are only looking at data from 2000 - 2021, we'll need to get rid of any rows from 1997 to 1999

In [15]:
filtered_df = filtered_df[filtered_df['Year'] > 1999]
filtered_df = filtered_df.reset_index(drop=True)
filtered_df

Unnamed: 0,Country,Code,Year,Gvt_Exp_on_Education,Country_Presence
0,Afghanistan,AFG,2010,17.067560,True
1,Afghanistan,AFG,2011,16.048429,True
2,Afghanistan,AFG,2013,14.102800,True
3,Afghanistan,AFG,2014,14.465930,True
4,Afghanistan,AFG,2015,12.509000,True
...,...,...,...,...,...
136,Niger,NER,2017,13.215160,True
137,Niger,NER,2018,16.339970,True
138,Niger,NER,2019,13.012810,True
139,Niger,NER,2020,13.332540,True


Let's check to see what year the dataset now starts from

In [16]:
filtered_df['Year'].min()

2000

Not all of the countries in our chosen 10 have the same year range, so we'll have to see what the time span is for each country 

In [19]:
country_count = filtered_df.groupby('Country')['Year'].count()
country_count

Country
Afghanistan    11
Chad           16
Estonia        18
Guinea         20
Latvia         18
Lithuania      18
Mali           20
Niger          20
Name: Year, dtype: int64

The highest amount of years is 20 and the lowest is 11, this is not too much of a range, so we can proceed

### Finding the mean of each country 

In [21]:
# we are grouping our data by year and then calculate mean Gross Earnings for each year. 
countries_mean = filtered_df.groupby('Country')['Gvt_Exp_on_Education'].mean()
countries_mean

Country
Afghanistan    13.044702
Chad           11.933025
Estonia        13.668718
Guinea         12.861944
Latvia         14.311544
Lithuania      13.705299
Mali           16.659471
Niger          16.770930
Name: Gvt_Exp_on_Education, dtype: float64