# Education in Brazil: an overview of PISA performance

Author: Marcel Cortapasso

## 1. Introduction

The education system in Brazil has been a constant topic of discussion in the last two decades. It is a consensus among the Brazilians that the primary and secondary education does not reach the minimum standard to develop youngsters the ability to read critically as well as to solve trivial problems in math and science. As a consequence, many issues have been aggravating such as the lacking of qualified labor, the weak technology development, and the low economic mobility. 
<br>
<br>
The following study explores indicators of education collected around the world and highlight the Brazil position regarding performance on an international standardized test, the PISA. The aim is to demonstrate how severe is the situation using the comparison with other regions and countries group income. By the ending, it infers the indicators of education that have a significant correlation to PISA performance; therefore it could be used to guide the government to allocate wisely the resources in education.

## 2. Data Preparation

### 2.1 Importing Python libraries

In [58]:
import numpy as np
import pandas as pd
from scipy import stats
import matplotlib.pyplot as plt 
import plotly.plotly as py # To use the plotly API it is required to sign up. 
import plotly.graph_objs as go
from sklearn import datasets, linear_model # To calculate the regressions
import plotly
from plotly import tools

import IPython.core.display as di

# This line will hide code by default when the notebook is exported as HTML
di.display_html('<script>jQuery(function() {if (jQuery("body.notebook_app").length == 0) { jQuery(".input_area").toggle(); jQuery(".prompt").toggle();}});</script>', raw=True)

# This line will add a button to toggle visibility of code blocks, for use with the HTML export version
di.display_html('''<button onclick="jQuery('.input_area').toggle(); jQuery('.prompt').toggle();">Toggle code</button>''', raw=True)

Import all Python libraries required to make the data visualization and the statistics.

In [57]:
# Personal credentials
with open ("../pyplot_credentials.txt") as cred:
    for row in cred:
        if row.split("=")[0] == "username":
            username = row.split("=")[1][:-1]
        else:
            api_key = row.split("=")[1] [:-1]

# To use the plotly API is required to sign up and get an username and api_key.
plotly.tools.set_credentials_file(username = username, api_key = api_key)
botton

### 2.2 The data
The data was collect from the World Bank Open Data and it is named ["Education Statistics"](http://datatopics.worldbank.org/education/).
<br>

### 2.3 Reading in the data 
The data was collected in .csv format and transformed into Pandas Data Frame.

In [56]:
# World Bank Data Base to get stats per year
data = pd.read_csv('../EdStatsData.csv')

# Data base to obtain extra details from the countries (region and income group)
countries = pd.read_csv('EdStatsCountry.csv')
data.head(3)
botton

Following are the target indicators to be analyzed in this study:
<br>
<br>
__PISA: Mean performance on the mathematics scale__: Average score of 15-year-old students on the PISA mathematics scale. The metric for the overall mathematics scale is based on a mean for OECD countries of 500 points and a standard deviation of 100 points. Data reflects country performance in the stated year according to PISA reports
<br>
<br>
__PISA: Mean performance on the reading scale__: Average score of 15-year-old students on the PISA reading scale. The metric for the overall mathematics scale is based on a mean for OECD countries of 500 points and a standard deviation of 100 points. Data reflects country performance in the stated year according to PISA reports
<br>
<br>
__PISA: Mean performance on the science scale__: Average score of 15-year-old students on the PISA science scale. The metric for the overall mathematics scale is based on a mean for OECD countries of 500 points and a standard deviation of 100 points. Data reflects country performance in the stated year according to PISA reports
<br>
<br>
__Government expenditure on education as % of GDP (%)__: Total general (local, regional and central) government expenditure on education (current, capital, and transfers), expressed as a percentage of GDP.
Government expenditure in primary institutions as % of GDP (%): Total general (local, regional and central) government expenditure in primary educational institutions (current and capital) at a given level of education, expressed as a percentage of GDP. 
<br>
<br>
__Government expenditure in secondary institutions education as % of GDP (%)__: Total general (local, regional and central) government expenditure in secondary educational institutions (current and capital) at a given level of education, expressed as a percentage of GDP. 
<br>
<br>
__Pupil-teacher ratio in primary education (headcount basis)__: Average number of pupils per teacher at a given level of education, based on headcounts of both pupils and teachers. Divide the total number of pupils enrolled at the specified level of education by the number of teachers at the same level. In computing and interpreting this indicator, one should take into account the existence of part-time teaching, school-shifts, multi-grade classes and other practices that may affect the precision and meaningfulness of pupil-teacher ratios.
<br>
<br>
__Pupil-teacher ratio in secondary education (headcount basis)__: Average number of pupils per teacher at a given level of education, based on headcounts of both pupils and teachers. Divide the total number of pupils enrolled at the specified level of education by the number of teachers at the same level. In computing and interpreting this indicator, one should take into account the existence of part-time teaching, school-shifts, multi-grade classes and other practices that may affect the precision and meaningfulness of pupil-teacher ratios.

### 2.4 Reshaping the data
It is required to reshape the data and place the variables - indicators mentioned above - in columns and extract just the target years.

In [33]:
# List to select the columns to be used
columns = ["Country Name", 
           "Country Code", 
           "Indicator Name", 
           "Indicator Code",
           "2008",
           "2009",
           "2010",
           "2011", 
           "2012", 
           "2013", 
           "2014", 
           "2015", 
           "2016", 
           "2017"]

# Indicators 
indicators = {
            "Government expenditure in primary institutions as % of GDP (%)":"Expenditure in primary institutions",
            "Government expenditure in secondary institutions education as % of GDP (%)": "Expenditure in secondary institutions",
            "Government expenditure on education as % of GDP (%)":"Expenditure on education",
            "PISA: Mean performance on the mathematics scale": "Performance on the Mathematics",
            "PISA: Mean performance on the reading scale": "Performance on the Reading",
            "PISA: Mean performance on the science scale": "Performance on the Science",
            "Pupil-teacher ratio in primary education (headcount basis)" : "Pupil-teacher ratio in primary education",
            "Pupil-teacher ratio in secondary education (headcount basis)": "Pupil-teacher ratio in secondary education",
             }

data.reset_index(drop = True, inplace = True)

# Selecting the columns and rows according to the list and dict above.
data = data.loc[data["Indicator Name"].isin(list(indicators.keys())),columns]

# Merging the 2 data base to get region and income group of each country
data = pd.merge(data, countries.loc[:,["Country Code","Region","Income Group" ]], how = "left", on = "Country Code")
data = data[data["Income Group"].notnull()]

# Replace the names to be concise
data["Indicator Name"] = data["Indicator Name"].map(indicators)

# List to produce the pivot table easily.
columnsPISA = ["Region",
               "Income Group",
               "Country Code",
               "Country Name",
               "Performance on the Mathematics",
               "Performance on the Reading", 
               "Performance on the Science"]

# List to produce the pivot table easily.
columns_set = ["Region",
              "Income Group",
              "Country Name", 
              "Country Code"]

# The pivot table with PISA results. The year 2015 was selected because it was the last exam.
mapPISA = pd.pivot_table(data, values = "2015",  
                         columns = ["Indicator Name"],
                         index = columns_set).reset_index(level = columns_set)

# It was used as PISA performance the average of the 3 results: Math, Science and Reading.
mapPISA["PISA Performance"] = (mapPISA["Performance on the Mathematics"] 
                               + mapPISA["Performance on the Science"] 
                               + mapPISA["Performance on the Reading"])/3

columns_drop = ["Expenditure in secondary institutions",
                "Pupil-teacher ratio in primary education",
                "Expenditure on education",
                "Pupil-teacher ratio in primary education",
                "Pupil-teacher ratio in secondary education",
               ]

mapPISA.drop(columns_drop, axis = 1, inplace = True)
mapPISA.head()

Indicator Name,Region,Income Group,Country Name,Country Code,Performance on the Mathematics,Performance on the Reading,Performance on the Science,PISA Performance
0,East Asia & Pacific,High income: OECD,Australia,AUS,493.8962,502.9006,509.9939,502.263567
1,East Asia & Pacific,High income: OECD,Japan,JPN,532.4399,515.9585,538.3948,528.931067
2,East Asia & Pacific,High income: OECD,"Korea, Rep.",KOR,524.1062,517.4367,515.8099,519.1176
3,East Asia & Pacific,High income: OECD,New Zealand,NZL,495.2233,509.2707,513.3035,505.9325
4,East Asia & Pacific,High income: nonOECD,Brunei Darussalam,BRN,,,,


Table with all PISA scores plus the average performance in 2015.

In [34]:
# Dimensions to produce the radar chart
theta = ["Performance on the Mathematics",
        "Performance on the Reading", 
        "Performance on the Science"]

column_countries = data["Country Name"]

# Data from the last records of teacher-pupil ratio 2012
pivot_2012 = pd.pivot_table(data, 
                            values = ["2012"],  
                            columns = data["Indicator Name"],
                            index = column_countries)
pivot_2012 = pd.DataFrame(pivot_2012)

pivot_2012.columns = pivot_2012.columns.droplevel()
pivot_2012.drop(columns = theta, inplace = True)

# Merging to PISA results table
pivot_2012_and_PISA2015 = pd.merge(pivot_2012, mapPISA, how = "right", on = "Country Name")


# Average expenditure from 2008 to 2012

data["2008-2012"] = (data["2008"] + data["2009"] + data["2010"] + data["2011"] + data["2012"]) / 5
mean_var = ["Expenditure on education",
            "Expenditure in primary institutions",
            "Expenditure in secondary institutions",
            ]

pivot_2008_2012 = pd.pivot_table(data, 
                                 values = ["2008-2012"],
                                 columns = data.loc[data["Indicator Name"].isin(mean_var),
                                                   "Indicator Name"],
                                 index = column_countries)

pivot_2008_2012 = pd.DataFrame(pivot_2008_2012)
pivot_2008_2012.columns = pivot_2008_2012.columns.droplevel()
pivot_2008_2012.reset_index(inplace = True)

pivot_2012_and_PISA2015.drop(mean_var, axis = 1, inplace = True) # removing expenditure columns


# Consolidate the 2 data base. 
data_edu = pd.merge(pivot_2012_and_PISA2015, pivot_2008_2012, how = "left", on = "Country Name")
data_edu.reset_index(inplace = True)
data_edu.rename(columns = {"Country Name": "Country"}, inplace = True)
data_edu.drop(columns = ["index"], inplace = True)
columns_order = ["PISA Performance",
                 "Expenditure on education",
                 "Expenditure in primary institutions",
                 "Expenditure in secondary institutions",
                 "Pupil-teacher ratio in primary education",
                 "Pupil-teacher ratio in secondary education",
                ]
data_edu.head()

Indicator Name,Country,Pupil-teacher ratio in primary education,Pupil-teacher ratio in secondary education,Region,Income Group,Country Code,Performance on the Mathematics,Performance on the Reading,Performance on the Science,PISA Performance,Expenditure in primary institutions,Expenditure in secondary institutions,Expenditure on education
0,Afghanistan,44.677429,,South Asia,Low income,AFG,,,,,,,
1,Albania,19.482981,14.89293,Europe & Central Asia,Upper middle income,ALB,413.157,405.2588,427.225,415.2136,,,
2,Algeria,23.15942,,Middle East & North Africa,Upper middle income,DZA,359.6062,349.8593,375.7451,361.736867,,,
3,Andorra,9.58237,,Europe & Central Asia,High income: nonOECD,AND,,,,,,,
4,Antigua and Barbuda,13.84503,11.59474,Latin America & Caribbean,High income: nonOECD,ATG,,,,,,,


 ### 2.5 Handling missing data
In this case, the analyses target just the countries which took the exam in 2015, so all others are out of the scope.

In [35]:
# Dropping the null records.                                                               
mapPISA = mapPISA.loc[mapPISA["PISA Performance"].notnull(), :] # base to radar map
data_edu = data_edu.loc[data_edu["PISA Performance"].notnull(),:] # base to the correlations

##  3. Exploratory Data Analysis (EDA)

This section provides insights into patterns and trends of Brazil performance in PISA and educational indicators.
<br>
<br>
Before starting the data exploration, it is necessary to explain what is the PISA; this exam is the bases of all analysis. The Programme for International Student Assessment - PISA - was designed to evaluate scholastic performance on mathematics, science, and reading by age 15 in countries member and nonmember of the Organisation for Economic Co-operation and Development. The programme helps to provide comparative data among the nations and it enables countries to analyze education policies and outcomes.
<br>
<br>
All following analysis use the PISA scores and its relationship to other educational indicators. 
<br>
<br>
The map depicts the general performance - the average of mathematics, science, and reading scores- from each country in 2015.

In [36]:
# Parameter to plot the graph
graphPISA = [ dict(
        type = 'choropleth',
        locations = mapPISA['Country Code'],
        z = mapPISA['PISA Performance'],
        text = mapPISA['Country Name'],
        colorscale = [[0.3,"rgb(7, 106, 4)"],[0.8,"rgb(226, 0, 15)"]],
        autocolorscale = False,
        reversescale = True,
        marker = dict(
            line = dict (
                color = 'rgb(180,180,180)',
                width = 0.5
            ) ),
        colorbar = dict(
            autotick = False,
            tickprefix = '',
            title = 'PISA Performance'),
      ) ]

# Layout parameter
layout = dict(
    title = 'PISA Performance',
    autosize = True,
    hovermode='closest',
    geo = dict(
        showframe = False,
        showcoastlines = True,
        projection = dict(
            type = 'Mercator'
        ),
        
    ),
    height = 500, 
    width = 900,
)


fig = dict( data = graphPISA, layout = layout)
py.iplot( fig, validate = False, filename = 'd3-world-map')

High five! You successfully sent some data to your account on plotly. View your plot in your browser at https://plot.ly/~macortapasso/0 or inside your plot.ly account where it is named 'd3-world-map'


The map identifies that most of the northern hemisphere countries reached results over 450, while all countries in Latin America did not score more than 450. Examining closely in South America, Brazil is still in disadvantage regarding performance. 
<br>
<br>
Overall, Brazil reached position 63 out of 71 countries that took the exam in 2015.

In [48]:
# Brazil PISA results
Brazil_PISA = mapPISA.loc[mapPISA["Country Name"] == "Brazil", theta].values[0].tolist()

# Object to portray Brazil radar chart
brazil_Radar = go.Scatterpolar(
      r = Brazil_PISA,
      theta = theta,
      fill = 'toself',
      name = 'Brazil',
    )

# Store the other region chart objects
radarData = []

# By Region
PISA_by_region = mapPISA.groupby(["Region"])[theta].mean().reset_index()
region_list = mapPISA.Region.unique().tolist()

for region in region_list:
    radarData.append(
        go.Scatterpolar(
      r = PISA_by_region.loc[PISA_by_region["Region"] == region, 
                                        theta].values[0].tolist(),
      theta = theta,
      name = region,
      subplot = "polar",)
    )

# Brazilian radar chart
brazilRadar = go.Scatterpolar(
      r = Brazil_PISA,
      theta = theta,
      fill = 'toself',
      name = 'Brazil',
      subplot = 'polar',
    )

# Add Brazil radar chart to the other regions
radarData.append(brazilRadar)


# Layout
layout = go.Layout(
    title = 'Performance by Region',
    height = 500, 
    width = 900,
    polar = dict(
      domain = dict(
        x = [0, 1],
        y = [0, 1]
      ),
        radialaxis = dict(
            visible = True,
            range = [320,550]
        ), 
    ), 
)

fig = go.Figure(data = radarData, layout = layout)
py.iplot(fig, filename = "Radar by Region")

The performance per subject and region demonstrates that even comparing the performance among all other regions, Brazil is below the average.  

In [49]:
### By Income
PISA_by_income = mapPISA.groupby(["Income Group"])[theta].mean().reset_index()
income_list = mapPISA["Income Group"].unique().tolist()

radarData = []
for income_group in income_list:
    radarData.append(
        go.Scatterpolar(
      r = PISA_by_income.loc[PISA_by_income["Income Group"] == income_group, 
                                        theta].values[0].tolist(),
      theta = theta,
      name = income_group,
      )
    )

# Add Brazil radar chart to the other income groups
radarData.append(brazilRadar)

# Layout
layout = go.Layout(
    title = 'Performance by Income Group',
    height = 500, 
    width = 900,
    polar = dict(
      domain = dict(
        x = [0, 1],
        y = [0, 1]
      ),
        radialaxis = dict(
            visible = True,
            range = [320,550]
        ), 
    ), 
)

fig = go.Figure(data = radarData, layout = layout)
py.iplot(fig, filename = "Radar by Income")

According to OECD, Brazil belongs to the upper middle-income group. This chart also shows that Brazil is below the PISA performance average among all income groups.
<br>
<br>
In order to analyze the factors that might affect the performance on PISA, it was selected five indicators of education, and it was crossed to the PISA performance to verify if there are statistical correlations. Follow below the five chosen indicators: 
<br>
1. __Expenditure on education__: Government expenditure on education as % of GDP (%) mean from 2008 to 2012
<br>
2. __Expenditure in primary institutions__: Government expenditure in primary institutions as % of GDP (%) mean from 2008 to 2012.
<br>
3. __Expenditure in secondary institutions__: Government expenditure in secondary institutions education as % of GDP (%) mean from 2008 to 2012.
<br>
5. __Pupil-teacher ratio in primary education__: Pupil-teacher ratio in primary education (headcount basis) from 2012
<br>
6. __Pupil-teacher ratio in secondary education__: Pupil-teacher ratio in secondary education (headcount basis) from 2012
<br>

Although the PISA results are from 2015, the base data for most of the countries are not available from 2013. In 2012 was found more records from the countries that also participated in the PISA 2015, then this year was chosen to carry out the analysis. 
<br>
<br>
As the expenditure on education does not produce outcomes in short terms, it was used the indicators means from 2008 to 2012.
<br>
<br>
The table below demostrates the descriptive statistics from the dependent variable- PISA Performance - and the independent ones.

In [39]:
data_edu[columns_order].describe()

Indicator Name,PISA Performance,Expenditure on education,Expenditure in primary institutions,Expenditure in secondary institutions,Pupil-teacher ratio in primary education,Pupil-teacher ratio in secondary education
count,71.0,43.0,36.0,33.0,56.0,43.0
mean,461.059427,5.107699,1.319591,1.778603,15.453741,12.766935
std,50.781211,1.390371,0.476773,0.579043,4.32525,4.593749
min,339.026167,1.856002,0.596452,0.628248,8.37204,7.80615
25%,420.0038,4.140779,0.95884,1.461286,11.671885,9.42649
50%,475.399633,5.152044,1.35157,1.831222,15.045085,11.67616
75%,503.185,5.582716,1.601966,2.024286,18.214748,14.778325
max,551.621533,8.753572,2.54575,3.14942,28.016359,29.17577


The sample is composed by 71 countries. The descriptive analysis demonstrates the dispersion of each variable and put in the picture of the Brazil position in each quartile. This country is in the first quartile in PISA Performance and fourth quartile in all others. Importantly, the pupil-teacher ratio refers to teacher per student; so low ratio is more desirable. In other words, Brazil is among the 25% countries that spent more on education proportionally with the GDP, however, it has fewer teachers per students than 75% of the countries in the analysis.
<br>
<br>
There is a misconception in Brazil that is not invested enough monetary resources in education, the descriptive statistics demonstrated that is not the full truth, and the following chart shows that Brazil spent the same amount - proportional to the GDP - as most of the other countries. Indeed, Brazil spent more than Germany, Netherland, and Australia, nations that achieved high scores in the PISA 2015. 
<br>
<br>
Another fact is that even spending less; Colombia, Chile, and Thailand attained slight better scores than Brazil.

In [40]:
x_name = "Expenditure on education"
y_name = "PISA Performance"

general_exp = go.Scatter(
    x = data_edu[x_name],
    y = data_edu[y_name],
    mode='markers+text',
    text = data_edu["Country"],
    marker= dict(size= 14,
                    line= dict(width=1),
                    color= np.random.randn(500),
                    opacity= 0.9
                   ),
    textfont = dict( size = 9),
    textposition = 'bottom center'
)

layout = go.Layout(showlegend = False,
                   title = 'Expenditure on Education - 2008 - 2012',
                   xaxis= dict(
                                title = "Expenditure on education %GDP"
                               ),
                    yaxis=dict(
                                title = 'PISA Performance - 2015',
                               ),
                   height = 500, 
                   width = 900,
                  )

fig = go.Figure(data = [general_exp], layout = layout)

py.iplot(fig, filename = "Expenditure on Education")

In general, it is not possible to identify a clear relationship between expenditure on education and PISA scores. There is a concentration around 4% and 6% expenditure of the GDP; however, the PISA scores in this range also vary significantly. 

In [51]:
x1_name = "Expenditure in primary institutions"
x2_name = "Expenditure in secondary institutions"

pri_exp = go.Scatter(
    x = data_edu[x1_name],
    y = data_edu[y_name],
    mode='markers+text',
    text = data_edu["Country"],
    marker= dict(size= 8,
                    line= dict(width=1),
                    color= np.random.randn(500),
                    opacity= 0.9
                   ),
    textfont = dict( size = 6),
    textposition = 'bottom center'
)
sec_exp = go.Scatter(
    x = data_edu[x2_name],
    y = data_edu[y_name],
    mode='markers+text',
    text = data_edu["Country"],
     marker= dict(size= 8,
                    line= dict(width=1),
                    color= np.random.randn(500),
                    opacity= 0.9
                   ),
    textfont = dict( size = 6),
    textposition='bottom center'
)

fig = tools.make_subplots(rows = 1, 
                          cols = 2,
                         subplot_titles = ('Primary Education', 'Secondary Education'))

fig.append_trace(pri_exp, 1, 1)
fig.append_trace(sec_exp, 1, 2)

fig['layout']['xaxis1'].update(title = "Expenditure in primary education %GDP")
fig['layout']['xaxis2'].update(title = 'Expenditure in secondary education %GDP')

fig['layout']['yaxis1'].update(title = 'PISA Performance - 2015')
fig['layout']['yaxis2'].update(title = 'PISA Performance - 2015')

fig['layout'].update(height = 500, 
                     width = 900, 
                     title='Expenditure in Primary & Secondary 2008-2012',
                     showlegend = False)

py.iplot(fig, filename='Pupils Teacher Ratio')

This is the format of your plot grid:
[ (1,1) x1,y1 ]  [ (1,2) x2,y2 ]



As presented in the previous chart, there is no clear correlation between expenditure in primary and secondary education and the scores in the PISA. 
<br>
<br>
Notably, Brazil spent more than the median; it was 1.71% and 2.45% of the GDP in primary and secondary education respectively.

In [42]:
def exclude_null(x_name, y_name, df):
    ''' Exclude nulls that exist in 2 column at the same time.
    Parameters
    ----------
    x_name, y_name : strig with column name
    df: DataFrame in pandas
    Returns
    -------
    outcome: new DataFrame with the 2 columns and no null values.

    '''
    outcome = df.loc[(df[x_name].notnull() & df[y_name].notnull()), [x_name, y_name]]
    return outcome

In [52]:
x_name = "Pupil-teacher ratio in primary education"
y_name = "PISA Performance"


primary = go.Scatter(
    x = data_edu[x_name],
    y = data_edu[y_name],
    mode='markers+text',
    text = data_edu["Country"],
    marker= dict(size= 14,
                    line= dict(width=1),
                    color= np.random.randn(500),
                    opacity= 0.9
                   ),
    textfont = dict( size = 9),
    textposition = 'bottom center'
)


regr_dbase = exclude_null(x_name, y_name, data_edu)

#   Creating linear regression object
regr = linear_model.LinearRegression()
regr.fit(regr_dbase[x_name].values.reshape(-1, 1), regr_dbase[y_name].values.reshape(-1, 1))


regr_chart = go.Scatter(x = regr_dbase[x_name].values.reshape(-1, 1), 
                        y = regr.predict(regr_dbase[x_name].values.reshape(-1, 1)), 
                        mode = 'lines', 
                        line = dict(color = 'grey', 
                                    width = 2
                                   )
                       )


layout = go.Layout(showlegend = False,
                   title = 'Primary Education - Pupil-teacher Ratio',
                   xaxis= dict(
                                title= "Pupil-teacher ratio in primary education"
                               ),
                    yaxis=dict(
                                title= 'PISA Performance - 2015',
                               ),
                   height = 500, 
                   width = 900,
                  )

fig = go.Figure(data = [primary, regr_chart], layout = layout)

py.iplot(fig, filename = "Pupil-teacher ratio in primary education")

Though Brazil spent a reasonable amount on education, the ratio pupil-teacher keeps high; it was 20.51 pupil per teacher in 2015. It suggests that Brazil has more students per teacher in primary school than most of the other nations. 
<br>
<br>
This chart also shows a negative relationship between pupil-teacher ratio and PISA performance, since the countries which have lower ratio achieved higher scores.

In [53]:
y_name = "PISA Performance"
x_name = "Pupil-teacher ratio in secondary education"

secundary = go.Scatter(
    x = data_edu[x_name],
    y = data_edu[y_name],
    mode='markers+text',
    text = data_edu["Country"],
    marker= dict(size= 14,
                    line= dict(width=1),
                    color= np.random.randn(500),
                    opacity= 0.9
                   ),
    textfont = dict( size = 9),
    textposition = 'bottom center'
)


regr_dbase = exclude_null(x_name, y_name, data_edu)

#   Create linear regression object
regr = linear_model.LinearRegression()
regr.fit(regr_dbase[x_name].values.reshape(-1, 1), regr_dbase[y_name].values.reshape(-1, 1))


regr_chart = go.Scatter(x = regr_dbase[x_name].values.reshape(-1, 1), 
                        y = regr.predict(regr_dbase[x_name].values.reshape(-1, 1)), 
                        mode = 'lines', 
                        line = dict(color = 'grey', 
                                    width = 2
                                   )
                       )

layout = go.Layout(showlegend = False,
                   title = 'Secondary Education - Pupil-teacher Ratio',
                   xaxis= dict(
                                title = "Pupil-teacher ratio in secondary education"
                               ),
                    yaxis=dict(
                                title = 'PISA Performance - 2015' ,
                               ),
                   height = 500, 
                   width = 900,
                  )

fig = go.Figure(data = [secundary, regr_chart], layout = layout)

py.iplot(fig, filename = "Pupil-teacher ratio in secondary education")

The same tendency is also identified on secondary education. As lower is the pupil-teacher ratio, higher are the performance in the PISA. 
<br>
<br>
Brazil has one of the highest pupil-teacher ratios in both indicators; so it might contribute to jeopardizing the scores in the test.

##  4. Inferential Statistics
<br>
In this section, it is developed a hypothesis test to verify whether the correlation portrayed previously have statistical significance. Following the 5 hypothesis:
<br>
<br>
Hypothesis Testing 1: Expenditure on education
<br>
$H_0$: Expenditure on education does not increase PISA performance
<br>
$H_1$: Expenditure on education does increase PISA performance
<br>
<br>
Hypothesis Testing 2: Expenditure in primary institutions	
<br>
$H_0$: Expenditure in primary institutions does not increase PISA performance
<br>
$H_1$: Expenditure in primary institutions	does increase PISA performance
<br>
<br>
Hypothesis Testing 3: Expenditure in secondary institutions	
<br>
$H_0$: Expenditure in secondary institutions does not increase PISA performance
<br>
$H_1$: Expenditure in secondary institutions does increase PISA performance
<br>
<br>
Hypothesis Testing 4: Pupil-teacher ratio in primary education
<br>
$H_0$: Low pupil-teacher ratio in primary education does not increase PISA performance
<br>
$H_1$: Low pupil-teacher ratio in primary education does increase PISA performance
<br>
<br>
Hypothesis Testing 5: Pupil-teacher ratio in secondary education
<br>
$H_0$: Low pupil-teacher ratio in secondary education does not increase PISA performance
<br>
$H_1$: Low pupil-teacher ratio in secondary education does increase PISA performance
<br>
<br>
It is used the Pearson's R method to calculate the correlation and the confidence interval is 95%, $\alpha$ = 5%.

$$r = \frac{{}\sum_{i=1}^{n} (x_i - \overline{x})(y_i - \overline{y})}
{\sqrt{\sum_{i=1}^{n} (x_i - \overline{x})^2(y_i - \overline{y})^2}}$$

In [45]:
def pearsonr_ci(x, y, alpha=0.05):
    # Collected from GitHubGist //gist.github.com/zhiyzuo/d38159a7c48b575af3e3de7501462e04
    # Explanation CI for Pearson's R: 
    # http://faculty.washington.edu/gloftus/P317-318/Useful_Information/r_to_z/PearsonrCIs.pdf
    ''' calculate Pearson correlation along with the confidence interval using scipy and numpy
    Parameters
    ----------
    x, y : iterable object such as a list or np.array
      Input for correlation calculation
    alpha : float
      Significance level. 0.05 by default
    Returns
    -------
    r : float
      Pearson's correlation coefficient
    pval : float
      The corresponding p value
    lo, hi : float
      The lower and upper bound of confidence intervals

    '''

    r, p = stats.pearsonr(x,y)
    r_z = np.arctanh(r)
    se = 1 / np.sqrt(x.size-3)
    z = stats.norm.ppf(1-alpha/2)
    lo_z, hi_z = r_z - z * se, r_z + z * se
    lo, hi = np.tanh((lo_z, hi_z))
    
    return r, p, lo, hi

In [46]:
y_var = "PISA Performance"
x_var = data_edu.loc[:, columns_order]
x_var = x_var.loc[:, x_var.columns != "PISA Performance"]

outcome = pd.DataFrame(index = x_var.columns)
for x in x_var:
    db_statis = exclude_null(x, y_var, data_edu)
    records, _ = db_statis.shape
    r, p, lo, hi = pearsonr_ci(db_statis[x], db_statis[y_var])
    outcome.loc[x, "Pearson's R"] = r
    outcome.loc[x, "P-value"] = p
    outcome.loc[x, "95% CI Lower"] = lo
    outcome.loc[x, "95% CI Upper"] = hi
    outcome.loc[x, "N"] = records
outcome

Unnamed: 0_level_0,Pearson's R,P-value,95% CI Lower,95% CI Upper,N
Indicator Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Expenditure on education,0.215368,0.165446,-0.090853,0.48438,43.0
Expenditure in primary institutions,-0.17935,0.295266,-0.479625,0.158527,36.0
Expenditure in secondary institutions,0.331176,0.059751,-0.013689,0.605628,33.0
Pupil-teacher ratio in primary education,-0.465964,0.000295,-0.649322,-0.231412,56.0
Pupil-teacher ratio in secondary education,-0.497722,0.000684,-0.69428,-0.23207,43.0


The expenditure on education, in the primary institution and secondary institutions, does not present P-value to reject the null hypothesis; therefore, there is no evidence that the increase in expenditure is correlated to the performance in the PISA. 
<br>
<br>
On the other hand, the pupil-teacher ratio in primary and secondary education demonstrated to have a significant correlation in the PISA scores. The hypothesis test for both indicators are rejected, so there is evidence that these indicators have effects on the performance.

## 5. Conclusion

The exploratory data analysis contrasts the Brazil PISA results with other regions and country income groups. For both angles, Brazil is behind among its similar. Afterward, it was analyzed how expenditure on education, expenditure in primary institutions, expenditure in secondary institutions, pupil-teacher ratio in primary school, and secondary education affect the PISA performance. The scatter charts provided insights to identify patterns and to define the hypothesis. The null hypothesis for the pupil-teacher ratio in primary and secondary education were rejected; therefore it indicates these two independent variables correlate with PISA scores. Regarding the expenditure variables, it was not found any significant relationship to the PISA performance, so there is no evidence the increase in expenditure advance the outcomes in PISA.
<br>
<br>
Overall, although Brazil spent a reasonable proportion of GDP in education between 2008 and 2012, the school did not seem to improve even when compared to other countries that spent similar amounts or lower. Besides that, the nation is still placed among the last positions in PISA performance. The hypothesis testing indicates that further investigation could be carried out to identify how the reduction of the pupil-teacher ratio would boost the education in Brazil, once these indicators demonstrated to be critical to attaining higher scores in PISA.  