# <font color = 'green'>What explains English children's academic progress between the ages of 11 and 16, relative to their peers?</font>

### <font color='grey'>*Final Project for Data Science Immersive Course, General Assembly London*</font>


Jack Tyler-Whittle<br />
September 2019<br />
jackwhittle@gmail.com

<font color='red'>**Note:** *you should be able to click on arrows on the left hand side of this document, to expand and contract the contents of each section.  If you are uanble to do this, you may want to head to https://ndres.me/post/best-jupyter-notebook-extensions/ and install the Jupyter extensions there.  You don't need it to read this document, but it will help.</font>*

This project attempts to understand what drives the progress (or lack of it) that children make in England's secondary schools between the ages of 11 and 16, by using performance data on a progress measure called Progress 8 that is available for each school.  The analysis has been done using regression modeling, and the main metric for evaluation has been the coefficient of determination, or R2.

This is a technical report.  You can read a presentation of the results [here](link to ppt)

## Index

  <a href='#exec_sum'>Executive Summary</a><br />
1)  <a href='#background'>Why is this analysis important?</a><br />
2)  <a href='#eda'>Exploratory Data Analysis (EDA) and Feature Engineering</a><br />
3)  <a href='#models'>Modeling of the data</a><br />
4)  <a href='#findings'>Findings from the analysis and modeling</a><br />
5)  <a href='#risks'>Key risks, limitations and assumptions</a><br />
6)  <a href='#recommendations'>Recommendations</a><br />
7)  <a href='#next_steps'>Next Steps</a><br />

<br />


In [1]:
# import the libraries and modules that will be needed

import pandas as pd
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
from pandas import read_csv
import csv
sns.set(font_scale=1.5)
%config InlineBackend.figure_format = 'retina'
%matplotlib inline


<a id='exec_sum'></a>

# <font color='blue'>Executive Summary</font>

## <font color='green'>Goal: to explain what accounts for English children's academic progress in key subjects</font>

Formal academic assessment of children in England takes place at key points ('Key Stages') during their time at school.  Until 2016, schools were judged by the proportion of their children who received Grades A-C in English and Maths.  But this didn't take account of children's starting points, so the UK Government sought to define a measure to assess how much value a school added to a child's education.  That measure is called Progress 8, and since 2016 it has been the key measure for assessing the academic performance of schools.

It measures how far children have progressed academically between 11 and 16 years old in up to 8 key subjects relative to their academic peers when they were 11, and it is designed to encourage good quality teaching across a broad curriculum.  A score of +1 indicates that a school's pupils are achieving on average one grade higher at GCSE in those 8 subjects than their counterparts who scored similarly to them when they were 11.  Similarly, a score of -1 indicates performance one grade below their counterparts. 

(You can see a brief 3 minute video from the Department for Education explaining the measure [here](https://www.youtube.com/watch?v=4IAEgFMSGDY), or read a [comprehensive description](https://assets.publishing.service.gov.uk/government/uploads/system/uploads/attachment_data/file/561021/Progress_8_and_Attainment_8_how_measures_are_calculated.pdf) of how it is calculated.)


***Why is this important?***

Over the past 10 years there has been reduced funding for schools.  Recently, the government announced a £7.1bn [increase in spending on education over the next three years](https://www.bbc.co.uk/news/education-49515002).  Knowing which areas to invest this money in is crucial to ensure the best possible benefit for children, especially since GCSE results are a key factor in predicting University grades.


Whilst academic results are by no means the sole factor in a child's development, with experience in sports and arts activities and overall health being three other crucial factors of many, they are nevertheless important to future achievement in adult life.

## <font color='green'>Key findings and recommendations</font>

There are five key factors that explain the progress made by children in GCSEs taken in 2018.  These factors came up in all the different analyses and eight models that were used (more details on them below), and explained approximately half (50%) of the variation in the progress that schools helped children to make between 11 and 16 years old.

The best performing model was Linear Regression with Lasso Regularisation, which had a mean R2 score of 0.51 on a 5 fold cross-validated training set.  Decision Tree Regression and its variants (Ensemble methods of Bagging, Random Forest and AdaBoosting) continually highlighted the same 5 features as being most important (accounting for between 48% and 75% of all Features' Importance depending on the model used).  When I ran just those 5 factors in Linear Regression models, I got very similar scores to when I had all data included, except that the 'vanilla' OLS without regularisation became as high as the regularised scores.

Note: I excluded all data that I termed 'Academic' since it had circular causality with the Progress 8 measure: for example, doing better than others in Maths will make you progress more than them.

**Five key non-academic factors explain half of the progress that children make academically between 11 and 16 in England:**<br />
**1)** Coming from a disadvantaged background hampers a child's academic progress<br />
**2)** Achieving at least one GCSE is important<br />
**3)** Taking a(ny) language at GCSE is a strong indicator of a child progressing more than his or her peers<br />
**4)** Girls progress faster than boys from 11 to 16 years old<br />
**5)** The more GCSEs (and equivalent) qualifications that children do, the more they progress<br />

#### **There were also some surprise findings (because they turned out to be unimportant when I had hypothesised that they would be)**

**1)** Class size (the ratio of teachers to pupils) does not have a significant impact on overall academic progress.<br />
**2)** Children in Selective Schools make on average two thirds of a grade more progress per GCSE than children in Non-Selective ones.<br />
Note: this is different from 'raw' academic performance, which one would expect to be higher in Selective Schools since they select their pupils.  However, this is progress of students vs (clever) peers in other schools.<br />
**3)** The most recent Government Inspection (Ofsted) rating of the school was not a big predictor of progress<br />
Whilst there was a trend for schools with higher ratings to progress more, there was also very varied results within each rating: some 'Outstanding' schools' children performed one grade lower in all subjects than peers, whilst some schools in 'Special Measures' added up to 1.5 grades per child per subject.<br />
**4)** Educational establishments (Academies) did not progress more than other schools<br />
**5)** Geographic location in the UK was not itself a significant factor in explaining progress<br />
I had expected that it would explain it in part, and that in turn would be driven by socioeconomic factors.  But there was no discernible pattern across the UK.


## <font color='green'>Summary of recommendations: to explain what accounts for UK children's academic progress in key subjects</font>

There are a number of recommendations based on the findings, but the four key ones, in order of priority, are:

**1) Commission research into determining whether learning languages strengthens a child's overall cognitive development, and if so, return to making languages compulsory at GCSE**

**2) Review the effectiveness of the 'Pupil Premium' funding that was introduced in 2011, and either boost that funding or change the policy for improving the progress that children make**

**3) Examine why children in Selective Schools make more progress than their peers in non-Selective schools.**

**4) Encourage schools to offer as broad an education as possible** since children who take more GCSEs progress more.

Given that the models explained approximately 50% of the variation in progress that children made, there is **one further hypothesis that I believe needs to be explored further: that teacher and Head Teacher quality in schools is the primary driver of the amount of progress that children make.**


However, before these recommendations are implemented, the data for other school years needs to be analysed, to ascertain whether the explanatory factors for those years are similar to 2017-18.

## <font color='green'>Risks and Limitations of the findings, and assumptions that have been made</font>

There are three key risks to the models that I have built:<br />
**1)** The 50% of progress that is unexplained by the model could have more important actions that need to be taken.  I believe that the most important factor for determining progress is teacher quality (driven itself by Head Teacher quality) but I could not find data on that<br />
**2)** Schools may focus on influencing the results for their school overall to the detriment of childrens' progress<br />
**3)** The smallest 30% of schools (accounting for only 5% of children) are excluded from data for reasons of anonymity, but they may affect the overall findings

<a id='background'></a>

# <font color='blue'>1) Why is it important to understand the factors that help children to progress academically?</font>

## <font color='green'>Goal: to explain what accounts for English children's academic progress in key subjects</font>

Formal academic assessment of children in England takes place at key points ('Key Stages') during their time at school.  Until 2016, schools were judged by the proportion of their children who received Grades A-C in English and Maths.  But this didn't take account of children's starting points, so the UK Government sought to define a measure to assess how much value a school added to a child's education.  That measure is called Progress 8, and since 2016 it has been the key measure for assessing the academic performance of schools at Key Stage 4 (GCSEs).

***What is 'Progress 8'?***<br />
It is a composite that measures how far children have progressed academically between 11 and 16 years old in up to 8 key subjects relative to their peers.  Their peers are defined as those who perform similarly to them in Key Stage 2 exams in Maths and English when they were 11 years old. It is designed to encourage good quality teaching across a broad curriculum.

The measure consists of up to eight subjects, including Maths and English (both of which are double-weighted due to their importance).  A score of +1 indicates that a school's pupils are achieving on average one grade higher at GCSE in those 8 subjects than their counterparts who scored similarly to them when they were 11.  Similarly, a score of -1 indicates performance one grade below their counterparts. 

(You can see a brief 3 minute video from the Department for Education explaining the measure [here](https://www.youtube.com/watch?v=4IAEgFMSGDY), or read a [comprehensive description](https://assets.publishing.service.gov.uk/government/uploads/system/uploads/attachment_data/file/561021/Progress_8_and_Attainment_8_how_measures_are_calculated.pdf) of how it is calculated.)


***Why is this important?***

There is extensive data and research on academic performance, but given the relative novelty of the measure, there is limited assessment of the Progress 8 measure.
Over the past 10 years there has been reduced funding for schools.  Recently, the government announced a £7.1bn [increase in spending on education over the next three years](https://www.bbc.co.uk/news/education-49515002).  Knowing which areas to invest this money in is crucial to ensure the best possible benefit for children.

In 2016 UCAS found that GCSE results and the mix of A levels studied are the key factors determining whether university applicants meet their predicted grades.

This project attempts to understand what drives the progress (or lack of it) that children make in England's secondary schools between the ages of 11 and 16.

Note: In England, over 93% of children attend state-funded schools (the [ISC](https://www.isc.co.uk/research/) covers data for private schools, which is not included in this analysis).  

## <font color='green'>How to measure the explanatory factors</font>

The main **target metric(s)** for this project is the Coefficient of Determination (R-squared or <font color='red'>**R2**</font>); the Mean Squared Error was also used as a supplementary indicator.  In addition, the amount that individual factors contribute towards the Progress 8 Measure is measured in different ways, depending on the models that are used to assess them.  These are explained in more detail in later sections.

However, the ultimate impact of improving the drivers of progress will not be an improvement in the overall score (since it is a relative score), but an increase in the GCSE grades of pupils, and increased progress compared with other nations - all else being equal - as measured in the OECD's PISA tests held every four years.

## <font color='green'>The target audience for this work</font>

The main **target audience** for this work is <font color='red'>**The Department for Education (DfE)**</font>:
    a) Having introduced the Progress 8 measure, it could be a good time to validate their hypotheses about the value of Progress 8 scores now that they have been used for four full years (although the fourth year - 2018-19 - will not have data released for another few months); and
    b) This could provide direction on where future government investment should be targeted for maximum effect.

In addition, <font color='red'>**secondary school Head Teachers**</font> may benefit from the report's findings, to help inform them on what they need to do in future to improve the chances of their children making even more progress than their predecessors.

A future phase of this project will target <font color='red'>**Parents**</font> who are considering which secondary school to send their child(ren) to, and would like more insight to the data than is provided by the DfE's website.  That will focus on predicting which Progress 8 Band (1 being the lowest, and 5 the highest) any given schoool will be in for the proceeding year.

Whilst academic results are by no means the sole factor in a child's development, with experience in sports and arts activities and overall health being three other crucial factors of many, they are nevertheless important to future achievement in adult life.

<a id='eda'></a>

# <font color='blue'>2) Exploratory Data Analysis and Feature Engineering</font>

The initial dataset consisted of 5,673 rows, and 496 columns of features.  This dataset was downloaded from gov.uk's [compare schools' performance site](https://www.compare-school-performance.service.gov.uk/download-data).

**The dependent (target) y variable** was each school's Progress 8 score (the column in the dataset is 'P8MEA').  This measure was normally distributed, demonstrating the Central Limit Theorem in practice (since each datapoint was an average Progress 8 measure for a number of students).

**There were 10 steps taken to process the data so it could be modeled:**

The detailed data cleaning can be found [here](2_GCSEs_Data_Cleaning_Sep_19.ipynb), and the Feature Enginnering can be found [here](3_GCSEs_Feature_Engineering.ipynb)<br />
In addition, I created a **data dictionary** (see 'GCSEs_Key_meta_Sep_19.xlsx' in the home folder)

**1) Created a DataFrame with 2017-18 (all base) data**
- Excluded certain columns that duplicated data in other columns (e.g. only including % of children for a measure, and not both % and number of children)<br />
- Dataset went from 496 to 279 columns<br />

**2) Removed Special Schools and rows of summary data**
- Special Schools focus on children with special educational needs.  By definition, those children will progress differently than their peers of similar age, and parents would know that their children need to go to a Special School before they are 11<br />
- Some rows consisted of summary data of the rest of the dataset; whilst informative, they needed to be excluded<br />
- Dataset went from 5,673 entries (rows) to 4,328 schools to model (rows)<br />

**3) Removed invalid values for target y**
- Where data was unavailable, or unable to be published, there were certain text values used throughout the dataset.  For example 'SUPP' meant that data had been suppressed because there were fewer than 5 children in the cohort<br />
- Although I imputed values for many of these text values (see below), when data was missing for the target variable I decided to exclude the relevant school from the model.<br />
- The dataset reduced from 4,328 to 3,165 rows (schools), which was 70% of all schools in England (covering 95% of children in all secondary schools taking GCSEs)

**4) Examined number of Nulls / NaNs per column and row, and formatted the target variable correctly**
- Dataset stayed at 3,165 rows (schools) due to imputation of missing data and interpretation of NaNs<br />
- For example, where percentage of boys in a school was NaN, this turned out to be because it was an all girls school, so the NaNs were replaced with zeroes<br />

**5) Created table to show which data in which columns need to be cleaned**
- I wrote a function to create a table to see which columns and rows had non-numeric values (see function below)<br />
- I subsequently added a description of each of the column names, so that it was easy to look up what the data was when deciding what to do with the exceptions

In [2]:
# write a function to create a DataFrame containing a list of all the columns in the base dataset, and how many
# of each text type abbreviations are being used in place of values

def item_freq_in_col(df, list_words, sort_by='total_freq'):
    '''Input: dataframe and the 'items' you want to count (in a list)
    Returns new Dataframe with:
    1) Columns from original dataframe as rows
    2) Number of NaNs per column ('# nulls')
    3) Frequency of 'items' from each column listed in a separate column per 'item'
    4) Total (sum) of all text items for that column
    5) % of rows that the text values represent
    '''
    
    df_item_freq = pd.DataFrame({'column_name':df.columns})
    
    df_item_freq['# nulls'] = [df[col].isnull().sum() for col in df.columns]
    
    for word in list_words:
        dict = {i: np.sum(df[i] == word) for i in df.columns}
        df_item_freq['# "' + word + '"'] = dict.values()
        
    df_item_freq['total_freq'] = df_item_freq.sum(axis=1)
    df_item_freq['pct_all_rows'] = df_item_freq['total_freq'].apply(lambda x: round(x/len(df)*100))
    df_item_freq.sort_values(sort_by, ascending=False, inplace=True)  
    return df_item_freq


**6) Decided how to handle remaining non-numeric data, and actioned it**
- Aim was to maximise number of schools and features being modeled, without compromising on quality of data
- For example, any numeric column with more than 5% of its numeric values missing was dropped

**7) Actioned the decisions made in 6) above**

**8) Converted all relevant columns to floats**

note: setp 9 was done as part of the Feature Engineering file
**9) Imported 4 new features and renamed the columns to make them more intuitive**
- Including Teacher / Pupil ratio, which I had originally hypothesised would be a key driver<br />
- I did add additional features, for example the most recent Oftsed rating for the schools, but each time I did the number of schools in the dataset reduced, and I believed this was not worth the trade off given that most of the data that I had hypthesised would be important was already included in the dataset.

**10) Exported final (base) dataset ready for modeling**
- This consisted of 31 non-Academic Features for 3,067 schools


<a id='models'></a>

# <font color='blue'>3) Modeling of the data</font>

Given that the dependent variable to be explained is a continuous score (Progress 8 measure), the models needed to be based on Regression (rather than Classification).

I used eight models (or variations of models) to investigate and model the dataset, broadly split into 4 categories:<br />

a) Ordinary Least Squares (OLS) Regression with regularisation<br />
b) Decision Tree<br />
c) Ensemble methods on Decision Trees (Bagging, Random Forest, and AdaBoost)<br />
d) Principal Component Analysis (PCA)<br />

The results are in the graphic below, and details of my approach for each model are below that.

The detailed modeling, with all parameters and results - can be [found here](4_GCSEs_Modeling_Sep_19_Final.ipynb)

<img src="pics/models_results.png" alt="Drawing" style="width: 600px;" align="left"/>

<img src="pics/model_results_table.png" alt="Drawing" style="width: 350px;" align="left"/>

## <font color='green'>Models 1-3) Ordinary Least Squares (OLS) regression</font>

Initially, I ran a simple OLS Linear Regression analysis, to see how much of the variation in scores could be explained by the variables that I had included in the model (the R2 score).<br />
I took a number of actions to ensure that the scores were as statistically sound as possible and the model did not overfit the data:<br /><br />
  **a) Split the data into a training set and a testing set**.  In order to test the model on unseen data, I split the data into 70% for training purposes, and 30% for validation (testing) purposes.<br />
  **b) Cross-validated the data set that was used for training the model** to ensure that any problems with overfitting or selection bias were highlighted; the model was trained on five-fold cross-validation<br />
  **c) Used regularisation to remove the effects of multicollinearity in the data** (both Lasso and Ridge were used)<br /><br />

**Additional elements to note**:
- Feature scaling: the data was also standardised using Scikit-Learn's StandardScaler.  MinMaxScaler was also tested, with very similar results except that the intercept for MinMax ws almost -2, implying that without any factors being taken into account a school's score was -2 grades.  StandardScaler's -0.014 made more sense, so I used the results for Lasso with StandardScaling.
- One-hot encoding (or dummification) was applied to the 10 categorical variables to ensure that they were modeled correctly (leaving them with numerical values would imply that higher numbers were more important, and a certain distance, from lower numbers)
- The residuals were normally distributed, indicating that the errors (data left after modeling) are not dependent on one another

<font color='red'>**Result:**</font> the best model was the regularised Lasso model:<br />
**0.514** for the mean cross-validated training R2, and<br />
**0.444** for the R2 of the model applied to the test data set


<br /><br />
The resulting equation was as follows (all values are for standardised data and only the six key factors are shown):

**$ \hat{y} $ = -0.0143<br /> + 0.086(% children taking any language at GCSE)<br /> + 0.086(% children with any qualifications)<br /> + 0.075(average number of GCSEs and equivalents taken)<br /> – 0.065(% disadvantaged children)<br /> – 0.048(% boys in the school)<br />+...
<br /><br />Note: since the data is standardized, the numbers represent the shift for that factor for every standard deviation of the Progress 8 score**

## <font color='green'>Model 4) Decision Tree Regressor</font>

Next I tried a Decision Tree Regressor, to understand what the most important features were in determining drivers of academic progress.  I used GridSearch to find the optimum parameters.  These were found to be:

- 'max_depth': 14
- 'max_features': 0.25
- 'min_samples_split': 200


<font color='red'>**Result:**</font>
**0.339** for the mean cross-validated training R2 score.
Whilst the Decision Tree was useful to highlight important features (the top 5 features accounted for 81% of the overall importance), the R2 was considerably lower than the OLS model.

The top 5 features turned out to be consistent across other models, and are shown below.

## <font color='green'>Models 5-8) Ensemble Methods</font>

I tried a number of Ensemble techniques to improve the predictive accuracy by combining several base models. 
I used both **Bagging** - to reduce overfitting - and also **boosting**.

### <font color='rebeccapurple'>Model 5) Decision Tree Regressor with Bagging</font>

In order to reduce overfitting, I tried Bagging, which trains a model on random subsets of the training data and then takes an average of their predictions.

<font color='red'>**Result:**</font>
**0.436** for the mean cross-validated training R2 score (using GridSearch to find the optimal max_samples and max_features).<br />
Whilst Bagging did help improve the Decision Tree Regressor considerably, the score did not surpass the regularised (Lasso) Linear Regression score.


The top 5 features turned out to be consistent across other models, and are shown below.


### <font color='rebeccapurple'>Model 6) Random Forest</font>

Similar in approach to Bagging, Random Forest is an ensemble of Decision Trees trained using the Bagging method.

<font color='red'>**Result:**</font>
**0.498** for the mean cross-validated training R2 score (using GridSearch to find the optimal n_estimators).<br />
Random Forest further improved on the Bagging results, but was still not quite at the regularised (Lasso) Linear Regression score.

The top 5 features turned out to be consistent across other models, and are shown below.


### <font color='rebeccapurple'>Model 7) Decision Tree Regressor with AdaBoost</font>

The other family of ensemble methods from Bagging methods are boosting methods, which (from are boosting methods, where base estimators are built sequentially and one tries to reduce the bias of the combined estimators. The motivation is to combine several weak models to produce a powerful ensemble.

<font color='red'>**Result:**</font>
**0.502** for the mean cross-validated training R2 score.<br />
This was the best Decision Tree result of all, but still slightly behind the regularised (Lasso) Linear Regression score.

The top 5 features turned out to be consistent across other models, and are shown below.

**Potential next step**: try Gradient Boosting to see whether that gives an even better result

## <font color='green'>Model 8) Principal Component Reduction (PCA)</font>

Given the feature abundance that was in the dataset, I attempted to identify the most important features and remove the rest through PCA, as it is ideal for dimensionality reduction.

<font color='red'>**Result:**</font>
**just under 80%** explained variance for 100 components (there were 800 components following one-hot encoding<br />
A further analysis of all components showed a linear progression of explained variance with number of components, indicating that PCA would not be helpful in this particular model.

**Potential next step**: an additional step would be to analyse sub-groups of the dataset that may be inter-related, to see whether PCA could reduce each group to a smaller set of variables that could then be add together with other groups' results to form a smaller subset of features.

<a id='findings'></a>

# <font color='blue'>4) Findings from the analysis and modeling</font>

There are five key factors that explain the progress made by children in GCSEs taken in 2018.  These factors came up in all the different analyses and models that were used, and explained approximately half (50%) of the variation in the progress that schools helped children to make between 11 and 16 years old.

Below is a summary of each key factor, along with proposed next steps.

The results were gleaned from the models that were run (explained in more detail above and in [4_GCSEs_Modeling_Sep_19.ipynb](4_GCSEs_Modeling_Sep_19.ipynb)), whose results showed relative stability, both in scores and in indicators of progress.

### <font color='rebeccapurple'>**1) Taking a(ny) language at GCSE is a strong indicator of a child progressing more than his or her peers**</font>

- ***Measure***: Percentage of pupils entering the EBacc Language subject area (EBacc is the English Baccalaureate, a collection of core subjects that children can do, such as Science, Geography and French).


- This was the most important feature in most models, and was significant in the best performing model too
- Approximately 47% (234k) of all pupils in the dataset took at least one language in 2018.  This is down from 70% in 2004, the last year that taking a language at GCSE was compulsory.
- Only 28 schools (1%) in 2018 had no children taking any languages at GCSE


***Recommendation: further Research needed before recommending that taking a lanaguage become compulsory again***:<br />
Why the driver was so strong, especially given that only 7% of GCSEs EBacc entries were for languages (the lowest of the EBacc categories), needs to be explored.  Some hypotheses are that:<br />

i) **Only the most able, in schools with strong language teachers, are steered towards taking a languages GCSE**<br />
    Modern Languages are known to be more harshly marked than other GCSEs ([see FFT article](https://ffteducationdatalab.org.uk/2019/09/will-anything-ever-be-done-about-grading-in-modern-foreign-languages-gcses/)), something that has been the case since at least the 1970s.  Also, there is a shortage of languages teachers.  Given these factors, and the pressure on schools to meet Progress 8 measures, only the most able children will be steered towards taking a language, and specifically those in schools with good language teachers. <br />
    **Measure**: The proportion of children in High, Medium and Low attainment categories at 11 (KS2) who take at least one language at GCSE.  <br />
    However, this does not necessarily explain why taking a language is a good predictor of **progress** rather than raw academic attainment.
    
ii) **The type of person who studies a language is someone who will achieve particularly well**, more so than someone who scored equally well in Maths and English at 11 years old but who chooses not to study a language. <br />
    Whilst this accounts for the link between the two factors, it does not explain **why** studying a language increases someone's chances of making more progress than their peers.
    
iii) **The act of learning a language enables the development of more advanced cognitive functioning than learning other subjects, and that cognitive functioning aids progress in other subjects**.  This would explain why there is a difference in overall progress between those taking a subject at 16 vs those who have not, but is a difficult hypothesis to test.
    
iv) **The quality of teaching has the biggest impact, and high quality teachers work with other high quality teachers** Teaching quality is a factor not included in the models I have created, because data on it is very difficult to attain.  However, given that there is a shortage of teachers in languages, and the pressure on schools to achieve as high a Progress 8 score as possible, it is likely that children steered towards taking a language at GCSE will have a good quality teacher.  In addition, high quality professionals in any field will want to work alongside people of a similar ability, which would help to explain why children make better than average progress in other subjects too.<br />
    This would be a difficult hypothesis to test, but is supported by looking at the progress that children make in selective schools.<br />
    
iv) **There is another confounding factor(s) that is yet to be discovered**
<br /><br />
Additionally, it would be interesting to examine whether the number of languages studied makes a difference, and also whether which languages studied makes a difference.

### <font color='rebeccapurple'>**2) Achieving at least one GCSE is important**</font>

- ***Measure***: Percentage of pupils achieving any qualifications
- 99% of children achieved at least one GCSE, so this measure filters very few
- Although the proportion of pupils achieving **any** GCSE or equivalent qualification is a significant predictor of how much progress a school will make, 99% of children achieved at least one qualification, meaning that there was approximately 4,400 children (outside of Special Schools) who did not achieve any qualifications, spread across 1,600 schools.

***Recommendation: investigate the schools that have children who did not receive any qualifications, and determine indivbidual strategies for each school to improve their scores in future.  Prioritise the 8 schools with more than 10% of pupils achieving no qualifications.***

### <font color='rebeccapurple'>**3) The more GCSEs (and equivalent) qualifications that children do, the more they progress**</font>

-  ***Measure***: Average number of GCSE and equivalents entries per pupil (in any given school)
- In one sense, this is an intuitive result: the more exams that a child takes, the more able they are likley to be, and since only the best eight exams are included in the overall Progress 8 measure - albeit with some restrictions on categories - when children take more than eight GCSEs or equivalents they are likely to get a higher Progress 8 score.
- However, since this is explaining the **progress** that children make rather than raw academic achievement, there remains the question of why some children take more exams than others.

<img src="pics/avgqual.png" alt="Drawing" style="width: 400px;" align="left"/>

***Recommendation: investigate why some children of similar abilities at 11 years old take different numbers of GCSEs and equivalent qualifications.***
Some hypotheses to explore are:<br /><br />
i) **Funding per pupil and facilities available have a significant impact on the breadth of education on offer** <br />
ii) **Teaching quality** is a confounding factor, which has a large impact on a student's performance at GCSEs<br />
iii) **Head Teacher quality**is a confounding factor, which has a large impact on a school's performance over time through the performance management of teachers and the setting of the curriculum within a school

### <font color='rebeccapurple'>**4) Coming from a disadvantaged background hampers academic progress**</font>

- ***Measure***: Percentage of pupils at the end of key stage 4 who are disadvantaged
- 26% (130k) of all pupils in the dataset were disadvantaged when they took their GCSEs (at the end of key stage 4)
- Disadvantaged pupils are those who currently claim free school meals, have claimed free school meals at any time during the last 6 years, and children looked after (in the care of the local authority for a day or
more) or who have been adopted from care.

<img src="pics/disadvan.png" alt="Drawing" style="width: 400px;" align="left"/>

- The government recognised this disparity, and in 2011 introduced the **['Pupil Premium'](https://www.gov.uk/guidance/pupil-premium-information-for-schools-and-alternative-provision-settings)** grant, which gives between £935 and £3,825 per secondary school child who is eligible for Free School Meals to the child's school.  This is an annual payment, and is specifially to 'improve the academic outcomes of disadvantaged pupils of all abilities'.  Schools are required to account for its use and how it has helped disadvantaged pupils.

- Given that the average funding per secondary pupil in the UK is £6,000, this is a significant investment.

***Recommendation: review the effectiveness of Pupil Premium to date***:
- I have not yet found a review of the efficacy of Pupil Premium across England since its introduction.
- Assuming that there is no review, then one should be undertakenk, to determine what impact the funding has had, and whether funding should be increased or whether the policy should be reviewed and a different mechanism for supporting the development of disadvantaged pupils be introduced.

### <font color='rebeccapurple'>**5) Girls progress faster than boys from 11 to 16 years old**</font>

- ***Measure***: Percentage of pupils at the end of key stage 4 who are boys

- There are just over 50% of boys in schools within the dataset
- Although the proportion of boys in a school is a significant predictor of how much progress a school will make, there is already extensive research which shows that girls progress more than boys between 11 and 16 years old (not isolated to England).  Whether this is due to differing cognitive development biologically, or whether it is due to the conditions in the learning environment is not entirely clear.
- Note: this disparity between boys and girls' progress is evident when one looks at Boys-only, Girls-only and Mixed schools data, and when the data is subset by Selective vs non-Selective schools.

The difference between boys and girls' progress can be seen in the data when the distribution of the Progress 8 scores for girls and boys are plotted on the same graph.

<img src="pics/girlsboys.png" alt="Drawing" style="width: 400px;" align="left"/>

***Recommendation: conduct research to confirm whether the difference between girls' and boys' progress in this age group are due to biological or environmental factors (or a combination of the two).***

### <font color='rebeccapurple'>**Some surprise findings (because they turned out to be unimportant)**</font>

There were also five surprise findings in the research, specifically things which I expected to have a signficiant impact, but on close analysis did not have such an impact.  Each of these has mode details on it in the [GCSEs_Findings_Data_Sep_19.ipynb](GCSEs_Findings_Data_Sep_19.ipynb).

**1) The ratio of teachers to pupils does not have a significant impact on overall progress.**<br />

**2) Children in Selective Schools make on average two thirds of a grade more progress per GCSE than children in Non-Selective ones.**<br />
- Whilst it is not a surprise that Selective Schools''raw' academic performance is better than other secondary schools, it is somewhat suprising that children at those schools progress more.
- This could be due to the teacher quality (as mentioned in finding 1 about learning a language), or perhaps due to children learning more from each other, or both.
- Selectivity itself was not significant in the drivers of progress, likely because only 162 of the 3067 schools

**3) The most recent Government Inspection (Ofsted) rating of the school was not a big predictor of progress**<br />
- Whilst there was a trend for schools with higher ratings to progress more, there was also very varied results within each rating: for example, those rated 'Outstanding' ranged from over +1.5 grades' progress to -1.0, and those under Special Measures (when external intervention takes place) ranged from +1.5 to more than -2.0 grades' progress.

**4) Educational establishments (Academies) did not progress more than other schools**<br />

**5) Geographic location in the UK was not itself a significant factor in explaining progress**<br />
- I had expected that it would explain it in part, and that in turn would be driven by socioeconomic factors.  But there was no discernible pattern across the UK (see diagram below)


<img src="pics/geog.png" alt="Drawing" style="width: 250px;" align="left"/>

<a id='risks'></a>

# <font color='blue'>5) Key risks and limitations of the findings, and assumptions made</font>

### <font color='rebeccapurple'>There are three key risks</font>

**1) The 50% of progress that is unexplained by the model could have more important actions that need to be taken**<br />
- Although certain factors were excluded because data is unavailable, they could be more important to investigate than the ones which have bneen included.  These include:
- Head Teacher and Teacher quality, both of which I expect have the biggest impact on children's progress.
- The difference that taking exams with each of the four Exam boards in England can have: the exam boards do nbot have even distributions of grades awarded, and anecdotally some are considered by teachers to be easier than others to get higher grades.


**2) Schools may focus on influencing the results for their school overall to the detriment of childrens' progress**.
- There is a risk of schools 'gaming' the system to increase their scores in at least two ways:<br /><br />
i)[**'Off-rolling'**](https://www.theguardian.com/education/2018/jun/26/300-schools-picked-out-in-gcse-off-rolling-investigation) is a practice that has been identified; it involves schools removing children from their school rolls who they believe will underperform so that their results do not adversely affect the school's overall results.  [One investigation](https://www.theguardian.com/education/2019/apr/18/more-than-49000-pupils-disappeared-from-schools-study) put the figure as high as 8% of children being off-rolled.  It is difficult to see off-rolling in the data, although the number of schools who had enough children to report Progress 8 the percent of children.<br /><br />
ii) **Entering children for GCSEs or equivalents that are easy to get high grades in**.<br />
    - The *European Computer Driving Licence (ECDL)* qualification is one such example, where pupils who took it received on average the equivalent of an A grade, whereas in their GCSEs they achieved an average of below grade C.
    - [Research](https://dataeducator.wordpress.com/2018/04/20/progress-8-and-ecdl/) found that 209 schools entered more than 95% of their cohort for the ECDL, and 2,240 schools used this qualification to some extent in 2017.  It has since been removed from the list of approved qualifications that are included in Progress 8.


**3) The smallest 30% of schools (accounting for only 5% of children) are excluded from data for reasons of anonymity, but they may affect the overall findings**



### <font color='rebeccapurple'>Limitations</font>

The key limitation is that I have been unable to source data on the potentially the most important factors

### <font color='rebeccapurple'>Assumptions</font>

In line with the Risks outlined above, there are a number of assumptions underlying the model, namely:

a) **Similarity of behaviour across schools**: the 30% of schools (covering only 5% of pupils) not included in the dataset that has been analysed and modeled behave in a similar way to the onces that have been analysed

b) **The drivers of progress are constant across school years and pupil cohorts**
- The 2017-18 school year has been analysed in this project
- Future phases of this project will analyse past data (2015-16 and 2016-17, and 2018-19 when they are published) to determine whether the explanatory factors are consistent over time

c) **Children move schools in a sufficiently small number that they do not affect the overall value add attributed to schools** (the progress measure shows the average for the children on the school roll when they take their GCSEs, so assumes that their progress is attributable to the school they are at when they take their GCSEs).

<a id='recommendations'></a>

# <font color='blue'>6) Recommendations</font>

There are a number of recommendations based on the findings, but the four key ones, in order of priority, are:

**1) Commission research into determining whether learning languages strengthens a child's overall cognitive development, and if so, return to making languages compulsory at GCSE**

**2) Review the effectiveness of the 'Pupil Premium' funding that was introduced in 2011, and either boost that funding or change the policy for improving the progress that children make**

**3) Examine why children in Selective Schools make more progress than their peers in non-Selective schools.**

**4) Encourage schools to offer as broad an education as possible** since children who take more GCSEs progress more.

Given that the models explained approximately 50% of the variation in progress that children made, there is one further hypothesis that I believe needs to be explored further: that teacher and Head Teacher quality in schools is the primary driver of the amount of progress that children make.


However, before these recommendations are implemented, the data for other school years needs to be analysed, to ascertain whether the explanatory factors for those years are similar to 2017-18.

<a id='next_steps'></a>

# <font color='blue'>7) Next steps</font>

**1)** Analyse 2017 and 2016 data to see whether the same factors are highlighted.  This would add weight to the findings, as one would expect the non-academic explanatory factors in schools to remain relatively constant over such a short period of time.<br />
**2)** Add new features to the dataset (without reducing number of schools in it), to see whether they can provide more explanation of the variation in progress scores<br />
**3)** Try additional models to see whether they provide different results (SVM and Bayesian)<br />
**4)** Try classification approach (logistic regression) to see how accurately schools’ P8 can be predicted<br />
**5)** PCA: try sub-groupings of variables to see whether they can be reduced to fewer dimensions<br /><br />


<img src="pics/next_steps.png" alt="Drawing" style="width: 800px;" align="left"/>

END