# EMA Project Notebook
__Name:__ Daniel Smith

__PI:__ A7603242

_Please note:_

This notebook records all the steps I took in the investigation.  It requires that the provided KS2 & KS4 data has been unzipped and is located in the `data/2015-2016/` folder to run.  When run it will clean the required csv files and store them in a MongoDB.

To carry out all steps may take a little time.

In [None]:
# import the required libraries
import pandas as pd
import scipy.stats
import numpy as np
import pymongo
import bson
import collections
import matplotlib.pyplot as plt
import seaborn as sns

# import the needed machine learning libraries
from sklearn import cluster
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_samples


# Contents
Use these links to jump to a section.

[Initial look at the ks4 dataset](#initial_look)

[Choosing MongoDB](#mongo)

---
[Data preparation](#preparation)
   - [Importing the KS2 data](#importing_ks2)
   - [Importing the KS4 data](#importing_ks4)
      - [Importing the abbreviations file](#abbr)
      - [Importing the ks4 meta file](#meta)
      - [Importing the data](#data)
---
[Q1, KS4 Investigation](#q1)
   - [Choosing performance measures](#measures)
   - [Additional cleaning](#add_clean)
   - [Does the type of school impact the results students acheive at keystage 4?](#Q1_a)
      - [summary stats](#ks4_summary_stats)
      - [Grouped school type plots](#school_type_plots)
      - [Whole dataset kMeans cluster analysis](#machine_learning)
      - [Grouped data kMeans cluster analysis](#grouped_cluster)
      - [Silhouette plots](#silhouette)
      - [School Scatter](#school_scatter)
      - [Findings](#q1_findings)
---
[Q2, KS2 - KS4 Investigation](#q2)
   - [Do schools that perform well at KS2 deliver as good or better results at KS4.](#q2)
   - [Joining the datasets](#joining)
   - [Summary stats](#stats)
   - [Plotting](#plotting)
   - [Pearson R^2](#pearson)
   - [Findings](#q2_findings)


[Cleanup - remove the database](#cleanup)

In [None]:
# make a folder for storing my working files as I go along.
# make a folder for plot image png's generated
!mkdir -p plot_images

In [None]:
!ls

In [None]:
# set up basic plot styles
sns.set_style('ticks')
sns.set_palette(palette='Paired', n_colors=12)

# set up plots for nice exporting
#https://matplotlib.org/users/customizing.html
plt.rcParams.update({'axes.titlesize': 35,
                     'axes.labelsize': 30,
                     'lines.linewidth': 5,
                     'lines.markersize': 12,
                     'legend.loc': 'best',
                     'legend.fontsize': 20,
                     'xtick.labelsize': 20,
                     'ytick.labelsize': 20,
                     'figure.figsize': [20, 16]})

<a name="initial_look"></a>

# Initial look at the KS4 results dataset
Let's have a quick look at the data we will be looking at for the EMA.

In [None]:
!head -5 'data/2015-2016/england_ks4final.csv'

In [None]:
!wc -l 'data/2015-2016/england_ks4final.csv'

The dataset has 5489 rows of data, there appears to be a large number of columns and a lot of codes that I'll need to look up.  There are a also a number of `NA` and `NP` values that could be missing data.  As well as this results data I will need to find and import the relevent metadata file.

Before importing the dataset I will need to decide which storage method to use.

<a name="mongo"></a>

# Choosing MongoDB

With so many columns to investigate I am leaning towards using a DBMS to make the querying of the data more efficient than in a pandas dataframe.  Therefore, I will import the data into MongoDB.  I chose a document database system as they are far more flexible than a relational database.  In this investigation it may become necessary to add fields to certain documents for example.  

In [None]:
# set up a connection to mongodb server
client = pymongo.MongoClient('mongodb://localhost:27351')

In [None]:
# uncomment to remove the database if needed
# client.drop_database('schools_db')
# client.database_names()

In [None]:
# setup a schools_db database on mongo
db = client.schools_db

<a name="preparation"></a>

# Data preparation

Before we can investigate the data we will need to have a quick look at it, determine what cleaning, if any, is needed.  Carry out the cleaning and store it for access in tn appropriate form.  

However before doing anything I will import the KS2 data in the same way as was done in `dcs283_TMA02_Question2b-pd`  I will then store the resultant dataframe into mongo for analysis later on.

<a name="importing_ks2"></a>

# Importing the KS2 data


All of this section is the same as in the `TMA02_Question2b-pd` notebook.

***
__ ----------- Beginning of TMA02 code -----------  __

### Import the LEA data

In [None]:
leas_df = pd.read_csv('data/2015-2016/la_and_region_codes_meta.csv')
leas_df.head()

### Import the KS2 data
Most of the field names are given in the `ks2_meta` file, so we'll use that to keep track of the types of various columns.

In [None]:
ks2cols = pd.read_csv('data/2015-2016/ks2_meta.csv')
ks2cols['Field Name'] = ks2cols['Field Name'].apply(lambda r: r.strip(),)
ks2cols

Some columns contain integers, but _**pandas**_ will treat any numeric column with `na` values as `float64`, due to NumPy's number type hierarchy. 

In [None]:
int_cols = [c for c in ks2cols['Field Name'] 
            if c.startswith('T')
            if c not in ['TOWN', 'TELNUM', 'TKS1AVERAGE']]
int_cols += ['RECTYPE', 'ALPHAIND', 'LEA', 'ESTAB', 'URN', 'URN_AC', 'ICLOSE']
int_cols += ['READ_AVERAGE', 'GPS_AVERAGE', 'MAT_AVERAGE']

Some columns contain percentages. We'll convert these to floating point numbers on import.

Note that we also need to handle the case of `SUPP` and `NEW` in the data.

In [None]:
def p2f(x):
    if x.strip('%').isnumeric():
        return float(x.strip('%'))/100
    elif x in ['SUPP', 'NEW', 'LOWCOV', 'NA', '']:
        return 0.0
    else:
        return x

These are the columns to try to convert from percentages. Note that we can be generous here, as columns like PCODE (postcode) will return the original value if the conversion fails.

In [None]:
percent_cols = [f for f in ks2cols['Field Name'] if f.startswith('P')]
percent_cols += ['WRITCOV', 'MATCOV', 'READCOV'] 
percent_cols += ['PTMAT_HIGH', 'PTREAD_HIGH', 'PSENELSAPK', 'PSENELK', 'PTGPS_HIGH']
percent_converters = {c: p2f for c in percent_cols}

In [None]:
ks2_df = pd.read_csv('data/2015-2016/england_ks2final.csv', 
                   na_values=['SUPP', 'NEW', 'LOWCOV', 'NA', '', ' '],
                   converters=percent_converters)

Drop the summary rows, keeping just the rows for mainstream and special schools.

In [None]:
ks2_df = ks2_df[(ks2_df['RECTYPE'] == 1) | (ks2_df['RECTYPE'] == 2)]

Convert everything to numbers, if possible.

In [None]:
ks2_df = ks2_df.apply(pd.to_numeric, errors='ignore')

Merge the LEA data into the school data

In [None]:
ks2_df = pd.merge(ks2_df, leas_df, on=['LEA'])
ks2_df.head().T

__ ----------- END of TMA02 code -----------  __
***

# Convert and store the KS2 dataframe into Mongo for use later

In [None]:
# set up a collection on the database for the ks2 results data
ks2 = db.ks2

In [None]:
# convert the dataframe into a list of dicts and store in Mongo

# the 'results' argument is needed to get a list of dicts
ks2.insert_many(ks2_df.to_dict('records'))

# snippet reference is from:
# https://stackoverflow.com/questions/33979983/insert-rows-from-pandas-dataframe-into-mongodb-collection-as-individual-document

In [None]:
# check we got them all
ks2.find().count(), len(ks2_df)

Great all present and correct.  Let's look at one.

In [None]:
ks2.find_one()

In [None]:
ks2.find_one()['GPS_AVERAGE_L']

In [None]:
ks2.find({'GPS_AVERAGE_L': np.nan}).count()

Looks like everything is set up.  We will need to bear the NaN values and missing values that the `p2f` function made into `0.0` in mind throughout the analysis.

In [None]:
ks2.find_one()

Now we have finished with it we can get rid of the ks2 dataframe.

In [None]:
del ks2_df

<a name="importing_ks4"></a>

# Importing the KS4 results dataset

Before we can investigate the data we will need to have a look at it, determine what cleaning if any needs to be done, and store it for access in an appropriate form.

### Look at the KS4 results dataset
Let's have a quick look at the data we will be looking at for the EMA.

In [None]:
!head -5 'data/2015-2016/england_ks4final.csv'

In [None]:
!wc -l 'data/2015-2016/england_ks4final.csv'

The dataset has 5489 rows of data, there appears to be a large number of columns and a lot of codes that I'll need to look up.  There are a also a number of `NA` and `NP` values that could be missing data.  As well as this results data I will need to find and import the relevent metadata file.

Looking through the data/2015-2016 folder there are a number of files that have information on these codes.

In [None]:
!ls data/2015-2016/

There is an abbreviations file, stored as an xlsx file.  I'll have a quick glance at it in excel.  Having looked the abbreviation up in the abbreviations file we can see that they have the following meanings:

- _NA_: Not applicable
- _NP_: Not Published
- _NE_: No entries
- _SUPP_: Suppressed (5 or fewer in cohort)
- _LOWCOV_: Low coverage (less than 50% of the cohort
- _NEW_: New institution

The abbreviations file also has listings of all the school types (NFTYPE) that I will need.  I'll grab that for use later on.

<a name='abbr'></a>

## Importing the abbreviations file

In [None]:
# read in the abbreviations file
abbr_df = pd.read_excel('data/2015-2016/abbreviations.xlsx')
abbr_df

We can see that the school types are rows 2-25, I'll store them as a dict for reference later on.

In [None]:
# relabel the columns
abbr_df.columns = ['label', 'expanded', 'not_needed']

In [None]:
# make a dictionary to easily look up the school types
nftypes = {}
for index, row in abbr_df[3:26].iterrows():
    nftypes[row['label'].strip()] = row['expanded'].strip()
    
nftypes

And, while we have the abbreviations available I'll store the missing value types for reference later on if needed.

In [None]:
# make a dictionary to refer to later of the missing types
missing_types = {}
for i, r in abbr_df[45:51].iterrows():
    missing_types[r['label']] = r['expanded']

missing_types

I can now delete the abbr_df as it won't be needed.

In [None]:
del abbr_df

<a name='meta'></a>

## Importing the KS4 Metadata file

In order to analyse the data we need to be able to reference the columns and the codes they represent.  I'll import the KS4_meta.csv file into the database and use it to help me understand the data in the KS4 results dataset.

In [None]:
!head -5 data/2015-2016/ks4_meta.csv

In [None]:
!wc -l data/2015-2016/ks4_meta.csv

0 lines.. I'll try loading directly into Mongo

In [None]:
!/usr/bin/mongoimport --port 27351 --drop --db schools_db --collection ks4_meta \
    --type csv --headerline --ignoreBlanks \
    --file data/2015-2016/ks4_meta.csv

Clearly there is an issue with the import.  I'll try importing it into a dataframe.

In [None]:
ks4_meta_df = pd.read_csv('data/2015-2016/ks4_meta.csv')
ks4_meta_df.head()

That imported ok.  But there are a few extra columns for my needs (I only need it to look up the description for a given term)

In [None]:
# reduce the dataframe to the columns of interest
ks4_meta_df = ks4_meta_df[['Metafile heading', 'Metafile description']]
# relabel them to match my target format
ks4_meta_df.columns = ['label', 'expanded']


In [None]:
# check it looks ok
ks4_meta_df.head()

In [None]:
# set up a reference to the db.collection 
ks4_meta = db.ks4_meta

In [None]:
ks4_meta.insert_many(ks4_meta_df.to_dict('records'))
# snippet reference is from:
# https://stackoverflow.com/questions/33979983/insert-rows-from-pandas-dataframe-into-mongodb-collection-as-individual-document

In [None]:
ks4_meta.find_one({'label': 'NFTYPE'})

I want to add the codes from the abbreviations dictionary to this document since it is one of the backbones to my investigation.

In [None]:
ks4_meta.update_one({'label': 'NFTYPE'}, 
                    {'$set': {'codes': nftypes}})

ks4_meta.find_one({'label': 'NFTYPE'})

I'll do the same for the `RECTYPE` label by splitting the description.

In [None]:
# select the correct document
r = ks4_meta.find_one({'label': 'RECTYPE'})

# checks that we haven't already updated the document
# then if not splits the description string, adding a code key
# to reference each school type
if 'codes' not in r.keys():
    expanded = r['expanded']
    e = expanded[:11]
    codelist = expanded[13:-1].split('; ')
    keys = [c[:1] for c in codelist]
    values = [c[2:] for c in codelist]
    codes = (dict(list(zip(keys, values))))
    ks4_meta.update_one({'_id': r['_id']},
                        {'$set': {'expanded': e,
                                  'codes': codes}})

# check that it was processed correctly
ks4_meta.find_one({'label': 'RECTYPE'})

Great.  That is most of the cleaning I need to do for the ks4_meta file.  If I were to be doing a different investigation I would consider merging in the LEA data here, but for the investigations I plan to do I don't think we need it and we already have it stored from earlier on (importing ks2) as the `LEA` dataframe which we can reference if needed.

Great.  Now in the tm351 module materials we had some handy collections provided by the module team that enabled us to quickly look up the labels and codes of a given accident.  I'll borrow that idea here for my purposes.  Because, I will need to do the same for the KS2 dataset, I'll wrap them in a function.

In [None]:
# code adapted from the p14 accidents dataset notebooks

def expanded_label(meta):
    # Load the expanded names of keys and human-readable codes into memory
    expanded_name = collections.defaultdict(str)
    for e in meta.find({'expanded': {"$exists": True}}):
        expanded_name[e['label']] = e['expanded']

    label_of = collections.defaultdict(str)
    for l in meta.find({'codes': {"$exists": True}}):
        for c in l['codes']:
            try:
                label_of[l['label'], int(c)] = l['codes'][c]
            except ValueError: 
                label_of[l['label'], c] = l['codes'][c]
    # return both as a tuple
    return (expanded_name, label_of)

In [None]:
# Set up the expanded_name and label_of for ks4_meta
ks4_expanded_name, ks4_label_of = expanded_label(ks4_meta)

In [None]:
# test it works
[(c, ks4_label_of['RECTYPE', c]) for k, c in ks4_label_of if k == 'RECTYPE']

In [None]:
ks4_expanded_name['NFTYPE']

In [None]:
ks4_label_of['NFTYPE', 'AC']

Great that all is working, I can now delete the ks4_meta_df, as the information is stored.

In [None]:
del ks4_meta_df

I'll quickly repeat the same steps for KS2_meta data to include the codes.

In [None]:
# relabel the columns of ks2cols
ks2cols.columns = ['not_needed', 'label', 'expanded']

# create a collection in the database
ks2_meta = db.ks2_meta

In [None]:
# store them into the database
ks2_meta.insert_many(ks2cols[['label', 'expanded']].to_dict('records'))

In [None]:
ks2_meta.find_one()

In [None]:
# repeat the splitting of the `RECTYPE`
# select the correct document
r = ks2_meta.find_one({'label': 'RECTYPE'})

# checks that we haven't already updated the document
# then if not splits the description string, adding a code key
# to reference each school type
if 'codes' not in r.keys():
    expanded = r['expanded']
    e = expanded[:11]
    codelist = expanded[13:-1].split('; ')
    keys = [c[:1] for c in codelist]
    values = [c[2:] for c in codelist]
    codes = (dict(list(zip(keys, values))))
    ks2_meta.update_one({'_id': r['_id']},
                        {'$set': {'expanded': e,
                                  'codes': codes}})

# check that it was processed correctly
ks2_meta.find_one({'label': 'RECTYPE'})

In [None]:
# And add the nftype to the meta collection
ks2_meta.update_one({'label': 'NFTYPE'}, 
                    {'$set': {'codes': nftypes}})

ks2_meta.find_one({'label': 'NFTYPE'})

In [None]:
# finally, set up the expanded_name and label_of for ks4_meta
ks2_expanded_name, ks2_label_of = expanded_label(ks2_meta)

check they work ok

In [None]:
# test it works
[(c, ks2_label_of['RECTYPE', c]) for k, c in ks2_label_of if k == 'RECTYPE']

In [None]:
ks2_label_of['NFTYPE', 'IND']

In [None]:
ks2_expanded_name['TELIG']

Great that is all the meta data handled, and we can now go about importing the KS4 data into the database and cleaning it.

In [None]:
# delete the ks2cols dataframe as we don't need it anymore
del ks2cols

<a name='data'></a>

## Importing the KS4 dataset

Before I import the data I will have another quick look at the file.

In [None]:
! head -5 'data/2015-2016/england_ks4final.csv'

To restate what was noted earlier there appears to be a great number of columns, and a large number of missing values.  How many rows are there?

In [None]:
!wc -l 'data/2015-2016/england_ks4final.csv'

Let's carry out similar steps to those we carried out in importing the ks2 data.  Again this is going to be adapted from the TMA02-Q2

In [None]:
ks4_df = pd.read_csv('data/2015-2016/england_ks4final.csv')
ks4_df.head()

A straight import gives an error (`DtypeWarning`).  Let's look at the file using the tools learned in p2 of the tm351 materials.

In [None]:
# let's quickly look at the file using command line
!file 'data/2015-2016/england_ks4final.csv'

In [None]:
# and check it using chardet
import chardet

# open the file and read the contents in as a byte object
testfile = open('data/2015-2016/england_ks4final.csv', 'rb').read()

# detect the file encoding
chardet.detect(testfile)

ks4_df = pd.read_csv('data/2015-2016/england_ks4final.csv', encoding='UTF-8-SIG')
ks4_df.head()

In [None]:
ks4_df.info()

In [None]:
ks4_df.dtypes

Most of the columns are mixed with an 'object' datatype.

In [None]:
ks4_dt_df = pd.DataFrame()
for col in ks4_df.columns:
    ks4_dt_df[col] = pd.to_numeric(ks4_df[col], errors='ignore')

ks4_dt_df.head()

In [None]:
ks4_dt_df.dtypes

I'm not getting very far here.  I'll try and follow the steps from Q2. If after that I have still made no progress I think that the most efficient way to get to the bottom of it will be to take a look at the file in OpenRefine to clean the mixed datatypes and determine what to do with the missing data.

__--- steps adapted from tma02 cleaning ---__

I'll find out which columns have percentages in them.

In [None]:
# Look through the meta file and get the columns that are percentages.
percent_cols_list = [(l, ks4_expanded_name[l]) 
                     for l in ks4_expanded_name 
                     if 'percent' in ks4_expanded_name[l].lower()]
percent_cols_list

In [None]:
# Save the column headings to a list
percent_cols = [p[0] for p in percent_cols_list]
percent_cols

In [None]:
# int columns
int_col_list = [(l, ks4_expanded_name[l])
                for l in ks4_expanded_name 
                if 'number' in ks4_expanded_name[l].lower()]
int_col_list

In [None]:
# again, save out just the column labels
# Save just the column headings
int_cols = [i[0] for i in int_col_list]
int_cols

In [None]:
# remind myself of the missing type codes
missing_types


In [None]:
percent_converters = {c: p2f for c in percent_cols}

Read in the file to a dataframe

In [None]:
ks4_df = pd.read_csv('data/2015-2016/england_ks4final.csv',
                     na_values=['SUPP', 'NEW', 'LOWCOV', 'NA', ''],
                     converters=percent_converters)

Still showing the error for the data types.  I will continue walking through the cleaning steps from tma02-q2.  For our questions will focus on only mainstream schools we can drop those that are not of `RECTYPE` == 1

In [None]:
ks4_df = ks4_df[ks4_df['RECTYPE'] == 1]

Convert everything to numbers, if possible.


In [None]:
ks4_df = ks4_df.apply(pd.to_numeric, errors='ignore')

Merge the LEA data into the school data.

In [None]:
ks4_df = pd.merge(ks4_df, leas_df, on=['LEA'])
ks4_df.head().T

That is looking better I'll now import these into mongodb

In [None]:
# create a collection in the database
ks4 = db.ks4

In [None]:
# insert the cleaned dataframe to the database
ks4.insert_many(ks4_df.to_dict('records'))

In [None]:
# check that the correct number of documents were included
len(ks4_df), ks4.find().count()

In [None]:
ks4.find_one()

This is an independent school since they don't need to publish their data there are a lot of missing values.  This is something we will need to be mindful of when carrying out the analysis.  Although the percentages have been handled, there are still a number of other measures that are still showing 'NP'.  Since the majority of the measures I will be looking at will be percentages, instead of working through every single measure I will determine those I want to use in my investigation and then clean those as needed.

In [None]:
ks4_expanded_name['P8MEA_AV']

In [None]:
# how many independent schools are in the dataset?
ks4.find({'NFTYPE': 'IND'}).count()

In [None]:
# look at another school type and find out how many there are
nftypes

In [None]:
# how many Community schools are there?
ks4.find({'NFTYPE': 'CY'}).count()

In [None]:
# delete the dataframes I don't need
del ks4_df, ks4_dt_df

Good things look they are clean enough to start working on the investigation.
<a name='q1'></a>

# Keystage 4 Investigation.  

## Q -  Does the type of school impact the results students acheive at keystage 4?

How big is the dataset?

In [None]:
ks4.find().count()

So there are a large number of documents in the dataset (after taking out the non-mainstream schools)

<a name="measures"></a>

# Choosing performance measures

The first thing I need to decide before I can analyse the data is to decide what I mean by 'good performance' and once that is ascertained which of the many data points I will use as measures to base my comparison of school types on.

For a long time the standard measure of successful schools was the percentage of pupils achieving grades A*-C in Maths and English.  This has changed recently with the government introducing new metrics the 'Progress 8' and 'Achievement 8' and the introduction of the English Baccalaurette which includes English, Maths, Sciences (incl. computer science, history/geograghy a modern/ancient foreign language).  So, I will try to look at these as the success measure of a school, and if possible combine them.

So the first step I need to take is to identify the keys for the data I want to query.

In [None]:
# print all the keys and values of the meta data
# to help choose the columns I will use
for d in ks4_meta.find():
    print(d['label'], ':', d['expanded'], '\n')

Looking through these it is clear that I will need to be selective in choosing measures.  There are thousands of ways to subdivide this dataset and investigate it.  I will be focusing on the Average numbers for the whole school, for every student.  There will of course be cases where this skews the results.

For instance, at schools with many disadvantaged students the average scores could be affected and without looking including measures the results can not be fully comprehensive.  That said it is beyond the scope of this project to examine every single possible facet of the dataset.

In [None]:
# how many scores are there for the attainment 8 measure?
test_df = pd.DataFrame(list(ks4.find({}, {'ATT8SCR':1, '_id': 0})))
test_df.count()

In [None]:
# and how many for attainment 8 in 2015?
test_df = pd.DataFrame(list(ks4.find({}, {'ATT8SCR_15':1, '_id': 0})))
test_df.count()

In [None]:
del test_df

I will look at the following basic performance measures to compare KS4 schools types.

- `PTEBACC_PTQ_EE` : Percentage of key stage 4 pupils achieving the English Baccalaureate 
- `PTAC5EM_PTQ_EE` : Percentage of pupils achieving 5+ A*-C or equivalents including A*-C in both English and mathematics GCSEs 
- `ATT8SCR` : Average Attainment 8 score per pupil
- `P8MEA` : Progress 8 measure 
- `URN`: To keep track of which school we want

In [None]:
# Create a dataframe of just the measures I will be investigating.
ks4_results_df = pd.DataFrame(list(ks4.find({}, 
                                {'NFTYPE':1, 
                                'PTEBACC_PTQ_EE':1,
                                'PTAC5EM_PTQ_EE':1,
                                'ATT8SCR':1,
                                'P8MEA':1,
                                'URN': 1,
                                 '_id': 0
                               })))
ks4_results_df.head()

It looks like there is still some cleaning to do.  In particular the `NP` values.  Also I will need to decide what to do with the independent schools.

<a name='add_clean'></a>

# Additional cleaning

In [None]:
# look at the missing types
missing_types

In [None]:
# function that clean values from string to number
def clean(value):
    if type(value) == str:
        if value.strip() in missing_types.keys():
            return np.nan
        else:
            return value
    else:
        return value


In [None]:
# make a list of the measures I'll use to clean on
measures = [c for c in ks4_results_df.columns if c != 'URN']

In [None]:
# update the database with cleaned values
for d in ks4.find():
    for k in d.keys():
        if k in measures:
            # update the value on the database
            ks4.update_one({'_id': d['_id']},
                           {'$set': {k: clean(d[k])}})

In [None]:
# recreate the dataframe
ks4_results_df = pd.DataFrame(list(ks4.find({}, 
                                {'NFTYPE':1,
                                 'PTEBACC_PTQ_EE':1,
                                 'PTAC5EM_PTQ_EE':1,
                                 'ATT8SCR':1,
                                 'P8MEA':1,
                                 'URN':1,
                                 '_id':0
                                })))
ks4_results_df.head()

That is better.  Next I'll drop the missing values.

In [None]:
# preview what dropping the missing values will do
ks4_results_df.dropna()

There will still be a few rows still with missing data particularly the `F NFTYPE`. What is that school type anyway?

In [None]:
ks4_label_of['NFTYPE','F']

Let's have a quick look at one file to see if we can see what is going on.

In [None]:
ks4.find_one({'NFTYPE': 'F'})

Ah a lot of blank space there `' '` I'll clean that up by removing them from the database completely.

In [None]:
# loop through the documents and remove keys that have ' ' as the value
# this step could take some time

for d in ks4.find({}):
    for k in d.keys():
        if d[k] == ' ':
            ks4.update_one({'_id': d['_id']},
                           {'$unset': {k: ''}})
        

In [None]:
# make the dataframe again
ks4_results_df = pd.DataFrame(list(ks4.find({}, 
                                {'NFTYPE':1, 
                                'PTEBACC_PTQ_EE':1,
                                'PTAC5EM_PTQ_EE':1,
                                'ATT8SCR':1,
                                'P8MEA':1,
                                'URN':1,
                                '_id':0
                               })))
ks4_results_df.head(10)            

Since independent schools do not publish their data when I drop the missing data I will also lose these.  That is ok for the scope of this investigation but it should be noted.

Drop the missing data.

In [None]:
# drop the na. missing data
ks4_results_df.dropna(inplace=True)
# reset the index
ks4_results_df.reset_index(inplace=True, drop=True)
#preview the dataframe
ks4_results_df.head(10)


Great that is looking better.

In [None]:
ks4_results_df.info()

There are still two problem columns (`ATT8SCR` and `P8MEA`)

In [None]:
# let's peak at the values
ks4_results_df['ATT8SCR'].unique()[:15]

right so they are all strings.  Is it the same for P8MEA?

In [None]:
# let's peak at the values
ks4_results_df['P8MEA'].unique()[:15]

ok so let's clean those up too!

In [None]:
# update the database documents
for d in ks4.find():
    for k in d.keys():
        if k in ['P8MEA', 'ATT8SCR']:
            ks4.update_one({'_id': d['_id']},
                           {'$set': {k: float(d[k])}}
                          )


In [None]:
# check the values have been updated
ks4.find_one({'NFTYPE':'CY'},{'P8MEA':1})

Great, now I can recreate the dataframe.

In [None]:
# make the dataframe again
ks4_results_df = pd.DataFrame(list(ks4.find({}, 
                                {'NFTYPE':1, 
                                'PTEBACC_PTQ_EE':1,
                                'PTAC5EM_PTQ_EE':1,
                                'ATT8SCR':1,
                                'P8MEA':1,
                                'URN':1,
                                '_id':0
                               })))   

In [None]:
# drop the na. missing data
ks4_results_df.dropna(inplace=True)
# reset the index
ks4_results_df.reset_index(inplace=True, drop=True)
#preview the dataframe
ks4_results_df.head(10)

<a name="Q1_a"></a>

# Q: Does the type of school impact the results students acheive at keystage 4?

<a name="ks4_summary_stats"></a>

### Summary stats

In [None]:
## Cleaned data stats
ks4_results_df.info()

In [None]:
ks4_results_df.describe()

How many values are 0.0 (some were possibly added by the p2f function)

In [None]:
# GCSE A*-C
ks4_results_df[ks4_results_df['PTAC5EM_PTQ_EE']==0]['PTAC5EM_PTQ_EE'].count()

In [None]:
# GCSE A*-C
ks4_results_df[ks4_results_df['PTEBACC_PTQ_EE']==0]['PTEBACC_PTQ_EE'].count()

Both are relatively small numbers.  But should still keep these in mind when calculating values.

<a name='school_type_plots'></a>

# School type grouped plots

In [None]:
# Group the mean results by school type
grouped_res = ks4_results_df[['NFTYPE', 'ATT8SCR', 'P8MEA', 'PTAC5EM_PTQ_EE', 'PTEBACC_PTQ_EE']].groupby(by='NFTYPE').mean()
grouped_df = pd.DataFrame(grouped_res)
grouped_df

In [None]:
# visualise them quickly
grouped_df.plot(kind='bar', subplots=True)

Interesting there appears to be something of a pattern in these groupings.

In [None]:
# sort the values to compare
grouped_df.sort_values('PTAC5EM_PTQ_EE').plot(kind='bar', subplots=True)

In [None]:
ks4_expanded_name['P8MEA']

Let's tidy up these plots a little by adding human readable codes.  Changing percentages into 0-100 values.

In [None]:
# provide human readable codes
grouped_df.index = [nftypes[code] for code in grouped_df.index]

In [None]:
# make the percentages range from 0-100
grouped_df['PTAC5EM_PTQ_EE'] = grouped_df['PTAC5EM_PTQ_EE']*100
grouped_df['PTEBACC_PTQ_EE'] = grouped_df['PTEBACC_PTQ_EE']*100

In [None]:
grouped_df.sort_values('PTAC5EM_PTQ_EE').plot(kind='bar')

There does appear to be a relationship between school type and the results acheived. using the three most common measures.  The mean values for the school type appears to relate to the other measures.

Lets see them in pairs to see how they compare.

## Plot of mean EBACC to GCSE performance by school type (%)

In [None]:
grouped_df[['PTEBACC_PTQ_EE', 'PTAC5EM_PTQ_EE']].sort_values('PTEBACC_PTQ_EE').plot(kind="bar")

In [None]:
grouped_df[['PTEBACC_PTQ_EE', 'PTAC5EM_PTQ_EE']].sort_values('PTEBACC_PTQ_EE').plot(kind="bar", subplots=True)

There appears to be a link between EBacc and GCSE performance and the the school type.

## Plot of mean Attainment8 to Progress8 performance by school type

In [None]:
grouped_df[['ATT8SCR', 'P8MEA']].sort_values('ATT8SCR').plot(kind="bar")

this isn't that clear - I'll use subplots to show more clearly.

In [None]:
grouped_df[['ATT8SCR', 'P8MEA']].sort_values('ATT8SCR').plot(kind="bar", subplots=True)

In all of these cases there appears to be a pattern.  Let's get the top performers and low performers for each measure.

In [None]:
for c in grouped_df.columns:
    p = grouped_df[c].idxmax(), grouped_df[c].idxmin()
    print(c, ': ', ks4_expanded_name[c])
    print('Top school type: ', round(grouped_df.loc[p[0]][c], 2), p[0])
    print('Bottom school type', round(grouped_df.loc[p[1]][c], 2), p[1], '\n')

Clearly by every measure considered _City Technology College_ are top and the _Further Education Sector Institutions_ are bottom.

## Double checking - accounting for the potential extra 0.0 values

In the data preparation phase we used the p2f function.  However, some of the values may have been set to 0 and could be skewing these analysis.  However, there are only a small number of values.  I will calculate the mean value of the column without them and replace the 0 with those and see whether it negatively impacts the results of the earlier findings.

I'll repeat the same steps without the 0.0 values that could be influenced by the import.  To compare.

In [None]:
# filter out the rows that could be effected
res = ks4_results_df[(ks4_results_df['PTAC5EM_PTQ_EE']!=0) & (ks4_results_df['PTEBACC_PTQ_EE']!=0)]

# group the data by school type
grouped_res = res[['NFTYPE', 'ATT8SCR', 'P8MEA', 'PTAC5EM_PTQ_EE', 'PTEBACC_PTQ_EE']].groupby(by='NFTYPE').mean()
grouped_gt0_df = pd.DataFrame(grouped_res)
grouped_gt0_df

In [None]:
# make the percentages prettier (out of 100)
grouped_gt0_df['PTAC5EM_PTQ_EE'] = grouped_gt0_df['PTAC5EM_PTQ_EE']*100
grouped_gt0_df['PTEBACC_PTQ_EE'] = grouped_gt0_df['PTEBACC_PTQ_EE']*100

In [None]:
# provide human readable codes
grouped_gt0_df.index = [nftypes[code] for code in grouped_gt0_df.index]

In [None]:
grouped_gt0_df.sort_values('P8MEA').plot(kind='bar')

## Plot of EBACC to GCSE performance by school type (%) non-0


In [None]:
grouped_gt0_df[['PTEBACC_PTQ_EE', 'PTAC5EM_PTQ_EE']].sort_values('PTEBACC_PTQ_EE').plot(kind="bar")

In [None]:
grouped_gt0_df[['PTEBACC_PTQ_EE', 'PTAC5EM_PTQ_EE']].sort_values('PTEBACC_PTQ_EE').plot(kind="bar", subplots=True)

There appears to be a link between EBacc and GCSE performance and the the school type.

## Plot of Attainment8 to Progress8 performance by school type ( non 0 )

In [None]:
grouped_gt0_df[['ATT8SCR', 'P8MEA']].sort_values('ATT8SCR').plot(kind="bar", subplots=True)

In all of these cases there appears to be a pattern.  Let's get the top performers and low performers for each measure.

In [None]:
for c in grouped_gt0_df.columns:
    p = grouped_gt0_df[c].idxmax(), grouped_gt0_df[c].idxmin()
    print(c, ': ', ks4_expanded_name[c])
    print('Top school type: ', round(grouped_gt0_df.loc[p[0]][c], 2), p[0])
    print('Bottom school type', round(grouped_gt0_df.loc[p[1]][c], 2), p[1], '\n')

So even after adjusting for those measures that aren't 0, for the small number that potentially got added.  The underlying results have not been effected.

However these findings are based on mean grouped values.  To get more clarity the next step is to use machine machine learning to cluster the ungrouped data and then look at each cluster group to see the distributions of school types in each group. 

There does appear to be a some kind of link between the type of school and the results of both English Baccalaurete and the older 5+A*-C GCSEs.

The correlation between 5+A*-C and the EBacc makes sense because to acheive an eBacc is across a variety of subjects including Maths, English, Sciences, language and history or geography.  Naturally, there will be a correlation between the two.

<a name='machine_learning'></a>

# Machine Learning

<a name="grouped_cluster"></a>

# Grouped data Cluster analysis

Let's start by looking at the scatter of some of these different measures.

In [None]:
# Progress 8 and GCSE results (5+ A*-C)
grouped_df.plot(kind='scatter', x='P8MEA', y='PTAC5EM_PTQ_EE')

In [None]:
# Progress 8 and ATT8SCR
grouped_df.plot(kind='scatter', x='P8MEA', y='ATT8SCR')

In [None]:
# GCSE and ATT8SCR
grouped_df.plot(kind='scatter', x='PTAC5EM_PTQ_EE', y='ATT8SCR')

In [None]:
# GCSE and EBACC results
grouped_df.plot(kind='scatter', x='PTAC5EM_PTQ_EE', y='PTEBACC_PTQ_EE')

In [None]:
# Attainment 8 and EBACC results
grouped_df.plot(kind='scatter', x='ATT8SCR', y='PTEBACC_PTQ_EE')

In [None]:
# Progress 8 and EBACC results
grouped_df.plot(kind='scatter', x='P8MEA', y='PTEBACC_PTQ_EE')

For the rest of this investigation I will narrow down the focus of performance measures to a combination of P8 measure and the GCSE.  I have chosen these two because:
 - Progress 8 - students progress is measured in comparison to other students across the country of similar starting ability, thus it is a good measure for comparing schools
 - 5+A*-C GCSE inc. Math&Eng - despite being an older measure it is still widely used and understood by everyone.

To start with I'll follow the steps in the module materials p21.1

### initial values for k = 2


In [None]:
initialCentroids_df = pd.DataFrame({'P8MEA': [-0.5, 0], 
                                    'PTAC5EM_PTQ_EE': [18, 30]}, 
                                   columns=['P8MEA', 'PTAC5EM_PTQ_EE'])

initialCentroids_df

and plot these on a scatter plot with the data points:

In [None]:
plt.scatter(grouped_df['P8MEA'], grouped_df['PTAC5EM_PTQ_EE'])

plt.xlabel('Progress 8 measure')
plt.ylabel('5+A*-C GCSE')

plt.title('School Type KS4 performance 2015-2016 with initial cluster centroids')

# Plot the initial centroids:
for i in initialCentroids_df.index:
    plt.plot(initialCentroids_df.iloc[i]['P8MEA'],
             initialCentroids_df.iloc[i]['PTAC5EM_PTQ_EE'],
             color='black', marker='x', mew=2)

In [None]:
# initialise the clustering object
kmeans2 = cluster.KMeans(n_clusters=2,
                         init=initialCentroids_df)

In [None]:
# fit the object to the data
assignedClusters_clust = kmeans2.fit(grouped_df[['P8MEA', 'PTAC5EM_PTQ_EE']])
assignedClusters_clust.labels_

... and plot the clustered data along with the final centroids:

In [None]:
# Plot the data points which is in the cluster labelled '0'
plt.scatter(grouped_df['P8MEA'][assignedClusters_clust.labels_==0],
            grouped_df['PTAC5EM_PTQ_EE'][assignedClusters_clust.labels_==0],
            color='red', marker='o', label='cluster 0')

# Plot the data points which is in the cluster labelled '1'
plt.scatter(grouped_df['P8MEA'][assignedClusters_clust.labels_==1],
            grouped_df['PTAC5EM_PTQ_EE'][assignedClusters_clust.labels_==1],
            color='blue', marker='o', label='cluster 1')

# plot each of the centroids:
for (cx, cy) in assignedClusters_clust.cluster_centers_:
    plt.plot(cx, cy, color='black', marker='x', mew=2)
    
plt.legend()

plt.xlabel('Progress 8 measure')
plt.ylabel('5 A*-C GCSE - inc. Math&English')

plt.title('School Type KS4 performance 2015-2016 2-means clustering with centroids')

plt.plot()


## K-means = 4

To make generating centroids quicker I'll make a quick function to speed up the process for me

In [None]:
import random
# set the seed so that enterpretation of analysis is consitent on each run.
random.seed(283)

In [None]:
# helper function to quickly generate a dataframe 
# of a given number of random centroidsto speeg
def random_centroids(df, x_col, y_col, num_centroids):
    
    # determine the ranges of the data
    x_range = [min(df[x_col]), max(df[y_col])]
    y_range = [min(df[x_col]), max(df[y_col])]
    
    # generate make a collection to store
    centroids = collections.defaultdict(list)
    
    # generate the random values 
    for i in range(num_centroids):
        centroids['X'].append(random.uniform(min(x_range), max(x_range)))
        centroids['Y'].append(random.uniform(min(y_range), max(y_range)))
    
    # return as a dataframe
    return pd.DataFrame(centroids)

In [None]:
initial_centroids = random_centroids(grouped_df, 'P8MEA', 'PTAC5EM_PTQ_EE', 4)

In [None]:
initial_centroids

In [None]:
# create a k-means 4 cluster
kmeans4 = cluster.KMeans(n_clusters=4, init=initial_centroids)

In [None]:
# fit the cluster object to the data
assigned_clust = kmeans4.fit(grouped_df[['P8MEA', 'PTAC5EM_PTQ_EE']])

In [None]:
# Helper function to plot clustering results, 
# colors do not need to be set unless num is greater than 6
def plot_cluster(data_x, data_y, assigned_clust, k, 
                 cluster_labels=None, plt_labels=None, 
                 plt_title=None, colors=None, save=False):
    # set default colors
    if colors==None:
        colors = sns.palettes.color_palette(n_colors=k)
    # set default labels
    if cluster_labels==None:
        cluster_labels = range(k)
    
    plt.figure()
    # plot the cluster group    
    for c in range(k):
        plt.scatter(data_x[assigned_clust.labels_==c],
                    data_y[assigned_clust.labels_==c],
                    color=colors[c], marker='o', label=cluster_labels[c]
                   )
        
    # plot the centroids
    for (cx, cy) in assigned_clust.cluster_centers_:
        plt.plot(cx, cy, color='black', marker='x', mew=1)
    
    # prettify it
    plt.legend()
    
    plt.xlabel(plt_labels[0])
    plt.ylabel(plt_labels[1])
        
    plt.title(plt_title)
    
    # save it out if wanted
    if save:
        plt.savefig('plot_images/cluster '+ plt_title + ' k='+str(k))

In [None]:
# plot the k-means 4
plot_cluster(grouped_df['P8MEA'], grouped_df['PTAC5EM_PTQ_EE'], assigned_clust, 4,
             cluster_labels=['a', 'b', 'c', 'd'],
             plt_labels=['Progress 8 (Average)', '5+A*-C GCSE (%)'],
             plt_title='School Type KS4 performance 2015-2016 \n4-means clustering with centroids', save=True)

## Run it again to check for variation in clusters

In [None]:
# make another run to see if the groupings vary 
initial_centroids = random_centroids(grouped_df, 'P8MEA', 'PTAC5EM_PTQ_EE', 4)

In [None]:
# create a k-means 4 cluster
kmeans4 = cluster.KMeans(n_clusters=4, init=initial_centroids)

In [None]:
# fit the cluster object to the data
assigned_clust = kmeans4.fit(grouped_df[['P8MEA', 'PTAC5EM_PTQ_EE']])

In [None]:
# plot the k-means 4
plot_cluster(grouped_df['P8MEA'], grouped_df['PTAC5EM_PTQ_EE'], assigned_clust, 4,
             cluster_labels=['a', 'b', 'c', 'd'],
             plt_labels=['Progress 8', '5+A*-C GCSE'],
             plt_title='School Type KS4 performance 2015-2016 \n4-means clustering with centroids')

Interesting, the error message is indicating that we don't need to declare the centroids.  A quick search online ([stack overflow link](https://stackoverflow.com/questions/28862334/k-means-with-selected-initial-centers)) also explains this.  If we don't pass initial_centroids in then the method will use the default of 10 random iterations.

## K-means 4, third trial.

In [None]:
# initialise the cluster object
kmeans4 = cluster.KMeans(n_clusters=4)

# fit the cluster object to the data
assigned_clust = kmeans4.fit(grouped_df[['P8MEA', 'PTAC5EM_PTQ_EE']])

In [None]:
# plot the k-means 4
plot_cluster(grouped_df['P8MEA'], grouped_df['PTAC5EM_PTQ_EE'], assigned_clust, 4,
             cluster_labels=['a', 'b', 'c', 'd'],
             plt_labels=['Progress 8', '5+A*-C GCSE'],
             plt_title='School Type KS4 performance 2015-2016 \n4-means clustering with centroids')

The variety in the clusterings show that they are unstable I'll try a couple of higher k values to see how it performs.

## K-means 5

In [None]:
# initialise the cluster object
kmeans = cluster.KMeans(n_clusters=5)

# fit the cluster object to the data
assigned_clust = kmeans.fit(grouped_df[['P8MEA', 'PTAC5EM_PTQ_EE']])

In [None]:
# plot the k-means 5
plot_cluster(grouped_df['P8MEA'], grouped_df['PTAC5EM_PTQ_EE'], assigned_clust, 5,
             plt_labels=['Progress 8', '5+A*-C GCSE'],
             plt_title='School Type KS4 performance 2015-2016 \n5-means clustering with centroids')

## K-means 6

In [None]:
# initialise the cluster object
kmeans = cluster.KMeans(n_clusters=6)

# fit the cluster object to the data
assigned_clust = kmeans.fit(grouped_df[['P8MEA', 'PTAC5EM_PTQ_EE']])

In [None]:
# plot the k-means 6
plot_cluster(grouped_df['P8MEA'], grouped_df['PTAC5EM_PTQ_EE'], assigned_clust, 6,
             plt_labels=['Progress 8', '5+A*-C GCSE'],
             plt_title='School Type KS4 performance 2015-2016 \n5-means clustering with centroids')

There are clearly a few ways that this dataset can be clustered.  It is perhaps worth noting that the most of the school types are around the national average measure for Progress8 (0), and above 50% for the GCSE measure.  Their are two clear outliers one at the top right (best score for both GCSE, and progress8) and one at the bottom (second worse GCSE and by far the worse progress8), let's identify which school types they are.

In [None]:
grouped_df['P8MEA'].idxmax()

In [None]:
grouped_df['P8MEA'].idxmin()

In [None]:
grouped_df['PTAC5EM_PTQ_EE'].idxmax()

In [None]:
grouped_df['PTAC5EM_PTQ_EE'].idxmin()

So in both cases the top school type is 'City technology college' and the bottom performer is 'Further Education Sector Institution' 

How many of each are there in the dataset?

In [None]:
ks4_results_df[ks4_results_df['NFTYPE']=='CTC']['NFTYPE'].count()

So there are only 3 schools of that type with results recorded in our cleaned dataset.  What about the 'Further Education Sector Institution'

In [None]:
ks4_results_df[ks4_results_df['NFTYPE']=='FESI']['NFTYPE'].count()

12 is a few more but still not that many.

In [None]:
(round(3/len(ks4_results_df)*100,4), round(12/len(ks4_results_df)*100, 4))

Both are fractions of a 1 percent of the whole dataset.

<a name="machine_learning"></a>

# Machine Learning

## k-means cluster analysis of the ungrouped dataset.

To get a better understanding of the school type distribution in each cluster it will be I need to cluster accross the whole dataset on those performance measures.

In [None]:
ks4_results_df.head()

In [None]:
ks4_results_df.plot(kind='scatter', x='P8MEA', y='PTAC5EM_PTQ_EE')

With there being a range of different measures to look at in combination I will create another helper function to make the process more efficient.

In [None]:
# helper function to fill out boilerplate code
# initialises a kmeans cluster object and fits it to the data
# then plots it using the plot_cluster method defined earlier.
def kmeans_plot(df, x_column, y_column, k,
                cluster_labels=None, plt_labels=None, plt_title=None,
                colors=None, initial_centroids=None, save=False
                ):
    
    # create k-means cluster object
    if initial_centroids == None:
        kmeans_clust = cluster.KMeans(n_clusters=k)
    else:
        kmeans_clust = cluster.KMeans(n_clusters=k, init=initial_centroids)
    # fit the objest to the data
    assigned_clust = kmeans_clust.fit(df[[x_column, y_column]])

    # plot the kmeans cluster
    plot_cluster(df[x_column], df[y_column], assigned_clust, k,
                 cluster_labels=cluster_labels, plt_labels=plt_labels,
                 plt_title=plt_title, colors=colors, save=save
                 )

Now I have a handy function I can iterate through some different k-values and see which k value fits the data the best.

## Cluster Groups of Progress 8 and 5+A*-C measures

Remind myself of the column names so I can look them up

In [None]:
ks4_results_df.columns


Iterate with different k values (2 - 8) and plot each one.

In [None]:
for k in range(2,12):
    title = 'KS4 results cluster groups k=' + str(k)
    cluster_labels = ['Group ' + str(i) for i in range(k)]
    kmeans_plot(ks4_results_df, 'P8MEA', 'PTAC5EM_PTQ_EE', k=k,
                plt_title=title, cluster_labels=cluster_labels,
                plt_labels=['Progress 8', '%+A*-C GCSE'], save=True)

The additional clusters seem to break the data into narrow and narrower segments.  But they do appear to be quite spread For our needs I think k=4 is good to look at in more detail.  As I am trying to identify groups that perform well.

## KS4 results data cluster group plot k=4

I need a little more control over the plotting so I can move the set the legend position if needed.

In [None]:
title = 'KS4 results cluster groups k=4'

kmeans_plot(ks4_results_df, 'P8MEA', 'PTAC5EM_PTQ_EE', k=4,
                plt_title=title, cluster_labels=cluster_labels,
                plt_labels=['Progress 8', '%+A*-C GCSE'])

Now to use the filter to separate the data I will need to have access to the cluster object, or supply supply it to the function to be plotted.

While editing I'll allow for a little more customisation of the plotting.

In [None]:
# Helper function to plot clustering results
# allows some plot visualisations to be specified
def plot_cluster_2(data_x, data_y, assigned_clust, k, 
                 cluster_labels=None, plt_labels=None, 
                 plt_title=None, colors=None, legend_loc=None,
                 opacity=1, save=False 
                ):
    # set default colors
    if colors==None:
        colors = sns.palettes.color_palette(n_colors=k)
    # set default labels
    if cluster_labels==None:
        cluster_labels = ['Group ' + str(i) for i in range(k)]
    
    plt.figure()
    # plot the cluster group    
    for c in range(k):
        plt.scatter(data_x[assigned_clust.labels_==c],
                    data_y[assigned_clust.labels_==c],
                    color=colors[c], marker='o', 
                    label=cluster_labels[c],
                    alpha=opacity
                   )
    # plot the centroids
    for (cx, cy) in assigned_clust.cluster_centers_:
        plt.plot(cx, cy, color='black', marker='x', mew=1)
    
    # add the legend
    plt.legend(loc=legend_loc)
        
    plt.xlabel(plt_labels[0])
    plt.ylabel(plt_labels[1])
        
    plt.title(plt_title)
    
    if save:
        plt.savefig('plot_images/cluster '+ plt_title + ' k='+str(k))


Now I can create a cluster group and then use it to plot the groups and then filter the dataframe.

In [None]:
# set the title I want to use
title = 'KS4 results kMeans4 - PMEA GCSE'

# so that it always runs the same I need to initialise centroids in this case
init_centroids = pd.DataFrame({'PMEA': [-1.5, -0.5, 0, 0.75],
                                'PTAC5EM_PTQ_EE': [0.2, 0.4, 0.6, 0.8]},
                              columns=['PMEA', 'PTAC5EM_PTQ_EE'])

# initialise the kmeans cluster
kmeans_4 = cluster.KMeans(n_clusters=4, init=init_centroids)
# initialise the kmeans cluster

# fit it to the data
assigned_clusters = kmeans4.fit(ks4_results_df[['P8MEA', 'PTAC5EM_PTQ_EE']])

# plot the results
plot_cluster_2(ks4_results_df['P8MEA'], ks4_results_df['PTAC5EM_PTQ_EE'], 
               assigned_clust=assigned_clusters, k=4,
            plt_title=title,
            plt_labels=['Progress 8', 'Students to acheive +A*-C GCSE (%)'],
            legend_loc=(1.05, 0.5),
            opacity=0.7,
            save=True)

The clusters appear to be separated into groups around the result bands.  With the progress measure having a strong impact on the groupings.  However, they look quite wide ranging.  I'll run a silouette analysis on them to see whether they are suitable.

<a name='silhouette'></a>

# Silhouette coefficients analysis

## Kmeans=4 

In [None]:
# create a column on the results data for each value
ks4_results_df['cluster'] = pd.Series(assigned_clusters.labels_)

In [None]:
# check it looks ok.
ks4_results_df.head()

Calculate the silhouette coefficients

In [None]:
# from notebook 21.3
# Add the silhouette coefficients as a new column in the
# ks4_results_df:
ks4_results_df['silhouette'] = silhouette_samples(ks4_results_df[['P8MEA', 'PTAC5EM_PTQ_EE']],
                                                             np.array(ks4_results_df['cluster']))

ks4_results_df.head()

In [None]:
# sort the dataframe so we can see a curve
silhouette_plot_data_df = ks4_results_df.sort_values(['cluster', 'silhouette'])
silhouette_plot_data_df.index = list(range(len(silhouette_plot_data_df)))

# set the colours
colours = sns.palettes.color_palette(n_colors=len(set(ks4_results_df['cluster'])))

for clust in set(silhouette_plot_data_df['cluster']):
    plt.bar(silhouette_plot_data_df[silhouette_plot_data_df['cluster']==clust].index,
            silhouette_plot_data_df[silhouette_plot_data_df['cluster']==clust]['silhouette'],
            color=colours[clust], alpha=0.7, label='Cluster ' + str(clust))
    
plt.title('kMeans=4 Silhouette plot of Progress 8 and GCSE A*-C')
plt.legend()

plt.xlabel('Number of data point')
plt.ylabel('Silhouette coefficient')

plt.savefig('plot_images/silhouette_k4_P8_AC5.png')



This silhouette plot shows us clearly that the clusters are uneven in size, with a wide range of coefficiets.  there are ales a few overlapping points showing.

I'll repeat using a higher k-value the above steps using a higher k value.

## KS4 results data cluster group plot k=9

First I'll drop the extra columns created earlier.

In [None]:
# drop the added columns
if 'cluster' in ks4_results_df.columns:
    ks4_results_df.drop(['cluster', 'silhouette'], axis=1 , inplace=True)

In [None]:
# set the title I want to use
title = 'KS4 results cluster groups k=9'

# so that it always runs the same I need to initialise centroids in this case
init_centroids_9 = pd.DataFrame({'PMEA': [-2, -1.5, -1, -0.75, -0.5, -0.25, 0, 0.25,  0.75],
                                'PTAC5EM_PTQ_EE': [0.2, 0.3, 0.4, 0.5, 0.55, 0.6, 0.7, 0.8, 0.9]},
                              columns=['PMEA', 'PTAC5EM_PTQ_EE'])

# initialise the kmeans cluster
kmeans9 = cluster.KMeans(n_clusters=9, init=init_centroids_9, n_init=1)

# fit it to the data
assigned_clusters_9 = kmeans9.fit(ks4_results_df[['P8MEA', 'PTAC5EM_PTQ_EE']])

# plot the results
plot_cluster_2(ks4_results_df['P8MEA'], ks4_results_df['PTAC5EM_PTQ_EE'], 
               assigned_clust=assigned_clusters_9, k=9,
            plt_title=title,
            plt_labels=['Progress 8', 'Students to acheive +A*-C GCSE (%)'],
            legend_loc=(1.05, 0.5),
            opacity=0.7,
            save=True)

The clusters still look quite wide ranging.  I'll run a silouette analysis on them to see if the higher k value has evened things out.

## Silhouette analysis of kMeans 9 clustering.

In [None]:
# create a column on the results data for each value
ks4_results_df['cluster'] = pd.Series(assigned_clusters_9.labels_)

Calculate the silhouette coefficients

In [None]:
# from notebook 21.3
# Add the silhouette coefficients as a new column in the
# ks4_results_df:
ks4_results_df['silhouette'] = silhouette_samples(ks4_results_df[['P8MEA', 'PTAC5EM_PTQ_EE']],
                                                             np.array(ks4_results_df['cluster']))

In [None]:
# sort the dataframe so we can see a curve
silhouette_plot_data_df = ks4_results_df.sort_values(['cluster', 'silhouette'])
silhouette_plot_data_df.index = list(range(len(silhouette_plot_data_df)))

colours = sns.palettes.color_palette(n_colors=len(set(ks4_results_df['cluster'])))

for clust in set(silhouette_plot_data_df['cluster']):
    plt.bar(silhouette_plot_data_df[silhouette_plot_data_df['cluster']==clust].index,
            silhouette_plot_data_df[silhouette_plot_data_df['cluster']==clust]['silhouette'],
            color=colours[clust], alpha=0.7, label='Cluster ' + str(clust))
    
plt.title('kMeans=9 Silhouette plot of Progress 8 and GCSE A*-C')
plt.legend(loc=(1.05,0.8))

plt.xlabel('Number of data point')
plt.ylabel('Silhouette coefficient')

plt.savefig('plot_images/silhouette_k9_P8_AC5.png')


Still no real improvement on the silhouette coefficients and there are still some overlapping groups.  This isn't the grouping I was hoping for to help me clarify the performance measures. 

But it has enabled me to gain a better insight to the spread of the data. I'll take a look at the Attainment 8 and GCSE 5+A*-C now in case it is better.

In [None]:
plt.scatter(x=ks4_results_df['ATT8SCR'], y=ks4_results_df['PTAC5EM_PTQ_EE'])

This scatter is a little less spread.  It may give a better range of values and is a good measure of the final results KS4 students achieve.

So our new measures will be:
 - Attainment 8
 - GCSE 5+ A*-C

## Cluster Groups of Attainment 8 and the GCSE 5+ A*-C measures

In [None]:
# Drop the added columns
if 'cluster' in ks4_results_df.columns:
    ks4_results_df.drop(['cluster', 'silhouette'], inplace=True, axis=1)

In [None]:
ks4_results_df.columns


Iterate with different k values (2 - 8) and plot each one.

In [None]:
for k in range(2,12):
    title = 'KS4 results cluster groups k=' + str(k)
    cluster_labels = ['Group ' + str(i) for i in range(k)]
    kmeans_plot(ks4_results_df, 'ATT8SCR', 'PTAC5EM_PTQ_EE', k=k,
                plt_title=title, cluster_labels=cluster_labels,
                plt_labels=['Attainment 8', 'English GCSE 5+ A*-C'])

Again these are all quite spread out with values ranging quite a lot from one cluster to another, making it hard to use as ameasure for school types. I'll try a very high k value of 11 kmeans.  I'll quickly run the silouette tests again.

## KMeans clustering of Attainment 8 and GCSE 5+ A*-C results

In [None]:
colour_map = sns.palettes.color_palette(n_colors=5)

In [None]:
# set the title I want to use
title = 'KS4 (ATT8 GCSE) cluster groups k=5'

# so that it always runs the same I need to initialise centroids in this case
init_centroids = pd.DataFrame({'ATT8SCR': [0.2, 0.4, 0.5, 0.6, 0.8],
                                'PTAC5EM_PTQ_EE': [30, 45, 50, 55, 60]},
                              columns=['ATT8SCR', 'PTAC5EM_PTQ_EE'])

# initialise the kmeans cluster
kmeans5 = cluster.KMeans(n_clusters=5, init=init_centroids, n_init=1)

# fit it to the data
assigned_clusters = kmeans5.fit(ks4_results_df[['ATT8SCR', 'PTAC5EM_PTQ_EE']])

# plot the results
plot_cluster_2(ks4_results_df['ATT8SCR'], ks4_results_df['PTAC5EM_PTQ_EE'], 
               assigned_clust=assigned_clusters, k=5,
               plt_title=title,
               plt_labels=['Attainment 8', '%+A*-C GCSE'],
               legend_loc=(0.02, 0.8),
               colors=colour_map,
               opacity=0.7,
               save=True)

Again the clusters appear to be separated into groups around the result bands. The range of values looks slightly less than with the P8Measure we looked at earlier.  I'll run a silouette analysis on them to see whether they are suitable.

## Silhouette analysis of kMeans 5 clustering.

In [None]:
# create a column on the results data for each value
ks4_results_df['cluster'] = pd.Series(assigned_clusters.labels_)

In [None]:
# check it looks ok.
ks4_results_df.head()

Calculate the silhouette coefficients

In [None]:
# from notebook 21.3
# Add the silhouette coefficients as a new column in the
# ks4_results_df:
ks4_results_df['silhouette'] = silhouette_samples(ks4_results_df[['ATT8SCR', 'PTAC5EM_PTQ_EE']],
                                                             np.array(ks4_results_df['cluster']))

ks4_results_df.head()

In [None]:
# sort the dataframe so we can see a curve
silhouette_plot_data_df = ks4_results_df.sort_values(['cluster', 'silhouette'])
silhouette_plot_data_df.index = list(range(len(silhouette_plot_data_df)))


for clust in set(silhouette_plot_data_df['cluster']):
    plt.bar(silhouette_plot_data_df[silhouette_plot_data_df['cluster']==clust].index,
            silhouette_plot_data_df[silhouette_plot_data_df['cluster']==clust]['silhouette'],
            color=colour_map[clust], alpha=0.7, label='Cluster ' + str(clust))
    
plt.title('Silhouette plot of Attainment 8 and GCSE dataset')
plt.legend(loc=(1, 0.5))

plt.xlabel('Number of data point')
plt.ylabel('Silhouette coefficient')

plt.savefig('plot_images/silhouette_k9_A8_AC5.png')


These clusters look to have less crossover which is good as I can't see any negative values.  However, the range of silhouette values is still extremely wide.  So it makes it hard to see this as being a useful grouping to divide the dataset into performance.

I will use it to double check the school types quicky though.


These cluster groupings indicate that there is probably some spread in the data.  I'm going to go back to the dataframe and plot out the performance measures and colour code it by school type to see how they lie.

<a name="school_scatter"></a>

## School Scatter Analysis

In [None]:
# get a list of the school types and assign each a colour
school_code = list(set(ks4_results_df['NFTYPE']))

# # Make a colour map for each school type
# colour_map = sns.palettes.color_palette('paired', n_colors=len(school_code))

colour_map = sns.palettes.color_palette(palette='Paired', n_colors=12)

# add the colour column to the dataframe 
ks4_results_df['colour'] = ks4_results_df['NFTYPE'].apply(lambda x:colour_map[school_code.index(x)])


In [None]:
# plot them all to the same plot
ks4_results_df.plot.scatter(x='ATT8SCR', y='PTAC5EM_PTQ_EE',
                            s=30,
                            c=ks4_results_df['colour']
                           )

# add the title and axis labels
plt.title('Scatter plot of pupils KS4 results per school')
plt.xlabel('Average Attainment 8 Score')
plt.ylabel('5+A*-C GCSE including English and Maths (%))')

# add a legend to the scatter plot
import matplotlib.patches as mpatches
# make the legend handles
legend_handles = ([mpatches.Patch(color=colour_map[0], label=nftypes[school_code[0]])]+
                  [mpatches.Patch(color=colour_map[1], label=nftypes[school_code[1]])]+
                  [mpatches.Patch(color=colour_map[2], label=nftypes[school_code[2]])]+
                  [mpatches.Patch(color=colour_map[3], label=nftypes[school_code[3]])]+
                  [mpatches.Patch(color=colour_map[4], label=nftypes[school_code[4]])]+
                  [mpatches.Patch(color=colour_map[5], label=nftypes[school_code[5]])]+
                  [mpatches.Patch(color=colour_map[6], label=nftypes[school_code[6]])]+
                  [mpatches.Patch(color=colour_map[7], label=nftypes[school_code[7]])]+
                  [mpatches.Patch(color=colour_map[8], label=nftypes[school_code[8]])]+
                  [mpatches.Patch(color=colour_map[9], label=nftypes[school_code[9]])]+
                  [mpatches.Patch(color=colour_map[10], label=nftypes[school_code[10]])])
                
plt.legend(handles=legend_handles, loc=(0.05, 0.4))

plt.savefig('plot_images/scatter_schools_distribution_2.png')

I want to plot each school type to see it's distribution

In [None]:
# snippet from https://stackoverflow.com/questions/3899980/how-to-change-the-font-size-on-a-matplotlib-plot
for school in school_code:
    
    subset = ks4_results_df[ks4_results_df['NFTYPE']==school]
    # plot only that school type
    subset.plot.scatter(x='ATT8SCR', y='PTAC5EM_PTQ_EE',
                        c=subset['colour'],
                        s=30
                       )
    # add the title and axis labels
    plt.title('School type: ' + nftypes[school])
    plt.xlabel('Average Attainment 8 Score')
    plt.ylabel('5+A*-C GCSE including English and Maths (%))')
    
    
    plt.savefig('plot_images/scatter_' + school + '_dist.png')

<a name='q1_findings'></a>

# Q1: Findings

Observations.  The mean value of the grouped schools performance can be misleading.  The spread within each school types results, and the variance in number makes it hard to be confident in saying there is a link between school type and performance in the measures investigated.

When the data is grouped by school type with mean values for each measure there are two clear outliers, CTC as the top performer in all and FESI as the worst performer in all measures.  However, when we run cluster analysis and silhouette plots we see how spread the data is for each type.  Looking at each scatter on a by school basis revealed that CTC had 2 schools at the top and 1 at the bottom (of 3 schools), whereas FESI had a more even spread of data points.

In balance I would say that although the mean of the performance measures suggest there could be a link.  Further investigation suggests that the spread of the data does not reflect such a narrow grouping.


<a name="q2"></a>

#  Q2 - Keystage 2 and 4 Investigation.

# Do schools that perform well at KS2 deliver as good or better results at KS4.

How many mainstream schools are there?

In [None]:
ks2.find({'RECTYPE': 1}).count()

So there are a very large number of documents in the dataset (after taking out the non-mainstream schools).  Which schools are in both ks2 and ks4?  I seem to remember there being a flag in the KS4 database if the school had published KS2 results.

<a name="joining"></a>

# Joining the two datasets

In [None]:
# print keys with key stage 2 in the description
for k in ks4_expanded_name.keys():
    if 'key stage 2' in ks4_expanded_name[k].lower():
        print(k, ks4_expanded_name[k])

Found it, so it is the TABKS2.  Let me make a dataframe of just those schools from KS4 then I can use the URN to look up the KS2 ones.

In [None]:
ks4_schools_df = pd.DataFrame(list(ks4.find({'TABKS2': 1},
                                            {'URN': 1,
                                             'NFTYPE': 1,
                                             'PTAC5EM_PTQ_EE': 1,
                                             'PTEBACC_PTQ_EE': 1,
                                             'P8MEA': 1,
                                             'ATT8SCR': 1,
                                             '_id': 0
                                            })))
len(ks4_schools_df)

In [None]:
ks4_schools_df.head()

In [None]:
# drop the missing values
ks4_schools_df.dropna(inplace=True)

ks4_schools_df.head()

Now is there a similar key in the ks2 dataset?

In [None]:
for k in ks2_expanded_name.keys():
    if 'key stage 4' in ks2_expanded_name[k].lower():
        print(k, ks2_expanded_name[k])

Awesome there is!  I can then use that to grab the schools from ks2

In [None]:
ks2_schools_df = pd.DataFrame(list(ks2.find({'TAB15': 1})))
len(ks2_schools_df.head())

5 isn't really that many, and looking at the style of the label code it probably means included in 2015 only.  Perhaps I can look up by the URN and match to those from the ks4 schools dataframe.

In [None]:
ks4_schools_df.dropna(inplace=True)

In [None]:
ks4_schools_df.head()

In [None]:
ks2_schools_2_df = pd.DataFrame()


for index, row in ks4_schools_df.iterrows():
    doc = ks2.find_one({'URN': row['URN']},
                       {'URN': 1,
                        'PTGPS_HIGH_H': 1,
                        'PTMAT_HIGH': 1,
                        'PTREAD_HIGH': 1,
                        'PTRWM_HIGH': 1,
                        'PTGPS_EXP': 1,
                        'PTMAT_EXP':1,
                        'PTREAD_EXP':1,
                        'PTRWM_EXP':1,
                        '_id': 0 })
    ks2_schools_2_df.append(list(doc))
    
#     print(row['URN'], doc['URN'])
    
ks2_schools_2_df.head()
    

That didn't really work that well.  Let's try another way.

In [None]:
# make a list of all the URN
urn_list = list(ks4_schools_df['URN'])

In [None]:
# iterate through the keys in each doc of ks2.  
for doc in ks2.find():
    for key in list(doc.keys()):
        # find the URN key
        if key == 'URN':
            # checks against our list
            if int(doc[key]) in urn_list:
                # if matches it adds a new key to tag the schools that are in ks4 too
                ks2.update_one({'_id': doc['_id']},
                               {'$set': {'school_in_ks4': 1}})

ks2.find_one({'school_in_ks4':1})
                

great that worked.

great now we can grab those data and put them it a dataframe.

In [None]:
ks2_schools_df = pd.DataFrame(list(ks2.find({'school_in_ks4': 1},
                                            {'URN': 1,
                                             'PTGPS_HIGH': 1,
                                             'PTMAT_HIGH': 1,
                                             'PTREAD_HIGH': 1,
                                             'PTRWM_HIGH': 1,
                                             'PTGPS_EXP': 1,
                                             'PTMAT_EXP': 1,
                                             'PTREAD_EXP': 1,
                                             'PTRWM_EXP': 1,
                                             
                                             '_id': 0 })))

In [None]:
len(ks2_schools_df)

In [None]:
ks2_schools_df.info()

In [None]:
ks2_schools_df['URN'] = ks2_schools_df['URN'].apply(lambda x: int(x))
ks2_schools_df.head()

In [None]:
# let's get an idea of the data
ks2_schools_df.describe()

In [None]:
ks2_schools_df.drop('URN', axis=1).plot(kind='bar', subplots=True)

In [None]:
# merge the two dataframes
ks2_ks4_df = pd.merge(ks4_schools_df, ks2_schools_df, on='URN')

In [None]:
ks2_ks4_df.describe()

In [None]:
ks2_ks4_df.info()

In [None]:
grouped_by_mean = ks2_ks4_df.drop('URN', axis=1).groupby('NFTYPE').mean()

In [None]:
grouped_by_mean

In [None]:
grouped_by_mean.sort_values('PTAC5EM_PTQ_EE').plot(kind='bar', subplots=True)

In [None]:
grouped_by_mean.plot.scatter(x='ATT8SCR', y='PTRWM_HIGH')

In [None]:
grouped_by_mean.plot.scatter(x='ATT8SCR', y='PTRWM_EXP')

In [None]:
# look at the values of high grades of PTRWM and attainment 8
ks2_ks4_df[['ATT8SCR', 'PTRWM_HIGH', ]].plot.scatter(x='ATT8SCR',y='PTRWM_HIGH')

In [None]:
# look at the values of expected PTRWM and attainment 8
ks2_ks4_df[['ATT8SCR', 'PTRWM_EXP']].plot.scatter(x='ATT8SCR',y='PTRWM_EXP')

<a name='plotting'></a>

# Plotting KS2 - KS4

## Plot the expected level performance vs Attainment 8 score

In [None]:
# subset the data
scatter_df = ks2_ks4_df[['ATT8SCR', 'PTRWM_EXP', 'NFTYPE']]

In [None]:
# set up the color map and school codes
# get a list of the school types and assign each a colour
school_code = list(set(scatter_df['NFTYPE']))

colour_map = sns.palettes.color_palette(palette='Paired', n_colors=len(school_code))

school_code.sort()

In [None]:
# add the colour column to the dataframe 
scatter_df['colour'] = scatter_df['NFTYPE'].apply(lambda x:colour_map[school_code.index(x)])


# plot them all to the same plot
scatter_df.plot.scatter(x='ATT8SCR', y='PTRWM_EXP',
                            s=30,
                            c=scatter_df['colour']
                           )


# add the title and labels
plt.title('Schools student performance KS2 and KS4')
plt.xlabel('Average Attainment 8 Score at KS4')
plt.ylabel('% Expected Level in Reading,\nWriting and Mathematics at KS2')

# make the legend handles
legend_handles = ([mpatches.Patch(color=colour_map[0], label=nftypes[school_code[0]])]+
                  [mpatches.Patch(color=colour_map[1], label=nftypes[school_code[1]])]+
                  [mpatches.Patch(color=colour_map[2], label=nftypes[school_code[2]])]+
                  [mpatches.Patch(color=colour_map[3], label=nftypes[school_code[3]])]+
                  [mpatches.Patch(color=colour_map[4], label=nftypes[school_code[4]])]+
                  [mpatches.Patch(color=colour_map[5], label=nftypes[school_code[5]])]+
                  [mpatches.Patch(color=colour_map[6], label=nftypes[school_code[6]])])
                
plt.legend(handles=legend_handles, loc=(0.05, 0.75))

plt.savefig('plot_images/KS2_KS4_EXP_ATT8.png')

In [None]:
scatter_df.describe()

## Plot the High level performance vs Attainment 8 score

In [None]:
# subset the data
scatter_df = ks2_ks4_df[['ATT8SCR', 'PTRWM_HIGH', 'NFTYPE']]

In [None]:
# set up the color map and school codes
# get a list of the school types and assign each a colour
school_code = list(set(scatter_df['NFTYPE']))

colour_map = sns.palettes.color_palette(palette='Paired', n_colors=len(school_code))

school_code.sort()

In [None]:
# add the colour column to the dataframe 
scatter_df['colour'] = scatter_df['NFTYPE'].apply(lambda x:colour_map[school_code.index(x)])


# plot them all to the same plot
scatter_df.plot.scatter(x='ATT8SCR', y='PTRWM_HIGH',
                            s=30,
                            c=scatter_df['colour']
                           )

# add the title and axis labels
plt.title('Schools student performance KS2 and KS4')
plt.xlabel('Average Attainment 8 Score at KS4')
plt.ylabel('% High Level in Reading,\nWriting and Mathematics at KS2')

# make the legend handles
legend_handles = ([mpatches.Patch(color=colour_map[0], label=nftypes[school_code[0]])]+
                  [mpatches.Patch(color=colour_map[1], label=nftypes[school_code[1]])]+
                  [mpatches.Patch(color=colour_map[2], label=nftypes[school_code[2]])]+
                  [mpatches.Patch(color=colour_map[3], label=nftypes[school_code[3]])]+
                  [mpatches.Patch(color=colour_map[4], label=nftypes[school_code[4]])]+
                  [mpatches.Patch(color=colour_map[5], label=nftypes[school_code[5]])]+
                  [mpatches.Patch(color=colour_map[6], label=nftypes[school_code[6]])])
                
plt.legend(handles=legend_handles, loc=(0.05, 0.75))

plt.savefig('plot_images/KS2_KS4_HIGH_ATT8.png')

In [None]:
scatter_df.describe()

<a name='pearson'></a>

## Plot the High level performance vs GCSE 5 A*-C

In [None]:
# subset the data
scatter_df = ks2_ks4_df[['PTAC5EM_PTQ_EE', 'PTRWM_HIGH', 'NFTYPE']]

In [None]:
# set up the color map and school codes
# get a list of the school types and assign each a colour
school_code = list(set(scatter_df['NFTYPE']))

colour_map = sns.palettes.color_palette(palette='Paired', n_colors=len(school_code))

school_code.sort()

In [None]:
# add the colour column to the dataframe 
scatter_df['colour'] = scatter_df['NFTYPE'].apply(lambda x:colour_map[school_code.index(x)])


# plot them all to the same plot
scatter_df.plot.scatter(x='PTAC5EM_PTQ_EE', y='PTRWM_HIGH',
                            s=30,
                            c=scatter_df['colour']
                           )

# add the title and axis labels
plt.title('Schools student performance KS2 and KS4')
plt.xlabel('% Students to acheive 5+ A*-C at GCSE')
plt.ylabel('% High Level in Reading,\nWriting and Mathematics at KS2')

# make the legend handles
legend_handles = ([mpatches.Patch(color=colour_map[0], label=nftypes[school_code[0]])]+
                  [mpatches.Patch(color=colour_map[1], label=nftypes[school_code[1]])]+
                  [mpatches.Patch(color=colour_map[2], label=nftypes[school_code[2]])]+
                  [mpatches.Patch(color=colour_map[3], label=nftypes[school_code[3]])]+
                  [mpatches.Patch(color=colour_map[4], label=nftypes[school_code[4]])]+
                  [mpatches.Patch(color=colour_map[5], label=nftypes[school_code[5]])]+
                  [mpatches.Patch(color=colour_map[6], label=nftypes[school_code[6]])])
                
plt.legend(handles=legend_handles, loc=(0.05, 0.75))

plt.savefig('plot_images/KS2_KS4_HIGH_AC5.png')

In [None]:
scatter_df.describe()

## Plot the High level performance vs GCSE 5 A*-C

In [None]:
# subset the data
scatter_df = ks2_ks4_df[['PTAC5EM_PTQ_EE', 'PTRWM_EXP', 'NFTYPE']]

In [None]:
# set up the color map and school codes
# get a list of the school types and assign each a colour
school_code = list(set(scatter_df['NFTYPE']))

colour_map = sns.palettes.color_palette(palette='Paired', n_colors=len(school_code))

school_code.sort()

In [None]:
# add the colour column to the dataframe 
scatter_df['colour'] = scatter_df['NFTYPE'].apply(lambda x:colour_map[school_code.index(x)])


# plot them all to the same plot
scatter_df.plot.scatter(x='PTAC5EM_PTQ_EE', y='PTRWM_EXP',
                            s=30,
                            c=scatter_df['colour']
                           )

# add the title and axis labels
plt.title('Schools student performance KS2 and KS4')
plt.xlabel('% Students to acheive 5+ A*-C at GCSE')
plt.ylabel('% Expected Level in Reading,\nWriting and Mathematics at KS2')

# make the legend handles
legend_handles = ([mpatches.Patch(color=colour_map[0], label=nftypes[school_code[0]])]+
                  [mpatches.Patch(color=colour_map[1], label=nftypes[school_code[1]])]+
                  [mpatches.Patch(color=colour_map[2], label=nftypes[school_code[2]])]+
                  [mpatches.Patch(color=colour_map[3], label=nftypes[school_code[3]])]+
                  [mpatches.Patch(color=colour_map[4], label=nftypes[school_code[4]])]+
                  [mpatches.Patch(color=colour_map[5], label=nftypes[school_code[5]])]+
                  [mpatches.Patch(color=colour_map[6], label=nftypes[school_code[6]])])
                
plt.legend(handles=legend_handles, loc=(0.05, 0.75))

plt.savefig('plot_images/KS2_KS4_EXP_AC5.png')

In [None]:
scatter_df.describe()

<a name='pearson'></a>

# Pearson's *R*² test

In [None]:
# Attainment 8 and high KS2
a = scipy.stats.pearsonr(ks2_ks4_df['PTRWM_HIGH'],
                     ks2_ks4_df['ATT8SCR'])

In [None]:
# Attainment 8 and Expected KS2

In [None]:
b = scipy.stats.pearsonr(ks2_ks4_df['ATT8SCR'],
                     ks2_ks4_df['PTRWM_EXP'])

In [None]:
# 5+ A*-C and high KS2
c = scipy.stats.pearsonr(ks2_ks4_df['PTAC5EM_PTQ_EE'],
                     ks2_ks4_df['PTRWM_HIGH'])

In [None]:
# 5+ A*-C and Expected KS2

In [None]:
d = scipy.stats.pearsonr(ks2_ks4_df['PTAC5EM_PTQ_EE'],
                     ks2_ks4_df['PTRWM_EXP'])

In [None]:
results = pd.DataFrame({'R2 value': [a[0], b[0], c[0], d[0]], 'P value': [a[1], b[1], c[1], d[1]],
 'Comparison': ['A8 to High KS2', 'A8 to Exp KS2', 'AC5 to High KS2', 'AC5 to EXP KS2']})
# results.set_index('Comparison')
# results[['R2 value', 'P value']]

In [None]:
results.set_index('Comparison')

Observations,  these statistics suggest that there is a slight positive correlation between students that achieve high scores at KS2 in Reading, Writing and Mathematics with Achieving better results at KS4 in both of the Attainment 8 scores and 5+A*-C GCSE grades.

Both have a very small _p_ value and so we can say that the result is significant.  We can therefore reject the null hypothesis that high performing schools perform the same as any other at KS4 (for the measures tested at least).

We can also see that the achievement of only the expected level of KS2 standard in Reading, Writing and Mathematics were not as correlated (the R value is smaller (between 15.5 and 17.3) and the _p_ value is too high for this result to be deemed significant (it should be under 0.05) 

<a name='q1_findings'></a>

# Q2: Findings

This analysis suggests that schools whose students achieve high scores at KS2 in Reading writing and Maths are also schools that achieved better results at KS4.  However it is not a massive amount higher and the dataset used did not have a large number of schools that were in both KS2 and KS4 published.  Therefore, there can be a number of confounding factors, which will be ellucidated in the accompanying report.

# Cleanup/remove the database
<a name="cleanup"></a>

Uncomment the lines below to remove the MongoDB created in the investigation.

In [None]:
# uncomment to remove the database if needed
client.drop_database('schools_db')
client.database_names()