In [1]:
import pandas as pd
import numpy as np
import seaborn as sns

# Loading Data

The data on schools in England come from the UK government, specifically from the school performance comparison service:
https://bit.ly/2YKztfa

The current dataset includes only records for the latest available school year, 2018-2019, and only for primary schools (key stages 1 and 2, for children of age 5-11). 

In [2]:
%%time
df = pd.read_excel('england_ks2final.xlsx', na_values="SUPP")

Wall time: 57.1 s


The file is quite large and takes about a minute to load

In [3]:
df.memory_usage().sum() # bytes of memory consumed by the whole dataframe

41996480

Let's take a look at the dataframe dimensionality, feature names and types

In [4]:
df.shape

(16508, 318)

In [5]:
df.columns

Index(['RECTYPE', 'ALPHAIND', 'LEA', 'ESTAB', 'URN', 'SCHNAME', 'ADDRESS1',
       'ADDRESS2', 'ADDRESS3', 'TOWN',
       ...
       'MATPROG_UNADJUSTED', 'READPROG_DESCR_17', 'WRITPROG_DESCR_17',
       'MATPROG_DESCR_17', 'READPROG_DESCR_18', 'WRITPROG_DESCR_18',
       'MATPROG_DESCR_18', 'READPROG_DESCR', 'WRITPROG_DESCR',
       'MATPROG_DESCR'],
      dtype='object', length=318)

In [6]:
df.index # uses pandas default index, for now

RangeIndex(start=0, stop=16508, step=1)

In [7]:
df.index.is_unique

True

In [8]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16508 entries, 0 to 16507
Columns: 318 entries, RECTYPE to MATPROG_DESCR
dtypes: float64(200), int64(1), object(117)
memory usage: 40.1+ MB


# Data Cleaning and Preprocessing

## Feature Selection

The total number of features (columns) in the dataframe is quite large, at 318. Investigating all of these is unlikely to be feasible, and we can instead focus only on a subset of these, with the meanings as in the table below.

To a large extent, feature selection for the purpose of this project is a judgement call, since pretty much all of them contain useful information. Some of the reasons are quite straightforward: certian columns (like school phone number) are not relevant for this type of analysis, and many columns contain duplicated information and/or are unambiguously redundant (e.g. the *number* of pupils achieving high proficiency in mathematics vs the *percentage* of pupils achieving the same, or the percentages of *disadvantaged* pupils in the same category vs the percentage of *non-disadvantaged* pupils in the same category). Since the focus will be on school performance as a whole, features pertaining specifically to boys or girls will not be included either (moreover, proper accounting for them would be complicated by the fact that a significant proportion of schools in England are single-sex, and thus a girls-only school will have N/A for all features pertaining to boys). 

The decision was also made *not* to include any features containing raw *scores* in any subject or category, for now, and focus instead on *percentages*. One of the reasons for this is the intention to apply unsupervised learning methods to explore the data further, and some of these methods (e.g. K-Means clustering) require features to be on the same scale. While it is generally easy to rescale them (using, for example, sklearn.preprocessing.StandardScaler) doing so might introduce additional inaccuracy to data, and will be left aside, for now. Retained continuous features are all expected to have the same scale (percentage), but this will need to be checked later. 

Finally, several categorical features will be included, including, among others, religious denomination and school type. While some of these columns may contain similar information (specifically postcodes, town names, and parliamentary constituencies) they will be retained for now. They are relevant for analysis (it will be important to undestand, for example, whether schools with religious affiliation tend to perform better or worse than more 'secular' ones, and whether this difference, if any, is statistically significant). Since the main goal of the project is to identify locations with better schools, several approaches to 'slicing' the data (e.g. by towns and post districts) will be attempted later. 

|Field Name|Description|
|:--|:--|
|URN|School unique reference number|
|SCHNAME|School/Local authority name|
|TOWN|School town|
|PCODE|School postcode|
|PCON_NAME|School parliamentary constituency name|
|NFTYPE|School type. Rectype 1: AC = Academy Sponsor Led (NFTYPE 20), CY = Community school (21), VA = Voluntary Aided school (22), VC = Voluntary Controlled school (23), FD = Foundation school (24), CTC = City Technology College (25), ACC = Academy converter (51), F = Free school (52). Rectype 2: CYS = Community special (26), FDS = Foundation special (27), ACS = Academy special (50), FS = Free Special School (53), ACCS = Academy Converter Special (55)|
|RELDENOM|Religious denomination|
|AGERANGE|Age range|
|PTKS1GROUP_L|Percentage of pupils in cohort with low KS1 attainment|
|PTKS1GROUP_M|Percentage of pupils in cohort with medium KS1 attainment|
|PTKS1GROUP_H|Percentage of pupils in cohort with high KS1 attainment|
|PTNotFSM6CLA1A|Percentage of key stage 2 pupils who are not disadvantaged|
|PTRWM_EXP|Percentage of pupils reaching the expected standard in reading, writing and maths|
|PTRWM_HIGH|Percentage of pupils achieving a high score in reading and maths and working at greater depth in writing|
|PTREAD_EXP|Percentage of pupils reaching the expected standard in reading|
|PTREAD_HIGH|Percentage of pupils achieving a high score in reading|
|PTGPS_EXP|Percentage of pupils reaching the expected standard in grammar, punctuation and spelling|
|PTGPS_HIGH|Percentage of pupils achieving a high score in grammar, punctuation and spelling|
|PTMAT_EXP|Percentage of pupils reaching the expected standard in maths|
|PTMAT_HIGH|Percentage of pupils achieving a high score in maths|
|PTWRITTA_EXP|Percentage of pupils reaching the expected standard in writing|
|PTWRITTA_HIGH|Percentage of pupils working at greater depth within the expected standard in writing|
|PTSCITA_EXP|Percentage of pupils reaching the expected standard in science TA|
|PTEALGRP1|Percentage of eligible pupils with English as first language|
|PSENELE|Percentage of eligible pupils with EHC plan|
|PSENELK|Percentage of eligible pupils with SEN support|
|PTNOTFSM6CLA1A_18|Percentage of key stage 2 pupils who are not disadvantaged one year prior|
|PTRWM_EXP_18|Percentage of pupils reaching the expected standard in reading, writing and maths one year prior|
|PTRWM_HIGH_18|Percentage of pupils achieving a high score in reading and maths and working at greater depth in writing  one year prior|
|PTNOTFSM6CLA1A_17|Percentage of key stage 2 pupils who are not disadvantaged two years prior|
|PTRWM_EXP_17|Percentage of pupils reaching the expected standard in reading, writing and maths two years prior|
|PTRWM_HIGH_17|Percentage of pupils achieving a high score in reading and maths and working at greater depth in writing two years prior|
|PTRWM_EXP_3YR|Percentage of pupils reaching the expected standard in reading, writing and maths  - 3 year total|
|PTRWM_HIGH_3YR|Percentage of pupils achieving a high score in reading and maths and working at greater depth in writing  - 3 year total|


In [9]:
selected_columns = [
    "URN", "SCHNAME", "TOWN", "PCODE", "PCON_NAME", "NFTYPE", "RELDENOM", "AGERANGE", 
    "PTKS1GROUP_L", "PTKS1GROUP_M", "PTKS1GROUP_H", "PTNotFSM6CLA1A",
    "PTRWM_EXP", "PTRWM_HIGH", "PTREAD_EXP", "PTREAD_HIGH", "PTGPS_EXP", 
    "PTGPS_HIGH", "PTMAT_EXP", "PTMAT_HIGH", "PTWRITTA_EXP", "PTWRITTA_HIGH", 
    "PTSCITA_EXP", "PTEALGRP1", "PSENELE", "PSENELK", "PTNOTFSM6CLA1A_18", 
    "PTRWM_EXP_18", "PTRWM_HIGH_18", "PTNOTFSM6CLA1A_17", "PTRWM_EXP_17", 
    "PTRWM_HIGH_17", "PTRWM_EXP_3YR", "PTRWM_HIGH_3YR"
    ]
df = df[selected_columns]

In [10]:
df.shape

(16508, 34)

In [11]:
df.columns

Index(['URN', 'SCHNAME', 'TOWN', 'PCODE', 'PCON_NAME', 'NFTYPE', 'RELDENOM',
       'AGERANGE', 'PTKS1GROUP_L', 'PTKS1GROUP_M', 'PTKS1GROUP_H',
       'PTNotFSM6CLA1A', 'PTRWM_EXP', 'PTRWM_HIGH', 'PTREAD_EXP',
       'PTREAD_HIGH', 'PTGPS_EXP', 'PTGPS_HIGH', 'PTMAT_EXP', 'PTMAT_HIGH',
       'PTWRITTA_EXP', 'PTWRITTA_HIGH', 'PTSCITA_EXP', 'PTEALGRP1', 'PSENELE',
       'PSENELK', 'PTNOTFSM6CLA1A_18', 'PTRWM_EXP_18', 'PTRWM_HIGH_18',
       'PTNOTFSM6CLA1A_17', 'PTRWM_EXP_17', 'PTRWM_HIGH_17', 'PTRWM_EXP_3YR',
       'PTRWM_HIGH_3YR'],
      dtype='object')

In [12]:
df.memory_usage().sum()

4490304

## Duplicated or Missing Values

In [13]:
df.duplicated().values.any()

False

All records (rows) in the dataframe are unique

In [14]:
df.isnull().values.any()

True

However, some of these records contain missing values

In [15]:
df.describe()

Unnamed: 0,PTKS1GROUP_L,PTKS1GROUP_M,PTKS1GROUP_H,PTNotFSM6CLA1A,PTRWM_EXP,PTRWM_HIGH,PTREAD_EXP,PTREAD_HIGH,PTGPS_EXP,PTGPS_HIGH,...,PSENELE,PSENELK,PTNOTFSM6CLA1A_18,PTRWM_EXP_18,PTRWM_HIGH_18,PTNOTFSM6CLA1A_17,PTRWM_EXP_17,PTRWM_HIGH_17,PTRWM_EXP_3YR,PTRWM_HIGH_3YR
count,15633.0,15633.0,15633.0,15633.0,15632.0,15632.0,15633.0,15633.0,15632.0,15632.0,...,15633.0,15633.0,15242.0,15240.0,15240.0,14863.0,14862.0,14862.0,14606.0,14083.0
mean,0.107795,0.565103,0.327413,0.693196,0.634661,0.100817,0.722765,0.26674,0.761829,0.337836,...,0.057641,0.14964,0.692965,0.63241,0.096105,0.683169,0.603305,0.086108,0.645959,0.101475
std,0.167547,0.147198,0.144141,0.197305,0.193203,0.081609,0.188728,0.139234,0.192186,0.172047,...,0.18673,0.095381,0.199625,0.196303,0.080042,0.20531,0.195663,0.075479,0.135041,0.062035
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.01
25%,0.03,0.5,0.24,0.56,0.55,0.04,0.66,0.17,0.71,0.22,...,0.0,0.08,0.56,0.55,0.03,0.54,0.5,0.03,0.57,0.06
50%,0.07,0.58,0.33,0.73,0.67,0.09,0.76,0.26,0.8,0.33,...,0.01,0.14,0.73,0.67,0.08,0.72,0.63,0.07,0.65,0.09
75%,0.12,0.65,0.42,0.85,0.76,0.14,0.84,0.35,0.88,0.45,...,0.04,0.2,0.85,0.76,0.14,0.85,0.73,0.13,0.74,0.13
max,1.0,1.0,0.89,1.0,1.0,0.71,1.0,0.94,1.0,1.0,...,1.0,0.83,1.0,1.0,0.67,1.0,1.0,0.81,1.0,0.56


In [16]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16508 entries, 0 to 16507
Data columns (total 34 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   URN                16508 non-null  object 
 1   SCHNAME            16508 non-null  object 
 2   TOWN               16507 non-null  object 
 3   PCODE              16508 non-null  object 
 4   PCON_NAME          16506 non-null  object 
 5   NFTYPE             16508 non-null  object 
 6   RELDENOM           16508 non-null  object 
 7   AGERANGE           16508 non-null  object 
 8   PTKS1GROUP_L       15633 non-null  float64
 9   PTKS1GROUP_M       15633 non-null  float64
 10  PTKS1GROUP_H       15633 non-null  float64
 11  PTNotFSM6CLA1A     15633 non-null  float64
 12  PTRWM_EXP          15632 non-null  float64
 13  PTRWM_HIGH         15632 non-null  float64
 14  PTREAD_EXP         15633 non-null  float64
 15  PTREAD_HIGH        15633 non-null  float64
 16  PTGPS_EXP          156

Now only 34 features are used, instead of previous 318. Memory usage also went down from more than 40 MB to around 4 MB.

However, the result with data types is slightly unexpected. While continuous features are all of type float64 (consistent with the fact that they are all supposed to be expressed as percentages), the categorical ones (e.g. the names of parliamentary constituencies) are of type object instead of string. This may happen when data of different types are mixed together in one column (e.g. integers and strings), and this is something we will need to investigate further, before moving to exploratory data analysis.

Also we can see that the number of non-null values is *not* the same across features. This may also suggest a problem with missing data, which has to be dealt with accordingly.

Let's take a look at first 10 and last 10 rows of the dataframe.

In [17]:
df.head(10)

Unnamed: 0,URN,SCHNAME,TOWN,PCODE,PCON_NAME,NFTYPE,RELDENOM,AGERANGE,PTKS1GROUP_L,PTKS1GROUP_M,...,PSENELE,PSENELK,PTNOTFSM6CLA1A_18,PTRWM_EXP_18,PTRWM_HIGH_18,PTNOTFSM6CLA1A_17,PTRWM_EXP_17,PTRWM_HIGH_17,PTRWM_EXP_3YR,PTRWM_HIGH_3YR
0,141279,Bringhurst Primary School,Market Harborough,LE16 8RH,Rutland and Melton,ACC,Does not apply,4-11,0.0,0.5,...,0.0,0.12,0.92,0.68,0.2,0.88,0.88,0.13,0.84,0.2
1,119910,Buckminster Primary School,Grantham,NG33 5RZ,Rutland and Melton,CY,Does not apply,4-11,0.06,0.71,...,0.0,0.06,0.84,0.79,0.11,0.87,0.6,0.0,0.69,0.12
2,139342,Great Dalby School,Melton Mowbray,LE14 2HA,Rutland and Melton,ACC,Does not apply,5-11,0.0,0.57,...,0.0,0.24,0.9,0.8,0.2,0.9,0.8,0.25,0.82,0.2
3,119912,Burton-on-the-Wolds Primary School,Loughborough,LE12 5TB,Loughborough,CY,Does not apply,4-11,0.04,0.46,...,0.0,0.16,0.86,0.91,0.41,0.92,0.88,0.16,0.9,0.33
4,119913,Belvoirdale Community Primary School,Coalville,LE67 3RD,North West Leicestershire,CY,Does not apply,4-11,0.15,0.56,...,0.02,0.28,0.66,0.61,0.03,0.65,0.5,0.05,0.53,0.04
5,141222,Christ Church & Saint Peter's Cofe Primary School,Loughborough,LE12 7JU,Charnwood,AC,Church of England,5-11,0.13,0.55,...,0.03,0.37,0.73,0.68,0.13,0.67,0.63,0.09,0.67,0.14
6,119914,Ellistown Community Primary School,Coalville,LE67 1EN,North West Leicestershire,CY,Does not apply,4-11,0.0,0.58,...,0.0,0.0,0.82,0.71,0.18,0.87,0.84,0.16,0.73,0.18
7,119915,Hugglescote Community Primary School,Coalville,LE67 2HA,North West Leicestershire,CY,Does not apply,4-11,0.18,0.58,...,0.11,0.21,0.88,0.56,0.08,0.9,0.65,0.05,0.59,0.06
8,119916,Woodstone Community Primary School,Coalville,LE67 2AH,North West Leicestershire,CY,Does not apply,4-11,0.03,0.55,...,0.0,0.13,0.9,0.87,0.1,0.9,0.55,0.06,0.71,0.09
9,119917,New Swannington Primary School,Coalville,LE67 5DQ,North West Leicestershire,CY,Does not apply,4-11,0.04,0.58,...,0.04,0.22,0.88,0.88,0.17,1.0,0.81,0.08,0.75,0.14


In [18]:
df.tail(10)

Unnamed: 0,URN,SCHNAME,TOWN,PCODE,PCON_NAME,NFTYPE,RELDENOM,AGERANGE,PTKS1GROUP_L,PTKS1GROUP_M,...,PSENELE,PSENELK,PTNOTFSM6CLA1A_18,PTRWM_EXP_18,PTRWM_HIGH_18,PTNOTFSM6CLA1A_17,PTRWM_EXP_17,PTRWM_HIGH_17,PTRWM_EXP_3YR,PTRWM_HIGH_3YR
16498,,Northamptonshire,,,,,,,0.09,0.57,...,0.03,0.13,0.75,0.61,0.08,0.74,0.57,0.07,0.6,0.08
16499,,Northumberland,,,,,,,0.07,0.6,...,0.04,0.13,0.71,0.65,0.1,0.7,0.61,0.09,0.64,0.1
16500,,Oxfordshire,,,,,,,0.08,0.56,...,0.03,0.17,0.8,0.63,0.1,0.78,0.61,0.09,0.63,0.1
16501,,Somerset,,,,,,,0.08,0.56,...,0.02,0.15,0.77,0.62,0.08,0.75,0.59,0.08,0.61,0.09
16502,,Suffolk,,,,,,,0.09,0.57,...,0.03,0.12,0.73,0.61,0.09,0.72,0.57,0.08,0.6,0.08
16503,,Surrey,,,,,,,0.06,0.49,...,0.04,0.13,0.83,0.7,0.13,0.82,0.67,0.13,0.69,0.13
16504,,Warwickshire,,,,,,,0.07,0.56,...,0.03,0.14,0.78,0.67,0.11,0.77,0.62,0.1,0.65,0.11
16505,,West Sussex,,,,,,,0.08,0.62,...,0.03,0.16,0.8,0.62,0.07,0.8,0.55,0.05,0.6,0.06
16506,,,,,,,,,0.09,0.57,...,0.03,0.15,0.69,0.64,0.1,0.69,0.61,0.09,0.63,0.1
16507,,,,,,,,,0.09,0.57,...,0.03,0.15,0.69,0.64,0.1,0.68,0.61,0.09,0.63,0.1


We can see an inconsistency here. While most *records* (or rows) in the dataframe refer to individual schools, the bottom ones appear to be aggregates by boroughs or counties, as well as totals. 

Let's move the records which do *not* have URN (unique reference number), and which do *not* correspond to individual schools, into a separate dataframe. 

Inconveniently, the missing URNs are not always blank (these are usually automatically recognised as N/A), but are more often specified as empty stings with whitespace (e.g. " ").

In [19]:
df[df['URN'].isna()]
df = df.dropna(subset=['SCHNAME'])
df = df[df['SCHNAME'] != " "] # removing the last row with empty school name
df_bor = df[(df['URN'].str.contains(' ') == True) | (df['URN'].isna())]
df_bor['URN'] = np.nan # populating all school URNs for boroughs with N/A for consistency

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_bor['URN'] = np.nan # populating all school URNs for boroughs with N/A for consistency


In [20]:
df = df[df['URN'].notna()]
df = df[~(df['URN'].str.contains(' ') == True)] # df should now contain only individual schools

In [21]:
df_bor.head(10)

Unnamed: 0,URN,SCHNAME,TOWN,PCODE,PCON_NAME,NFTYPE,RELDENOM,AGERANGE,PTKS1GROUP_L,PTKS1GROUP_M,...,PSENELE,PSENELK,PTNOTFSM6CLA1A_18,PTRWM_EXP_18,PTRWM_HIGH_18,PTNOTFSM6CLA1A_17,PTRWM_EXP_17,PTRWM_HIGH_17,PTRWM_EXP_3YR,PTRWM_HIGH_3YR
16355,,City of London,,,,,,,0.07,0.52,...,0.0,0.32,0.59,0.72,0.21,0.65,0.88,0.15,0.81,0.22
16356,,Camden,,,,,,,0.1,0.61,...,0.05,0.17,0.48,0.72,0.14,0.47,0.67,0.12,0.71,0.14
16357,,Greenwich,,,,,,,0.07,0.5,...,0.04,0.17,0.56,0.69,0.13,0.56,0.71,0.13,0.71,0.14
16358,,Hackney,,,,,,,0.08,0.58,...,0.04,0.18,0.49,0.71,0.13,0.46,0.72,0.13,0.69,0.13
16359,,Hammersmith and Fulham,,,,,,,0.08,0.56,...,0.04,0.16,0.54,0.74,0.16,0.5,0.74,0.14,0.74,0.15
16360,,Islington,,,,,,,0.12,0.6,...,0.05,0.17,0.4,0.69,0.16,0.36,0.66,0.14,0.68,0.15
16361,,Kensington and Chelsea,,,,,,,0.07,0.54,...,0.05,0.14,0.54,0.76,0.21,0.52,0.76,0.18,0.76,0.2
16362,,Lambeth,,,,,,,0.09,0.62,...,0.06,0.17,0.51,0.7,0.12,0.49,0.68,0.11,0.7,0.12
16363,,Lewisham,,,,,,,0.08,0.52,...,0.05,0.17,0.57,0.68,0.11,0.57,0.62,0.09,0.66,0.1
16364,,Southwark,,,,,,,0.1,0.61,...,0.04,0.19,0.51,0.69,0.12,0.43,0.64,0.09,0.67,0.11


In [22]:
df_bor.tail(10)

Unnamed: 0,URN,SCHNAME,TOWN,PCODE,PCON_NAME,NFTYPE,RELDENOM,AGERANGE,PTKS1GROUP_L,PTKS1GROUP_M,...,PSENELE,PSENELK,PTNOTFSM6CLA1A_18,PTRWM_EXP_18,PTRWM_HIGH_18,PTNOTFSM6CLA1A_17,PTRWM_EXP_17,PTRWM_HIGH_17,PTRWM_EXP_3YR,PTRWM_HIGH_3YR
16496,,Lincolnshire,,,,,,,0.1,0.58,...,0.04,0.17,0.72,0.6,0.08,0.72,0.57,0.07,0.59,0.08
16497,,Norfolk,,,,,,,0.09,0.58,...,0.03,0.15,0.72,0.59,0.07,0.7,0.57,0.07,0.59,0.07
16498,,Northamptonshire,,,,,,,0.09,0.57,...,0.03,0.13,0.75,0.61,0.08,0.74,0.57,0.07,0.6,0.08
16499,,Northumberland,,,,,,,0.07,0.6,...,0.04,0.13,0.71,0.65,0.1,0.7,0.61,0.09,0.64,0.1
16500,,Oxfordshire,,,,,,,0.08,0.56,...,0.03,0.17,0.8,0.63,0.1,0.78,0.61,0.09,0.63,0.1
16501,,Somerset,,,,,,,0.08,0.56,...,0.02,0.15,0.77,0.62,0.08,0.75,0.59,0.08,0.61,0.09
16502,,Suffolk,,,,,,,0.09,0.57,...,0.03,0.12,0.73,0.61,0.09,0.72,0.57,0.08,0.6,0.08
16503,,Surrey,,,,,,,0.06,0.49,...,0.04,0.13,0.83,0.7,0.13,0.82,0.67,0.13,0.69,0.13
16504,,Warwickshire,,,,,,,0.07,0.56,...,0.03,0.14,0.78,0.67,0.11,0.77,0.62,0.1,0.65,0.11
16505,,West Sussex,,,,,,,0.08,0.62,...,0.03,0.16,0.8,0.62,0.07,0.8,0.55,0.05,0.6,0.06


In [23]:
df.head(10)

Unnamed: 0,URN,SCHNAME,TOWN,PCODE,PCON_NAME,NFTYPE,RELDENOM,AGERANGE,PTKS1GROUP_L,PTKS1GROUP_M,...,PSENELE,PSENELK,PTNOTFSM6CLA1A_18,PTRWM_EXP_18,PTRWM_HIGH_18,PTNOTFSM6CLA1A_17,PTRWM_EXP_17,PTRWM_HIGH_17,PTRWM_EXP_3YR,PTRWM_HIGH_3YR
0,141279,Bringhurst Primary School,Market Harborough,LE16 8RH,Rutland and Melton,ACC,Does not apply,4-11,0.0,0.5,...,0.0,0.12,0.92,0.68,0.2,0.88,0.88,0.13,0.84,0.2
1,119910,Buckminster Primary School,Grantham,NG33 5RZ,Rutland and Melton,CY,Does not apply,4-11,0.06,0.71,...,0.0,0.06,0.84,0.79,0.11,0.87,0.6,0.0,0.69,0.12
2,139342,Great Dalby School,Melton Mowbray,LE14 2HA,Rutland and Melton,ACC,Does not apply,5-11,0.0,0.57,...,0.0,0.24,0.9,0.8,0.2,0.9,0.8,0.25,0.82,0.2
3,119912,Burton-on-the-Wolds Primary School,Loughborough,LE12 5TB,Loughborough,CY,Does not apply,4-11,0.04,0.46,...,0.0,0.16,0.86,0.91,0.41,0.92,0.88,0.16,0.9,0.33
4,119913,Belvoirdale Community Primary School,Coalville,LE67 3RD,North West Leicestershire,CY,Does not apply,4-11,0.15,0.56,...,0.02,0.28,0.66,0.61,0.03,0.65,0.5,0.05,0.53,0.04
5,141222,Christ Church & Saint Peter's Cofe Primary School,Loughborough,LE12 7JU,Charnwood,AC,Church of England,5-11,0.13,0.55,...,0.03,0.37,0.73,0.68,0.13,0.67,0.63,0.09,0.67,0.14
6,119914,Ellistown Community Primary School,Coalville,LE67 1EN,North West Leicestershire,CY,Does not apply,4-11,0.0,0.58,...,0.0,0.0,0.82,0.71,0.18,0.87,0.84,0.16,0.73,0.18
7,119915,Hugglescote Community Primary School,Coalville,LE67 2HA,North West Leicestershire,CY,Does not apply,4-11,0.18,0.58,...,0.11,0.21,0.88,0.56,0.08,0.9,0.65,0.05,0.59,0.06
8,119916,Woodstone Community Primary School,Coalville,LE67 2AH,North West Leicestershire,CY,Does not apply,4-11,0.03,0.55,...,0.0,0.13,0.9,0.87,0.1,0.9,0.55,0.06,0.71,0.09
9,119917,New Swannington Primary School,Coalville,LE67 5DQ,North West Leicestershire,CY,Does not apply,4-11,0.04,0.58,...,0.04,0.22,0.88,0.88,0.17,1.0,0.81,0.08,0.75,0.14


In [24]:
df.tail(10)

Unnamed: 0,URN,SCHNAME,TOWN,PCODE,PCON_NAME,NFTYPE,RELDENOM,AGERANGE,PTKS1GROUP_L,PTKS1GROUP_M,...,PSENELE,PSENELK,PTNOTFSM6CLA1A_18,PTRWM_EXP_18,PTRWM_HIGH_18,PTNOTFSM6CLA1A_17,PTRWM_EXP_17,PTRWM_HIGH_17,PTRWM_EXP_3YR,PTRWM_HIGH_3YR
16345,126155,St Anthony's School,Chichester,PO19 5PA,Chichester,CYS,Does not apply,4-16,1.0,0.0,...,1.0,0.0,0.5,0.0,0.0,0.67,0.0,0.0,,
16346,126156,"Littlegreen School, Compton",Chichester,PO18 9NW,Chichester,CYS,Does not apply,7-16,0.29,0.71,...,1.0,0.0,0.42,0.0,0.0,0.36,0.0,0.0,,
16347,126159,Palatine Primary School,Worthing,BN12 6JP,Worthing West,CYS,Does not apply,4-11,1.0,0.0,...,1.0,0.0,0.44,0.0,0.0,0.59,0.0,0.0,,
16348,126160,"Queen Elizabeth II Silver Jubilee School, Horsham",Horsham,RH13 5NW,Horsham,CYS,Does not apply,2-19,,,...,,,,,,0.5,0.0,0.0,,
16349,126162,Manor Green Primary School,Crawley,RH11 0DU,Crawley,CYS,Does not apply,2-11,0.93,0.07,...,1.0,0.0,0.61,0.0,0.0,0.65,0.0,0.0,,
16350,126163,"Fordwater School, Chichester",Chichester,PO19 6PP,Chichester,CYS,Does not apply,2-19,1.0,0.0,...,1.0,0.0,,,,,,,,
16351,136114,Woodlands Meed,Burgess Hill,RH15 9EY,Mid Sussex,FDS,Does not apply,2-19,0.65,0.3,...,0.95,0.05,0.65,0.0,0.0,0.53,0.0,0.0,,
16352,145394,Brantridge School,Haywards Heath,RH17 6EQ,Mid Sussex,ACCS,Does not apply,6-13,0.4,0.6,...,1.0,0.0,,,,,,,,
16353,126169,Herons Dale School,Shoreham-by-Sea,BN43 6TN,East Worthing and Shoreham,CYS,Does not apply,4-11,1.0,0.0,...,1.0,0.0,0.38,0.0,0.0,0.58,0.0,0.0,,
16354,126170,"Cornfield School, Littlehampton",Littlehampton,BN17 6HY,Bognor Regis and Littlehampton,CYS,Does not apply,9-16,,,...,,,,,,,,,,


In [25]:
df['URN'].isna().any()

False

No N/A in the main dataframe, for unique reference numbers

In [26]:
df_bor['URN'].isna().all()

True

All unique reference numbers are N/A in the boroughs dataframe

In [27]:
df.shape

(16355, 34)

In [28]:
df.describe()

Unnamed: 0,PTKS1GROUP_L,PTKS1GROUP_M,PTKS1GROUP_H,PTNotFSM6CLA1A,PTRWM_EXP,PTRWM_HIGH,PTREAD_EXP,PTREAD_HIGH,PTGPS_EXP,PTGPS_HIGH,...,PSENELE,PSENELK,PTNOTFSM6CLA1A_18,PTRWM_EXP_18,PTRWM_HIGH_18,PTNOTFSM6CLA1A_17,PTRWM_EXP_17,PTRWM_HIGH_17,PTRWM_EXP_3YR,PTRWM_HIGH_3YR
count,15480.0,15480.0,15480.0,15480.0,15479.0,15479.0,15480.0,15480.0,15479.0,15479.0,...,15480.0,15480.0,15091.0,15089.0,15089.0,14712.0,14711.0,14711.0,14455.0,13932.0
mean,0.107987,0.565017,0.327312,0.69334,0.634419,0.100734,0.722557,0.266632,0.761545,0.337526,...,0.057862,0.14963,0.693144,0.632197,0.096043,0.683345,0.603117,0.086075,0.645966,0.101484
std,0.168352,0.147885,0.144776,0.198043,0.194084,0.081943,0.189606,0.139836,0.193057,0.172724,...,0.187635,0.095816,0.200373,0.197215,0.080385,0.206087,0.19657,0.075822,0.135657,0.062298
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.01
25%,0.03,0.5,0.24,0.56,0.55,0.04,0.65,0.17,0.7,0.22,...,0.0,0.08,0.56,0.55,0.03,0.54,0.5,0.03,0.56,0.06
50%,0.07,0.58,0.33,0.73,0.67,0.09,0.76,0.26,0.8,0.33,...,0.01,0.14,0.73,0.67,0.08,0.73,0.63,0.07,0.65,0.09
75%,0.12,0.65,0.42,0.85,0.76,0.15,0.84,0.35,0.88,0.45,...,0.04,0.2,0.85,0.76,0.14,0.85,0.73,0.13,0.74,0.13
max,1.0,1.0,0.89,1.0,1.0,0.71,1.0,0.94,1.0,1.0,...,1.0,0.83,1.0,1.0,0.67,1.0,1.0,0.81,1.0,0.56


We can see that the scale of each continuous feature is indeed from 0 to 1, as expected (since they all are percentages), so we do not have obvious errors in the data, at least from first glance. 

Note that the number of columns in the table above is 26. The describe method of the pandas.DataFrame class includes, by default, only numeric and object features. The total number of features in the dataframe is 33. This is expected, since the first 6 features are distinctly categorical (e.g. religious denomination). 

Nevertheless, the fact that descriptive statistics were calculated for all features where we expected them does *not* mean on its own that everything is fine with these columns. Specifically we will need to check for N/A values, which are generally ignored in calculations. 

In [29]:
df.index # still the default pandas index

Int64Index([    0,     1,     2,     3,     4,     5,     6,     7,     8,
                9,
            ...
            16345, 16346, 16347, 16348, 16349, 16350, 16351, 16352, 16353,
            16354],
           dtype='int64', length=16355)

It will be more convenient if we change the default index to one of the columns, specifically to URN (only in the dataframe containing only school records, each with its own unique reference number)

In [30]:
df.set_index('URN', inplace=True)
df.index

Int64Index([141279, 119910, 139342, 119912, 119913, 141222, 119914, 119915,
            119916, 119917,
            ...
            126155, 126156, 126159, 126160, 126162, 126163, 136114, 145394,
            126169, 126170],
           dtype='int64', name='URN', length=16355)

In [31]:
df.index.is_unique

True

This worked fine. Now every the index is a series of URNs, each is recognised as integer as expected, and every single value in the index series is unique. 