# EMA Project Diary

## Jump to section header

 - [Initial look at the KS4 dataset](#looking_ks4)
 - [Importing KS4 data](#import_ks4) 
     - [Importing the KS4 MetaData](#metafile)

<a name="looking_ks4"></a>
# Initial look at the KS4 dataset
Let's have a quick look at the data we will be looking at for the EMA.


In [57]:
!head -5 'data/2015-2016/england_ks4final.csv'

﻿RECTYPE,ALPHAIND,LEA,ESTAB,URN,SCHNAME,SCHNAME_AC,ADDRESS1,ADDRESS2,ADDRESS3,TOWN,PCODE,TELNUM,CONTFLAG,ICLOSE,NFTYPE,RELDENOM,ADMPOL,EGENDER,FEEDER,TABKS2,TAB1618,AGERANGE,CONFEXAM,TOTPUPS,NUMBOYS,NUMGIRLS,TPUP,BPUP,PBPUP,GPUP,PGPUP,KS2APS,TPRIORLO,PTPRIORLO,TPRIORAV,PTPRIORAV,TPRIORHI,PTPRIORHI,TFSM6CLA1A,PTFSM6CLA1A,TNOTFSM6CLA1A,PTNOTFSM6CLA1A,TEALGRP2,PTEALGRP2,TEALGRP1,PTEALGRP1,TEALGRP3,PTEALGRP3,TNMOB,PTNMOB,SENSE4,PSENSE4,SENAPK4,PSENAPK4,TOTATT8,ATT8SCR,TOTATT8ENG,ATT8SCRENG,TOTATT8MAT,ATT8SCRMAT,TOTATT8EBAC,ATT8SCREBAC,TOTATT8OPEN,ATT8SCROPEN,TOTATT8OPENG,ATT8SCROPENG,TOTATT8OPENNG,ATT8SCROPENNG,P8PUP,P8MEACOV,P8MEA,P8CILOW,P8CIUPP,P8MEAENG,P8MEAENG_CILOW,P8MEAENG_CIUPP,P8MEAMAT,P8MEAMAT_CILOW,P8MEAMAT_CIUPP,P8MEAEBAC,P8MEAEBAC_CILOW,P8MEAEBAC_CIUPP,P8MEAOPEN,P8MEAOPEN_CILOW,P8MEAOPEN_CIUPP,PTL2BASICS_LL_PTQ_EE,PTL2BASICS_3YR_PTQ_EE,TEBACC_E_PTQ_EE,PTEBACC_E_PTQ_EE,PTEBACC_PTQ_EE,TEBACENG_E_PTQ_EE,PTEBACENG_E_PTQ_EE,TEBACMAT_E_PTQ_EE,PTEBACMAT_E_PTQ_EE,TEBAC2SCI_E_PTQ_EE,PT

In [59]:
!wc -l 'data/2015-2016/england_ks4final.csv'

5489 data/2015-2016/england_ks4final.csv


The dataset has 5489 rows of data, there looks to be a large number of columns and lots of these are codes that I'll need to look up.  There are also a number of `NA` and `NP` values that could be missing data.  I want to have a quick look at the dataset to determine which columns will be most relevant to my investigation, therefore I will import it into MongoDB to explore further. 

<a name="import_ks4"></a>
# Importing the KS4 dataset
Let's import both the ks4 ks5 and original ks2 results into mongo so we have the information we need at hand.  I chose to use Mongo for this investigation as there are a lot of columns and data to investigate.  Using a document database system will enable me to have all the data in memory for easy querying.

In [73]:
# import the required librabies
import pandas as pd
import scipy.stats
import pymongo
import bson

In [3]:
!/usr/bin/mongoimport --port 27351 --drop --db schools --collection ks4results \
    --type csv --headerline --ignoreBlanks \
    --file data/2015-2016/england_ks4final.csv

2018-05-16T07:19:15.842+0000	connected to: localhost:27351
2018-05-16T07:19:15.842+0000	dropping: schools.ks4results
2018-05-16T07:19:18.289+0000	imported 5489 documents


In [61]:
# open a connection to the mongo server
client = pymongo.MongoClient('mongodb://localhost:27351/')

In [62]:
# open the imported database and collection
db = client.schools
ks4results = db.ks4results

In [64]:
# check the number of imported matches the line length of the csv 
ks4results.find().count()

5489

In [65]:
# let's take a look at one document
ks4results.find_one()

{'AC5EM13': '0%',
 'AC5EM14_PTQ': '0%',
 'AC5EM15_PTQ_EE': '0%',
 'AC5EM16_PTQ_EE': '0%',
 'ADDRESS1': 'Queen Victoria Street',
 'AGERANGE': '10-18',
 'ALPHAIND': 11828,
 'ATT8SCR': 42.1,
 'ATT8SCREBAC': 22.2,
 'ATT8SCREBAC_FSM6CLA1A': 'NP',
 'ATT8SCREBAC_NFSM6CLA1A': 'NP',
 'ATT8SCRENG': 7.3,
 'ATT8SCRENG_FSM6CLA1A': 'NP',
 'ATT8SCRENG_NFSM6CLA1A': 'NP',
 'ATT8SCRMAT': 0,
 'ATT8SCRMAT_FSM6CLA1A': 'NP',
 'ATT8SCRMAT_NFSM6CLA1A': 'NP',
 'ATT8SCROPEN': 12.6,
 'ATT8SCROPENG': 10.4,
 'ATT8SCROPENG_FSM6CLA1A': 'NP',
 'ATT8SCROPENG_NFSM6CLA1A': 'NP',
 'ATT8SCROPENNG': 2.2,
 'ATT8SCROPENNG_FSM6CLA1A': 'NP',
 'ATT8SCROPENNG_NFSM6CLA1A': 'NP',
 'ATT8SCROPEN_FSM6CLA1A': 'NP',
 'ATT8SCROPEN_NFSM6CLA1A': 'NP',
 'ATT8SCR_15': 'NA',
 'ATT8SCR_AV': 'NP',
 'ATT8SCR_BOYS': 42.1,
 'ATT8SCR_EAL': 'NP',
 'ATT8SCR_FSM6CLA1A': 'NP',
 'ATT8SCR_GIRLS': 'NA',
 'ATT8SCR_HI': 'NP',
 'ATT8SCR_LO': 'NP',
 'ATT8SCR_NFSM6CLA1A': 'NP',
 'ATT8SCR_NMOB': 'NP',
 'BPUP': 139,
 'CONTFLAG': 0,
 'DIFFN_ATT8': 'NP',
 'DIFFN_

Looking through this document there are a large number of missing values.  The keys are for the most part codes and I'll need to look in the dataset for their meaning.  Let's have a look in the data folder for an abbreviations file.

<a name='ks4_metafile'></a>
## Importing the KS4 Metafile data

In [66]:
!ls data/2015-2016/


abbreviations.xlsx	    england_swf.csv
abs_meta.csv		    england_vaqual.csv
census_meta.csv		    england_vasubj.csv
england_abs.csv		    ks2_meta.csv
england_census.csv	    ks4_labels.csv
england_cfrfull.xlsx	    ks4_meta.csv
england_ks2final.csv	    ks4_meta_methodology.csv
england_ks4final.csv	    ks4-pupdest_meta.csv
england_ks4-pupdest.csv     ks5_meta.csv
england_ks4underlying.xlsx  ks5-studest_meta.csv
england_ks5final.csv	    la_and_region_codes_meta.csv
england_ks5-studest.csv     sixth_form_centres_and_consortia_meta.xlsx
england_ks5underlying.xlsx  spine_meta.csv
england_spine.csv	    swf_meta.csv


There is an abbreviations file, stored as an xlsx file.  I'll have a quick glance at it in excel.

The abbreviations file is not the file I was looking for.  However, it does store the General abbreviations I'll need to refer to particularly for school types and the missing data points.  I looked at the abbreviations file as part of TMA02.  In TMA02 I created a simple dict to enable easy lookup of the schools.  Let's create that again.

In [80]:
school_type_dict = {'VA': 'Voluntary aided school',
             'AC': 'Sponsored Academy',
             'F': 'Free school - mainstream',
             'CY': 'Community school',
             'FS': 'Free school - special',
             'CYS': 'Community special school',
             'FD': 'Foundation school',
             'ACC': 'Academy converter - mainstream',
             'ACCS': 'Academy converter - special school',
             'FDS': 'Foundation special school',
             'ACS': 'Sponsored special academy',
             'VC': 'Voluntary controlled school'}
len(school_type_dict)


12

To make the code lookup easier I want to make a labels collection similar to the one provided by the OU team for the accidents dataset.  Having briefly looked at the files it looks as the ks4_meta.csv is the file we want to import and have access to.

In [81]:
!head -5 data/2015-2016/ks4_meta.csv

Column,Metafile heading,Metafile description,Methodology changes,Null field for special schools,Null field for local authority records,Null field for National (all schools) records,Null field for National (maintained schools) records,,1,RECTYPE,Record type (1=mainstream school; 2=special school; 4=local authority; 5=National (all schools); 7=National (maintained schools)),,,,,,,2,ALPHAIND,Alphabetic sorting index,,,Yes,Yes,Yes,,3,LEA,Local authority code (see separate list of local authorities and their codes),,,,Yes,Yes,,4,ESTAB,Establishment number,,,Yes,Yes,Yes,,5,URN,School Unique Reference Number,,,Yes,Yes,Yes,,6,SCHNAME,School name,,,Yes,Yes,Yes,,7,SCHNAME_AC,School now known as (used if the school has converted to an academy on or after 12 Sept 2015),,,Yes,Yes,Yes,,8,ADDRESS1,School address (1),,,Yes,Yes,Yes,,9,ADDRESS2,School address (2),,,Yes,Yes,Yes,,10,ADDRESS3,School address (3),,,Yes,Yes,Yes,,11,TOWN,School town,,,Yes,Yes,Yes,,12,PCODE,School postcode,,,Yes,Yes

In [82]:
# lets try and import the KS4 meta data straight into mongoDB
!/usr/bin/mongoimport --port 27351 --drop --db schools --collection ks4_labels \
    --type csv --headerline --ignoreBlanks \
    --file data/2015-2016/ks4_meta.csv

2018-05-16T14:46:10.060+0000	Failed: fields cannot be identical: '' and ''
2018-05-16T14:46:10.060+0000	imported 0 documents


There is an issue with the import let's import it into a dataframe and organise it.

In [128]:
ks4_meta_df = pd.read_csv('data/2015-2016/ks4_meta.csv')
ks4_meta_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 372 entries, 0 to 371
Data columns (total 10 columns):
Column                                                  372 non-null int64
Metafile heading                                        372 non-null object
Metafile description                                    372 non-null object
Methodology changes                                     25 non-null object
Null field for special schools                          4 non-null object
Null field for local authority records                  79 non-null object
Null field for National (all schools) records           284 non-null object
Null field for National (maintained schools) records    77 non-null object
Unnamed: 8                                              0 non-null float64
Unnamed: 9                                              0 non-null float64
dtypes: float64(2), int64(1), object(7)
memory usage: 29.1+ KB


In [129]:
ks4_meta_df.head()

Unnamed: 0,Column,Metafile heading,Metafile description,Methodology changes,Null field for special schools,Null field for local authority records,Null field for National (all schools) records,Null field for National (maintained schools) records,Unnamed: 8,Unnamed: 9
0,1,RECTYPE,Record type (1=mainstream school; 2=special sc...,,,,,,,
1,2,ALPHAIND,Alphabetic sorting index,,,Yes,Yes,Yes,,
2,3,LEA,Local authority code (see separate list of loc...,,,,Yes,Yes,,
3,4,ESTAB,Establishment number,,,Yes,Yes,Yes,,
4,5,URN,School Unique Reference Number,,,Yes,Yes,Yes,,


In order to expand the codes all we will need are the first two columns `Metafile heading` and `Metafile desrcription`  I'll store the `Column` as well as it may be useful later on if we choose to merge with the ks2 meta data.

In [130]:
ks4_meta_labels = ks4_meta_df[['Column','Metafile heading', 'Metafile description']]
# relabel the columns to match the target
ks4_meta_labels.columns = ['column','label', 'expanded']
ks4_meta_labels.head()

Unnamed: 0,column,label,expanded
0,1,RECTYPE,Record type (1=mainstream school; 2=special sc...
1,2,ALPHAIND,Alphabetic sorting index
2,3,LEA,Local authority code (see separate list of loc...
3,4,ESTAB,Establishment number
4,5,URN,School Unique Reference Number


In [131]:
ks4_labels = db.ks4_labels

In [132]:
# iterate through each row and add it to the database
for index, row in ks4_meta_labels.iterrows():
    ks4_labels.insert_one({'column': row['column'],
                           'label': row['label'],
                           'expanded': row['expanded']
                          })
ks4_labels.find_one()

{'_id': ObjectId('5afbdb78b70b0769c01d5585'),
 'codes': {'1': 'mainstream school',
  '2': 'special school',
  '4': 'local authority',
  '5': 'National (all schools)',
  '7': 'National (maintained schools)'},
 'expanded': 'Record type',
 'label': 'RECTYPE'}

In [13]:
!head -5 data/2015-2016/ks2_meta.csv

Column,Field Name,Label/Description
1,RECTYPE,Record type (1=mainstream school; 2=special school; 3=Local Authority; 4=National (all schools); 5=National (maintained schools))
2,ALPHAIND,Alphabetic index
3,LEA,Local authority number
4,ESTAB,Establishment number


In [14]:
!wc -l data/2015-2016/ks2_meta.csv

265 data/2015-2016/ks2_meta.csv


In [15]:
!head -5 data/2015-2016/ks4_meta.csv

Column,Metafile heading,Metafile description,Methodology changes,Null field for special schools,Null field for local authority records,Null field for National (all schools) records,Null field for National (maintained schools) records,,1,RECTYPE,Record type (1=mainstream school; 2=special school; 4=local authority; 5=National (all schools); 7=National (maintained schools)),,,,,,,2,ALPHAIND,Alphabetic sorting index,,,Yes,Yes,Yes,,3,LEA,Local authority code (see separate list of local authorities and their codes),,,,Yes,Yes,,4,ESTAB,Establishment number,,,Yes,Yes,Yes,,5,URN,School Unique Reference Number,,,Yes,Yes,Yes,,6,SCHNAME,School name,,,Yes,Yes,Yes,,7,SCHNAME_AC,School now known as (used if the school has converted to an academy on or after 12 Sept 2015),,,Yes,Yes,Yes,,8,ADDRESS1,School address (1),,,Yes,Yes,Yes,,9,ADDRESS2,School address (2),,,Yes,Yes,Yes,,10,ADDRESS3,School address (3),,,Yes,Yes,Yes,,11,TOWN,School town,,,Yes,Yes,Yes,,12,PCODE,School postcode,,,Yes,Yes

In [16]:
!wc -l data/2015-2016/ks4_meta.csv

0 data/2015-2016/ks4_meta.csv


strange, the line count is showing as 0.  Let's check the word count.

In [72]:
!wc data/2015-2016/ks4_meta.csv

    0  4564 42255 data/2015-2016/ks4_meta.csv


I'll open it in open refine.

In [18]:
# lets try a straight import of ks2 data first
!/usr/bin/mongoimport --port 27351 --drop --db schools --collection ks2_labels \
    --type csv --headerline --ignoreBlanks \
    --file data/2015-2016/ks2_meta.csv

2018-05-16T07:19:19.506+0000	connected to: localhost:27351
2018-05-16T07:19:19.506+0000	dropping: schools.ks2_labels
2018-05-16T07:19:19.542+0000	imported 259 documents


In [19]:
!/usr/bin/mongoimport --port 27351 --drop --db schools --collection ks4_labels \
    --type csv --headerline --ignoreBlanks \
    --file data/2015-2016/ks4_meta.csv

2018-05-16T07:19:19.683+0000	Failed: fields cannot be identical: '' and ''
2018-05-16T07:19:19.683+0000	imported 0 documents


In [20]:
ks2_labels = db.ks2_labels

In [21]:
list(ks2_labels.find())

[{'Column': 1,
  'Field Name': 'RECTYPE',
  'Label/Description': 'Record type (1=mainstream school; 2=special school; 3=Local Authority; 4=National (all schools); 5=National (maintained schools))',
  '_id': ObjectId('5afbdb77b70b0769c01d546e')},
 {'Column': 2,
  'Field Name': 'ALPHAIND',
  'Label/Description': 'Alphabetic index',
  '_id': ObjectId('5afbdb77b70b0769c01d546f')},
 {'Column': 3,
  'Field Name': 'LEA',
  'Label/Description': 'Local authority number',
  '_id': ObjectId('5afbdb77b70b0769c01d5470')},
 {'Column': 4,
  'Field Name': 'ESTAB',
  'Label/Description': 'Establishment number',
  '_id': ObjectId('5afbdb77b70b0769c01d5471')},
 {'Column': 5,
  'Field Name': 'URN',
  'Label/Description': 'School unique reference number',
  '_id': ObjectId('5afbdb77b70b0769c01d5472')},
 {'Column': 6,
  'Field Name': 'SCHNAME',
  'Label/Description': 'School/Local authority name',
  '_id': ObjectId('5afbdb77b70b0769c01d5473')},
 {'Column': 7,
  'Field Name': 'ADDRESS1',
  'Label/Description

In [22]:
# update the keys

In [23]:
ks2_labels.update_many({}, {'$rename': {'Field Name': 'label', 
                            'Label/Description': 'expanded'}})

list(ks2_labels.find())

[{'Column': 1,
  '_id': ObjectId('5afbdb77b70b0769c01d546e'),
  'expanded': 'Record type (1=mainstream school; 2=special school; 3=Local Authority; 4=National (all schools); 5=National (maintained schools))',
  'label': 'RECTYPE'},
 {'Column': 2,
  '_id': ObjectId('5afbdb77b70b0769c01d546f'),
  'expanded': 'Alphabetic index',
  'label': 'ALPHAIND'},
 {'Column': 3,
  '_id': ObjectId('5afbdb77b70b0769c01d5470'),
  'expanded': 'Local authority number',
  'label': 'LEA'},
 {'Column': 4,
  '_id': ObjectId('5afbdb77b70b0769c01d5471'),
  'expanded': 'Establishment number',
  'label': 'ESTAB'},
 {'Column': 5,
  '_id': ObjectId('5afbdb77b70b0769c01d5472'),
  'expanded': 'School unique reference number',
  'label': 'URN'},
 {'Column': 6,
  '_id': ObjectId('5afbdb77b70b0769c01d5473'),
  'expanded': 'School/Local authority name',
  'label': 'SCHNAME'},
 {'Column': 7,
  '_id': ObjectId('5afbdb77b70b0769c01d5474'),
  'expanded': 'School address (1)',
  'label': 'ADDRESS1'},
 {'Column': 8,
  '_id': O

looking through the meta file we can see only `RECTYPE` has codes that need to be extracted. 

In [24]:
import re

In [25]:
ks2_labels.update_many({}, {'$unset': {'Column': ''}})

<pymongo.results.UpdateResult at 0x7f06b8a5df78>

In [26]:
for l in ks2_labels.find():
    if l['label'] == 'RECTYPE':
        expanded = l['expanded'][:11]
        codes = l['expanded'][13:-1].split('; ')
        key = [c[:1] for c in codes]
        value = [c[2:] for c in codes]
        codes = (dict(list(zip(key, value))))
        ks2_labels.update_one({'_id': l['_id']}, {'$set': {'expanded': expanded, 
                                                          'codes': codes}})
#         print(l['expanded'])
ks2_labels.find_one({'label': 'RECTYPE'})

{'_id': ObjectId('5afbdb77b70b0769c01d546e'),
 'codes': {'1': 'mainstream school',
  '2': 'special school',
  '3': 'Local Authority',
  '4': 'National (all schools)',
  '5': 'National (maintained schools)'},
 'expanded': 'Record type',
 'label': 'RECTYPE'}

In [27]:
# use the functions provided by the TM351 course team
for exploring the accidents dataset
import collections
# Load the expanded names of keys and human-readable codes into memory
expanded_name = collections.defaultdict(str)
for e in ks2_labels.find({'expanded': {"$exists": True}}):
    expanded_name[e['label']] = e['expanded']
    
label_of = collections.defaultdict(str)
for l in ks2_labels.find({'codes': {"$exists": True}}):
    for c in l['codes']:
        try:
            label_of[l['label'], int(c)] = l['codes'][c]
        except ValueError: 
            label_of[l['label'], c] = l['codes'][c]

In [28]:
[(c, label_of['RECTYPE', c]) for k, c in label_of if k == 'RECTYPE']

[(5, 'National (maintained schools)'),
 (3, 'Local Authority'),
 (4, 'National (all schools)'),
 (1, 'mainstream school'),
 (2, 'special school')]

In [29]:
expanded_name['RECTYPE']

'Record type'

In [30]:
expanded_name['TOTPUPS']


'Total number of pupils (including part-time pupils)'

In [31]:
expanded_name['ICLOSE']

'Closed Flag'

now lets take a look at the KS4 data

the ks4 data will need a little more manipulation before we can merge it in to the labels

In [32]:
import pandas as pd
ks4_meta = pd.read_csv('data/2015-2016/ks4_meta.csv')
ks4_meta

Unnamed: 0,Column,Metafile heading,Metafile description,Methodology changes,Null field for special schools,Null field for local authority records,Null field for National (all schools) records,Null field for National (maintained schools) records,Unnamed: 8,Unnamed: 9
0,1,RECTYPE,Record type (1=mainstream school; 2=special sc...,,,,,,,
1,2,ALPHAIND,Alphabetic sorting index,,,Yes,Yes,Yes,,
2,3,LEA,Local authority code (see separate list of loc...,,,,Yes,Yes,,
3,4,ESTAB,Establishment number,,,Yes,Yes,Yes,,
4,5,URN,School Unique Reference Number,,,Yes,Yes,Yes,,
5,6,SCHNAME,School name,,,Yes,Yes,Yes,,
6,7,SCHNAME_AC,School now known as (used if the school has co...,,,Yes,Yes,Yes,,
7,8,ADDRESS1,School address (1),,,Yes,Yes,Yes,,
8,9,ADDRESS2,School address (2),,,Yes,Yes,Yes,,
9,10,ADDRESS3,School address (3),,,Yes,Yes,Yes,,


In [33]:
# for the label information we don't need the other columns
ks4_labels = ks4_meta[['Metafile heading', 'Metafile description']]
ks4_labels.index.name = 'index'

for the labels collection we only need the first two columns.  `Metafile heading` and `Metafile description`

In [34]:
ks4_labels.to_csv('data/2015-2016/ks4_labels.csv')

In [35]:
!ls data/2015-2016/

abbreviations.xlsx	    england_swf.csv
abs_meta.csv		    england_vaqual.csv
census_meta.csv		    england_vasubj.csv
england_abs.csv		    ks2_meta.csv
england_census.csv	    ks4_labels.csv
england_cfrfull.xlsx	    ks4_meta.csv
england_ks2final.csv	    ks4_meta_methodology.csv
england_ks4final.csv	    ks4-pupdest_meta.csv
england_ks4-pupdest.csv     ks5_meta.csv
england_ks4underlying.xlsx  ks5-studest_meta.csv
england_ks5final.csv	    la_and_region_codes_meta.csv
england_ks5-studest.csv     sixth_form_centres_and_consortia_meta.xlsx
england_ks5underlying.xlsx  spine_meta.csv
england_spine.csv	    swf_meta.csv


In [36]:
!/usr/bin/mongoimport --port 27351 --drop --db schools --collection ks4_labels \
    --type csv --headerline --ignoreBlanks \
    --file data/2015-2016/ks4_labels.csv

2018-05-16T07:19:20.473+0000	connected to: localhost:27351
2018-05-16T07:19:20.473+0000	dropping: schools.ks4_labels
2018-05-16T07:19:20.499+0000	imported 372 documents


In [37]:
ks4_labels = db.ks4_labels

In [38]:
ks4_labels.find_one()

{'Metafile description': 'Record type (1=mainstream school; 2=special school; 4=local authority; 5=National (all schools); 7=National (maintained schools))',
 'Metafile heading': 'RECTYPE',
 '_id': ObjectId('5afbdb78b70b0769c01d5585'),
 'index': 0}

let's drop the `index'

In [39]:
ks4_labels.update_many({}, {'$unset':{'index':''}})
ks4_labels.find_one()

{'Metafile description': 'Record type (1=mainstream school; 2=special school; 4=local authority; 5=National (all schools); 7=National (maintained schools))',
 'Metafile heading': 'RECTYPE',
 '_id': ObjectId('5afbdb78b70b0769c01d5585')}

Now relabel the key's as before


In [40]:
[(c, label_of['RECTYPE', c]) for k, c in label_of if k == 'RECTYPE']

[(5, 'National (maintained schools)'),
 (3, 'Local Authority'),
 (4, 'National (all schools)'),
 (1, 'mainstream school'),
 (2, 'special school')]

In [41]:
ks4_labels.update_many({}, {'$rename': {'Metafile description': 'expanded',
                                        'Metafile heading': 'label'}})
ks4_labels.find_one()

{'_id': ObjectId('5afbdb78b70b0769c01d5585'),
 'expanded': 'Record type (1=mainstream school; 2=special school; 4=local authority; 5=National (all schools); 7=National (maintained schools))',
 'label': 'RECTYPE'}

Let's codify the record type again:


In [42]:
for l in ks4_labels.find():
    if l['label'] == 'RECTYPE':
        expanded = l['expanded'][:11]
        codes = l['expanded'][13:-1].split('; ')
        key = [c[:1] for c in codes]
        print(key)
        value = [c[2:] for c in codes]
        codes = (dict(list(zip(key, value))))
        print(codes)
        ks4_labels.update_one({'_id': l['_id']}, {'$set': {'expanded': expanded, 
                                                          'codes': codes}})
#         print(l['expanded'])
ks4_labels.find_one({'label': 'RECTYPE'})

['1', '2', '4', '5', '7']
{'1': 'mainstream school', '5': 'National (all schools)', '4': 'local authority', '7': 'National (maintained schools)', '2': 'special school'}


{'_id': ObjectId('5afbdb78b70b0769c01d5585'),
 'codes': {'1': 'mainstream school',
  '2': 'special school',
  '4': 'local authority',
  '5': 'National (all schools)',
  '7': 'National (maintained schools)'},
 'expanded': 'Record type',
 'label': 'RECTYPE'}

looking at both datasets we can see that they are different for the record types at least:

In [100]:
for l in ks4_labels.find({'label':'NFTYPE'}):
    print(l)

{'expanded': 'School type (see separate list of abbreviations used in the tables)', '_id': ObjectId('5afbdb78b70b0769c01d5594'), 'label': 'NFTYPE'}


In [99]:
label_of['NFTYPE', '0']

''

In [46]:
df = pd.DataFrame(list(ks4results.find({'RECTYPE': 1})))

In [53]:
df.describe()

Unnamed: 0,ALPHAIND,CONTFLAG,ESTAB,FEEDER,ICLOSE,LEA,RECTYPE,TAB1618,TABKS2,TPUP,URN
count,4196.0,4196.0,4196.0,4196.0,4196.0,4196.0,4196.0,4196.0,4196.0,3995.0,4196.0
mean,31238.209247,0.0,4930.61368,0.012393,0.015491,671.13632,1.0,0.683508,0.037178,143.872591,127727.017159
std,18570.836976,0.0,983.241566,0.110644,0.12351,273.915396,0.0,0.465163,0.189221,77.973029,13366.193672
min,20.0,0.0,2006.0,0.0,0.0,201.0,1.0,0.0,0.0,1.0,100001.0
25%,15005.5,0.0,4041.0,0.0,0.0,350.75,1.0,0.0,0.0,93.0,116462.75
50%,30271.0,0.0,4510.0,0.0,0.0,841.0,1.0,1.0,0.0,147.0,135982.0
75%,46395.0,0.0,6003.0,0.0,0.0,888.0,1.0,1.0,0.0,196.0,137922.25
max,64354.0,0.0,8601.0,1.0,1.0,938.0,1.0,1.0,1.0,550.0,143428.0


In [133]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4196 entries, 0 to 4195
Columns: 373 entries, AC5EM13 to _id
dtypes: float64(1), int64(10), object(362)
memory usage: 11.9+ MB
