# Cleaning a dataset

This notebook details the process of cleaning an especially dirty dataset - you can find it [here](https://github.com/paulbradshaw/cleaning/blob/master/dirtydata/Disposals%20by%20region%202012-13%20Table.xls) in [my folder of dirty data](https://github.com/paulbradshaw/cleaning/tree/master/dirtydata).

You'll need to download the data from that link, and then upload it to the files folder on the left before proceeding.

Let's list the problems:

* Data spread across multiple sheets
* Information in the sheet name
* Headings in row 2
* Empty rows
* Empty columns
* Mixed data in column 1 (region, category, sub-category)
* Non-string column headings

What are the potential solutions?

* Import whole Excel file using `pd.ExcelFile` and loop through sheets to combine them by appending to a dataframe
* Add a column as we append each dataframe containing the sheet name
* Either `skiprows=` or `headers=` when importing to specify headings in row 2
* `.drop()` to get rid of empty columns
* `dropna` to get rid of empty rows
* Look for patterns that distinguish different types of data - empty cells next to them? Particular strings/values
* Loop through the column headings and apply `str()` function to convert to string

Let's start by importing the `pandas` library and the data itself. (We could import directly from GitHub but it's not straightforward).

In [2]:
#import the pandas library
import pandas as pd

In [5]:
#store the filename 
xls = "Disposals by region 2012-13 Table.xls"
#read that file, skipping the first row so we capture the headings in row 2
xlsdf = pd.read_excel(xls, skiprows=1)
#show the first few rows
xlsdf.head(3)

Unnamed: 0,These figures do not match the data published in Chapter 5 as they are taken from a different data source.,10 - 14,15,16,17+,Unnamed: 5,Female,Male,Not Known,Unnamed: 9,White,Mixed,Asian or Asian British,Black or Black British,Chinese or Other Ethnic Group,Not Known.1,TOTAL
0,National,,,,,,,,,,,,,,,,
1,,,,,,,,,,,,,,,,,
2,Pre-court,,,,,,,,,,,,,,,,


## Dealing with numeric column headings

Python doesn't like column headings that aren't strings. Two of our column headings are numbers, so we can fix that by using the `str()` function to convert them to strings.

In [None]:
#put column names in a list
dfcols = list(xlsdf.columns)

#replace that list with those items as strings to fix the numbers
dfcols = [str(i) for i in dfcols]
dfcols

## Drop two empty columns using `drop()`

To drop the empty columns it's useful to have their names. Below we loop through the column names and look for any that match a pattern. Those that do, we add to a list to use for dropping them.

In [22]:
#create an empty list to store unnamed column names
unnamedlist = []

#loop through the column namews
for colname in dfcols:
  print(colname)
  #T/F - is 'Unnnamed' in the column name
  if 'Unnamed' in colname:
    #add to that tf list
    unnamedlist.append(colname)
  
unnamedlist

These figures do not match the data published in Chapter 5 as they are taken from a different data source.
10 - 14
15
16
17+
Unnamed: 5
Female
Male
Not Known
Unnamed: 9
White
Mixed
Asian or Asian British
Black or Black British
Chinese or Other Ethnic Group
Not Known.1
TOTAL


['Unnamed: 5', 'Unnamed: 9']

In [None]:
#remove those columns
xlsdf = xlsdf.drop(labels=unnamedlist, axis=1)

In [28]:
#show the first 3 rows
xlsdf.head(5)

Unnamed: 0,These figures do not match the data published in Chapter 5 as they are taken from a different data source.,10 - 14,15,16,17+,Female,Male,Not Known,White,Mixed,Asian or Asian British,Black or Black British,Chinese or Other Ethnic Group,Not Known.1,TOTAL
0,National,,,,,,,,,,,,,,
1,,,,,,,,,,,,,,,
2,Pre-court,,,,,,,,,,,,,,
3,Reprimand,4726.0,2795.0,2814.0,2720.0,3524.0,9530.0,1.0,11302.0,232.0,458.0,498.0,54.0,511.0,13055.0
4,Final Warning,3467.0,2404.0,2491.0,2587.0,2350.0,8596.0,3.0,9562.0,243.0,360.0,450.0,39.0,295.0,10949.0


## Fetch all the sheets

Before we continue we should probably grab all the data instead of cleaning just the one sheet. We can do that with `ExcelFile()` which will store an entire workbook, not just one sheet. 

In [31]:
#fetch the excel file and store in an object
xls = pd.ExcelFile("Disposals by region 2012-13 Table.xls")
#fetch the sheet names of that object
xls.sheet_names

['National',
 'East Midlands',
 'Eastern',
 'London',
 'North East',
 'North West',
 'South East',
 'South West',
 'Wales',
 'West Midlands',
 'Yorkshire']

In [34]:
#create a dataframe to hold all the sheets in one
allthesheets = pd.DataFrame()

#loop through the sheet names
for sheet in xls.sheet_names:
  print(sheet)
  #grab that sheet from the file and store in a dataframe
  currentsheet = pd.read_excel("Disposals by region 2012-13 Table.xls", 
                               sheet_name=sheet,
                               skiprows=1)
  #add a column to this dataframe containing the name of the sheet
  currentsheet['region'] = sheet
  #append the current sheet to the 'all sheets' dataframe
  allthesheets = allthesheets.append(currentsheet)

National
East Midlands
Eastern
London
North East
North West
South East
South West
Wales
West Midlands
Yorkshire


In [35]:
allthesheets.head(3)

Unnamed: 0.1,These figures do not match the data published in Chapter 5 as they are taken from a different data source.,10 - 14,15,16,17+,Unnamed: 5,Female,Male,Not Known,Unnamed: 9,White,Mixed,Asian or Asian British,Black or Black British,Chinese or Other Ethnic Group,Not Known.1,TOTAL,region,Unnamed: 0
0,National,,,,,,,,,,,,,,,,,National,
1,,,,,,,,,,,,,,,,,,National,
2,Pre-court,,,,,,,,,,,,,,,,,National,


## Clean the column names

Now we can repeat what we did earlier in converting the column headings to all be strings.

In [36]:
#put column names in a list
dfcols = list(allthesheets.columns)

#replace that list with those items as strings to fix the numbers
dfcols = [str(i) for i in dfcols]
dfcols

['These figures do not match the data published in Chapter 5 as they are taken from a different data source.',
 '10 - 14',
 '15',
 '16',
 '17+',
 'Unnamed: 5',
 'Female',
 'Male',
 'Not Known',
 'Unnamed: 9',
 'White',
 'Mixed',
 'Asian or Asian British',
 'Black or Black British',
 'Chinese or Other Ethnic Group',
 'Not Known.1',
 'TOTAL',
 'region',
 'Unnamed: 0']

In [37]:
#replace the column names in the dataframe with the string versions
allthesheets.columns = dfcols

## Drop the empty columns

And we can replace the empty columns again too.

In [38]:
#create an empty list to store unnamed column names
unnamedlist = []

#loop through the column namews
for colname in dfcols:
  print(colname)
  #T/F - is 'Unnnamed' in the column name
  if 'Unnamed' in colname:
    #add to that tf list
    unnamedlist.append(colname)
  
unnamedlist

These figures do not match the data published in Chapter 5 as they are taken from a different data source.
10 - 14
15
16
17+
Unnamed: 5
Female
Male
Not Known
Unnamed: 9
White
Mixed
Asian or Asian British
Black or Black British
Chinese or Other Ethnic Group
Not Known.1
TOTAL
region
Unnamed: 0


['Unnamed: 5', 'Unnamed: 9', 'Unnamed: 0']

In [None]:
#remove those columns
allthesheets = allthesheets.drop(labels=unnamedlist, axis=1)

In [41]:
allthesheets.head()

Unnamed: 0,These figures do not match the data published in Chapter 5 as they are taken from a different data source.,10 - 14,15,16,17+,Female,Male,Not Known,White,Mixed,Asian or Asian British,Black or Black British,Chinese or Other Ethnic Group,Not Known.1,TOTAL,region
0,National,,,,,,,,,,,,,,,National
1,,,,,,,,,,,,,,,,National
2,Pre-court,,,,,,,,,,,,,,,National
3,Reprimand,4726.0,2795.0,2814.0,2720.0,3524.0,9530.0,1.0,11302.0,232.0,458.0,498.0,54.0,511.0,13055.0,National
4,Final Warning,3467.0,2404.0,2491.0,2587.0,2350.0,8596.0,3.0,9562.0,243.0,360.0,450.0,39.0,295.0,10949.0,National


## Move the categories into another column

We have a problem with the first column containing 3 types of data: region, category of disposal, and the type of disposal.

In [None]:
allthesheets.head(3)

A key technique here is to look for a pattern which distinguishes one type of data from the others.

In this case, the disposal category is always in a row where the other cells are all empty. 

We can use `ffill` to fill *across* where there are empty cells - but this will fill across the entire line. To stop that happening we can create an empty column for it to fill into, and a 'buffer' column to stop it filling any further.

In [None]:
#create a new column 2 called empty, which has empty cells in it
allthesheets.insert(1, 'EMPTY', '')

In [57]:
#import a library we will need to create NaN values
import numpy as np

In [60]:
#Fill that EMPTY column with NaN values
allthesheets['EMPTY']= np.NaN

In [50]:
#insert a new column 3 called 'STOPHERE' which has text to stop our fill going all the way across
allthesheets.insert(2, 'STOPHERE', 'STOPHERE')

In [63]:
#fill across (axis=1) from left to right where a cell is NaN 
allthesheets = allthesheets.ffill(axis=1)

In [64]:
allthesheets.head(5)

Unnamed: 0,These figures do not match the data published in Chapter 5 as they are taken from a different data source.,EMPTY,STOPHERE,10 - 14,15,16,17+,Female,Male,Not Known,White,Mixed,Asian or Asian British,Black or Black British,Chinese or Other Ethnic Group,Not Known.1,TOTAL,region,BUFFER
0,National,National,STOPHERE,STOPHERE,STOPHERE,STOPHERE,STOPHERE,STOPHERE,STOPHERE,STOPHERE,STOPHERE,STOPHERE,STOPHERE,STOPHERE,STOPHERE,STOPHERE,STOPHERE,National,STOPHERE
1,,,STOPHERE,STOPHERE,STOPHERE,STOPHERE,STOPHERE,STOPHERE,STOPHERE,STOPHERE,STOPHERE,STOPHERE,STOPHERE,STOPHERE,STOPHERE,STOPHERE,STOPHERE,National,STOPHERE
2,Pre-court,Pre-court,STOPHERE,STOPHERE,STOPHERE,STOPHERE,STOPHERE,STOPHERE,STOPHERE,STOPHERE,STOPHERE,STOPHERE,STOPHERE,STOPHERE,STOPHERE,STOPHERE,STOPHERE,National,STOPHERE
3,Reprimand,Reprimand,STOPHERE,4726,2795,2814,2720,3524,9530,1,11302,232,458,498,54,511,13055,National,STOPHERE
4,Final Warning,Final Warning,STOPHERE,3467,2404,2491,2587,2350,8596,3,9562,243,360,450,39,295,10949,National,STOPHERE


## Rename the first two columns

We can also rename the first two columns so they make more sense.

In [70]:
#fetch the list of columns as a list
dfcols = list(allthesheets.columns)
#replace the 2nd column name
dfcols[1] = 'category'
dfcols[0] = 'disposal'
#reassign this list to the column names
allthesheets.columns = dfcols

In [71]:
allthesheets.head(5)

Unnamed: 0,disposal,category,STOPHERE,10 - 14,15,16,17+,Female,Male,Not Known,White,Mixed,Asian or Asian British,Black or Black British,Chinese or Other Ethnic Group,Not Known.1,TOTAL,region,BUFFER
0,National,National,STOPHERE,STOPHERE,STOPHERE,STOPHERE,STOPHERE,STOPHERE,STOPHERE,STOPHERE,STOPHERE,STOPHERE,STOPHERE,STOPHERE,STOPHERE,STOPHERE,STOPHERE,National,STOPHERE
1,,,STOPHERE,STOPHERE,STOPHERE,STOPHERE,STOPHERE,STOPHERE,STOPHERE,STOPHERE,STOPHERE,STOPHERE,STOPHERE,STOPHERE,STOPHERE,STOPHERE,STOPHERE,National,STOPHERE
2,Pre-court,Pre-court,STOPHERE,STOPHERE,STOPHERE,STOPHERE,STOPHERE,STOPHERE,STOPHERE,STOPHERE,STOPHERE,STOPHERE,STOPHERE,STOPHERE,STOPHERE,STOPHERE,STOPHERE,National,STOPHERE
3,Reprimand,Reprimand,STOPHERE,4726,2795,2814,2720,3524,9530,1,11302,232,458,498,54,511,13055,National,STOPHERE
4,Final Warning,Final Warning,STOPHERE,3467,2404,2491,2587,2350,8596,3,9562,243,360,450,39,295,10949,National,STOPHERE


## Using frequency to identify data types

Another pattern we could identify is **frequency**: it may be that certain types of data appear more or less frequently than the others. You can see in the output below, for example, that regions only appear once. 

This could be used to generate a list of values that we want to match in order to copy them into a separate column and/or remove.

In [78]:
#count how many times each value appears
col1counts = allthesheets['disposal'].value_counts()
col1counts

Custody                                                                                                                                                                 11
First-tier                                                                                                                                                              11
Conditional Discharge                                                                                                                                                   11
Sentence Deferred                                                                                                                                                       11
Reparation Order                                                                                                                                                        11
Section 90-92 Detention                                                                                                                          

## Next: Melting wide to long