# Analysing MPs' expenses data with Python

This notebook details a number of techniques for using Python to analyse a dataset. Firstly, we import the `pandas` library to be able to read the CSV of data.

In [None]:
#import the pandas library, rename it as pd
import pandas as pd

In [None]:
#store the URL - this suggests that it uses an API and we can generate other URLs for other years
exes2122 = "https://www.theipsa.org.uk/api/download?type=individualExpenses&year=21_22"
#read a CSV from that URL
exes2122df = pd.read_csv(exes2122)
#or you can combine both lines into one, as in this example for the previous year's data
exes2021df = pd.read_csv("https://www.theipsa.org.uk/api/download?type=individualExpenses&year=20_21")
#or export it all at the
pd.read_csv("https://www.theipsa.org.uk/api/download?type=individualExpenses&year=18_19").to_csv("exes1819.csv")

In [None]:
#export second as csv
exes2021df.to_csv("exes2021df.csv")

In [None]:
#show the first few rows
exes2122df.head(3)

Unnamed: 0,memberId,year,date,claimNumber,mpName,mpConstituency,category,expenseType,shortDescription,details,journeyType,journeyFrom,journeyTo,travel,nights,mileage,amountClaimed,amountPaid,amountNotPaid,amountRepaid,status,reasonIfNotPaid,supplyMonth,supplyPeriod
0,4671,21_22,25/02/2021,60077153-1,Afzal Khan,"Manchester, Gorton BC",Office Costs,Software & applications,,,,,,,0,0.0,2.79,2.79,0.0,0.0,paid,,,0
1,4671,21_22,04/03/2021,60079789-16,Afzal Khan,"Manchester, Gorton BC",Office Costs,Software & applications,,,,,,,0,0.0,2.79,2.79,0.0,0.0,paid,,,0
2,1522,21_22,11/05/2021,60085075-2,Adam Holloway,Gravesham CC,MP Travel,Rail,,,London-constituency MP & Staff,,,Standard Return,0,0.0,29.0,29.0,0.0,0.0,paid,,,0


When we check the data types you can see that the 'year' and 'date' columns are not numeric or datetime columns, but just text (`object`).

In [None]:
#show the data types
exes2122df.dtypes

memberId              int64
year                 object
date                 object
claimNumber          object
mpName               object
mpConstituency       object
category             object
expenseType          object
shortDescription     object
details             float64
journeyType          object
journeyFrom         float64
journeyTo           float64
travel               object
nights                int64
mileage             float64
amountClaimed       float64
amountPaid          float64
amountNotPaid       float64
amountRepaid        float64
status               object
reasonIfNotPaid      object
supplyMonth         float64
supplyPeriod          int64
dtype: object

## Dealing with the dates

Let's take a look at the first few dates.

In [None]:
#show the first 5 items in the 'date' column
exes2122df['date'][:5]

0    25/02/2021
1    04/03/2021
2    11/05/2021
3    08/03/2021
4    24/03/2021
Name: date, dtype: object

To convert this into something we can work with *as* dates, we need a library of functions for that - and `datetime` is the best known.

In [None]:
#import the datetime class from the datetime module
from datetime import datetime

By way of illustration of what a datetime **object** looks like, we can use the `now()` function to show the current time. Note that this is an array of different values representing the year, month, day, hour, minute, seconds, and microsecond.

In [None]:
#show the current time
datetime.now()

datetime.datetime(2021, 11, 11, 15, 37, 11, 647962)

## Converting a string to a date

Our first problem is that the dates are stored as strings. 

The function `strptime()` converts strings to dates - it needs two ingredients:

* The string that you want to convert
* A string indicating the *pattern* of the date in question.

Here's an example:

In [None]:
#convert the given string into a date, based on the pattern supplied
datetime.strptime('24/03/2021', '%d/%m/%Y')

datetime.datetime(2021, 3, 24, 0, 0)

You can see that the pattern is written in a particular way. The slashes are literal (they literally represent the slashes between the day, month and year) but between those are these letters with the `%` symbol:

* `%d` means a two-character 'day'
* `%m` means a two-character 'month'
* `%Y` means a **four**-character 'year'

You can [find a full list of codes in the documentation for datetime](https://docs.python.org/2/library/datetime.html#strftime-and-strptime-behavior) including ways to indicate time zone and microsenconds. 

Broadly speaking, lowercase letters mean 'short' and uppercase letters mean 'long' - so you would use `%y` to indicate a date string where the year takes up two characters.

As well as day, month and year, you can also have `%w` for weekday as a number (from 0-6), `%a` or `%A` for weekday as a name (e.g. Mon or Monday); and `%b` or `%B` for the month as a name (e.g. Jan or January). Hours and seconds are indicated by `%H` and `%S` but a capital `%M` distinguishes 'minutes' from month (`%m`).

Here's a more complex example:


In [None]:
#Here's a complex example, based on how a Twitter scraper records timestamps
datetime.strptime('September 18, 2020 at 11:05AM','%B %d, %Y at %I:%M%p')

datetime.datetime(2020, 9, 18, 11, 5)


Note also that in the absence of any information on time, the resulting datetime object simply sets hours and minutes to zero by default - but doesn't specify seconds or microseconds. 

## Parsing dates rather than describing them

Instead of using special characters to describe the pattern of the date, you can use a **parser**

In [None]:
#import parse
from dateutil.parser import parse

The `parse` function from `dateutil` takes one ingredient - a string - and will 'parse' it (guess the pattern based on certain algorithms) to return a datetime.

In [None]:
#use the parse function to interpret a string as a datetime object
parse('September 18, 2020 at 11:05AM')

datetime.datetime(2020, 9, 18, 11, 5)

## Applying to a whole column

Converting a single string is straightforward - but what if we want to convert a whole column?

In [None]:
#try to convert the 'date' column
datetime.strptime(exes2122df['date'], '%d/%m/%Y')

TypeError: ignored

Nope - we get an error telling us that 'argument 1' (the first item in brackets) must be a string (`str`), not Series, which is what a dataframe column is.

Instead, then, we'll need to loop through them, which we can do with a `for` loop in a line of code like this:

In [None]:
#create list to store our clean dates
datesclean = []

#loop through the dirty dates
for i in exes2021df['date']:
  print(i)
  #clean the date
  cleandate = datetime.strptime(i, '%d/%m/%Y')
  #add it to the list
  datesclean.append(cleandate)

datesclean

In [None]:
exes2021df['datesclean'] = datesclean

Here's another way of writing that code but in fewer lines. 

In [None]:
#apply the same code as before, but to each item in the column when looped through
datesclean = [datetime.strptime(i, '%d/%m/%Y') for i in exes2122df['date']]
#show the first 5
datesclean[:5]

[datetime.datetime(2021, 2, 25, 0, 0),
 datetime.datetime(2021, 3, 4, 0, 0),
 datetime.datetime(2021, 5, 11, 0, 0),
 datetime.datetime(2021, 3, 8, 0, 0),
 datetime.datetime(2021, 3, 24, 0, 0)]

And then add back into the dataframe.

In [None]:
exes2122df['dateclean'] = datesclean

## Extracting months or years

Now that we have the dates stored as datetime objects, it is easy to extract months or years, etc.

In [None]:
datetime.now().month

11

In [None]:
#show the first date
print(exes2122df['dateclean'][0])
#show the month of the first date in the 'dateclean' column
print(exes2122df['dateclean'][0].month)
#and year
print(exes2122df['dateclean'][0].year)
#and day
print(exes2122df['dateclean'][0].day)

2021-02-25 00:00:00
2
2021
25


Again we can create new columns with those.

In [None]:
#create a column and fill it with the years extracted from each date
exes2122df['dateyear'] = [i.year for i in exes2122df['dateclean']]
#create a column and fill with months
exes2122df['datemonth'] = [i.month for i in exes2122df['dateclean']]
#and repeat with days
exes2122df['dateday'] = [i.day for i in exes2122df['dateclean']]

Note that these are integers, not datetime objects.

In [None]:
#show last 5 columns
exes2122df.dtypes[-5:]

dateclean      datetime64[ns]
dateyear                int64
datemonth               int64
dateday                 int64
dateweekday             int64
dtype: object

The `.weekday()` function is similar but it needs some parentheses. It [returns the weekday as a number between 0 and 6](https://pythontic.com/datetime/date/weekday).

In [None]:
#create a column of weekdays - note the brackets
exes2122df['dateweekday'] = [i.weekday() for i in exes2122df['dateclean']]

In [None]:
#show the first 5
exes2122df['dateweekday'][:5]

0    3
1    3
2    1
3    0
4    2
Name: dateweekday, dtype: int64

## Extracting days of the week

To extract those days as words, we need [the `calendar` library](https://docs.python.org/3/library/calendar.html).

In [None]:
#bring in the calendar library to convert numbers to strings
import calendar

...specifically `day_name`, which is a sort-of-list, that corresponds to the numbers used to indicate weekdays by `datetime`.

In [None]:
#Monday is represented by a zero
calendar.day_name[0:]

['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']

So if we use the output of the `weekday()` function as the *index* for `day_name` you get:

In [None]:
#store the first date
firstdate = exes2122df['dateclean'][0]
#print it
print(firstdate)
#extract the weekday, and then convert that to the day name
calendar.day_name[datetime.weekday(firstdate)]

2021-02-25 00:00:00


'Thursday'

We've already got a column of weekday integers, which we can use to create another column

In [None]:
#create a column to store the results of using day_name on the dateweekday column
exes2122df['weekdayname'] = [calendar.day_name[i] for i in exes2122df['dateweekday']]
#show the first 5
exes2122df['weekdayname'][:5]

0     Thursday
1     Thursday
2      Tuesday
3       Monday
4    Wednesday
Name: weekdayname, dtype: object

Turns out Tuesday is the most popular day for submitting expenses.

In [None]:
exes2122df['weekdayname'].value_counts()

Tuesday      3716
Thursday     3158
Wednesday    3152
Monday       2941
Friday       2464
Sunday        870
Saturday      757
Name: weekdayname, dtype: int64

## Extracting week of the year

What about the most common time of the year for submitting expenses?

We can use the `.isocalendar()` function to extract that. 

In [None]:
#print the first date
print(exes2122df['dateclean'][0])
#now print the result of using the isocalendar function on it
print(exes2122df['dateclean'][0].isocalendar())

2021-02-25 00:00:00
(2021, 8, 4)


The output here contains 3 pieces of information: the year, the week of the year, and the day (`isocalendar()` starts counting from 1, so 4 means Thursday here)

We can access the week by specifying an index after `isocalendar()` like so:

In [None]:
#print just the second - index 1 - item from isocalendar's output
print(exes2122df['dateclean'][0].isocalendar()[1])

8


In [None]:
#extract the week numbers and create a new column with it
exes2122df['weeknum'] = [i.isocalendar()[1] for i in exes2122df['dateclean']]
#show the first 5
exes2122df['weeknum'][:5]

0     8
1     9
2    19
3    10
4    12
Name: weeknum, dtype: int64

In [None]:
#count how many times each value appears and show the first few results
exes2122df['weeknum'].value_counts().head()

15    1877
17    1783
16    1780
14    1693
13    1501
Name: weeknum, dtype: int64

Week 15 was the most popular week - not by much, but all the top five are from the same time of the year: the end of the financial year.

What about the least popular weeks? Perhaps these are when Parliament is in recess?

In [None]:
#count how many times each value appears and show the first few results
exes2122df['weeknum'].value_counts().tail(10)

46    2
25    2
37    1
28    1
31    1
35    1
23    1
27    1
40    1
26    1
Name: weeknum, dtype: int64

In [None]:
#Get an overview of one column
exes2122df['category'].value_counts()

Office Costs        9217
MP Travel           4432
Accommodation       1723
Staff Travel        1038
Staffing             498
Dependant Travel     112
Winding Up            22
Miscellaneous         16
Name: category, dtype: int64

In [None]:
#filter to one category in that column
#https://stackoverflow.com/questions/41119623/pandas-pivot-table-sort-values-by-columns
exes2122df[exes2122df['category']=='MP Travel'].pivot_table(index='weeknum', 
                       values='memberId', 
                       aggfunc='count').sort_values(by='memberId', 
                                                    ascending=False)

Unnamed: 0_level_0,memberId
weeknum,Unnamed: 1_level_1
15,628
16,626
17,494
19,393
20,325
12,285
11,270
14,253
10,227
9,208


## Calculating time elapsed with `timedelta`

One of the reasons we want dates to be stored as dates rather than strings is to be able to perform calculations with them, like calculating the time elapsed between two dates. 

Let's create a column for that. First, let's test the idea with one date and today's date.

In [None]:
#print today's date
print(datetime.now())
#print the first date in the column
print(exes2122df['dateclean'][0])
#subtract one from the other
timesincethen = datetime.now() - exes2122df['dateclean'][0]
#print the results of 
print(timesincethen)

2021-10-20 18:47:04.137400
2021-02-25 00:00:00
237 days 18:47:04.142209


Here's what that object looks like without a print command

In [None]:
timesincethen

Timedelta('237 days 18:47:04.142209')

This is a `timedelta` object. It is used to represent a **period of time** - it is different to a `datetime` object which is used to represent a specific **point in time**.

Now let's create that column by repeating the calculation for all claim dates.

In [None]:
#calculate the time elapsed between each claim and now and create a column for that data
exes2122df['ageofclaim'] = [datetime.now() - i for i in exes2122df['dateclean']]
#check the first few results
exes2122df['ageofclaim'][:5]

0   237 days 18:50:17.601550
1   230 days 18:50:17.601659
2   162 days 18:50:17.601679
3   226 days 18:50:17.601694
4   210 days 18:50:17.601722
Name: ageofclaim, dtype: timedelta64[ns]

Once we've created those objects we can identify the oldest and most recent claims.

In [None]:
#what is the oldest claim?
exes2122df['ageofclaim'].max()

Timedelta('933 days 18:50:17.792636')

In [None]:
#what is the most recent claim?
exes2122df['ageofclaim'].min()

Timedelta('126 days 18:50:17.697271')

We can also do that with datetime objects - but notice that `max` and `min` create the opposite results: the oldest date is the smallest because dates are stored as the amount of time since a certain point (typically Jan 1 1900), and the most recent date is the largest number.

In [None]:
#what is the oldest claim?
exes2122df['dateclean'].min()

Timestamp('2019-04-01 00:00:00')

In [None]:
#what is the most recent claim?
exes2122df['dateclean'].max()

Timestamp('2021-06-16 00:00:00')

## Sorting by time

We can sort by these columns too, using the `pandas` function `sort_values()`. By default this sorts ascending (from smallest to largest), so with the age of claim this will bring the most recent (i.e. those with the least time elapsed) to the top.

In [None]:
#sort dataframe by age of claim
exes2122df = exes2122df.sort_values('ageofclaim')
#show it
exes2122df

Unnamed: 0,memberId,year,date,claimNumber,mpName,mpConstituency,category,expenseType,shortDescription,details,journeyType,journeyFrom,journeyTo,travel,nights,mileage,amountClaimed,amountPaid,amountNotPaid,amountRepaid,status,reasonIfNotPaid,supplyMonth,supplyPeriod,dateclean,ageofclaim
7466,4825,21_22,16/06/2021,60084088-4,Jacob Young,Redcar BC,Office Costs,Utilities,Water,,,,,,0,0.0,41.42,41.42,0.0,0.0,paid,,,0,2021-06-16,126 days 18:50:17.697271
14801,4818,21_22,08/06/2021,60085152-1,Saqib Bhatti,Meriden CC,Staff Travel,Rail,,,London-constituency MP & Staff,,,Standard Single,0,0.0,25.05,25.05,0.0,0.0,paid,,,0,2021-06-08,134 days 18:50:17.796828
488,529,21_22,05/06/2021,60084754-1,Alan Campbell,Tynemouth BC,Accommodation,Council tax,,,,,,,0,0.0,195.00,195.00,0.0,0.0,paid,,,0,2021-06-05,137 days 18:50:17.607964
224,529,21_22,01/06/2021,60084754-2,Alan Campbell,Tynemouth BC,Accommodation,Utilities,Electricity,,,,,,0,0.0,40.00,40.00,0.0,0.0,paid,,,0,2021-06-01,141 days 18:50:17.604546
7576,261,21_22,01/06/2021,60084241-2,James Gray,North Wiltshire CC,Office Costs,Rent,,,,,,,0,0.0,400.00,400.00,0.0,0.0,paid,,,0,2021-06-01,141 days 18:50:17.698611
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
12151,4368,21_22,29/09/2019,60078418-1,Neil Coyle,Bermondsey and Old Southwark BC,Office Costs,Rent,,,,,,,0,0.0,2000.00,2000.00,0.0,0.0,paid,,,0,2019-09-29,752 days 18:50:17.762929
14310,87,21_22,01/04/2019,60081854-4,Roger Gale,North Thanet CC,Office Costs,Utilities,Water,,,,,,0,0.0,53.15,53.15,0.0,0.0,paid,,,0,2019-04-01,933 days 18:50:17.790697
14349,87,21_22,01/04/2019,60081854-1,Roger Gale,North Thanet CC,Office Costs,Utilities,Water,,,,,,0,0.0,59.87,59.87,0.0,0.0,paid,,,0,2019-04-01,933 days 18:50:17.791188
14363,87,21_22,01/04/2019,60081854-2,Roger Gale,North Thanet CC,Office Costs,Utilities,Electricity,,,,,,0,0.0,150.38,150.38,0.0,0.0,paid,,,0,2019-04-01,933 days 18:50:17.791361


And again, if you're doing this with date, the default ascending order means it will start with the earliest dates.

In [None]:
#sort dataframe by age of claim
exes2122df = exes2122df.sort_values('dateclean')
#show it
exes2122df

Unnamed: 0,memberId,year,date,claimNumber,mpName,mpConstituency,category,expenseType,shortDescription,details,journeyType,journeyFrom,journeyTo,travel,nights,mileage,amountClaimed,amountPaid,amountNotPaid,amountRepaid,status,reasonIfNotPaid,supplyMonth,supplyPeriod,dateclean,ageofclaim
14464,87,21_22,01/04/2019,60081854-3,Roger Gale,North Thanet CC,Office Costs,Utilities,Gas,,,,,,0,0.0,176.55,176.55,0.0,0.0,paid,,,0,2019-04-01,933 days 18:50:17.792636
14310,87,21_22,01/04/2019,60081854-4,Roger Gale,North Thanet CC,Office Costs,Utilities,Water,,,,,,0,0.0,53.15,53.15,0.0,0.0,paid,,,0,2019-04-01,933 days 18:50:17.790697
14363,87,21_22,01/04/2019,60081854-2,Roger Gale,North Thanet CC,Office Costs,Utilities,Electricity,,,,,,0,0.0,150.38,150.38,0.0,0.0,paid,,,0,2019-04-01,933 days 18:50:17.791361
14349,87,21_22,01/04/2019,60081854-1,Roger Gale,North Thanet CC,Office Costs,Utilities,Water,,,,,,0,0.0,59.87,59.87,0.0,0.0,paid,,,0,2019-04-01,933 days 18:50:17.791188
12151,4368,21_22,29/09/2019,60078418-1,Neil Coyle,Bermondsey and Old Southwark BC,Office Costs,Rent,,,,,,,0,0.0,2000.00,2000.00,0.0,0.0,paid,,,0,2019-09-29,752 days 18:50:17.762929
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7576,261,21_22,01/06/2021,60084241-2,James Gray,North Wiltshire CC,Office Costs,Rent,,,,,,,0,0.0,400.00,400.00,0.0,0.0,paid,,,0,2021-06-01,141 days 18:50:17.698611
224,529,21_22,01/06/2021,60084754-2,Alan Campbell,Tynemouth BC,Accommodation,Utilities,Electricity,,,,,,0,0.0,40.00,40.00,0.0,0.0,paid,,,0,2021-06-01,141 days 18:50:17.604546
488,529,21_22,05/06/2021,60084754-1,Alan Campbell,Tynemouth BC,Accommodation,Council tax,,,,,,,0,0.0,195.00,195.00,0.0,0.0,paid,,,0,2021-06-05,137 days 18:50:17.607964
14801,4818,21_22,08/06/2021,60085152-1,Saqib Bhatti,Meriden CC,Staff Travel,Rail,,,London-constituency MP & Staff,,,Standard Single,0,0.0,25.05,25.05,0.0,0.0,paid,,,0,2021-06-08,134 days 18:50:17.796828


You can specify you want to order it from largest to smallest number by adding the `ascending=` parameter and setting it to `False`.

In [None]:
#sort dataframe by age of claim
exes2122df = exes2122df.sort_values('dateclean', ascending=False)
#show it
exes2122df

Unnamed: 0,memberId,year,date,claimNumber,mpName,mpConstituency,category,expenseType,shortDescription,details,journeyType,journeyFrom,journeyTo,travel,nights,mileage,amountClaimed,amountPaid,amountNotPaid,amountRepaid,status,reasonIfNotPaid,supplyMonth,supplyPeriod,dateclean,ageofclaim
7466,4825,21_22,16/06/2021,60084088-4,Jacob Young,Redcar BC,Office Costs,Utilities,Water,,,,,,0,0.0,41.42,41.42,0.0,0.0,paid,,,0,2021-06-16,126 days 18:50:17.697271
14801,4818,21_22,08/06/2021,60085152-1,Saqib Bhatti,Meriden CC,Staff Travel,Rail,,,London-constituency MP & Staff,,,Standard Single,0,0.0,25.05,25.05,0.0,0.0,paid,,,0,2021-06-08,134 days 18:50:17.796828
488,529,21_22,05/06/2021,60084754-1,Alan Campbell,Tynemouth BC,Accommodation,Council tax,,,,,,,0,0.0,195.00,195.00,0.0,0.0,paid,,,0,2021-06-05,137 days 18:50:17.607964
224,529,21_22,01/06/2021,60084754-2,Alan Campbell,Tynemouth BC,Accommodation,Utilities,Electricity,,,,,,0,0.0,40.00,40.00,0.0,0.0,paid,,,0,2021-06-01,141 days 18:50:17.604546
7576,261,21_22,01/06/2021,60084241-2,James Gray,North Wiltshire CC,Office Costs,Rent,,,,,,,0,0.0,400.00,400.00,0.0,0.0,paid,,,0,2021-06-01,141 days 18:50:17.698611
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
12151,4368,21_22,29/09/2019,60078418-1,Neil Coyle,Bermondsey and Old Southwark BC,Office Costs,Rent,,,,,,,0,0.0,2000.00,2000.00,0.0,0.0,paid,,,0,2019-09-29,752 days 18:50:17.762929
14363,87,21_22,01/04/2019,60081854-2,Roger Gale,North Thanet CC,Office Costs,Utilities,Electricity,,,,,,0,0.0,150.38,150.38,0.0,0.0,paid,,,0,2019-04-01,933 days 18:50:17.791361
14310,87,21_22,01/04/2019,60081854-4,Roger Gale,North Thanet CC,Office Costs,Utilities,Water,,,,,,0,0.0,53.15,53.15,0.0,0.0,paid,,,0,2019-04-01,933 days 18:50:17.790697
14349,87,21_22,01/04/2019,60081854-1,Roger Gale,North Thanet CC,Office Costs,Utilities,Water,,,,,,0,0.0,59.87,59.87,0.0,0.0,paid,,,0,2019-04-01,933 days 18:50:17.791188
