# Pivot tables and more questioning data with `pandas`

This notebook details how to use the Python library `pandas` to ask more advanced questions of. First, let's import `pandas` - and some data to work with.

Note that when reading data using `pandas`, the results are stored in a **dataframe** which can be used for further analysis.

In [1]:
import pandas as pd
#read in some JSON from the UK police API - this should show stops near a particular location during January 2021
policestops = pd.read_json("https://data.police.uk/api/stops-street?lat=52.629729&lng=-1.131592&date=2021-01")
#count how many in each age range
policestops.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 106 entries, 0 to 105
Data columns (total 16 columns):
 #   Column                               Non-Null Count  Dtype              
---  ------                               --------------  -----              
 0   age_range                            102 non-null    object             
 1   outcome                              106 non-null    object             
 2   involved_person                      106 non-null    bool               
 3   self_defined_ethnicity               106 non-null    object             
 4   gender                               106 non-null    object             
 5   legislation                          106 non-null    object             
 6   outcome_linked_to_object_of_search   13 non-null     float64            
 7   datetime                             106 non-null    datetime64[ns, UTC]
 8   removal_of_more_than_outer_clothing  106 non-null    bool               
 9   outcome_object                  

## Using the pivot_table function

The `pivot_table` function allows you to create an Excel-like pivot table. It can be attached to the data frame with a period like so: 

`policestops.pivot_table()` 

...or you can name the data frame as a parameter like so:

`pd.pivot_table(data = policestops)` 

Then, inside the parentheses, you specify the rows, columns, values and calculations you want to perform. 

In [6]:
policestops.pivot_table(index="age_range", columns="gender", aggfunc="count")

Unnamed: 0_level_0,datetime,datetime,involved_person,involved_person,legislation,legislation,location,location,object_of_search,object_of_search,officer_defined_ethnicity,officer_defined_ethnicity,operation,operation,operation_name,operation_name,outcome,outcome,outcome_linked_to_object_of_search,outcome_linked_to_object_of_search,outcome_object,outcome_object,removal_of_more_than_outer_clothing,removal_of_more_than_outer_clothing,self_defined_ethnicity,self_defined_ethnicity,type,type
gender,Female,Male,Female,Male,Female,Male,Female,Male,Female,Male,Female,Male,Female,Male,Female,Male,Female,Male,Female,Male,Female,Male,Female,Male,Female,Male,Female,Male
age_range,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2,Unnamed: 22_level_2,Unnamed: 23_level_2,Unnamed: 24_level_2,Unnamed: 25_level_2,Unnamed: 26_level_2,Unnamed: 27_level_2,Unnamed: 28_level_2
10-17,,4.0,,4.0,,4.0,,4.0,,4.0,,4.0,,0.0,,0.0,,4.0,,3.0,,4.0,,4.0,,4.0,,4.0
18-24,5.0,33.0,5.0,33.0,5.0,33.0,5.0,33.0,5.0,33.0,5.0,33.0,0.0,0.0,0.0,0.0,5.0,33.0,0.0,3.0,5.0,33.0,5.0,33.0,5.0,33.0,5.0,33.0
25-34,8.0,27.0,8.0,27.0,8.0,27.0,8.0,27.0,8.0,27.0,8.0,27.0,0.0,0.0,0.0,0.0,8.0,27.0,3.0,0.0,8.0,27.0,8.0,27.0,8.0,27.0,8.0,27.0
over 34,3.0,22.0,3.0,22.0,3.0,22.0,3.0,22.0,3.0,22.0,3.0,22.0,0.0,0.0,0.0,0.0,3.0,22.0,0.0,4.0,3.0,22.0,3.0,22.0,3.0,22.0,3.0,22.0


Note that it repeats this for each column - datetime, involved_person and so on. 

To stop this, you need to specify the `values=` like so:

In [9]:
policestops.pivot_table(index="age_range", 
                        values="involved_person",
                        columns="gender", 
                        aggfunc="count")

gender,Female,Male
age_range,Unnamed: 1_level_1,Unnamed: 2_level_1
10-17,,4.0
18-24,5.0,33.0
25-34,8.0,27.0
over 34,3.0,22.0


Note that it doesn't much matter which values you pick - it just stops it repeating for each field.

However, if you pick the same column for values that you picked for index or columns, you'll get an error.

## Replace `NaN` with zeroes

We can also [add extra functions](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.fillna.html) on the end to replace the `NaN` with a zero.

The code below takes the results of the `pivot_table()` function and applies `.fillna()` to it.

In [13]:
policestops.pivot_table(index="age_range", 
                        values="involved_person",
                        columns="gender", 
                        aggfunc="count").fillna(0)

gender,Female,Male
age_range,Unnamed: 1_level_1,Unnamed: 2_level_1
10-17,0.0,4.0
18-24,5.0,33.0
25-34,8.0,27.0
over 34,3.0,22.0


An alternative is to use the `fill_value=` parameter in the `pivot_table()` function, which specifies what to use to replace missing values.

In [None]:
policestops.pivot_table(index="age_range", 
                        values="involved_person",
                        columns="gender", 
                        aggfunc="count",
                        fill_value=0)

gender,Female,Male
age_range,Unnamed: 1_level_1,Unnamed: 2_level_1
10-17,0,4
18-24,5,33
25-34,8,27
over 34,3,22


## Format numbers as integers, not floats

And because we expect whole numbers here, we can add `.astype(int)` on the end to convert the floats.

In [None]:
policestops.pivot_table(index="age_range", 
                        values="involved_person",
                        columns="gender", 
                        aggfunc="count").fillna(0).astype(int)

gender,Female,Male
age_range,Unnamed: 1_level_1,Unnamed: 2_level_1
10-17,0,4
18-24,5,33
25-34,8,27
over 34,3,22


## Adding row and column totals

The `margins=` parameter allows us to add totals if we want.

In [None]:
policestops.pivot_table(index="age_range", 
                        values="involved_person",
                        columns="gender",
                        margins=True,
                        aggfunc="count").fillna(0).astype(int)

gender,Female,Male,All
age_range,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
10-17,0,4,4
18-24,5,33,38
25-34,8,27,35
over 34,3,22,25
All,16,86,102


We can also specify the name of that new column/row with `margins_name=`

In [None]:
policestops.pivot_table(index="age_range", 
                        values="involved_person",
                        columns="gender",
                        margins=True,
                        margins_name="Total",
                        aggfunc="count").fillna(0).astype(int)

gender,Female,Male,Total
age_range,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
10-17,0,4,4
18-24,5,33,38
25-34,8,27,35
over 34,3,22,25
Total,16,86,102


## Calculating averages

A mean average can be calculated by using the `.mean()` function with a specified column:

In [None]:
#calculate the mean average of the populations
calhousing['population'].mean()

1402.7986666666666

A median average can be calculated using `.median()`:

In [None]:
#calculate the median average of the populations
calhousing['population'].median()

1155.0

## Pivot table-like grouping

You can also make pivot tables by adding the `.groupby()` function to your code and making sure your code follows this structure:

* You name the data frame followed by **double square brackets** containing the names of the columns as strings, separated by commas e.g. `calhousing[['longitude','population']]`. (One to group by, the other to calculate on - equivalent to the 'rows' and 'values' boxes in a pivot table) 
* This is followed by the `.groupby()` function with the column you want to group by inside the parenthese, as a string
* Then you add `.mean()` or another calculation onto the end of the code. If you want to, you can specify which column you want to average, i.e. `.mean('population')`, which might help make the code clearer, but it will work without that anyway.

In [None]:
calhousing[['longitude','population']].groupby('longitude').mean()

Unnamed: 0_level_0,population
longitude,Unnamed: 1_level_1
-124.18,788.000000
-124.17,1259.000000
-124.16,1002.000000
-124.15,911.000000
-124.14,922.666667
...,...
-114.98,374.000000
-114.62,5.000000
-114.61,1115.000000
-114.55,1431.000000


To count rather than calculate mean averages or sums, etc. use `.value_counts()`.

Note that this doesn't need double square brackets as before.



In [14]:
policestops['age_range'].value_counts()

18-24      38
25-34      35
over 34    25
10-17       4
Name: age_range, dtype: int64

In [None]:
calhousing['longitude'].value_counts()

-118.26    26
-118.21    26
-118.28    25
-118.27    25
-118.29    25
           ..
-123.39     1
-116.66     1
-120.59     1
-119.61     1
-120.10     1
Name: longitude, Length: 607, dtype: int64

In [None]:
#read in some JSON from the UK police API - this should show stops near a particular location during January 2021
policestops = pd.read_json("https://data.police.uk/api/stops-street?lat=52.629729&lng=-1.131592&date=2021-01")
#count how many in each age range
policestops['age_range'].value_counts()

18-24      38
25-34      35
over 34    25
10-17       4
Name: age_range, dtype: int64

## More calculations

The [`pandas` documentation](https://pandas.pydata.org/docs/reference/frame.html) provides details on other functions that can be used for calculations, such as `.std()` for standard deviations, `.quantile()` to show quantiles, and `.nunique()` to count the number of distinct elements in a column.

## Creating new columns from calculations

You can create new columns by using *whole* columns for calculations - for example subtracting one year's total from another to create a 'change' column.

A classic example would be dividing events by population to get a 'per capita' amount. 

First, we create a data frame from scratch.

In [None]:
#create a dictionary containing lists (column values) against each keys (headings)
mydict = {"Police force":["West Midlands","Avon & Somerset"],"crimes":[30,40],"population": [3000,8000]}
#convert to a data frame
mydf = pd.DataFrame(data = mydict, index=None)
#print
print(mydf)

      Police force  crimes  population
0    West Midlands      30        3000
1  Avon & Somerset      40        8000


Now we write a calculation that pulls two different columns from the data frame (by specifying the keys/column headings), and divides one by the other.

The result is another list of numbers which is used to create a new column against the key 'crimespercapita'.

In [None]:
mydf['crimespercapita'] = mydf['crimes']/mydf['population']
mydf

Unnamed: 0,Police force,crimes,population,crimespercapita
0,West Midlands,30,3000,0.01
1,Avon & Somerset,40,8000,0.005


Equally we can multiply values in one column to get new values. A classic example would be multiplying our per capita figures by 1000 to get a 'per thousand people' amount instead.

In [None]:
mydf['crimesperthou'] = mydf['crimespercapita']*1000
mydf

Unnamed: 0,Police force,crimes,population,crimespercapita,crimesperthou
0,West Midlands,30,3000,0.01,10.0
1,Avon & Somerset,40,8000,0.005,5.0


This can be done with text values too.

In [None]:
#create an empty list to store our true/false values for 'west'
west = []
#loop through the column 'Police force'
for i in mydf['Police force']:
  #check if the string "West" is in each item, store the result (True/False) in 'tf'
  tf = "West" in i
  #add 'tf' to our previously empty list, which fills up as this loops
  west.append(tf)

#print the list once the loop has finished
print(west)

[True, False]


In [None]:
#create a new column in the data frame with the key 'West' and the values we stored in that list
mydf['West'] = west
#print the new data frame
print(mydf)

      Police force  crimes  population  crimespercapita  crimesperthou   West
0    West Midlands      30        3000            0.010           10.0   True
1  Avon & Somerset      40        8000            0.005            5.0  False


## Filtering data

Here's how to filter data stored in a dataframe. First, let's see how many rows there are in our dataset:

In [None]:
policestops.shape

(106, 16)

So 106 rows.

Next, here's how to filter that to just those rows where the 'gender' column contains the string 'Male':

In [None]:
maleonly = policestops[policestops['gender'] == "Male"]
print(maleonly)

    age_range  ...          object_of_search
1     over 34  ...              Stolen goods
2        None  ...              Stolen goods
4     over 34  ...              Stolen goods
6     over 34  ...              Stolen goods
7     over 34  ...          Controlled drugs
..        ...  ...                       ...
100     18-24  ...  Article for use in theft
101     10-17  ...  Article for use in theft
102     10-17  ...          Controlled drugs
104   over 34  ...          Controlled drugs
105     10-17  ...          Controlled drugs

[90 rows x 16 columns]


This new dataframe has only 90 rows - although the index column retains its original values, so the last row still has 105 in that column.

How does this work? Well, first of all you are generating a list of `True/False` values with this code:

In [None]:
policestops['gender'] == "Male"

0      False
1       True
2       True
3      False
4       True
       ...  
101     True
102     True
103    False
104     True
105     True
Name: gender, Length: 106, dtype: bool

When placed inside square brackets after the name of the dataframe, this True/False list acts as a series of indices: where it is `True` the row is selected. 

Let's do the opposite, using `!=` (not equal to) to return `True` where the suspect is not male

In [None]:
notmale = policestops[policestops['gender'] != "Male"]
print(notmale)

    age_range                       outcome  ...  operation_name   object_of_search
0       25-34  A no further action disposal  ...             NaN       Stolen goods
3       25-34  A no further action disposal  ...             NaN       Stolen goods
5       25-34  A no further action disposal  ...             NaN       Stolen goods
9     over 34  A no further action disposal  ...             NaN   Controlled drugs
25    over 34  A no further action disposal  ...             NaN   Controlled drugs
40      18-24  A no further action disposal  ...             NaN   Controlled drugs
50      18-24          Community resolution  ...             NaN   Controlled drugs
52      25-34                        Arrest  ...             NaN   Controlled drugs
61      18-24  A no further action disposal  ...             NaN   Controlled drugs
62      18-24  A no further action disposal  ...             NaN   Controlled drugs
82      25-34  A no further action disposal  ...             NaN   Controlle

In [None]:
print(notmale['gender'])

0      Female
3      Female
5      Female
9      Female
25     Female
40     Female
50     Female
52     Female
61     Female
62     Female
82     Female
85     Female
87     Female
95     Female
96     Female
103    Female
Name: gender, dtype: object
