# Exploring data with `pandas`

This notebook details how to use the Python library `pandas` to explore data once you've imported. First, let's import `pandas` - and some data to work with.

Note that when reading data using `pandas`, the results are stored in a **dataframe** which can be used for further analysis.

In [None]:
import pandas as pd
#read in some JSON from the UK police API - this should show stops near a particular location during January 2021
policestops = pd.read_json("https://data.police.uk/api/stops-street?lat=52.629729&lng=-1.131592&date=2021-01")
#show the new variable
print(policestops)

    age_range  ...          object_of_search
0       25-34  ...              Stolen goods
1     over 34  ...              Stolen goods
2        None  ...              Stolen goods
3       25-34  ...              Stolen goods
4     over 34  ...              Stolen goods
..        ...  ...                       ...
101     10-17  ...  Article for use in theft
102     10-17  ...          Controlled drugs
103     25-34  ...          Controlled drugs
104   over 34  ...          Controlled drugs
105     10-17  ...          Controlled drugs

[106 rows x 16 columns]


Note that `pandas` inserts a new column at the front of the dataframe with the number of each row. This is the **index** for each row - a unique value that only appears once. 

## Showing the head and tail

As well as just printing the object (which shows the first and last 5 rows) we can ask to show a certain number of rows at the top or bottom.

The functions here are `.head()` and `.tail()` - they need to be attached to the dataframe object, and if you don't want the default 5 rows to be shown, put the number of rows in the parentheses, e.g. `.head(9)`.

In [None]:
#print the first 10 rows of policestops
print(policestops.head(10))

  age_range                       outcome  ...  operation_name  object_of_search
0     18-24  A no further action disposal  ...             NaN  Controlled drugs
1     25-34  A no further action disposal  ...             NaN  Controlled drugs
2     25-34  A no further action disposal  ...             NaN  Controlled drugs
3     25-34  A no further action disposal  ...             NaN      Stolen goods
4   over 34  A no further action disposal  ...             NaN      Stolen goods
5      None  A no further action disposal  ...             NaN      Stolen goods
6     25-34  A no further action disposal  ...             NaN      Stolen goods
7   over 34  A no further action disposal  ...             NaN      Stolen goods
8     25-34  A no further action disposal  ...             NaN      Stolen goods
9   over 34  A no further action disposal  ...             NaN      Stolen goods

[10 rows x 16 columns]


In [None]:
#print the last 10 rows of policestops
print(policestops.tail(10))

    age_range  ...          object_of_search
96      25-34  ...         Offensive weapons
97    over 34  ...         Offensive weapons
98    over 34  ...  Article for use in theft
99      25-34  ...  Article for use in theft
100     18-24  ...  Article for use in theft
101     10-17  ...  Article for use in theft
102     10-17  ...          Controlled drugs
103     25-34  ...          Controlled drugs
104   over 34  ...          Controlled drugs
105     10-17  ...          Controlled drugs

[10 rows x 16 columns]


## Showing the columns

You can show the columns in a dataset by simply adding `.columns` like so:

In [None]:
policestops.columns

Index(['age_range', 'outcome', 'involved_person', 'self_defined_ethnicity',
       'gender', 'legislation', 'outcome_linked_to_object_of_search',
       'datetime', 'removal_of_more_than_outer_clothing', 'outcome_object',
       'location', 'operation', 'officer_defined_ethnicity', 'type',
       'operation_name', 'object_of_search'],
      dtype='object')

The `.dtypes` command can be used instead to show both the column names *and* the types (boolean, float, etc)

In [None]:
policestops.dtypes

age_range                                           object
outcome                                             object
involved_person                                       bool
self_defined_ethnicity                              object
gender                                              object
legislation                                         object
outcome_linked_to_object_of_search                 float64
datetime                               datetime64[ns, UTC]
removal_of_more_than_outer_clothing                   bool
outcome_object                                      object
location                                            object
operation                                          float64
officer_defined_ethnicity                           object
type                                                object
operation_name                                     float64
object_of_search                                    object
dtype: object

Alternatively, the `.info()` function not only lists the columns and the type of data in each, but it also provides a 'Non-Null Count' (how many values aren't `null`)

In [None]:
policestops.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 106 entries, 0 to 105
Data columns (total 16 columns):
 #   Column                               Non-Null Count  Dtype              
---  ------                               --------------  -----              
 0   age_range                            102 non-null    object             
 1   outcome                              106 non-null    object             
 2   involved_person                      106 non-null    bool               
 3   self_defined_ethnicity               106 non-null    object             
 4   gender                               106 non-null    object             
 5   legislation                          106 non-null    object             
 6   outcome_linked_to_object_of_search   13 non-null     float64            
 7   datetime                             106 non-null    datetime64[ns, UTC]
 8   removal_of_more_than_outer_clothing  106 non-null    bool               
 9   outcome_object                  

You can change the type of column by using `astype()`

In [None]:
policestops['datetime'] = policestops['datetime'].astype('str')
policestops['datetime'][:4]

0    2021-01-11 07:05:00+00:00
1    2021-01-30 16:12:00+00:00
2    2021-01-25 13:10:00+00:00
3    2021-01-15 08:55:00+00:00
Name: datetime, dtype: object

This might be useful if you want to do something with the column which requires it to be a particular data type. Below, for example, we create a function to extract the first four characters from a string, and then apply that to a column of dates. In order for that to work the dates need to be strings to begin with.

In [None]:
#define a function we call 'firstfour'. It has one parameter we name 'date'
def firstfour(date):
  #grab the first four characters and store in 'chars0to3'
  chars0to3 = date[:4]
  #return to whatever called the function
  return(chars0to3)

#apply that function to all items in the specified column (a list)
years = policestops['datetime'].apply(firstfour)
print(years)
#assign back to the dataframe
policestops['year'] = years

0      2021
1      2021
2      2021
3      2021
4      2021
       ... 
101    2021
102    2021
103    2021
104    2021
105    2021
Name: datetime, Length: 106, dtype: object


The `shape` method can be used to see the dimensions (how many rows and columns) the dataframe has. Note that this does not have parentheses.

In [None]:
policestops.shape

(106, 16)

Similarly, `size` shows you how many cells it has.

In [None]:
policestops.size

1696

The `count()` function is slightly different: it tells you how many non-NA values (in other words, non-blank) each column has. The results below, for example, tell you that 'outcome' and 'involved_person' both have 106 values, meaning there are no `NA` cells - but 'age_range' only has 102 values, meaning there are 4 cells which contain `NA`. Some columns, such as 'operation_name', are entirely empty.

In [None]:
policestops.count()

age_range                              102
outcome                                106
involved_person                        106
self_defined_ethnicity                 106
gender                                 106
legislation                            106
outcome_linked_to_object_of_search      13
datetime                               106
removal_of_more_than_outer_clothing    106
outcome_object                         106
location                               106
operation                                0
officer_defined_ethnicity              106
type                                   106
operation_name                           0
object_of_search                       106
dtype: int64

The `index` function describes the index which is added when importing data into a pandas dataframe. In this case, it's a **range** which starts at 0 and ends at 106, with a 'step' of 1 (meaning a gap of 1 between each number).

In [None]:
policestops.index

RangeIndex(start=0, stop=106, step=1)

## Accessing columns

Columns can be accessed by naming the dataframe variable and then, in square brackets, naming the column as a string like so:

In [None]:
policestops['age_range']

0        25-34
1      over 34
2         None
3        25-34
4      over 34
        ...   
101      10-17
102      10-17
103      25-34
104    over 34
105      10-17
Name: age_range, Length: 106, dtype: object

A different approach which produces the same result is to name the dataframe, followed by a period, then the name of the field you want to see. Of course this only works if the field name doesn't contain any spaces:

In [None]:
policestops.age_range

0        25-34
1      over 34
2         None
3        25-34
4      over 34
        ...   
101      10-17
102      10-17
103      25-34
104    over 34
105      10-17
Name: age_range, Length: 106, dtype: object

You can also access columns with the row index/indices inside `[]` as follows:

In [None]:
policestops[5:8]

Unnamed: 0,age_range,outcome,involved_person,self_defined_ethnicity,gender,legislation,outcome_linked_to_object_of_search,datetime,removal_of_more_than_outer_clothing,outcome_object,location,operation,officer_defined_ethnicity,type,operation_name,object_of_search
5,,A no further action disposal,True,Other ethnic group - Not stated,Male,Police and Criminal Evidence Act 1984 (section 1),,2021-01-25 13:10:00+00:00,False,"{'id': 'bu-no-further-action', 'name': 'A no f...","{'latitude': '52.634407', 'street': {'id': 883...",,White,Person search,,Stolen goods
6,25-34,A no further action disposal,True,Other ethnic group - Any other ethnic group,Female,Police and Criminal Evidence Act 1984 (section 1),,2021-01-15 08:55:00+00:00,False,"{'id': 'bu-no-further-action', 'name': 'A no f...","{'latitude': '52.642971', 'street': {'id': 884...",,White,Person search,,Stolen goods
7,over 34,A no further action disposal,True,Other ethnic group - Not stated,Male,Police and Criminal Evidence Act 1984 (section 1),,2021-01-21 11:43:00+00:00,False,"{'id': 'bu-no-further-action', 'name': 'A no f...","{'latitude': '52.641197', 'street': {'id': 884...",,White,Person search,,Stolen goods


## Accessing rows and individual cells with `iloc[]`

Rows, and row-column cell references, can be accessed using `iloc[]` 

In [None]:
#row 1
policestops.iloc[0]

age_range                                                                          25-34
outcome                                                     A no further action disposal
involved_person                                                                     True
self_defined_ethnicity                 White - English/Welsh/Scottish/Northern Irish/...
gender                                                                            Female
legislation                            Police and Criminal Evidence Act 1984 (section 1)
outcome_linked_to_object_of_search                                                   NaN
datetime                                                       2021-01-11 07:05:00+00:00
removal_of_more_than_outer_clothing                                                False
outcome_object                         {'id': 'bu-no-further-action', 'name': 'A no f...
location                               {'latitude': '52.631954', 'street': {'id': 131...
operation            

In [None]:
#row 1, cell 1
policestops.iloc[0,0]

'25-34'

An alternative to `iloc` is the `loc` function - the main difference is that index ranges in `loc` include the last index. For example:

In [None]:
#show rows with index 0 to 3 - but not including 4
print(policestops.iloc[0:4])
#show rows with index 0 to 4 inclusive
print(policestops.loc[0:4])

  age_range                       outcome  ...  operation_name object_of_search
0     25-34  A no further action disposal  ...             NaN     Stolen goods
1   over 34  A no further action disposal  ...             NaN     Stolen goods
2      None  A no further action disposal  ...             NaN     Stolen goods
3     25-34  A no further action disposal  ...             NaN     Stolen goods

[4 rows x 16 columns]
  age_range                       outcome  ...  operation_name object_of_search
0     25-34  A no further action disposal  ...             NaN     Stolen goods
1   over 34  A no further action disposal  ...             NaN     Stolen goods
2      None  A no further action disposal  ...             NaN     Stolen goods
3     25-34  A no further action disposal  ...             NaN     Stolen goods
4   over 34  A no further action disposal  ...             NaN     Stolen goods

[5 rows x 16 columns]


## Summarise the data with descriptive statistics

The `.describe()` function can be used to provide an overview of *numerical* columns in the dataframe (note below that it only provides results for 3 of the 16 columns in our dataframe).

That overview shows how many numbers there are (count), the mean average, standard deviation, minimum and maximum values, and quartiles (25, 50 and 75% dividing points). 

In [None]:
policestops.describe()

Unnamed: 0,outcome_linked_to_object_of_search,operation,operation_name
count,13.0,0.0,0.0
mean,0.076923,,
std,0.27735,,
min,0.0,,
25%,0.0,,
50%,0.0,,
75%,0.0,,
max,1.0,,


The `.describe()` function has [a number of parameters](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.describe.html) which allow to, for example, specify which percentiles you want to see (The default is `[.25, .5, .75]`)

One of those parameters is `include=`. If you want to look at non-numeric data as well then you can set this to `"all"` like so: 

In [None]:
policestops.describe(include = "all")

  """Entry point for launching an IPython kernel.


Unnamed: 0,age_range,outcome,involved_person,self_defined_ethnicity,gender,legislation,outcome_linked_to_object_of_search,datetime,removal_of_more_than_outer_clothing,outcome_object,location,operation,officer_defined_ethnicity,type,operation_name,object_of_search
count,102,106,106,106,106,106,13.0,106,106,106,106,0.0,106,106,0.0,106
unique,4,3,1,13,2,2,,92,2,3,61,,4,2,,4
top,18-24,A no further action disposal,True,White - English/Welsh/Scottish/Northern Irish/...,Male,Misuse of Drugs Act 1971 (section 23),,2021-01-09 18:10:00+00:00,False,"{'id': 'bu-no-further-action', 'name': 'A no f...","{'latitude': '52.637354', 'street': {'id': 883...",,White,Person search,,Controlled drugs
freq,38,78,106,30,90,83,,4,100,78,7,,51,83,,84
first,,,,,,,,2021-01-01 01:30:00+00:00,,,,,,,,
last,,,,,,,,2021-01-31 23:49:00+00:00,,,,,,,,
mean,,,,,,,0.076923,,,,,,,,,
std,,,,,,,0.27735,,,,,,,,,
min,,,,,,,0.0,,,,,,,,,
25%,,,,,,,0.0,,,,,,,,,


Note that info on non-numeric data adds extra rows such as the first and last value, the top value and its frequency, and how many unique values there are (e.g. the Gender column has two unique values: male and female.

## Storing and exporting results

Note that these summaries can be saved in a separate variable and exported like any other data.

In [None]:
first50 = policestops.head(50)
last50 = policestops.tail(50)
agerangeonly = policestops['age_range']
rows5to10 = policestops[4:10]
datainfo = policestops.describe(include = "all")

datainfo.to_csv("datainfo.csv")
rows5to10.to_csv("rows5to10.csv")
agerangeonly.to_csv("agerangeonly.csv")

  """
