# Intro to `pandas`

This notebook introduces basic techniques in using `pandas` for data analysis.

First, we need to import the `pandas` library. This is pre-installed in Colab notebooks, so doesn't need installing - it only needs bringing in with the `import` command.

It's also quite common to rename the library when it's imported, as `pd`, like so:

In [None]:
import pandas as pd

The inverted pyramid of data journalism outlines 5 stages:

1. Compile
2. Clean
3. Combine
4. Context
5. Clean

And running throughout it: **question**.

Let's start with compiling in `pandas`. 

## Compiling data: importing a CSV

The easiest way to compile data in a Colab notebook is to upload the data to the Files area on the left hand side of Colab. Once in the Files view, it can be brought into the notebook with the `read_csv()` function.

Colab already has a 'sample_data' folder in Files with 4 CSV files and a JSON file. We can export one of those to demonstrate:

In [None]:
#import the CSV from the Files in Colab
caldata = pd.read_csv("sample_data/california_housing_test.csv")
#print the results
print(caldata)

      longitude  latitude  ...  median_income  median_house_value
0       -122.05     37.37  ...         6.6085            344700.0
1       -118.30     34.26  ...         3.5990            176500.0
2       -117.81     33.78  ...         5.7934            270500.0
3       -118.36     33.82  ...         6.1359            330000.0
4       -119.67     36.33  ...         2.9375             81700.0
...         ...       ...  ...            ...                 ...
2995    -119.86     34.42  ...         1.1790            225000.0
2996    -118.14     34.06  ...         3.3906            237200.0
2997    -119.70     36.30  ...         2.2895             62000.0
2998    -117.12     34.10  ...         3.2708            162500.0
2999    -119.63     34.42  ...         8.5608            500001.0

[3000 rows x 9 columns]


### Importing Excel files

If your data is an Excel spreadsheet in .xlsx format you will need [pandas's `read_excel` function](https://pandas.pydata.org/docs/reference/api/pandas.read_excel.html).

I've downloaded an Excel spreadsheet on [*Operation of police powers under the Terrorism Act 2000, financial year ending March 2021*](https://www.gov.uk/government/statistics/operation-of-police-powers-under-the-terrorism-act-2000-financial-year-ending-march-2021) and then uploaded it to the Files area in Colab.

In [None]:
terrdata = pd.read_excel("operation-police-powers-terrorism-mar2021-annual-tables.xlsx")
print(terrdata)

    Unnamed: 0                                         Unnamed: 1
0          NaN                                                NaN
1          NaN                                                NaN
2          NaN                                                NaN
3          NaN                                                NaN
4          NaN                                                NaN
5          NaN  Statistics on the operation of police powers u...
6          NaN             Year to March 2021: Annual Data Tables
7          NaN                                                NaN
8          NaN                                                NaN
9          NaN                                                NaN
10         NaN                                                NaN
11         NaN                                                NaN
12         NaN              Responsible Statistician: Daniel Shaw
13         NaN       Enquiries: CTAI_Statistics@homeoffice.gov.uk
14        

Note that the spreadsheet has a bunch of `NaN` cells and unnamed columns. It's also imported the first sheet by default. 

You can control these by adding extra parameters to the `read_excel()` function like so:

In [None]:
terrdata = pd.read_excel("operation-police-powers-terrorism-mar2021-annual-tables.xlsx", sheet_name = 3, skiprows = 5)
print(terrdata)

                                  Period of detention Charged  ... Other.20 Total.20
0                                                 NaN     NaN  ...      NaN      NaN
1                                         Under 1 day       4  ...     63.0    757.0
2                               1 to less than 2 days       3  ...     26.0    367.0
3                               2 to less than 3 days       1  ...      1.0     57.0
4                               3 to less than 4 days       9  ...     16.0    131.0
5                               4 to less than 5 days       9  ...      9.0    115.0
6                               5 to less than 6 days       1  ...      6.0    139.0
7                               6 to less than 7 days       7  ...      8.0    260.0
8                               7 to less than 8 days       0  ...      5.0     24.0
9                               8 to less than 9 days       0  ...      1.0     24.0
10                             9 to less than 10 days       0  ..

Note that: 

* The first ingredient (argument) for `pd.read_excel(` is a string with the name of the spreadsheet, including .xlsx.
* The second argument is `sheet_name =` which is set to `3` meaning the fourth sheet (counting begins at 0 in Python)
* And the `skiprows =` argument is set to `5`, meaning that it will skip 5 rows and use row 6 for column headings.

Other arguments are [listed in the documentation](https://pandas.pydata.org/docs/reference/api/pandas.read_excel.html). Key ones to note are:

* `header = ` - which row to use for column headings
* `usecols = ` - which columns to keep. This can be column letter ranges as strings, e.g. `"A:E" or "A,C,E:F"`, or integers as a list, e.g. `[1:10]`
* `nrows = ` - the number of rows to import. For example you might only want to import the first 100 rows to begin with in a large dataset, or all the rows before any footnotes
* `skipfooter = ` is a similar argument which allows you to skip the last few rows by specifying how many rows at the end you want to leave out
* `parse_dates = ` - specify which columns you want to import as dates, e.g. `[2,3]`. Check the documentation for more information on how to combine columns (e.g. day, month, year) as a date

Here's an example of using more of those with our spreadsheet:

In [None]:
#store the url first so you can see all the arguments below
theurlwewant = "operation-police-powers-terrorism-mar2021-annual-tables.xlsx"
#read that url and specify a sheet name, header row and footers to skip
terrdata = pd.read_excel(theurlwewant, sheet_name = 3, header = 5, skipfooter=9)
print(terrdata)

        Period of detention Charged Released  ... Released.20 Other.20  Total.20
0                       NaN     NaN      NaN  ...         NaN      NaN       NaN
1               Under 1 day       4       22  ...       556.0     63.0     757.0
2     1 to less than 2 days       3       13  ...       257.0     26.0     367.0
3     2 to less than 3 days       1        0  ...        30.0      1.0      57.0
4     3 to less than 4 days       9        9  ...        56.0     16.0     131.0
5     4 to less than 5 days       9        3  ...        55.0      9.0     115.0
6     5 to less than 6 days       1        0  ...        52.0      6.0     139.0
7     6 to less than 7 days       7        4  ...        78.0      8.0     260.0
8     7 to less than 8 days       0        0  ...         8.0      5.0      24.0
9     8 to less than 9 days       0        0  ...         7.0      1.0      24.0
10   9 to less than 10 days       0        0  ...         9.0      2.0      35.0
11  10 to less than 11 days 

In [None]:
terrdata = pd.read_excel("operation-police-powers-terrorism-mar2021-annual-tables.xlsx", sheet_name = 3, header = 5)
print(terrdata)

                                  Period of detention Charged  ... Other.20 Total.20
0                                                 NaN     NaN  ...      NaN      NaN
1                                         Under 1 day       4  ...     63.0    757.0
2                               1 to less than 2 days       3  ...     26.0    367.0
3                               2 to less than 3 days       1  ...      1.0     57.0
4                               3 to less than 4 days       9  ...     16.0    131.0
5                               4 to less than 5 days       9  ...      9.0    115.0
6                               5 to less than 6 days       1  ...      6.0    139.0
7                               6 to less than 7 days       7  ...      8.0    260.0
8                               7 to less than 8 days       0  ...      5.0     24.0
9                               8 to less than 9 days       0  ...      1.0     24.0
10                             9 to less than 10 days       0  ..

### Importing JSON

Data in the JSON format can be imported using `read_json`. Below we import another piece of data in Colab's 'sample_data' folder:

In [None]:
anscombe = pd.read_json("sample_data/anscombe.json")
print(anscombe)

   Series   X      Y
0       I  10   8.04
1       I   8   6.95
2       I  13   7.58
3       I   9   8.81
4       I  11   8.33
5       I  14   9.96
6       I   6   7.24
7       I   4   4.26
8       I  12  10.84
9       I   7   4.81
10      I   5   5.68
11     II  10   9.14
12     II   8   8.14
13     II  13   8.74
14     II   9   8.77
15     II  11   9.26
16     II  14   8.10
17     II   6   6.13
18     II   4   3.10
19     II  12   9.13
20     II   7   7.26
21     II   5   4.74
22    III  10   7.46
23    III   8   6.77
24    III  13  12.74
25    III   9   7.11
26    III  11   7.81
27    III  14   8.84
28    III   6   6.08
29    III   4   5.39
30    III  12   8.15
31    III   7   6.42
32    III   5   5.73
33     IV   8   6.58
34     IV   8   5.76
35     IV   8   7.71
36     IV   8   8.84
37     IV   8   8.47
38     IV   8   7.04
39     IV   8   5.25
40     IV  19  12.50
41     IV   8   5.56
42     IV   8   7.91
43     IV   8   6.89


### Importing from an online source

The same functions can also be used to import data an online source - you just need to use the URL of the file. 

Below we import CSV [from a GitHub repo](https://github.com/BBC-Data-Unit/stalking_protection_orders). GitHub displays CSVs nicely as tables - but note that in order to get the link to the actual CSV *data* you need to click on the CSV link in GitHub and *then* click on **Raw**. The URL should start `raw.githubusercontent.com`.

In [None]:
stalkingdata = pd.read_csv("https://raw.githubusercontent.com/BBC-Data-Unit/stalking_protection_orders/main/forsharing_stalking_protection_orders%20-%20Main_dataset.csv")
print(stalkingdata)

             Police force  ... charge_rate_apr20_dec20
0      Avon and Somerset   ...                      4%
1           Bedfordshire   ...                      4%
2         Cambridgeshire   ...                      9%
3               Cheshire   ...                      6%
4              Cleveland   ...                      8%
5                Cumbria   ...                      9%
6             Derbyshire   ...                      6%
7       Devon & Cornwall   ...                      8%
8                 Dorset   ...                      7%
9                 Durham   ...                      6%
10            Dyfed Powys  ...                      7%
11                 Essex   ...                      8%
12       Gloucestershire   ...                     11%
13    Greater Manchester   ...                 #DIV/0!
14                  Gwent  ...                     11%
15             Hampshire   ...                      6%
16         Hertfordshire   ...                      8%
17        

The same process can be used to import Excel spreadsheets:

In [None]:
terrdata = pd.read_excel("https://assets.publishing.service.gov.uk/government/uploads/system/uploads/attachment_data/file/991988/operation-police-powers-terrorism-mar2021-annual-tables.xlsx", sheet_name=3, header=5)
print(terrdata)

                                  Period of detention Charged  ... Other.20 Total.20
0                                                 NaN     NaN  ...      NaN      NaN
1                                         Under 1 day       4  ...     63.0    757.0
2                               1 to less than 2 days       3  ...     26.0    367.0
3                               2 to less than 3 days       1  ...      1.0     57.0
4                               3 to less than 4 days       9  ...     16.0    131.0
5                               4 to less than 5 days       9  ...      9.0    115.0
6                               5 to less than 6 days       1  ...      6.0    139.0
7                               6 to less than 7 days       7  ...      8.0    260.0
8                               7 to less than 8 days       0  ...      5.0     24.0
9                               8 to less than 9 days       0  ...      1.0     24.0
10                             9 to less than 10 days       0  ..

## Importing all sheets within an Excel file

You can also read an entire Excel file first in order to see what sheets it contains and select more than one sheet.

In [None]:
#read in an Excel file
xlfile = pd.ExcelFile("https://assets.publishing.service.gov.uk/government/uploads/system/uploads/attachment_data/file/991988/operation-police-powers-terrorism-mar2021-annual-tables.xlsx")
#what are the sheet names?
print(xlfile.sheet_names)
#how many sheets
print(len(xlfile.sheet_names))

['Front Page', 'A - Index', 'A - A.01', 'A - A.02', 'A - A.03', 'A - A.04', 'A - A.05a', 'A - A.05b', 'A - A.05c', 'A - A.06a', 'A - A.06b', 'A - A.06c', 'A - A.07', 'A - A.08a', 'A - A.08b', 'A - A.08c', 'A - A.09', 'A - A.10', 'A - A.11', 'A - A.12a', 'A - A.12b', 'A - A.12c', 'A C.01', 'A C.02', 'A C.03', 'A C.04', 'A C.05', 'A P.01', 'A P.02', 'A P.03', 'A P.04', 'A P.05', 'A P.06', 'A S.01', 'A S.02', 'A S.03', 'A S.04']
37


If sheets contain the same data (e.g. a different sheet for each region, but the same columns) then this approach can be used to merge them, by looping through each sheet name you want to use.

## Importing from GitHub

We can import some data from GitHub using the 'Raw' link on [the data file page](https://github.com/paulbradshaw/cleaning/blob/master/dirtydata/Disposals%20by%20region%202012-13%20Table.xls) - but we get an error.

In [None]:
githublink = "https://github.com/paulbradshaw/cleaning/blob/master/dirtydata/Disposals%20by%20region%202012-13%20Table.xls?raw=true"
disposals = pd.ExcelFile(githublink)

XLRDError: ignored

Some googling finds a [solution](https://stackoverflow.com/questions/66648775/how-to-get-link-of-xlsx-file-in-github-to-be-opened-as-a-pandas-dataframe) involving a couple other libraries.

In [None]:
#https://stackoverflow.com/questions/66648775/how-to-get-link-of-xlsx-file-in-github-to-be-opened-as-a-pandas-dataframe
import requests as rq
from io import BytesIO

url = "https://github.com/paulbradshaw/cleaning/blob/master/dirtydata/Disposals%20by%20region%202012-13%20Table.xls?raw=true"
data = rq.get(url).content
disposals = pd.ExcelFile(BytesIO(data))

#what are the sheet names?
print(disposals.sheet_names)


['National', 'East Midlands', 'Eastern', 'London', 'North East', 'North West', 'South East', 'South West', 'Wales', 'West Midlands', 'Yorkshire']


## Looping through sheets to import them and combine

This particular spreadsheet has a different sheet for each area. Here's how we might combine them all into one dataframe:

First, we use `read_excel()` with that variable containing the Excel spreadsheet, and specify the first sheet (index 0).

In [None]:
#import the first sheet
dis1 = pd.read_excel(disposals, sheet_name=0, skiprows=1)
dis1.head()

Unnamed: 0,These figures do not match the data published in Chapter 5 as they are taken from a different data source.,10 - 14,15,16,17+,Unnamed: 5,Female,Male,Not Known,Unnamed: 9,White,Mixed,Asian or Asian British,Black or Black British,Chinese or Other Ethnic Group,Not Known.1,TOTAL
0,National,,,,,,,,,,,,,,,,
1,,,,,,,,,,,,,,,,,
2,Pre-court,,,,,,,,,,,,,,,,
3,Reprimand,4726.0,2795.0,2814.0,2720.0,,3524.0,9530.0,1.0,,11302.0,232.0,458.0,498.0,54.0,511.0,13055.0
4,Final Warning,3467.0,2404.0,2491.0,2587.0,,2350.0,8596.0,3.0,,9562.0,243.0,360.0,450.0,39.0,295.0,10949.0


Then we loop through the list of sheet names and use those to do the same for every other sheet - appending it to the dataframe containing the data from sheet 1.

In [None]:
#create a dataframe that's a copy of sheet index 0
disposalsall = dis1
#add a column for the sheet it came from
disposalsall['sheet'] = "National"

#loop through the sheet names from index 1 onwards
for i in disposals.sheet_names[1:]:
  print(i)
  #grab the sheet at that position
  currentsheet = pd.read_excel(disposals, sheet_name=i, skiprows=1)
  #add a column for the sheet it came from
  currentsheet['sheet'] = i
  #add to the ongoing dataframe
  disposalsall = disposalsall.append(dis1)

East Midlands
Eastern
London
North East
North West
South East
South West
Wales
West Midlands
Yorkshire


In [None]:
#show how many rows and cols the one-sheet dataframe has
print(dis1.shape)
#and the combined dataframe
print(disposalsall.shape)

(352, 18)
(3872, 18)


An alternative approach would be to measure the *length* of the sheet list and use that to generate indices for `sheet_name=` instead of the actual sheet name.

In [None]:
disposalsall.dtypes

These figures do not match the data published in Chapter 5 as they are taken from a different data source.     object
10 - 14                                                                                                       float64
15                                                                                                            float64
16                                                                                                            float64
17+                                                                                                           float64
Unnamed: 5                                                                                                    float64
Female                                                                                                        float64
Male                                                                                                          float64
Not Known                                               

More functions for importing data are detailed on pandas's [documentation on import/export](https://pandas.pydata.org/docs/reference/io.html)

## Exporting data

The same group of import/export functions can also be used to export data once you've finished doing analysis. These include

* `.to_csv()`
* `.to_excel()`
* `.to_json()`
* `.to_html()`
* `.to_xml()`

To use these, you need to put the name of a data frame *before* the period, and the name you want to give to the exported file as a **string** inside the parentheses. Like this:

In [None]:
anscombe.to_csv("anscombe.csv")
anscombe.to_excel("anscombe.xlsx")
anscombe.to_json("anscombe.json")
anscombe.to_html("anscombe.html")

Once you run any of those commands you should see the resulting exported file in the Files view on the left in Colab. You can then download that file by hovering over it, clicking the three dots to the right, and selecting *Download*.