# TMA01, question 1 (40 marks)

**Name**: \[Enter your name here\]

**PI**: \[Enter your student ID here\]

In this question, you will investigate a dataset of common land in England. Common land is registered land to which the public generally have access, as well as potentially other rights (for example, some commons allow the public to graze certain types of animal on that land).

You are interested in the two questions:
1. What proportion of land in each English county is common land?
2. How much common land is in each English county per person in the county?

<img src="Beverley_Westwood_Common_Land_-_geograph.org.uk_-_514559.jpg" alt="Beverley Westwood Common Land, Beverley, East Riding of Yorkshire, England. Beverley Minster can be seen in the distance." style="width: 400px;"/>

<p style="text-align: center;">Beverley Westwood Common Land / Andy Beecroft / <a href=https://creativecommons.org/licenses/by-sa/2.0/>CC BY-SA 2.0</a>


The tasks in this notebook can be addressed using the techniques discussed in parts 2-6 of the module materials, and the associated notebooks.

The question has three parts, looking at different parts of the data analysis pipeline.

Record all your activity and observations in this notebook. Insert additional notebook cells as required. Remember to run each cell in sequence and to rerun cells if you make any changes in earlier cells. 

Before you submit your notebook make sure you run all cells in order and check that you get the results you expect. (It is not unknown to receive notebooks which don't work when the cells are run in order. The most reliable way of checking your results is usually to use the menu option *Kernel $\rightarrow$ Restart & Run All*.)

Note that in this question you are required to use python and the `pandas` library - this is to give you experience with using `pandas` and `DataFrame`s to manipulate data. If you wish, you may use the `pandasql` library as part of "the `pandas` library", as described in the notebooks for Part 3.

In [None]:
# This cell imports some standard libraries you may need 
# for this question.

import pandas as pd

import matplotlib.pyplot as plt

In [None]:
# If you require additional libraries to answer any questions 
# then import them in this code cell.

#### Contents

[Data provenance and importing the data](#provenance)

[Cleaning, reshaping and combining the data sets](#combining)

[Visualising the data](#visualising)

## <a id='provenance'></a>Data provenance, importing and shaping the data

In this notebook, you will use two datasets. You can find these in the `data` directory. Although we have provided both for you here, even when someone passes you a dataset, you need to be able to confirm your usage rights for that data.

#### 1. Licensing for the common land dataset

The common land dataset is stored as a csv file called `common_register_file1v3_2015.csv`. This dataset was obtained from the [UK government data portal](http://data.gov.uk) on 10th June, 2020 from:

https://data.gov.uk/dataset/database-of-registered-common-land-in-england

The files contains Government information licensed under the [Open Government Licence v3.0](http://www.nationalarchives.gov.uk/doc/open-government-licence/version/3/).

Find the terms of this license, and state:
1. the specific clauses of the licence which allow the OU to distribute the data to you, and
2. the obligations that the licence places upon the OU when distributing the data.

*(2 marks)*


**Write your answer in this markdown cell**

#### 2. Licensing for the county land and population dataset

The information about the size and populations of counties is stored as a csv file called `counties.csv`. This dataset was obtained from the Wikipedia page:

https://en.wikipedia.org/wiki/Ceremonial_counties_of_England

on 10th June, 2020.

The `counties.csv` file is governed by the [Creative Commons Attribution-ShareAlike 3.0](https://creativecommons.org/licenses/by-sa/3.0/) licence.


From the terms of this licence, and explain why this licence was chosen for the `counties.csv` file. Again, you should identify the specific clauses which explain why the Creative Commons Attribution-ShareAlike 3.0 licence is the appropriate one.

*(2 marks)*


**Write your answer in this markdown cell**

#### 3. Importing the datasets

Each row of the commons dataset represents a distinct piece of common land. The name of the county which controls the land is in the `County` column, and the area of each common is in the `Registered AREA in hectares` column.

Import the file `common_register_file1v3_2015.csv`, and create a dataframe named `commons_df` which contains only the columns in `common_register_file1v3_2015.csv` entitled `County` and `Registered AREA in hectares`.

Display a preview of the first five rows of the dataframe. *(2 marks)*

In [None]:
# Write your answer in this code cell

Next, import the file `counties.csv`, and create a dataframe named `counties_df` which contains only the columns in `counties.csv` titled `County for the purposes of the lieutenancies`, `Population (2018)` and `Area (km2)`.

Display a preview of the first five rows of the dataframe. *(4 marks)*

In [None]:
# Write your answer in this code cell

## <a id='combining'></a>Cleaning, reshaping and combining the data sets

You want to compare the information on common land with the information on county area and population.

The column `County` in the commons dataset gives the list of counties covered by the dataset, and the `County for the purposes of the lieutenancies` gives the list of counties covered by the county information dataset. 

You notice that there are several discrepancies between the values in the two columns which could lead to errors in such an analysis.

#### <a id='identifying_discrepancies'></a>4. Identifying discrepancies between the datasets

When combining datasets, it is important to be systematic about investigating the differences in the columns that you want to use which contain equivalent values. 

Write a statement to generate a set of the county names in your dataframe `commons_df` which do not appear in the county names in your dataframe `counties_df`. *(1 mark)*

In [None]:
# Write your answer in this code cell

Similarly, write a statement to generate a set of the county names in your dataframe `counties_df` which do not appear in the county names in your dataframe `commons_df`. *(1 mark)*

In [None]:
# Write your answer in this code cell

#### 5. Correcting the discrepancies between the datasets: county names

In order to combine the dataframes `counties_df` and `commons_df` into a single dataframe, you need to accommodate the variations in the county names in these two dataframes. When investigating English geography, you decide to make the following changes:


1. "Avon" and "Bristol" seem to refer to the same area, so convert the occurrences of `Avon` to refer to `Bristol`.

2. The City of London is part of Greater London, so your `counties_df` dataframe should combine the data for `City of London` into that of `Greater London`,

3. Worcestershire and Herefordshire are often considered together, so your `counties_df` dataframe should combine the data for `Worcestershire` and `Herefordshire` into a single (appropriately named) entry,

4. Humberside was abolished as a county in 1996. The East Riding of Yorkshire is the nearest equivalent, so your `commons_df` dataframe should convert the occurrences of `Humberside` to refer to `East Riding of Yorkshire`.

5. The data for Cleveland should be folded into North Yorkshire, so your `commons_df` dataframe should convert the occurrences of `Cleveland` to refer to `North Yorkshire`, and

6. Rutland should be considered part of Leicestershire, so your `counties_df` dataframe should combine the data for `Rutland` into that of `Leicestershire`.

Use this information, and the results of [part 4](#identifying_discrepancies), to ensure that the information about the county names in the two dataframes is consistent. *(14 marks)*

In [None]:
# Write your answer in this code cell

#### 6. Correcting the discrepancies between the datasets: Units of area

Looking at the two datasets, you realise that the area in the common land dataset is given in hectares, whereas the area in the counties information set is given in square kilometres. You decide to convert all the data into hectares.

Add a column to the `counties_df` dataframe called `Area (Ha)` which contains the area of each county in hectares.

Display a preview of the first five rows of the dataframe. *(1 mark)*

Note: 1 square kilometre is equal to 100 hectares.

In [None]:
# Write your answer in this code cell

#### 7. Reshaping the commons dataset

Using the `commons_df` dataframe, create a pandas dataframe `common_area_df` which contains:
- a column containing the county names taken from the `commons_df['County']` column, and
- a column containing the total area of common land in that county, taken from the `commons_df['Registered AREA in hectares']` column.


For example, if the `commons_df` dataframe contained the data:

| County | Registered AREA in hectares |
|----|----|
|Derbyshire | 0.95 |
|Derbyshire |	8.40 |
|Derbyshire	| 27.30 |
|Wiltshire	| 0.80 |
|Wiltshire	| 69.32 |

then your `common_area_df` might look something like this:

| County | Registered AREA in hectares |
|----|----|
|Derbyshire | 36.65 |
|Wiltshire	| 70.12 |

If you wish, you might prefer to hold the county names in the index of your dataframe, rather than in a column; either is fine.

When you have constructed your dataframe, display the first 5 rows. *(3 marks)*

In [None]:
# Write your answer in this code cell

## <a id='visualising'></a>Visualising the data

You have been asked to answer the two questions:

1. What proportion of land in each English county is common land?
2. How much common land is in each English county per person living in the county?

Your task is to plot two bar charts, where each bar represents one county, and the size of each bar represents:
1. The proportion of common land of all land in the county, and
2. The amount of common land per person living in the county.


#### 8. Create a dataframe containing plotting data

To create your plots, you should create a new dataframe that combines the data from your dataframes `counties_df` and `common_area_df`. Your dataframe should have a row for each county, containing the name of the county, the ratio of common land to all land in the county, and the amount of common land per person in the county.

When you have constructed your dataframe, display the first 5 rows. *(4 marks)*

In [None]:
# Write your answer in this code cell

#### 9. Plot the data

Having created an appropriate dataframe, you should now plot the data as a bar chart.

As stated, your task is to plot two bar charts, where each bar represents one county, and the size of each bar represents:

1. The ratio of common land to all land in the county, and
2. The amount of common land per person living in the county.

for the two graphs respectively.

The plots should also be sorted, so that the bars are in order of increasing height.

Your graphs should have appropriate titles, axis labels, and legends.

Remember that for this question, your solution must be carried out using the `pandas` library: you must provide python commands which generate the required plot from your plotting dataframe.

Finally, comment on your charts: What do they tell you (or don't tell you) about the amount of common land in England?

*(6 marks)*



In [None]:
# Write your answer in this code cell

**Comment on your charts in this markdown cell**