# TMA01, question 2 (65 marks)

**Name**: \[Anna Duncan\]

**PI**: \[K8748389\]

This TMA question gives you the opportunity to demonstrate your mastery of the techniques in carrying out a small-scale data analysis. Specifically, this question requires you to clean two datasets, combine and reshape them, and graphically present the cleaned data. All the techniques required to answer this question can be found in Parts 2-6 and are illustrated in the associated notebooks. 

**Note**: Although it is possible to carry out all the necessary analyses using the techniques shown in the notebooks, it is possible that there may be more effective ways of carrying out the various operations. If you find that a particular tasks seems hard to carry out with the techniques shown in the teaching materials, check the *pandas* documentation. There may well be a suitable technique available.

## The Task

[Tuberculosis](https://en.wikipedia.org/wiki/Tuberculosis) is an extremely widespread and contagious disease, occurring in all the world's continents. Although the rate of successful treatment is improving, there are several groups who are at particular risk from the disease.

One of these groups is people living with [HIV](https://en.wikipedia.org/wiki/HIV), who are more likely to develop active tuberculosis. In this part of the TMA, you will investigate health datasets to visualise possible correlations between different countries' rates of tuberculosis, the co-occurence of tuberculosis with HIV, and some economic factors of the countries in question. The economic factors to be considered are countries' [GINI coefficient](https://en.wikipedia.org/wiki/Gini_coefficient), which is a measure of the wealth inequality in a country, and the amount of a country's wealth held by its poorest citizens.

You will be considering the questions:

- The rate of tuberculosis among people living with HIV is higher than among those without HIV infection.
    1. Is there a relationship between these relative rates, and a country's GINI index? 
    2. Is there a relationship between these relative rates, and the amount of a country's wealth held by the poorest 20% of citizens? 


To address these questions, you will investigate two datasets. The first details the number and rates of cases of tuberculosis in each country over a number of years, for tuberculosis as well as those which co-occur with HIV. The data was obtained from the World Health Organisation's [Global Health Observatory data repository](https://apps.who.int/gho/data/), on 28th June 2021, under the licence: [CC BY-NC-SA 3.0 IGO](https://creativecommons.org/licenses/by-nc-sa/3.0/igo/).

You can find the tuberculosis data in the `data` directory as:

    data/WHO_TB_data.csv
    

To obtain the various countries' GINI coefficient and wealth ownership data, you should use the data shared by the World Bank. The data was obtained from the [World Bank Open Data portal](https://data.worldbank.org/) on 27th April 2021, under the Creative Commons Attribution 4.0 International license ([CC-BY 4.0](https://creativecommons.org/licenses/by/4.0/)).

You can find the economic data in the subfolder:

    data/Data_Extract_From_Poverty_and_Equity

You must produce two graphical illustrations of the available data:

1. The first graph should show the relationship between different countries' rate of tuberculosis, the rate of tuberculosis in people who are HIV-positive, and the GINI coefficient.

1. The second graph should show the relationship between different countries' rate of tuberculosis, the rate of tuberculosis in people who are HIV-positive, and the income share held by that country's poorest 20%.

You should then discuss what you believe your representations show.

(65 marks)

### Some Guidance

There are many ways you could approach this task, but one way might be to produce a *pandas* dataframe, containing values so that the appropriate tuberculosis and economic data is contained in the dataframe. The final dataframe might then look something like this:

|Country|TB infections per 100,000|TB infections with HIV per 100,000|GINI coefficient|Income share of poorest 20%|
|---|---|---|---|---|
|Canada | 0.013 | 0.002 |23.6 | 8.7 |
|Japan | 0.001 | 0.000 | 34.1 | 6.2|
| $\vdots$ | $\vdots$ | $\vdots$ |$\vdots$ | $\vdots$ |

(although note that the entries 0.013, 0.002, 23.6, ... are just for illustration; they are not necessarily the correct values for the question).

You should then construct plots showing the relationships between the rates of infection with and without HIV, and the economic data indicator for the country.

You should also give an explanation of what you believe your plots show.

**Note**: Both datasets contain data covering several years. For this task, you are not expected to provide a time series. Instead you should focus on the data for 2018, which we will take as being a representative year. **However, you may need to consider and possibly use, data from other years to provide as representative a view of TB worldwide as possible**.

## Presenting Your Work<a id='presentation_tips'></a>

This TMA question is designed to develop your skills in carrying out and presenting an independent data investigation. As such, your final notebook should have a clear narrative explaining *what* you are doing at each stage, and *why*. This means that you must:

* explain how you are handling the data, including explaining the code that you have written,

* clearly explain any assumptions or simplifications that you have made about the data, and

* interpret your final results in the context of these assumptions and simplifications.


Each operation should be presented in its own code cell (or cells, if it is clearer to break the code up a bit) and be preceded by at least one markdown cell explaining what the code is intended to do. You should also use the markdown to justify your decisions.

For example, if you were carrying out the data investigation from question 1 of this TMA, and you had noticed that one dataset contains entries for Glyndebourne, and the other contains entries for Lewes, you might write a markdown cell along the lines of:
> The first dataset contains entries for `Glyndebourne` which are not in the second dataset. The second dataset contains entries for `Lewes` which are not in the first dataset. As Glyndebourne represents an opera house near the town of Lewes, I will use Lewes as the nearest town to represent operas which took place at Glyndebourne. The occurrences of `Glyndebourne` in the first dataset should be therefore be replaced with `Lewes`.

before writing a code cell to replace occurrences of `Glyndebourne` with `Lewes` in the appropriate dataset.

In this case, the markdown describes:

- what the problem is (inconsistent data),
- what I'm doing about it (replacing `Glyndebourne` with `Lewes` in one of the datasets), and
- why I'm doing it (it's reasonable to treat Glyndebourne and Lewes as being in the same location).

Note that you are *not* expected to have extensive domain specific knowledge for TM351. In this case, if you don't know much about British opera, you might have felt that a better solution would be to have rows for Glyndebourne with null values in the final dataset. That would be fine too, provided that you explained why you had made that decision. In TM351 you will be rewarded for making reasonable decisions, clearly justifying them, correctly implementing them and explaining what you have done.

Some general considerations on presentation:

* You must present your answer in this notebook.
    
* Do not put too much text or code into each notebook cell. Text or markdown cells should contain one or two paragraphs at most. Code cells should contain around ten lines of code (or less).

* Ensure that your code is clear enough for a reader (in this case your tutor) to understand, For example, you should use meaningful variable names, comments where appropriate and so on.

* You should have a specific cell whose return value is the dataframe described above, or an equivalent that you will use to generate your final plots.

* You should have two specific cells which give the plots of the data that you require.

### Using Tools Outside the Notebooks

Unlike question 1 of this TMA, you do not have to use Python or *pandas* to carry out all of your analyses for this question; you may use any analysis tool which has been covered so far in the module (but not tools which have not been covered). However, if you do not use Python, you must provide screenshots of the tools you did use in order to show your working, and to enable your reader to replicate your analysis. To display an image in your notebook, you should use the appropriate HTML code in a markdown cell:
```html
<img src="myScreenshot.jpg"> 
```
or
```html
<img src="myScreenshot.png"> 
```
(where `myScreenshot.jpg` or `myScreenshot.png` is the name of your screenshot image; the image will be shown when you run the cell, and the form of the image is determined from the extension). It is good practice to keep your screenshots in a separate folder; if you do so, then modify the syntax to be:
```html
<img src="images/myScreenshot.jpg">
```
or
```html
<img src="images/myScreenshot.png">
```
where `images` is the name of the folder containing your screenshot files.

Remember to include all your images in the TMA zip file that you submit to the eTMA system.

## Structuring Your Answer

This question requires that you complete a number of tasks:

1. You will need to identify the licences governing the data you use, and identify and quote the specific clauses which show that you are permitted to carry out your chosen analysis.

2. You may decide to carry out some preprocessing on the datasets before importing them using OpenRefine.

3. You need to import the two datasets into Python.

4. You will need to restructure the data into a dataframe in the form described above. You will need to reshape the data, and possibly carry out further cleaning.

5. Finally, you should select a visualisation method for the data in the dataset, and present two visualisations of the data, with a description of how you think they should be interpreted. We are not prescribing a particular choice of visualisation: you should choose one that you think is appropriate and clear.

For stages 2 and 4 you must examine the datasets and process them in a way that will enable you to create the desired visualisation. You should consider questions such as:

- Is there ambiguity in the dataset? (That is, are there aspects of the data which are unclear, and/or not documented?)
    
- Is any data missing from the datasets?
    
- Is there any dirtiness in the datasets, or inconsistency in how the data is represented between the two datasets?
    
In each case, you should describe the problem you have found with the data. You should then clearly explain how you have handled it, [as described above](#presentation_tips).



We have provided a structure for your answer. The headings do not represent equal amounts of work, because different datasets and different tasks require the effort to be spent in different places. Also, you may need to use several cells to address a particular heading. For example, you would expect to present substantially more work on reshaping the data, than on importing the datasets.

#### Approximate breakdown of marks

The following table gives an *approximate* breakdown of how marks might be awarded. Be aware that your tutor has considerable discretion about where to award marks, and the final allocation might vary depending upon how you choose to approach the various subtasks in the investigation. In particular, bear in mind that Level 3 is about critical reflection, so your overall approach and analysis of your results are at least as important as the code that you write.

| Category | Number of marks |
|-----------------|-------|
| 1. Identify and Explain the Relevant Licensing Terms and Conditions | 5 |
| 2. Preprocess the Data (if applicable) | (\*) |
| 3. Import the Datasets | 5 |
|4. Clean and Reshape the Data | (\*)25 |
|5. Put the data into an appropriate form for plotting | 5 |
|6. Visualise the data | 10 |
|7. Interpret the plot | 5 |
|Presentation (not explicit in the structure) | 10 |


(\*) Taken together categories 2 and 4 are worth 25 marks. If you choose to do much of the cleaning and reshaping of the data in OpenRefine many of those marks will be awarded in category 2 for preprocessing the data before it is imported. If you do all your work in *pandas* there will be no preprocessing and all the marks will be awarded in category 4. If you use a mix of OpenRefine and *pandas* the marks will be distributed appropriately.

-----------------------------------------------------------------------


## Your Answer

#### 1. Identify and Explain the Relevant Licensing Terms and Conditions

#### 2. Preprocess the Data (if applicable)

#### 3. Import the Datasets

#### 4. Clean and Reshape the Data

#### 5. Put the data into an appropriate form for plotting

#### 6. Visualise the data

#### 7. Interpret the plots