# Project 1: Deaths by tuberculosis

by Gonzalo Gomez Millan, 22 November 2017, based on Michel Wermelinger version of 14 July 2015

This is the personal project notebook for Week 1 of The Open University's [_Learn to code for Data Analysis_](http://futurelearn.com/courses/learn-to-code) course.

In 2000, the United Nations set eight Millenium Development Goals (MDGs) to reduce poverty and diseases, improve gender equality and environmental sustainability, etc. Each goal is quantified and time-bound, to be achieved by the end of 2015. Goal 6 is to have halted and started reversing the spread of HIV, malaria and tuberculosis (TB).
TB doesn't make headlines like Ebola, SARS (severe acute respiratory syndrome) and other epidemics, but is far deadlier. For more information, see the World Health Organisation (WHO) page <http://www.who.int/gho/tb/en/>.

Given the population and number of deaths due to TB in some countries during one year, the following questions will be answered: 

- What is the total, maximum, minimum and average number of deaths in that year?
- Which countries have the most and the least deaths?
- What is the death rate (deaths per 100,000 inhabitants) for each country?
- Which countries have the lowest and highest death rate?
- How different or similar are the BRICS and PA countries?

The death rate allows for a better comparison of countries with widely different population sizes.

## The data

The data consists of total population and total number of deaths due to TB (excluding HIV) in 2013 in each of the BRICS (Brazil, Russia, India, China and South Africa) and PA (Pacific Alliance: Mexico, Colombia, Peru and Chile) countries of the world. 

The data was taken in July 2015 from <http://apps.who.int/gho/data/node.main.POP107?lang=en> (population) and <http://apps.who.int/gho/data/node.main.593?lang=en> (deaths). The uncertainty bounds of the number of deaths were ignored.

The data was collected into two Excel files (One for BRICS and one for PA) which should be in the same folder as this notebook.

## The BRICS countries

Let's see the data of the BRICS countries.

In [52]:
import warnings
warnings.simplefilter('ignore', FutureWarning)

from pandas import *
data = read_excel('WHO POP TB brics.xls')
data

Unnamed: 0,Country,Population (1000s),TB deaths
0,Brazil,200362,4400
1,Russian Federation,142834,17000
2,India,1252140,240000
3,China,1393337,41000
4,South Africa,52776,25000


## The range of the problem

The column of interest is the last one.

In [53]:
tbColumn = data['TB deaths']

The total number of deaths in 2013 for BRICS countries is:

In [54]:
tbColumn.sum()

327400

The largest and smallest number of deaths in a single country are:

In [55]:
tbColumn.max()

240000

In [56]:
tbColumn.min()

4400

From 4400 to almost a quarter of a million deaths is a huge range. The average number of deaths, over all countries in the data, can give a better idea of the seriousness of the problem in each country.
The average can be computed as the mean or the median. Given the wide range of deaths, the median is probably a more sensible average measure.

In [57]:
tbColumn.mean()

65480.0

In [58]:
tbColumn.median()

25000.0

The median is far lower than the mean. This indicates that some of the BRICS countries had a very high number of TB deaths in 2013, pushing the value of the mean up.

## The most affected

To see the most affected countries, the table is sorted in ascending order by the last column, which puts those countries in the last rows.

In [59]:
data.sort('TB deaths')

Unnamed: 0,Country,Population (1000s),TB deaths
0,Brazil,200362,4400
1,Russian Federation,142834,17000
4,South Africa,52776,25000
3,China,1393337,41000
2,India,1252140,240000


The table raises the possibility that a large number of deaths may be partly due to a large population. To compare the countries on an equal footing, the death rate per 100,000 inhabitants is computed.

In [60]:
populationColumn = data['Population (1000s)']
data['TB deaths (per 100,000)'] = tbColumn * 100 / populationColumn
data

Unnamed: 0,Country,Population (1000s),TB deaths,"TB deaths (per 100,000)"
0,Brazil,200362,4400,2.196025
1,Russian Federation,142834,17000,11.901928
2,India,1252140,240000,19.167186
3,China,1393337,41000,2.942576
4,South Africa,52776,25000,47.370017


To see the least and most affected countries, the table is sorted in ascending order by the death rate. 

In [61]:
data.sort('TB deaths (per 100,000)')

Unnamed: 0,Country,Population (1000s),TB deaths,"TB deaths (per 100,000)"
0,Brazil,200362,4400,2.196025
3,China,1393337,41000,2.942576
1,Russian Federation,142834,17000,11.901928
2,India,1252140,240000,19.167186
4,South Africa,52776,25000,47.370017


## The PA countries

Let's see the data of the PA countries.

In [62]:
import warnings
warnings.simplefilter('ignore', FutureWarning)

from pandas import *
data = read_excel('WHO POP TB pa.xls')
data

Unnamed: 0,Country,Population (1000s),TB deaths
0,Mexico,122332,2200
1,Colombia,48321,770
2,Peru,30376,2300
3,Chile,17620,220


## The range of the problem

The column of interest is the last one.

In [63]:
tbColumn = data['TB deaths']

The total number of deaths in 2013 for PA countries is:

In [64]:
tbColumn.sum()

5490

The largest and smallest number of deaths in a single country are:

In [65]:
tbColumn.max()

2300

In [66]:
tbColumn.min()

220

From 220 to 2300 deaths is a large range. The average number of deaths, over all countries in the data, can give a better idea of the seriousness of the problem in each country. The average can be computed as the mean or the median. Given the wide range of deaths, the median is probably a more sensible average measure.

In [67]:
tbColumn.mean()

1372.5

In [68]:
tbColumn.median()

1485.0

The median is a little higher than the average. This indicates that some of the PA countries had a very low number of TB deaths in 2013, pushing the value of the mean down.

## The most affected

To see the most affected countries, the table is sorted in ascending order by the last column, which puts those countries in the last rows.

In [69]:
data.sort('TB deaths')

Unnamed: 0,Country,Population (1000s),TB deaths
3,Chile,17620,220
1,Colombia,48321,770
0,Mexico,122332,2200
2,Peru,30376,2300


The table raises the possibility that a large number of deaths may be partly due to a large population. To compare the countries on an equal footing, the death rate per 100,000 inhabitants is computed.

In [70]:
populationColumn = data['Population (1000s)']
data['TB deaths (per 100,000)'] = tbColumn * 100 / populationColumn
data

Unnamed: 0,Country,Population (1000s),TB deaths,"TB deaths (per 100,000)"
0,Mexico,122332,2200,1.798385
1,Colombia,48321,770,1.59351
2,Peru,30376,2300,7.571767
3,Chile,17620,220,1.248581


To see the least and most affected countries, the table is sorted in ascending order by the death rate. 

In [71]:
data.sort('TB deaths (per 100,000)')

Unnamed: 0,Country,Population (1000s),TB deaths,"TB deaths (per 100,000)"
3,Chile,17620,220,1.248581
1,Colombia,48321,770,1.59351
0,Mexico,122332,2200,1.798385
2,Peru,30376,2300,7.571767


## Conclusions

The BRICS countries had almost than 328 thousand deaths due to TB in 2013. The median shows that half of these coutries had fewer than 25,000 deaths. The much higher mean (over 65,480) indicates that some countries had a very high number. The least affected country was Brazil, with 4400 deaths, and the most affected was India with 240 thousand deaths in a single year. However, taking the population size into account, the most affected was South Africa with over 47 deaths per 100,000 inhabitants.

On the other hand, the PA countries had almost than 5,490 deaths due to TB in 2013. The median shows that half of these coutries had fewer than 1,485 deaths. The lower mean (over 1,372) indicates that some countries had a low number. The least affected country was Chile, with 220 deaths, and the most affected was Peu with 2300 deaths in a single year. Now, taking the population size into account, the most affected was Peru over 7 deaths per 100,000 inhabitants having a particular dramatically situation against the other AP  countries.

Looking at the numbers we can see a wide gap between average and median for the BRICS countries, unlike the AP countries, so we conclude that the health problems of the PA countries are more homogeneous than for the BRICS countries. That could be because they are geographically close to each other and share the same language and similar culture.

One should not forget that most values are estimates. Nevertheless, they convey the message that TB is still a major cause of fatalities, than there is not a correlation between deaths and size of population, and that there is a huge disparity between countries, with several ones being highly affected.