# Coronavirus World Data Analysis

KATE expects your code to define variables with specific names that correspond to certain things we are interested in.

KATE will run your notebook from top to bottom and check the latest value of those variables, so make sure you don't overwrite them.

* Remember to uncomment the line assigning the variable to your answer and don't change the variable or function names.
* Use copies of the original or previous DataFrames to make sure you do not overwrite them by mistake.

You will find instructions below about how to define each variable.

Once you're happy with your code, upload your notebook to KATE to check your feedback.

First of all, run the following cell to:

- import `pandas` with an alias of `pd`
- read a CSV containing the data to work with
- convert the `date` column to the `datetime` format
- create a DataFrame `df` containing the data for only **1st July 2020**
- take a look at the first few rows of the DataFrame


In [2]:
import pandas as pd

data = pd.read_csv('data/owid-covid-data.csv')
data['date'] = pd.to_datetime(data['date'])
df = data[data['date'] == '2020-07-01']

df.head()

Unnamed: 0,iso_code,continent,location,date,total_cases,new_cases,total_deaths,new_deaths,total_cases_per_million,new_cases_per_million,...,aged_70_older,gdp_per_capita,extreme_poverty,cvd_death_rate,diabetes_prevalence,female_smokers,male_smokers,handwashing_facilities,hospital_beds_per_thousand,life_expectancy
173,AFG,Asia,Afghanistan,2020-07-01,31517.0,279.0,746.0,13.0,809.616,7.167,...,1.337,1803.987,,597.029,9.59,,,37.746,0.5,64.83
300,ALB,Europe,Albania,2020-07-01,2535.0,69.0,62.0,4.0,880.881,23.977,...,8.643,11803.431,1.1,304.195,10.08,7.1,51.2,,2.89,78.57
491,DZA,Africa,Algeria,2020-07-01,13907.0,336.0,912.0,7.0,317.142,7.662,...,3.857,13913.839,0.5,278.364,6.73,0.7,30.4,83.741,1.9,76.88
613,AND,Europe,Andorra,2020-07-01,855.0,0.0,52.0,0.0,11065.812,0.0,...,,,,109.135,7.97,29.0,37.8,,,83.73
727,AGO,Africa,Angola,2020-07-01,284.0,8.0,13.0,2.0,8.641,0.243,...,1.362,5819.495,,276.045,3.94,,,26.664,,61.15


- `df` DataFrame now has one row of data for each country with data present for **July 1st 2020**
- however, it also has a row with a `location` of `World` which contains aggregated values for all countries
- `df.tail()`, `df.info()` and `df.shape` will allow for further exploration of the structure of the DataFrame

In [None]:
#df.tail()

In [None]:
#df.info()

In [None]:
#df.shape

**Q1. Create a new DataFrame called `countries` which is the same as `df` but with the `World` row removed.**

- Use the `.copy()` method to ensure you have a distinct DataFrame in memory
- Assign this new DataFrame to the variable `countries`; do not modify `df`


See below code syntax for some guidance:
```python
countries['location'] != 'World'
```

In [4]:
#add your code below
countries = df.copy()
countries = countries.drop(countries[countries['location'] == 'World'].index)


**Q2. Check the shape of your DataFrame to confirm that `countries` has one row fewer than `df`:**

Please note you have been provided with the code for this question to carry out the necessary analysis. Simply uncomment the line of code and run the code cell to produce the desired results.

In [5]:
print(df.shape, countries.shape)

(211, 34) (210, 34)


**Q3. Define a DataFrame based on the `countries` DataFrame, but which only contains the columns in `cols` (defined below) and assign this to a variable called `countries_dr`**

- Order this DataFrame by `'total_deaths_per_million'`, with the highest numbers at the top.

See below code syntax for some guidance:
```python
DataFrame_name[column_names].sort_values(by=..., ascending=False)
```

In [6]:
cols = ['continent', 'location', 'total_deaths_per_million']

#add your code below
countries_dr = countries[cols].sort_values('total_deaths_per_million', ascending=False)


**Q4. Using the `countries` DataFrame we created earlier, find the sum of `total_tests` for countries in `Africa`, assigning the result, *as an integer*, to `africa_tests`.**

- Use `.sum()` method calculate the sum for `total_tests` column
- Use `.astype(int)` method or `int()` function to convert results to an integer


See below code syntax for some guidance:
```python
countries['continent'] == 'Africa'
```

In [7]:
#add your code below
africa_tests = int(countries[countries['continent'] == 'Africa']['total_tests'].sum())


**Q5. How many countries in Africa have no value recorded for the number of `total_tests` column? Assign the result to `africa_missing_test_data`.**

- You may find the pandas `.isna()` method and python `len()` function useful

See below code syntax for some guidance:
```python
len(DataFrame_name[column_name].isna())
```

In [8]:
#add your code below
africa_missing_test_data = len(countries[(countries['continent'] == 'Africa') & (countries['total_tests'].isna())])


**Q6. How many countries have a higher value for `total_tests` than the `United Kingdom`? Assign your answer to a variable called `countries_more_tests`.**

Remember to work from the `countries` DataFrame rather than `df`. You should avoid modifying any existing DataFrames. 

In [9]:
#add your code below
uk_no_tests = countries[countries['location'] == 'United Kingdom']['total_tests'].values[0]
countries_more_tests = len(countries[countries['total_tests'] > uk_no_tests])


**Q7. Create a DataFrame called `beds_dr` which is based on the `countries` DataFrame, but contains only the columns `hospital_beds_per_thousand` and `total_deaths_per_million`.**

- Your answer should only  include rows where there are values present in both of these columns
- You may find the `.dropna()` method useful

See below code syntax for some guidance:
```python
DataFrame_name.dropna()
```

In [10]:
#add your code below
beds_dr = countries[['hospital_beds_per_thousand', 'total_deaths_per_million']].dropna()


**Q8. Refer to the `beds_dr` DataFrame. What is the average `total_deaths_per_million` for entries in `beds_dr` where `hospital_beds_per_thousand` is greater than the mean?**

- Save the results to a new variable called `dr_high_bed_ratio`

See below code syntax for some guidance:
```python
beds_dr['hospital_beds_per_thousand'] > beds_dr['hospital_beds_per_thousand'].mean()
```

In [11]:
#add your code below
mask = beds_dr['hospital_beds_per_thousand'] > beds_dr['hospital_beds_per_thousand'].mean()
dr_high_bed_ratio = beds_dr[mask]['total_deaths_per_million'].mean()


**Q9. Refer to the `beds_dr` DataFrame. What is the average `total_deaths_per_million` for entries in `beds_dr` where `hospital_beds_per_thousand` is less than the mean?**

- Save the results to a new variable called `dr_low_bed_ratio`

See below code syntax for some guidance:
```python
beds_dr['hospital_beds_per_thousand'] < beds_dr['hospital_beds_per_thousand'].mean()
```

In [12]:
#add your code below
mask = beds_dr['hospital_beds_per_thousand'] < beds_dr['hospital_beds_per_thousand'].mean()
dr_low_bed_ratio = beds_dr[mask]['total_deaths_per_million'].mean()


**Q10. Refer to the `countries` DataFrame. Create a new DataFrame called `no_new_cases` which contains only rows from `countries` with zero `new_cases`.**

Please note you have been provided with the code for this question to carry out the necessary analysis. Simply uncomment the lines of code and run the code cell to produce the desired results.

In [13]:
#add your code below
no_new_cases = countries[countries['new_cases'] == 0]
no_new_cases.head()



Unnamed: 0,iso_code,continent,location,date,total_cases,new_cases,total_deaths,new_deaths,total_cases_per_million,new_cases_per_million,...,aged_70_older,gdp_per_capita,extreme_poverty,cvd_death_rate,diabetes_prevalence,female_smokers,male_smokers,handwashing_facilities,hospital_beds_per_thousand,life_expectancy
613,AND,Europe,Andorra,2020-07-01,855.0,0.0,52.0,0.0,11065.812,0.0,...,,,,109.135,7.97,29.0,37.8,,,83.73
836,AIA,North America,Anguilla,2020-07-01,3.0,0.0,0.0,0.0,199.973,0.0,...,,,,,,,,,,81.88
952,ATG,North America,Antigua and Barbuda,2020-07-01,66.0,0.0,3.0,0.0,673.965,0.0,...,4.631,21490.943,,191.511,13.17,,,,3.8,77.02
1381,ABW,North America,Aruba,2020-07-01,103.0,0.0,3.0,0.0,964.727,0.0,...,7.452,35973.781,,,11.62,,,,,76.29
2080,BHS,North America,Bahamas,2020-07-01,104.0,0.0,11.0,0.0,264.464,0.0,...,5.2,27717.847,,235.954,13.17,3.1,20.4,,2.9,73.92


**Q11. Refer to the `no_new_cases` DataFrame. Which country in `no_new_cases` DataFrame has had the highest number of `total_cases`?**

- Save the results to a new variable called `highest_no_new`

See below code syntax for some guidance:
```python
no_new_cases['total_cases'] == no_new_cases['total_cases'].max()
```

In [14]:
#add your code below
highest_no_new = no_new_cases[no_new_cases['total_cases'] == no_new_cases['total_cases'].max()]['location'].values[0]


**Q12. Refer to the `countries` DataFrame. What is the sum of the `population` of all countries which have had zero `total_deaths`?**

- Assign your answer to `sum_populations_no_deaths`  variable
- Your answer should be in millions, rounded to the nearest whole number, and converted to an integer

In [16]:
#add your code below
sum_populations_no_deaths = int(round(countries[countries['total_deaths'] == 0]['population'].sum() / 1e6))


**Q13. Create a function called `country_metric` which accepts the following three parameters:**

- a DataFrame (which can be assumed to be of a similar format to `countries`)
- a location (i.e. a string  which will be found in the `location` column of the DataFrame)
- a metric (i.e. a string which will be found in any column  (other than `location`)  in the DataFrame)

The function should return only the value from the first row for a given `location` and  `metric`. *You may find  `.iloc[]`  useful.*

See below code syntax for some guidance:
```python
def country_metric(df, location, metric):
    
    return df[df['location'] == location].iloc[0][metric]
```

In [17]:
#add your code below
def country_metric(df, location, metric):
    return df[df['location'] == location].iloc[0][metric]


**Q.14 Use your function to collect the value for `Vietnam` for the metric `aged_70_older`, assigning the result to `vietnam_older_70`.**

Please note you have been provided with the code for this question to carry out the necessary analysis. Simply uncomment the lines of code and run the code cell to produce the desired results.

In [18]:
#add your code below
vietnam_older_70 = country_metric(countries, 'Vietnam', 'aged_70_older')
vietnam_older_70



4.718

**Q.15 Create another function called `countries_average`, which accepts the following three parameters:**

- a DataFrame "df" (which can be assumed to be such as `countries`)
- a list of countries "countries" (which can be assumed to all be found in the `location` column of the DataFrame)
- a string "metric" (which can be assumed to be a column (other than `location`) which will be found in the DataFrame) . For instance, this string value can be `life_expectancy`.

Note that for the test on KATE for this question to pass, you need to make sure the function accepts the three parameters in the following order: `countries_average(df, countries, metric)`. (You can call your parameters however you like as long as the type of these parameters are what was described above).

The function should return the average value for the given metric for the given list of countries.

You may find `.isin()`  method useful while filtering for list of countries.

In [19]:
#add your code below
def countries_average(df, countries, metric):
    return df[df['location'].isin(countries)][metric].mean()


**Q16. Use your `countries_average` function to find out the average `life_expectancy` of countries in the `g7` list defined below. Assign the result to the variable `g7_avg_life_expectancy`.**

Please note you have been provided with the code for this question to carry out the necessary analysis. Simply uncomment the lines of code and run the code cell to produce the desired results.

In [20]:
g7 = ['United States', 'Italy', 'Canada', 'Japan', 'United Kingdom', 'Germany', 'France']

In [21]:
#add your code below
g7 = ['United States', 'Italy', 'Canada', 'Japan', 'United Kingdom', 'Germany', 'France']
g7_avg_life_expectancy = countries_average(df, g7, 'life_expectancy')
g7_avg_life_expectancy


82.10571428571428

**Q.17 Refer to the `countries` DataFrame. Find the country with lowest value for `life_expectancy` in the `countries` DataFrame, and create a string which is formatted as follows:**

'{country} has a life expectancy of {diff} years lower than the G7 average.'
    
Assign your string to the variable `headline` and ensure it is formatted exactly as above, with:
- use `f-strings` to format the string
- {country} being replaced by the value in the `location` column of the DataFrame
- {diff} being replaced by a float **rounded to one decimal place**, of the value from the `life_expectancy` column subtracted from `g7_avg_life_expectancy`. Please note that {diff} should be a positive value
```python
diff = <G7 countries average life expectancy> - <value of the lowest life expectancy country>
```
    
    
    
See below code syntax for some guidance:
```python
lowest = countries[countries['life_expectancy'] == countries['life_expectancy'].min()].iloc[0]
country = lowest['location']
life_exp = lowest['life_expectancy']
```

In [22]:
#add your code below
min_life_exp = countries['life_expectancy'].min()
country = countries[countries['life_expectancy'] == min_life_exp]['location'].values[0]
diff = round(g7_avg_life_expectancy - min_life_exp, 1)

headline = f'{country} has a life expectancy of {diff} years lower than the G7 average.'
