## Assignment 2 Data Analysis using Pandas

This assignment will contain 14 questions with details as below. The due date is October 8 (Sunday), 2023 23:59PM. Each late day will result in 20% loss of total points.

The file of 'Daily reports (csse_covid_19_daily_reports)' contains 01-01-2023 (MM-DD-YYYY) daily case report. All timestamps are in UTC (GMT+0). More Description can be found in [COVID-19 Data Repository by the Center for Systems Science and Engineering (CSSE) at Johns Hopkins University.](https://github.com/CSSEGISandData/COVID-19)

References:

- Dong E, Du H, Gardner L. An interactive web-based dashboard to track COVID-19 in real time. Lancet Inf Dis. 20(5):533-534. doi: 10.1016/S1473-3099(20)30120-1
- Additional Information about the Visual Dashboard: https://systems.jhu.edu/research/public-health/ncov/
- Miller, Meg. "2019 Novel Coronavirus COVID-19 (2019-nCoV) Data Repository: Johns Hopkins University Center for Systems Science and Engineering." Bulletin-Association of Canadian Map Libraries and Archives (ACMLA) 164 (2020): 47-51.

Field/Feature/Column names descriptions are listed as follows

- FIPS: US only. Federal Information Processing Standards code that uniquely identifies counties within the USA.

- Admin2: County name. US only.

- Province_State: Province, state or dependency name.

- Country_Region: Country, region or sovereignty name. The names of locations included on the Website correspond with the official designations used by the U.S. Department of State.

- Last Update: MM/DD/YYYY HH:mm:ss (24 hour format, in UTC).

- Lat and Long: Dot locations on the dashboard. All points (except for Australia) shown on the map are based on geographic centroids, and are not representative of a specific address, building or any location at a spatial scale finer than a province/state. Australian dots are located at the centroid of the largest city in each state.

- Confirmed: Counts include confirmed and probable (where reported).

- Deaths: Counts include confirmed and probable (where reported).

- Recovered: Recovered cases are estimates based on local media reports, and state and local reporting when available, and therefore may be substantially lower than the true number. US state-level recovered cases are from COVID Tracking Project. We stopped to maintain the recovered cases.

- Active: Active cases = total cases - total recovered - total deaths. This value is for reference only after we stopped to report the recovered cases.

- Incident_Rate: Incidence Rate = cases per 100,000 persons.

- Case_Fatality_Ratio (%): Case-Fatality Ratio (%) = 100 * Number recorded deaths / Number cases.

- All cases, deaths, and recoveries reported are based on the date of initial report.


Note: Please download the dataset "01-01-2023.csv" from the moodle to your local path for performing the analysis, as some modification on the original data was done to suit the needs for this assignment.

In [1]:
import pandas as pd
import numpy as np

## Question 1 (5 points)

Now you need to use ```pandas``` to read the downloaded file from your local path.

**Print the column names, and also print a general description of it by using ```.describe()``` function.**

In [2]:
### Q1
covid_data = pd.read_csv('01-01-2023.csv')
print(covid_data.columns)
display(covid_data.describe())      # use display instead of print for output readability

Index(['FIPS', 'Admin2', 'Province_State', 'Country_Region', 'Last_Update',
       'Lat', 'Long_', 'Confirmed', 'Deaths', 'Recovered', 'Active',
       'Combined_Key', 'Incident_Rate', 'Case_Fatality_Ratio'],
      dtype='object')


Unnamed: 0,FIPS,Lat,Long_,Confirmed,Deaths,Recovered,Active,Incident_Rate,Case_Fatality_Ratio
count,3268.0,3925.0,3925.0,4016.0,4016.0,0.0,0.0,3922.0,3974.0
mean,32405.94339,35.736183,-71.109728,164536.4,1666.937749,,,27690.256958,3.396189
std,18056.381177,13.441327,55.36148,1045288.0,8702.992446,,,10386.943044,93.482132
min,60.0,-71.9499,-178.1165,0.0,0.0,,,0.0,0.0
25%,19048.5,33.191535,-96.595639,3721.25,46.0,,,23340.816452,0.892777
50%,30068.0,37.8957,-86.717326,10506.0,130.5,,,28611.368832,1.287045
75%,47041.5,42.176955,-77.3579,45770.75,465.25,,,33162.63832,1.739872
max,99999.0,71.7069,178.065,38267000.0,183247.0,,,218343.195266,5651.724138


## Question 2  (10 points)

Meanwhile, the data contains a few errors that need to be resolved:

- the ```Long``` column is mistakenly encoded as ```Long_```
- the ```Recovered``` column contains mostly missing values and needs to be deleted
- the ```Active``` column contains mostly missing values and needs to be deleted
- the ```Incident_Rate``` column is miscalculated by multiplying 100 on its original value

In [3]:
### Q2
covid_data.rename(columns={'Long_': 'Long'}, inplace=True)
covid_data.drop(columns=['Recovered', 'Active'], inplace=True)
covid_data['Incident_Rate'] = covid_data['Incident_Rate']/100
display(covid_data.head())

Unnamed: 0,FIPS,Admin2,Province_State,Country_Region,Last_Update,Lat,Long,Confirmed,Deaths,Combined_Key,Incident_Rate,Case_Fatality_Ratio
0,,,,Afghanistan,2023-01-02 04:20:57,33.93911,67.709953,207616.0,7849.0,Afghanistan,5.333287,3.780537
1,,,,Albania,2023-01-02 04:20:57,41.1533,20.1683,333811.0,3595.0,Albania,115.995205,1.076957
2,,,,Algeria,2023-01-02 04:20:57,28.0339,1.6596,271229.0,6881.0,Algeria,6.185235,2.536971
3,,,,Andorra,2023-01-02 04:20:57,42.5063,1.5218,47751.0,165.0,Andorra,618.015919,0.345543
4,,,,Angola,2023-01-02 04:20:57,-11.2027,17.8739,105095.0,1930.0,Angola,3.197655,1.836434


## Question 3  (5 points)

The column ```Last_Update``` involves some timestamps that are not in the year of 2023. Find them out and delete those rows.

**The updated dataframe should have only rows with timestamp in 2023.**

Hint: use value_counts() to count unique values first.

In [4]:
### Q3
covid_data = covid_data[covid_data['Last_Update']==covid_data.value_counts('Last_Update').index[0]]
display(covid_data)

Unnamed: 0,FIPS,Admin2,Province_State,Country_Region,Last_Update,Lat,Long,Confirmed,Deaths,Combined_Key,Incident_Rate,Case_Fatality_Ratio
0,,,,Afghanistan,2023-01-02 04:20:57,33.939110,67.709953,207616.0,7849.0,Afghanistan,5.333287,3.780537
1,,,,Albania,2023-01-02 04:20:57,41.153300,20.168300,333811.0,3595.0,Albania,115.995205,1.076957
2,,,,Algeria,2023-01-02 04:20:57,28.033900,1.659600,271229.0,6881.0,Algeria,6.185235,2.536971
3,,,,Andorra,2023-01-02 04:20:57,42.506300,1.521800,47751.0,165.0,Andorra,618.015919,0.345543
4,,,,Angola,2023-01-02 04:20:57,-11.202700,17.873900,105095.0,1930.0,Angola,3.197655,1.836434
...,...,...,...,...,...,...,...,...,...,...,...,...
4011,,,,West Bank and Gaza,2023-01-02 04:20:57,31.952200,35.233200,703228.0,5708.0,West Bank and Gaza,137.849570,0.811686
4012,,,,Winter Olympics 2022,2023-01-02 04:20:57,39.904200,116.407400,535.0,0.0,Winter Olympics 2022,,0.000000
4013,,,,Yemen,2023-01-02 04:20:57,15.552727,48.516388,11945.0,2159.0,Yemen,0.400490,18.074508
4014,,,,Zambia,2023-01-02 04:20:57,-13.133897,27.849332,334629.0,4024.0,Zambia,18.202230,1.202526


## Question 4  (5 points)

There are two provinces/states that have the same latitude (```Lat```) 52.939900. Print out these two provinces/states.

In [6]:
### Q4
print(covid_data[covid_data['Lat']==52.939900]['Province_State'])

89          Quebec
91    Saskatchewan
Name: Province_State, dtype: object


## Question 5  (5 points)

Show the average ```Confirmed``` number of all regions. Show also the median ```Deaths``` number per county of the US.

In [7]:
### Q5
print(round(covid_data[['Country_Region', 'Confirmed']].groupby('Country_Region').sum().mean()))  # use round() to avoid scientific notation
print(covid_data[covid_data['Country_Region']=='US'][['FIPS', 'Deaths']].groupby('FIPS').sum().median())

Confirmed    3287340.0
dtype: float64
Deaths    103.0
dtype: float64


## Question 6 (5 points)

Show the difference of average ```Deaths``` number between Alabama in US and Wyoming in US .

In [8]:
### Q6
df_deaths_usstate = covid_data[covid_data['Country_Region']=='US'][['Province_State', 'Deaths']].groupby('Province_State').mean()
print(df_deaths_usstate.loc['Alabama'] - df_deaths_usstate.loc['Wyoming'])

Deaths    227.924129
dtype: float64


## Question 7 (10 points)

Find the outputs of ```Province_State``` and ```Country_Region``` where the ```Deaths``` number reaches at the maximum and the second maximum.

In [9]:
### Q7
print(covid_data[['Province_State', 'Country_Region', 'Deaths']].nlargest(2,'Deaths'))

     Province_State  Country_Region    Deaths
3992        England  United Kingdom  183247.0
66        Sao Paulo          Brazil  177411.0


## Question 8 (10 points)

Build a subset dataframe for samples collected from US. **Use the values in column ```Combined_Key``` to create a new column** ```Province_State_recovered``` by containing only the information of the province, state or dependency name.  The county name and country, region or sovereignty name should be omitted.

# ***Note: From this question, please complete ALL the following data curation tasks with the U.S. subset dataframe.***

In [10]:
### Q8
us_data = covid_data[covid_data['Country_Region']=='US'].copy()
us_data['Province_State_recovered'] = us_data['Combined_Key'].map(lambda x: x.split(',')[-2].strip()) # index from the end for cases like Guam (FIPS 66) where there is no county name so Province would be index [0] instead of [1]
display(us_data)

Unnamed: 0,FIPS,Admin2,Province_State,Country_Region,Last_Update,Lat,Long,Confirmed,Deaths,Combined_Key,Incident_Rate,Case_Fatality_Ratio,Province_State_recovered
678,1001.0,Autauga,Alabama,US,2023-01-02 04:20:57,32.539527,-86.644082,18961.0,230.0,"Autauga, Alabama, US",339.383200,1.213016,Alabama
679,1003.0,Baldwin,Alabama,US,2023-01-02 04:20:57,30.727750,-87.722071,67496.0,719.0,"Baldwin, Alabama, US",302.355376,1.065248,Alabama
680,1005.0,Barbour,Alabama,US,2023-01-02 04:20:57,31.868263,-85.387129,7027.0,103.0,"Barbour, Alabama, US",284.655270,1.465775,Alabama
681,1007.0,Bibb,Alabama,US,2023-01-02 04:20:57,32.996421,-87.125115,7692.0,108.0,"Bibb, Alabama, US",343.484862,1.404056,Alabama
682,1009.0,Blount,Alabama,US,2023-01-02 04:20:57,33.982109,-86.567906,17731.0,260.0,"Blount, Alabama, US",306.626777,1.466358,Alabama
...,...,...,...,...,...,...,...,...,...,...,...,...,...
3952,56039.0,Teton,Wyoming,US,2023-01-02 04:20:57,43.935225,-110.589080,12010.0,16.0,"Teton, Wyoming, US",511.847937,0.133222,Wyoming
3953,56041.0,Uinta,Wyoming,US,2023-01-02 04:20:57,41.287818,-110.547578,6305.0,43.0,"Uinta, Wyoming, US",311.727479,0.681998,Wyoming
3954,90056.0,Unassigned,Wyoming,US,2023-01-02 04:20:57,,,0.0,0.0,"Unassigned, Wyoming, US",,,Wyoming
3955,56043.0,Washakie,Wyoming,US,2023-01-02 04:20:57,43.904516,-107.680187,2722.0,47.0,"Washakie, Wyoming, US",348.750801,1.726672,Wyoming


## Question 9 (5 points)

Compute the correlation between ```Confirmed, Deaths, Incident_Rate, Case_Fatality_Ratio```. What do you observe?

In [11]:
### Q9
correlation_matrix = us_data[['Confirmed', 'Deaths', 'Incident_Rate', 'Case_Fatality_Ratio']].corr()
display(correlation_matrix)

Unnamed: 0,Confirmed,Deaths,Incident_Rate,Case_Fatality_Ratio
Confirmed,1.0,0.960176,0.054209,-0.007169
Deaths,0.960176,1.0,0.048942,0.077919
Incident_Rate,0.054209,0.048942,1.0,-0.256701
Case_Fatality_Ratio,-0.007169,0.077919,-0.256701,1.0


### Observations:
For these observations, keep in mind that they apply to the US subset dataframe only, where we have records for each individual county.

There is a very high correlation between the number of confirmed cases and the number of deaths. This is to be expected since the more cases there are, the more the disease spreads, and therefore more vulnerable segments of the population are infected and consequently die. Also, a high number of cases means that local health care systems are overworked and preventive public health measures are less effective.

The correlation between confirmed cases and the incident rate is positive but still relatively weak, which is surprising since the incident rate is calculated by dividing the number of confirmed cases by the population size. In principle, the more cases there are, the higher the incident rate should be, and counties with high populations should have more difficulty controlling outbreaks, since the flow of people is a lot higher within counties than across the country or between countries. However, since incident rate only takes into account population size and not population density, the weak correlation is plausible.

The correlation between the number of deaths and the case fatality ratio is also positive but very low, which is counter-intuitive, since more deaths should increase the fatality of a disease. However, because the fatality ratio is calculated by dividing the number of deaths by the number of confirmed cases, and because the confirmed cases and deaths increase at similar rates, the ratio is not very sensitive to changes in the number of deaths. So the very weak correlation makes sense.

The final noteworthy correlation is between the incident rate and the case fatality ratio, which is a weakly moderate correlation. This is also a counter-intuitive result, since the higher the incident rate, the more cases per population we have, and therefore the bigger the spread of the disease, particularly to the vulnerable segments of the population, causing a higher death rate. But this is not necessarily the case, since theoretically, having a low incident rate means that only a small subset of the population would be infected, and by consequence, the death rate could range from very low to very high, depending on whether the vulnerable segments were infected. So, if the spread is random, we would expect a weak correlation.
There are also a number of additional factors that would contribute to a negative relationship. The main one is the fact that high incident rates happen in places with high population density that have more healthcare resources to tackle the disease and therefore would have a lower death rate. Public policy for pandemic response was also not uniform across the US. The response to the pandemic was much more effective in US blue states, where there are typically more urban and densely packed areas and therefore higher incident rates, than in red states, which are typically more rural and less densely populated. The blue/red divide comes to mind particularly in regard to the support for vaccination campaigns, which were much more effective in blue states than in red states.


## Question 10 (5 points)

Find the number of miscalculated samples when the ```Case_Fatality_Ratio```(%) is not equal to 100 * Deaths number divided by Confirmed number. Note that in this case you also need to make sure the ```Confirmed```, as the denominator, is not zero.

In [12]:
### Q10
print(us_data[(us_data['Confirmed']!=0) & (us_data['Case_Fatality_Ratio']!=(100*(us_data['Deaths']/us_data['Confirmed'])))].shape[0])

1370


## Question 11 (5 points)

Create a new column ```Case_Fatality_Ratio_short``` to extract and store the first three digits of the original values.
Create a new column ```Case_Fatality_Ratio_calculated``` and compute Case-Fatality Ratio(%) by yourself. Store the first three digits of the computed values as well.

Note that Case-Fatality Ratio(%) = 100 * Number recorded deaths / Number cases.

In [13]:
### Q11
us_data['Case_Fatality_Ratio_short'] = us_data['Case_Fatality_Ratio'].map(lambda x: float(str(x)[:4]))
us_data['Case_Fatality_Ratio_calculated'] = (100*(us_data['Deaths']/us_data['Confirmed'])).map(lambda x: float(str(x)[:4]))
display(us_data)

"""
Two entries have a Case Fatality Ratio above 100%, so we would need to include conditionals to slice only the first three characters (str(x)[:3]) to avoid float('100.')
However, both those entries are above 1000, so slicing the first four characters for all x's still works.
Checked using the following code: display(us_data[us_data['Case_Fatality_Ratio']>=100])

If there were any values above 100 and below 1000, we would need to include conditionals: lambda x: float(str(x)[:3]) if 100<x<1000 else float(str(x)[:4]).

In case the last digit is a 0 in a decimal place that is not the first, the output will only have two digits, since python omits zeros in the final decimal place. Test with:
display(us_data[~(us_data['Case_Fatality_Ratio_short'].map(lambda x: True if len(str(x).replace('.', '')) == 3 else False))])
float('1.0')
float('1.20')

In case we get a NaN, if we convert to string, slice the first four characters and reconvert to float, we still get NaN. We would only get an error if we sliced part of nan and tried to convert to float.
Test with:
float('nan'[:2])
"""

Unnamed: 0,FIPS,Admin2,Province_State,Country_Region,Last_Update,Lat,Long,Confirmed,Deaths,Combined_Key,Incident_Rate,Case_Fatality_Ratio,Province_State_recovered,Case_Fatality_Ratio_short,Case_Fatality_Ratio_calculated
678,1001.0,Autauga,Alabama,US,2023-01-02 04:20:57,32.539527,-86.644082,18961.0,230.0,"Autauga, Alabama, US",339.383200,1.213016,Alabama,1.21,1.21
679,1003.0,Baldwin,Alabama,US,2023-01-02 04:20:57,30.727750,-87.722071,67496.0,719.0,"Baldwin, Alabama, US",302.355376,1.065248,Alabama,1.06,1.06
680,1005.0,Barbour,Alabama,US,2023-01-02 04:20:57,31.868263,-85.387129,7027.0,103.0,"Barbour, Alabama, US",284.655270,1.465775,Alabama,1.46,1.46
681,1007.0,Bibb,Alabama,US,2023-01-02 04:20:57,32.996421,-87.125115,7692.0,108.0,"Bibb, Alabama, US",343.484862,1.404056,Alabama,1.40,1.40
682,1009.0,Blount,Alabama,US,2023-01-02 04:20:57,33.982109,-86.567906,17731.0,260.0,"Blount, Alabama, US",306.626777,1.466358,Alabama,1.46,1.46
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3952,56039.0,Teton,Wyoming,US,2023-01-02 04:20:57,43.935225,-110.589080,12010.0,16.0,"Teton, Wyoming, US",511.847937,0.133222,Wyoming,0.13,0.13
3953,56041.0,Uinta,Wyoming,US,2023-01-02 04:20:57,41.287818,-110.547578,6305.0,43.0,"Uinta, Wyoming, US",311.727479,0.681998,Wyoming,0.68,0.68
3954,90056.0,Unassigned,Wyoming,US,2023-01-02 04:20:57,,,0.0,0.0,"Unassigned, Wyoming, US",,,Wyoming,,
3955,56043.0,Washakie,Wyoming,US,2023-01-02 04:20:57,43.904516,-107.680187,2722.0,47.0,"Washakie, Wyoming, US",348.750801,1.726672,Wyoming,1.72,1.72


"\nTwo entries have a Case Fatality Ratio above 100%, so we would need to include conditionals to slice only the first three characters (str(x)[:3]) to avoid float('100.')\nHowever, both those entries are above 1000, so slicing the first four characters for all x's still works.\nChecked using the following code: display(us_data[us_data['Case_Fatality_Ratio']>=100])\n\nIf there were any values above 100 and below 1000, we would need to include conditionals: lambda x: float(str(x)[:3]) if 100<x<1000 else float(str(x)[:4]).\n\nIn case the last digit is a 0 in a decimal place that is not the first, the output will only have two digits, since python omits zeros in the final decimal place. Test with:\ndisplay(us_data[~(us_data['Case_Fatality_Ratio_short'].map(lambda x: True if len(str(x).replace('.', '')) == 3 else False))])\nfloat('1.0')\nfloat('1.20')\n\nIn case we get a NaN, if we convert to string, slice the first four characters and reconvert to float, we still get NaN. We would only ge

## Question 12 (10 points)

Find the number of samples when the ```Case_Fatality_Ratio_short``` is not equal to```Case_Fatality_Ratio_calculated```. Remember to drop the missing values appeared in these two columns, before count the sample size.

In [19]:
### Q12
us_data.dropna(subset=['Case_Fatality_Ratio_short', 'Case_Fatality_Ratio_calculated'], inplace=True)
us_data_error = us_data.copy()[us_data['Case_Fatality_Ratio_short']!=us_data['Case_Fatality_Ratio_calculated']] # Store the errors in a seperate dataframe to use for Q13&14

print(us_data_error.shape[0])


202


## Question 13 (10 points)

Here we define a new concept, ```acceptable percentage error```, to measure how large the error is. It is computed as the absolute value of the difference between the calculated value and the originally stored value (in three digits), divided by the calculated value, as a percent, i.e., 100 * abs(original - calculated)/calculated.

Compute this acceptable percentage error, add it as a new column of the data frame, and group this continuous acceptable percentage error into discrete bins ([0,0.5], (0.5,1], (1,50], (50,100]) to generate a new categorical object. Note that the lowest number 0 is included in the first bin. Check the resulting distribution, i.e., how many samples fall into each bin, by ```value_counts()``` method.

In [22]:
### Q13
us_data_error['Acceptable_Percent_Error'] = np.where(us_data_error['Case_Fatality_Ratio_calculated']!=0, 100*(abs(us_data_error['Case_Fatality_Ratio_short'] - us_data_error['Case_Fatality_Ratio_calculated'])/us_data_error['Case_Fatality_Ratio_calculated']), 0)

us_data_error['categorical'] = pd.cut(us_data_error['Acceptable_Percent_Error'], bins=[0,0.5,1,50,100], include_lowest=True, labels=['[0, 0.5]', ']0.5, 1]', ']1, 50]', ']50, 100]'])
# include_lowest=True to include 0 in the first bin
# Chose to store the catg. value in the dataframe, since Q14 asks to use map() specifically on the generated catg. variable

display(us_data_error['categorical'].value_counts(sort=False))


categorical
[0, 0.5]       9
]0.5, 1]     122
]1, 50]       71
]50, 100]      0
Name: count, dtype: int64

## Question 14 (10 points)

Use ```map()``` method to perform element-wise transformation on the generated categorical object and create a new series, according to the following rules:

- if error is in range [0, 0.5] or (0.5, 1], transform as 'Accept'
- if error is in range (1, 50] or (50, 100], transform as 'Reject'
- if error is missing, transform as 'Missing'

Use ```value_counts()``` to check the counts for these three types.

In [21]:
### Q14
def categorical_transform(x):
    if x in ['[0, 0.5]', ']0.5, 1]']:
        return 'Accept'
    elif x in [']1, 50]', ']50, 100]']:
        return 'Reject'
    elif np.isnan(x):   # must use np.isnan() because missing values are float
        return 'Missing'
error_series = us_data_error['categorical'].map(categorical_transform)
display(error_series.value_counts())

categorical
Accept    131
Reject     71
Name: count, dtype: int64