## Assignment 2 Data Analysis using Pandas

This assignment will contain 14 questions with details as below. The due date is October 8 (Sunday), 2023 23:59PM. Each late day will result in 20% loss of total points.

The file of 'Daily reports (csse_covid_19_daily_reports)' contains 01-01-2023 (MM-DD-YYYY) daily case report. All timestamps are in UTC (GMT+0). More Description can be found in [COVID-19 Data Repository by the Center for Systems Science and Engineering (CSSE) at Johns Hopkins University.](https://github.com/CSSEGISandData/COVID-19)

References:

- Dong E, Du H, Gardner L. An interactive web-based dashboard to track COVID-19 in real time. Lancet Inf Dis. 20(5):533-534. doi: 10.1016/S1473-3099(20)30120-1
- Additional Information about the Visual Dashboard: https://systems.jhu.edu/research/public-health/ncov/
- Miller, Meg. "2019 Novel Coronavirus COVID-19 (2019-nCoV) Data Repository: Johns Hopkins University Center for Systems Science and Engineering." Bulletin-Association of Canadian Map Libraries and Archives (ACMLA) 164 (2020): 47-51.

Field/Feature/Column names descriptions are listed as follows

- FIPS: US only. Federal Information Processing Standards code that uniquely identifies counties within the USA.

- Admin2: County name. US only.

- Province_State: Province, state or dependency name.

- Country_Region: Country, region or sovereignty name. The names of locations included on the Website correspond with the official designations used by the U.S. Department of State.

- Last Update: MM/DD/YYYY HH:mm:ss (24 hour format, in UTC).

- Lat and Long: Dot locations on the dashboard. All points (except for Australia) shown on the map are based on geographic centroids, and are not representative of a specific address, building or any location at a spatial scale finer than a province/state. Australian dots are located at the centroid of the largest city in each state.

- Confirmed: Counts include confirmed and probable (where reported).

- Deaths: Counts include confirmed and probable (where reported).

- Recovered: Recovered cases are estimates based on local media reports, and state and local reporting when available, and therefore may be substantially lower than the true number. US state-level recovered cases are from COVID Tracking Project. We stopped to maintain the recovered cases.

- Active: Active cases = total cases - total recovered - total deaths. This value is for reference only after we stopped to report the recovered cases.

- Incident_Rate: Incidence Rate = cases per 100,000 persons.

- Case_Fatality_Ratio (%): Case-Fatality Ratio (%) = 100 * Number recorded deaths / Number cases.

- All cases, deaths, and recoveries reported are based on the date of initial report.


Note: Please download the dataset "01-01-2023.csv" from the moodle to your local path for performing the analysis, as some modification on the original data was done to suit the needs for this assignment.

In [1]:
import pandas as pd
import numpy as np

## Question 1 (5 points)

Now you need to use ```pandas``` to read the downloaded file from your local path.

**Print the column names, and also print a general description of it by using ```.describe()``` function.**

In [2]:
### Q1

# Reading the file
reportDF = pd.read_csv("01-01-2023.csv")

# Printing out dataframe
reportDF

Unnamed: 0,FIPS,Admin2,Province_State,Country_Region,Last_Update,Lat,Long_,Confirmed,Deaths,Recovered,Active,Combined_Key,Incident_Rate,Case_Fatality_Ratio
0,,,,Afghanistan,2023-01-02 04:20:57,33.939110,67.709953,207616,7849,,,Afghanistan,533.328662,3.780537
1,,,,Albania,2023-01-02 04:20:57,41.153300,20.168300,333811,3595,,,Albania,11599.520467,1.076957
2,,,,Algeria,2023-01-02 04:20:57,28.033900,1.659600,271229,6881,,,Algeria,618.523486,2.536971
3,,,,Andorra,2023-01-02 04:20:57,42.506300,1.521800,47751,165,,,Andorra,61801.591924,0.345543
4,,,,Angola,2023-01-02 04:20:57,-11.202700,17.873900,105095,1930,,,Angola,319.765542,1.836434
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4011,,,,West Bank and Gaza,2023-01-02 04:20:57,31.952200,35.233200,703228,5708,,,West Bank and Gaza,13784.956961,0.811686
4012,,,,Winter Olympics 2022,2023-01-02 04:20:57,39.904200,116.407400,535,0,,,Winter Olympics 2022,,0.000000
4013,,,,Yemen,2023-01-02 04:20:57,15.552727,48.516388,11945,2159,,,Yemen,40.048994,18.074508
4014,,,,Zambia,2023-01-02 04:20:57,-13.133897,27.849332,334629,4024,,,Zambia,1820.223025,1.202526


In [3]:
# Printing the column names
print(reportDF.columns)

Index(['FIPS', 'Admin2', 'Province_State', 'Country_Region', 'Last_Update',
       'Lat', 'Long_', 'Confirmed', 'Deaths', 'Recovered', 'Active',
       'Combined_Key', 'Incident_Rate', 'Case_Fatality_Ratio'],
      dtype='object')


In [4]:
# Printing general description
print(reportDF.describe())

               FIPS          Lat        Long_     Confirmed         Deaths  \
count   3268.000000  3925.000000  3925.000000  4.016000e+03    4016.000000   
mean   32405.943390    35.736183   -71.109728  1.645364e+05    1666.937749   
std    18056.381177    13.441327    55.361480  1.045288e+06    8702.992446   
min       60.000000   -71.949900  -178.116500  0.000000e+00       0.000000   
25%    19048.500000    33.191535   -96.595639  3.721250e+03      46.000000   
50%    30068.000000    37.895700   -86.717326  1.050600e+04     130.500000   
75%    47041.500000    42.176955   -77.357900  4.577075e+04     465.250000   
max    99999.000000    71.706900   178.065000  3.826700e+07  183247.000000   

       Recovered  Active  Incident_Rate  Case_Fatality_Ratio  
count        0.0     0.0    3922.000000          3974.000000  
mean         NaN     NaN   27690.256958             3.396189  
std          NaN     NaN   10386.943044            93.482132  
min          NaN     NaN       0.000000      

## Question 2  (10 points)

Meanwhile, the data contains a few errors that need to be resolved:

- the ```Long``` column is mistakenly encoded as ```Long_```
- the ```Recovered``` column contains mostly missing values and needs to be deleted
- the ```Active``` column contains mostly missing values and needs to be deleted
- the ```Incident_Rate``` column is miscalculated by multiplying 100 on its original value

In [5]:
### Q2
# Changing "Long_" in "Long"
reportDF.rename(columns={"Long_": "Long"}, inplace = True)

# Delete columns "Recovered" and "Action"
reportDF.drop(["Recovered", "Active"], axis = 1, inplace = True)

# dividing "Incident_Rate" by 100
reportDF["Incident_Rate"] = reportDF["Incident_Rate"] / 100

# printing transformed dataframe
reportDF

Unnamed: 0,FIPS,Admin2,Province_State,Country_Region,Last_Update,Lat,Long,Confirmed,Deaths,Combined_Key,Incident_Rate,Case_Fatality_Ratio
0,,,,Afghanistan,2023-01-02 04:20:57,33.939110,67.709953,207616,7849,Afghanistan,5.333287,3.780537
1,,,,Albania,2023-01-02 04:20:57,41.153300,20.168300,333811,3595,Albania,115.995205,1.076957
2,,,,Algeria,2023-01-02 04:20:57,28.033900,1.659600,271229,6881,Algeria,6.185235,2.536971
3,,,,Andorra,2023-01-02 04:20:57,42.506300,1.521800,47751,165,Andorra,618.015919,0.345543
4,,,,Angola,2023-01-02 04:20:57,-11.202700,17.873900,105095,1930,Angola,3.197655,1.836434
...,...,...,...,...,...,...,...,...,...,...,...,...
4011,,,,West Bank and Gaza,2023-01-02 04:20:57,31.952200,35.233200,703228,5708,West Bank and Gaza,137.849570,0.811686
4012,,,,Winter Olympics 2022,2023-01-02 04:20:57,39.904200,116.407400,535,0,Winter Olympics 2022,,0.000000
4013,,,,Yemen,2023-01-02 04:20:57,15.552727,48.516388,11945,2159,Yemen,0.400490,18.074508
4014,,,,Zambia,2023-01-02 04:20:57,-13.133897,27.849332,334629,4024,Zambia,18.202230,1.202526


## Question 3  (5 points)

The column ```Last_Update``` involves some timestamps that are not in the year of 2023. Find them out and delete those rows.

**The updated dataframe should have only rows with timestamp in 2023.**

Hint: use value_counts() to count unique values first.

In [6]:
# Looking at the information in the "Last_Update" column
reportDF["Last_Update"].value_counts()

2023-01-02 04:20:57    4002
2020-12-21 13:27:30       5
2022-11-22 23:21:06       2
2020-08-04 02:27:56       2
2022-10-21 23:21:56       1
2022-09-12 23:21:04       1
2020-08-07 22:34:20       1
2021-10-10 23:21:42       1
2021-07-31 23:21:38       1
Name: Last_Update, dtype: int64

In [7]:
### Q3
# Only considering the inputs where the first 4 elements of each row in "Last_update" are equal to 2023
reports_2023 = reportDF[reportDF["Last_Update"].str[:4] == "2023"]

# Printing out the transformed dataframe
reports_2023

Unnamed: 0,FIPS,Admin2,Province_State,Country_Region,Last_Update,Lat,Long,Confirmed,Deaths,Combined_Key,Incident_Rate,Case_Fatality_Ratio
0,,,,Afghanistan,2023-01-02 04:20:57,33.939110,67.709953,207616,7849,Afghanistan,5.333287,3.780537
1,,,,Albania,2023-01-02 04:20:57,41.153300,20.168300,333811,3595,Albania,115.995205,1.076957
2,,,,Algeria,2023-01-02 04:20:57,28.033900,1.659600,271229,6881,Algeria,6.185235,2.536971
3,,,,Andorra,2023-01-02 04:20:57,42.506300,1.521800,47751,165,Andorra,618.015919,0.345543
4,,,,Angola,2023-01-02 04:20:57,-11.202700,17.873900,105095,1930,Angola,3.197655,1.836434
...,...,...,...,...,...,...,...,...,...,...,...,...
4011,,,,West Bank and Gaza,2023-01-02 04:20:57,31.952200,35.233200,703228,5708,West Bank and Gaza,137.849570,0.811686
4012,,,,Winter Olympics 2022,2023-01-02 04:20:57,39.904200,116.407400,535,0,Winter Olympics 2022,,0.000000
4013,,,,Yemen,2023-01-02 04:20:57,15.552727,48.516388,11945,2159,Yemen,0.400490,18.074508
4014,,,,Zambia,2023-01-02 04:20:57,-13.133897,27.849332,334629,4024,Zambia,18.202230,1.202526


In [8]:
# Checking if everything worked
reports_2023["Last_Update"].value_counts()

2023-01-02 04:20:57    4002
Name: Last_Update, dtype: int64

**Note** From here onwards I will work only with the 2023 dataframe

## Question 4  (5 points)

There are two provinces/states that have the same latitude (```Lat```) 52.939900. Print out these two provinces/states.

In [9]:
### Q4
# Printing all the data where "Lat" = 52.939900
reports_2023[reports_2023["Lat"] == 52.939900]

Unnamed: 0,FIPS,Admin2,Province_State,Country_Region,Last_Update,Lat,Long,Confirmed,Deaths,Combined_Key,Incident_Rate,Case_Fatality_Ratio
89,,,Quebec,Canada,2023-01-02 04:20:57,52.9399,-73.5491,1284873,17692,"Quebec, Canada",150.494502,1.376945
91,,,Saskatchewan,Canada,2023-01-02 04:20:57,52.9399,-106.4509,150942,1781,"Saskatchewan, Canada",127.736602,1.179923


In [10]:
# Printing out only the Provinces/States
reports_2023[reports_2023["Lat"] == 52.939900]["Province_State"]

89          Quebec
91    Saskatchewan
Name: Province_State, dtype: object

## Question 5  (5 points)

Show the average ```Confirmed``` number of all regions. Show also the median ```Deaths``` number per county of the US.

**Note** Here, the solution strongly depends on the interpretation of the question. 
Option 1: Always looking at the entire population
Option 2: First building up the subsamples and then applying the aggregation functions

Here, the mean over the entire population (option 1) is (10 + 20 + 80 + 40 + 50 + 100) / 6 = 50. If we would, however, look at each subsample, here region, first (option 2), the mean-value changes. The total number of confirmed cases in region A is 10 + 20 + 80 = 110 and in region B is 40 + 50 + 100 = 190, leading to a mean of (110 + 190) / 2 = 150.

The same logic applies to the median: Option 1: 45; Option 2: 150

As the question can be understood in both ways, I decided to implement each. In the comment above each code, I state which version I am computing.

In [11]:
### Q5
# Mean Option1: Average over all entries
reports_2023["Confirmed"].mean()

165106.29985007495

In [12]:
# Mean Option2: Average of all regions
reports_2023.groupby("Country_Region")["Confirmed"].sum().mean()

3287340.3582089553

In [13]:
# Median Option1: Median over all entries within US
reports_2023[reports_2023["Country_Region"] == "US"]["Deaths"].median()

103.0

In [14]:
# Median Option2: Median of all counties
reports_2023[reports_2023["Country_Region"] == "US"].groupby("Admin2")["Deaths"].sum().median()

139.0

## Question 6 (5 points)

Show the difference of average ```Deaths``` number between Alabama in US and Wyoming in US .

In [15]:
### Q6
# Filtering for Alabama and computing the Deaths mean
Avg_death_Alabama = reports_2023[reports_2023["Province_State"] == "Alabama"]["Deaths"].mean()
print(Avg_death_Alabama)

309.5074626865672


In [16]:
# Filtering for Wyoming and computing the Deaths mean
Avg_death_Wyoming = reports_2023[reports_2023["Province_State"] == "Wyoming"]["Deaths"].mean()
print(Avg_death_Wyoming)

81.58333333333333


In [17]:
# Computing and printing out the difference in Deaths mean
print(Avg_death_Alabama - Avg_death_Wyoming)

227.92412935323387


## Question 7 (10 points)

Find the outputs of ```Province_State``` and ```Country_Region``` where the ```Deaths``` number reaches at the maximum and the second maximum.

In [18]:
### Q7
# getting the two largest values of Deaths and putting it into a list
two_largest_Death = list(reports_2023["Deaths"].nlargest(2))

# Printing out the data where "Deaths" are either maximum or second maximum
reports_2023[reports_2023["Deaths"].isin(two_largest_Death)]

Unnamed: 0,FIPS,Admin2,Province_State,Country_Region,Last_Update,Lat,Long,Confirmed,Deaths,Combined_Key,Incident_Rate,Case_Fatality_Ratio
66,,,Sao Paulo,Brazil,2023-01-02 04:20:57,-23.5505,-46.6333,6315333,177411,"Sao Paulo, Brazil",137.531877,2.809211
3992,,,England,United Kingdom,2023-01-02 04:20:57,52.3555,-1.1743,20392339,183247,"England, United Kingdom",364.297232,0.885833


In [19]:
# Printing out only the "Province_State" and "Country_Region" where "Deaths" are either maximum or second maximum
reports_2023[reports_2023["Deaths"].isin(two_largest_Death)][["Province_State", "Country_Region", "Deaths"]]

Unnamed: 0,Province_State,Country_Region,Deaths
66,Sao Paulo,Brazil,177411
3992,England,United Kingdom,183247


## Question 8 (10 points)

Build a subset dataframe for samples collected from US. **Use the values in column ```Combined_Key``` to create a new column** ```Province_State_recovered``` by containing only the information of the province, state or dependency name.  The county name and country, region or sovereignty name should be omitted.

# ***Note: From this question, please complete ALL the following data curation tasks with the U.S. subset dataframe.***

In [20]:
### Q8
# Filtering "Country_Region" on only the US
reports_2023_US = reports_2023[reports_2023["Country_Region"] == "US"]

# Looking at the dataset
print(reports_2023_US.head(5))

       FIPS   Admin2 Province_State Country_Region          Last_Update  \
678  1001.0  Autauga        Alabama             US  2023-01-02 04:20:57   
679  1003.0  Baldwin        Alabama             US  2023-01-02 04:20:57   
680  1005.0  Barbour        Alabama             US  2023-01-02 04:20:57   
681  1007.0     Bibb        Alabama             US  2023-01-02 04:20:57   
682  1009.0   Blount        Alabama             US  2023-01-02 04:20:57   

           Lat       Long  Confirmed  Deaths          Combined_Key  \
678  32.539527 -86.644082      18961     230  Autauga, Alabama, US   
679  30.727750 -87.722071      67496     719  Baldwin, Alabama, US   
680  31.868263 -85.387129       7027     103  Barbour, Alabama, US   
681  32.996421 -87.125115       7692     108     Bibb, Alabama, US   
682  33.982109 -86.567906      17731     260   Blount, Alabama, US   

     Incident_Rate  Case_Fatality_Ratio  
678     339.383200             1.213016  
679     302.355376             1.065248  
68

In [21]:
# Creating a copy
US_2023_DF = reports_2023_US.copy() # Moritz gave me the hint of copying the dataset

# Defining "Province_State_recovered"
US_2023_DF["Province_State_recovered"] = reports_2023_US["Combined_Key"].str.split(',').str[1]
US_2023_DF

Unnamed: 0,FIPS,Admin2,Province_State,Country_Region,Last_Update,Lat,Long,Confirmed,Deaths,Combined_Key,Incident_Rate,Case_Fatality_Ratio,Province_State_recovered
678,1001.0,Autauga,Alabama,US,2023-01-02 04:20:57,32.539527,-86.644082,18961,230,"Autauga, Alabama, US",339.383200,1.213016,Alabama
679,1003.0,Baldwin,Alabama,US,2023-01-02 04:20:57,30.727750,-87.722071,67496,719,"Baldwin, Alabama, US",302.355376,1.065248,Alabama
680,1005.0,Barbour,Alabama,US,2023-01-02 04:20:57,31.868263,-85.387129,7027,103,"Barbour, Alabama, US",284.655270,1.465775,Alabama
681,1007.0,Bibb,Alabama,US,2023-01-02 04:20:57,32.996421,-87.125115,7692,108,"Bibb, Alabama, US",343.484862,1.404056,Alabama
682,1009.0,Blount,Alabama,US,2023-01-02 04:20:57,33.982109,-86.567906,17731,260,"Blount, Alabama, US",306.626777,1.466358,Alabama
...,...,...,...,...,...,...,...,...,...,...,...,...,...
3952,56039.0,Teton,Wyoming,US,2023-01-02 04:20:57,43.935225,-110.589080,12010,16,"Teton, Wyoming, US",511.847937,0.133222,Wyoming
3953,56041.0,Uinta,Wyoming,US,2023-01-02 04:20:57,41.287818,-110.547578,6305,43,"Uinta, Wyoming, US",311.727479,0.681998,Wyoming
3954,90056.0,Unassigned,Wyoming,US,2023-01-02 04:20:57,,,0,0,"Unassigned, Wyoming, US",,,Wyoming
3955,56043.0,Washakie,Wyoming,US,2023-01-02 04:20:57,43.904516,-107.680187,2722,47,"Washakie, Wyoming, US",348.750801,1.726672,Wyoming


## Question 9 (5 points)

Compute the correlation between ```Confirmed, Deaths, Incident_Rate, Case_Fatality_Ratio```. What do you observe?

In [22]:
### Q9
US_2023_DF[["Confirmed", "Deaths", "Incident_Rate", "Case_Fatality_Ratio"]].corr()

Unnamed: 0,Confirmed,Deaths,Incident_Rate,Case_Fatality_Ratio
Confirmed,1.0,0.960176,0.054209,-0.007169
Deaths,0.960176,1.0,0.048942,0.077919
Incident_Rate,0.054209,0.048942,1.0,-0.256701
Case_Fatality_Ratio,-0.007169,0.077919,-0.256701,1.0


**What do I observe** 

Gerneral interpretation scheme: 
- abs(correlation) in [0, 0.2) --> very weakly correlated
- abs(correlation) in [0.2, 0.4) --> weakly correlated
- abs(correlation) in [0.4, 0.6) --> moderately correlated
- abs(correlation) in [0.6, 0.8) --> strongly correlated
- abs(correlation) in [0.8, 0.1] --> very strongly correlated

Correlation between a series and itself is always equal to 1. Thus, those entries do not need any additional interpretation.

Important note: Correlation *not equal* causation!

/-----------------------------------------------------------------------------------------------------------------/

Correlation between:
- Confirmed and Deaths:
    The correlation is approx. 0.96 meaning that they are *very strongly* correlated.
    This implies that if # Confirmed goes up, there is a *very strong* tendency, that # Deaths also goes up, 
    and vice versa.
- Confirmed and Incident_Rate: 
    The correlation is approx. 0.05 meaning that they are *very weakly* correlated.
    This implies that if # Confirmed goes up, there is a *very weak* tendency, that the Incident_Rate also goes up, 
    and vice versa.
- Confirmed and Case_Fatality_Ratio: 
    The correlation is approx. -0.01 meaning that they are *very weakly* negatively correlated.
    This implies that if # Confirmed goes up, there is a *very weak* tendency, that the Case_Fatality_Ratio goes
    down, and vice versa.
- Deaths and Incident_Rate:
    The correlation is approx. 0.05 meaning that they are *very weakly* correlated.
    This implies that if # Deaths goes up, there is a *very weak* tendency, that the Incident_Rate also goes up, 
    and vice versa.
- Deaths and Case_Fatality_Ratio:
    The correlation is approx. 0.08 meaning that they are *very weakly* correlated.
    This implies that if # Deaths goes up, there is a *very weak* tendency, that the Case_Fatality_Ratio also 
    goes up, and vice versa.   
- Incident_Rate and Case_Fatility_Rate:
    The correlation is approx. -0.26 meaning that they are *weakly* negatively correlated.
    This implies that if the Incident_Rate goes up, there is a *weak* tendency, that the Case_Fatality_Rate goes
    down, and vice versa.
    
*In conclusion*: The only interesting correlation here is between "Confirmed" and "Deaths". The others are neglegtable


## Question 10 (5 points)

Find the number of miscalculated samples when the ```Case_Fatality_Ratio```(%) is not equal to 100 * Deaths number divided by Confirmed number. Note that in this case you also need to make sure the ```Confirmed```, as the denominator, is not zero.

In [23]:
### Q10
US_2023_DF[(US_2023_DF["Case_Fatality_Ratio"] != (100 * (US_2023_DF["Deaths"] / US_2023_DF["Confirmed"]))) \
           & (US_2023_DF["Confirmed"] != 0)].shape[0]

1370

## Question 11 (5 points)

Create a new column ```Case_Fatality_Ratio_short``` to extract and store the first three digits of the original values.
Create a new column ```Case_Fatality_Ratio_calculated``` and compute Case-Fatality Ratio(%) by yourself. Store the first three digits of the computed values as well.

Note that Case-Fatality Ratio(%) = 100 * Number recorded deaths / Number cases.

In [24]:
### Q11
# Part 1: creating "Case_Fatality_Ratio_short"

# Converting the "Case_Fatility_Ratio" to string and then looking at the first 4 elements. 
US_2023_DF["Case_Fatality_Ratio_short"] = US_2023_DF["Case_Fatality_Ratio"].astype(str).str[:4]

# Converting "Case_Fatility_Ratio_short" back to type float
US_2023_DF["Case_Fatality_Ratio_short"] = US_2023_DF["Case_Fatality_Ratio_short"].astype(float)

# Looking at the output
US_2023_DF["Case_Fatality_Ratio_short"]

678     1.21
679     1.06
680     1.46
681     1.40
682     1.46
        ... 
3952    0.13
3953    0.68
3954     NaN
3955    1.72
3956    1.17
Name: Case_Fatality_Ratio_short, Length: 3269, dtype: float64

**Note**: In order to store the first three digits of the original values, we have to convert the column to type "string" and then extract the first 4 elements of each entry. First 4 elements, because "." is also a element, thus this method leads to extracting the first three digits. Beware, that, as discussed, we can ignore the two cases with more than 1 digit before "."

In [25]:
# Part 2: creating "Case_Fatality_Ratio_calculated"

# Caluclating the right ratio and storing it in a new column
US_2023_DF["Case_Fatality_Ratio_calculated"] = (100 * US_2023_DF["Deaths"] / US_2023_DF["Confirmed"])

# Converting the "Case_Fatility_Ratio" to string and then looking at the first 4 elements. 
US_2023_DF["Case_Fatality_Ratio_calculated_short"] = US_2023_DF["Case_Fatality_Ratio_calculated"].astype(str).str[:4]

# Converting "Case_Fatility_Ratio" back to type float
US_2023_DF["Case_Fatality_Ratio_calculated_short"] = US_2023_DF["Case_Fatality_Ratio_calculated_short"].astype(float)

# Looking at the output
US_2023_DF["Case_Fatality_Ratio_calculated_short"]

678     1.21
679     1.06
680     1.46
681     1.40
682     1.46
        ... 
3952    0.13
3953    0.68
3954     NaN
3955    1.72
3956    1.17
Name: Case_Fatality_Ratio_calculated_short, Length: 3269, dtype: float64

## Question 12 (10 points)

Find the number of samples when the ```Case_Fatality_Ratio_short``` is not equal to```Case_Fatality_Ratio_calculated```. Remember to drop the missing values appeared in these two columns, before count the sample size.

In [26]:
### Q12
# Dropping the missing values
US_2023_DF.dropna(subset=["Case_Fatality_Ratio_calculated_short", "Case_Fatality_Ratio_short"], inplace = True)

# Finding the number of samples when the Case_Fatality_Ratio_short is not equal to Case_Fatality_Ratio_calculated_short
US_2023_DF[US_2023_DF["Case_Fatality_Ratio_short"] != US_2023_DF["Case_Fatality_Ratio_calculated_short"]].shape[0]

202

## Question 13 (10 points)

Here we define a new concept, ```acceptable percentage error```, to measure how large the error is. It is computed as the absolute value of the difference between the calculated value and the originally stored value (in three digits), divided by the calculated value, as a percent, i.e., 100 * abs(original - calculated)/calculated.

Compute this acceptable percentage error, add it as a new column of the data frame, and group this continuous acceptable percentage error into discrete bins ([0,0.5], (0.5,1], (1,50], (50,100]) to generate a new categorical object. Note that the lowest number 0 is included in the first bin. Check the resulting distribution, i.e., how many samples fall into each bin, by ```value_counts()``` method.

In [27]:
### Q13
print(US_2023_DF.shape[0])
# Calculating the acceptable percentage error and storing it in a new column
US_2023_DF["acceptable_prc_error"] = (100 * \
                                      (abs(US_2023_DF["Case_Fatality_Ratio_calculated_short"] - US_2023_DF["Case_Fatality_Ratio_short"])\
                                      / US_2023_DF["Case_Fatality_Ratio_calculated_short"])).fillna(0)

# creating bins
bins = [0, 0.5, 1, 50, 100]

# discretinization: include_lowest = True includes the left most value in bins, here: 0
cats = pd.cut(US_2023_DF["acceptable_prc_error"], bins, include_lowest = True)
cats.value_counts()

3244


(-0.001, 0.5]    3051
(0.5, 1.0]        122
(1.0, 50.0]        71
(50.0, 100.0]       0
Name: acceptable_prc_error, dtype: int64

**Note** It is also possible to rename the labels (using labels = within the pd.cut statement), such that the output shows [0,0.5], (0.5,1], (1,50], (50,100], instead of (-0.001,0.5], (0.5,1], (1,50], (50,100]. This, however, does not make a difference, because the "include_lowest = True" statement is the important part!

## Question 14 (10 points)

Use ```map()``` method to perform element-wise transformation on the generated categorical object and create a new series, according to the following rules:

- if error is in range [0, 0.5] or (0.5, 1], transform as 'Accept'
- if error is in range (1, 50] or (50, 100], transform as 'Reject'
- if error is missing, transform as 'Missing'

Use ```value_counts()``` to check the counts for these three types.

In [28]:
### Q14
# Define a mapping function for element-wise transformation
def map_error_category(error):
    Int1 = pd.Interval(left=0, right=0.5, closed="both")
    Int2 = pd.Interval(left=0.5, right=1, closed="right")
    Int3 = pd.Interval(left=1, right=50, closed="right")
    Int4 = pd.Interval(left=50, right=100, closed="right")

    if Int1.overlaps(error) or Int2.overlaps(error):
        return "Accept"
    elif Int3.overlaps(error) or Int4.overlaps(error):
        return "Reject"
    else:
        return "Missing"

# Apply the mapping function to create a new series
error_category = cats.map(map_error_category)

# Count the samples in each category
print(error_category.value_counts())

Accept    3173
Reject      71
Name: acceptable_prc_error, dtype: int64
