### Import libraries

In [747]:
import numpy as np
import pandas as pd

## Data explorations
1. How many rows (observations) and columns (attributes) does the data have? 
2. Are there any duplicate rows in the data?
3. What do the columns in the data mean?
4. What data type does each column currently have? (.dtype)numeric datatype


### 1. How many rows (observations) and columns (attributes) does the data have for each day?

In [748]:
# Read
df1 = pd.read_csv('./data/covid_3_2_2023.csv')
df1.shape

(239, 21)

#### Because we need data for one week, we need to combine the data from each day into one dataframe.
- Read in the data for each day
- Add a column to the dataframe that indicates the day of the week
- Combine the data into one dataframe

In [749]:
# Read all data
df1 = pd.read_csv('./data/covid_3_2_2023.csv')
df2 = pd.read_csv('./data/covid_3_3_2023.csv')
df3 = pd.read_csv('./data/covid_3_4_2023.csv')
df4 = pd.read_csv('./data/covid_3_5_2023.csv')
df5 = pd.read_csv('./data/covid_3_6_2023.csv')

In [750]:
# Add Date column 
def add_date(df, date):
    df['Date'] = pd.to_datetime(date, format='%Y-%m-%d')
    
    # move the date column to second position
    cols = df.columns.tolist()
    cols = cols[:1] + cols[-1:] + cols[1:-1]
    return df[cols]

df1 = add_date(df1, '2023-03-02')
df2 = add_date(df2, '2023-03-03')
df3 = add_date(df3, '2023-03-04')
df4 = add_date(df4, '2023-03-05')
df5 = add_date(df5, '2023-03-06')

In [751]:
# Concatenate all dataframes
new_df = pd.concat([df1, df2, df3, df4, df5], axis=0, ignore_index=True)

# Sort rows by country name and date
new_df = new_df.sort_values(by=['Country,Other', 'Date'])

In [752]:
# Shape of new dataframe 
new_df.shape

(1195, 22)

In [753]:
# save the new dataframe
new_df.to_csv('./data/combined_data.csv', index=False)
new_df.head(10)

Unnamed: 0,"Country,Other",Date,TotalCases,NewCases,TotalDeaths,NewDeaths,TotalRecovered,NewRecovered,ActiveCases,"Serious,Critical",...,TotalTests,Tests/\r\n1M pop,Population,Continent,1 Caseevery X ppl,1 Deathevery X ppl,1 Testevery X ppl,New Cases/1M pop,New Deaths/1M pop,Active Cases/1M pop
121,Afghanistan,2023-03-02,209359,19.0,7896,,191166,10.0,10297,45,...,1201475.0,29481.0,40754388.0,Asia,195.0,5161.0,34.0,0.5,,253.0
360,Afghanistan,2023-03-03,209362,3.0,7896,,191170,4.0,10296,45,...,1201744.0,29487.0,40754388.0,Asia,195.0,5161.0,34.0,0.07,,253.0
599,Afghanistan,2023-03-04,209370,4.0,7896,,191181,7.0,10293,45,...,1202018.0,29494.0,40754388.0,Asia,195.0,5161.0,34.0,0.1,,253.0
838,Afghanistan,2023-03-05,209390,20.0,7896,,191212,31.0,10282,45,...,1202290.0,29501.0,40754388.0,Asia,195.0,5161.0,34.0,0.5,,252.0
1076,Afghanistan,2023-03-06,209390,,7896,,191212,,10282,45,...,1202290.0,29501.0,40754388.0,Asia,195.0,5161.0,34.0,,,252.0
5,Africa,2023-03-02,12795801,732.0,258590,,12076247,132.0,460964,548,...,,,,Africa,,,,,,
244,Africa,2023-03-03,12796326,525.0,258590,,12076332,85.0,461404,548,...,,,,Africa,,,,,,
483,Africa,2023-03-04,12796404,78.0,258590,,12076449,117.0,461365,548,...,,,,Africa,,,,,,
722,Africa,2023-03-05,12796571,167.0,258593,3.0,12076616,167.0,461362,548,...,,,,Africa,,,,,,
961,Africa,2023-03-06,12796571,,258593,,12076624,8.0,461354,548,...,,,,Africa,,,,,,


### 2. Are there any duplicate rows in the data?

In [754]:
new_df[new_df.duplicated()].shape

(0, 22)

- The data has no duplicate rows

### 3. What do the columns in the data mean?
- The columns in the data are the following:
    - **Country**: The date of the observation
    - **Date**: The time of the observation
    - **TotalCases**: The day of the week of the observation
    - **NewCases**: The temperature of the observation
    - **TotalDeaths**: The humidity of the observation
    - **NewDeaths**: The light of the observation
    - **TotalRecovered**: The CO2 of the observation
    - **NewRecovered**: The humidity ratio of the observation
    - **ActiveCases**: The humidity ratio of the observation
    - **Serious,Critical**: The humidity ratio of the observation
    - **TotalCases/1M pop**: The humidity ratio of the observation
    - **Deaths/1M pop**: The humidity ratio of the observation
    - **TotalTests**: The humidity ratio of the observation
    - **Tests/1M pop**: The humidity ratio of the observation
    - **Population**: The humidity ratio of the observation
    - **Continent**: The humidity ratio of the observation
    - **1 Caseevery X ppl**: The humidity ratio of the observation
    - **1 Deathevery X ppl**: The humidity ratio of the observation
    - **1 Testevery  X ppl**: The humidity ratio of the observation
    - **New Cases/1M pop**: The humidity ratio of the observation
    - **New Deaths/1M pop**: The humidity ratio of the observation
    - **Active Cases/1M pop**: The humidity ratio of the observation

### 4. What data type does each column currently have? 

In [755]:
new_df.dtypes

Country,Other                  object
Date                   datetime64[ns]
TotalCases                     object
NewCases                       object
TotalDeaths                    object
NewDeaths                     float64
TotalRecovered                 object
NewRecovered                   object
ActiveCases                    object
Serious,Critical               object
Tot Cases/1M pop               object
Deaths/1M pop                  object
TotalTests                     object
Tests/\r\n1M pop               object
Population                     object
Continent                      object
1 Caseevery X ppl              object
1 Deathevery X ppl             object
1 Testevery X ppl             float64
New Cases/1M pop               object
New Deaths/1M pop             float64
Active Cases/1M pop            object
dtype: object

**Issues that need to be preprocessed:**
- Rename the columns to be more descriptive
- Change the data type of the columns to be numeric
- Handle missing values
- Handle values such as +19, +1, etc.
- Delete columns that are dependent attributes. 

## Preprocessing
### 1. Rename the columns to be more descriptive

In [756]:
# Rename the col    
new_df = new_df.rename(columns={'Country,Other': 'Country'})
new_df = new_df.rename(columns={'Tot\xa0Cases/1M pop': 'TotalCases/1M pop'})
new_df = new_df.rename(columns={'Tests/\r\n1M pop': 'Tests/1M pop'})

### 2. Handle missing values

In [757]:
new_df.isnull().sum()

Country                   5
Date                      0
TotalCases                0
NewCases                955
TotalDeaths              30
NewDeaths              1063
TotalRecovered          105
NewRecovered            946
ActiveCases              95
Serious,Critical        524
TotalCases/1M pop        45
Deaths/1M pop            75
TotalTests              130
Tests/1M pop            130
Population               50
Continent                15
1 Caseevery X ppl        50
1 Deathevery X ppl       80
1 Testevery X ppl       130
New Cases/1M pop        983
New Deaths/1M pop      1088
Active Cases/1M pop      91
dtype: int64

#### 2.1. Delete rows with missing Country values and Country values that are not in continents

In [758]:
# delete the rows with null country names
new_df = new_df.dropna(subset=['Country'])
new_df.shape

(1190, 22)

In [759]:
new_df[new_df['Continent'].isnull()]

Unnamed: 0,Country,Date,TotalCases,NewCases,TotalDeaths,NewDeaths,TotalRecovered,NewRecovered,ActiveCases,"Serious,Critical",...,TotalTests,Tests/1M pop,Population,Continent,1 Caseevery X ppl,1 Deathevery X ppl,1 Testevery X ppl,New Cases/1M pop,New Deaths/1M pop,Active Cases/1M pop
234,Diamond Princess,2023-03-02,712,,13,,699,,0,,...,,,,,,,,,,
473,Diamond Princess,2023-03-03,712,,13,,699,,0,,...,,,,,,,,,,
712,Diamond Princess,2023-03-04,712,,13,,699,,0,,...,,,,,,,,,,
951,Diamond Princess,2023-03-05,712,,13,,699,,0,,...,,,,,,,,,,
1189,Diamond Princess,2023-03-06,712,,13,,699,,0,,...,,,,,,,,,,
237,MS Zaandam,2023-03-02,9,,2,,7,,0,,...,,,,,,,,,,
476,MS Zaandam,2023-03-03,9,,2,,7,,0,,...,,,,,,,,,,
715,MS Zaandam,2023-03-04,9,,2,,7,,0,,...,,,,,,,,,,
954,MS Zaandam,2023-03-05,9,,2,,7,,0,,...,,,,,,,,,,
1192,MS Zaandam,2023-03-06,9,,2,,7,,0,,...,,,,,,,,,,


In [760]:
# So need to delete 'Diamond Princess', 'MS Zaandam' country 
new_df = new_df[~new_df['Country'].str.contains('Diamond Princess')]
new_df = new_df[~new_df['Country'].str.contains('MS Zaandam')]
new_df.shape

(1180, 22)

In [761]:
new_df.isnull().sum()

Country                   0
Date                      0
TotalCases                0
NewCases                940
TotalDeaths              30
NewDeaths              1048
TotalRecovered          105
NewRecovered            931
ActiveCases              95
Serious,Critical        514
TotalCases/1M pop        30
Deaths/1M pop            60
TotalTests              115
Tests/1M pop            115
Population               35
Continent                 0
1 Caseevery X ppl        35
1 Deathevery X ppl       65
1 Testevery X ppl       115
New Cases/1M pop        968
New Deaths/1M pop      1073
Active Cases/1M pop      76
dtype: int64

#### 2.2. Delete rows missing population values 

In [762]:
new_df[new_df['Population'].isnull()]

Unnamed: 0,Country,Date,TotalCases,NewCases,TotalDeaths,NewDeaths,TotalRecovered,NewRecovered,ActiveCases,"Serious,Critical",...,TotalTests,Tests/1M pop,Population,Continent,1 Caseevery X ppl,1 Deathevery X ppl,1 Testevery X ppl,New Cases/1M pop,New Deaths/1M pop,Active Cases/1M pop
5,Africa,2023-03-02,12795801,732.0,258590,,12076247,132.0,460964,548,...,,,,Africa,,,,,,
244,Africa,2023-03-03,12796326,525.0,258590,,12076332,85.0,461404,548,...,,,,Africa,,,,,,
483,Africa,2023-03-04,12796404,78.0,258590,,12076449,117.0,461365,548,...,,,,Africa,,,,,,
722,Africa,2023-03-05,12796571,167.0,258593,3.0,12076616,167.0,461362,548,...,,,,Africa,,,,,,
961,Africa,2023-03-06,12796571,,258593,,12076624,8.0,461354,548,...,,,,Africa,,,,,,
0,Asia,2023-03-02,214400666,36069.0,1536954,177.0,199025596,36820.0,13838116,15470,...,,,,Asia,,,,,,
239,Asia,2023-03-03,214439390,38724.0,1537144,190.0,199055199,29603.0,13847047,15462,...,,,,Asia,,,,,,
478,Asia,2023-03-04,214474748,35043.0,1537331,184.0,199108904,53531.0,13828513,15464,...,,,,Asia,,,,,,
717,Asia,2023-03-05,214505850,31102.0,1537487,156.0,199139502,30598.0,13828861,15506,...,,,,Asia,,,,,,
957,Asia,2023-03-06,214510150,4300.0,1537496,9.0,199139436,,13833218,15511,...,,,,Asia,,,,,,


- Because there are rows that caculate the total cases in each region such as Asia, Europe, North America, etc, we can remove these rows.

In [763]:
# remove the rows with null population
new_df = new_df[~new_df['Population'].isnull()]

#### 2.3. Delete dependent attributes such as TotalCases/1M pop, TotalDeaths/1M pop, etc. Because we can calculate these values from TotalCases, TotalDeaths and population.

In [764]:
new_df = new_df.drop(columns=['TotalCases/1M pop', 'Tests/1M pop', 'Deaths/1M pop', 'New Cases/1M pop', 'New Deaths/1M pop'])
new_df.head(5)

Unnamed: 0,Country,Date,TotalCases,NewCases,TotalDeaths,NewDeaths,TotalRecovered,NewRecovered,ActiveCases,"Serious,Critical",TotalTests,Population,Continent,1 Caseevery X ppl,1 Deathevery X ppl,1 Testevery X ppl,Active Cases/1M pop
121,Afghanistan,2023-03-02,209359,19.0,7896,,191166,10.0,10297,45,1201475,40754388,Asia,195,5161,34.0,253
360,Afghanistan,2023-03-03,209362,3.0,7896,,191170,4.0,10296,45,1201744,40754388,Asia,195,5161,34.0,253
599,Afghanistan,2023-03-04,209370,4.0,7896,,191181,7.0,10293,45,1202018,40754388,Asia,195,5161,34.0,253
838,Afghanistan,2023-03-05,209390,20.0,7896,,191212,31.0,10282,45,1202290,40754388,Asia,195,5161,34.0,252
1076,Afghanistan,2023-03-06,209390,,7896,,191212,,10282,45,1202290,40754388,Asia,195,5161,34.0,252


In [765]:
new_df.isnull().sum()

Country                   0
Date                      0
TotalCases                0
NewCases                933
TotalDeaths              30
NewDeaths              1038
TotalRecovered          105
NewRecovered            926
ActiveCases              95
Serious,Critical        514
TotalTests               80
Population                0
Continent                 0
1 Caseevery X ppl         0
1 Deathevery X ppl       30
1 Testevery X ppl        80
Active Cases/1M pop      41
dtype: int64

In [767]:
# remove all , in values
new_df['TotalCases'] = new_df['TotalCases'].str.replace(',', '')
new_df['NewCases'] = new_df['NewCases'].str.replace(r'[+,]', '')
new_df['TotalDeaths'] = new_df['TotalDeaths'].str.replace(',', '')
new_df['TotalRecovered'] = new_df['TotalRecovered'].str.replace(',', '')
new_df['NewRecovered'] = new_df['NewRecovered'].str.replace(r'[+,]', '')
new_df['ActiveCases'] = new_df['ActiveCases'].str.replace(',', '')
new_df['Serious,Critical'] = new_df['Serious,Critical'].str.replace(',', '')
new_df['TotalTests'] = new_df['TotalTests'].str.replace(',', '')
new_df['Population'] = new_df['Population'].str.replace(',', '')
new_df['Active Cases/1M pop'] = new_df['Active Cases/1M pop'].str.replace(',', '')


  new_df['NewCases'] = new_df['NewCases'].str.replace(r'[+,]', '')
  new_df['NewRecovered'] = new_df['NewRecovered'].str.replace(r'[+,]', '')


In [768]:
# fill the missing values with 0 
new_df[['NewCases', 'NewDeaths', 'NewRecovered','Serious,Critical', 'Active Cases/1M pop']] = new_df[['NewCases', 'NewDeaths','NewRecovered','Serious,Critical', 'Active Cases/1M pop']].fillna(0)
# 'TotalRecovered','TotalDeaths',
new_df.isnull().sum()

Country                  0
Date                     0
TotalCases               0
NewCases                 0
TotalDeaths             30
NewDeaths                0
TotalRecovered         105
NewRecovered             0
ActiveCases             95
Serious,Critical         0
TotalTests              80
Population               0
Continent                0
1 Caseevery X ppl        0
1 Deathevery X ppl      30
1 Testevery X ppl       80
Active Cases/1M pop      0
dtype: int64

In [769]:
# convert all columns to numeric
new_df['ActiveCases'] = pd.to_numeric(new_df['ActiveCases'])
new_df['Active Cases/1M pop'] = pd.to_numeric(new_df['Active Cases/1M pop'])
new_df['Population'] = pd.to_numeric(new_df['Population'])
new_df['TotalCases'] = pd.to_numeric(new_df['TotalCases'])
new_df['TotalDeaths'] = pd.to_numeric(new_df['TotalDeaths'])
new_df['TotalRecovered'] = pd.to_numeric(new_df['TotalRecovered'])
new_df['NewCases'] = pd.to_numeric(new_df['NewCases'])
new_df['NewDeaths'] = pd.to_numeric(new_df['NewDeaths'])
new_df['NewRecovered'] = pd.to_numeric(new_df['NewRecovered'])
new_df['Serious,Critical'] = pd.to_numeric(new_df['Serious,Critical'])
new_df['TotalTests'] = pd.to_numeric(new_df['TotalTests'])
    
new_df['ActiveCases'] = new_df['Active Cases/1M pop'] * new_df['Population'] / 1000000
new_df['ActiveCases'] = new_df['ActiveCases'].round(0)
new_df


Unnamed: 0,Country,Date,TotalCases,NewCases,TotalDeaths,NewDeaths,TotalRecovered,NewRecovered,ActiveCases,"Serious,Critical",TotalTests,Population,Continent,1 Caseevery X ppl,1 Deathevery X ppl,1 Testevery X ppl,Active Cases/1M pop
121,Afghanistan,2023-03-02,209359,19,7896.0,0.0,191166.0,10,10311.0,45,1201475.0,40754388,Asia,195,5161,34.0,253.0
360,Afghanistan,2023-03-03,209362,3,7896.0,0.0,191170.0,4,10311.0,45,1201744.0,40754388,Asia,195,5161,34.0,253.0
599,Afghanistan,2023-03-04,209370,4,7896.0,0.0,191181.0,7,10311.0,45,1202018.0,40754388,Asia,195,5161,34.0,253.0
838,Afghanistan,2023-03-05,209390,20,7896.0,0.0,191212.0,31,10270.0,45,1202290.0,40754388,Asia,195,5161,34.0,252.0
1076,Afghanistan,2023-03-06,209390,0,7896.0,0.0,191212.0,0,10270.0,45,1202290.0,40754388,Asia,195,5161,34.0,252.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
116,Zimbabwe,2023-03-02,264127,0,5668.0,0.0,257839.0,0,613.0,12,2525756.0,15331428,Africa,58,2705,6.0,40.0
355,Zimbabwe,2023-03-03,264127,0,5668.0,0.0,257839.0,0,613.0,12,2525756.0,15331428,Africa,58,2705,6.0,40.0
594,Zimbabwe,2023-03-04,264127,0,5668.0,0.0,257839.0,0,613.0,12,2525756.0,15331428,Africa,58,2705,6.0,40.0
833,Zimbabwe,2023-03-05,264127,0,5668.0,0.0,257839.0,0,613.0,12,2525756.0,15331428,Africa,58,2705,6.0,40.0


In [770]:
new_df.isnull().sum()

Country                  0
Date                     0
TotalCases               0
NewCases                 0
TotalDeaths             30
NewDeaths                0
TotalRecovered         105
NewRecovered             0
ActiveCases              0
Serious,Critical         0
TotalTests              80
Population               0
Continent                0
1 Caseevery X ppl        0
1 Deathevery X ppl      30
1 Testevery X ppl       80
Active Cases/1M pop      0
dtype: int64

In [771]:
new_df[new_df['TotalDeaths'].isnull()]

Unnamed: 0,Country,Date,TotalCases,NewCases,TotalDeaths,NewDeaths,TotalRecovered,NewRecovered,ActiveCases,"Serious,Critical",TotalTests,Population,Continent,1 Caseevery X ppl,1 Deathevery X ppl,1 Testevery X ppl,Active Cases/1M pop
231,Falkland Islands,2023-03-02,1930,0,,0.0,1930.0,0,0.0,0,8632.0,3539,South America,2,,0.0,0.0
470,Falkland Islands,2023-03-03,1930,0,,0.0,1930.0,0,0.0,0,8632.0,3539,South America,2,,0.0,0.0
709,Falkland Islands,2023-03-04,1930,0,,0.0,1930.0,0,0.0,0,8632.0,3539,South America,2,,0.0,0.0
948,Falkland Islands,2023-03-05,1930,0,,0.0,1930.0,0,0.0,0,8632.0,3539,South America,2,,0.0,0.0
1186,Falkland Islands,2023-03-06,1930,0,,0.0,1930.0,0,0.0,0,8632.0,3539,South America,2,,0.0,0.0
233,Niue,2023-03-02,790,0,,0.0,784.0,1,6.0,0,,1622,Australia/Oceania,2,,,3699.0
472,Niue,2023-03-03,790,0,,0.0,784.0,0,6.0,0,,1622,Australia/Oceania,2,,,3699.0
711,Niue,2023-03-04,790,0,,0.0,784.0,0,6.0,0,,1622,Australia/Oceania,2,,,3699.0
950,Niue,2023-03-05,790,0,,0.0,784.0,0,6.0,0,,1622,Australia/Oceania,2,,,3699.0
1188,Niue,2023-03-06,790,0,,0.0,784.0,0,6.0,0,,1622,Australia/Oceania,2,,,3699.0


In [772]:
# we can remove the rows with null values in TotalDeaths and population < 1000000
new_df = new_df[~new_df['TotalDeaths'].isnull()]
new_df = new_df[~(new_df['Population'] < 1000000)]
new_df.isnull().sum()

Country                 0
Date                    0
TotalCases              0
NewCases                0
TotalDeaths             0
NewDeaths               0
TotalRecovered         35
NewRecovered            0
ActiveCases             0
Serious,Critical        0
TotalTests             20
Population              0
Continent               0
1 Caseevery X ppl       0
1 Deathevery X ppl      0
1 Testevery X ppl      20
Active Cases/1M pop     0
dtype: int64

In [773]:
new_df.shape

(790, 17)

In [774]:
new_df['TotalRecovered'] = new_df['TotalCases'] - new_df['TotalDeaths'] - new_df['ActiveCases']
new_df['TotalRecovered'] = new_df['TotalRecovered'].round(0)
new_df.isnull().sum()

Country                 0
Date                    0
TotalCases              0
NewCases                0
TotalDeaths             0
NewDeaths               0
TotalRecovered          0
NewRecovered            0
ActiveCases             0
Serious,Critical        0
TotalTests             20
Population              0
Continent               0
1 Caseevery X ppl       0
1 Deathevery X ppl      0
1 Testevery X ppl      20
Active Cases/1M pop     0
dtype: int64

In [775]:
# save to csv
new_df.to_csv('./data/final_data.csv', index=False)