# DTSC 691 PROJECT

### COVID-19 Dataset: Number of Confirmed, Death and Recovered cases every day across the globe from Kaggle.com

### Project Overview and Goal

We will be analyzing the Covid-19 datasets for this project and highlighting key details that are pertinent to our study topic. The project aims to monitor several critical variables over a period of time, from January 22, 2020, to July 27, 2020, including the number of cases, the mortality rate, and the recovery rate by geographic location. Instead of discussing the causes of these effects, this project will concentrate on generic research of the COVID-19's effects in various World Health Organization (WHO) regions. We will list the several questions we are attempting to address in this project for greater clarity:


- How the COVID-19 pandemic has changed over time in each geographical area (globe, continent, and WHO region)
- The number of cases of COVID-19 per geographical area (country, continent, and WHO region).
- What is the population's overall death rate by geographic area (country, continent, and WHO region)?
- The number of individuals who returned home after recovering from the COvid-19 in each country, continent, and WHO area
- Assessing the efficacy of treatment modalities by geographical area (nation, continent, and WHO region)
- Analyzing the differences in the efficacy of COVID-19 treatment between the US and other nations
- Where was the COVID-19 pandemic least disruptive? (Nationality, continent, and WHO area)
- Where was the COVID-19 epidemic most prevalent? (nation, region, and WHO)

You may view our stories in our tableau file, which attempts to address each of the aforementioned questions. We believe that after reading this, you will have a clear understanding of our goals and a solid overview of the project.

We are going to examine several COVID-19 datasets in this notebook. We will examine each dataset independently to analyze any missing values and ensure that the appropriate datatype is given to the appropriate column. We will also remove columns that mostly contain missing values or have worse quality information. Once this procedure is finished, we will analyze which join method is most accurate to combine a table or data set to get the necessary information. And how might we go about building a relational schema?

In [1]:
import numpy as np
import pandas as pd

##### 1. country_wise_latest

In [2]:
df1 = pd.read_csv('Datasets/country_wise_latest.csv')
df1.head()

Unnamed: 0,Country/Region,Confirmed,Deaths,Recovered,Active,New cases,New deaths,New recovered,Deaths / 100 Cases,Recovered / 100 Cases,Deaths / 100 Recovered,Confirmed last week,1 week change,1 week % increase,WHO Region
0,Afghanistan,36263,1269,25198,9796,106,10,18,3.5,69.49,5.04,35526,737,2.07,Eastern Mediterranean
1,Albania,4880,144,2745,1991,117,6,63,2.95,56.25,5.25,4171,709,17.0,Europe
2,Algeria,27973,1163,18837,7973,616,8,749,4.16,67.34,6.17,23691,4282,18.07,Africa
3,Andorra,907,52,803,52,10,0,0,5.73,88.53,6.48,884,23,2.6,Europe
4,Angola,950,41,242,667,18,1,0,4.32,25.47,16.94,749,201,26.84,Africa


In [3]:
list(df1.columns)

['Country/Region',
 'Confirmed',
 'Deaths',
 'Recovered',
 'Active',
 'New cases',
 'New deaths',
 'New recovered',
 'Deaths / 100 Cases',
 'Recovered / 100 Cases',
 'Deaths / 100 Recovered',
 'Confirmed last week',
 '1 week change',
 '1 week % increase',
 'WHO Region']

In [4]:
#df1.des`cribe()
df1.shape

(187, 15)

In [5]:
df1.isnull().sum() #Always checking for missing values. We do not have missing values in this case 

Country/Region            0
Confirmed                 0
Deaths                    0
Recovered                 0
Active                    0
New cases                 0
New deaths                0
New recovered             0
Deaths / 100 Cases        0
Recovered / 100 Cases     0
Deaths / 100 Recovered    0
Confirmed last week       0
1 week change             0
1 week % increase         0
WHO Region                0
dtype: int64

In [6]:
df1_modified = df1.drop(['Confirmed last week', '1 week change', '1 week % increase', 'Deaths / 100 Cases', 'Recovered / 100 Cases',
                         'Deaths / 100 Recovered' ], axis=1)

df1_modified.rename(columns={'Country/Region':'Country',
                            'New cases':'New_cases',
                            'New deaths':'New_deaths',
                            'New recovered':'New_recovered',
                            'WHO Region':'WHO_Region'}, inplace=True)


In [7]:
df1_modified = df1_modified.set_index('Country') 
df1_modified.head(3)

Unnamed: 0_level_0,Confirmed,Deaths,Recovered,Active,New_cases,New_deaths,New_recovered,WHO_Region
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Afghanistan,36263,1269,25198,9796,106,10,18,Eastern Mediterranean
Albania,4880,144,2745,1991,117,6,63,Europe
Algeria,27973,1163,18837,7973,616,8,749,Africa


In [8]:
df1_modified.to_csv('Country_wise.csv', sep=',', encoding='utf-8')

##### 2. covid_19_clan_complete

In [9]:
df2 = pd.read_csv('Datasets/covid_19_clean_complete.csv')
df2.head()

Unnamed: 0,Province_State,Country_Region,Lat,Long,Date,Confirmed,Deaths,Recovered,Active,WHO_Region
0,,Afghanistan,33.93911,67.709953,1/22/2020,0,0,0,0,Eastern Mediterranean
1,,Albania,41.1533,20.1683,1/22/2020,0,0,0,0,Europe
2,,Algeria,28.0339,1.6596,1/22/2020,0,0,0,0,Africa
3,,Andorra,42.5063,1.5218,1/22/2020,0,0,0,0,Europe
4,,Angola,-11.2027,17.8739,1/22/2020,0,0,0,0,Africa


In [10]:
list(df2.columns)

['Province_State',
 'Country_Region',
 'Lat',
 'Long',
 'Date',
 'Confirmed',
 'Deaths',
 'Recovered',
 'Active',
 'WHO_Region']

In [11]:
#df1.describe()
df2.shape

(49068, 10)

Continuously searching for missing values. Many data in the "Province/State" column are missing; in fact, 70.11% of the data in this particular column are missing. Because of this deficiency and the fact that our study is primarily focused on countries, we will omit this particular column.

In [12]:
df2.isnull().sum()

Province_State    34404
Country_Region        0
Lat                   0
Long                  0
Date                  0
Confirmed             0
Deaths                0
Recovered             0
Active                0
WHO_Region            0
dtype: int64

In [13]:
df2_modified = df2.drop(['Province_State', 'Lat', 'Long'], axis=1)
df2_modified.rename(columns={'Country_Region':'Country'}, inplace=True)

In [14]:
df2_modified = df2_modified.set_index('Country') 
df2_modified.head(3)

Unnamed: 0_level_0,Date,Confirmed,Deaths,Recovered,Active,WHO_Region
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Afghanistan,1/22/2020,0,0,0,0,Eastern Mediterranean
Albania,1/22/2020,0,0,0,0,Europe
Algeria,1/22/2020,0,0,0,0,Africa


In [15]:
df2_modified.to_csv('Covid19_clan.csv', sep=',', encoding='utf-8')

##### 3. day_wise.csv

In [16]:
df3 = pd.read_csv('Datasets/day_wise.csv')
df3.head()

Unnamed: 0,Date,Confirmed,Deaths,Recovered,Active,New cases,New deaths,New recovered,Deaths / 100 Cases,Recovered / 100 Cases,Deaths / 100 Recovered,No. of countries
0,2020-01-22,555,17,28,510,0,0,0,3.06,5.05,60.71,6
1,2020-01-23,654,18,30,606,99,1,2,2.75,4.59,60.0,8
2,2020-01-24,941,26,36,879,287,8,6,2.76,3.83,72.22,9
3,2020-01-25,1434,42,39,1353,493,16,3,2.93,2.72,107.69,11
4,2020-01-26,2118,56,52,2010,684,14,13,2.64,2.46,107.69,13


In [17]:
list(df3.columns)

['Date',
 'Confirmed',
 'Deaths',
 'Recovered',
 'Active',
 'New cases',
 'New deaths',
 'New recovered',
 'Deaths / 100 Cases',
 'Recovered / 100 Cases',
 'Deaths / 100 Recovered',
 'No. of countries']

In [18]:
#df3.describe()
df3.shape

(188, 12)

In [19]:
df3.isnull().sum() #Always checking for missing values. We do not have missing values in this case 

Date                      0
Confirmed                 0
Deaths                    0
Recovered                 0
Active                    0
New cases                 0
New deaths                0
New recovered             0
Deaths / 100 Cases        0
Recovered / 100 Cases     0
Deaths / 100 Recovered    0
No. of countries          0
dtype: int64

In [62]:
df3_modified = df3.drop(['Deaths / 100 Cases','Recovered / 100 Cases','Deaths / 100 Recovered'], axis=1)

df3_modified.rename(columns={'New cases':'New_cases',
                            'New deaths':'New_deaths',
                            'New recovered':'New_recovered',
                            'No. of countries':'Number_of_countries'}, inplace=True)

In [63]:
df3_modified = df3_modified.set_index('Date') 
df3_modified.head(3)

Unnamed: 0_level_0,Confirmed,Deaths,Recovered,Active,New_cases,New_deaths,New_recovered,Number_of_countries
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
2020-01-22,555,17,28,510,0,0,0,6
2020-01-23,654,18,30,606,99,1,2,8
2020-01-24,941,26,36,879,287,8,6,9


In [64]:
df3_modified.to_csv('Day_wise.csv', sep=',', encoding='utf-8')

##### 4. full_grouped

In [23]:
df4 = pd.read_csv('Datasets/full_grouped.csv')
df4.head()

Unnamed: 0,Date,Country_Region,Confirmed,Deaths,Recovered,Active,New_cases,New_deaths,New_recovered,WHO_Region
0,1/22/2020,Afghanistan,0,0,0,0,0,0,0,Eastern Mediterranean
1,1/22/2020,Albania,0,0,0,0,0,0,0,Europe
2,1/22/2020,Algeria,0,0,0,0,0,0,0,Africa
3,1/22/2020,Andorra,0,0,0,0,0,0,0,Europe
4,1/22/2020,Angola,0,0,0,0,0,0,0,Africa


In [24]:
list(df4.columns)

['Date',
 'Country_Region',
 'Confirmed',
 'Deaths',
 'Recovered',
 'Active',
 'New_cases',
 'New_deaths',
 'New_recovered',
 'WHO_Region']

In [25]:
df4.shape

(35156, 10)

In [26]:
df4.isnull().sum() #Always checking for missing values. We do not have missing values in this case 

Date              0
Country_Region    0
Confirmed         0
Deaths            0
Recovered         0
Active            0
New_cases         0
New_deaths        0
New_recovered     0
WHO_Region        0
dtype: int64

In [27]:
df4_modified = df4.rename(columns={'Country_Region':'Country'})

In [28]:
df4_modified = df4_modified.set_index('Country') 
df4_modified.head(3)

Unnamed: 0_level_0,Date,Confirmed,Deaths,Recovered,Active,New_cases,New_deaths,New_recovered,WHO_Region
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
Afghanistan,1/22/2020,0,0,0,0,0,0,0,Eastern Mediterranean
Albania,1/22/2020,0,0,0,0,0,0,0,Europe
Algeria,1/22/2020,0,0,0,0,0,0,0,Africa


In [29]:
df4_modified.to_csv('Full_detail.csv', sep=',', encoding='utf-8')

##### 5. usa_country_wise

In [30]:
df5 = pd.read_csv('Datasets/usa_county_wise.csv')
df5.head()

Unnamed: 0,UID,iso2,iso3,code3,FIPS,Admin2,Province_State,Country_Region,Lat,Long_,Combined_Key,Date,Confirmed,Deaths
0,16,AS,ASM,16,60.0,,American Samoa,US,-14.271,-170.132,"American Samoa, US",1/22/20,0,0
1,316,GU,GUM,316,66.0,,Guam,US,13.4443,144.7937,"Guam, US",1/22/20,0,0
2,580,MP,MNP,580,69.0,,Northern Mariana Islands,US,15.0979,145.6739,"Northern Mariana Islands, US",1/22/20,0,0
3,63072001,PR,PRI,630,72001.0,Adjuntas,Puerto Rico,US,18.180117,-66.754367,"Adjuntas, Puerto Rico, US",1/22/20,0,0
4,63072003,PR,PRI,630,72003.0,Aguada,Puerto Rico,US,18.360255,-67.175131,"Aguada, Puerto Rico, US",1/22/20,0,0


In [31]:
list(df5.columns)

['UID',
 'iso2',
 'iso3',
 'code3',
 'FIPS',
 'Admin2',
 'Province_State',
 'Country_Region',
 'Lat',
 'Long_',
 'Combined_Key',
 'Date',
 'Confirmed',
 'Deaths']

In [32]:
df5.shape

(627920, 14)

In [33]:
df5.isnull().sum() #Always checking for missing values. 

UID                  0
iso2                 0
iso3                 0
code3                0
FIPS              1880
Admin2            1128
Province_State       0
Country_Region       0
Lat                  0
Long_                0
Combined_Key         0
Date                 0
Confirmed            0
Deaths               0
dtype: int64

Once again, we will ignore all the following columns ['UID', 'iso2', 'iso3', 'code3', 'FIPS', 'Admin2', 'Long_', 'Lat', 'Country_Region'  ] because they do not support our investigation according to the objective established and the question we have to answer.


In [34]:
df5_modified = df5.drop(['UID','iso2','iso3','code3','FIPS','Admin2','Long_','Lat','Combined_Key','Country_Region'], axis=1)

In [35]:
df5_modified = df5_modified.set_index('Province_State') 
df5_modified.head(3)

Unnamed: 0_level_0,Date,Confirmed,Deaths
Province_State,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
American Samoa,1/22/20,0,0
Guam,1/22/20,0,0
Northern Mariana Islands,1/22/20,0,0


In [36]:
df5_modified.to_csv('USA_wise.csv', sep=',', encoding='utf-8')

#### 6. worldometer_data

In [37]:
df6 = pd.read_csv('Datasets/worldometer_data.csv')
df6.head()

Unnamed: 0,Country/Region,Continent,Population,TotalCases,NewCases,TotalDeaths,NewDeaths,TotalRecovered,NewRecovered,ActiveCases,"Serious,Critical",Tot Cases/1M pop,Deaths/1M pop,TotalTests,Tests/1M pop,WHO Region
0,USA,North America,331198100.0,5032179,,162804.0,,2576668.0,,2292707.0,18296.0,15194.0,492.0,63139605.0,190640.0,Americas
1,Brazil,South America,212710700.0,2917562,,98644.0,,2047660.0,,771258.0,8318.0,13716.0,464.0,13206188.0,62085.0,Americas
2,India,Asia,1381345000.0,2025409,,41638.0,,1377384.0,,606387.0,8944.0,1466.0,30.0,22149351.0,16035.0,South-EastAsia
3,Russia,Europe,145940900.0,871894,,14606.0,,676357.0,,180931.0,2300.0,5974.0,100.0,29716907.0,203623.0,Europe
4,South Africa,Africa,59381570.0,538184,,9604.0,,387316.0,,141264.0,539.0,9063.0,162.0,3149807.0,53044.0,Africa


In [38]:
list(df6.columns)

['Country/Region',
 'Continent',
 'Population',
 'TotalCases',
 'NewCases',
 'TotalDeaths',
 'NewDeaths',
 'TotalRecovered',
 'NewRecovered',
 'ActiveCases',
 'Serious,Critical',
 'Tot Cases/1M pop',
 'Deaths/1M pop',
 'TotalTests',
 'Tests/1M pop',
 'WHO Region']

In [39]:
df6.shape

(209, 16)

In [40]:
df6.isnull().sum() #Always checking for missing values. 

Country/Region        0
Continent             1
Population            1
TotalCases            0
NewCases            205
TotalDeaths          21
NewDeaths           206
TotalRecovered        4
NewRecovered        206
ActiveCases           4
Serious,Critical     87
Tot Cases/1M pop      1
Deaths/1M pop        22
TotalTests           18
Tests/1M pop         18
WHO Region           25
dtype: int64

We may remove the columns "NewCases," "NewRecovered," and "NewDeaths" without losing a significant amount of data because it is evident that they are nearly entirely composed of missing values.

In [41]:
df6_modified = df6.drop(['NewCases', 'NewRecovered', 'NewDeaths','Tot Cases/1M pop','Deaths/1M pop','Tests/1M pop'], axis=1)

df6_modified.rename(columns={'Country/Region':'Country',
                            'Serious,Critical':'Serious_Critical',
                            'WHO Region':'WHO_Region'}, inplace=True)

In [42]:
df6_modified = df6_modified.set_index('Country') 
df6_modified.head(3)

Unnamed: 0_level_0,Continent,Population,TotalCases,TotalDeaths,TotalRecovered,ActiveCases,Serious_Critical,TotalTests,WHO_Region
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
USA,North America,331198100.0,5032179,162804.0,2576668.0,2292707.0,18296.0,63139605.0,Americas
Brazil,South America,212710700.0,2917562,98644.0,2047660.0,771258.0,8318.0,13206188.0,Americas
India,Asia,1381345000.0,2025409,41638.0,1377384.0,606387.0,8944.0,22149351.0,South-EastAsia


In [43]:
df6_modified.to_csv('Worldometer.csv', sep=',', encoding='utf-8')

##### 7. owid-covid-data

In [44]:
df7 = pd.read_csv('Datasets/owid-covid-data.csv')
df7.head()

Unnamed: 0,iso_code,continent,location,date,total_cases,new_cases,new_cases_smoothed,total_deaths,new_deaths,new_deaths_smoothed,...,gdp_per_capita,extreme_poverty,cardiovasc_death_rate,diabetes_prevalence,female_smokers,male_smokers,handwashing_facilities,hospital_beds_per_thousand,life_expectancy,human_development_index
0,AFG,Asia,Afghanistan,2020-02-24,1.0,1.0,,,,,...,1803.987,,597.029,9.59,,,37.746,0.5,64.83,0.511
1,AFG,Asia,Afghanistan,2020-02-25,1.0,0.0,,,,,...,1803.987,,597.029,9.59,,,37.746,0.5,64.83,0.511
2,AFG,Asia,Afghanistan,2020-02-26,1.0,0.0,,,,,...,1803.987,,597.029,9.59,,,37.746,0.5,64.83,0.511
3,AFG,Asia,Afghanistan,2020-02-27,1.0,0.0,,,,,...,1803.987,,597.029,9.59,,,37.746,0.5,64.83,0.511
4,AFG,Asia,Afghanistan,2020-02-28,1.0,0.0,,,,,...,1803.987,,597.029,9.59,,,37.746,0.5,64.83,0.511


Rather than removing some of the 59 columns in this database, we will select a few that are relevant to our research objective.

In [45]:
list(df7.columns)

['iso_code',
 'continent',
 'location',
 'date',
 'total_cases',
 'new_cases',
 'new_cases_smoothed',
 'total_deaths',
 'new_deaths',
 'new_deaths_smoothed',
 'total_cases_per_million',
 'new_cases_per_million',
 'new_cases_smoothed_per_million',
 'total_deaths_per_million',
 'new_deaths_per_million',
 'new_deaths_smoothed_per_million',
 'reproduction_rate',
 'icu_patients',
 'icu_patients_per_million',
 'hosp_patients',
 'hosp_patients_per_million',
 'weekly_icu_admissions',
 'weekly_icu_admissions_per_million',
 'weekly_hosp_admissions',
 'weekly_hosp_admissions_per_million',
 'new_tests',
 'total_tests',
 'total_tests_per_thousand',
 'new_tests_per_thousand',
 'new_tests_smoothed',
 'new_tests_smoothed_per_thousand',
 'positive_rate',
 'tests_per_case',
 'tests_units',
 'total_vaccinations',
 'people_vaccinated',
 'people_fully_vaccinated',
 'new_vaccinations',
 'new_vaccinations_smoothed',
 'total_vaccinations_per_hundred',
 'people_vaccinated_per_hundred',
 'people_fully_vaccinate

In [46]:
df7.shape

(91026, 59)

In [47]:
df7_modified = df7[['continent','date','total_cases','total_deaths','reproduction_rate','total_tests','positive_rate','total_vaccinations','population','population_density',
                    'median_age','aged_65_older','life_expectancy']]

In [48]:
df7_modified.isnull().sum()

continent              4327
date                      0
total_cases            2690
total_deaths          12542
reproduction_rate     17659
total_tests           50153
positive_rate         46582
total_vaccinations    78920
population              604
population_density     6364
median_age             9273
aged_65_older         10197
life_expectancy        4594
dtype: int64

In [49]:
## We removed the column labeled "total vaccinations" because it contains 86.7% missing data.
df7_modif = df7_modified.drop(['total_vaccinations'], axis=1)

There is a time constraint on our study. We must ensure that DF7 and DF8 are in that rage because that runs from January 22, 2020, to July 27, 2020.

In [50]:
df7_mod = df7_modif[df7_modif['date']<'2020-07-28']

In [51]:
df7_mod = df7_mod.set_index('date') 
df7_mod.head(3)

Unnamed: 0_level_0,continent,total_cases,total_deaths,reproduction_rate,total_tests,positive_rate,population,population_density,median_age,aged_65_older,life_expectancy
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
2020-02-24,Asia,1.0,,,,,38928341.0,54.422,18.6,2.581,64.83
2020-02-25,Asia,1.0,,,,,38928341.0,54.422,18.6,2.581,64.83
2020-02-26,Asia,1.0,,,,,38928341.0,54.422,18.6,2.581,64.83


In [52]:
df7_mod.to_csv('Covid-data.csv', sep=',', encoding='utf-8')

###### 8. covid_19_usa_city

In [53]:
df8 = pd.read_csv('Datasets/covid_19_usa_city.csv')
df8.head()

Unnamed: 0,City,Total Cases,New Cases,Total Deaths,New Deaths,Active Cases,Total Cases /1M pop,Deaths /1M pop,Total Tests,Tests /1M pop,Date
0,New York,188694,7550.0,9385.0,758.0,162220,9618.0,478.0,461601.0,23529,04-12-2020
1,New Jersey,61850,3699.0,2350.0,167.0,58818,6964.0,265.0,126735.0,14269,04-12-2020
2,Michigan,23993,,1392.0,,22158,2410.0,140.0,76014.0,7634,04-12-2020
3,Massachusetts,22860,,686.0,,21445,3347.0,100.0,108776.0,15926,04-12-2020
4,Pennsylvania,22833,1029.0,507.0,6.0,21676,1785.0,40.0,124890.0,9764,04-12-2020


In [54]:
list(df8.columns)

['City',
 'Total Cases',
 'New Cases',
 'Total Deaths',
 'New Deaths',
 'Active Cases',
 'Total Cases /1M pop',
 'Deaths /1M pop',
 'Total Tests',
 'Tests /1M pop',
 'Date']

In [55]:
df8.shape

(9660, 11)

In [56]:
df8.isnull().sum()

City                      0
Total Cases               0
New Cases              4990
Total Deaths             20
New Deaths                0
Active Cases              0
Total Cases /1M pop    1288
Deaths /1M pop         1290
Total Tests             387
Tests /1M pop          1288
Date                      0
dtype: int64

In [57]:
df8_modified = df8.drop(['New Cases', 'Total Cases /1M pop','Deaths /1M pop','Tests /1M pop'], axis=1)

df8_modified.rename(columns={'Total Cases':'Total_cases',
                            'Total Deaths':'Total_deaths',
                            'New Deaths':'New_deaths',
                            'Active Cases':'Active_cases',
                            'Total Tests':'Total_tests'}, inplace=True)

df8_modified = df8_modified[df8_modified['Date'] < '07-28-2020']

In [58]:
df8_modified = df8_modified.set_index('City') 
df8_modified.head(3)

Unnamed: 0_level_0,Total_cases,Total_deaths,New_deaths,Active_cases,Total_tests,Date
City,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
New York,188694,9385.0,758.0,162220,461601.0,04-12-2020
New Jersey,61850,2350.0,167.0,58818,126735.0,04-12-2020
Michigan,23993,1392.0,,22158,76014.0,04-12-2020


In [59]:
df8_modified.to_csv('USA_city_Covid19.csv', sep=',', encoding='utf-8')