# Covid Cases Worldwide

![title](image.png)

The data being reviewed for this project shows Covid-19 cases worldwide from January 20, 2020 through June 1, 2020.

Original dataset:
https://www.kaggle.com/datasets/lin0li/covid19testing

## Importing data to notebook

In [205]:
import pandas as pd
from datetime import datetime
import matplotlib.pyplot as plt

In [142]:
covid_df = pd.read_csv('covid19.csv')

Checking for the number of columns and rows

## Data Cleaning Process

In [143]:
covid_df.head()

Unnamed: 0,Date,Continent_Name,Two_Letter_Country_Code,Country_Region,Province_State,positive,hospitalized,recovered,death,total_tested,active,hospitalizedCurr,daily_tested,daily_positive
0,2020-01-20,Asia,KR,South Korea,All States,1,0,0,0,4,0,0,0,0
1,2020-01-22,North America,US,United States,All States,1,0,0,0,1,0,0,0,0
2,2020-01-22,North America,US,United States,Washington,1,0,0,0,1,0,0,0,0
3,2020-01-23,North America,US,United States,All States,1,0,0,0,1,0,0,0,0
4,2020-01-23,North America,US,United States,Washington,1,0,0,0,1,0,0,0,0


In [144]:
covid_df.tail(2)

Unnamed: 0,Date,Continent_Name,Two_Letter_Country_Code,Country_Region,Province_State,positive,hospitalized,recovered,death,total_tested,active,hospitalizedCurr,daily_tested,daily_positive
10901,2020-06-01,Asia,TW,Taiwan,All States,0,0,0,0,72319,0,0,237,0
10902,2020-06-01,Asia,VN,Vietnam,All States,0,0,0,0,261004,0,0,0,0


In [145]:
covid_df.shape

(10903, 14)

This means there is a total of 10,903 rowns and 14 columns

Using .info() method to check if any important values are missing

In [146]:
covid_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10903 entries, 0 to 10902
Data columns (total 14 columns):
 #   Column                   Non-Null Count  Dtype 
---  ------                   --------------  ----- 
 0   Date                     10903 non-null  object
 1   Continent_Name           10903 non-null  object
 2   Two_Letter_Country_Code  10903 non-null  object
 3   Country_Region           10903 non-null  object
 4   Province_State           10903 non-null  object
 5   positive                 10903 non-null  int64 
 6   hospitalized             10903 non-null  int64 
 7   recovered                10903 non-null  int64 
 8   death                    10903 non-null  int64 
 9   total_tested             10903 non-null  int64 
 10  active                   10903 non-null  int64 
 11  hospitalizedCurr         10903 non-null  int64 
 12  daily_tested             10903 non-null  int64 
 13  daily_positive           10903 non-null  int64 
dtypes: int64(9), object(5)
memory usage: 1

Below you will see the list of all columns to determine if anything needs to be changed

In [147]:
covid_df.columns

Index(['Date', 'Continent_Name', 'Two_Letter_Country_Code', 'Country_Region',
       'Province_State', 'positive', 'hospitalized', 'recovered', 'death',
       'total_tested', 'active', 'hospitalizedCurr', 'daily_tested',
       'daily_positive'],
      dtype='object')

Created a dictionary to change 3 column names

In [148]:
covid_df.rename(columns={
    'Continent_Name': 'Continent',
    'Two_Letter_Country_Code': 'Country_Code',
    'Province_State': 'State'
}, inplace=True)

In [149]:
covid_df.columns

Index(['Date', 'Continent', 'Country_Code', 'Country_Region', 'State',
       'positive', 'hospitalized', 'recovered', 'death', 'total_tested',
       'active', 'hospitalizedCurr', 'daily_tested', 'daily_positive'],
      dtype='object')

This list was created to capitlize the first letter of each word

In [150]:
covid_df.columns = ['Date', 'Continent', 'Country_Code', 'Country_Region', 'State',
       'Positive', 'Hospitalized', 'Recovered', 'Death', 'Total_Tested',
       'Active', 'HospitalizedCurr', 'Daily_Tested', 'Daily_Positive']

In [151]:
covid_df.columns

Index(['Date', 'Continent', 'Country_Code', 'Country_Region', 'State',
       'Positive', 'Hospitalized', 'Recovered', 'Death', 'Total_Tested',
       'Active', 'HospitalizedCurr', 'Daily_Tested', 'Daily_Positive'],
      dtype='object')

After going through this cleaning, I realized I wanted to drop some columns to make analyzis easier

In [152]:
covid_df.drop(['Country_Code', 'State', 'Hospitalized', 'Total_Tested', 'Active', 'HospitalizedCurr', 'Daily_Tested', 'Daily_Positive'], axis=1, inplace=True )


Since my dataframe has multiple rows with dates, I created a new column that will divide the dataframe into months for easier analysis.

In [187]:
data = {'Date': pd.date_range(start='2020-01-20', end='2020-06-01')}
data = pd.DataFrame(data)
covid_df['Month'] = data['Date'].apply(lambda x: datetime.strftime(x, '%B'))

In [188]:
#def extract_month(date):
    #try:
        #last_date = covid_df[covid_df['Date']==date]['Date'].last()
        #return datetime.datetime.strftime(last_date, '%B')
    #except:
        #return None

#covid_df['Month'] = covid_df['Date'].apply(extract_month)

In [189]:
#The above code is supposed to add the month to the last 2020-06-01 row, but instead

Here I am checking that the new "Month" column was created

In [190]:
covid_df.head()

Unnamed: 0,Date,Continent,Country_Region,Positive,Recovered,Death,Month
0,2020-01-20,Asia,South Korea,1,0,0,January
1,2020-01-22,North America,United States,1,0,0,January
2,2020-01-22,North America,United States,1,0,0,January
3,2020-01-23,North America,United States,1,0,0,January
4,2020-01-23,North America,United States,1,0,0,January


In [191]:
covid_df.tail(2)

Unnamed: 0,Date,Continent,Country_Region,Positive,Recovered,Death,Month
10901,2020-06-01,Asia,Taiwan,0,0,0,
10902,2020-06-01,Asia,Vietnam,0,0,0,


In [192]:
#Need help with this as the last few rows of the dataset is not showing the month in the month column. My best guess is that the code is only showing the first row of 2020-06-01

In [195]:
covid_df.isnull()
covid_df.isnull().sum()


Date                  0
Continent             0
Country_Region        0
Positive              0
Recovered             0
Death                 0
Month             10769
dtype: int64

## Data Analysis

In [166]:
covid_df.describe()

Unnamed: 0,Positive,Recovered,Death
count,10903.0,10903.0,10903.0
mean,17768.02,2409.22,947.59
std,93143.46,11254.21,5507.24
min,0.0,0.0,0.0
25%,44.0,0.0,0.0
50%,1026.0,0.0,6.0
75%,7440.0,500.5,136.0
max,1783570.0,171883.0,98536.0


In [168]:
covid_df['Continent'].describe()

count             10903
unique                6
top       North America
freq               6452
Name: Continent, dtype: object

Continent Column has 6 unique values, top value is North America, which shows up 6452 times.

In [169]:
covid_df.corr()

Unnamed: 0,Positive,Recovered,Death
Positive,1.0,0.27,0.93
Recovered,0.27,1.0,0.33
Death,0.93,0.33,1.0


In [234]:
#Creating Mask to show which records match 'United States'
us_mask = (covid_df['Country_Region'] == 'United States')

#Slicing DataFrame to sum "Positive" column
"${:,}".format(covid_df[us_mask]['Positive'].sum())

#Adding "US_Category" column to set value to "US" or "Other"
covid_df['US_Category'] = np.where(us_mask, 'US', 'Other')

#Using value_counts() function to get count and percentage of "US"
us_counts = covid_df['US_Category'].value_counts().apply(lambda x: "{:,}".format(x))

us_percents = covid_df['US_Category'].value_counts(normalize=True).mul(100).round(1).astype(str) + '%'

pd.concat([us_counts, us_percents], axis=1, keys=['US','Percentage'])



Unnamed: 0,US,Percentage
Other,5871,53.8%
US,5032,46.2%


In [231]:
#Check how many deaths happened in US vs rest of the world

In [None]:
#Check how many recovered in US vs rest of the world