<a href="https://colab.research.google.com/github/aarsanjani/meansquares/blob/master/weekly_CovidCases_CA_NY.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Topic: Investigate Covid-19 NY and CA data

This colab includes CA and NY covid case data from JHU and moving avg trend analysis

In [1]:
!pip install wget

Collecting wget
  Downloading https://files.pythonhosted.org/packages/47/6a/62e288da7bcda82b935ff0c6cfe542970f04e29c756b0e147251b2fb251f/wget-3.2.zip
Building wheels for collected packages: wget
  Building wheel for wget (setup.py) ... [?25l[?25hdone
  Created wheel for wget: filename=wget-3.2-cp36-none-any.whl size=9682 sha256=174294e2ed655f344bf23e83b97c6e0cfff096c4b5add0b8598ff7e12857685a
  Stored in directory: /root/.cache/pip/wheels/40/15/30/7d8f7cea2902b4db79e3fea550d7d7b85ecb27ef992b618f3f
Successfully built wget
Installing collected packages: wget
Successfully installed wget-3.2


## Import Libraries

In [70]:
import pandas as pd
from tqdm import tqdm

import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
import os
import wget
from pandas import Series, datetime
from pandas.plotting import scatter_matrix, autocorrelation_plot
from sklearn.preprocessing import StandardScaler, RobustScaler
from sklearn.model_selection import train_test_split, KFold, cross_val_score, GridSearchCV, TimeSeriesSplit
from sklearn import metrics
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score, mean_squared_error
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.naive_bayes import GaussianNB
from sklearn.cluster import KMeans
from sklearn.svm import SVC
from sklearn.ensemble import AdaBoostClassifier, GradientBoostingClassifier, RandomForestClassifier, ExtraTreesClassifier
from sklearn.metrics import roc_curve, auc
import random
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
from statsmodels.tsa.arima_model import ARIMA
from xgboost import XGBClassifier
from sklearn.mixture import GaussianMixture

## Mount Google drive


In [3]:
os.environ["TF_CPP_MIN_LOG_LEVEL"]="2"
import warnings; warnings.simplefilter('ignore')

In [4]:
from google.colab import drive
# drive.mount('/content/drive')

In [5]:
location = "drive/Shared drives/the-mean-sqaures/the-mean-squares/Colab_Dataset/Dataset/"

In [6]:
!ls /content/drive/My\ Drive/MasterProject-Personal/data

ls: cannot access '/content/drive/My Drive/MasterProject-Personal/data': No such file or directory


# Data Load

## 1 Load County Population


In [7]:
county_population_US = pd.read_csv('https://usafactsstatic.blob.core.windows.net/public/data/covid-19/covid_county_population_usafacts.csv',low_memory=False)
print(county_population_US.shape)

(3195, 4)


In [8]:
wget.download('https://usafactsstatic.blob.core.windows.net/public/data/covid-19/covid_county_population_usafacts.csv')
county_population_US = pd.read_csv('covid_county_population_usafacts.csv',low_memory=False)
print(county_population_US.shape)




(3195, 4)


In [9]:
county_population_US.head(2)

Unnamed: 0,countyFIPS,County Name,State,population
0,0,Statewide Unallocated,AL,0
1,1001,Autauga County,AL,55869


## 2 Load Covid-19 case details (until July 12)

In [10]:
!ls '/content/drive/My Drive/MasterProject-Personal/data/'

ls: cannot access '/content/drive/My Drive/MasterProject-Personal/data/': No such file or directory


### Note about data:
John Hopkins university updates data every day hence we are pulling from repository directly

**US Confirmed url** :https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_US.csv

**US deaths url**: https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_deaths_US.csv

In [11]:
urls = ['https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_US.csv',
        'https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_deaths_US.csv']

[wget.download(url) for url in urls]        

['time_series_covid19_confirmed_US.csv', 'time_series_covid19_deaths_US.csv']

In [12]:
confirmed_US = pd.read_csv('time_series_covid19_confirmed_US.csv',low_memory=False)
death_US = pd.read_csv('time_series_covid19_deaths_US.csv',low_memory=False)
print(confirmed_US.shape)
print(death_US.shape)
print(confirmed_US.head(2))
death_US.head(2)

(3340, 224)
(3340, 225)
        UID iso2 iso3  code3  ...  8/18/20 8/19/20 8/20/20 8/21/20
0  84001001   US  USA    840  ...     1235    1241    1240    1255
1  84001003   US  USA    840  ...     3906    3931    3957    3997

[2 rows x 224 columns]


Unnamed: 0,UID,iso2,iso3,code3,FIPS,Admin2,Province_State,Country_Region,Lat,Long_,Combined_Key,Population,1/22/20,1/23/20,1/24/20,1/25/20,1/26/20,1/27/20,1/28/20,1/29/20,1/30/20,1/31/20,2/1/20,2/2/20,2/3/20,2/4/20,2/5/20,2/6/20,2/7/20,2/8/20,2/9/20,2/10/20,2/11/20,2/12/20,2/13/20,2/14/20,2/15/20,2/16/20,2/17/20,2/18/20,...,7/13/20,7/14/20,7/15/20,7/16/20,7/17/20,7/18/20,7/19/20,7/20/20,7/21/20,7/22/20,7/23/20,7/24/20,7/25/20,7/26/20,7/27/20,7/28/20,7/29/20,7/30/20,7/31/20,8/1/20,8/2/20,8/3/20,8/4/20,8/5/20,8/6/20,8/7/20,8/8/20,8/9/20,8/10/20,8/11/20,8/12/20,8/13/20,8/14/20,8/15/20,8/16/20,8/17/20,8/18/20,8/19/20,8/20/20,8/21/20
0,84001001,US,USA,840,1001.0,Autauga,Alabama,US,32.539527,-86.644082,"Autauga, Alabama, US",55869,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,15,17,18,19,20,20,20,20,20,20,20,20,20,20,20,20,20,20,20,20,20,20,20,21,21,21,21,21,21,22,22,22,22,22,22,22,22,22,22,22
1,84001003,US,USA,840,1003.0,Baldwin,Alabama,US,30.72775,-87.722071,"Baldwin, Alabama, US",223234,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,11,11,12,13,13,14,14,14,15,15,16,17,17,17,17,17,20,20,21,21,22,23,23,23,23,23,24,25,25,29,29,29,29,29,29,29,30,30,31,32


In [13]:
#print(len(mask_data['state_name'].unique()))
print(len(confirmed_US['Province_State'].unique()))
confirmed_US['Province_State'].unique()

58


array(['Alabama', 'Alaska', 'American Samoa', 'Arizona', 'Arkansas',
       'California', 'Colorado', 'Connecticut', 'Delaware',
       'Diamond Princess', 'District of Columbia', 'Florida', 'Georgia',
       'Grand Princess', 'Guam', 'Hawaii', 'Idaho', 'Illinois', 'Indiana',
       'Iowa', 'Kansas', 'Kentucky', 'Louisiana', 'Maine', 'Maryland',
       'Massachusetts', 'Michigan', 'Minnesota', 'Mississippi',
       'Missouri', 'Montana', 'Nebraska', 'Nevada', 'New Hampshire',
       'New Jersey', 'New Mexico', 'New York', 'North Carolina',
       'North Dakota', 'Northern Mariana Islands', 'Ohio', 'Oklahoma',
       'Oregon', 'Pennsylvania', 'Puerto Rico', 'Rhode Island',
       'South Carolina', 'South Dakota', 'Tennessee', 'Texas', 'Utah',
       'Vermont', 'Virgin Islands', 'Virginia', 'Washington',
       'West Virginia', 'Wisconsin', 'Wyoming'], dtype=object)

## Data cleaning

In [14]:
confirmed_US.columns[:11]

# Note: the first 11 columns contains UID, ios2,ios3, code ,FIPS, 'Admin2', 'Province_State',
      # 'Country_Region', 'Lat', 'Long_',Combined_Key', 'Population',


Index(['UID', 'iso2', 'iso3', 'code3', 'FIPS', 'Admin2', 'Province_State',
       'Country_Region', 'Lat', 'Long_', 'Combined_Key'],
      dtype='object')

In [15]:
# date begins from 11th column
confirmed_dates = confirmed_US.columns[11:]
confirmed_dates

Index(['1/22/20', '1/23/20', '1/24/20', '1/25/20', '1/26/20', '1/27/20',
       '1/28/20', '1/29/20', '1/30/20', '1/31/20',
       ...
       '8/12/20', '8/13/20', '8/14/20', '8/15/20', '8/16/20', '8/17/20',
       '8/18/20', '8/19/20', '8/20/20', '8/21/20'],
      dtype='object', length=213)

In [16]:
death_US.columns[:12]

Index(['UID', 'iso2', 'iso3', 'code3', 'FIPS', 'Admin2', 'Province_State',
       'Country_Region', 'Lat', 'Long_', 'Combined_Key', 'Population'],
      dtype='object')

In [17]:
death_US.columns[10:]

Index(['Combined_Key', 'Population', '1/22/20', '1/23/20', '1/24/20',
       '1/25/20', '1/26/20', '1/27/20', '1/28/20', '1/29/20',
       ...
       '8/12/20', '8/13/20', '8/14/20', '8/15/20', '8/16/20', '8/17/20',
       '8/18/20', '8/19/20', '8/20/20', '8/21/20'],
      dtype='object', length=215)

In [18]:
death_dates = death_US.columns[12:]
death_dates

Index(['1/22/20', '1/23/20', '1/24/20', '1/25/20', '1/26/20', '1/27/20',
       '1/28/20', '1/29/20', '1/30/20', '1/31/20',
       ...
       '8/12/20', '8/13/20', '8/14/20', '8/15/20', '8/16/20', '8/17/20',
       '8/18/20', '8/19/20', '8/20/20', '8/21/20'],
      dtype='object', length=213)

#### Note: both the date values has the same beginning date 1/22/2020. Hence we can use either of the values

In [19]:
confirmed_df_long = confirmed_US.melt(
    id_vars=['UID', 'iso2', 'iso3', 'code3', 'FIPS', 'Admin2', 'Province_State',
       'Country_Region', 'Lat', 'Long_', 'Combined_Key'],
       value_vars=confirmed_dates,
       var_name = 'Date',
       value_name = 'Confirmed'
)

death_df_long = death_US.melt(
    id_vars=['UID', 'iso2', 'iso3', 'code3', 'FIPS', 'Admin2', 'Province_State',
       'Country_Region', 'Lat', 'Long_', 'Combined_Key', 'Population'],
       value_vars=death_dates,
       var_name = 'Date',
       value_name = 'Deaths'
)

In [20]:
confirmed_df_long.tail(10)

Unnamed: 0,UID,iso2,iso3,code3,FIPS,Admin2,Province_State,Country_Region,Lat,Long_,Combined_Key,Date,Confirmed
711410,84056029,US,USA,840,56029.0,Park,Wyoming,US,44.521575,-109.585283,"Park, Wyoming, US",8/21/20,156
711411,84056031,US,USA,840,56031.0,Platte,Wyoming,US,42.132991,-104.966331,"Platte, Wyoming, US",8/21/20,7
711412,84056033,US,USA,840,56033.0,Sheridan,Wyoming,US,44.790489,-106.886239,"Sheridan, Wyoming, US",8/21/20,112
711413,84056035,US,USA,840,56035.0,Sublette,Wyoming,US,42.765583,-109.913092,"Sublette, Wyoming, US",8/21/20,44
711414,84056037,US,USA,840,56037.0,Sweetwater,Wyoming,US,41.659439,-108.882788,"Sweetwater, Wyoming, US",8/21/20,288
711415,84056039,US,USA,840,56039.0,Teton,Wyoming,US,43.935225,-110.58908,"Teton, Wyoming, US",8/21/20,399
711416,84056041,US,USA,840,56041.0,Uinta,Wyoming,US,41.287818,-110.547578,"Uinta, Wyoming, US",8/21/20,283
711417,84090056,US,USA,840,90056.0,Unassigned,Wyoming,US,0.0,0.0,"Unassigned, Wyoming, US",8/21/20,0
711418,84056043,US,USA,840,56043.0,Washakie,Wyoming,US,43.904516,-107.680187,"Washakie, Wyoming, US",8/21/20,106
711419,84056045,US,USA,840,56045.0,Weston,Wyoming,US,43.839612,-104.567488,"Weston, Wyoming, US",8/21/20,11


In [21]:
confirmed_df_long[confirmed_df_long['FIPS'] == 36081].tail(30)

Unnamed: 0,UID,iso2,iso3,code3,FIPS,Admin2,Province_State,Country_Region,Lat,Long_,Combined_Key,Date,Confirmed
613161,84036081,US,USA,840,36081.0,Queens,New York,US,40.710881,-73.816847,"Queens, New York, US",7/23/20,0
616501,84036081,US,USA,840,36081.0,Queens,New York,US,40.710881,-73.816847,"Queens, New York, US",7/24/20,0
619841,84036081,US,USA,840,36081.0,Queens,New York,US,40.710881,-73.816847,"Queens, New York, US",7/25/20,0
623181,84036081,US,USA,840,36081.0,Queens,New York,US,40.710881,-73.816847,"Queens, New York, US",7/26/20,0
626521,84036081,US,USA,840,36081.0,Queens,New York,US,40.710881,-73.816847,"Queens, New York, US",7/27/20,0
629861,84036081,US,USA,840,36081.0,Queens,New York,US,40.710881,-73.816847,"Queens, New York, US",7/28/20,0
633201,84036081,US,USA,840,36081.0,Queens,New York,US,40.710881,-73.816847,"Queens, New York, US",7/29/20,0
636541,84036081,US,USA,840,36081.0,Queens,New York,US,40.710881,-73.816847,"Queens, New York, US",7/30/20,0
639881,84036081,US,USA,840,36081.0,Queens,New York,US,40.710881,-73.816847,"Queens, New York, US",7/31/20,0
643221,84036081,US,USA,840,36081.0,Queens,New York,US,40.710881,-73.816847,"Queens, New York, US",8/1/20,0


In [22]:
death_df_long.tail(10)

Unnamed: 0,UID,iso2,iso3,code3,FIPS,Admin2,Province_State,Country_Region,Lat,Long_,Combined_Key,Population,Date,Deaths
711410,84056029,US,USA,840,56029.0,Park,Wyoming,US,44.521575,-109.585283,"Park, Wyoming, US",29194,8/21/20,0
711411,84056031,US,USA,840,56031.0,Platte,Wyoming,US,42.132991,-104.966331,"Platte, Wyoming, US",8393,8/21/20,0
711412,84056033,US,USA,840,56033.0,Sheridan,Wyoming,US,44.790489,-106.886239,"Sheridan, Wyoming, US",30485,8/21/20,0
711413,84056035,US,USA,840,56035.0,Sublette,Wyoming,US,42.765583,-109.913092,"Sublette, Wyoming, US",9831,8/21/20,0
711414,84056037,US,USA,840,56037.0,Sweetwater,Wyoming,US,41.659439,-108.882788,"Sweetwater, Wyoming, US",42343,8/21/20,0
711415,84056039,US,USA,840,56039.0,Teton,Wyoming,US,43.935225,-110.58908,"Teton, Wyoming, US",23464,8/21/20,0
711416,84056041,US,USA,840,56041.0,Uinta,Wyoming,US,41.287818,-110.547578,"Uinta, Wyoming, US",20226,8/21/20,0
711417,84090056,US,USA,840,90056.0,Unassigned,Wyoming,US,0.0,0.0,"Unassigned, Wyoming, US",0,8/21/20,36
711418,84056043,US,USA,840,56043.0,Washakie,Wyoming,US,43.904516,-107.680187,"Washakie, Wyoming, US",7805,8/21/20,0
711419,84056045,US,USA,840,56045.0,Weston,Wyoming,US,43.839612,-104.567488,"Weston, Wyoming, US",6927,8/21/20,0


## Check California-New York data

In [23]:
state= ['California','New York']
confirmed_CA_df = confirmed_df_long[confirmed_df_long['Province_State'].isin(state)]
confirmed_CA_df.tail(5)

Unnamed: 0,UID,iso2,iso3,code3,FIPS,Admin2,Province_State,Country_Region,Lat,Long_,Combined_Key,Date,Confirmed
710039,84036115,US,USA,840,36115.0,Washington,New York,US,43.311538,-73.430434,"Washington, New York, US",8/21/20,264
710040,84036117,US,USA,840,36117.0,Wayne,New York,US,43.154944,-77.029765,"Wayne, New York, US",8/21/20,272
710041,84036119,US,USA,840,36119.0,Westchester,New York,US,41.162784,-73.757417,"Westchester, New York, US",8/21/20,36621
710042,84036121,US,USA,840,36121.0,Wyoming,New York,US,42.701451,-78.221996,"Wyoming, New York, US",8/21/20,121
710043,84036123,US,USA,840,36123.0,Yates,New York,US,42.635055,-77.103699,"Yates, New York, US",8/21/20,59


In [24]:
confirmed_CA_df.Province_State.unique()

array(['California', 'New York'], dtype=object)

## Merging Confirmed and Death data

In [25]:
full_table = confirmed_df_long.merge(
    right=death_df_long,
    how='left',
    on=[ 'UID', 'iso2', 'iso3', 'code3', 'FIPS', 'Admin2', 'Province_State',
       'Country_Region', 'Lat', 'Long_', 'Combined_Key','Date']
)

full_table.head(10)

Unnamed: 0,UID,iso2,iso3,code3,FIPS,Admin2,Province_State,Country_Region,Lat,Long_,Combined_Key,Date,Confirmed,Population,Deaths
0,84001001,US,USA,840,1001.0,Autauga,Alabama,US,32.539527,-86.644082,"Autauga, Alabama, US",1/22/20,0,55869,0
1,84001003,US,USA,840,1003.0,Baldwin,Alabama,US,30.72775,-87.722071,"Baldwin, Alabama, US",1/22/20,0,223234,0
2,84001005,US,USA,840,1005.0,Barbour,Alabama,US,31.868263,-85.387129,"Barbour, Alabama, US",1/22/20,0,24686,0
3,84001007,US,USA,840,1007.0,Bibb,Alabama,US,32.996421,-87.125115,"Bibb, Alabama, US",1/22/20,0,22394,0
4,84001009,US,USA,840,1009.0,Blount,Alabama,US,33.982109,-86.567906,"Blount, Alabama, US",1/22/20,0,57826,0
5,84001011,US,USA,840,1011.0,Bullock,Alabama,US,32.100305,-85.712655,"Bullock, Alabama, US",1/22/20,0,10101,0
6,84001013,US,USA,840,1013.0,Butler,Alabama,US,31.753001,-86.680575,"Butler, Alabama, US",1/22/20,0,19448,0
7,84001015,US,USA,840,1015.0,Calhoun,Alabama,US,33.774837,-85.826304,"Calhoun, Alabama, US",1/22/20,0,113605,0
8,84001017,US,USA,840,1017.0,Chambers,Alabama,US,32.913601,-85.390727,"Chambers, Alabama, US",1/22/20,0,33254,0
9,84001019,US,USA,840,1019.0,Cherokee,Alabama,US,34.17806,-85.60639,"Cherokee, Alabama, US",1/22/20,0,26196,0


In [26]:
#full_table['Date'] = pd.to_datetime(full_table['Date'])


In [27]:
ship_data = full_table['Province_State'].str.contains('Grand Princess') | full_table['Province_State'].str.contains('Diamond Princess') | full_table['Province_State'].str.contains('Northern Mariana Islands') | full_table['Province_State'].str.contains('American Samoa') |full_table['Province_State'].str.contains('Guam') | full_table['Province_State'].str.contains('Virgin Islands')

full_ship = full_table[ship_data]


In [28]:
# Removing ship data from State data

full_table = full_table[~(ship_data)]

## Group data

In [29]:
full_grouped = full_table.groupby(['Date', 'Province_State','FIPS'])['Confirmed', 'Deaths'].sum().reset_index()

full_grouped.tail(5)

Unnamed: 0,Date,Province_State,FIPS,Confirmed,Deaths
708007,8/9/20,Wyoming,56041.0,278,0
708008,8/9/20,Wyoming,56043.0,77,0
708009,8/9/20,Wyoming,56045.0,5,0
708010,8/9/20,Wyoming,80056.0,0,0
708011,8/9/20,Wyoming,90056.0,0,27


In [30]:
NY_full_grouped = full_grouped[full_grouped['Province_State'] == 'New York']
NY_full_grouped[NY_full_grouped['FIPS'] == 36081.0]

Unnamed: 0,Date,Province_State,FIPS,Confirmed,Deaths
1932,1/22/20,New York,36081.0,0,0
5256,1/23/20,New York,36081.0,0,0
8580,1/24/20,New York,36081.0,0,0
11904,1/25/20,New York,36081.0,0,0
15228,1/26/20,New York,36081.0,0,0
...,...,...,...,...,...
693324,8/5/20,New York,36081.0,0,0
696648,8/6/20,New York,36081.0,0,0
699972,8/7/20,New York,36081.0,0,0
703296,8/8/20,New York,36081.0,0,0


In [31]:
full_grouped.shape

(708012, 5)

####  Adding new cases, new deaths by subtracting from previous day record

In [32]:
full_grouped.dtypes

Date               object
Province_State     object
FIPS              float64
Confirmed           int64
Deaths              int64
dtype: object

In [33]:
full_grouped_ = full_grouped.copy()

In [63]:
full_grouped = full_grouped_.copy()

In [36]:
def fixDate(x):
  arr = x.split('/')
  m = arr[0]
  d = arr[1]
  y = arr[2]

  if int(m) < 10:
    m = '0'+str(m)
  if int(d) < 10:
    d = '0'+str(d)
  return '20'+str(y)+'-'+m+'-'+d

In [37]:
full_grouped['Date'] = full_grouped['Date'].apply(lambda x: fixDate(x))

full_grouped.head(2)

Unnamed: 0,Date,Province_State,FIPS,Confirmed,Deaths
0,2020-01-22,Alabama,1001.0,0,0
1,2020-01-22,Alabama,1003.0,0,0


In [86]:
def computeNewCases(FIPS):


  countyData = full_grouped[full_grouped['FIPS'] == FIPS]
  countyData = countyData.sort_values(by=['Date'])

  if countyData.Confirmed.max() > 0:
    temp = countyData.groupby(['Date'])['Confirmed', 'Deaths']
    temp = temp.sum().diff().reset_index()
    #print(temp)
    mask = temp['Date'] != temp['Date'].shift(1)

    #temp.loc[mask, 'Confirmed'] = np.nan
    #temp.loc[mask, 'Deaths'] = np.nan
    temp.columns = ['Date', 'New cases', 'New deaths']
    print(temp)
    countyData = pd.merge(countyData, temp, on=[ 'Date'])
    print(countyData)
    #print('********')
    # filling na with 0
    countyData = countyData.fillna(0)
    # fixing data types
    cols = ['New cases', 'New deaths']
    countyData[cols] = countyData[cols].astype('int') 
    return countyData


In [87]:
computeNewCases(36041)

        Date  New cases  New deaths
0    1/22/20        NaN         NaN
1    1/23/20        0.0         0.0
2    1/24/20        0.0         0.0
3    1/25/20        0.0         0.0
4    1/26/20        0.0         0.0
..       ...        ...         ...
208   8/5/20        0.0         0.0
209   8/6/20        1.0         0.0
210   8/7/20        0.0         0.0
211   8/8/20        0.0         0.0
212   8/9/20        0.0         0.0

[213 rows x 3 columns]
        Date Province_State     FIPS  Confirmed  Deaths  New cases  New deaths
0    1/22/20       New York  36041.0          0       0        NaN         NaN
1    1/23/20       New York  36041.0          0       0        0.0         0.0
2    1/24/20       New York  36041.0          0       0        0.0         0.0
3    1/25/20       New York  36041.0          0       0        0.0         0.0
4    1/26/20       New York  36041.0          0       0        0.0         0.0
..       ...            ...      ...        ...     ...        ...    

Unnamed: 0,Date,Province_State,FIPS,Confirmed,Deaths,New cases,New deaths
0,1/22/20,New York,36041.0,0,0,0,0
1,1/23/20,New York,36041.0,0,0,0,0
2,1/24/20,New York,36041.0,0,0,0,0
3,1/25/20,New York,36041.0,0,0,0,0
4,1/26/20,New York,36041.0,0,0,0,0
...,...,...,...,...,...,...,...
208,8/5/20,New York,36041.0,7,0,0,0
209,8/6/20,New York,36041.0,8,0,1,0
210,8/7/20,New York,36041.0,8,0,0,0
211,8/8/20,New York,36041.0,8,0,0,0


Below code executes for all FIPS, estimated runtime **10 minutes**

In [88]:
df = pd.DataFrame(columns=['Date','Province_State','FIPS','Confirmed','Deaths','New cases','New deaths'])

for fips in tqdm(full_grouped.FIPS.unique()):
  

  countyData = full_grouped[full_grouped['FIPS'] == fips]
  countyData = countyData.sort_values(by=['Date'])
  #print(fips,' , ',countyData.Confirmed.min())
  
  if countyData.Confirmed.max() > 0:
    temp = countyData.groupby(['Date'])['Confirmed', 'Deaths']
    temp = temp.sum().diff().reset_index()
    #print(temp)
    mask = temp['Date'] != temp['Date'].shift(1)

    # temp.loc[mask, 'Confirmed'] = np.nan
    # temp.loc[mask, 'Deaths'] = np.nan
    temp.columns = ['Date', 'New cases', 'New deaths']
    countyData = pd.merge(countyData, temp, on=[ 'Date'])
    #print(countyData)
    #print('********')
    # filling na with 0
    countyData = countyData.fillna(0)
    # fixing data types
    cols = ['New cases', 'New deaths']
    countyData[cols] = countyData[cols].astype('int')
    df = df.append(countyData)
    #break

100%|██████████| 3324/3324 [04:36<00:00, 12.03it/s]


In [90]:
print(df.shape)
df.tail(30)

(691185, 7)


Unnamed: 0,Date,Province_State,FIPS,Confirmed,Deaths,New cases,New deaths
183,7/3/20,Wyoming,90056.0,0,19,0,-6
184,7/30/20,Wyoming,90056.0,0,25,0,6
185,7/31/20,Wyoming,90056.0,0,25,0,0
186,7/4/20,Wyoming,90056.0,0,19,0,-6
187,7/5/20,Wyoming,90056.0,0,19,0,0
188,7/6/20,Wyoming,90056.0,0,19,0,0
189,7/7/20,Wyoming,90056.0,0,20,0,1
190,7/8/20,Wyoming,90056.0,0,20,0,0
191,7/9/20,Wyoming,90056.0,0,20,0,0
192,8/1/20,Wyoming,90056.0,0,25,0,5


In [69]:
df.shape

(691824, 7)

In [91]:
df[df['FIPS'] == 56041.0].tail(50)

Unnamed: 0,Date,Province_State,FIPS,Confirmed,Deaths,New cases,New deaths
163,7/11/20,Wyoming,56041.0,201,0,1,0
164,7/12/20,Wyoming,56041.0,202,0,1,0
165,7/13/20,Wyoming,56041.0,205,0,3,0
166,7/14/20,Wyoming,56041.0,208,0,3,0
167,7/15/20,Wyoming,56041.0,208,0,0,0
168,7/16/20,Wyoming,56041.0,217,0,9,0
169,7/17/20,Wyoming,56041.0,219,0,2,0
170,7/18/20,Wyoming,56041.0,221,0,2,0
171,7/19/20,Wyoming,56041.0,221,0,0,0
172,7/2/20,Wyoming,56041.0,180,0,-41,0


In [92]:
# merging new values
full_grouped = df.copy()

In [93]:
full_grouped.tail(5)

Unnamed: 0,Date,Province_State,FIPS,Confirmed,Deaths,New cases,New deaths
208,8/5/20,Wyoming,90056.0,0,26,0,0
209,8/6/20,Wyoming,90056.0,0,26,0,0
210,8/7/20,Wyoming,90056.0,0,27,0,1
211,8/8/20,Wyoming,90056.0,0,27,0,0
212,8/9/20,Wyoming,90056.0,0,27,0,0


In [94]:
state= ['California','New York']
ca_df = full_grouped[full_grouped['Province_State'].isin(state)]
ca_df.tail(10)

Unnamed: 0,Date,Province_State,FIPS,Confirmed,Deaths,New cases,New deaths
203,8/2/20,New York,90036.0,0,0,0,0
204,8/20/20,New York,90036.0,0,0,0,0
205,8/21/20,New York,90036.0,0,0,0,0
206,8/3/20,New York,90036.0,0,0,0,0
207,8/4/20,New York,90036.0,0,0,0,0
208,8/5/20,New York,90036.0,0,0,0,0
209,8/6/20,New York,90036.0,0,0,0,0
210,8/7/20,New York,90036.0,0,0,0,0
211,8/8/20,New York,90036.0,0,0,0,0
212,8/9/20,New York,90036.0,0,0,0,0


In [95]:
county_population_US.head(2)

Unnamed: 0,countyFIPS,County Name,State,population,FIPS
0,0,Statewide Unallocated,AL,0,0
1,1001,Autauga County,AL,55869,1001


In [96]:
county_population_US['FIPS'] = county_population_US['countyFIPS']

In [97]:
ca_df.shape

(25134, 7)

# Merge with FIPS 

In [98]:
merged = pd.merge(ca_df,county_population_US,how='inner' ,on=['FIPS'])
print(merged.shape)

(24708, 11)


In [99]:
merged.tail(2)

Unnamed: 0,Date,Province_State,FIPS,Confirmed,Deaths,New cases,New deaths,countyFIPS,County Name,State,population
24706,8/8/20,New York,36123.0,56,7,0,0,36123,Yates County,NY,24913
24707,8/9/20,New York,36123.0,57,7,1,0,36123,Yates County,NY,24913


## Visualization

In [100]:
full_grouped.head(3)

Unnamed: 0,Date,Province_State,FIPS,Confirmed,Deaths,New cases,New deaths
0,1/22/20,Alabama,1001.0,0,0,0,0
1,1/23/20,Alabama,1001.0,0,0,0,0
2,1/24/20,Alabama,1001.0,0,0,0,0


In [52]:
merged.head(3)


Unnamed: 0,Date,Province_State,Confirmed,Deaths,New cases,New deaths,countyFIPS,County Name,State,population,FIPS


In [101]:
import pandas as pd
import altair as alt
#full_grouped = merged
ca_df = full_grouped[full_grouped['Province_State'] == 'California']
ny_df = full_grouped[full_grouped['Province_State'] == 'New York']
queens_df = full_grouped[(full_grouped['Province_State'] == 'New York') & (full_grouped['FIPS'] == 36081)]

ca_df.shape

(12567, 7)

In [102]:
ca_total = (
    ca_df
    .pipe(lambda x: x.assign(gains_pctg=x["New cases"]))
    .groupby(['Date','Province_State'])
    .agg({"gains_pctg": "sum"})
    .reset_index()
    .rename(columns={"gains_pctg": "New cases"})
)
ny_total = (
    ny_df
    .pipe(lambda x: x.assign(gains_pctg=x["New cases"]))
    .groupby(['Date','Province_State'])
    .agg({"gains_pctg": "sum"})
    .reset_index()
    .rename(columns={"gains_pctg": "New cases"})
)

queens_county_total = (
    queens_df
    .pipe(lambda x: x.assign(gains_pctg=x["New cases"]))
    .groupby(['Date','Province_State'])
    .agg({"gains_pctg": "sum"})
    .reset_index()
    .rename(columns={"gains_pctg": "New cases"})
)

In [103]:
base_ca = alt.Chart(ca_total).mark_bar().encode(
    x='monthdate(Date):O',
).properties(
    width=500
)

base_ny = alt.Chart(ny_total).mark_bar().encode(
    x='monthdate(Date):O',
).properties(
    width=500
)

base_queens = alt.Chart(queens_county_total).mark_bar().encode(
    x='monthdate(Date):O',
).properties(
    width=500
)

In [104]:
red = alt.value("#f54242")
##Ca data
#base_ca.encode(y='Confirmed').properties(title='Total Confirmed') | base_ca.encode(y='Deaths',color = red).properties(title='Total deaths')
base_ca.encode(y='New cases').properties(title='CA State- New cases')

In [57]:
#base_ny.encode(y='Confirmed').properties(title='Total Confirmed') | base_ny.encode(y='Deaths',color = red).properties(title='Total deaths')

base_ny.encode(y='New cases').properties(title='NY state - New cases')


In [58]:
base_queens.encode(y="New cases").properties(title='Queens County New Cases')

# Rolling Average

In [59]:
queens_county_total.head()

Unnamed: 0,index,New cases


In [60]:
queens_county_total.iloc[:,2]

IndexError: ignored

In [None]:
base_queens.encode(y="New cases").properties(title='Queens County New Cases')

In [None]:
queens_county_total['rolling_average'] = queens_county_total.iloc[:,2].rolling(window=7).mean()

base_queens = alt.Chart(queens_county_total).mark_bar().encode(
    x='monthdate(Date):O',
).properties(
    width=500
)


In [None]:
queens_county_total.tail(10)

In [None]:
base_queens.encode(y="rolling_average").properties(title='Queens County- 7 day rolling average (New Cases)')


In [None]:
bar = base_queens.mark_bar().encode(y="New cases")

line =  base_queens.mark_line(color='red').encode(
    y='rolling_average'
)

(bar + line).properties(title='Queens County - New Cases and rolling average ',width=600)

# New Section

In [None]:
fractalData_combo1

In [None]:
test = fractalData_combo1[fractalData_combo1['County Name'] == 'San Diego County']
test.shape
test

In [None]:
cluster_perf_df.head()

In [None]:
cluster_perf_df1.head() # min

In [None]:
cluster_perf_df1_2.head() #max

In [None]:
test1 = pd.merge(test,week_df, how='inner', on ='WeekNumber')
test1.tail(10)

In [None]:
red = alt.value("#f54242")
base_orange.encode(y='Confirmed').properties(title='Total Confirmed') 

# Result Comparison

In [None]:
# considering for Final comparison
combo1_cluster2

In [None]:
combo2_cluster1

In [None]:
combo2_cluster2

In [None]:
combo3_cluster1

In [None]:
combo3_cluster2

## Data Exploration

In [None]:
import altair as alt

In [None]:
week_df

In [None]:
merged.dtypes

In [None]:
LA_county = merged[merged['County Name'] == 'Los Angeles County']
LA_county.shape

In [None]:
orange_county = merged[merged['County Name'] == 'San Diego County']


In [None]:
base_la = alt.Chart(LA_county).mark_bar().encode(
    x='monthdate(Date):O',
).properties(
    width=500
)

base_orange = alt.Chart(orange_county).mark_bar().encode(
    x='monthdate(Date):O',
).properties(
    width=500
)

In [None]:
red = alt.value("#f54242")
base_orange.encode(y='Confirmed').properties(title='Total Confirmed') 


In [105]:
ca_df = full_grouped[full_grouped['Province_State'] == 'California']
ny_df = full_grouped[full_grouped['Province_State'] == 'New York']
tx_df = full_grouped[full_grouped['Province_State'] == 'Texas']

In [106]:
ny_df.head()

Unnamed: 0,Date,Province_State,FIPS,Confirmed,Deaths,New cases,New deaths
0,1/22/20,New York,36001.0,0,0,0,0
1,1/23/20,New York,36001.0,0,0,0,0
2,1/24/20,New York,36001.0,0,0,0,0
3,1/25/20,New York,36001.0,0,0,0,0
4,1/26/20,New York,36001.0,0,0,0,0


In [107]:
ny_df.to_csv('NY-CovidAug22.csv',index=False)

In [None]:
base_ca = alt.Chart(ca_df).mark_bar().encode(
    x='monthdate(Date):O',
).properties(
    width=500
)

base_ny = alt.Chart(ny_df).mark_bar().encode(
    x='monthdate(Date):O',
).properties(
    width=500
)

base_tx = alt.Chart(tx_df).mark_bar().encode(
    x='monthdate(Date):O',
).properties(
    width=500
)

In [None]:
red = alt.value("#f54242")
base_ca.encode(y='Confirmed').properties(title='Total Confirmed') | base_ca.encode(y='Deaths',color = red).properties(title='Total deaths')


In [None]:
base_ny.encode(y='Confirmed').properties(title='Total Confirmed') | base_ny.encode(y='Deaths',color = red).properties(title='Total deaths')


In [None]:
base_tx.encode(y='Confirmed').properties(title='Total Confirmed') | base_tx.encode(y='Deaths',color = red).properties(title='Total deaths')


In [None]:
base_ca.encode(y='New cases').properties(title='Total New Cases') | base_ca.encode(y='New deaths',color = red).properties(title='Total New deaths')


## Steps to approach the problem

* Predict how 'New cases' count changed in states after mask mandate
* check the type of **rule** - 
  1. Entire State
  2. Entire Territory
  3. Parts of State
  4. Entire State (Employees Only)
  5. Parts of State (Employees Only)
  6. No
  7. Masks strongly recommended, provides masks for free
  8. Entire State (Some Employees)

impacts the new cases count

* Include/identify latent variables like 'Long weekend' date, 'Rally' date, 'BLM' protests happened date impact in new cases count 
* Identify datasource for people cooperation in following the rule
* Political party ruling the state
* Population 
* Epicenter city of each state
* Type of mask (?)



## Reference

* https://towardsdatascience.com/covid-19-data-processing-58aaa3663f6