### Imports

In [None]:
!gdown 128UP6X4kbWVjjOKt4vB9bqqPVeT16cwL -O "covid.csv"
!gdown 1yj5Pa_Zck6VNf1JgkdCuErUKl5FLuoAd -O "hatecrime.csv"
!gdown 1yigT-1eM5Ki-uJA4FGpnt5bQDM0PtlKr -O "15m_cleaned_tweets.csv"
!gdown 19WLK_YzFvPnaEko-WllwClS0ZMVRdjHk -O  "stringency.csv"

Downloading...
From: https://drive.google.com/uc?id=128UP6X4kbWVjjOKt4vB9bqqPVeT16cwL
To: /content/covid.csv
100% 5.10M/5.10M [00:00<00:00, 154MB/s]
Downloading...
From: https://drive.google.com/uc?id=1yj5Pa_Zck6VNf1JgkdCuErUKl5FLuoAd
To: /content/hatecrime.csv
100% 54.6M/54.6M [00:01<00:00, 53.6MB/s]
Downloading...
From: https://drive.google.com/uc?id=1yigT-1eM5Ki-uJA4FGpnt5bQDM0PtlKr
To: /content/15m_cleaned_tweets.csv
100% 86.5M/86.5M [00:01<00:00, 60.3MB/s]
Downloading...
From: https://drive.google.com/uc?id=19WLK_YzFvPnaEko-WllwClS0ZMVRdjHk
To: /content/stringency.csv
100% 43.4k/43.4k [00:00<00:00, 23.6MB/s]


In [None]:
import pandas as pd

### COVID

This dataset contains information for the United States COVID-19 Cases and Deaths by State over Time. It was extracted from the Centers for Disease Control and Prevention (CDC). 

The code below is performing a series of data cleaning and preprocessing steps on the dataframe named "covid". The first step was to convert the "submission_date" column to a datetime object, allowing for proper sorting and filtering by date. The dataframe was then sorted by date and state and only the columns "submission_date","state","new_case" were selected. The dataframe was then filtered to include only rows between the date range of "2020-01-01" and "2021-03-31", as this specific timeframe overlapped with the other datasets being used, making it possible to combine and analyze the datasets together. The dataframe was then pivoted by index "date" and columns "state". Missing values were filled with 0. The index of **the** dataframe was set as "date". After that, the number of states contained in the dataframe was tested; and as it included more than 50 values, any location that was not included in the 50 US states was dropped. Then the data was resampled to a monthly sum. Lastly, the dataframe was then melted by "date",var_name="state",value_name="covid_cases" . This allow for easy aggregation and filtering of the data by date and state. The last step was to check the shape of the dataframe to make sure we have the right amount of states and dates.




In [None]:
covid = pd.read_csv("covid.csv")
covid.head()

Unnamed: 0,submission_date,state,tot_cases,conf_cases,prob_cases,new_case,pnew_case,tot_death,conf_death,prob_death,new_death,pnew_death,created_at,consent_cases,consent_deaths
0,03/11/2021,KS,297229,241035.0,56194.0,0,0.0,4851,,,0,0.0,03/12/2021 03:20:13 PM,Agree,
1,12/01/2021,ND,163565,135705.0,27860.0,589,220.0,1907,,,9,0.0,12/02/2021 02:35:20 PM,Agree,Not agree
2,01/02/2022,AS,11,,,0,0.0,0,,,0,0.0,01/03/2022 03:18:16 PM,,
3,11/22/2021,AL,841461,620483.0,220978.0,703,357.0,16377,12727.0,3650.0,7,3.0,11/22/2021 12:00:00 AM,Agree,Agree
4,05/30/2022,AK,251425,,,0,0.0,1252,,,0,0.0,05/31/2022 01:20:20 PM,,


In [None]:
covid = covid[covid["new_case"]>=0]

In [None]:
covid["submission_date"] = pd.to_datetime(covid["submission_date"])

In [None]:
covid = covid.sort_values(by=["submission_date","state"])[["submission_date","state","new_case"]]

In [None]:
covid = covid.reset_index().drop(columns=["index"])

In [None]:
covid.rename(columns={"submission_date":"date"},inplace=True)

In [None]:
covid = covid[(covid["date"]>='2020-01-01') & (covid["date"]<="2021-03-31")]

In [None]:
covid = covid.pivot(index='date', columns='state')['new_case'].reset_index().rename_axis(None,axis=1).fillna(0)

In [None]:
covid.set_index('date',inplace=True) 

In [None]:
covid.shape

(435, 60)

In [None]:
us_abbreviations = [
    "AL", "AK", "AZ", "AR", "CA", "CO", "CT", "DE", "FL", "GA",
    "HI", "ID", "IL", "IN", "IA", "KS", "KY", "LA", "ME", "MD",
    "MA", "MI", "MN", "MS", "MO", "MT", "NE", "NV", "NH", "NJ",
    "NM", "NY", "NC", "ND", "OH", "OK", "OR", "PA", "RI", "SC",
    "SD", "TN", "TX", "UT", "VT", "VA", "WA", "WV", "WI", "WY"
]
for state in covid.columns:
  if state not in us_abbreviations:
    print(state)

AS
DC
FSM
GU
MP
NYC
PR
PW
RMI
VI


In [None]:
covid.drop(columns=["AS",
"DC",
"FSM",
"GU",
"MP",
"NYC",
"PR",
"PW",
"RMI",
"VI"],inplace=True)

In [None]:
covid.head()

Unnamed: 0_level_0,AK,AL,AR,AZ,CA,CO,CT,DE,FL,GA,...,SD,TN,TX,UT,VA,VT,WA,WI,WV,WY
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2020-01-22,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2020-01-23,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
2020-01-24,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2020-01-25,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2020-01-26,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [None]:
covid = covid.resample("M").sum()

In [None]:
covid.reset_index(inplace=True)

In [None]:
covid = covid.melt(id_vars="date",var_name="state",value_name="covid_cases")

In [None]:
covid.head()

Unnamed: 0,date,state,covid_cases
0,2020-01-31,AK,0.0
1,2020-02-29,AK,0.0
2,2020-03-31,AK,128.0
3,2020-04-30,AK,227.0
4,2020-05-31,AK,108.0


In [None]:
covid.shape

(750, 3)