### Imports

In [None]:
!gdown 128UP6X4kbWVjjOKt4vB9bqqPVeT16cwL -O "covid.csv"
!gdown 1yj5Pa_Zck6VNf1JgkdCuErUKl5FLuoAd -O "hatecrime.csv"
!gdown 1yigT-1eM5Ki-uJA4FGpnt5bQDM0PtlKr -O "15m_cleaned_tweets.csv"
!gdown 19WLK_YzFvPnaEko-WllwClS0ZMVRdjHk -O "stringency.csv"

Downloading...
From: https://drive.google.com/uc?id=128UP6X4kbWVjjOKt4vB9bqqPVeT16cwL
To: /content/covid.csv
100% 5.10M/5.10M [00:00<00:00, 154MB/s]
Downloading...
From: https://drive.google.com/uc?id=1yj5Pa_Zck6VNf1JgkdCuErUKl5FLuoAd
To: /content/hatecrime.csv
100% 54.6M/54.6M [00:01<00:00, 53.6MB/s]
Downloading...
From: https://drive.google.com/uc?id=1yigT-1eM5Ki-uJA4FGpnt5bQDM0PtlKr
To: /content/15m_cleaned_tweets.csv
100% 86.5M/86.5M [00:01<00:00, 60.3MB/s]
Downloading...
From: https://drive.google.com/uc?id=19WLK_YzFvPnaEko-WllwClS0ZMVRdjHk
To: /content/stringency.csv
100% 43.4k/43.4k [00:00<00:00, 23.6MB/s]


In [None]:
import pandas as pd

### Tweets

This section concerns the anti-asian hate tweets. This file was pre-hydrated and cleaned a first time in order to recover the location. The original dataset extracted from the paper originally contained 200M tweets. This dataset was reduced to 15M to maximise processing. 

This dataframe was filtered to only include rows where the value in the column "BERT_label" was equal to 1, indicating that the tweet was hateful. This considerably reduced the dataset as only hateful tweets were retained and tweets with a BERT label of 0, indicating no hate, were disregarded. The data was then grouped by date and location and the hate tweets counts were reset to the index. The "created_at" column was then converted to a datetime object and the data was sorted by date and location. The columns were then renamed to "date" and "hate_tweets" respectively, since the BERT label 1 were counted to represent the number of hate tweets per states over time rather than an indication of a hate tweet. 

The next step was to filter the "user_location" to only contain stattes. To do this, A list of all US state abbreviations was created and used to go through each location in the "user_location" column. Next, if a state abbreviation was present in a location, it was added to a new list called "sts". Once all locations were checked, the "user_location" column was replaced with the "sts" list containing only the state abbreviations found and the column was renamed to "state". The dataframe was then grouped by "date" and "state" and the sum was reset to the index. The dataframe was then pivoted by index "date" and columns "state" and filtered. The dataframe was then melted by date, var_name="state", and value_name="hate_tweets".

In [None]:
tweets = pd.read_csv("/content/15m_cleaned_tweets.csv")
tweets.head()

Unnamed: 0.1,Unnamed: 0,id,user_location,created_at,BERT_label
0,4,1326918184126074886,"Phoenix, AZ",Nov 2020,0
1,5,1321164103721885697,"Chicago, IL",Oct 2020,0
2,7,1280645004596252672,"San Antonio, TX",Jul 2020,0
3,16,1337951731519549442,"Orlando, FL",Dec 2020,0
4,18,1226174047169458181,"Perrysburg, OH",Feb 2020,0


In [None]:
tweets.drop(columns=["Unnamed: 0"],inplace=True)

In [None]:
tweets.drop(columns=["id"],inplace=True)

In [None]:
tweets = tweets[tweets["BERT_label"]==1]

In [None]:
tweets = tweets.groupby(["created_at","user_location"]).count().reset_index()

In [None]:
tweets["created_at"] = pd.to_datetime(tweets["created_at"])

In [None]:
tweets= tweets.sort_values(by=["created_at","user_location"])

In [None]:
tweets.rename(columns={"created_at":"date","BERT_label":"hate_tweets"},inplace=True)

In [None]:
tweets = tweets.reset_index().drop(columns=["index"])

In [None]:
us_abbreviations = ['AK', 'AL', 'AR', 'AZ', 'CA', 'CO', 'CT', 'DC', 'DE', 'FL', 'GA', 'HI',
       'IA', 'ID', 'IL', 'IN', 'KS', 'KY', 'LA', 'MA', 'MD', 'ME', 'MI', 'MN',
       'MO', 'MS', 'MT', 'NC', 'ND', 'NE', 'NH', 'NJ', 'NM', 'NV', 'NY', 'OH',
       'OK', 'OR', 'PA', 'RI', 'SC', 'SD', 'TN', 'TX', 'UT', 'VA', 'VT', 'WA',
       'WI', 'WV', 'WY']

In [None]:
# Create empty list to store state abbreviations found in user_location column
# Iterate through each location in user_location column
# Iterate through each state abbreviation
sts = []
for loc in tweets.user_location:
  for state in us_abbreviations:
    if state in loc:
      sts.append(state)
      break


In [None]:
# Replace user_location column with sts list containing only state abbreviations
# Rename user_location column to state

tweets["user_location"]=sts
tweets = tweets.rename(columns={"user_location":"state"})

In [None]:
tweets = tweets[tweets["state"]!="DC"]
tweets["state"].nunique()

50

In [None]:
tweets = tweets.reset_index().drop(columns=["index"])

In [None]:
tweets = tweets.groupby(["date","state"]).sum().reset_index()

In [None]:
tweets = tweets.pivot(index='date', columns='state')['hate_tweets'].reset_index().rename_axis(None,axis=1).fillna(0)

In [None]:
tweets = tweets.melt(id_vars="date",var_name="state",value_name="hate_tweets")

In [None]:
tweets.head()

Unnamed: 0,date,state,hate_tweets
0,2020-01-01,AK,1.0
1,2020-02-01,AK,0.0
2,2020-03-01,AK,2.0
3,2020-04-01,AK,0.0
4,2020-05-01,AK,2.0


In [None]:
tweets.shape

(750, 3)