<a href="https://colab.research.google.com/github/joyjixu/qm2_resources/blob/main/data_preprocessing/clean_tweets.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

How do I get the tweets in csv?

First you need to set up a twitter developer account

go to the Github repo with all the tweets and scroll down in the README.md file. There is a section on "hydrating" the files, which basically to replace the tweet id with the actual tweet + location, time etc... (every file on github is basically a list of tweet ids for each hour of every day)

Download the Hydrator app, and the txt file of the tweet ids you want to "hydrate". The app is easy to use, just enter your twitter developer keys, click on the "add tab", then select the correct file and start the process. Then you can click on the 'csv' button which converts the json into a csv.

### also upload this
https://www.kaggle.com/giodev11/usstates-dataset



Importing libraries for data wrangling

Upload the csv file on the bar on the left

In [61]:
import pandas as pd
import numpy as np
import re

In [62]:
pd.__version__


'1.1.4'

Replace the path with the path of the file

You can get it by right-clicking on your uploaded file and selecting 'Copy path'

In [63]:
path = '/content/coronavirus-tweet-id-2020-04-04-10.csv'

# reading the csv as a pandas dataframe
tweets = pd.read_csv(path)

# printing the first row to get a better idea of the structure
print(tweets.loc[0])
print(tweets.loc[5])


coordinates                                                                 NaN
created_at                                       Sat Apr 04 10:57:08 +0000 2020
hashtags                                                                    NaN
media                                                                       NaN
urls                                                                        NaN
favorite_count                                                                0
id                                                          1246391348007014400
in_reply_to_screen_name                                                     NaN
in_reply_to_status_id                                                       NaN
in_reply_to_user_id                                                         NaN
lang                                                                         hi
place                                                                       NaN
possibly_sensitive                      

Since most tweets actually have no location attached, we first remove all tweets from the dataset that has a "NaN" (Not a number) value in the "place" column. This significantly reduces the length of the dataframe.

In [64]:
# removing NaN rows in tweets.places
loc_tweets = tweets.dropna(subset = ['user_location'])

print(loc_tweets.info())

<class 'pandas.core.frame.DataFrame'>
Int64Index: 23902 entries, 0 to 35565
Data columns (total 34 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   coordinates                 24 non-null     object 
 1   created_at                  23902 non-null  object 
 2   hashtags                    6512 non-null   object 
 3   media                       2064 non-null   object 
 4   urls                        5884 non-null   object 
 5   favorite_count              23902 non-null  int64  
 6   id                          23902 non-null  int64  
 7   in_reply_to_screen_name     1761 non-null   object 
 8   in_reply_to_status_id       1547 non-null   float64
 9   in_reply_to_user_id         1761 non-null   float64
 10  lang                        23902 non-null  object 
 11  place                       309 non-null    object 
 12  possibly_sensitive          7394 non-null   object 
 13  retweet_count               239

In [65]:
# just to see what locations there are, we are printing the unique values in the form of a list
locations = loc_tweets.user_location.unique()
print(locations)

['Shahpura, India' 'Catalunya' 'occasional nsfw content' ...
 'Sarcelles, Ile-de-France' 'Catalunya, Europa' 'Warwickshire, England']


Now we need to select the tweets that are only from US states. To do this, we first have to get a dataset on all the state names and their abbreviations.

There is a csv file on Kaggle that does just that: https://www.kaggle.com/giodev11/usstates-dataset

Download that csv and upload it here

In [66]:
states_path = "/content/state-abbrevs.csv"

states = pd.read_csv(states_path)
print(states)

                   state abbreviation
0                Alabama           AL
1                 Alaska           AK
2                Arizona           AZ
3               Arkansas           AR
4             California           CA
5               Colorado           CO
6            Connecticut           CT
7               Delaware           DE
8   District of Columbia           DC
9                Florida           FL
10               Georgia           GA
11                Hawaii           HI
12                 Idaho           ID
13              Illinois           IL
14               Indiana           IN
15                  Iowa           IA
16                Kansas           KS
17              Kentucky           KY
18             Louisiana           LA
19                 Maine           ME
20               Montana           MT
21              Nebraska           NE
22                Nevada           NV
23         New Hampshire           NH
24            New Jersey           NJ
25          

In [67]:

abbr_list = states.abbreviation.to_list()


states_list = (states.state.to_list())


Now that we have a list of US states, we have to check each row in our tweets dataframe and see if the abbreviation is included in the location.

To do that we can define a function that takes a string as the input (the 'place' value of our tweets dataframe for each row), and returns True if the string contains a US state abbreviation.

In [68]:
def in_US(loc, states, abbr):
  if any(state in loc for state in states):
    return True
  for abbreviation in abbr:
    if re.search(r'\bis\b', abbreviation) is not None:
      return True
  return False
# returns True if place contains state, else False

In [69]:
# adding a new column is_state, either True or False
us_state = loc_tweets.user_location.apply(lambda x: in_US(x, states_list, abbr_list))
us_state = us_state.rename('is_state')

loc_tweets_isstate = loc_tweets.merge(us_state, left_index=True, right_index=True)
print(loc_tweets_isstate.head())

  coordinates                      created_at  ... user_verified is_state
0         NaN  Sat Apr 04 10:57:08 +0000 2020  ...         False    False
3         NaN  Sat Apr 04 10:57:07 +0000 2020  ...         False    False
6         NaN  Sat Apr 04 10:57:08 +0000 2020  ...         False    False
7         NaN  Sat Apr 04 10:57:08 +0000 2020  ...         False    False
8         NaN  Sat Apr 04 10:57:08 +0000 2020  ...         False    False

[5 rows x 35 columns]


Now we can only keep the rows where the location is in a US state

In [70]:
us_tweets_temp = loc_tweets_isstate[loc_tweets_isstate['is_state'] == True]
print(us_tweets_temp.head())

    coordinates                      created_at  ... user_verified is_state
55          NaN  Sat Apr 04 10:57:07 +0000 2020  ...         False     True
57          NaN  Sat Apr 04 10:57:08 +0000 2020  ...         False     True
74          NaN  Sat Apr 04 10:57:08 +0000 2020  ...         False     True
159         NaN  Sat Apr 04 10:57:13 +0000 2020  ...         False     True
213         NaN  Sat Apr 04 10:57:12 +0000 2020  ...         False     True

[5 rows x 35 columns]


Finally, we then reset the indices (becuse we removed a lot of rows and we want the number order to be correct)

In [71]:
us_tweets = us_tweets_temp.reset_index(drop = True)

In [72]:
print(us_tweets)

     coordinates                      created_at  ... user_verified is_state
0            NaN  Sat Apr 04 10:57:07 +0000 2020  ...         False     True
1            NaN  Sat Apr 04 10:57:08 +0000 2020  ...         False     True
2            NaN  Sat Apr 04 10:57:08 +0000 2020  ...         False     True
3            NaN  Sat Apr 04 10:57:13 +0000 2020  ...         False     True
4            NaN  Sat Apr 04 10:57:12 +0000 2020  ...         False     True
...          ...                             ...  ...           ...      ...
1032         NaN  Sat Apr 04 10:57:03 +0000 2020  ...         False     True
1033         NaN  Sat Apr 04 10:57:04 +0000 2020  ...          True     True
1034         NaN  Sat Apr 04 10:57:06 +0000 2020  ...         False     True
1035         NaN  Sat Apr 04 10:57:06 +0000 2020  ...         False     True
1036         NaN  Sat Apr 04 10:57:06 +0000 2020  ...         False     True

[1037 rows x 35 columns]


In [73]:
def write_state(location):
  for i in range(len(states_list)):
    res = re.search(states_list[i], location)
    if res is not None:
      return abbr_list[i]
  for i in range(len(abbr_list)):
    res = re.search(abbr_list[i], location)
    if res is not None:
      return abbr_list[i]

In [74]:
us_tweets['state'] = us_tweets.apply(lambda row: write_state(row['user_location']), axis=1)

for i in range(90):
  print(us_tweets['text'].loc[i])

RT @BreitbartNews: Texas Judge Stops Harris County from Releasing Inmates over Coronavirus Fears https://t.co/qqFIxEfo6U
RT @FLOTUS: As the weekend approaches I ask that everyone take social distancing &amp; wearing a mask/face covering seriously. #COVID19 is a vi…
RT @HKrassenstein: BREAKING:  The WSJ is reporting that Trump's businesses are now losing $1 MILLION a day because of the Coronavirus.

Doe…
RT @no_silenced: We currently have close to 50K illegal aliens in our Federal Prison system costing American Tax Payers Billions of dollars…
RT @nytimes: In Opinion

Tweed Roosevelt writes: "Captain Crozier joins a growing list of heroic men and women who have risked their career…
RT @KamVTV: Is it just me or has anyone else noticed the media has pushed Coronavirus hysteria, fear, and paranoia up by a 1000% in the las…
RT @B52Malmet: Lest you forget for a sec what a tremendously vindictive vengeful monster Trump is, he reminds you on a Friday night in the…
RT @MAERomania: #Romania stron

We can also save the dataframe as a new csv file (make sure to download it because it will be deleted once runtime ends)


In [75]:
us_tweets.to_csv('coronavirus_tweets_us_04_04_10.csv', index = False)

To do

Replace place by only state name