# **Part I: Twitter Data Collection and Sentiment Analysis**

This Jupyter Notebook includes the code for the data collection (using web-scraping methods), data cleaning, data wrangling, and sentiment analysis (using a pre-trained Natural Language Processing model) aspects of the Twitter (social media) data for our project. It represents the first part of the complete data processing tasks we undertook to analyse the Twitter data, with the Part II notebook including the ensuing steps. This code scrapes the required Tweets using the `snscrape` library in Python for four key cities, merges the dataframes, pre-processes the text scraped, cleans it, and stores it in dataframes. Then, the sentiment analysis is carried out on it using the pre-trained RoBERTa-Base Model.

## Libraries

In [None]:
# installing libraries if required (uncomment the next two lines)

# !pip install snscrape 
# !pip install transformers

# importing required functionalities 
import numpy as np              
import pandas as pd
from transformers import AutoTokenizer, AutoModelForSequenceClassification       # imports two classes from the transformers module of the Hugging Face NLP library
from scipy.special import softmax                                                 # importing the final 'activation function'   
import time
import snscrape.modules.twitter as sntwitter                                        # import the module for web-scraping Tweets from Twitter
from datetime import datetime

## Data Collection

### 1. Scraping the Twitter data

In [None]:
# define a function to collect tweets with specified location, time and limit
def get_tweet_per_month(loc, start, end, limit, tweets):
    '''
    Takes coordinates, starting and ending of the time period, maximum Tweets collected, the list of Tweets 
    Returns the list of Tweets scraped
    Inputs:
        loc: location coordinates of the city for which we are scraping relevant Tweets
        start: starting of the time period for which we are scraping relevant Tweets
        limit: the maximum number of Tweets to be scraped with the specified characteristics
        tweets: name of the list in which the scraped Tweets will be stored
    Output:
        tweets: list of Tweets scraped
    '''
    for i, tweet in enumerate(sntwitter.TwitterSearchScraper('abortion geocode:"{}" since:"{}" until:"{}"'.format(loc, start, end)).get_items()):
        if i > limit:
            break
        else:
            tweets.append([tweet.date, tweet.user.username, tweet.content])
    return tweets

In order to collect representative data, we choose to collect about 300 tweets using the key word: "abortion" around similar dates for each month.
We realize this by iterating over the starting dates and ending dates of each month from 2022-01-01 to 2023-01-01. We use the same keyword for scraping synchronous textual data from the New York Times for the other data source of the project.

Then, we set the coordinates to be ones representing New York, Chicago, Los Angeles and Seattle to collect data from these cities respectively.
Here the code only collects Tweets in New York and Los Angeles.

We set a cap on the maximum number of Tweets from per region and per smaller time period (month) in order to ensure that we do not have a sample that is very skewed in terms of the geographical region or the time period. Further, it allowed us to cluster on the regional/time basis and visualize the changes in the number and type of Tweets in a representative manner.

In [None]:
# initialize the list
tweets = []

# initialize the list of time periods (months) for which we wanted the data to be scraped
lst_date = [('2022-01-01', '2022-02-01'),
            ('2022-02-01', '2022-03-01'),
            ('2022-03-01', '2022-04-01'),
            ('2022-04-01', '2022-05-01'),
            ('2022-05-01', '2022-06-01'),
            ('2022-06-01', '2022-07-01'),
            ('2022-07-01', '2022-08-01'),
            ('2022-08-01', '2022-09-01'),
            ('2022-09-01', '2022-10-01'),
            ('2022-10-01', '2022-11-01'),
            ('2022-11-01', '2022-12-01'),
            ('2022-12-01', '2023-01-01')]

* Scrape Tweets from New York City

In [None]:
# scrape tweets in New York City
loc = '40.7128, -74.0060, 50km'
for start, end in lst_date:
    get_tweet_per_month(loc, start, end, 250, tweets)

# convert the list to a pandas dataframe
df_nyc = pd.DataFrame(tweets, columns = ['Date', 'User', 'Tweet'])

# add city column and the coordinates
df_nyc['city'] = 'New York City'
df_nyc['latitude'] = '40.7128'
df_nyc['longitude'] = '-74.0060'

# save the dataframe to a csv file
df_nyc.to_csv('tweet_nyc.csv')

* Scrape Tweets from Chicago

In [None]:
# get tweets in Chicago
loc = '41.8781, -87.6298, 50km'
tweets = []
for start, end in lst_date:
    get_tweet_per_month(loc, start, end, 350, tweets)

# convert the list to a pandas dataframe
df_chi = pd.DataFrame(tweets, columns = ['Date', 'User', 'Tweet'])

# add city column and the coordinates
df_chi['city'] = 'Chicago'
df_chi['latitude'] = '41.8781'
df_chi['longitude'] = '-87.6298'

# save the dataframe to a csv file
df_chi.to_csv('tweet_chicago.csv')

* Scrape Tweets from Los Angeles

The following data from LA and Seattle is scraped using a different method since (this part of the code was merged afterwards) but the logic is the same. This is because Bhavya and Xiaowei were working on this task in parallel.

In [None]:
inputs_list = ['abortion geocode:"{}" since:2022-01-01 until:2022-02-01', 'abortion geocode:"{}" since:2022-02-01 until:2022-03-01', 'abortion geocode:"{}" since:2022-03-01 until:2022-04-01',
                'abortion geocode:"{}" since:2022-04-01 until:2022-05-01', 'abortion geocode:"{}" since:2022-05-01 until:2022-06-01', 'abortion geocode:"{}" since:2022-06-01 until:2022-07-01',
               'abortion geocode:"{}" since:2022-07-01 until:2022-08-01', 'abortion geocode:"{}" since:2022-08-01 until:2022-09-01', 'abortion geocode:"{}" since:2022-09-01 until:2022-10-01', 
               'abortion geocode:"{}" since:2022-10-01 until:2022-11-01', 'abortion geocode:"{}" since:2022-11-01 until:2022-12-01', 'abortion geocode:"{}" since:2022-12-01 until:2023-01-01', 
                'abortion geocode:"{}" since:2023-01-01 until:2023-02-01']
    
tweets_list = []

loc = '34.052235, -118.243683, 50km' #LA
# Using TwitterSearchScraper to scrape data and append tweets to list
for item in inputs_list:
  for i,tweet in enumerate(sntwitter.TwitterSearchScraper(item.format(loc)).get_items()):
      if i>250:
          break
      tweets_list.append([tweet.date, tweet.content, tweet.user.location])
print(len(tweets_list))

# Creating a dataframe from the tweets list above
tweets_df = pd.DataFrame(tweets_list, columns=['Date', 'Text', 'Location'])
tweets_df.head()

  tweets_list.append([tweet.date, tweet.content, tweet.user.location])


2328


Unnamed: 0,Date,Text,Location
0,2022-01-29 22:44:44+00:00,@Billfisher1219 @MarTheResister @RepSwalwell Y...,"West Hollywood, CA"
1,2022-01-28 13:30:00+00:00,@EmeGeDice That's why I believe in abortion.,"Los Angeles, CA"
2,2022-01-27 22:14:12+00:00,"Ahh, and here Lilly links a pro-choice abortio...","Studio City, CA"
3,2022-01-27 21:05:32+00:00,@aonetwothreefor @ghowell69 @covie_93 I was wo...,
4,2022-01-27 20:29:40+00:00,Honduras’ first female President Xiomara Castr...,"Los Angeles, CA"


* Scrape Tweets from Seattle

In [None]:
inputs_list = ['abortion geocode:"{}" since:2022-01-01 until:2022-02-01', 'abortion geocode:"{}" since:2022-02-01 until:2022-03-01', 'abortion geocode:"{}" since:2022-03-01 until:2022-04-01',
                'abortion geocode:"{}" since:2022-04-01 until:2022-05-01', 'abortion geocode:"{}" since:2022-05-01 until:2022-06-01', 'abortion geocode:"{}" since:2022-06-01 until:2022-07-01',
               'abortion geocode:"{}" since:2022-07-01 until:2022-08-01', 'abortion geocode:"{}" since:2022-08-01 until:2022-09-01', 'abortion geocode:"{}" since:2022-09-01 until:2022-10-01', 
               'abortion geocode:"{}" since:2022-10-01 until:2022-11-01', 'abortion geocode:"{}" since:2022-11-01 until:2022-12-01', 'abortion geocode:"{}" since:2022-12-01 until:2023-01-01', 
                'abortion geocode:"{}" since:2023-01-01 until:2023-02-01']
    
tweets_list2 = []

loc = '47.608013, -122.335167, 50km' #Seattle
# Using TwitterSearchScraper to scrape data and append tweets to list
for item in inputs_list:
  for i,tweet in enumerate(sntwitter.TwitterSearchScraper(item.format(loc)).get_items()):
      if i>250:
          break
      tweets_list2.append([tweet.date, tweet.content, tweet.user.location])
print(len(tweets_list2))

# Creating a dataframe from the tweets list above
tweets_df2 = pd.DataFrame(tweets_list2, columns=['Date', 'Text', 'Location'])
tweets_df2.head()

  tweets_list2.append([tweet.date, tweet.content, tweet.user.location])


1655


Unnamed: 0,Date,Text,Location
0,2022-01-28 00:18:30+00:00,@HeartlandSignal Want government over-reach an...,
1,2022-01-25 14:33:33+00:00,@SenatorCantwell Could you get Democrat suppor...,WA
2,2022-01-23 18:35:15+00:00,@MaryMargOlohan @johnbeagle @DailySignal Prett...,"Puyallup, WA"
3,2022-01-20 04:30:26+00:00,Correct. The filibuster can’t be permanently s...,"Seattle, WA"
4,2022-01-16 17:42:00+00:00,@RebeccaforWA Codifying Roe would make it even...,"Tacoma, WA"


In [None]:
# adding latitude and longitude columns to the dataframes
tweets_df['Latitude'] = '34.052235'
tweets_df['Longitude'] = '-118.243683'
tweets_df.head()

Unnamed: 0,Date,Text,Location,Latitude,Longitude
0,2022-01-29 22:44:44+00:00,@Billfisher1219 @MarTheResister @RepSwalwell Y...,"West Hollywood, CA",34.052235,-118.243683
1,2022-01-28 13:30:00+00:00,@EmeGeDice That's why I believe in abortion.,"Los Angeles, CA",34.052235,-118.243683
2,2022-01-27 22:14:12+00:00,"Ahh, and here Lilly links a pro-choice abortio...","Studio City, CA",34.052235,-118.243683
3,2022-01-27 21:05:32+00:00,@aonetwothreefor @ghowell69 @covie_93 I was wo...,,34.052235,-118.243683
4,2022-01-27 20:29:40+00:00,Honduras’ first female President Xiomara Castr...,"Los Angeles, CA",34.052235,-118.243683


In [None]:
tweets_df2['Latitude'] = '47.608013'
tweets_df2['Longitude'] = '-122.335167'
tweets_df2.head()

Unnamed: 0,Date,Text,Location,Latitude,Longitude
0,2022-01-28 00:18:30+00:00,@HeartlandSignal Want government over-reach an...,,47.608013,-122.335167
1,2022-01-25 14:33:33+00:00,@SenatorCantwell Could you get Democrat suppor...,WA,47.608013,-122.335167
2,2022-01-23 18:35:15+00:00,@MaryMargOlohan @johnbeagle @DailySignal Prett...,"Puyallup, WA",47.608013,-122.335167
3,2022-01-20 04:30:26+00:00,Correct. The filibuster can’t be permanently s...,"Seattle, WA",47.608013,-122.335167
4,2022-01-16 17:42:00+00:00,@RebeccaforWA Codifying Roe would make it even...,"Tacoma, WA",47.608013,-122.335167


### 2. Merge the collected data of four cities

Merging the Tweet data from four cities and cleaning the dataset 

In [None]:
# load in the la and seattle data
df_la = pd.read_csv('twitter_LA_data.csv')
df_sea = pd.read_csv('twitter_seattle_data.csv')

# transform the dataset to be in consistency with chicago and new york data
df_sea.drop(columns=['Unnamed: 0', 'Location'], inplace=True)
df_la.drop(columns=['Unnamed: 0', 'Location'], inplace=True)

# add city column 
df_sea['city'] = 'Seattle'
df_la['city'] = 'Los Angeles'

# change the columns of the nyc and chicago data to make them consistent
df_nyc.drop(columns=['User', 'city'], inplace=True)
df_nyc['city'] = 'New York City'

df_chi.drop(columns=['User', 'city'], inplace=True)
df_chi['city'] = 'Chicago'

# append the dataframes
df_1 = df_nyc.append(df_chi, ignore_index=True)
df_2 = df_la.append(df_sea, ignore_index=True)
# change the column names for df2 in alignment with df1
df_2.columns = ['Date', 'Tweet', 'latitude', 'longitude', 'city']

# append df2 to df1 to merge the dataframes
df = df_1.append(df_2, ignore_index= True)

In [None]:
# extract the date in the expected format using striptime 
df['Date'] = df["Date"].apply(lambda x : datetime.strptime(str(x)[:10], "%Y-%m-%d"))

# save the merged data to a csv file
df.to_csv('tweets.csv')

## Sentiment Analysis: Pre-Processing, Analysis and Labeling

After loading the required data collected, we carry out the pre-processing of the data. We do this to remove certain user identifiers (such as usernames like @xyz) and external links/urls so that this text in the Tweets does not interfere with the processing of the sentiment analysis model.

### 1. Pre-processing the Tweets

In [None]:
tweets_df2.to_csv("twitter_seattle_data.csv")

In [None]:
# preprocessing tweets 
def preprocess(sentence):           #this function has been called in cells after this definition as well
    '''
    Takes a blurb of text (such as a Tweet), removes the usernames and external links, returns the processed text of the Tweet
    Inputs:
        sentence: a piece of text to be cleaned (such as a single Tweet)
    Output:
        tweets_processed: the processed/cleaned Tweet with usernames and external links removed
    '''
    words = []
    for word in sentence.split():
        if word.startswith('@') and len(word) > 1:
            word = '@user'
        elif word.startswith('http'):
            word = 'http'
        words.append(word)
    tweets_processed = " ".join(words)
    return tweets_processed

tweets_df2['tweet_processed'] = tweets_df2.apply(lambda x: preprocess(x.Text), axis = 1)
tweets_df2.head()

### 2. Calling Model Functionalities

In [None]:
# storing variables and calling model functionality from the Hugging Face library, modules
tweets = tweets_df
tweets2 = tweets_df2
roberta = "cardiffnlp/twitter-roberta-base-sentiment"           # storing the name of the model from the library
model = AutoModelForSequenceClassification.from_pretrained(roberta)     # calling the functionalities of the modek

### 3. Sentiment Analysis: Tokenizing, Classifying, and Labelling Tweets using RoBERTa-Base Model

In [None]:
# tokenizing, running the sentiment analysis on both the dataframes
tokenizer = AutoTokenizer.from_pretrained(roberta)
labels = ['Negative', 'Neutral', "Positive"]
def encode(tweet):                          #this function has been called in cells after this definition as well
    ''' 
    This function encodes a given tweet using the pre-trained roberta model
    Inputs:
        tweet: input tweet for sentiment analysis
    Output: 
        corresponding label for the Tweet, based on the analysis, generated by the model
    '''
    out = model(**tokenizer(tweet, return_tensors='pt'))
    return labels[np.argmax(softmax(out[0][0].detach()))]

tweets['sentiment'] = tweets.apply(lambda x: encode(x.tweet_processed), axis = 1)
tweets['sentiment'].value_counts()

Negative    1699
Neutral      554
Positive      75
Name: sentiment, dtype: int64

The first line of the **encode** function calls the model and tokenizer objects to encode the input tweet. The tokenizer function tokenizes the tweet, and returns a PyTorch tensor that represents the input to the model. The return_tensors argument specifies that the tokenizer should return the result as a PyTorch tensor.

The resulting tensor is then passed as an argument to the model object, which returns a tuple containing the output of the model. The output is assigned to the variable 'out'.

The next line of the function uses the softmax function to compute the softmax probabilities of the output tensor. The softmax function is applied to the first element of the tensor by accessing out[0][0] and detaching it from the computation graph using .detach(). The resulting tensor is passed to np.argmax, which returns the index of the highest probability.

Finally, the labels variable is used to map the index to the corresponding label, which is then returned by the function.

The **lambda** function in the following lines of code allows us to apply this **encode** function to all the tweets we have collected and pre-processed.

In [None]:
# tokenizing, running the sentiment analysis on both the dataframes
tokenizer = AutoTokenizer.from_pretrained(roberta)
labels = ['Negative', 'Neutral', "Positive"]
def encode(tweet):
    out = model(**tokenizer(tweet, return_tensors='pt'))
    return labels[np.argmax(softmax(out[0][0].detach()))]

tweets2['sentiment'] = tweets2.apply(lambda x: encode(x.tweet_processed), axis = 1)
tweets2['sentiment'].value_counts()

Negative    1176
Neutral      422
Positive      57
Name: sentiment, dtype: int64

In [None]:
merged_df = pd.read_csv("tweets.csv")  # all tweets for all main cities

In [None]:
# preprocessing tweets 
def preprocess(sentence):
    words = []
    for word in sentence.split():
        if word.startswith('@') and len(word) > 1:
            word = '@user'
        elif word.startswith('http'):
            word = 'http'
        words.append(word)
    tweets_processed = " ".join(words)
    return tweets_processed

merged_df['tweet_processed'] = merged_df.apply(lambda x: preprocess(x.Tweet), axis = 1)
merged_df.count()

Unnamed: 0         8459
Date               8459
Tweet              8459
latitude           8459
longitude          8459
city               8459
tweet_processed    8459
dtype: int64

In [None]:
# tokenizing, running the sentiment analysis on both the dataframes
tokenizer = AutoTokenizer.from_pretrained(roberta)
labels = ['Negative', 'Neutral', "Positive"]
def encode(tweet):
    out = model(**tokenizer(tweet, return_tensors='pt'))
    return labels[np.argmax(softmax(out[0][0].detach()))]

merged_df['sentiment'] = merged_df.apply(lambda x: encode(x.tweet_processed), axis = 1)
merged_df['sentiment'].value_counts()

Negative    5835
Neutral     2272
Positive     352
Name: sentiment, dtype: int64

In [None]:
merged_df.head()

Unnamed: 0.1,Unnamed: 0,Date,Tweet,latitude,longitude,city,tweet_processed,sentiment
0,0,2022-01-31,"Nazis, banned books, suppressed voting rights,...",40.7128,-74.006,New York City,"Nazis, banned books, suppressed voting rights,...",Negative
1,1,2022-01-31,@SenatorLankford In case you haven't noticed a...,40.7128,-74.006,New York City,@user In case you haven't noticed abortion is ...,Neutral
2,2,2022-01-31,@SenatorLankford So you support pushing aborti...,40.7128,-74.006,New York City,@user So you support pushing abortion undergro...,Negative
3,3,2022-01-31,@RayRiosy @Gdad1 @ltwlauren @AngelMHart417 @Ji...,40.7128,-74.006,New York City,@user @user @user @user @user @user @user @use...,Neutral
4,4,2022-01-31,@alicee_pll @miaana_14 People who support the ...,40.7128,-74.006,New York City,@user @user People who support the pro abortio...,Negative


### 4. Storing the Analyzed data into a .csv file

In [None]:
# storing the final merged, sentiment-analyzed dataset in a .csv file for further visualization and analysis
merged2 = merged_df.drop('tweet_processed', axis = 1)
merged2.to_csv('analyzed_tweet_merged.csv', index = False)