<div align="center"><h1><b>Step 1: PreProcess Russian Tweets</b></h1></div>


**Outline**

1. Merge Russian Information Operations datasets from Twitter's public information operations datasets (https://transparency.twitter.com/en/reports/information-operations.html)

2. Format datasets and ensure consistency between the users in the user and tweet data
3. Perform lemmatization and preprocessing on the tweets to create a list of formatted words as an additional feature for every user and tweet. The user feature will be the aggregation of all their tweet BoWs.
4. Create a list of all BoWs to query legitimate users in Step 2

In [None]:
# Import necessary libraries

# For accessing Google Drive Files
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth, drive
from oauth2client.client import GoogleCredentials

# Connect and authenticate Google Drive with Google CoLab
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive.mount('/drive')
drive = GoogleDrive(gauth)

# For NLP, Exploratory Data Analysis, Twitter API access
import CS3315Project.tweetProcessing as tweetProcessing
import pandas as pd
import re
import itertools
import numpy as np 

Drive already mounted at /drive; to attempt to forcibly remount, call drive.mount("/drive", force_remount=True).
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


In [None]:
# Grab Russian disinformation tweet and user datasets from CSVs on Google Drive, convert to dataframes

print('Accessing shared file links...')
rustweetssep = drive.CreateFile({'id':'insert file id'}) 
rususerssep = drive.CreateFile({'id': 'insert file id'})
rustweetsmay1 = drive.CreateFile({'id': 'insert file id'}) 
rustweetsmay2 = drive.CreateFile({'id': 'insert file id'}) 
rususerssmay = drive.CreateFile({'id': 'insert file id'}) 
rustweetsjun2019 = drive.CreateFile({'id': 'insert file id'})
rususersjun2019 = drive.CreateFile({'id': 'insert file id'})
rustweetsjan2019 = drive.CreateFile({'id': 'insert file id'})
rususersjan2019 = drive.CreateFile({'id':'insert file id'})
rustweetsoct2018 = drive.CreateFile({'id':'insert file id'})
rususersoct2018 = drive.CreateFile({'id':'insert file id'})
print('Links accessed')

print('Getting the file contents...')
rustweetssep.GetContentFile('ira_092020_tweets_csv_hashed.csv')
rususerssep.GetContentFile('ira_092020_users_csv_hashed.csv')
rustweetsmay1.GetContentFile('russia_052020_tweets_csv_hashed_1.csv')
rustweetsmay2.GetContentFile('russia_052020_tweets_csv_hashed_2.csv')
rususerssmay.GetContentFile('russia_052020_users_csv_hashed.csv')
rustweetsjun2019.GetContentFile('russia_201906_1_tweets_csv_hashed.csv')
rususersjun2019.GetContentFile('russia_201906_1_users_csv_hashed.csv')
rustweetsjan2019.GetContentFile('russia_201901_linked_tweets_csv_hashed_201901_1.csv')
rususersjan2019.GetContentFile('russia_201901_1_users_csv_hashed.csv')
rustweetsoct2018.GetContentFile('ira_tweets_csv_hashed.csv')
rususersoct2018.GetContentFile('ira_users_csv_hashed.csv')
print('Files retrieved')

print('Creating dataframes...')
rtweets_0920_df1 = pd.read_csv('ira_092020_tweets_csv_hashed.csv')
rtweets_0520_df2 = pd.read_csv('russia_052020_tweets_csv_hashed_1.csv')
rtweets_0520_df3 = pd.read_csv('russia_052020_tweets_csv_hashed_2.csv')
rtweets_0619_df4 = pd.read_csv('russia_201906_1_tweets_csv_hashed.csv')
rtweets_0119_df5 = pd.read_csv('russia_201901_linked_tweets_csv_hashed_201901_1.csv')
rtweets_1018_df6 = pd.read_csv('ira_tweets_csv_hashed.csv')

rusers_0920_df1 = pd.read_csv('ira_092020_users_csv_hashed.csv')
rusers_0520_df2 = pd.read_csv('russia_052020_users_csv_hashed.csv')
rusers_0619_df3 = pd.read_csv('russia_201906_1_users_csv_hashed.csv')
rusers_0119_df4 = pd.read_csv('russia_201901_1_users_csv_hashed.csv')
rusers_1018_df5 = pd.read_csv('ira_users_csv_hashed.csv')
print('Dataframes created')


Accessing shared file links...
Links accessed
Getting the file contents...
Files retrieved
Creating dataframes...


  interactivity=interactivity, compiler=compiler, result=result)
  interactivity=interactivity, compiler=compiler, result=result)
  interactivity=interactivity, compiler=compiler, result=result)


Dataframes created


In [None]:
# Look at shapes of the tweet datasets, ensure consistency

print(rtweets_0920_df1.shape)
print(rtweets_0520_df2.shape)
print(rtweets_0520_df3.shape)
print(rtweets_0619_df4.shape)
print(rtweets_0119_df5.shape)
print(rtweets_1018_df6.shape)

(1368, 30)
(3128489, 30)
(306303, 30)
(3, 31)
(920761, 31)
(8768633, 31)


In [None]:
# Identify why there is an extra column in one dataset

print(rtweets_0920_df1.info())
print(rtweets_1018_df6.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1368 entries, 0 to 1367
Data columns (total 30 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   tweetid                   1368 non-null   int64  
 1   userid                    1368 non-null   object 
 2   user_display_name         1368 non-null   object 
 3   user_screen_name          1368 non-null   object 
 4   user_reported_location    919 non-null    object 
 5   user_profile_description  1368 non-null   object 
 6   user_profile_url          1368 non-null   object 
 7   follower_count            1368 non-null   int64  
 8   following_count           1368 non-null   int64  
 9   account_creation_date     1368 non-null   object 
 10  account_language          1368 non-null   object 
 11  tweet_language            1368 non-null   object 
 12  tweet_text                1368 non-null   object 
 13  tweet_time                1368 non-null   object 
 14  tweet_cl

The earlier tweet datasets contain an extra column 'poll_choices', that is not available in the later datasets, so we will drop that column. This column is intended to indicate the choices available if someone posted a poll, so it is not a necessary feature for our purposes.

In [None]:
# Drop poll_choices from earliest datasets

rtweets_0619_df4 = rtweets_0619_df4.drop(['poll_choices'], axis=1)
rtweets_0119_df5 = rtweets_0119_df5.drop(['poll_choices'], axis=1)
rtweets_1018_df6 = rtweets_1018_df6.drop(['poll_choices'], axis=1)

print(rtweets_0619_df4.shape)
print(rtweets_0119_df5.shape)
print(rtweets_1018_df6.shape)

(3, 30)
(920761, 30)
(8768633, 30)


In [None]:
# Look at shapes of the user datasets, ensure consistency

print(rusers_0920_df1.shape)
print(rusers_0520_df2.shape)
print(rusers_0619_df3.shape)
print(rusers_0119_df4.shape)
print(rusers_1018_df5.shape)

(5, 10)
(1153, 10)
(3, 10)
(416, 10)
(3608, 10)


In [None]:
# Add columns annotating each dataset in order to differentiate them

rtweets_0920_df1['dataset'] = '0920'
rusers_0920_df1['dataset'] = '0920'

rtweets_0520_df2['dataset'] = '0520'
rtweets_0520_df3['dataset'] = '0520'
rusers_0520_df2['dataset'] = '0520'

rtweets_0619_df4['dataset'] = '0619'
rusers_0619_df3['dataset'] = '0619'

rtweets_0119_df5['dataset'] = '0119'
rusers_0119_df4['dataset'] = '0119'

rtweets_1018_df6['dataset'] = '1018'
rusers_1018_df5['dataset'] = '1018'

In [None]:
# Verify column has been added

print(rtweets_0920_df1.shape)
print(rtweets_0520_df2.shape)
print(rtweets_0520_df3.shape)
print(rtweets_0619_df4.shape)
print(rtweets_0119_df5.shape)
print(rtweets_1018_df6.shape)
print(rusers_0920_df1.shape)
print(rusers_0520_df2.shape)
print(rusers_0619_df3.shape)
print(rusers_0119_df4.shape)
print(rusers_1018_df5.shape)

print(rtweets_0920_df1.head())
print(rusers_0920_df1.head())

(1368, 31)
(3128489, 31)
(306303, 31)
(3, 31)
(920761, 31)
(8768633, 31)
(5, 11)
(1153, 11)
(3, 11)
(416, 11)
(3608, 11)
               tweetid  ... dataset
0  1290351045160448005  ...    0920
1  1268235122131771392  ...    0920
2  1283019246503694336  ...    0920
3  1273537153893629952  ...    0920
4  1273195539383889921  ...    0920

[5 rows x 31 columns]
                                         userid  ... dataset
0  CqW9bECdw2Jjk9DDU7UyE6P59TukYFISNE8J6sN66u4=  ...    0920
1    uOrf1TDmM7vP4YEhOJDXORoqvpDlsJt03AyOfhrZo=  ...    0920
2   LXW4uuq2JWx4So6ycDFanp4qYQxNvj0ftiuyUe3tZo=  ...    0920
3   oqEFFiOrA+QVN8mEK0wweRTMmY2FQNB6XE5baB1Wik=  ...    0920
4   KjTkk0ZTF6mmlwxdxA13V1UVlB+NAeaWoH9YqBCFEE=  ...    0920

[5 rows x 11 columns]


In [None]:
# Merge Russian tweet datasets

rustw_dflist = [rtweets_0920_df1, rtweets_0520_df2, rtweets_0520_df3, rtweets_0619_df4, rtweets_0119_df5, rtweets_1018_df6]
rus_tweetsdf_mrg = pd.concat(rustw_dflist)

del rtweets_0920_df1, rtweets_0520_df2, rtweets_0520_df3, rtweets_0619_df4, rtweets_0119_df5, rtweets_1018_df6
del rustw_dflist

rus_tweetsdf_mrg.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 13125557 entries, 0 to 8768632
Data columns (total 31 columns):
 #   Column                    Dtype  
---  ------                    -----  
 0   tweetid                   int64  
 1   userid                    object 
 2   user_display_name         object 
 3   user_screen_name          object 
 4   user_reported_location    object 
 5   user_profile_description  object 
 6   user_profile_url          object 
 7   follower_count            int64  
 8   following_count           int64  
 9   account_creation_date     object 
 10  account_language          object 
 11  tweet_language            object 
 12  tweet_text                object 
 13  tweet_time                object 
 14  tweet_client_name         object 
 15  in_reply_to_userid        object 
 16  in_reply_to_tweetid       float64
 17  quoted_tweet_tweetid      float64
 18  is_retweet                bool   
 19  retweet_userid            object 
 20  retweet_tweetid        

In [None]:
# Merge Russian user datasets

rusus_dflist = [rusers_0920_df1, rusers_0520_df2, rusers_0619_df3, rusers_0119_df4, rusers_1018_df5]
rus_usersdf_mrg = pd.concat(rusus_dflist)

del rusers_0920_df1, rusers_0520_df2, rusers_0619_df3, rusers_0119_df4, rusers_1018_df5
del rusus_dflist

rus_usersdf_mrg.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 5185 entries, 0 to 3607
Data columns (total 11 columns):
 #   Column                    Non-Null Count  Dtype 
---  ------                    --------------  ----- 
 0   userid                    5185 non-null   object
 1   user_display_name         5185 non-null   object
 2   user_screen_name          5185 non-null   object
 3   user_reported_location    3790 non-null   object
 4   user_profile_description  3476 non-null   object
 5   user_profile_url          376 non-null    object
 6   follower_count            5152 non-null   object
 7   following_count           5153 non-null   object
 8   account_creation_date     5185 non-null   object
 9   account_language          5185 non-null   object
 10  dataset                   5185 non-null   object
dtypes: object(11)
memory usage: 486.1+ KB


There are 5,185 unique users in the Russian user dataset and 13,125,557 total tweets.

Now that the data is merged, we need to drop rows with null values for user data or tweet text, so we can appropriately group by user ids and be preprocessed. We also need to ensure the two datasets are consistent, namely that all users in the user data have tweets and vice versa.

In [None]:
# View null values in the user dataset to see if there are any unusable data points

display(rus_usersdf_mrg.isnull().sum()) 

userid                         0
user_display_name              0
user_screen_name               0
user_reported_location      1395
user_profile_description    1709
user_profile_url            4809
follower_count                33
following_count               32
account_creation_date          0
account_language               0
dataset                        0
dtype: int64

In [None]:
# View null values in the tweet dataset to see if there are any unusable data points

display(rus_tweetsdf_mrg.isnull().sum()) 

tweetid                            0
userid                             0
user_display_name                  0
user_screen_name                   0
user_reported_location       2894797
user_profile_description     1859060
user_profile_url             9916540
follower_count                     0
following_count                    0
account_creation_date              0
account_language                   0
tweet_language                804083
tweet_text                         2
tweet_time                         0
tweet_client_name              40341
in_reply_to_userid          12031159
in_reply_to_tweetid         12317236
quoted_tweet_tweetid        12806654
is_retweet                         0
retweet_userid               9044607
retweet_tweetid              8663898
latitude                           0
longitude                          0
quote_count                     8708
reply_count                     8708
like_count                      8708
retweet_count                   8708
h

There are two rows where the tweet_text value is null, which may have negative effects, so we will drop these rows. There was nothing of concern in the user dataframe

In [None]:
# Drop null rows for tweet_text column

rus_tweetsdf_mrg = rus_tweetsdf_mrg.dropna(subset=['tweet_text'])

display(rus_tweetsdf_mrg.isnull().sum()) 

tweetid                            0
userid                             0
user_display_name                  0
user_screen_name                   0
user_reported_location       2894797
user_profile_description     1859060
user_profile_url             9916538
follower_count                     0
following_count                    0
account_creation_date              0
account_language                   0
tweet_language                804083
tweet_text                         0
tweet_time                         0
tweet_client_name              40341
in_reply_to_userid          12031157
in_reply_to_tweetid         12317234
quoted_tweet_tweetid        12806652
is_retweet                         0
retweet_userid               9044605
retweet_tweetid              8663896
latitude                           0
longitude                          0
quote_count                     8708
reply_count                     8708
like_count                      8708
retweet_count                   8708
h

Now we can see there are no null values for tweet_text. 

In [None]:
# Find number of users in each dataset
tweetusers = rus_tweetsdf_mrg['userid'].unique()
print('The number of users that tweeted in the Russian tweet dataset is ', len(tweetusers))
print('The number of Russian users in the user dataset is ', len(rus_usersdf_mrg.index))

The number of users that tweeted in the Russian tweet dataset is  4861
The number of Russian users in the user dataset is  5185


The number of unique users between the tweet dataset and the user datasets do not match, so we will have to drop user columns that only have user info but do not contain corresponding tweets, because we cannot create word vectors for these users, which is the basis of our classification model.

In [None]:
# Drop users that have no recorded tweets in the dataset

# If user in tweet data set is in user dataset, return True, else return False
rus_usersdf_mrg = rus_usersdf_mrg[rus_usersdf_mrg.userid.isin(tweetusers)]

print('New number of users in the user dataset')
print(len(rus_usersdf_mrg.index))
print('Number of users in the tweet dataset')
print(len(tweetusers))

New number of users in the user dataset
4859
Number of users in the tweet dataset
4861


It looks like there are two users that have tweets, but are not in the user dataset. Since all the user information in the user dataset is also in the tweet dataset, we can add these users to the user dataset.

In [None]:
# Identify two tweet users that aren't in the user dataset, grab one tweet from each

extra_tweet_users = rus_tweetsdf_mrg[~rus_tweetsdf_mrg.userid.isin(rus_usersdf_mrg.userid)]
extra_tweet_users = extra_tweet_users.drop_duplicates(subset = ["userid"])

extra_tweet_users.head()

Unnamed: 0,tweetid,userid,user_display_name,user_screen_name,user_reported_location,user_profile_description,user_profile_url,follower_count,following_count,account_creation_date,account_language,tweet_language,tweet_text,tweet_time,tweet_client_name,in_reply_to_userid,in_reply_to_tweetid,quoted_tweet_tweetid,is_retweet,retweet_userid,retweet_tweetid,latitude,longitude,quote_count,reply_count,like_count,retweet_count,hashtags,urls,user_mentions,dataset
32,820382233420701697,iZ328VglWrG25qPym1bifLoiwXD9v1+A3G4WU5AThso=,iZ328VglWrG25qPym1bifLoiwXD9v1+A3G4WU5AThso=,iZ328VglWrG25qPym1bifLoiwXD9v1+A3G4WU5AThso=,United States,No more #HappyHolidays shit!!!\nIt's #MerryChr...,https://t.co/XFnhCqCWBy,2718,264,2016-06-15,en,und,#RosieODonnellIsTrash #RosieThePig #Disgusting...,2017-01-14 21:29,Twitter for Android,,,,False,,,absent,absent,0.0,0.0,0.0,0.0,"['RosieODonnellIsTrash', 'RosieThePig', 'Disgu...",[],[25203361],119
215,452439880874725376,YonB7sDqf9+ts0T3nZcFTCwI+9xx3nCoR7APykRtAE=,YonB7sDqf9+ts0T3nZcFTCwI+9xx3nCoR7APykRtAE=,YonB7sDqf9+ts0T3nZcFTCwI+9xx3nCoR7APykRtAE=,Новосибирск,Добавляю взаимно для общения #followback #rufo...,,1673,283,2012-08-03,ru,ru,Volvo планирует продавать 1 млн автомобилей еж...,2014-04-05 13:37,twitterfeed,,,,False,,,absent,absent,0.0,0.0,0.0,0.0,['авто'],,,119


In [None]:
# Add the relevant fields from the tweet users to the user dataset

extra_tweet_users = extra_tweet_users[['userid', 'user_display_name', 'user_screen_name', 'user_reported_location', 'user_profile_description', 'user_profile_url', 'follower_count', 'following_count', 'account_creation_date', 'account_language', 'dataset']]

rus_usersdf_mrg = pd.concat([rus_usersdf_mrg, extra_tweet_users], ignore_index=True)

del extra_tweet_users

# Verify that the number of users in both datasets are now equal and that those users are in the user dataset
print('Number of users in the user dataset')
print(len(rus_usersdf_mrg.index))
print('Number of users in the tweet dataset')
print(len(tweetusers))

rus_usersdf_mrg.tail(2)

Number of users in the user dataset
4861
Number of users in the tweet dataset
4861


Unnamed: 0,userid,user_display_name,user_screen_name,user_reported_location,user_profile_description,user_profile_url,follower_count,following_count,account_creation_date,account_language,dataset
4859,iZ328VglWrG25qPym1bifLoiwXD9v1+A3G4WU5AThso=,iZ328VglWrG25qPym1bifLoiwXD9v1+A3G4WU5AThso=,iZ328VglWrG25qPym1bifLoiwXD9v1+A3G4WU5AThso=,United States,No more #HappyHolidays shit!!!\nIt's #MerryChr...,https://t.co/XFnhCqCWBy,2718,264,2016-06-15,en,119
4860,YonB7sDqf9+ts0T3nZcFTCwI+9xx3nCoR7APykRtAE=,YonB7sDqf9+ts0T3nZcFTCwI+9xx3nCoR7APykRtAE=,YonB7sDqf9+ts0T3nZcFTCwI+9xx3nCoR7APykRtAE=,Новосибирск,Добавляю взаимно для общения #followback #rufo...,,1673,283,2012-08-03,ru,119


Now we know that the users are consistent between the two datasets.

Because we cannot conduct lemmatization for preprocessing on non-English languages, we will delete the data for users that have no English language tweets. However, we want to keep the non-English tweets for users that also have tweets in English, as this will be relevant for feature-generation later on. For users that have no English tweets, we will delete their rows in the user data set and all of their tweets.

In [None]:
# Identify the users that have no English tweets

# Grab relevant columns
twtlang_analysis_df = rus_tweetsdf_mrg[['userid','account_language', 'tweet_language']]
usrlang_analysis_df = rus_usersdf_mrg[['userid', 'account_language']]

# Tweets that are in English
twtdata_tweetlang_en = twtlang_analysis_df[twtlang_analysis_df['tweet_language'] == 'en']

# Users with tweets that are in English
usertwtcount_bylang = twtlang_analysis_df.groupby(['userid', 'tweet_language']).size().reset_index()
users_entweets = usertwtcount_bylang[usertwtcount_bylang['tweet_language'] == 'en']

# Users with no English tweets
userlang_noen = usertwtcount_bylang[~usertwtcount_bylang.userid.isin(users_entweets.userid)]
noen_users = userlang_noen.groupby('userid').size().reset_index()


print('The total number of tweets')
print(len(twtlang_analysis_df.index))
print('\n')

print('The number of tweets in the English language')
print(len(twtdata_tweetlang_en.index))
print('English tweets account for ' + str((len(twtdata_tweetlang_en.index)/len(twtlang_analysis_df.index))*100) + '% of all tweets')
print('\n')

print('The total number of users with English tweets')
print(len(users_entweets))
print('\n')

print('The total number of users with no English tweets')
print(len(noen_users))

The total number of tweets
13125555


The number of tweets in the English language
3768347
English tweets account for 28.710001215186708% of all tweets


The total number of users with English tweets
4158


The total number of users with no English tweets
694


In [None]:
# Delete users with no English tweets from merged user and tweet datasets.
rus_usersdf_mrg = rus_usersdf_mrg[~rus_usersdf_mrg.userid.isin(noen_users.userid)]
rus_tweetsdf_mrg = rus_tweetsdf_mrg[~rus_tweetsdf_mrg.userid.isin(noen_users.userid)]

# Ensure the number of users is still consistent between the two datasets
print('Number of users in the user dataset')
print(len(rus_usersdf_mrg.index))
print('Number of users in the tweet dataset')
print(len(rus_tweetsdf_mrg['userid'].unique()))

Number of users in the user dataset
4167
Number of users in the tweet dataset
4167


Now that we have gotten rid of the data that we cannot use, we can generate the Bag of Words list from the tweets for every user and append it to their row in the user dataset. We will also generate a Bag of Words and add a row to the tweet dataset. The Bag of Words will only consist of words from English tweets, even though there are tweets in multiple languages.

In [None]:
test_tweetdf = rus_tweetsdf_mrg.iloc[0:3]
test_userdf = rus_usersdf_mrg.iloc[0:3]

print(test_tweetdf.head())
print(test_userdf.head())

               tweetid  ... dataset
0  1290351045160448005  ...    0920
1  1268235122131771392  ...    0920
2  1283019246503694336  ...    0920

[3 rows x 31 columns]
                                         userid  ... dataset
0  CqW9bECdw2Jjk9DDU7UyE6P59TukYFISNE8J6sN66u4=  ...    0920
1    uOrf1TDmM7vP4YEhOJDXORoqvpDlsJt03AyOfhrZo=  ...    0920
2   LXW4uuq2JWx4So6ycDFanp4qYQxNvj0ftiuyUe3tZo=  ...    0920

[3 rows x 11 columns]


In [None]:
# Create BoW feature in user and tweet datasets

tweetProcessing.preprocess_frame(rus_tweetsdf_mrg, rus_usersdf_mrg)

In [None]:
# Verify BoW feature was created

print(rus_tweetsdf_mrg.head(10)['tweet_text'])
print(rus_tweetsdf_mrg.head(10)['BoW'])
print(rus_usersdf_mrg['BoW'].head(10))

0    RT @Claudia90291: Never did I ever:\n\nsee Sch...
1    RT @mlk_institute: "It is high time that we re...
2    RT @davidsirota: Fear: Trump wins reelection.\...
3    RT @curbstompchloe: lemme introduce the tl to ...
4    RT @papichulomin: Oh you “love Obama”? Name 7 ...
5    RT @MelanieMoore: The police officers walking ...
6    RT @LeftistFun: Do you think that people are f...
7    RT @JoyAnnReid: What year ... what century are...
8    RT @ABC: BREAKING: All four responding officer...
9    RT @norvergence: #WaterSecurity in #Jordan is ...
Name: tweet_text, dtype: object
0                   [schumer, fight, hard, healthcare]
1    [high, time, retire, white, racist, congress, ...
2    [fear, trump, win, reelection, fear, democrati...
3            [lemme, introduce, tl, favorite, graphic]
4                   [oh, love, obama, country, bombed]
5    [police, officer, walk, charge, commit, murder...
6            [think, people, fundamentally, good, bad]
7                                

In [None]:
# Convert the BoW into a single, comprehensive list to use for user query
# Evaluate differences between different BoW

BoW_list = []
BoW_prelist = rus_usersdf_mrg.BoW.tolist()
BoW_list = [item for sublist in BoW_prelist for item in sublist]
BoW_list = pd.Series(BoW_list)

# Determine how many unique words there are and the most frequently used words
BoW_unique = BoW_list.unique()
BoW_counts = BoW_list.value_counts()

print('There are a total of ' + str(len(BoW_list)) + ' preprocessed words in the BoW dataset')
print('There are a total of ' + str(len(BoW_unique)) + ' unique preprocessed words in the BoW dataset\n')
print('The top 50 most used words in the dataset:')
print(BoW_counts.head(50))

There are a total of 25741529 preprocessed words in the BoW dataset
There are a total of 563378 unique preprocessed words in the BoW dataset

The top 50 most used words in the dataset:
news         282821
trump        224713
amp          128947
new          116600
people       108153
like         104972
sport        101087
love          96441
man           96100
obama         95977
politics      85767
know          80440
police        79699
want          78363
time          76075
world         74676
http          73831
year          72708
day           71152
today         70433
good          69146
need          67363
life          66313
woman         64269
america       62622
hillary       62560
local         62192
win           60015
think         58676
president     58424
black         56718
kill          54794
islam         54523
right         53456
look          53442
clinton       53322
state         48215
come          46995
white         46585
great         46501
let           4

In [None]:
# Analyze the words by dataset

BoW_df = rus_usersdf_mrg[['BoW', 'dataset']]

BoW_df_0920 = BoW_df[BoW_df['dataset'] == '0920']
BoW_df_0520 = BoW_df[BoW_df['dataset'] == '0520']
BoW_df_0619 = BoW_df[BoW_df['dataset'] == '0619']
BoW_df_0119 = BoW_df[BoW_df['dataset'] == '0119']
BoW_df_01018 = BoW_df[BoW_df['dataset'] == '1018']

print('Number of rows and columns in each dataset. The number of rows correspond to the number of users\n')
print('0920 Dataset')
print(BoW_df_0920.shape)
print('\n0520 Dataset')
print(BoW_df_0520.shape)
print('\n0619 Dataset')
print(BoW_df_0619.shape)
print('\n0119 Dataset')
print(BoW_df_0119.shape)
print('\n1018 Dataset')
print(BoW_df_1018.shape)

# Take words from BoW
BoW_list_0920 = []
BoW_prelist_0920 = BoW_df_0920.BoW.tolist()
BoW_list_0920 = [item for sublist in BoW_prelist_0920 for item in sublist]
BoW_list_0920 = pd.Series(BoW_list_0920)

BoW_list_0520 = []
BoW_prelist_0520 = BoW_df_0520.BoW.tolist()
BoW_list_0520 = [item for sublist in BoW_prelist_0520 for item in sublist]
BoW_list_0520 = pd.Series(BoW_list_0520)

BoW_list_0619 = []
BoW_prelist_0619 = BoW_df_0619.BoW.tolist()
BoW_list_0619 = [item for sublist in BoW_prelist_0619 for item in sublist]
BoW_list_0619 = pd.Series(BoW_list_0619)

BoW_list_0119 = []
BoW_prelist_0119 = BoW_df_0119.BoW.tolist()
BoW_list_0119 = [item for sublist in BoW_prelist_0119 for item in sublist]
BoW_list_0119 = pd.Series(BoW_list_0119)

BoW_list_1018 = []
BoW_prelist_1018 = BoW_df_1018.BoW.tolist()
BoW_list_1018 = [item for sublist in BoW_prelist_1018 for item in sublist]
BoW_list_1018 = pd.Series(BoW_list_1018)

BoW_list_0920_unique = BoW_list_0920.unique()
BoW_list_0920_counts = BoW_list_0920.value_counts()

BoW_list_0520_unique = BoW_list_0520.unique()
BoW_list_0520_counts = BoW_list_0520.value_counts()

BoW_list_0619_unique = BoW_list_0619.unique()
BoW_list_0619_counts = BoW_list_0619.value_counts()

BoW_list_0119_unique = BoW_list_0119.unique()
BoW_list_0119_counts = BoW_list_0119.value_counts()

BoW_list_1018_unique = BoW_list_1018.unique()
BoW_list_1018_counts = BoW_list_1018.value_counts()


print('The top most used words and number of unique words by dataset\n')

print('0920 Dataset')
print(len(BoW_list_0920_unique), ' unique words')
print(BoW_list_0920_counts.head(50))
print('\n')

print('0520 Dataset')
print(len(BoW_list_0520_unique), ' unique words')
print(BoW_list_0520_counts.head(50))
print('\n')

print('0619 Dataset')
print(len(BoW_list_0619_unique), ' unique words')
print(BoW_list_0619_counts.head(50))
print('\n')

print('0119 Dataset')
print(len(BoW_list_0119_unique), ' unique words')
print(BoW_list_0119_counts.head(50))
print('\n')

print('1018 Dataset')
print(len(BoW_list_1018_unique), ' unique words')
print(BoW_list_1018_counts.head(50))

Number of rows and columns in each dataset. The number of rows correspond to the number of users

0920 Dataset
(4, 2)

0520 Dataset
(730, 2)

0619 Dataset
(2, 2)

0119 Dataset
(349, 2)

1018 Dataset
(3082, 2)
The top most used words and number of unique words by dataset

0920 Dataset
2934  unique words
message        124
direct         122
new            100
war             85
like            81
government      74
people          72
non             68
america         67
trump           67
hello           66
send            65
disable         65
following       65
discuss         63
proposal        62
right           61
police          52
coronavirus     51
military        50
report          48
crisis          47
covid           46
american        43
yemen           42
country         41
pandemic        40
saudi           40
history         38
year            37
way             36
piece           36
continue        36
nuclear         35
human           35
uk              34
security    

We will save further analysis for the Exploratory Data Analysis step later, this was just to ensure that the data processed so far will work for our intended purposes. While some of the datasets are not as large as others, they seem to all generally be about similar topics so it should not matter and we can keep all of the datasets.

Now that we have formatted and generated our data, we will save it to a csv and continue to the next step in another notebook.

In [None]:
# Send user data with BoW to csv for Step 3

rus_usersdf_mrg.to_csv('/mypath/Step 3 - Feature Generation/Input_Data_Step3/rus_users_bow.csv')

In [None]:
# Store BoW List in a csv to reference for the legitimate tweet dataset query in Step 2

BoW_list.to_csv('/mypath/Step 2 - Query Legitimate Tweets Dataset/Input_Data_Step2/BoW_list.csv')

In [None]:
# Delete some variables to make space in RAM

del BoW_list, rus_usersdf_mrg, BoW_list_0920, BoW_list_0520, BoW_list_0619, BoW_list_0119, BoW_list_1018

In [None]:
# The tweet dataset is pretty large, so we will have to save it as multiple csv files for Step 3

# Save 0920 dataset
rus_tweetsdf_mrg[rus_tweetsdf_mrg['dataset'] == '0920'].to_csv('/mypath/Step 3 - Feature Generation/Input_Data_Step3/rus_tweets_0920.csv')

In [None]:
# Save 0520 dataset

rus_tweetsdf_mrg[rus_tweetsdf_mrg['dataset'] == '0520'].to_csv('/mypath/Step 3 - Feature Generation/Input_Data_Step3/rus_tweets_0520.csv')

In [None]:
# Save 0119 dataset

rus_tweetsdf_mrg[rus_tweetsdf_mrg['dataset'] == '0119'].to_csv('/mypath/Step 3 - Feature Generation/Input_Data_Step3/rus_tweets_0119.csv')

In [None]:
# Save 0619 dataset

rus_tweetsdf_mrg[rus_tweetsdf_mrg['dataset'] == '0619'].to_csv('/mypath/Step 3 - Feature Generation/Input_Data_Step3/rus_tweets_0619.csv')

In [None]:
# Save 1018 dataset

rus_tweetsdf_mrg[rus_tweetsdf_mrg['dataset'] == '1018'].to_csv('/mypath/Step 3 - Feature Generation/Input_Data_Step3/rus_tweets_1018.csv')

Now that the Russian dataset has been preprocessed and formatted and we have a list of search terms, we will conduct our query via the Twitter API to generate the legitimate user tweet dataset.