# Data Collection, Anonymization and Initial Data Cleaning

The present notebook is part of a series of notebooks related to the MSc. thesis: **Sentiment analysis on generative language models    based on Social Media commentary of industry participants**

The MSc. thesis research was conducted based on tweets about ChatGPT. These were collected, processed and analyzed with the scope of answering the following research question:

**How are generative language models perceived by participants of different industries based on social media commentary?**

As part of the process of data handling, this notebook presents the data collection process, the anonymization of the data and its initial cleaning.

In [1]:
#Import cell of necessary packages
import pandas as pd
import numpy as np
from string import digits

#Set pandas options to display 500 rows, for ease of cleaning
pd.set_option('display.max_rows', 500)
pd.options.mode.chained_assignment = None

#sweet_viz package is used for initial Exploratory Data Analysis(EDA)
import sweetviz as sv
#Surpass long future warnings
import warnings
warnings.simplefilter(action='ignore')

#Scraper module to get twitter data
import snscrape.modules.twitter as sntwitter

## Data Collection ##

In this section of the notebook the methods used in collecting the tweets are described. 
The data was collected with the help of *snscrape* package and its specific module handling tweets, *snscrape.modules.twitter*.

A function *data_collection* was created to ease the scraping process. Due to computational limitations, the function was then run multiple times until all the desired tweets were collected. 

In [2]:
#Query statement to be used in tweets scraping
query="ChatGpt lang:en since:2022-11-30" 

In [3]:
#Scraping function 
def data_collection(query, max_tweets_int, name_csv):

    #Initialize empty dataframe
    tweets_df = pd.DataFrame()
    #Get tweets scraped
    for i, tweet in enumerate(sntwitter.TwitterSearchScraper(query).get_items()):  
        if i > max_tweets_int:
            break

        #Create tweetDict to ease data storage
        tweetDict = {'tweet_id':'',
                    'tweet_date': '',
                    'tweet_content':'',
                    'tweet_source':'',
                    'tweet_replycount':'',
                    'tweet_retweetcount':'',
                    'tweet_likecount':'',
                    'tweet_inreplytotweetid':'',
                    'tweet_inreplytouser':'',
                    'tweet_hashtags':'',
                    'tweet_media':'',
                    'usr_description': '',
                    'usr_username':'',
                    'tweet_vibe':'',
                    'usr_verified':'',
                    'usr_follower_count':'',
                    'usr_location':'',
                    'usr_userid':'',
                    'usr_created':''
                    }
        tweetDict['tweet_id'] = [tweet.id]  
        tweetDict['tweet_date'] = [tweet.date]
        tweetDict['tweet_content'] = [tweet.rawContent]
        tweetDict['tweet_source'] = [tweet.source]
        tweetDict['tweet_replycount'] = [tweet.replyCount]
        tweetDict['tweet_retweetcount'] = [tweet.retweetCount]
        tweetDict['tweet_likecount'] = [tweet.likeCount]
        tweetDict['tweet_media'] = [tweet.media]
        tweetDict['tweet_vibe'] = [tweet.vibe]
        tweetDict['tweet_inreplytotweetid'] = [tweet.inReplyToTweetId]
        tweetDict['tweet_inreplytouser'] = [tweet.inReplyToUser]
        tweetDict['tweet_hashtags'] = [tweet.hashtags]
        tweetDict['usr_description'] = [tweet.user.description]
        tweetDict['usr_username'] = [tweet.user.username]
        tweetDict['usr_created'] = [tweet.user.created]
        tweetDict['usr_userid'] = [tweet.user.id]
        tweetDict['usr_verified'] = [tweet.user.verified]
        tweetDict['usr_follower_count'] = [tweet.user.followersCount]
        tweetDict['usr_location'] = [tweet.user.location]
        
        #store data as dataframe
        row= pd.DataFrame.from_dict(tweetDict)
        #append row to tweets dataframe
        tweets_df = pd.concat([tweets_df,row], ignore_index=True)
    #return csv file wit tweets    
    return tweets_df.to_csv(f'{name_csv}.csv', index=False)

#TODO: when scraping remove comment of row below
#data_collection(query, 1000000, '2023Feb25_toNA')

Tweet 1622448321863487488 contains an app icon medium key '4_1644860781903900672' on app 'android_app'/'com.leh.app', but the corresponding medium is missing; dropping


## Data Cleaning

In this section , cleaning considereation are made and data is processed based on those. 

In [28]:
#TODO: load raw data
df_tweet_25FebOnwards= pd.read_csv("data_files/scraped_RAW_data/2023Feb25_toNA.csv")
df_tweet_Prior25Feb = pd.read_csv("data_files/scraped_RAW_data/data_prior25Feb.csv")

In [5]:
#Inspect how raw data csv files look like
df_tweet_25FebOnwards.head(3)

Unnamed: 0,tweet_id,tweet_date,tweet_content,tweet_source,tweet_replycount,tweet_retweetcount,tweet_likecount,tweet_inreplytotweetid,tweet_inreplytouser,tweet_hashtags,tweet_media,usr_description,usr_username,tweet_vibe,usr_verified,usr_follower_count,usr_location,usr_userid,usr_created
0,1644449184060252160,2023-04-07 21:16:52+00:00,I am thinking of writing a plugin for ChatGPT ...,"<a href=""https://mobile.twitter.com"" rel=""nofo...",0,0,0,,,['ChatGPTto'],,I say things. Things so important that they ma...,jscix,,False,70,A small outpost on Pluto,17632837,2008-11-25 21:50:18+00:00
1,1644449176942530561,2023-04-07 21:16:50+00:00,$20 a month for ChatGPT? https://t.co/acRmNWghBl,"<a href=""https://mobile.twitter.com"" rel=""nofo...",0,0,0,,,,[Photo(previewUrl='https://pbs.twimg.com/media...,CTO @mach_49,mdscaff,,False,1541,"Redwood City, CA",14456867,2008-04-21 01:11:15+00:00
2,1644449166775529474,2023-04-07 21:16:48+00:00,COVID : TREATMENT IDEAS\n\n#CHATGPT,"<a href=""http://twitter.com/download/android"" ...",0,0,0,,,['CHATGPT'],,"Oxford DPhil in Zoology, Environmental Rating ...",mattprescott,,False,24743,Little England,21855179,2009-02-25 11:07:48+00:00


In [4]:
df_tweet_Prior25Feb.head(3)

Unnamed: 0,tweet_id,tweet_date,tweet_content,tweet_source,tweet_replycount,tweet_retweetcount,tweet_likecount,tweet_inreplytotweetid,tweet_inreplytouser,tweet_hashtags,tweet_media,usr_description,usr_username,tweet_vibe,usr_verified,usr_follower_count,usr_location,usr_userid,usr_created
0,1609685677528023043,2023-01-01 22:59:06+00:00,Playing with ChatGPT: Very Fun,"<a href=""https://about.twitter.com/products/tw...",0,0,1,,,,,"Participant, #NBATwitter; Dean, #PistonsTwitte...",lazchance,,False,5645,"Raleigh, NC",25218618,2009-03-19 02:33:31+00:00
1,1609685668292366337,2023-01-01 22:59:04+00:00,"#power #dax What the F*K, ChatGPT know Power-B...","<a href=""https://dlvrit.com/"" rel=""nofollow"">d...",0,0,0,,,"['power', 'dax']",,🕵🏻‍♂️Consultant |👨🏻‍💼Managing Director |✍🏼Auth...,PhilippeJB_PJB,,False,1015,World,1167839853184258049,2019-08-31 16:41:45+00:00
2,1609685651762589697,2023-01-01 22:59:00+00:00,@gdb chatGPT was a flop in 2022.,"<a href=""http://twitter.com/download/iphone"" r...",1,0,3,1.609245e+18,https://twitter.com/gdb,,,Interface Manager - Midstream projects | Energ...,amitarunk,,False,652,USA/India,907077838351847426,2017-09-11 03:06:24+00:00


In [5]:
#Concatenate the two dataframes of raw data into one and inspect how they look 
df_tweet = pd.concat([df_tweet_25FebOnwards, df_tweet_Prior25Feb], ignore_index=True)
df_tweet.head(3)

Unnamed: 0,tweet_id,tweet_date,tweet_content,tweet_source,tweet_replycount,tweet_retweetcount,tweet_likecount,tweet_inreplytotweetid,tweet_inreplytouser,tweet_hashtags,tweet_media,usr_description,usr_username,tweet_vibe,usr_verified,usr_follower_count,usr_location,usr_userid,usr_created
0,1644449184060252160,2023-04-07 21:16:52+00:00,I am thinking of writing a plugin for ChatGPT ...,"<a href=""https://mobile.twitter.com"" rel=""nofo...",0,0,0,,,['ChatGPTto'],,I say things. Things so important that they ma...,jscix,,False,70,A small outpost on Pluto,17632837,2008-11-25 21:50:18+00:00
1,1644449176942530561,2023-04-07 21:16:50+00:00,$20 a month for ChatGPT? https://t.co/acRmNWghBl,"<a href=""https://mobile.twitter.com"" rel=""nofo...",0,0,0,,,,[Photo(previewUrl='https://pbs.twimg.com/media...,CTO @mach_49,mdscaff,,False,1541,"Redwood City, CA",14456867,2008-04-21 01:11:15+00:00
2,1644449166775529474,2023-04-07 21:16:48+00:00,COVID : TREATMENT IDEAS\n\n#CHATGPT,"<a href=""http://twitter.com/download/android"" ...",0,0,0,,,['CHATGPT'],,"Oxford DPhil in Zoology, Environmental Rating ...",mattprescott,,False,24743,Little England,21855179,2009-02-25 11:07:48+00:00


In [6]:
#Check how many tweets were scraped and how many columns the dataframe has
df_tweet.shape

(2147019, 19)

In [7]:
#TODO: if changes need to be done, remove comment below to save the concatenated data 
#df_tweet.to_csv("raw_data_30Nov_07Apr.csv",index=False)

To make sure the data was saved correctly the data is loaded again from the newly saved file. 
<br>Another worth mentioning point is that since the scraping process was conducted in multiple phases, it was expected that some tweets were scraped more than once, thus duplicated may exist in the raw data. 

In [2]:
#Loading raw data
df_tweet= pd.read_csv("data_files/scraped_RAW_data/raw_data_30Nov_07Apr.csv", index_col=False)

In [9]:
#Inspect thea first 3 rows of the dataframe
df_tweet.head(3)

Unnamed: 0,tweet_id,tweet_date,tweet_content,tweet_source,tweet_replycount,tweet_retweetcount,tweet_likecount,tweet_inreplytotweetid,tweet_inreplytouser,tweet_hashtags,tweet_media,usr_description,usr_username,tweet_vibe,usr_verified,usr_follower_count,usr_location,usr_userid,usr_created
0,1644449184060252160,2023-04-07 21:16:52+00:00,I am thinking of writing a plugin for ChatGPT ...,"<a href=""https://mobile.twitter.com"" rel=""nofo...",0,0,0,,,['ChatGPTto'],,I say things. Things so important that they ma...,jscix,,False,70,A small outpost on Pluto,17632837,2008-11-25 21:50:18+00:00
1,1644449176942530561,2023-04-07 21:16:50+00:00,$20 a month for ChatGPT? https://t.co/acRmNWghBl,"<a href=""https://mobile.twitter.com"" rel=""nofo...",0,0,0,,,,[Photo(previewUrl='https://pbs.twimg.com/media...,CTO @mach_49,mdscaff,,False,1541,"Redwood City, CA",14456867,2008-04-21 01:11:15+00:00
2,1644449166775529474,2023-04-07 21:16:48+00:00,COVID : TREATMENT IDEAS\n\n#CHATGPT,"<a href=""http://twitter.com/download/android"" ...",0,0,0,,,['CHATGPT'],,"Oxford DPhil in Zoology, Environmental Rating ...",mattprescott,,False,24743,Little England,21855179,2009-02-25 11:07:48+00:00


In [12]:
#Generate sweetviz report of the raw data. this provides information on datatypes, duplicates and more on column level of the dataframe
my_report = sv.analyze(df_tweet)
my_report.show_html()

                                             |          | [  0%]   00:00 -> (? left)

Report SWEETVIZ_REPORT.html was generated! NOTEBOOK/COLAB USERS: the web browser MAY not pop up, regardless, the report IS saved in your notebook/colab files.


In [10]:
#Check how remaining with the tweets with no media would influence the dataframe
df_tweet[df_tweet['tweet_media'].isna()].nunique()

tweet_id                  1640146
tweet_date                1477065
tweet_content             1578182
tweet_source                 3068
tweet_replycount              482
tweet_retweetcount            879
tweet_likecount              2147
tweet_inreplytotweetid     512393
tweet_inreplytouser        237494
tweet_hashtags             160644
tweet_media                     0
usr_description            638981
usr_username               700578
tweet_vibe                     27
usr_verified                    2
usr_follower_count          64664
usr_location               145630
usr_userid                 698722
usr_created                800752
dtype: int64

In [4]:
#Check how many tweets have missing values as tweet vibe
df_tweet[df_tweet['tweet_vibe'].isna()]['tweet_id'].nunique()

2113264

From the previous steps it can be noticed that aproximately 20% of the tweets include media attachements. These tweets were not in the scope of the MSc. thesis, thus they were discarded.<br>
Additionally, since the *tweet_vibe* column is consisting of mainly null values, this will be later discarded. 

In [3]:
df_tweet=df_tweet[df_tweet['tweet_media'].isna()].drop_duplicates(subset='tweet_id')

In [58]:
df_tweet.shape

(1640146, 19)

### Data Anonymization

The following section focuses on anonymizing the data based on *usr_userid* instead of *usr_username*. The scope of this section is to result in a dataframe where usernames of twitter users are not stored.  

In [59]:
#Transform usr_userid column into string type
df_tweet['usr_userid']= df_tweet['usr_userid'].astype(str)

In [65]:
#Handle usernames appearing the column that shows in reply to which user a tweet was made
#Create new dataframe consisting only of user ids and user names 
twitter_users = df_tweet[['usr_userid','usr_username']].drop_duplicates(subset='usr_userid')


In [67]:
#TODO:Code below saves user mapping data
#twitter_users.to_csv('Users_Map.csv', index=False)

In [68]:
twitter_users.shape

(698722, 2)

In [69]:
#Rename columns so that it will be easier to merge the ids of users in the main dataframe
twitter_users.rename(columns={"usr_userid":"usr_inreply_id","usr_username":"tweet_inreplytouser"},inplace=True)

In [23]:
#Inspect new users only dataframe
twitter_users.head(3)

Unnamed: 0,usr_inreply_id,tweet_inreplytouser
0,17632837,jscix
1,14456867,mdscaff
2,21855179,mattprescott


In [24]:
#Remove noise in the column tweet_inreplytouser of the main dataframe so that only the name of the users remain and not the link to their profile
df_tweet['tweet_inreplytouser']=df_tweet['tweet_inreplytouser'].str.replace("https://twitter.com/","")

In [25]:
#Add to the main dataframe the user id of users in reply to whom the tweets were made 
df_tweet = pd.merge(df_tweet, twitter_users, on='tweet_inreplytouser', how='left')
#Inspect the names of the columns of the main dataframe after merger
df_tweet.columns

Index(['tweet_id', 'tweet_date', 'tweet_content', 'tweet_source',
       'tweet_replycount', 'tweet_retweetcount', 'tweet_likecount',
       'tweet_inreplytotweetid', 'tweet_inreplytouser', 'tweet_hashtags',
       'tweet_media', 'usr_description', 'usr_username', 'tweet_vibe',
       'usr_verified', 'usr_follower_count', 'usr_location', 'usr_userid',
       'usr_created', 'usr_inreply_id'],
      dtype='object')

### Initial Data Cleaning

The next section of the notebook handles the initial data cleaning of the raw data. This involves removing redundant columns and rows. 

In [30]:
#Function to clean raw data
def clean_dataframe(df_tweets):
    # remove all tweets with media
    df_tweets = df_tweets[df_tweets['tweet_media'].isna()] 
    #sort dataframe
    df_tweets=df_tweets.sort_values(by=['tweet_date'],ascending=False)
    # remove duplicates
    df_tweets.drop_duplicates(['tweet_id'],keep='first',inplace=True)
    #remove unused cols
    df_tweets=df_tweets.drop(['tweet_vibe', 'tweet_media','usr_username','tweet_inreplytouser'], axis=1)
    df_tweets=df_tweets.reset_index()

    return df_tweets

In [31]:
#Clean dataframe with the clean_dataframe function
df_tweet=clean_dataframe(df_tweet)

In [33]:
#Check shape of the resulting dataframe
df_tweet.shape

(1640146, 17)

In [34]:
#Save again the cleaned dataframe
#df_tweet.to_csv("1_raw_data_30Nov_07Apr_CLEAN.csv",index=False)

In [4]:
#Load cleaned data from saved file
df_tweet= pd.read_csv("data_files/raw_data_30Nov_07Apr_CLEAN.csv", index_col=False).drop(columns=['index'])


In [45]:
#Inspect first 3 rows of dataframe
df_tweet.head(3)

Unnamed: 0,tweet_id,tweet_date,tweet_content,tweet_source,tweet_replycount,tweet_retweetcount,tweet_likecount,tweet_inreplytotweetid,tweet_hashtags,usr_description,usr_verified,usr_follower_count,usr_location,usr_userid,usr_created,usr_inreply_id
0,1644449184060252160,2023-04-07 21:16:52+00:00,I am thinking of writing a plugin for ChatGPT ...,"<a href=""https://mobile.twitter.com"" rel=""nofo...",0,0,0,,['ChatGPTto'],I say things. Things so important that they ma...,False,70,A small outpost on Pluto,17632837,2008-11-25 21:50:18+00:00,
1,1644449166775529474,2023-04-07 21:16:48+00:00,COVID : TREATMENT IDEAS\n\n#CHATGPT,"<a href=""http://twitter.com/download/android"" ...",0,0,0,,['CHATGPT'],"Oxford DPhil in Zoology, Environmental Rating ...",False,24743,Little England,21855179,2009-02-25 11:07:48+00:00,
2,1644449134425022470,2023-04-07 21:16:40+00:00,any cool viruses to give the chat gpt AI?,"<a href=""https://mobile.twitter.com"" rel=""nofo...",0,0,0,,,enjoyer of long walks in the dungeon.,False,228,home.,1207075000164864000,2019-12-17 23:08:33+00:00,


In [46]:
#Inspect last 3 rows of the dataframe
df_tweet.tail(3)

Unnamed: 0,tweet_id,tweet_date,tweet_content,tweet_source,tweet_replycount,tweet_retweetcount,tweet_likecount,tweet_inreplytotweetid,tweet_hashtags,usr_description,usr_verified,usr_follower_count,usr_location,usr_userid,usr_created,usr_inreply_id
1640143,1598015627540635648,2022-11-30 18:06:29+00:00,"Just launched ChatGPT, our new AI system which...","<a href=""https://mobile.twitter.com"" rel=""nofo...",84,369,2370,,,President & Co-Founder @OpenAI,True,215048,,162124540,2010-07-02 19:38:09+00:00,
1640144,1598014522098208769,2022-11-30 18:02:06+00:00,"Try talking with ChatGPT, our new AI system wh...","<a href=""https://mobile.twitter.com"" rel=""nofo...",1320,3518,13703,,,OpenAI’s mission is to ensure that artificial ...,True,1575814,,4398626122,2015-12-06 22:51:08+00:00,
1640145,1598014056790622225,2022-11-30 18:00:15+00:00,ChatGPT: Optimizing Language Models for Dialog...,"<a href=""http://www.evyware.com/"" rel=""nofollo...",0,0,2,,,using A.I. to propel the real estate industry ...,False,10482,"Cleveland, OH",354863991,2011-08-14 12:40:43+00:00,


To ease the further data handling which involves applying models such as LDA (for topic modelling) and Vader (for sentiment analysis), two new dataframes were created and separately saved: one related to users' descriptions and one related to tweets content. 

In [51]:
df_tweet['usr_userid']=df_tweet['usr_userid'].astype(str)

In [52]:
#Create dataframe of users descriptions. 
df_users = df_tweet[['usr_userid','usr_description']].drop_duplicates(subset='usr_userid')
df_users.shape

(698722, 2)

In [6]:
#Create dataframe of tweets contents
df_tweets_content = df_tweet[['tweet_id','tweet_content']].drop_duplicates()
df_tweets_content.shape

(1640146, 2)

In [7]:
#Save the two new files
#df_users.reset_index().drop(columns="index").to_csv("UsrData_PreLDA.csv", index=False)
#df_tweets_content.reset_index().drop(columns="index").to_csv("TweetsContent_CLEAN.csv", index=False)