# Data Preprocessing

The present notebook is part of a series of notebooks related to the MSc. thesis: **Sentiment analysis on generative language models    based on Social Media commentary of industry participants**

The MSc. thesis research was conducted based on tweets about ChatGPT. These were collected, processed and analyzed with the scope of answering the following research question:

**How are generative language models perceived by participants of different industries based on social media commentary?**

In the process of answering the research question the tweets data will be further processed with the help of topic modelling techniques (Latent Dirichlet Allocation -LDA-) and sentiment analysis techniques (VADER). 
For these techniques to be employed the data needs to be cleaned and processed so that the input data takes the required input shape. 

The data transformations are especially required in employing LDA as the input for LDA needs to take the form of tokens. 

In [1]:
#Import cell of necessary packages

import pandas as pd
import numpy as np
import re, string #Imported for regex data cleaning 
from string import digits # Imported for digits handling
import glob
import datetime
import os

#Set pandas options for ease of cleaning
pd.set_option('display.max_rows', 500)
pd.options.mode.chained_assignment = None

# Import sweet_viz for initial exploratory data analysis (EDA)
import sweetviz as sv
#Package imported to surpass long future warnings
import warnings
warnings.simplefilter(action='ignore')

#Library imported for text cleaning
import spacy
nlp = spacy.load("en_core_web_sm")

#Libraries used to clean text (stematize, lemmatize words)
from lemminflect import getLemma
import nltk
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer 
nltk.download('stopwords') #Download stop words list from nltk package

#Library to extract urls
from urlextract import URLExtract
#Library to handle emoji
import emoji
#Tweet cleaning package
import preprocessor as p

#Used to clean output screen
from IPython.display import clear_output


[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\oanaa\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [3]:
#Load user data and tweets data
df_users=pd.read_csv("data_files/UniqueUsers_CLEAN.csv")
df_tweets_content=pd.read_csv("data_files/TweetsContent_CLEAN.csv")

## Text Pre-Processing Methods

This section introduces the methods that were developed to conduct the data transformations required for LDA topic modeling. 
At the end of this section there can be found the cells that were re-ran to transform the column needed for LDA (user descriptions, tweets content).  

In [3]:
#Load stop words to be removed from text
stop_words_list = set(nltk.corpus.stopwords.words('english'))
#Inspect stop words 
print(stop_words_list)

{'these', 'theirs', 'doesn', "isn't", 'she', 'who', 'its', 'out', 'before', "should've", 'no', 'is', 'again', 'down', 'same', 'ours', 'won', "you'd", 'should', 'shan', 'that', 'against', 'don', 'had', 'haven', 'this', 'not', 'only', 'if', 'you', 'each', 'were', 'at', 'now', 'do', "mustn't", 'it', 'through', 'be', "wasn't", "haven't", "won't", 'how', 'having', 'here', 'themselves', 'under', 'an', 'about', "weren't", 'below', "it's", 'most', 'him', 'there', 'some', "you'll", 'he', 'has', 'between', 'further', 'hers', 'those', 'a', 'nor', 'doing', 'mustn', 'just', 'did', 're', 'ma', 'needn', 'very', "didn't", 'because', 'them', 'then', 'over', 'they', 'wouldn', 'our', 'any', "aren't", "hasn't", 'in', 'hasn', 'been', 'your', "mightn't", 'yours', 'other', 'i', 'with', 'we', 'himself', 'as', 'was', 'have', 'aren', "wouldn't", 'didn', 'than', 'his', 'shouldn', 'can', 'into', 'to', 'more', 'weren', 'until', 'on', "doesn't", 'd', 'are', 'couldn', 'm', "couldn't", 'isn', 'above', 'the', 'my', 'a

In [4]:
#Define lemmatization function
def lemmatize_text(text):
    return [getLemma(str(w), upos= w.pos_)[0] if len(getLemma(str(w), upos= w.pos_))>0 else "" for w in text]


In [5]:
#Define text cleaning function
def df_text_prepro(df_tweets, text_column):
    stop_words_list = set(nltk.corpus.stopwords.words('english'))
    #Extract hashtags
    df_tweets[f'{text_column}_hashtags'] = df_tweets[f'{text_column}'].apply(lambda x: re.findall(r"#(\w+)", str(x)))
    #Extract mentions
    df_tweets[f'{text_column}_mentions'] = df_tweets[f'{text_column}'].apply(lambda x: re.findall(r"@(\w+)", str(x)))
    #Extract URL
    extractor = URLExtract()
    df_tweets[f'{text_column}_URLs'] = df_tweets[f'{text_column}'].apply(lambda x: extractor.find_urls(str(x)))
    #Extract Emoji's
    pattern = re.compile(r"|".join(map(re.escape, emoji.EMOJI_DATA)))
    df_tweets[f'{text_column}_emoji'] = df_tweets[f'{text_column}'].apply(lambda x: "".join(pattern.findall(str(x))))
    #Apply tweet cleaning such as removal of urls, hashtags, mentions, smileys 
    df_tweets[f'{text_column}_clean'] = df_tweets[f'{text_column}'].apply(lambda x: p.clean(str(x)))
    #Remove punctuation
    df_tweets[f'{text_column}_clean'] = df_tweets[f'{text_column}_clean'].apply(lambda x: x.translate(str.maketrans('', '', string.punctuation.replace("-",""))))
    #remove digits 
    df_tweets[f'{text_column}_clean'] = df_tweets[f'{text_column}_clean'].apply(lambda x: ''.join([i for i in x if not i.isdigit()]))
    #Lowercase
    df_tweets[f'{text_column}_clean'] = df_tweets[f'{text_column}_clean'].apply(lambda x: x.lower())
    #Create tokens objects so that part of speech of words can be identified
    df_tweets[f'{text_column}_tokens'] = df_tweets[f'{text_column}_clean'].apply(lambda x: nlp(x))
    #Lemamatize words from tokens column 
    df_tweets[f'{text_column}_tokens']= df_tweets[f'{text_column}_tokens'].apply(lemmatize_text)
    #Remove stopwords from tokens list
    df_tweets[f'{text_column}_tokens'] = df_tweets[f'{text_column}_tokens'].apply(lambda x: x if type(x) == type(None) else list(set(x).difference(stop_words_list) ))
    df_tweets[f'{text_column}_tokens'] = df_tweets[f'{text_column}_tokens'].apply(lambda x: x if type(x) == type(None) else list(set(x).difference(['']) ))
    #return df
    return df_tweets

In [6]:
#Define function to preprocess data in chuncks as the process is timely and consuming. The preprocessing function allows to save the changes made for every 200 rows
def preprocessing_chuncks(df_tweets, col_to_tokanize , typeofdata, lower_limit=0):
    
    #store the total rows of the dataframe to use in checking whether the entire dataframe was cleaned 
    total_df_rows= df_tweets.shape[0]
    
    #dates stored as strings to be used in naming convention of files
    low_string = str(df_tweets.index.min())
    up_string = str(df_tweets.index.max())

    #Loop to clean data
    while True:
        #Assign upper limit based on lower limit, this incrementation helps to move to next iteration
        upper_limit = lower_limit + 200

        #Condition to assign max row number to upper_limit  
        if upper_limit > df_tweets.shape[0]:
            upper_limit = df_tweets.shape[0]-1
        #Condition to break from the loop    
        if lower_limit >= total_df_rows:
            break

        #Create a dataframe of the rows processed in current iteration     
        dataframe = df_tweets[lower_limit:upper_limit]
        #Print status update message
        print(f"{lower_limit}: {upper_limit} >> started")
        clear_output(wait=True)
        #Clean the current rows in the iteration 
        dataframe = df_text_prepro(dataframe, col_to_tokanize )
        #Since df_text_prepro prints warnings and errors that are not necessary, whenever the tokanizer function does not recognize the part of speech of a word, the ouput is cleared
        clear_output(wait=False)
        print(f"Completed {col_to_tokanize}")
        clear_output(wait=True)

        if lower_limit != 0: 
            #append rows that are preprocessed to the csv without header
            dataframe.tail(-1).to_csv(f'{low_string}_to_{up_string}_processed_{typeofdata}.csv', mode='a', index=False)
        else: 
            dataframe.to_csv(f'{low_string}_to_{up_string}_processed_{typeofdata}.csv', mode='a', index=False) 
        #Print status update
        print(f"{lower_limit}: {upper_limit} >> completed")
        #Assign upper_limit to lower_limit
        lower_limit = upper_limit
    
    return "COMPLETED"

The below cells have been re-run multiple times so that the *user_description* and the *tweet_content* columns of *df_users* and *df_tweets* respectively.

In [8]:
#Assign part of df_tweets_content for ease of preprocessing
df_prepo=df_tweets_content[800000:1000000]
df_prepo.shape

(66999, 2)

In [9]:
#Preprocess df_prepo
preprocessing_chuncks(df_prepo,"tweet_content","tweets")

0: 200 >> started


## Users Data

The current section showcases how the users data was handled so that all users' descriptions were processed.

In [None]:
#Load users data 
unique_users= pd.read_csv("data_files/UniqueUsers_CLEAN.csv")
#Cast user_id column into string as it is unique id and not int
unique_users['usr_userid']= unique_users['usr_userid'].apply(str)

In [None]:
#Check if both columns are now object
#Check shape of dataframe
unique_users.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 731084 entries, 0 to 731083
Data columns (total 2 columns):
 #   Column           Non-Null Count   Dtype 
---  ------           --------------   ----- 
 0   usr_userid       731084 non-null  object
 1   usr_description  650350 non-null  object
dtypes: object(2)
memory usage: 11.2+ MB


In [None]:
#Read preprocessed user descriptions files
preprocessed_df1 = pd.read_csv("data_files/0_to_199999_processed_usr.csv")
preprocessed_df2 = pd.read_csv("data_files/200000_to_399999_processed_usr.csv")
preprocessed_df3 = pd.read_csv("data_files/400000_to_599999_processed_usr.csv")
preprocessed_df4 = pd.read_csv("data_files/600000_to_731083_processed_usr.csv")

In [None]:
#Merge all preprocessed data frames into one, drop duplicates for sanity check 
user_prepro_data = pd.concat([preprocessed_df1,preprocessed_df2,preprocessed_df3,preprocessed_df4]).drop_duplicates(subset=['usr_userid'],keep='last')

In [None]:
#Merge the preprocessed data to the unique users data based on usr_userid
user_prepro_data = unique_users.merge( user_prepro_data, on='usr_userid', how='left')
user_prepro_data.shape

(731084, 9)

In [None]:
user_prepro_data.head(3)

Unnamed: 0,usr_userid,usr_description_x,usr_description_y,usr_description_hashtags,usr_description_mentions,usr_description_URLs,usr_description_emoji,usr_description_clean,usr_description_tokens
0,354863991,using A.I. to propel the real estate industry ...,using A.I. to propel the real estate industry ...,"['PropTech', 'AI']",[],[],,using ai to propel the real estate industry fo...,"['enjoy', 'propel', 'family', 'read', 'industr..."
1,4398626122,OpenAI’s mission is to ensure that artificial ...,OpenAI’s mission is to ensure that artificial ...,[],[],['openai.com/jobs'],,openais mission is to ensure that artificial g...,"['general', 'benefit', 'humanity', 'intelligen..."
2,162124540,President & Co-Founder @OpenAI,President & Co-Founder @OpenAI,[],['OpenAI'],[],,president co-founder,"['-', 'founder', 'president', 'co']"


In [None]:
#Rename columns for better readability
user_prepro_data.rename(columns={'usr_description_x':'usr_description_origin','usr_description_y':'usr_description_prepro'}, inplace=True)
user_prepro_data.shape

(731084, 9)

In [9]:
import ast

In [22]:
#TODO: below comment was used to save data. Consequently, the data is then read from the file where it is saved
#user_prepro_data.to_csv("UsrData_PreLDA.csv",index=False)
user_prepro_data=pd.read_csv("data_files/UsrData_PreLDA.csv")
user_prepro_data['usr_description_tokens'] = user_prepro_data['usr_description_tokens'].apply(ast.literal_eval)

In [11]:
user_prepro_data.head(3)

Unnamed: 0,usr_userid,usr_description_origin,usr_description_prepro,usr_description_hashtags,usr_description_mentions,usr_description_URLs,usr_description_emoji,usr_description_clean,usr_description_tokens
0,354863991,using A.I. to propel the real estate industry ...,using A.I. to propel the real estate industry ...,"['PropTech', 'AI']",[],[],,using ai to propel the real estate industry fo...,"[enjoy, propel, family, read, industry, forwar..."
1,4398626122,OpenAI’s mission is to ensure that artificial ...,OpenAI’s mission is to ensure that artificial ...,[],[],['openai.com/jobs'],,openais mission is to ensure that artificial g...,"[general, benefit, humanity, intelligence, mis..."
2,162124540,President & Co-Founder @OpenAI,President & Co-Founder @OpenAI,[],['OpenAI'],[],,president co-founder,"[-, founder, president, co]"


In [41]:
user_prepro_data['check']=user_prepro_data['usr_description_tokens'].apply(lambda x: 0 if len(x)<=0 or x==['nan'] else 1)

In [42]:
user_prepro_data[user_prepro_data['check']==0].shape

(103338, 10)

In [None]:
#During the preprocessing phase there were a few rows that generated errors, thus a few users' descriptions were not cleaned properly. 
#In this sense the following section will work one these descriptions
user_prepro_data.tail(3)

Unnamed: 0,usr_userid,usr_description_origin,usr_description_prepro,usr_description_hashtags,usr_description_mentions,usr_description_URLs,usr_description_emoji,usr_description_clean,usr_description_tokens
731081,832624291724144640,coder interested in #ai #ML #sentience #health...,,,,,,,
731082,327337127,Founder @PplPolicyProj: patreon.com/peoplespol...,,,,,,,
731083,1207075000164864000,enjoyer of long walks in the dungeon.,,,,,,,


In [None]:
#Inspect rows that have null data in the usr_description_prepro. As a result it can be seen that some users are missing descriptions entirely 
user_prepro_data[user_prepro_data['usr_description_prepro'].isnull()].head(3)

Unnamed: 0,usr_userid,usr_description_origin,usr_description_prepro,usr_description_hashtags,usr_description_mentions,usr_description_URLs,usr_description_emoji,usr_description_clean,usr_description_tokens
23,2496666240,,,[],[],[],,,['nan']
40,825495264270114816,,,[],[],[],,,['nan']
45,1128159740599656448,"somehow, I keep ending up in the AGI timeline",,[],[],[],,,['nan']


In [None]:
#Preprocess solely the rows that were problematic 
prepro_missing_rows = df_text_prepro(user_prepro_data[user_prepro_data['usr_description_prepro'].isnull()], 'usr_description_origin')

In [None]:
#Save the newly preprocessed rows in case anything goes wrong
prepro_missing_rows.to_csv("data_files/usr_prepro_missing_rows.csv", index=False)

In [None]:
#Read missing rows
prepro_missing_rows=pd.read_csv("data_files/usr_prepro_missing_rows.csv")
#Extra columns were added when preprocessing 
prepro_missing_rows.drop(columns=['usr_description_hashtags','usr_description_mentions','usr_description_URLs','usr_description_emoji','usr_description_clean','usr_description_tokens'],inplace=True)
#Rename remaining columns to prepare for merging with the rest of the user description dataset
prepro_missing_rows.rename(columns= {'usr_description_origin_hashtags':'usr_description_hashtags', 'usr_description_origin_mentions':'usr_description_mentions', 'usr_description_origin_URLs':'usr_description_URLs', 'usr_description_origin_emoji':'usr_description_emoji','usr_description_origin_clean':'usr_description_clean','usr_description_origin_tokens':'usr_description_tokens'}, inplace = True)
prepro_missing_rows.tail()

Unnamed: 0,usr_userid,usr_description_origin,usr_description_prepro,usr_description_hashtags,usr_description_mentions,usr_description_URLs,usr_description_emoji,usr_description_clean,usr_description_tokens
83391,711830875256717312,Amazon Wholesale,,[],[],[],,amazon wholesale,"['amazon', 'wholesale']"
83392,3698407940,Love life - live life. Politicians need to get...,,[],[],[],,love life - live life politicians need to get ...,"['world', 'politician', 'fragile', 'need', 'ma..."
83393,832624291724144640,coder interested in #ai #ML #sentience #health...,,"['ai', 'ML', 'sentience', 'health']",[],[],,coder interested in of extensant citizen initi...,"['extensant', 'initiative', 'interested', 'cit..."
83394,327337127,Founder @PplPolicyProj: patreon.com/peoplespol...,,[],"['PplPolicyProj', 'ebruenig']","['patreon.com/peoplespolicyp….', 'patreon.com/...",,founder co-host of the bruenigs with,"['host', '-', 'co', 'founder', 'bruenig']"
83395,1207075000164864000,enjoyer of long walks in the dungeon.,,[],[],[],,enjoyer of long walks in the dungeon,"['enjoyer', 'dungeon', 'walk', 'long']"


In [None]:
#Inspect shape of the dataframe of newly preprocessed rows
prepro_missing_rows.shape

(83396, 9)

In [None]:
#Concatenate the original data with the missing rows that are now preprocessed
user_prepro_final = pd.concat([user_prepro_data,prepro_missing_rows])
user_prepro_final.tail()

Unnamed: 0,usr_userid,usr_description_origin,usr_description_prepro,usr_description_hashtags,usr_description_mentions,usr_description_URLs,usr_description_emoji,usr_description_clean,usr_description_tokens
83391,711830875256717312,Amazon Wholesale,,[],[],[],,amazon wholesale,"['amazon', 'wholesale']"
83392,3698407940,Love life - live life. Politicians need to get...,,[],[],[],,love life - live life politicians need to get ...,"['world', 'politician', 'fragile', 'need', 'ma..."
83393,832624291724144640,coder interested in #ai #ML #sentience #health...,,"['ai', 'ML', 'sentience', 'health']",[],[],,coder interested in of extensant citizen initi...,"['extensant', 'initiative', 'interested', 'cit..."
83394,327337127,Founder @PplPolicyProj: patreon.com/peoplespol...,,[],"['PplPolicyProj', 'ebruenig']","['patreon.com/peoplespolicyp….', 'patreon.com/...",,founder co-host of the bruenigs with,"['host', '-', 'co', 'founder', 'bruenig']"
83395,1207075000164864000,enjoyer of long walks in the dungeon.,,[],[],[],,enjoyer of long walks in the dungeon,"['enjoyer', 'dungeon', 'walk', 'long']"


In [None]:
#Drop duplicates of the file and keep just the last ones
user_prepro_final.drop_duplicates(subset='usr_userid',keep='last').shape

(698722, 9)

In [None]:
#TODO: uncommenting the below line will save the preprocessed data 
#user_prepro_final.drop_duplicates(subset='usr_userid',keep='last').to_csv("UsrData_PreLDA.csv", index=False)

In [2]:
#Read data
user_prepro_df= pd.read_csv("UsrData_PreLDA.csv").drop(columns="usr_description_origin").rename(columns={'usr_description_prepro':'usr_description'})
user_prepro_df.head(3)

Unnamed: 0,usr_userid,usr_description,usr_description_hashtags,usr_description_mentions,usr_description_URLs,usr_description_emoji,usr_description_clean,usr_description_tokens
0,354863991,using A.I. to propel the real estate industry ...,"['PropTech', 'AI']",[],[],,using ai to propel the real estate industry fo...,"['enjoy', 'propel', 'family', 'read', 'industr..."
1,4398626122,OpenAI’s mission is to ensure that artificial ...,[],[],['openai.com/jobs'],,openais mission is to ensure that artificial g...,"['general', 'benefit', 'humanity', 'intelligen..."
2,162124540,President & Co-Founder @OpenAI,[],['OpenAI'],[],,president co-founder,"['-', 'founder', 'president', 'co']"


In [4]:
#Loop below casts the usr_description_tokens from string to list of strings 
for col in ['usr_description_tokens']:
    user_prepro_df[f'{col}'] = user_prepro_df[f'{col}'].apply(lambda x: x.strip("']['").split("', '"))
#Inspect how dataset looks like
user_prepro_df.head(10)

Unnamed: 0,usr_userid,usr_description,usr_description_hashtags,usr_description_mentions,usr_description_URLs,usr_description_emoji,usr_description_clean,usr_description_tokens
0,354863991,using A.I. to propel the real estate industry ...,"['PropTech', 'AI']",[],[],,using ai to propel the real estate industry fo...,"[enjoy, propel, family, read, industry, forwar..."
1,4398626122,OpenAI’s mission is to ensure that artificial ...,[],[],['openai.com/jobs'],,openais mission is to ensure that artificial g...,"[general, benefit, humanity, intelligence, mis..."
2,162124540,President & Co-Founder @OpenAI,[],['OpenAI'],[],,president co-founder,"[-, founder, president, co]"
3,1573710710852489216,The latest developments in the world of artifi...,[],[],[],,the latest developments in the world of artifi...,"[world, late, intelligence, development, artif..."
4,4617024083,I'm a bot. I post articles from the Hacker New...,[],['c17r_'],[],,im a bot i post articles from the hacker news ...,"[may, news, post, dayby, article, bot, bounce,..."
5,1181481640532406272,An average #radonc with extraordinary interest...,"['radonc', 'radiating', 'Indian', 'advocate', ...",['vivaldibrowser'],['radoncnotes.com'],,an average with extraordinary interests common...,"[extraordinary, radoncnotescom, average, proud..."
6,1128321032413155329,For the latest #Hack to incorporate Machine Le...,"['Hack', 'AI', 'ECommerce', 'Everyday']",['KhareemSudlow'],[],,for the latest to incorporate machine learning...,"[lifestyle, business, late, learn, machine, in..."
7,944364177581260800,Aspiring #Philanthropist | Machine Learning | ...,['Philanthropist'],[],['KhareemSudlow.com'],,aspiring machine learning blockchain web v...,"[web, blockchain, khareemsudlowcom, learn, asp..."
8,373030639,All things SEO + Ai - Random Facts - cofounder...,[],"['draftngoal', 'data']",[],,all things seo ai - random facts - cofounder ...,"[random, fact, seo, cofounder, ai, thing]"
9,23367384,Technology Expert - Advisory Board Member - Co...,[],[],['knelsonvsi.com'],,technology expert - advisory board member - co...,"[expert, speaker, knelsonvsicom, tv, community..."


In [5]:
#Inspect how the original usr_description and usr_description_tokens side by side
user_prepro_df[['usr_description','usr_description_tokens']].head(3)

Unnamed: 0,usr_description,usr_description_tokens
0,using A.I. to propel the real estate industry ...,"[enjoy, propel, family, read, industry, forwar..."
1,OpenAI’s mission is to ensure that artificial ...,"[general, benefit, humanity, intelligence, mis..."
2,President & Co-Founder @OpenAI,"[-, founder, president, co]"


In [7]:
#When implementing LDA it was observed that one token was just an empty string, thus we want to see how many tweets are affected by the problem
problem_rows= 0 
rows_list_just_empty = []
rows_list_additional_tokens = []
for i in range(user_prepro_df.shape[0]):
    if "" in user_prepro_df['usr_description_tokens'].iloc[i]:

        if user_prepro_df['usr_description_tokens'].iloc[i] != [""]: 
            rows_list_additional_tokens.append(i)
            problem_rows = problem_rows +1
        else: 
            rows_list_just_empty.append(i)
            problem_rows = problem_rows +1
        
print(f"List just empty: {rows_list_just_empty} \nList other tokens: {rows_list_additional_tokens} \nRows : {problem_rows}")

List just empty: [12, 41, 92, 127, 182, 188, 348, 353, 385, 399, 446, 462, 473, 486, 491, 504, 597, 634, 639, 646, 655, 675, 681, 700, 718, 733, 734, 780, 785, 841, 883, 889, 905, 915, 918, 968, 970, 978, 990, 1004, 1007, 1037, 1047, 1070, 1075, 1122, 1236, 1240, 1258, 1275, 1277, 1287, 1354, 1365, 1402, 1468, 1495, 1503, 1517, 1544, 1556, 1674, 1707, 1757, 1842, 1908, 1914, 1941, 1955, 1967, 2064, 2074, 2140, 2187, 2213, 2226, 2350, 2353, 2354, 2454, 2481, 2537, 2542, 2591, 2596, 2607, 2608, 2629, 2708, 2748, 2789, 2801, 2816, 2823, 2839, 2952, 2993, 3145, 3311, 3395, 3453, 3462, 3477, 3487, 3495, 3514, 3626, 3636, 3658, 3663, 3698, 3705, 3782, 3800, 3861, 3902, 3919, 3939, 3945, 3989, 4010, 4094, 4108, 4145, 4237, 4308, 4311, 4326, 4427, 4428, 4465, 4469, 4474, 4564, 4573, 4611, 4617, 4622, 4624, 4649, 4658, 4692, 4792, 4835, 4883, 4974, 5084, 5097, 5136, 5165, 5174, 5200, 5223, 5226, 5373, 5393, 5477, 5494, 5556, 5676, 5680, 5718, 5825, 5827, 5936, 5953, 5959, 5972, 6033, 6077, 6111

In [8]:
user_prepro_df.iloc[12]

usr_userid                                                           38374100
usr_description             #Cloudtechnology / #web3 / #softwareengineerin...
usr_description_hashtags    ['Cloudtechnology', 'web3', 'softwareengineeri...
usr_description_mentions                                                   []
usr_description_URLs                                                       []
usr_description_emoji                                                     NaN
usr_description_clean                                                        
usr_description_tokens                                                     []
Name: 12, dtype: object

In [None]:
#After inspecting these rows it can be concluded that there is no need to further dig into the problem as many of these rows became empty due to unrecognizable symbols, emojis or hashtags being used in the user description

## Tweets Data

The current section showcases how the preprocessed tweets data was concatenated and stored as one. 

In [6]:
#Setting the path for joining multiple files
files = os.path.join("D:/MSC/tweets_tokanized/", "*.csv")

#List of files to be merged
files = glob.glob(files)
print(files)

#Joining the files
tweets_tokanized_all = pd.concat(map(pd.read_csv, files), ignore_index=True)
tweets_tokanized_all.head(3)

['D:/MSC/tweets_tokanized\\0_to_399999_processed_tweets.csv', 'D:/MSC/tweets_tokanized\\1100000_to_1640145_processed_tweets (1).csv', 'D:/MSC/tweets_tokanized\\1100000_to_1640145_processed_tweets.csv', 'D:/MSC/tweets_tokanized\\330000_to_330999_processed_tweets (1).csv', 'D:/MSC/tweets_tokanized\\330000_to_330999_processed_tweets.csv', 'D:/MSC/tweets_tokanized\\333000_to_333179_processed_tweets (1).csv', 'D:/MSC/tweets_tokanized\\333000_to_333179_processed_tweets.csv', 'D:/MSC/tweets_tokanized\\333070_to_333179_processed_tweets.csv', 'D:/MSC/tweets_tokanized\\333200_to_399999_processed_tweets(3).csv', 'D:/MSC/tweets_tokanized\\333200_to_399999_processed_tweets.csv', 'D:/MSC/tweets_tokanized\\333200_to_399999_processed_tweetscontent (2).csv', 'D:/MSC/tweets_tokanized\\400000_to_799999_processed_tweets.csv', 'D:/MSC/tweets_tokanized\\800000_to_1099999_processed_tweets (2).csv']


Unnamed: 0,tweet_id,tweet_content,tweet_content_hashtags,tweet_content_mentions,tweet_content_URLs,tweet_content_emoji,tweet_content_clean,tweet_content_tokens
0,1598014056790622225,ChatGPT: Optimizing Language Models for Dialog...,[],['OpenAI'],['https://t.co/K9rKRygYyn'],,chatgpt optimizing language models for dialogue,"['chatgpt', 'optimize', 'model', 'language', '..."
1,1598014522098208769,"Try talking with ChatGPT, our new AI system wh...",[],[],['https://t.co/sHDm57g3Kr'],,try talking with chatgpt our new ai system whi...,"['ai', 'optimize', 'feedback', 'talk', 'improv..."
2,1598015627540635648,"Just launched ChatGPT, our new AI system which...",[],[],"['https://t.co/ArX6m0FfLE.', 'https://t.co/YM1...",,just launched chatgpt our new ai system which ...,"['ai', 'optimize', 'launch', 'chatgpt', 'new',..."


In [None]:
#Define function for verb removal
def verb_removal(tokens_list):
    nlp_object = nlp(' '.join(tokens_list))
    for element in nlp_object:
        if element.pos_ == 'VERB' and str(element) in tokens_list:
            tokens_list.remove(str(element))
                
    return tokens_list

In [None]:
#Remove verbs
tweets_tokanized_all['tweet_content_tokens_no_verbs'] = tweets_tokanized_all['tweet_content_tokens'].apply(lambda x: verb_removal(x))

In [7]:
#In the preprocessing phase, some tweets may have been cleaned and stored more than once, thus these are droped
#TODO:Remove comment below to save the tweet data as one
#tweets_tokanized_all.drop_duplicates(subset="tweet_id").to_csv("Tweets_PreLDA.csv", index=False)

In [2]:
#Read data from recently saved file
tweets_tokanized = pd.read_csv("Tweets_PreLDA.csv")
tweets_tokanized.shape

(1640046, 10)

In [3]:
tweets_tokanized.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1640046 entries, 0 to 1640045
Data columns (total 10 columns):
 #   Column                         Non-Null Count    Dtype 
---  ------                         --------------    ----- 
 0   Unnamed: 0                     1640046 non-null  int64 
 1   tweet_id                       1640046 non-null  int64 
 2   tweet_content                  1640046 non-null  object
 3   tweet_content_hashtags         1640046 non-null  object
 4   tweet_content_mentions         1640039 non-null  object
 5   tweet_content_URLs             1640039 non-null  object
 6   tweet_content_emoji            211254 non-null   object
 7   tweet_content_clean            1639947 non-null  object
 8   tweet_content_tokens           1640039 non-null  object
 9   tweet_content_tokens_no_verbs  1640046 non-null  object
dtypes: int64(2), object(8)
memory usage: 125.1+ MB
