# Preprocessing

## Data Cleaning
- parse date - DONE

## Text Preprocessing

- case folding (lowercase)
- remove stopwords / create stopword list
- remove punctuation? but social media data might need it
- change emoji to words
- remove urls

### MICROTEXT NORMALISATION ?? (not done here)
- substitution/thesauri lists/define equivalence classes
- spell correction?? -> abbreviations in social media data
  - eg c u l8r -> see you later
- normalise date forms
- lemmatisation - to be done during indexing
- stemming - to be done during indexing

In [1]:
import pandas as pd

In [2]:
df = pd.read_csv("Raw datasets/tweets_raw.csv")

df

Unnamed: 0,url,datetime,text,tweet_id,username,retweet_count,like_count
0,https://twitter.com/nickytwoeyes/status/163928...,2023-03-24 23:28:42+08:00,@Cryptowizardd77 @crypto_rand @elonmusk @Hobbe...,1639288136076328960,nickytwoeyes,0,0
1,https://twitter.com/Dr_Bed_Dr/status/163928813...,2023-03-24 23:28:42+08:00,@MarkusWoat @elonmusk Hurrah,1639288133232590852,Dr_Bed_Dr,0,0
2,https://twitter.com/lill63416788/status/163928...,2023-03-24 23:28:41+08:00,@cb_doge @elonmusk wow that's so amazing 2look...,1639288131819302913,lill63416788,0,0
3,https://twitter.com/starflower1959/status/1639...,2023-03-24 23:28:41+08:00,@elonmusk @BillyM2k Hmmm …how about Australia?,1639288128962715648,starflower1959,0,0
4,https://twitter.com/DBrubaker13/status/1639288...,2023-03-24 23:28:40+08:00,@jayinneveh @williamlegate @elonmusk The only ...,1639288127079546880,DBrubaker13,0,0
...,...,...,...,...,...,...,...
9995,https://twitter.com/alexandre_lores/status/163...,2023-03-24 21:24:15+08:00,We are damaging the environment. But not by us...,1639256815354404865,alexandre_lores,3,14
9996,https://twitter.com/ashutos07601960/status/163...,2023-03-24 21:24:14+08:00,@elonmusk Yas😊😊 very good,1639256812741357571,ashutos07601960,0,0
9997,https://twitter.com/pates_karbo/status/1639256...,2023-03-24 21:24:13+08:00,"@runews @elonmusk ""became aware"" you, nasty pu...",1639256808979337216,pates_karbo,0,0
9998,https://twitter.com/IIuffy/status/163925680567...,2023-03-24 21:24:13+08:00,@luffysmayie Elon Musk sucks so bad,1639256805673975808,IIuffy,0,0


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 7 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   url            10000 non-null  object
 1   datetime       10000 non-null  object
 2   text           10000 non-null  object
 3   tweet_id       10000 non-null  int64 
 4   username       10000 non-null  object
 5   retweet_count  10000 non-null  int64 
 6   like_count     10000 non-null  int64 
dtypes: int64(3), object(4)
memory usage: 547.0+ KB


## Data Cleaning

## Parse date

`datetime` is already in singapore time during scraping but we want to remove the "+08:00" from the timestramp.

In [4]:
# convert "datetime" values from object to datetime64[ns, UTC]
df["datetime"] = pd.to_datetime(df["datetime"], utc=True)

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 7 columns):
 #   Column         Non-Null Count  Dtype              
---  ------         --------------  -----              
 0   url            10000 non-null  object             
 1   datetime       10000 non-null  datetime64[ns, UTC]
 2   text           10000 non-null  object             
 3   tweet_id       10000 non-null  int64              
 4   username       10000 non-null  object             
 5   retweet_count  10000 non-null  int64              
 6   like_count     10000 non-null  int64              
dtypes: datetime64[ns, UTC](1), int64(3), object(3)
memory usage: 547.0+ KB


In [5]:
# remove "+08:00" from timestamp
df["datetime"] = df["datetime"].dt.tz_localize(None)  # this removes "+08:00" but also subtracts 8 hours

# add back 8 hours to make it singapore time again
df["datetime"] = pd.to_datetime(df["datetime"].astype(str)) + pd.DateOffset(hours=8)

df.head()

Unnamed: 0,url,datetime,text,tweet_id,username,retweet_count,like_count
0,https://twitter.com/nickytwoeyes/status/163928...,2023-03-24 23:28:42,@Cryptowizardd77 @crypto_rand @elonmusk @Hobbe...,1639288136076328960,nickytwoeyes,0,0
1,https://twitter.com/Dr_Bed_Dr/status/163928813...,2023-03-24 23:28:42,@MarkusWoat @elonmusk Hurrah,1639288133232590852,Dr_Bed_Dr,0,0
2,https://twitter.com/lill63416788/status/163928...,2023-03-24 23:28:41,@cb_doge @elonmusk wow that's so amazing 2look...,1639288131819302913,lill63416788,0,0
3,https://twitter.com/starflower1959/status/1639...,2023-03-24 23:28:41,@elonmusk @BillyM2k Hmmm …how about Australia?,1639288128962715648,starflower1959,0,0
4,https://twitter.com/DBrubaker13/status/1639288...,2023-03-24 23:28:40,@jayinneveh @williamlegate @elonmusk The only ...,1639288127079546880,DBrubaker13,0,0


## Text Preprocessing

From raw twitter data:
- number of records = 10000
- number of words = 242098
- number of unique words = 24523

In [6]:
# # download missing resource
# import nltk
# nltk.download("punkt")

from nltk.tokenize import word_tokenize

In [7]:
# combine all records in "text" column
text_combined = " ".join(df["text"])

# tokenize combined text
tokens = word_tokenize(text_combined)

# no. of words
len(tokens)

242098

In [8]:
# no. of unique words
unique_tokens = set(tokens)

len(unique_tokens)

24523

### Clean Text
1. remove stopwords
2. change emoji and emoticons into words
3. remove urls
4. remove punctuations
5. case folding (lowercase)

In [9]:
import re
import string

# # download missing resource
# import nltk
# nltk.download("stopwords")

from nltk.corpus import stopwords
from emot.emo_unicode import UNICODE_EMOJI  # for emojis
from emot.emo_unicode import EMOTICONS_EMO  # for emoticons

In [10]:
# Converting emojis to words
def convert_emojis(text):
    for i in text:
        if i in UNICODE_EMOJI.keys():
            try:
                text = text.replace(i, f' {"_".join(UNICODE_EMOJI[i].replace(",","").replace(":","").split())}')
            except:
                continue
    
    return text


# Converting emoticons to words    
def convert_emoticons(text):
    for i in text.split():
        if i in EMOTICONS_EMO.keys():
            text = text.replace(i, "_".join(EMOTICONS_EMO[i].replace(",","").split()))
    return text


# Function for removing urls
def remove_urls(text):
    url_pattern = re.compile(r'https?://\S+|www\.\S+')
    return url_pattern.sub(r'', text)


# custom punctuation
punctuations = string.punctuation + "´‘’“”…–€"

In [11]:
#making a text-cleaning function
def preprocess_text(text):
    
    #convert lowercase
    cleaned_text = text.lower()
    
    #convert emoji into words
    cleaned_text = convert_emojis(cleaned_text)
    
    # convert emoticons into words
    cleaned_text = convert_emoticons(cleaned_text)
    
    # remove urls
    cleaned_text = remove_urls(cleaned_text)
    
    #remove punctuations
    nopunc = [char for char in cleaned_text if char not in punctuations]
    nopunc = ''.join(nopunc)
    
    #remove stopwords
    clean_words = [word for word in nopunc.split() if word.lower() not in stopwords.words('english')]

    #return cleaned text
    return ' '.join(clean_words)

In [12]:
df["cleaned_text"] = df["text"].apply(preprocess_text)

df

Unnamed: 0,url,datetime,text,tweet_id,username,retweet_count,like_count,cleaned_text
0,https://twitter.com/nickytwoeyes/status/163928...,2023-03-24 23:28:42,@Cryptowizardd77 @crypto_rand @elonmusk @Hobbe...,1639288136076328960,nickytwoeyes,0,0,cryptowizardd77 cryptorand elonmusk hobbeseth ...
1,https://twitter.com/Dr_Bed_Dr/status/163928813...,2023-03-24 23:28:42,@MarkusWoat @elonmusk Hurrah,1639288133232590852,Dr_Bed_Dr,0,0,markuswoat elonmusk hurrah
2,https://twitter.com/lill63416788/status/163928...,2023-03-24 23:28:41,@cb_doge @elonmusk wow that's so amazing 2look...,1639288131819302913,lill63416788,0,0,cbdoge elonmusk wow thats amazing 2look highly...
3,https://twitter.com/starflower1959/status/1639...,2023-03-24 23:28:41,@elonmusk @BillyM2k Hmmm …how about Australia?,1639288128962715648,starflower1959,0,0,elonmusk billym2k hmmm australia
4,https://twitter.com/DBrubaker13/status/1639288...,2023-03-24 23:28:40,@jayinneveh @williamlegate @elonmusk The only ...,1639288127079546880,DBrubaker13,0,0,jayinneveh williamlegate elonmusk one appears mad
...,...,...,...,...,...,...,...,...
9995,https://twitter.com/alexandre_lores/status/163...,2023-03-24 21:24:15,We are damaging the environment. But not by us...,1639256815354404865,alexandre_lores,3,14,damaging environment using much energy using e...
9996,https://twitter.com/ashutos07601960/status/163...,2023-03-24 21:24:14,@elonmusk Yas😊😊 very good,1639256812741357571,ashutos07601960,0,0,elonmusk yas smilingfacewithsmilingeyes smilin...
9997,https://twitter.com/pates_karbo/status/1639256...,2023-03-24 21:24:13,"@runews @elonmusk ""became aware"" you, nasty pu...",1639256808979337216,pates_karbo,0,0,runews elonmusk became aware nasty pulitzer
9998,https://twitter.com/IIuffy/status/163925680567...,2023-03-24 21:24:13,@luffysmayie Elon Musk sucks so bad,1639256805673975808,IIuffy,0,0,luffysmayie elon musk sucks bad


In [13]:
# # export cleaned data to csv
# df.to_csv("Cleaned datasets/tweets_cleaned.csv", index=False)

### Count tokens in cleaned_text

From cleaned twitter data:
- number of records = 10000
- number of words = 115094
- number of unique words = 19139

In [14]:
# combine all records in "text" column
text_combined = " ".join(df["cleaned_text"])

# tokenize combined text
tokens = word_tokenize(text_combined)

# no. of words
len(tokens)

115094

In [15]:
# no. of unique words
unique_tokens = set(tokens)

len(unique_tokens)

19139