# Project: Perform Text Pre-processing for Tweets

You are given a dataset with about 1000 English tweets on Singapore (`singapore.csv`) obtained for the period of Jan 2022 to Mar 2022. Your task is to **produce a function to perform text preprocessing of the text in the tweets**.

As mentioned in class, text pre-processing is focused at preparing the data into a more ideal form - particularly a more general form so that patterns can be derived. They will often improve performance of the machine learning (ML) models significantly.

This step is especially important for tweets as tweets tend to contain many *out of vocabulary* tokens such as *twitter user handle*, *numbers*, *urls*, *emoticons*, etc. Without performing text pre-processing, the ML models would have a much harder time to generalize the data.

Some useful NLP application working on tweets includes building a text classifier that can classify whether a tweet is supposed to be under the *singapore*, *sports*, *business*, etc category.

In [1]:
#load in the necessary packages
import pandas as pd
import re
import unicodedata
import nltk
import spacy
from nltk.corpus import wordnet as wn
from nltk.stem import *

In [2]:
#read the csv file into a pandas dataframe
data = pd.read_csv("singapore.csv")
data

Unnamed: 0,created_at,id,id_str,full_text,truncated,source,in_reply_to_status_id,in_reply_to_status_id_str,in_reply_to_user_id,in_reply_to_user_id_str,...,supplemental_language,self_thread,ext_edit_control,ext,extended_entities,quoted_status_id,quoted_status_id_str,quoted_status_permalink,conversation_control,limited_actions
0,Wed Mar 30 23:51:10 +0000 2022,1509317341863035000,1509317341863034882,https://t.co/Fy7GnSdvsr\nXu said that #HongKon...,False,"<a href=""https://mobile.twitter.com"" rel=""nofo...",1.509196e+18,1.509196e+18,4.244077e+07,4.244077e+07,...,,"{""id"":1509192725148549000,""id_str"":""1509192725...","{""initial"":{""edit_tweet_ids"":[""150931734186303...","{""editControl"":{""r"":{""ok"":{""initial"":{""editTwe...",,,,,,
1,Wed Mar 30 23:56:21 +0000 2022,1509318644194582500,1509318644194582532,U.S. President Joe Biden is highlighting what ...,False,"<a href=""http://twitter.com/download/iphone"" r...",,,,,...,,,"{""initial"":{""edit_tweet_ids"":[""150931864419458...","{""unmentionInfo"":{""r"":{""ok"":{}},""ttl"":-1},""sup...",,,,,,
2,Wed Mar 30 23:52:44 +0000 2022,1509317736786448400,1509317736786448386,Spot-on. But dare you spot the real Catriona.\...,False,"<a href=""https://mobile.twitter.com"" rel=""nofo...",,,,,...,,,"{""initial"":{""edit_tweet_ids"":[""150931773678644...","{""unmentionInfo"":{""r"":{""ok"":{}},""ttl"":-1},""edi...",,,,,,
3,Wed Mar 30 23:51:59 +0000 2022,1509317546071437300,1509317546071437317,@Phuket_Sammy At least you can tell that Singa...,False,"<a href=""https://mobile.twitter.com"" rel=""nofo...",1.509315e+18,1.509315e+18,2.607660e+09,2.607660e+09,...,,,"{""initial"":{""edit_tweet_ids"":[""150931754607143...","{""unmentionInfo"":{""r"":{""ok"":{}},""ttl"":-1},""sup...",,,,,,
4,Wed Mar 30 23:58:02 +0000 2022,1509319069996302300,1509319069996302336,The Economist 12 March 2022 rated the top 3 'C...,False,"<a href=""https://mobile.twitter.com"" rel=""nofo...",,,,,...,,,"{""initial"":{""edit_tweet_ids"":[""150931906999630...","{""editControl"":{""r"":{""ok"":{""initial"":{""editTwe...",,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1008,Wed Mar 30 16:28:05 +0000 2022,1509205836278775800,1509205836278775819,@AlYap73961573 @US_FDA @DrCaliff_FDA @ashishkj...,False,"<a href=""http://twitter.com/download/iphone"" r...",1.509201e+18,1.509201e+18,1.299724e+18,1.299724e+18,...,,,"{""initial"":{""edit_tweet_ids"":[""150920583627877...","{""unmentionInfo"":{""r"":{""ok"":{}},""ttl"":-1},""sup...",,,,,,
1009,Wed Mar 30 15:35:27 +0000 2022,1509192591249752000,1509192591249752072,"Immigration is America's superpower, so this s...",False,"<a href=""https://mobile.twitter.com"" rel=""nofo...",,,,,...,,"{""id"":1509192591249752000,""id_str"":""1509192591...","{""initial"":{""edit_tweet_ids"":[""150919259124975...","{""unmentionInfo"":{""r"":{""ok"":{}},""ttl"":-1},""edi...","{""media"":[{""id"":1509190425369796600,""id_str"":""...",,,,,
1010,Wed Mar 30 16:25:02 +0000 2022,1509205067609518000,1509205067609518083,JANUARY 2022:\nBLACK LIZARD (1968)\nBRAINDEAD ...,False,"<a href=""https://mobile.twitter.com"" rel=""nofo...",1.509188e+18,1.509188e+18,1.115341e+18,1.115341e+18,...,,"{""id"":1509187536408653800,""id_str"":""1509187536...","{""initial"":{""edit_tweet_ids"":[""150920506760951...","{""unmentionInfo"":{""r"":{""ok"":{}},""ttl"":-1},""sup...","{""media"":[{""id"":1509204994708308000,""id_str"":""...",,,,,
1011,Wed Mar 30 16:27:06 +0000 2022,1509205589712597000,1509205589712596992,210929 - NINGNING aespa for Vogue Singapore O...,False,"<a href=""http://twitter.com/download/android"" ...",,,,,...,,,"{""initial"":{""edit_tweet_ids"":[""150920558971259...","{""superFollowMetadata"":{""r"":{""ok"":{}},""ttl"":-1...","{""media"":[{""id"":1509205576970309600,""id_str"":""...",,,,,


In [3]:
#access the tweets by accessing the full_text column in the pandas dataframe
tweets = data["full_text"]
tweets

0       https://t.co/Fy7GnSdvsr\nXu said that #HongKon...
1       U.S. President Joe Biden is highlighting what ...
2       Spot-on. But dare you spot the real Catriona.\...
3       @Phuket_Sammy At least you can tell that Singa...
4       The Economist 12 March 2022 rated the top 3 'C...
                              ...                        
1008    @AlYap73961573 @US_FDA @DrCaliff_FDA @ashishkj...
1009    Immigration is America's superpower, so this s...
1010    JANUARY 2022:\nBLACK LIZARD (1968)\nBRAINDEAD ...
1011    210929 -  NINGNING aespa for Vogue Singapore O...
1012    🧍🏻‍♀️ going back to singapore i’m so excited t...
Name: full_text, Length: 1013, dtype: object

## Helper Functions

In this section, we will provide some helper functions that we will use later.

In [4]:
def remove_accented_chars(text):
    text = unicodedata.normalize('NFKD', text).encode('ascii', 'ignore').decode('utf-8', 'ignore')
    return text

In [5]:
#stemming using Porter stemmer
def stem_sentence_porter(sentence):
    stemmer = PorterStemmer()
    words = nltk.word_tokenize(sentence)
    stemmed_sentence = " ".join(stemmer.stem(w) for w in words)
    return stemmed_sentence

In [6]:
#lemmatization using spacy
nlp = spacy.load("en_core_web_sm")
def lemmatize_sentence(sentence):
    doc = nlp(sentence)
    lemmatized_sentence = " ".join([token.lemma_ for token in doc])
    return lemmatized_sentence

In [7]:
def replace_num(tweet):
    tweet = re.sub("\\b\\d+\\.\\d+\\b", "NUM", tweet)
    tweet = re.sub("\\b\\d+", "NUM", tweet)
    return tweet

In [8]:
print(tweets[39])
replace_num(tweets[39])

#SIA256 9V-SHT Airbus A350 941 Singapore Airlines: 3.7 km away @ 861.1 m and 13.1° frm hrzn, heading SW @ 390.2km/h 09:47:19. #OverBrisbane #BNE #BRISBANE #ADSB #dump1090 https://t.co/mY4oPiqxHd


'#SIA256 NUMV-SHT Airbus A350 NUM Singapore Airlines: NUM km away @ NUM m and NUM° frm hrzn, heading SW @ NUM.NUMkm/h NUM:NUM:NUM. #OverBrisbane #BNE #BRISBANE #ADSB #dump1090 https://t.co/mY4oPiqxHd'

#### ❓Task 1. Provide the codes of `get_hashtags()` that will take in a tweet and return a tuple with (tweet_without_hash_tags, list_of_hash_tags_in_lowercase)

In [20]:
def get_hashtags(tweet):
    patt = "(#\w+)" #to fill in
    hash_tags = re.findall(patt, tweet.lower())
    tweet_without_hashtag = re.sub(patt, "", tweet).strip()
    return (tweet_without_hashtag, hash_tags)

In [21]:
print(tweets[39])
get_hashtags(tweets[39])

#SIA256 9V-SHT Airbus A350 941 Singapore Airlines: 3.7 km away @ 861.1 m and 13.1° frm hrzn, heading SW @ 390.2km/h 09:47:19. #OverBrisbane #BNE #BRISBANE #ADSB #dump1090 https://t.co/mY4oPiqxHd


('9V-SHT Airbus A350 941 Singapore Airlines: 3.7 km away @ 861.1 m and 13.1° frm hrzn, heading SW @ 390.2km/h 09:47:19.      https://t.co/mY4oPiqxHd',
 ['#sia256', '#overbrisbane', '#bne', '#brisbane', '#adsb', '#dump1090'])

#### ❓Task 2. Provide the codes of `remove_newlines()` that will take in a tweet and replace one (or more) newlines of a tweet with a space instead

In [27]:
def remove_newlines(tweet):
    return re.sub("\n+", " ", tweet) #to fill in pattern

In [23]:
print(tweets[8])
tweets[8]

Spot-on. But dare you spot the real Catriona.

Two years after it was announced, the wax figure of Miss Universe 2018 Catriona Gray was formally unveiled in Madame Tussauds Singapore.

Read more:
https://t.co/5PP3jIO5U4 

#CatrionaGray 
#DailyTribune


'Spot-on. But dare you spot the real Catriona.\n\nTwo years after it was announced, the wax figure of Miss Universe 2018 Catriona Gray was formally unveiled in Madame Tussauds Singapore.\n\nRead more:\nhttps://t.co/5PP3jIO5U4 \n\n#CatrionaGray \n#DailyTribune'

In [26]:
remove_newlines(tweets[8])

'Spot-on. But dare you spot the real Catriona. Two years after it was announced, the wax figure of Miss Universe 2018 Catriona Gray was formally unveiled in Madame Tussauds Singapore. Read more: https://t.co/5PP3jIO5U4 #CatrionaGray #DailyTribune'

====

#### ❓Task 3. Provide the codes of `replace_years()` that will take in a tweet and replace all year tokens (`e.g. 2018`) to a special token (`YEAR`)

In [30]:
def replace_year(tweet):
    return re.sub("\\b\d{4}\\b", "YEAR", tweet) #to fill in pattern
    # \\b is boundary, which means that it should not mistake 2014 from 20145

In [31]:
print(tweets[8])
replace_year(tweets[8])

Spot-on. But dare you spot the real Catriona.

Two years after it was announced, the wax figure of Miss Universe 2018 Catriona Gray was formally unveiled in Madame Tussauds Singapore.

Read more:
https://t.co/5PP3jIO5U4 

#CatrionaGray 
#DailyTribune


'Spot-on. But dare you spot the real Catriona.\n\nTwo years after it was announced, the wax figure of Miss Universe YEAR Catriona Gray was formally unveiled in Madame Tussauds Singapore.\n\nRead more:\nhttps://t.co/5PP3jIO5U4 \n\n#CatrionaGray \n#DailyTribune'

====

#### ❓Task 4. Provide the codes of `replace_time()` that will take in a tweet and replace all time  (`e.g. 09:47:19`) to a special token (`TIME`)

In [32]:
def replace_time(tweet):
    return re.sub("\d{2}:\d{2}:\d{2}", "TIME", tweet) #to fill in pattern

In [33]:
print(tweets[39])
replace_time(tweets[39])

#SIA256 9V-SHT Airbus A350 941 Singapore Airlines: 3.7 km away @ 861.1 m and 13.1° frm hrzn, heading SW @ 390.2km/h 09:47:19. #OverBrisbane #BNE #BRISBANE #ADSB #dump1090 https://t.co/mY4oPiqxHd


'#SIA256 9V-SHT Airbus A350 941 Singapore Airlines: 3.7 km away @ 861.1 m and 13.1° frm hrzn, heading SW @ 390.2km/h TIME. #OverBrisbane #BNE #BRISBANE #ADSB #dump1090 https://t.co/mY4oPiqxHd'

#### ❓Task 5. Provide the codes of `replace_username()` that will take in a tweet and replace all twitter_handles (e.g. `@mrbrown`) to a special token (`TWITTER_USERNAME`)

In [34]:
def replace_username(tweet):
    return re.sub("@\w+", "TWITTER_USERNAME", tweet) #to fill in pattern

In [35]:
print(tweets[13])
replace_username(tweets[13])

@GWHomeTeam @jessi_asli Not a problem……who wants to fly 11-12 hours to Singapore……then 8-9 hours to Sydney to be picked on 🤣🤣🤣


'TWITTER_USERNAME TWITTER_USERNAME Not a problem……who wants to fly 11-12 hours to Singapore……then 8-9 hours to Sydney to be picked on 🤣🤣🤣'

====

#### ❓Task 6. Provide the codes of `replace_url()` that will take in a tweet and replace all url to a special token (`TWITTER_URL`)

In [60]:
def replace_url(tweet):
    return re.sub("https://[^ ]*", "TWITTER_URL", tweet) #to fill in pattern

In [55]:
print(tweets[6])
replace_url(tweets[6])

@AngryFalconNFT Love Singapore! https://t.co/cPSVKiksC8


'@AngryFalconNFT Love Singapore! TWITTER_URL'

====

## Combining everything together...

In [56]:
def preprocess(tweet):
    tweet = tweet.lower()
    tweet = remove_accented_chars(tweet) #this will also remove the emoticons
    
    #lemmatization or stemming
    tweet = lemmatize_sentence(tweet)    
    #tweet = stem_sentence_porter(tweet)
    
    
    #should make these the last steps so that the special tokens will be upper case while
    #the rest of the tweet text are lowercase
    #convert newline to space
    tweet = remove_newlines(tweet)
    
    #handle year
    tweet = replace_year(tweet)
    
    #handle time
    tweet = replace_time(tweet)
    
    #handle numbers (notice that we will put replace_num after relace_year() and replace_time())
    tweet = replace_num(tweet)
    
    #handle username handle
    tweet = replace_username(tweet)
    
    #handle url
    tweet = replace_url(tweet)
    
    return tweet

In [57]:
preprocess(tweets[1])

'u.s . president joe biden be highlight what country can gain if they stand against russian president vladimir putin . the white house be host singapore prime minister lee hsien loong on tuesday after the city - state back sanction against russia . TWITTER_URL'

In [58]:
processed_tweets = [preprocess(tweet) for tweet in tweets]

#### Execute the code above (and make modification to `preprocess()`) to see whether `preprocess()` would run faster with lemmatization or stemming. Which text normalization technique  runs faster?

In [59]:
tweets_df = pd.DataFrame({"raw_tweets": tweets, "processed_tweets":  processed_tweets})

with pd.option_context('display.max_rows', None, 'display.max_colwidth', None):
    display(tweets_df[1:100])

Unnamed: 0,raw_tweets,processed_tweets
1,U.S. President Joe Biden is highlighting what countries can gain if they stand against Russian President Vladimir Putin. The White House is hosting Singapore’s Prime Minister Lee Hsien Loong on Tuesday after the city-state backed sanctions against Russia. https://t.co/tqiUTkxigB,u.s . president joe biden be highlight what country can gain if they stand against russian president vladimir putin . the white house be host singapore prime minister lee hsien loong on tuesday after the city - state back sanction against russia . TWITTER_URL
2,"Spot-on. But dare you spot the real Catriona.\n\nTwo years after it was announced, the wax figure of Miss Universe 2018 Catriona Gray was formally unveiled in Madame Tussauds Singapore.\n\nRead more:\nhttps://t.co/BB5pjePiCq \n\n#CatrionaGray \n#DailyTribune","spot - on . but dare you spot the real catriona . two year after it be announce , the wax figure of miss universe YEAR catriona gray be formally unveil in madame tussaud singapore . read more : TWITTER_URL"
3,@Phuket_Sammy At least you can tell that Singapore is not as corrupted as Thailand.,TWITTER_USERNAME at least you can tell that singapore be not as corrupt as thailand .
4,The Economist 12 March 2022 rated the top 3 'Crony Capitalist' countries as Russia (1) Malaysia (2) Singapore (3). Seems big land-taxing Singapore needs to extract more rent from her parasites? 🤔,the economist NUM march YEAR rate the top NUM ' crony capitalist ' country as russia ( NUM ) malaysia ( NUM ) singapore ( NUM ) . seem big land - tax singapore need to extract more rent from her parasite ?
5,Singapore Airlines’ most luxurious first-class suites are now available from the US — take a look inside\n\nhttps://t.co/8pAUkSrKcF,singapore airline most luxurious first - class suite be now available from the us take a look inside TWITTER_URL
6,@AngryFalconNFT Love Singapore! https://t.co/cPSVKiksC8,TWITTER_USERNAME love singapore ! TWITTER_URL
7,"Singapore, a secular republic, have a better islamic religious council compared to Malaysia which has JAKIM plus a whole bunch of state muftis who are not homogenous in their interpretation of Islamic jurisprudence.","singapore , a secular republic , have a well islamic religious council compare to malaysia which have jakim plus a whole bunch of state muftis who be not homogenous in their interpretation of islamic jurisprudence ."
8,"Spot-on. But dare you spot the real Catriona.\n\nTwo years after it was announced, the wax figure of Miss Universe 2018 Catriona Gray was formally unveiled in Madame Tussauds Singapore.\n\nRead more:\nhttps://t.co/5PP3jIO5U4 \n\n#CatrionaGray \n#DailyTribune","spot - on . but dare you spot the real catriona . two year after it be announce , the wax figure of miss universe YEAR catriona gray be formally unveil in madame tussaud singapore . read more : TWITTER_URL"
9,60% of food delivery riders in Singapore signed up during Covid-19 pandemic: Survey https://t.co/vMmV1jcENG,NUM % of food delivery rider in singapore sign up during covid-NUM pandemic : survey TWITTER_URL
10,"John found ways that cannot call themselves from Singapore than 24 hour about the end and metrics, and their spinal column.","john find way that can not call themselves from singapore than NUM hour about the end and metric , and their spinal column ."
