<a href="https://colab.research.google.com/github/glitch-y/CE888-Project/blob/main/Preprocessing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
!pip install contractions
!pip install emot

Collecting contractions
  Downloading https://files.pythonhosted.org/packages/0a/04/d5e0bb9f2cef5d15616ebf68087a725c5dbdd71bd422bcfb35d709f98ce7/contractions-0.0.48-py2.py3-none-any.whl
Collecting textsearch>=0.0.21
  Downloading https://files.pythonhosted.org/packages/d3/fe/021d7d76961b5ceb9f8d022c4138461d83beff36c3938dc424586085e559/textsearch-0.0.21-py2.py3-none-any.whl
Collecting anyascii
[?25l  Downloading https://files.pythonhosted.org/packages/09/c7/61370d9e3c349478e89a5554c1e5d9658e1e3116cc4f2528f568909ebdf1/anyascii-0.1.7-py3-none-any.whl (260kB)
[K     |████████████████████████████████| 266kB 7.8MB/s 
[?25hCollecting pyahocorasick
[?25l  Downloading https://files.pythonhosted.org/packages/4a/92/b3c70b8cf2b76f7e3e8b7243d6f06f7cb3bab6ada237b1bce57604c5c519/pyahocorasick-1.4.1.tar.gz (321kB)
[K     |████████████████████████████████| 327kB 13.4MB/s 
[?25hBuilding wheels for collected packages: pyahocorasick
  Building wheel for pyahocorasick (setup.py) ... [?25l[?25hdone


# Import Modules

In [2]:
#Import modules
import html #import 'html' module to clean html elements such as '&amp;, &lt' etc.
import numpy as np
import pandas as pd
import contractions #import 'contractions' module to expand linguistic contactions (e.g. it's = it is)
from emot import UNICODE_EMO #import emoji dictionary to transform emojis into text
import re

# Import

In [3]:
#Import files for the 'Emotion' task
data_emotion_test = pd.read_csv(f"https://raw.githubusercontent.com/cardiffnlp/tweeteval/main/datasets/emotion/test_text.txt", 
                                delimiter='\t', dtype=str, header= None)
data_emotion_test_labels = pd.read_csv(f"https://raw.githubusercontent.com/cardiffnlp/tweeteval/main/datasets/emotion/test_labels.txt", 
                                      delimiter='\t', dtype=str, header= None)
data_emotion_mapping = pd.read_csv(f"https://raw.githubusercontent.com/cardiffnlp/tweeteval/main/datasets/emotion/mapping.txt", 
                                      delimiter='\t', dtype=str, header= None)
data_emotion_train = pd.read_csv(f"https://raw.githubusercontent.com/cardiffnlp/tweeteval/main/datasets/emotion/train_text.txt", 
                                      delimiter='\t', dtype=str, header= None)
data_emotion_train_labels = pd.read_csv(f"https://raw.githubusercontent.com/cardiffnlp/tweeteval/main/datasets/emotion/train_labels.txt", 
                                      delimiter='\t', dtype=str, header= None)
data_emotion_val = pd.read_csv(f"https://raw.githubusercontent.com/cardiffnlp/tweeteval/main/datasets/emotion/val_text.txt", 
                                      delimiter='\t', dtype=str, header= None)
data_emotion_val_labels = pd.read_csv(f"https://raw.githubusercontent.com/cardiffnlp/tweeteval/main/datasets/emotion/val_labels.txt", 
                                      delimiter='\t', dtype=str, header= None)

#Assign column names for Emotion datasets
data_emotion_test.columns =['content']
data_emotion_test_labels.columns =['labels']
data_emotion_mapping.columns =['labels','mapping']
data_emotion_train.columns =['content']
data_emotion_train_labels.columns =['labels']
data_emotion_val.columns =['content']
data_emotion_val_labels =['labels'] 

# Preprocessing

##Replace misspelled words

Typos are common in text data and the scripts below create a dictionary of commonly mispelled words and applies it against the 3 'text' sets

In [4]:
#Import misspelling data as dictionary
misspell_data = pd.read_csv("https://raw.githubusercontent.com/glitch-y/CE888-Project/main/Misspelling.txt",sep=":",names=["correction","misspell"])
misspell_data.misspell = misspell_data.misspell.str.strip()
misspell_data.misspell = misspell_data.misspell.str.split(" ")
misspell_data = misspell_data.explode("misspell").reset_index(drop=True)
misspell_data.drop_duplicates("misspell",inplace=True)
miss_corr = dict(zip(misspell_data.misspell, misspell_data.correction))

#Preview misspelling dictionary
{v:miss_corr[v] for v in [list(miss_corr.keys())[k] for k in range(10)]}


{'Steffen': 'Stephen',
 'abilitey': 'ability',
 'abouy': 'about',
 'absorbtion': 'absorption',
 'accidently': 'accidentally',
 'accomodate': 'accommodate',
 'nevade': 'Nevada',
 'presbyterian': 'Presbyterian',
 'rsx': 'RSX',
 'susan': 'Susan'}

In [5]:
#Create misspelling correction function
def misspelled_correction(x):
    for i in x.split(): 
        if i in miss_corr.keys(): 
            x = x.replace(i, miss_corr[i]) 
    return x

#Apply misspelling correction to text dataframes as new column
data_emotion_test['content_clean'] = data_emotion_test.content.apply(lambda x : misspelled_correction(x).lower())
data_emotion_train['content_clean'] = data_emotion_train.content.apply(lambda x : misspelled_correction(x).lower())
data_emotion_val['content_clean'] = data_emotion_val.content.apply(lambda x : misspelled_correction(x).lower())

##Replace abbreviated words

Social media users normally use abbreviated text due to the fast nature of writing a post as well as certain limitations in terms of characters (a well-known aspect of Twitter. 

The script below create a dictionary of commonly known internet abbreviations and applies it against the 3 'text' data sets.

In [6]:
#Abbreviated chat words conversion
#Create Dictionary
chat_dictionary = pd.read_csv("https://raw.githubusercontent.com/glitch-y/CE888-Project/main/SlangDictionary.csv",dtype=str, names=["Slang", "Translation"])
chat_dictionary=chat_dictionary.apply(lambda x: x.str.lower())
slang_corr = dict(zip(chat_dictionary.Slang, chat_dictionary.Translation))

#Preview abbreviation dictionary
{v:slang_corr[v] for v in [list(slang_corr.keys())[k] for k in range(10)]}

{'a.s.a.p.': 'as soon as possible',
 'ama': 'ask me anything',
 'asap': 'as soon as possible',
 'atk': 'at the keyboard',
 'atm': 'at the moment',
 'bbl': 'be back later',
 'bbs': 'be back soon',
 'bc': 'because',
 'bcs': 'because',
 'bfn': 'bye for now'}

In [7]:
#Create abbreviation replacement function
def abbrev_replace(x):
    for i in x.split(): 
        if i in slang_corr.keys(): 
            x = x.replace(i, slang_corr[i]) 
    return x

#Apply misspelling correction to dataframe as new column
data_emotion_test.content_clean = data_emotion_test.content_clean.apply(lambda x : abbrev_replace(x))
data_emotion_train.content_clean = data_emotion_train.content_clean.apply(lambda x : abbrev_replace(x))
data_emotion_val.content_clean = data_emotion_val.content_clean.apply(lambda x : abbrev_replace(x))

#Check
data_emotion_train.head()

Unnamed: 0,content,content_clean
0,“Worry is a down payment on a problem you may ...,“worry is a down payment on a problem you may ...
1,My roommate: it's okay that we can't spell bec...,my roommate: it's okay that we can't spell bec...
2,No but that's so cute. Atsu was probably shy a...,no but that's so cute. atsu was probably shy a...
3,Rooneys fucking untouchable isn't he? Been fuc...,rooneys fucking untouchable isn't he? been fuc...
4,it's pretty depressing when u hit pan on ur fa...,it's pretty depressing when u hit pan on your...


##Remove HTML elements

Data scraped from various websites usually returns certain html elements such as '&amp;' for '&'

The script below uses the 'html' module to clean the data of any such occurences

In [8]:
#clean HTML charachters such as &amp;, &lt; etc using 'html' module
data_emotion_test.content_clean = data_emotion_test.content_clean.apply(lambda x: html.unescape(x))
data_emotion_train.content_clean = data_emotion_train.content_clean.apply(lambda x: html.unescape(x))
data_emotion_val.content_clean = data_emotion_val.content_clean.apply(lambda x: html.unescape(x))

#Check
print(data_emotion_test.loc[[12]])

                                              content                                      content_clean
12  Yes #depression &amp; #anxiety are real but so...  yes #depression & #anxiety are real but so is ...


##Fix language contractions

The script below uses the 'contractions' module to expand any language contractions such as 'let's' into 'let us' or 'it's' into 'it is'

In [9]:
#fix contractions; i.e. 'It's' transforms into 'it is'
data_emotion_test.content_clean = data_emotion_test.content_clean.apply(lambda x: contractions.fix(x))
data_emotion_train.content_clean = data_emotion_train.content_clean.apply(lambda x: contractions.fix(x))
data_emotion_val.content_clean = data_emotion_val.content_clean.apply(lambda x: contractions.fix(x))

#Check
print(data_emotion_test.loc[[54]])

                                              content                                      content_clean
54  Let's start all over again.....\n#feels #lover...  let us start all over again.....\n#feels #love...


##Remove 'newlines' and replace '&' with 'and'


In [10]:
#Remove newlines from data and replace '&' with 'and'
data_emotion_test.content_clean = data_emotion_test.content_clean.replace(r'\\n',' ', regex=True)
data_emotion_test.content_clean = data_emotion_test.content_clean.replace(r'&','and', regex=True)

data_emotion_train.content_clean = data_emotion_train.content_clean.replace(r'\\n',' ', regex=True)
data_emotion_train.content_clean = data_emotion_train.content_clean.replace(r'&','and', regex=True)

data_emotion_val.content_clean = data_emotion_val.content_clean.replace(r'\\n',' ', regex=True)
data_emotion_val.content_clean = data_emotion_val.content_clean.replace(r'&','and', regex=True)

#Check
print(data_emotion_test.loc[[34]])
print(data_emotion_test.loc[[12]])

                                              content                                      content_clean
34  @user -- can handle myself.\n[Carl yelled back...  @user -- can handle myself. [carl yelled back ...
                                              content                                      content_clean
12  Yes #depression &amp; #anxiety are real but so...  yes #depression and #anxiety are real but so i...


##Convert emojis into text

Emoji's describe a variety of emotions or objects which can help increase the accuracy of the algorithm. 

The script below uses the 'emot' module to lookup emoji's in the module dictionary and translate them into text.


In [11]:
#convert emojis into text
def convert_emojis(x):
    for emot in UNICODE_EMO:
        x = x.replace(emot, "_".join(UNICODE_EMO[emot].replace(",","").replace(":","").split()))
    return x

data_emotion_test.content_clean = data_emotion_test.content_clean.apply(lambda x: convert_emojis(x))
data_emotion_train.content_clean = data_emotion_train.content_clean.apply(lambda x: convert_emojis(x))
data_emotion_val.content_clean = data_emotion_val.content_clean.apply(lambda x: convert_emojis(x))

#Check
print(data_emotion_test.loc[[105]])

                     content                                      content_clean
105  @user Wise you mean? 😅   @user wise you mean? smiling_face_with_open_mo...


##Remove unnecessary punctuation

Certain types of punctuation is not of particular use and is removed using the script below. 

However, commas, periods, exclamation marks, question marks apostrophes have not been taken out as they help set the tone or define the relationships between words.


In [13]:
#Remove unnecessary punctuation
def punctuation(x): 
  
    punctuations = '()-[]{};:\<>/#$%^&_~'
  
    for i in x.lower(): 
        if i in punctuations: 
            x = x.replace(i, " ") 
    return x

data_emotion_test.content_clean = data_emotion_test.content_clean.apply(lambda x: punctuation(x))
data_emotion_train.content_clean = data_emotion_train.content_clean.apply(lambda x: punctuation(x))
data_emotion_val.content_clean = data_emotion_val.content_clean.apply(lambda x: punctuation(x))

##Remove '@user' mentions


In [14]:
#Remove @user mentions
data_emotion_test.content_clean = data_emotion_test.content_clean.str.replace('@user','')
data_emotion_train.content_clean = data_emotion_train.content_clean.str.replace('@user','')
data_emotion_val.content_clean = data_emotion_val.content_clean.str.replace('@user','')


data_emotion_test.head()

Unnamed: 0,content,content_clean
0,#Deppression is real. Partners w/ #depressed p...,deppression is real. partners with depresse...
1,@user Interesting choice of words... Are you c...,interesting choice of words... are you confir...
2,My visit to hospital for care triggered #traum...,my visit to hospital for care triggered traum...
3,@user Welcome to #MPSVT! We are delighted to h...,welcome to mpsvt! we are delighted to have y...
4,What makes you feel #joyful?,what makes you feel joyful?
