## Mini Project 03- NLP Emotions: Text Preparation

Dataset:
- A. Tripathi, "Emotion Classification NLP", Kaggle.com, 2021. [Online]. Available: https://www.kaggle.com/datasets/anjaneyatripathi/emotion-classification-nlp. [Accessed: 16- Jul- 2022].

Sources:
- WASSA-2017 Shared Task on Emotion Intensity. Saif M. Mohammad and Felipe Bravo-Marquez. In Proceedings of the EMNLP 2017 Workshop on Computational Approaches to Subjectivity, Sentiment, and Social Media (WASSA), September 2017, Copenhagen, Denmark.
BibTex

Emotion Labels:
- joy: 1
- sadness: 2
- anger: 3
- fear: 4

### Install Libraries

In [61]:
# install pipline
# ! pip install simple-colors
# ! pip install neattext
# ! pip install emoji 

### Import Libraries

In [62]:
## Import Libraries
import numpy as np
import pandas as pd
import re as regex
import spacy
from pathlib import Path
import time


import matplotlib.pyplot as plt
import matplotlib.gridspec as gridspec
import seaborn as sns

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

from sklearn.model_selection import train_test_split

from sklearn.svm import SVC

from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import average_precision_score
from sklearn.metrics import roc_auc_score
from sklearn.metrics import precision_recall_curve
from sklearn.metrics import roc_curve
from sklearn.metrics import auc

import string
from collections import Counter
import re as regex

from sklearn.decomposition import LatentDirichletAllocation
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import LinearSVC

import neattext.functions as nfx
import nltk

import warnings
warnings.filterwarnings('ignore')

%matplotlib inline

### Load Data

In [63]:
dfData = pd.read_csv("textDataset.csv")

In [64]:
dfData.head()

Unnamed: 0.1,Unnamed: 0,text,label
0,0,Just got back from seeing @GaryDelaney in Burs...,1
1,1,Oh dear an evening of absolute hilarity I don'...,1
2,2,Been waiting all week for this game ❤️❤️❤️ #ch...,1
3,3,"@gardiner_love : Thank you so much, Gloria! Yo...",1
4,4,I feel so blessed to work with the family that...,1


In [65]:
dfData.drop(['Unnamed: 0'], axis= 1, inplace= True)

In [66]:
dfData.shape

(7102, 2)

### Prepare the Stage

In [67]:
nlp = spacy.load('en_core_web_md')

### Prepare the text
All the text handling and preparation concerned with the changes and modifications from the raw source text to a format that will be used for the actual processing, things like:
- handle encoding
- handle extraneous and international charaters
- handle symbols
- handle metadata and embedded information
- handle repetitions (such multiple spaces or newlines)

Clean text.

### Text Handling on Single cell

In [68]:
def clean_text(text):
    # reduce multiple spaces and newlines to only one
    text = regex.sub(r'(\s\s+|\n\n+)', r'\1', text)
    # remove double quotes
    text = regex.sub(r'"', '', text)
    text = regex.sub(r'(\\n)', ' ', text)

    return text

In [69]:
demoText= dfData.iloc[4122, 0]
# demoText= dfData.iloc[4321, 0]
# demoText= dfData.iloc[3085, 0]
# demoText= dfData.iloc[3023, 0]
demoText

"I'm so excited to see Nat tonight 😍😍.. And how happy and cheery she is! &amp; then I'm even more excited for her to get on social media 😍 #BB18"

In [70]:
cleanStg01= clean_text(demoText)
cleanStg01

"I'm so excited to see Nat tonight 😍😍.. And how happy and cheery she is! &amp; then I'm even more excited for her to get on social media 😍 #BB18"

In [71]:
# Check for duplicate entries
dfData.duplicated().sum()

0

In [72]:
# nltk.download('stopwords')
# nltk.download('wordnet')
# nltk.download('averaged_perceptron_tagger')

In [73]:
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
from nltk.corpus import wordnet
from emoji import demojize

In [74]:
# Remove userhandles
text_userHandles= nfx.remove_userhandles(cleanStg01)
text_userHandles

"I'm so excited to see Nat tonight 😍😍.. And how happy and cheery she is! &amp; then I'm even more excited for her to get on social media 😍 #BB18"

In [75]:
# Lower casing
text_lower= str.lower(text_userHandles)
text_lower

"i'm so excited to see nat tonight 😍😍.. and how happy and cheery she is! &amp; then i'm even more excited for her to get on social media 😍 #bb18"

In [76]:
# handle emojis
text_emoji= demojize(text_lower)
text_emoji

"i'm so excited to see nat tonight :smiling_face_with_heart-eyes::smiling_face_with_heart-eyes:.. and how happy and cheery she is! &amp; then i'm even more excited for her to get on social media :smiling_face_with_heart-eyes: #bb18"

In [77]:
# Remove puntuation
# punc_to_remove = string.punctuation

# def remove_punctuation(text):
#     return text.translate(str.maketrans('','', punc_to_remove))
# text_punc= remove_punctuation(text_emoji)
# text_punc

In [78]:
# Removae of punctuation
# def remove_punc(text):
#     pattern= regex.compile('[^\w]+')
#     init= regex.split(pattern, text)
    
#     output02= ' '.join(regex.split(pattern, text))
# #     print(init)
# #     print(output02)
#     return output02

# text_punc= remove_punc(text_emoji)
# text_punc

In [79]:
# Remove punctuation
def remove_punc(text):
#     text= regex.sub(r'(\b\S{1}\b)|(\b\S{2}\b)', '', text)
#     text= regex.sub(r'(\b\S{1}\b)', '', text)
#     tweet = regex.sub('http.*\s', ' ', tweet)  # remove URLs
#     tweet = regex.sub(r'\bRT\b|\bcc\b', '', tweet)  # remove RT and cc
#     tweet = regex.sub('rs.*', '', tweet)  # remove hashtags
#     tweet = regex.sub(r'\bTheNextWeb\b', '', tweet)  # remove mentions
    text = regex.sub(r'[^\w\s]+|[_\s]', ' ', text)  # remove punctuations
    text= regex.sub('\s+', ' ', text)  # remove extra whitespace
    text= regex.sub('', '', text)  # remove extra whitespace
    return text

text_punc= remove_punc(text_emoji)
text_punc

'i m so excited to see nat tonight smiling face with heart eyes smiling face with heart eyes and how happy and cheery she is amp then i m even more excited for her to get on social media smiling face with heart eyes bb18'

In [80]:
# remove stopwords
STOPWORDS = set(stopwords.words("english"))

def remove_stopword(text):
    return " ".join([word for word in str(text).split() if word not in STOPWORDS])
text_stop= remove_stopword(text_punc)
text_stop

'excited see nat tonight smiling face heart eyes smiling face heart eyes happy cheery amp even excited get social media smiling face heart eyes bb18'

In [81]:
def stopWord_remove(cell):
    doc= nlp(cell)
    filtered_text =[] 
    token_list = []
    for token in doc:
        token_list.append(token.text)
    for word in token_list:
        lexeme = nlp.vocab[word]
        if lexeme.is_stop == False:
            filtered_text.append(word)
#     print(f"Token list: \n{token_list}")
#     print(f"\nFiltered text: \n{filtered_text}")
    joinFiltered= ' '.join(filtered_text)

    pattern= regex.compile('[^\w]+')
    init01= regex.split(pattern, joinFiltered)

    cleanFiltered= ' '.join(regex.split(pattern, joinFiltered))
#     print(init01)
#     print(cleanFiltered)
    return cleanFiltered

text_stop01= stopWord_remove(text_punc)
text_stop01

'm excited nat tonight smiling face heart eyes smiling face heart eyes happy cheery amp m excited social media smiling face heart eyes bb18'

In [82]:
# lemmatization
lemmatizer = WordNetLemmatizer()
wordnet_map={"N":wordnet.NOUN, "V":wordnet.VERB, "J":wordnet.ADJ, "R":wordnet.ADV}

def lemmatized_words(text):
    pos_tagged_text = nltk.pos_tag(text.split())
    return " ".join([lemmatizer.lemmatize(word , wordnet_map.get(pos[0], wordnet.NOUN)) for word, pos in pos_tagged_text])
    
text_lemma= lemmatized_words(text_stop01)
text_lemma

'm excite nat tonight smile face heart eye smile face heart eye happy cheery amp m excite social medium smile face heart eye bb18'

In [83]:
# removal of numbers
def remove_num(text):
    pattern= regex.compile('[\d]+')
    init= regex.split(pattern, text)

    output02= ' '.join(regex.split(pattern, text))
#     print(init)
#     print(output02)
    return output02

text_num= remove_num(text_lemma)
text_num

'm excite nat tonight smile face heart eye smile face heart eye happy cheery amp m excite social medium smile face heart eye bb '

In [84]:
# Remove single letter in text
# text01= 'i am excited to go to a ocean'
def remove_single(text):
    text= regex.sub(r'(\b\S{1}\b)|(\b\S{2}\b)', '', text)
#     text= regex.sub(r'(\b\S{1}\b)', '', text)
#     tweet = regex.sub('http.*\s', ' ', tweet)  # remove URLs
#     tweet = regex.sub(r'\bRT\b|\bcc\b', '', tweet)  # remove RT and cc
#     tweet = regex.sub('rs.*', '', tweet)  # remove hashtags
#     tweet = regex.sub(r'\bTheNextWeb\b', '', tweet)  # remove mentions
#     tweet = regex.sub('[^\w\s]', '', tweet)  # remove punctuations
    text= regex.sub('\s+', ' ', text)  # remove extra whitespace
    text= regex.sub(r'^\s+|\s+$', '', text)  # remove extra whitespace beginning and end of string
    text= regex.sub('[o]{3,}', 'o', text)  # remove extra whitespace beginning and end of string
    return text

text_letter= remove_single(text_num)
text_letter

'excite nat tonight smile face heart eye smile face heart eye happy cheery amp excite social medium smile face heart eye'

### Text Cleaning

In [85]:
def convert_text(cell):
    cleanStg01= clean_text(cell)
    text_userHandles= nfx.remove_userhandles(cleanStg01)
    text_lower= str.lower(text_userHandles)
    text_emoji= demojize(text_lower)
    text_punc= remove_punc(text_emoji)
    text_stop01= stopWord_remove(text_punc)
    text_lemma= lemmatized_words(text_stop01)
    text_num= remove_num(text_lemma)
    text_letter= remove_single(text_num)
    return text_letter

In [86]:
%%time
# Initialise some columns for feature's counts
dfData['short']= dfData['text'].apply(lambda x: convert_text(x))

CPU times: total: 58.9 s
Wall time: 58.9 s


In [87]:
dfData.sample(10)

Unnamed: 0,text,label,short
5806,Panpiper playing Big River outside The Bridges...,3,panpiper play big river outside bridge outrage
868,@p4pictures it would be great but what if the ...,4,great card crash face scream fear happen twice...
1198,"If purging was real, Kenya would be the countr...",4,purge real kenya country elite purge chopper l...
55,What a #lively #lovely #shower 😇 ..,1,lively lovely shower smile face halo
6244,I'd pay good money to watch someone slap that ...,3,pay good money watch slap pout candice
5278,@MLB @JoeyBats19 Sam Dyson is probably having ...,4,sam dyson probably have flashback right
5944,jelly baby is my favourite insult,3,jelly baby favourite insult
2930,i is sad,2,sad
40,@palmtreesarah @WorthingTheatre had more fun t...,1,fun funny person funsville hilarity usual than...
6770,"Too many are on their 'yeah, the thing going o...",2,yeah thing cop shoot innocent people sad backy...


In [88]:
dfData.iloc[4321, 0]

"Ok, it seems there is still hope in BSNL 🙂\\nNow, BB speed is ok 🙂\\nBut I won't thank and elate much lest it be gone again like before 😐🙂"

In [89]:
dfData.iloc[4321, 2]

'hope bsnl slightly smile face speed slightly smile face win thank elate like neutral face slightly smile face'

In [90]:
filepath = Path('convertedTextDataset.csv')  
filepath.parent.mkdir(parents=True, exist_ok=True)  
dfData.to_csv(filepath)