[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1QWts4Kplj24mYyW8rlTL8Iizy1bwiggY?usp=sharing)

# SemEval-2018 Task 3: Irony Detection

This is the dataset called TweetEval [[1]](#section_id) and it is available [here](https://github.com/cardiffnlp/tweeteval) <br>

As described by [[1]](#section_id), the tweets were retrieved with the
Twitter API from October 2015 to February 2017 and ”geolocalized” in United States. <br>

In [1]:
import pandas as pd # Allow us to work with CSV files
import emoji # Allow us to print Emojis
import numpy as np # Allow us to work with arrays
import re  # Allow us to work with regular expressions
import nltk.data  # Allow to use the tokenizer punkt/english.pickle
import nltk # import the nltk package
from nltk.stem.snowball import SnowballStemmer # Import the SnowballStemmer algorithm
import warnings
warnings.filterwarnings('ignore') # Allow to disable Python warnings

# PENDIENTE
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression # Import logistic regression
from sklearn.metrics import accuracy_score, confusion_matrix, make_scorer, f1_score, classification_report
# Import scikit-learn.metrics module for accuracy score, make_scorer, confusion matrix and classification_report
from sklearn import metrics # Import scikit-learn metrics module for Recall calculation
from sklearn.model_selection import cross_val_score # Import cross validation score
from sklearn.model_selection import train_test_split, StratifiedKFold, GridSearchCV
# Import train_test_split function, stratified K-Folds cross-validator and GridSearchCV

### Step 1: Preparing the Irony Detection Mapping.
For this task, the **Irony Detection Mapping** proposed in the TweetEval [[1]](#section_id) will be used. <br>

In [9]:
# Print the Irony Detection Mapping dataframe

# Read a TXT file from internet (Github) that does not have a header
df_irony_mapping = pd.read_csv('https://raw.githubusercontent.com/cardiffnlp/tweeteval/'
                               'main/datasets/irony/mapping.txt', header=None, sep ='\t', 
                                names=['Label','Output'], index_col=False)
    
print('\n\033[1mIrony Detection Mapping Subset:\033[0m')
df_irony_mapping = df_irony_mapping[['Output', 'Label']]
display(df_irony_mapping) # Output: SemEval-2018 Irony Detection Mapping dataframe (truncated)



[1mIrony Detection Mapping Subset:[0m


Unnamed: 0,Output,Label
0,non_irony,0
1,irony,1


The above dataframe can be explaned with the following examples: <br>
<br>
• As shown, 0 is equal to **Non Irony**, while 1 is equal to **Irony** <br>

### Step 2: Importing Irony Detection Train and Test subsets
For this task, the **Irony Detection train and test subset** proposed in the TweetEval [[1]](#section_id) will be used. <br>
Each dataset contains 2,862 tweets and 784 tweets correspondigly. They represent the feature variable (X)

In [12]:
# Load the Irony Detection Train subset (Feature variable (X))

# Read a TXT file from internet (Github) that does not have a header
df_irony_train_x = pd.read_csv('https://raw.githubusercontent.com/cardiffnlp/tweeteval/'
                                 'main/datasets/irony/train_text.txt', header=None, sep ='\t', names=['Tweets'])
    
print('\n\033[1mIrony Detection Train Subset:\033[0m')
display(df_irony_train_x) # Output: SemEval-2018 Irony Detection dataframe for training (truncated)


# Load the Irony Detection Test subset (Feature variable (X))

# Read a TXT file from internet (Github) that does not have a header
df_irony_test_x = pd.read_csv('https://raw.githubusercontent.com/cardiffnlp/tweeteval/'
                              'main/datasets/irony/test_text.txt', header=None, sep ='\n', names=['Tweets'])
  
print('\n\033[1mIrony Detection Test Subset:\033[0m')
display(df_irony_test_x) # Output: SemEval-2018 Irony Detection dataframe for testing (truncated)


[1mIrony Detection Train Subset:[0m


Unnamed: 0,Tweets
0,seeing ppl walking w/ crutches makes me really...
1,"look for the girl with the broken smile, ask h..."
2,Now I remember why I buy books online @user #s...
3,@user @user So is he banded from wearing the c...
4,Just found out there are Etch A Sketch apps. ...
...,...
2857,I don't have to respect your beliefs.||I only ...
2858,Women getting hit on by married managers at @u...
2859,@user no but i followed you and i saw you post...
2860,@user I dont know what it is but I'm in love y...



[1mIrony Detection Test Subset:[0m


Unnamed: 0,Tweets
0,@user Can U Help?||More conservatives needed o...
1,"Just walked in to #Starbucks and asked for a ""..."
2,#NOT GONNA WIN
3,@user He is exactly that sort of person. Weirdo!
4,So much #sarcasm at work mate 10/10 #boring 10...
...,...
779,"If you drag yesterday into today, your tomorro..."
780,Congrats to my fav @user & her team & my birth...
781,@user Jessica sheds tears at her fan signing e...
782,#Irony: al jazeera is pro Anti - #GamerGate be...


For this task, the **Irony Detection train and test subset (labels)**  proposed in the TweetEval [[1]](#section_id) will be used. <br>
Each dataset contains 2,862 tweets and 784 tweets correspondigly. They represent the target variable (Y)

In [13]:
# Load the Irony Detection Train subset (Target Variable (Y))

# Read a TXT file from internet (Github) that does not have a header
df_irony_train_y = pd.read_csv('https://raw.githubusercontent.com/cardiffnlp/tweeteval/'
                               'main/datasets/irony/train_labels.txt', header=None, names=['irony_output'])

print('\n\033[1mIrony Detection Train Subset (Labels):\033[0m')
display(df_irony_train_y) # Output: SemEval-2018 Irony Detection dataframe for training (labels)(truncated)



# Read a TXT file from internet (Github) that does not have a header
df_irony_test_y = pd.read_csv('https://raw.githubusercontent.com/cardiffnlp/tweeteval/'
                              'main/datasets/irony/test_labels.txt', header=None, names=['irony_output'])

print('\n\033[1mIrony Detection Test Subset (Labels):\033[0m')
display(df_irony_test_y) # Output: SemEval-2018 Irony Detection dataframe for testing (labels)(truncated)



[1mIrony Detection Train Subset (Labels):[0m


Unnamed: 0,irony_output
0,1
1,0
2,1
3,1
4,1
...,...
2857,0
2858,1
2859,0
2860,0



[1mIrony Detection Test Subset (Labels):[0m


Unnamed: 0,irony_output
0,0
1,1
2,0
3,0
4,1
...,...
779,0
780,0
781,0
782,1


In [15]:
# Merging the X and Y dataframes (Training)
df_irony_train = pd.concat([df_irony_train_x, df_irony_train_y], axis=1)  # Merging the dataframes
print('\n\033[1mIrony Detection Train Subset:\033[0m')
display(df_irony_train) # Output: SemEval-2018 Irony Detection dataframe for training (truncated)


# Merging the X and Y dataframes (Testing)
df_irony_test = pd.concat([df_irony_test_x, df_irony_test_y], axis=1)  # Merging the dataframes
print('\n\033[1mIrony Detection Test Subset:\033[0m')
display(df_irony_test) # Output: SemEval-2018 Irony Detection dataframe for testing (truncated)


# .concact()      This function is used to concatenate two different dataframes.
# Axis=1          This parameter indicates column-wise concatenation (Merging columns of two different dataframes)


[1mIrony Detection Train Subset:[0m


Unnamed: 0,Tweets,irony_output
0,seeing ppl walking w/ crutches makes me really...,1
1,"look for the girl with the broken smile, ask h...",0
2,Now I remember why I buy books online @user #s...,1
3,@user @user So is he banded from wearing the c...,1
4,Just found out there are Etch A Sketch apps. ...,1
...,...,...
2857,I don't have to respect your beliefs.||I only ...,0
2858,Women getting hit on by married managers at @u...,1
2859,@user no but i followed you and i saw you post...,0
2860,@user I dont know what it is but I'm in love y...,0



[1mIrony Detection Test Subset:[0m


Unnamed: 0,Tweets,irony_output
0,@user Can U Help?||More conservatives needed o...,0
1,"Just walked in to #Starbucks and asked for a ""...",1
2,#NOT GONNA WIN,0
3,@user He is exactly that sort of person. Weirdo!,0
4,So much #sarcasm at work mate 10/10 #boring 10...,1
...,...,...
779,"If you drag yesterday into today, your tomorro...",0
780,Congrats to my fav @user & her team & my birth...,0
781,@user Jessica sheds tears at her fan signing e...,0
782,#Irony: al jazeera is pro Anti - #GamerGate be...,1


In [17]:
# Exploring the first 5 rows of the Irony Detection Train subset

# For-loop-enumerate iterates over indices (idx) and the first 5 rows (i) of a dataframe containing all the tweets (train subset)
for idx, i in enumerate(range(5)):      
    print(idx, '\n', df_irony_train['Tweets'][i],
         df_irony_train['irony_output'][i],'\n')
# Output: First 5 tweets in Irony Detection train subset

0 
 seeing ppl walking w/ crutches makes me really excited for the next 3 weeks of my life   1 

1 
 look for the girl with the broken smile, ask her if she wants to stay while, and she will be loved. 💕🎵  0 

2 
 Now I remember why I buy books online @user #servicewithasmile   1 

3 
 @user @user So is he banded from wearing the clothes?  #Karma  1 

4 
 Just found out there are Etch A Sketch apps.  #oldschool #notoldschool  1 



### Step 3: Preprocessing Irony Detection Train and Test subset

#### Cleaning Data (Part 1)
The first preprocessing phase will consist in the following actions:

|Action|Examples of the strings that will be removed or modified|
|:--|:-------------------------------|
|Lowercase the column "Tweets" | Can --> can, Texans --> texans, MLB --> mlb, Carly --> carly|
|Remove Stopwords |'a', 'about', 'above', 'after', 'again' .... "you're", "you've", 'your', 'yours', 'yourself', 'yourselves''|

1. In total, there are 179 stopwords in the NLTK module (stopwords.words('english')
2. However, 326 stopwords were added to the list. 
3. In total, there are 505 stopwords in this project

In [19]:
# Lowercasing the column "Tweets" (Irony Detection Train and Test subset) 

df_irony_train_c1 = df_irony_train.copy()                        # Create a copy of the Irony Detection Train subset
df_irony_test_c1 = df_irony_test.copy()                          # Create a copy of the Irony Detection Test subset

df_irony_train["Tweets"] = df_irony_train["Tweets"].str.lower()  # Lowercase the whole content of the column "Tweets" (Train)
df_irony_test["Tweets"] = df_irony_test["Tweets"].str.lower()    # Lowercase the whole content of the column "Tweets" (Test)

# Defining the Stopwords

stopwords = ["a's", "a", "about", "above", "according", "accordingly", "across", "actually", "after", "afterwards", "again", "against",
 "ain't", "ain", "all", "allow", "allows", "almost", "along", "already", "also", "although", "am", "among", "amongst", "an", "and", "another", "any", "anybody", "anyhow", "anyone", "anything", "anyway", "anyways", "anywhere", 
 "apart", "appear", "appropriate", "are", "aren't", "aren" "around", "as", "aside", "ask", "asking", "associated", "at", 
 "available", "be", "because", "been", "before", "beforehand", "behind", "being", "believe", "below", "beside", "besides", 
 "between", "beyond", "both", "brief", "but", "by", "c'mon", "c's", "came", "can", "can't", "cannot", "cant", "cause", 
 "causes", "certain", "certainly", "clearly", "co", "com", "come", "comes", "concerning", "consequently", "consider", 
 "considering", "contain", "containing", "contains", "corresponding", "could", "couldn't", "course", "currently", 
 "definitely", "described", "despite", "did", "didn't", "different", "do", "does", "doesn't", "doing", "don't", "done", 
 "down", "downwards", "during", "each", "edu", "eg", "either", "eight", "else", "elsewhere", "enough", "entirely", 
 "especially", "et", "etc", "even", "ever", "every", "everybody", "everyone", "everything", "everywhere", "exactly", 
 "example", "far", "few", "fifth", "first", "five", "followed", "following", "follows", "for", "former", "formerly", 
 "forth", "four", "from", "further", "furthermore", "get", "gets", "getting", "given", "go", "goes", "going", "gone", 
 "got", "gotten", "had", "hadn't", "happens", "hardly", "has", "hasn't", "have", "haven't", "having", "he", "he's", 
 "hence", "her", "here", "here's", "hereafter","hereby", "herein", "hereupon", "hers", "herself", "him", "himself", "his", 
 "hither", "hopefully", "how", "howbeit", "however", "i'd", "i'll", "i'm", "i've", "ie", "if", "immediate", "in", 
 "inasmuch", "inc", "indeed", "indicate", "indicated", "indicates", "inner", "insofar", "instead", "into", "inward", "is",
 "isn't", "it", "it'd", "it'll", "it's", "its", "itself", "just", "keep", "keeps", "kept", "know", "known", "knows", 
 "last", "lately", "later", "latter", "latterly", "least", "less", "lest", "let", "let's", "likely", "little", "look", 
 "looking", "looks", "ltd", "mainly", "many", "may", "maybe", "me", "meanwhile", "merely", "might", "more" , "moreover", 
 "most", "mostly", "much", "must", "my", "myself", "name", "namely", "nd", "near", "nearly", "necessary", "need", "needs",
 "neither", "never", "nevertheless","next", "nine", "no", "nobody", "non", "none", "noone", "nor", "normally", "not",
 "nothing", "novel", "now", "nowhere", "obviously", "of", "off", "often", "oh", "ok", "okay", "on", "once", "one", 
 "ones", "only", "onto", "or", "other", "others", "otherwise", "ought", "our", "ours", "ourselves", "out", "outside", 
 "over", "overall", "own", "particular", "particularly", "per", "perhaps", "placed", "plus", "possible", "presumably", 
 "probably", "provides", "que", "quite", "qv", "rather", "rd", "re", "really", "reasonably", "regarding", "regardless", 
 "regards", "relatively", "respectively", "right", "said", "same", "saw", "say", "saying", "says", "second", "secondly", 
 "see", "seeing", "seem", "seemed", "seeming", "seems", "seen", "self", "selves", "sent", "seriously", "seven", "several", 
 "shall", "she", "should", "shouldn't", "since", "six", "so", "some", "somebody", "somehow", "someone", "something", 
 "sometime", "sometimes", "somewhat", "somewhere", "soon", "specified", "specify", "specifying", "still", "sub", "such", 
 "sup", "t's", "take", "taken", "tell", "tends", "th", "than","that", "that's" , "thats", "the", "their", "theirs", "them", 
 "themselves", "then", "thence", "there", "there's", "thereafter", "thereby", "therefore", "therein", "theres", "thereupon",
 "these", "they", "they'd", "they'll", "they're", "they've", "think", "third", "this", "thorough", "thoroughly", "those", 
 "though", "three", "through", "throughout", "thru", "thus", "to", "together", "too", "took", "toward", "towards", "tried", 
 "tries", "truly", "try", "trying", "twice", "two", "un", "under" , "unless", "unlikely", "until", "unto", "up", "upon", 
 "use", "used", "useful", "uses", "using", "usually", "value", "various", "very", "via", "viz", "vs", "want", "wants", 
 "was", "wasn't", "we", "we'd", "we'll", "we're", "we've", "were", "weren't" , "what", "what's", "whatever", "when", 
 "whence", "whenever", "where", "where's", "whereafter", "whereas", "whereby", "wherein", "whereupon", "wherever", 
 "whether", "which", "while", "whither", "who", "who's", "whoever", "whole", "whom", "whose", "why", "will", "willing", 
 "with", "within", "without", "won't", "wonder", "would", "wouldn't", "yes", "yet", "you", "you'd", "you'll", "you're", 
 "you've", "your", "yours", "yourself", "yourselves", "zero", "'s'"]


# .copy()          This function is used to make a copy of one dataframe with indices and data
# df.str.lower()   This function is used to transform the content of one column or dataframe to lowercase

In [21]:
# Removing Stopwords from the Irony Detection Train subset

for i in stopwords:         # For-loop iterates over all the words found in the list "stopwords"
    df_irony_train['Tweets'] = df_irony_train['Tweets'].replace(to_replace=r"\b%s\b"%(i), value='', regex=True)
# Action: Replace the stopwords by an empty character ('') (train)


# Removing Stopwords from the Irony Detection Test subset

for i in stopwords:         # For-loop iterates over all the words found in the list "stopwords"
    df_irony_test['Tweets'] = df_irony_test['Tweets'].replace(to_replace=r"\b%s\b"%(i), value='', regex=True)
# Action: Replace the stopwords by an empty character ('') (test)


# df.replace()   This function is used to replace occurrences of a particular sub-string with another sub-string.
#                In this case, the ReGex %s has been replaced by an empty character ('')
# To_replace     This parameter indicates the sub-string to replace
# value          This parameter indicates the sub-string to replace with
# ReGex=True     This parameter indicates that the sub-string to replace is a Regular Expression

# ReGex Explanation
# \bstring\b    This RegEx matches only the string declared.
#               In this case, it matches the variable string character (%s), which contains any of the stopwords declared.
#               For example. it will match either 'about', where' or 'has'

In [22]:
# Comparing the original Tweet vs the Tweet after first preprocessing

pd.set_option("display.max_rows", None, "display.max_columns", None, 'display.max_colwidth', None)
# This option is used to print the entire Pandas dataframe (all rows, all columns & all content)

# Comparing the Original Tweet VS the Tweet after first preprocessing (train subset)
print('\n\033[1mIrony Detection Train Subset:\033[0m')
display(df_irony_train_c1[['Tweets']].iloc[0:3]) # Output: Original Tweet 
display(df_irony_train[['Tweets']].iloc[0:3]) # Output: Tweet after preprocessing 


# Comparing the Original Tweet VS the Tweet after first preprocessing (Test subset)
print('\n\033[1mIrony Detection Test Subset:\033[0m')
display(df_irony_test_c1[['Tweets']].iloc[0:3]) # Output: Original Tweet
display(df_irony_test[['Tweets']].iloc[0:3]) # Output: Tweet after preprocessing


[1mIrony Detection Train Subset:[0m


Unnamed: 0,Tweets
0,seeing ppl walking w/ crutches makes me really excited for the next 3 weeks of my life
1,"look for the girl with the broken smile, ask her if she wants to stay while, and she will be loved. 💕🎵"
2,now i remember why i buy books online @user #servicewithasmile


Unnamed: 0,Tweets
0,ppl walking w/ crutches makes excited 3 weeks life
1,"girl broken smile, stay , loved. 💕🎵"
2,i remember i buy books online @user #servicewithasmile



[1mIrony Detection Test Subset:[0m


Unnamed: 0,Tweets
0,@user can u help?||more conservatives needed on #tsu + get paid 4 posting stuff like this!||you $ can go to
1,"just walked in to #starbucks and asked for a ""tall blonde"" hahahaha #irony"
2,#not gonna win


Unnamed: 0,Tweets
0,@user u help?|| conservatives needed #tsu + paid 4 posting stuff like !|| $
1,"walked #starbucks asked ""tall blonde"" hahahaha #irony"
2,# gonna win


As shown above, the first 2 preprocessing techniques has been applied succesfully.

#### Cleaning Data (Part 2)
The second preprocessing phase will consist in the following actions:

|Action|Examples of the strings that will be removed or modified|
|:--|:-------------------------------|
|Remove User Objects |@user, @user_1, @user-1, @paulina_100, @WiNer206|
|Remove Hashtags | #friends #bff #celebrate #sandiego #sundayfunday #ObsessedWithMyDog|
|Remove Non-ASCII Characters |랙바, 에이오, ᴬᴺᴼᵀᴴᴱᴿ	ᴰᴿᴵᴺᴷ	ᴴᴬᴾᴾᵞ, บมาแล, добройночи|
|Remove new line characters|\n|
|Remove punctuation marks |!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~|
|Remove Two or more spaces|&ensp;, &ensp;&ensp;, &ensp;&ensp;&ensp;, &ensp;&ensp;&ensp;&ensp;,|
|Remove 1 or more underscores|"____ "_____" "_____________"|
|Remove numerical characters|0007, 0, 12389, 50000|
|Stemming|walking --> walk, excited -->, excit, kids --> kid|

1. The **Punkt/english.pickle** Sentence Tokenizer divides a text into a list of sentences by using an unsupervised algorithm to build a model for abbreviation words, collocations, and words that start a sentence. <br>
2. The Punkt Sentence Tokenizer is based on the publication by [Kiss, T. & Strunk, J., 2006. Unsupervised Multilingual Sentence Boundary Detection. Computational Linguistics, 32(4), pp. 485-525](https://direct.mit.edu/coli/article/32/4/485/1923/Unsupervised-Multilingual-Sentence-Boundary)
3. SnowballStemmer is a stemming algorithm used to remove morphological affixes from words, leaving only the word stem. 
4. SnowballStemmer is part of the NLTK libraries

In [24]:
# Creating a function to clean data

def split_sentences(text):
    tokenizer = nltk.data.load('tokenizers/punkt/english.pickle') # Get the The Punkt Sentence Tokenizer. 
    list_df = []                             # Create a new-empty list to store all sentences after preprocessing
    mark_sentence = '***'.join(tokenizer.tokenize(text))  
    # Add a sentence delimitator (***) to identify the beginning of each sentence
    
    #mark_sentence = re.sub(r"([a-z])\1+", r'\1', mark_sentence)
    # ReGex that removes duplicate letters. Replace them by the word without duplications)
    
    mark_sentence = re.sub(r"([@#][\w_-]+)", '', mark_sentence) 
    # ReGex that remove User Objects. (Replace them by Nothing (''))
    
    mark_sentence = re.sub(r"([^\x00-\x7F]+)", '', mark_sentence) 
    # ReGex that removes Non-ASCII Characters. (Replace them by nothing (''))
    
    mark_sentence = re.sub(r"([\b_\b]{1,})", '', mark_sentence) 
    # ReGex that removes 1 or more underscores. (Replace them by nothing (''))
    
    mark_sentence = re.sub(r"([0-9])", '', mark_sentence) 
    # ReGex that removes numbers. (Replace them by nothing (''))
    
    mark_sentence = re.sub(r"(\n+)", '', mark_sentence)
    # ReGex that removes new line characters (Replace them by nothing ('')
    
    mark_sentence = re.sub(r"([^\w\s])", ' ', mark_sentence) 
    # ReGex that removes punctuaction marks (Replace them by space (' '))
    
    mark_sentence = re.sub(r"([\s]{2,})", ' ', mark_sentence)
    # ReGex that removes Two or more spaces (Replace them by space (' '))
    
    list_df  = mark_sentence.split('***')   # Perform the sentence segmentation
    return list_df

print('Function split_sentences has been succesfully created')


# ReGex Explanation:
# \s    This RegEx matches any whitespace character. In other words, it will find strings such as " ", "\r" or "\n"
# \w    This ReGex matches any alphamumeric character. In other words, it will find strings such as: "a", "julio", "100", julio100"
# ()    The parenthesis identify a group of characters formed by the combination of 1, 2 or more characters.
#       For example, ([@][\w_-]) represents 1 group that has an structure "@ + Alphabetic Characters + _ or -"
# []    The brackets [] identify a range of characters.
#       For example, [\w\s] represents 1 range of values from any alphanumeric character or any whitespace character.
# ^     The caret symbol (^) inside of a character set [] represents the characters NOT in the range [\w\s]. 
#       (Different to the range [\w\s])
# \n    This character represents the new line character
# +     The plus sign (+) declares that "\n" is compulsory and should appears at least once. (1)
# {2,}  This quantifier declares that "\s" is compulsory and should appears at least two times. (2)
# [0-9] This ReGex matches any numerical character (0-9). In other words, it will find numbers such as "007" or "0"
# \bstring\b    This RegEx matches only the string declared.
#               In this case, it matches the underscore (_)

# \\X00      This means 0 in Hexa-decimal connotation
# \\7F       This means 127 in Hexa-decimal connotation.
# \x00-\x7F  This means a range from 0 to 127, which represents the range of the ASCII characters.

Function split_sentences has been succesfully created


In [25]:
# Cleaning the data (Part 2)(Train subset)

# Run the function "split_sentences" on the Train subset 
df_irony_train['clean_tweet'] = df_irony_train['Tweets'].apply(split_sentences)
# Action: Create a column 'clean_tweet' that stores the Tweets after preprocessing (training)

df_irony_train['clean_tweet'] = df_irony_train['clean_tweet'].apply(lambda x: ','.join(map(str, x)))
# Action: Transform the list obtained after preprocessing into single strings. (training)

df_irony_train['clean_tweet'] = df_irony_train['clean_tweet'].str.strip()
# Action: Remove extra spaces at the beginning and the end of any cell. (training)


# Cleaning the data (Part 2)(test subset)

# Run the function "split_sentences" on the Test subset
df_irony_test['clean_tweet'] = df_irony_test['Tweets'].apply(split_sentences)
# Action: Create a column 'clean_tweet' that stores the Tweets after preprocessing (test)

df_irony_test['clean_tweet'] = df_irony_test['clean_tweet'].apply(lambda x: ','.join(map(str, x)))
# Action: Transform the list obtained after preprocessing into single strings. (test)

df_irony_test['clean_tweet'] = df_irony_test['clean_tweet'].str.strip()
# Action: Remove extra spaces at the beginning and the end of any cell. (test)

print ('Done')

# df.apply()    This command is used to pass a function and apply it on every single value of the column or dataframe.
# lamda()       This command is only useful when we want to define a function that will be used only once in our program.
# ','.join()    This function takes all items in an iterable and joins them into one string.
#               In this case, it will take any word and seperate it with ",". For example: ['irwin','arnstein','subject']
# map()         This function is used to replace each value in a column or dataframe with another value.
#               In this case, this function transformed the list created after preprocessing into single strings
# str.strip()   This function is used to remove spaces at the beginning and the end of the cells.

Done


In [27]:
# Stemming the data

stemmer = SnowballStemmer('english') # Create a Stemmer object for 'English' language

# Stemming the data (Train subset)
df_irony_train['clean_tweet1'] = df_irony_train['clean_tweet'].apply(lambda x: [stemmer.stem(word) for word in x.split()])
# Action: Stem every word found in every row (train)

df_irony_train['clean_tweet1'] = df_irony_train['clean_tweet1'].apply(lambda x: ' '.join(map(str, x)))
# Action: Transform the list obtained after preprocessing into single strings (train)

df_irony_train = df_irony_train.drop(columns=['clean_tweet'])
# Drop the column "clean_tweet" (unstemmed column)(train)



# Stemming the data (Test subset)
df_irony_test['clean_tweet1'] = df_irony_test['clean_tweet'].apply(lambda x: [stemmer.stem(word) for word in x.split()]) 
# Action: Stem every word found in every row (test)

df_irony_test['clean_tweet1'] = df_irony_test['clean_tweet1'].apply(lambda x: ' '.join(map(str, x)))
# Action: Transform the list obtained after preprocessing into single strings (test)

df_irony_test = df_irony_test.drop(columns=['clean_tweet']) # Get rid of the unstemmed column.
# Drop the column "clean_tweet" (unstemmed column)(test)

print ('Done')

# df.apply()    This command is used to pass a function and apply it on every single value of the column or dataframe.
# lamda()       This command is only useful when we want to define a function that will be used only once in our program.
# x.split()     This function is used to split a string into a list using a user specified separator.
#               In this case, the content of each row is divided into words using a whitespace separator (' ')
# ','.join()    This function takes all items in an iterable and joins them into one string.
#               In this case, it will take any word and seperate it with ",". For example: ['irwin','arnstein','subject']
# map()         This function is used to replace each value in a column or dataframe with another value.
#               In this case, this function transformed the list created after preprocessing into single strings
# drop(columns='')  This function is used to delete 1 specific column
# stem()       This function is used to execute the stemmer object, which in this case is the NLTK Snowball Stemmer

KeyError: 'clean_tweet'

In [20]:
# Comparing the original Tweet vs the Tweet after second preprocessing

pd.set_option("display.max_rows", None, "display.max_columns", None, 'display.max_colwidth', None)
# This option is used to print the entire Pandas dataframe (all rows, all columns & all content)

# Train subset
print('\n\033[1mEmotion Prediction Train Subset:\033[0m')
display(df_emotion_train[0:5]) # Output: First 5 tweets (train)

# Test subset
print('\n\033[1mEmotion Prediction Test Subset:\033[0m')
display(df_emotion_test[0:5]) # Output: First 5 tweets (test)

# Column 'Tweets' contains the Original tweets
# Column 'clean_tweet1' contains the Tweets after all preprocessing


[1mEmotion Prediction Train Subset:[0m


Unnamed: 0,Tweets,Sent_label,clean_tweet1
0,“worry payment problem '. joyce meyer. #motivation #leadership #worry,2,worri payment problem joyc meyer
1,roommate: 's 't spell autocorrect. #terrible #firstworldprobs,0,roommat s t spell autocorrect
2,'s cute. atsu shy photos cherry helped uwu,1,s cute atsu shi photo cherri help uwu
3,"rooneys fucking untouchable ? fucking dreadful , depay looked decent(ish)tonight",0,rooney fuck untouch fuck dread depay look decent ish tonight
4,'s pretty depressing u hit pan ur favourite highlighter,3,s pretti depress u hit pan ur favourit highlight



[1mEmotion Prediction Test Subset:[0m


Unnamed: 0,Tweets,Sent_label,clean_tweet1
0,#deppression real. partners w/ #depressed people dont understand depth affect us. add #anxiety &amp;makes worse,3,real partner w peopl dont understand depth affect us add amp make wors
1,"@user interesting choice words... confirming governments fund #terrorism? bit open door, ...",0,interest choic word confirm govern fund bit open door
2,visit hospital care triggered #trauma accident 20+yrs ago image dead brother . feeling symptoms #depression,3,visit hospit care trigger accid yrs ago imag dead brother feel symptom
3,@user welcome #mpsvt! delighted ! #grateful #mpsvt #relationships,1,welcom delight
4,makes feel #joyful?,1,make feel


In [24]:
# Creating the final Emotion Prediction subset after preprocessing

pd.reset_option('^display.', silent=True) # Reset the default Pandas Display (Truncated)

# Final emotion prediction Train subset
print('\n\033[1mEmotion Prediction Train Subset:\033[0m')
df_emotion_train_final = df_emotion_train[['clean_tweet1','Sent_label']] 
# Create the final Emotion Prediction dataframe (train)

display(df_emotion_train_final) # Output: Emotion prediction train subset (cleaning)

# Final emotion prediction Test subset
print('\n\033[1mEmotion Prediction test Subset:\033[0m')
df_emotion_test_final = df_emotion_test[['clean_tweet1','Sent_label']]
# Create the final Emotion Prediction dataframe (test)

display(df_emotion_test_final)  # Output: Emoji prediction test subset (cleaning)



[1mEmotion Prediction Train Subset:[0m


Unnamed: 0,clean_tweet1,Sent_label
0,worri payment problem joyc meyer,2
1,roommat s t spell autocorrect,0
2,s cute atsu shi photo cherri help uwu,1
3,rooney fuck untouch fuck dread depay look dece...,0
4,s pretti depress u hit pan ur favourit highlight,3
...,...,...
3252,i discourag i fuck year contact ladi gaga thou...,3
3253,content host nation camden empti,3
3254,fellow grad i shiver shallow argument,0
3255,,0



[1mEmotion Prediction test Subset:[0m


Unnamed: 0,clean_tweet1,Sent_label
0,real partner w peopl dont understand depth aff...,3
1,interest choic word confirm govern fund bit op...,0
2,visit hospit care trigger accid yrs ago imag d...,3
3,welcom delight,1
4,make feel,1
...,...,...
1416,i sparkl bodysuit occas case s emerg sparkl suit,1
1417,finish read simpli mind blog writer continu i ...,3
1418,shaft abras panti shift side n n,0
1419,fake outrag y stop,0


In [26]:
# Transforming the Emotion Prediction subsets to Python lists.

# Train subset
list_emotion_train_final = df_emotion_train_final['clean_tweet1'].tolist()
# Action: Transform the Emotion Prediction Train dataframe into a Python list

# Test subset
list_emotion_test_final = df_emotion_test_final['clean_tweet1'].tolist()
# Action: Transform the Emotion Prediction Test dataframe into a Python list

display(list_emotion_train_final) # Output: Emotion Prediction Train list
#Uncomment the following line if you want to see the Emotion Prediction Test list
#display(list_emotion_test_final) 

# df.values.tolist()    This function is used to convert a dataFrame into a Python list

['worri payment problem joyc meyer',
 'roommat s t spell autocorrect',
 's cute atsu shi photo cherri help uwu',
 'rooney fuck untouch fuck dread depay look decent ish tonight',
 's pretti depress u hit pan ur favourit highlight',
 'pussi weak i heard stfu bitch threaten pregnant',
 'make year transit excit hope colleg return sick exhaust pessimist',
 'tiller breezi collab album rap sing prolli fire',
 'broadband shock regret sign',
 'teef',
 'usa embarrass watch time guy won game',
 'amp indian activist hold demonstr headquart demand pak stop export india',
 'glee fill normi dri hump recent high profil celebr break pathet amp wrong world today',
 'fuck muppet',
 'autocorrect chang em i resent great',
 'i strateg vote i agre lot clinton vote base fear negat',
 'hater low worth righteous delus cower thought chang chang inevit',
 'i save order risk life i panic stay calm rescu',
 'uggh s horribl bad person stretch imagin i hope person realiz',
 'tamra f swung tamra nkelli piec',
 'love h

### Step 4: Performing Word Embeddings

Word embeddings is nothing but the process of converting text data to numerical vectors, and it is used to capture not only  the semantic of the word, but also their emotional content. <br>

In this case, the **Bag-of-Words** technique will be applied in this project.
1. BoW is a representation of text that describes the occurrence of words within a document collection.
2. BoW uses word occurrence frequencies to measure the content of the tweet and see how often each words appeared
3. BoW is a method to extract features (X) from tweets. These attributes should be used for training any Machine Learning algorithm.

In [32]:
# Vectorizing the Emoji Prediction Train subset

#Uncomment the following line if you want to print all columns and all content of a given dataframe. 
# (Truncated rows, all columns & all content)
#pd.set_option("display.max_columns", None, 'display.max_colwidth', None)

vect_train = CountVectorizer(dtype=np.uint8, max_features = 373, min_df = 5)
Z_train = vect_train.fit_transform(list_emotion_train_final)
df_vect_train = pd.DataFrame(Z_train.A, columns=vect_train.get_feature_names())
display(df_vect_train)

print("--------------------------------------------------")
display(list_emotion_train_final[0])

Unnamed: 0,absolut,accept,act,afraid,age,alarm,alway,amaz,american,amp,anger,angri,anim,annoy,anxieti,around,ask,ass,attack,aw,away,awe,back,bad,bc,best,better,big,birthday,bit,bitch,bitter,black,blood,blue,bodi,boil,book,bore,boy,break,breezi,bring,broadcast,bulli,burn,burst,busi,buy,call,car,care,chanc,chang,cheer,citi,class,close,concern,contact,countri,cri,custom,damn,dark,day,deal,depress,despair,destroy,die,discourag,dont,dread,dream,drop,dude,dull,eat,email,end,episod,even,exhilar,expect,experi,eye,face,fact,fall,famili,fan,fear,feel,fieri,fight,final,find,fire,flight,follow,found,free,friend,front,frown,fuck,full,fume,fun,funni,furi,furious,futur,game,girl,give,glee,gloomi,god,gonna,good,gotta,great,grim,grudg,gt,guess,gun,guy,hair,half,hand,happen,happi,hard,hate,haunt,head,hear,heard,heart,hell,help,hey,high,hilar,hilari,hit,hold,holiday,home,honest,hope,horribl,horrif,horror,hour,hous,idea,ignor,im,imagin,insult,irrit,issu,job,joke,joy,kid,kill,kind,kinda,king,late,laugh,leav,left,level,lie,life,like,lip,listen,liter,live,ll,lmao,lol,long,lose,lost,lot,love,ly,mad,made,make,man,match,mean,media,meet,men,mind,minut,miss,moment,money,month,morn,mourn,move,movi,music,nervous,new,news,night,nightmar,offend,offens,offic,old,omg,open,order,outrag,pakistan,panic,part,past,pay,peopl,person,phone,pine,place,play,player,pleas,pm,point,polic,polit,post,pout,pretti,problem,protest,provok,put,question,racist,rage,read,real,realiz,reason,rejoic,rememb,resent,rest,reveng,room,rule,run,sad,safe,scare,school,season,send,serious,servic,shake,shi,shit,shock,shoot,shot,show,sick,side,sing,sit,sleep,smile,snap,sober,sorri,sort,sound,speak,st,start,state,stay,sting,stop,store,stori,straight,stuff,stupid,sunk,support,suppos,sure,take,talk,team,tear,tell,terribl,terror,terrorist,test,thank,thing,think,thought,threaten,ticket,time,tire,today,told,tomorrow,tonight,top,true,trump,truth,turn,tv,tweet,twitter,understand,unhappi,ur,us,ve,video,voic,vote,wait,walk,wanna,want,war,watch,way,weak,weari,week,well,white,win,wish,woman,word,work,world,worri,wors,worst,wow,wrath,wrong,yeah,year,yo
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3252,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0
3253,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3254,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3255,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


--------------------------------------------------


'worri payment problem joyc meyer'

In [31]:
# Vectorizing the Test subset

pd.set_option("display.max_columns", None, 'display.max_colwidth', None)
vect_test= CountVectorizer(dtype=np.uint8, max_features = 900, min_df = 5)
Z_test = vect_test.fit_transform(list_emotion_test_final)
df_vect_test = pd.DataFrame(Z_test.A, columns=vect_test.get_feature_names())
display(df_vect_test)

print("--------------------------------------------------")
display(list_emotion_test_final[0])

Unnamed: 0,abl,action,add,afraid,ago,alarm,alon,alway,amaz,american,amount,amp,anger,angri,annoy,area,around,ask,ass,attack,aw,away,awesom,babi,back,bad,basic,bc,beauti,becom,bed,best,better,big,birthday,bit,bitch,bitter,black,bless,blue,bodi,book,bore,boy,break,bright,bring,brother,bulli,bunch,burn,burst,buy,call,car,care,chang,charact,chat,cheer,choic,choos,class,close,colleg,come,concern,continu,cool,crap,cream,cri,custom,damn,dark,day,dead,death,decid,depress,deserv,die,disappoint,discourag,dog,dont,dread,dream,dress,drink,earli,earth,eat,eclips,end,enjoy,episod,excit,expect,experi,express,eye,face,fact,fake,famili,fan,fear,feed,feel,fill,final,find,folk,follow,food,found,free,friend,frown,fuck,full,fume,fun,funni,furi,furious,game,girl,give,glad,gloomi,god,good,govern,great,grudg,guy,ha,hair,hand,happen,happi,hard,hate,hatr,head,hear,heart,hell,help,hey,hilari,hold,home,hope,horribl,horrid,horrif,hour,hurt,idk,im,insult,interest,intimid,irrit,issu,jame,job,joke,jr,kid,kill,kind,knew,laugh,laughter,lead,learn,leav,left,leg,life,light,like,listen,liter,live,ll,local,lol,longer,look,lost,lot,love,low,lt,mad,made,make,man,matter,mean,meet,mind,miss,mom,moment,money,month,mood,morn,mourn,mouth,move,movi,music,muslim,nervous,new,news,ni,nice,night,nightmar,offend,offens,old,omg,onlin,opinion,order,outrag,panic,park,part,pay,peopl,person,phone,pic,piss,place,play,pleas,point,poor,post,power,ppl,presid,pretti,price,problem,provoc,put,question,rabid,rage,read,readi,real,reason,red,rememb,remind,respons,rest,return,reveng,run,sad,safe,sale,scare,scream,serious,servic,shake,shit,shock,show,sign,sink,sir,sit,sleep,smile,smoke,snap,sober,sorri,soul,sound,speak,spent,stand,star,start,state,stay,stop,stupid,support,sure,take,talk,tantrum,tea,team,tear,terribl,terrifi,terror,terrorist,thank,thing,think,tho,thought,throw,ticket,time,tire,today,told,tomorrow,tonight,train,true,trump,turn,tv,tweet,twitter,type,ugh,understand,upset,ur,us,ve,video,wait,walk,wanna,want,watch,water,way,wear,weather,week,well,went,woke,woman,wonder,word,work,world,worri,worst,wow,wrath,write,wrong,wtf,yeah,year
0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1416,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1417,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1418,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1419,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


--------------------------------------------------


'real partner w peopl dont understand depth affect us add amp make wors'

In [34]:
# Transfering the data into 2 numpy arrays

x_train = df_vect_train.loc[:,:].to_numpy() 
# This array contains all the feature variable values.

y_train = df_emotion_train_final.loc[:, df_emotion_train_final.columns == 'Sent_label'].to_numpy()
# This array contains ONLY the target variable values.

print('\n'+'\033[1m'+'Array of feature variables (Train):'+'\033[0m', x_train.shape) 
print(x_train) # Output: All possible elements of the array. (This includes everything except the values of the target variable ("Class label")

print('\n'+'\033[1m'+'Array of target variable (Train):'+'\033[0m', y_train.shape)
print(y_train) # Output: All possible elements of the array. (This includes only values of "Class Label" feature from the wine dataset)


[1mArray of feature variables (Train):[0m (3257, 373)
[[0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 ...
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]]

[1mArray of target variable (Train):[0m (3257, 1)
[[2]
 [0]
 [1]
 ...
 [0]
 [0]
 [0]]


In [35]:
# Transfering the data into 2 numpy arrays

x_test = df_vect_test.loc[:,:].to_numpy() 
# This array contains all the feature variable values.

y_test = df_emotion_test_final.loc[:, df_emotion_test_final.columns == 'Sent_label'].to_numpy()
# This array contains ONLY the target variable values.

print('\n'+'\033[1m'+'Array of feature variables (Test):'+'\033[0m', x_test.shape) 
print(x_test) # Output: All possible elements of the array. (This includes everything except the values of the target variable ("Class label")

print('\n'+'\033[1m'+'Array of target variable (Test):'+'\033[0m', y_test.shape)
print(y_test) # Output: All possible elements of the array. (This includes only values of "Class Label" feature from the wine dataset)


[1mArray of feature variables (Test):[0m (1421, 373)
[[0 0 1 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 ...
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]]

[1mArray of target variable (Test):[0m (1421, 1)
[[3]
 [0]
 [3]
 ...
 [0]
 [0]
 [1]]


### Step 5: Building the Machine Learning model

#### Naıve  Bayes  Multinomial  classifier

In [36]:
# Use a Naive Bayes model 
from sklearn.naive_bayes import MultinomialNB 

mnb = MultinomialNB() 

# Train the model
mnb.fit(x_train, y_train)

MultinomialNB()

In [37]:
# Take the model that was trained on the X_train_cv data and apply it to the X_test_cv data 
y_pred_cv_mnb = mnb.predict(x_test) 
y_pred_cv_mnb # The output is all of the predictions

array([0, 3, 1, ..., 0, 0, 0], dtype=int64)

In [38]:
from sklearn.metrics import classification_report, accuracy_score

print(accuracy_score(y_test, y_pred_cv_mnb))
print(classification_report(y_test, y_pred_cv_mnb))

0.33497536945812806
              precision    recall  f1-score   support

           0       0.40      0.63      0.49       558
           1       0.26      0.18      0.21       358
           2       0.05      0.03      0.04       123
           3       0.25      0.15      0.19       382

    accuracy                           0.33      1421
   macro avg       0.24      0.25      0.23      1421
weighted avg       0.29      0.33      0.30      1421



#### Logistic Regression Classifier

In [39]:
# Create the Logistic Regression object
logr = LogisticRegression(solver='lbfgs')  
# Train the model
logr.fit(x_train, y_train)

LogisticRegression()

In [40]:
# Take the model that was trained on the X_train_cv data and apply it to the X_test_cv data 
y_pred_cv_logr = logr.predict(x_test) 
y_pred_cv_logr # The output is all of the predictions

array([0, 3, 1, ..., 0, 0, 0], dtype=int64)

In [41]:
print(accuracy_score(y_test, y_pred_cv_logr))
print(classification_report(y_test, y_pred_cv_logr))

0.3216045038705137
              precision    recall  f1-score   support

           0       0.39      0.62      0.48       558
           1       0.23      0.16      0.19       358
           2       0.07      0.03      0.04       123
           3       0.21      0.12      0.16       382

    accuracy                           0.32      1421
   macro avg       0.23      0.24      0.22      1421
weighted avg       0.28      0.32      0.28      1421



## References

<a id='section_id'></a>
&emsp;[1] C. Van Hee, E. Lefever, and V. Hoste, “Semeval-2018 task 3: Irony detection in english tweets,” in Proceedings of The 12th  International Workshop<br>
&emsp;&emsp;&ensp;on Semantic Evaluation, 2018, pp. 39–50<br>