[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1NtMbZ1fIkreQeoam1161u6UeVmtw7xhh)

# SemEval-2018 Task 3: Irony Detection

This is the dataset called TweetEval [[1]](#section_id) and it is available [here](https://github.com/cardiffnlp/tweeteval) <br>

As described by [[1]](#section_id), the tweets were retrieved with the
Twitter API from October 2015 to February 2017 and ”geolocalized” in United States. <br>

In [1]:
import pandas as pd # Allow us to work with CSV files
import emoji # Allow us to print Emojis
import numpy as np # Allow us to work with arrays
import re  # Allow us to work with regular expressions
import nltk.data  # Allow to use the tokenizer punkt/english.pickle
import nltk # import the nltk package
from nltk.stem.snowball import SnowballStemmer # Import the SnowballStemmer algorithm
import warnings
warnings.filterwarnings('ignore') # Allow to disable Python warnings

# PENDIENTE
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression # Import logistic regression
from sklearn.metrics import accuracy_score, confusion_matrix, make_scorer, f1_score, classification_report
# Import scikit-learn.metrics module for accuracy score, make_scorer, confusion matrix and classification_report
from sklearn import metrics # Import scikit-learn metrics module for Recall calculation
from sklearn.model_selection import cross_val_score # Import cross validation score
from sklearn.model_selection import train_test_split, StratifiedKFold, GridSearchCV
# Import train_test_split function, stratified K-Folds cross-validator and GridSearchCV

### Step 1: Preparing the Irony Detection Mapping.
For this task, the **Irony Detection Mapping** proposed in the TweetEval [[1]](#section_id) will be used. <br>

In [2]:
# Print the Irony Detection Mapping dataframe

# Read a TXT file from internet (Github) that does not have a header
df_irony_mapping = pd.read_csv('https://raw.githubusercontent.com/cardiffnlp/tweeteval/'
                               'main/datasets/irony/mapping.txt', header=None, sep ='\t', 
                                names=['Label','Output'], index_col=False)
    
print('\n\033[1mIrony Detection Mapping Subset:\033[0m')
df_irony_mapping = df_irony_mapping[['Output', 'Label']]
display(df_irony_mapping) # Output: SemEval-2018 Irony Detection Mapping dataframe (truncated)



[1mIrony Detection Mapping Subset:[0m


Unnamed: 0,Output,Label
0,non_irony,0
1,irony,1


The above dataframe can be explaned with the following examples: <br>
<br>
• As shown, 0 is equal to **Non Irony**, while 1 is equal to **Irony** <br>

### Step 2: Importing Irony Detection Train and Test subsets
For this task, the **Irony Detection train and test subset** proposed in the TweetEval [[1]](#section_id) will be used. <br>
Each dataset contains 2,862 tweets and 784 tweets correspondigly. They represent the feature variable (X)

In [3]:
# Load the Irony Detection Train subset (Feature variable (X))

# Read a TXT file from internet (Github) that does not have a header
df_irony_train_x = pd.read_csv('https://raw.githubusercontent.com/cardiffnlp/tweeteval/'
                                 'main/datasets/irony/train_text.txt', header=None, sep ='\t', names=['Tweets'])
    
print('\n\033[1mIrony Detection Train Subset:\033[0m')
display(df_irony_train_x) # Output: SemEval-2018 Irony Detection dataframe for training (truncated)


# Load the Irony Detection Test subset (Feature variable (X))

# Read a TXT file from internet (Github) that does not have a header
df_irony_test_x = pd.read_csv('https://raw.githubusercontent.com/cardiffnlp/tweeteval/'
                              'main/datasets/irony/test_text.txt', header=None, sep ='\n', names=['Tweets'])
  
print('\n\033[1mIrony Detection Test Subset:\033[0m')
display(df_irony_test_x) # Output: SemEval-2018 Irony Detection dataframe for testing (truncated)


[1mIrony Detection Train Subset:[0m


Unnamed: 0,Tweets
0,seeing ppl walking w/ crutches makes me really...
1,"look for the girl with the broken smile, ask h..."
2,Now I remember why I buy books online @user #s...
3,@user @user So is he banded from wearing the c...
4,Just found out there are Etch A Sketch apps. ...
...,...
2857,I don't have to respect your beliefs.||I only ...
2858,Women getting hit on by married managers at @u...
2859,@user no but i followed you and i saw you post...
2860,@user I dont know what it is but I'm in love y...



[1mIrony Detection Test Subset:[0m


Unnamed: 0,Tweets
0,@user Can U Help?||More conservatives needed o...
1,"Just walked in to #Starbucks and asked for a ""..."
2,#NOT GONNA WIN
3,@user He is exactly that sort of person. Weirdo!
4,So much #sarcasm at work mate 10/10 #boring 10...
...,...
779,"If you drag yesterday into today, your tomorro..."
780,Congrats to my fav @user & her team & my birth...
781,@user Jessica sheds tears at her fan signing e...
782,#Irony: al jazeera is pro Anti - #GamerGate be...


For this task, the **Irony Detection train and test subset (labels)**  proposed in the TweetEval [[1]](#section_id) will be used. <br>
Each dataset contains 2,862 tweets and 784 tweets correspondigly. They represent the target variable (Y)

In [4]:
# Load the Irony Detection Train subset (Target Variable (Y))

# Read a TXT file from internet (Github) that does not have a header
df_irony_train_y = pd.read_csv('https://raw.githubusercontent.com/cardiffnlp/tweeteval/'
                               'main/datasets/irony/train_labels.txt', header=None, names=['irony_output'])

print('\n\033[1mIrony Detection Train Subset (Labels):\033[0m')
display(df_irony_train_y) # Output: SemEval-2018 Irony Detection dataframe for training (labels)(truncated)



# Read a TXT file from internet (Github) that does not have a header
df_irony_test_y = pd.read_csv('https://raw.githubusercontent.com/cardiffnlp/tweeteval/'
                              'main/datasets/irony/test_labels.txt', header=None, names=['irony_output'])

print('\n\033[1mIrony Detection Test Subset (Labels):\033[0m')
display(df_irony_test_y) # Output: SemEval-2018 Irony Detection dataframe for testing (labels)(truncated)



[1mIrony Detection Train Subset (Labels):[0m


Unnamed: 0,irony_output
0,1
1,0
2,1
3,1
4,1
...,...
2857,0
2858,1
2859,0
2860,0



[1mIrony Detection Test Subset (Labels):[0m


Unnamed: 0,irony_output
0,0
1,1
2,0
3,0
4,1
...,...
779,0
780,0
781,0
782,1


In [5]:
# Merging the X and Y dataframes (Training)
df_irony_train = pd.concat([df_irony_train_x, df_irony_train_y], axis=1)  # Merging the dataframes
print('\n\033[1mIrony Detection Train Subset:\033[0m')
display(df_irony_train) # Output: SemEval-2018 Irony Detection dataframe for training (truncated)


# Merging the X and Y dataframes (Testing)
df_irony_test = pd.concat([df_irony_test_x, df_irony_test_y], axis=1)  # Merging the dataframes
print('\n\033[1mIrony Detection Test Subset:\033[0m')
display(df_irony_test) # Output: SemEval-2018 Irony Detection dataframe for testing (truncated)


# .concact()      This function is used to concatenate two different dataframes.
# Axis=1          This parameter indicates column-wise concatenation (Merging columns of two different dataframes)


[1mIrony Detection Train Subset:[0m


Unnamed: 0,Tweets,irony_output
0,seeing ppl walking w/ crutches makes me really...,1
1,"look for the girl with the broken smile, ask h...",0
2,Now I remember why I buy books online @user #s...,1
3,@user @user So is he banded from wearing the c...,1
4,Just found out there are Etch A Sketch apps. ...,1
...,...,...
2857,I don't have to respect your beliefs.||I only ...,0
2858,Women getting hit on by married managers at @u...,1
2859,@user no but i followed you and i saw you post...,0
2860,@user I dont know what it is but I'm in love y...,0



[1mIrony Detection Test Subset:[0m


Unnamed: 0,Tweets,irony_output
0,@user Can U Help?||More conservatives needed o...,0
1,"Just walked in to #Starbucks and asked for a ""...",1
2,#NOT GONNA WIN,0
3,@user He is exactly that sort of person. Weirdo!,0
4,So much #sarcasm at work mate 10/10 #boring 10...,1
...,...,...
779,"If you drag yesterday into today, your tomorro...",0
780,Congrats to my fav @user & her team & my birth...,0
781,@user Jessica sheds tears at her fan signing e...,0
782,#Irony: al jazeera is pro Anti - #GamerGate be...,1


In [6]:
# Exploring the first 5 rows of the Irony Detection Train subset

# For-loop-enumerate iterates over indices (idx) and the first 5 rows (i) of a dataframe containing all the tweets (train subset)
for idx, i in enumerate(range(5)):      
    print(idx, '\n', df_irony_train['Tweets'][i],
         df_irony_train['irony_output'][i],'\n')
# Output: First 5 tweets in Irony Detection train subset

0 
 seeing ppl walking w/ crutches makes me really excited for the next 3 weeks of my life   1 

1 
 look for the girl with the broken smile, ask her if she wants to stay while, and she will be loved. 💕🎵  0 

2 
 Now I remember why I buy books online @user #servicewithasmile   1 

3 
 @user @user So is he banded from wearing the clothes?  #Karma  1 

4 
 Just found out there are Etch A Sketch apps.  #oldschool #notoldschool  1 



### Step 3: Preprocessing Irony Detection Train and Test subset

#### Cleaning Data (Part 1)
The first preprocessing phase will consist in the following actions:

|Action|Examples of the strings that will be removed or modified|
|:--|:-------------------------------|
|Lowercase the column "Tweets" | Can --> can, Texans --> texans, MLB --> mlb, Carly --> carly|
|Remove Stopwords |'a', 'about', 'above', 'after', 'again' .... "you're", "you've", 'your', 'yours', 'yourself', 'yourselves''|

1. In total, there are 179 stopwords in the NLTK module (stopwords.words('english')
2. However, 326 stopwords were added to the list. 
3. In total, there are 505 stopwords in this project

In [7]:
# Lowercasing the column "Tweets" (Irony Detection Train and Test subset) 

df_irony_train_c1 = df_irony_train.copy()                        # Create a copy of the Irony Detection Train subset
df_irony_test_c1 = df_irony_test.copy()                          # Create a copy of the Irony Detection Test subset

df_irony_train["Tweets"] = df_irony_train["Tweets"].str.lower()  # Lowercase the whole content of the column "Tweets" (Train)
df_irony_test["Tweets"] = df_irony_test["Tweets"].str.lower()    # Lowercase the whole content of the column "Tweets" (Test)

# Defining the Stopwords

stopwords = ["a's", "a", "about", "above", "according", "accordingly", "across", "actually", "after", "afterwards", "again", "against",
 "ain't", "ain", "all", "allow", "allows", "almost", "along", "already", "also", "although", "am", "among", "amongst", "an", "and", "another", "any", "anybody", "anyhow", "anyone", "anything", "anyway", "anyways", "anywhere", 
 "apart", "appear", "appropriate", "are", "aren't", "aren" "around", "as", "aside", "ask", "asking", "associated", "at", 
 "available", "be", "because", "been", "before", "beforehand", "behind", "being", "believe", "below", "beside", "besides", 
 "between", "beyond", "both", "brief", "but", "by", "c'mon", "c's", "came", "can", "can't", "cannot", "cant", "cause", 
 "causes", "certain", "certainly", "clearly", "co", "com", "come", "comes", "concerning", "consequently", "consider", 
 "considering", "contain", "containing", "contains", "corresponding", "could", "couldn't", "course", "currently", 
 "definitely", "described", "despite", "did", "didn't", "different", "do", "does", "doesn't", "doing", "don't", "done", 
 "down", "downwards", "during", "each", "edu", "eg", "either", "eight", "else", "elsewhere", "enough", "entirely", 
 "especially", "et", "etc", "even", "ever", "every", "everybody", "everyone", "everything", "everywhere", "exactly", 
 "example", "far", "few", "fifth", "first", "five", "followed", "following", "follows", "for", "former", "formerly", 
 "forth", "four", "from", "further", "furthermore", "get", "gets", "getting", "given", "go", "goes", "going", "gone", 
 "got", "gotten", "had", "hadn't", "happens", "hardly", "has", "hasn't", "have", "haven't", "having", "he", "he's", 
 "hence", "her", "here", "here's", "hereafter","hereby", "herein", "hereupon", "hers", "herself", "him", "himself", "his", 
 "hither", "hopefully", "how", "howbeit", "however", "i'd", "i'll", "i'm", "i've", "ie", "if", "immediate", "in", 
 "inasmuch", "inc", "indeed", "indicate", "indicated", "indicates", "inner", "insofar", "instead", "into", "inward", "is",
 "isn't", "it", "it'd", "it'll", "it's", "its", "itself", "just", "keep", "keeps", "kept", "know", "known", "knows", 
 "last", "lately", "later", "latter", "latterly", "least", "less", "lest", "let", "let's", "likely", "little", "look", 
 "looking", "looks", "ltd", "mainly", "many", "may", "maybe", "me", "meanwhile", "merely", "might", "more" , "moreover", 
 "most", "mostly", "much", "must", "my", "myself", "name", "namely", "nd", "near", "nearly", "necessary", "need", "needs",
 "neither", "never", "nevertheless","next", "nine", "no", "nobody", "non", "none", "noone", "nor", "normally", "not",
 "nothing", "novel", "now", "nowhere", "obviously", "of", "off", "often", "oh", "ok", "okay", "on", "once", "one", 
 "ones", "only", "onto", "or", "other", "others", "otherwise", "ought", "our", "ours", "ourselves", "out", "outside", 
 "over", "overall", "own", "particular", "particularly", "per", "perhaps", "placed", "plus", "possible", "presumably", 
 "probably", "provides", "que", "quite", "qv", "rather", "rd", "re", "really", "reasonably", "regarding", "regardless", 
 "regards", "relatively", "respectively", "right", "said", "same", "saw", "say", "saying", "says", "second", "secondly", 
 "see", "seeing", "seem", "seemed", "seeming", "seems", "seen", "self", "selves", "sent", "seriously", "seven", "several", 
 "shall", "she", "should", "shouldn't", "since", "six", "so", "some", "somebody", "somehow", "someone", "something", 
 "sometime", "sometimes", "somewhat", "somewhere", "soon", "specified", "specify", "specifying", "still", "sub", "such", 
 "sup", "t's", "take", "taken", "tell", "tends", "th", "than","that", "that's" , "thats", "the", "their", "theirs", "them", 
 "themselves", "then", "thence", "there", "there's", "thereafter", "thereby", "therefore", "therein", "theres", "thereupon",
 "these", "they", "they'd", "they'll", "they're", "they've", "think", "third", "this", "thorough", "thoroughly", "those", 
 "though", "three", "through", "throughout", "thru", "thus", "to", "together", "too", "took", "toward", "towards", "tried", 
 "tries", "truly", "try", "trying", "twice", "two", "un", "under" , "unless", "unlikely", "until", "unto", "up", "upon", 
 "use", "used", "useful", "uses", "using", "usually", "value", "various", "very", "via", "viz", "vs", "want", "wants", 
 "was", "wasn't", "we", "we'd", "we'll", "we're", "we've", "were", "weren't" , "what", "what's", "whatever", "when", 
 "whence", "whenever", "where", "where's", "whereafter", "whereas", "whereby", "wherein", "whereupon", "wherever", 
 "whether", "which", "while", "whither", "who", "who's", "whoever", "whole", "whom", "whose", "why", "will", "willing", 
 "with", "within", "without", "won't", "wonder", "would", "wouldn't", "yes", "yet", "you", "you'd", "you'll", "you're", 
 "you've", "your", "yours", "yourself", "yourselves", "zero", "'s'"]


# .copy()          This function is used to make a copy of one dataframe with indices and data
# df.str.lower()   This function is used to transform the content of one column or dataframe to lowercase

In [8]:
# Removing Stopwords from the Irony Detection Train subset

for i in stopwords:         # For-loop iterates over all the words found in the list "stopwords"
    df_irony_train['Tweets'] = df_irony_train['Tweets'].replace(to_replace=r"\b%s\b"%(i), value='', regex=True)
# Action: Replace the stopwords by an empty character ('') (train)


# Removing Stopwords from the Irony Detection Test subset

for i in stopwords:         # For-loop iterates over all the words found in the list "stopwords"
    df_irony_test['Tweets'] = df_irony_test['Tweets'].replace(to_replace=r"\b%s\b"%(i), value='', regex=True)
# Action: Replace the stopwords by an empty character ('') (test)


# df.replace()   This function is used to replace occurrences of a particular sub-string with another sub-string.
#                In this case, the ReGex %s has been replaced by an empty character ('')
# To_replace     This parameter indicates the sub-string to replace
# value          This parameter indicates the sub-string to replace with
# ReGex=True     This parameter indicates that the sub-string to replace is a Regular Expression

# ReGex Explanation
# \bstring\b    This RegEx matches only the string declared.
#               In this case, it matches the variable string character (%s), which contains any of the stopwords declared.
#               For example. it will match either 'about', where' or 'has'

In [9]:
# Comparing the original Tweet vs the Tweet after first preprocessing

pd.set_option("display.max_rows", None, "display.max_columns", None, 'display.max_colwidth', None)
# This option is used to print the entire Pandas dataframe (all rows, all columns & all content)

# Comparing the Original Tweet VS the Tweet after first preprocessing (train subset)
print('\n\033[1mIrony Detection Train Subset:\033[0m')
display(df_irony_train_c1[['Tweets']].iloc[0:3]) # Output: Original Tweet 
display(df_irony_train[['Tweets']].iloc[0:3]) # Output: Tweet after preprocessing 


# Comparing the Original Tweet VS the Tweet after first preprocessing (Test subset)
print('\n\033[1mIrony Detection Test Subset:\033[0m')
display(df_irony_test_c1[['Tweets']].iloc[0:3]) # Output: Original Tweet
display(df_irony_test[['Tweets']].iloc[0:3]) # Output: Tweet after preprocessing


[1mIrony Detection Train Subset:[0m


Unnamed: 0,Tweets
0,seeing ppl walking w/ crutches makes me really excited for the next 3 weeks of my life
1,"look for the girl with the broken smile, ask her if she wants to stay while, and she will be loved. 💕🎵"
2,Now I remember why I buy books online @user #servicewithasmile


Unnamed: 0,Tweets
0,ppl walking w/ crutches makes excited 3 weeks life
1,"girl broken smile, stay , loved. 💕🎵"
2,i remember i buy books online @user #servicewithasmile



[1mIrony Detection Test Subset:[0m


Unnamed: 0,Tweets
0,@user Can U Help?||More conservatives needed on #TSU + get paid 4 posting stuff like this!||YOU $ can go to
1,"Just walked in to #Starbucks and asked for a ""tall blonde"" Hahahaha #irony"
2,#NOT GONNA WIN


Unnamed: 0,Tweets
0,@user u help?|| conservatives needed #tsu + paid 4 posting stuff like !|| $
1,"walked #starbucks asked ""tall blonde"" hahahaha #irony"
2,# gonna win


As shown above, the first 2 preprocessing techniques has been applied succesfully.

#### Cleaning Data (Part 2)
The second preprocessing phase will consist in the following actions:

|Action|Examples of the strings that will be removed or modified|
|:--|:-------------------------------|
|Remove User Objects |@user, @user_1, @user-1, @paulina_100, @WiNer206|
|Remove Hashtags | #friends #bff #celebrate #sandiego #sundayfunday #ObsessedWithMyDog|
|Remove Non-ASCII Characters |랙바, 에이오, ᴬᴺᴼᵀᴴᴱᴿ	ᴰᴿᴵᴺᴷ	ᴴᴬᴾᴾᵞ, บมาแล, добройночи|
|Remove new line characters|\n|
|Remove punctuation marks |!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~|
|Remove Two or more spaces|&ensp;, &ensp;&ensp;, &ensp;&ensp;&ensp;, &ensp;&ensp;&ensp;&ensp;,|
|Remove 1 or more underscores|"____ "_____" "_____________"|
|Remove numerical characters|0007, 0, 12389, 50000|
|Stemming|walking --> walk, excited -->, excit, kids --> kid|

1. The **Punkt/english.pickle** Sentence Tokenizer divides a text into a list of sentences by using an unsupervised algorithm to build a model for abbreviation words, collocations, and words that start a sentence. <br>
2. The Punkt Sentence Tokenizer is based on the publication by [Kiss, T. & Strunk, J., 2006. Unsupervised Multilingual Sentence Boundary Detection. Computational Linguistics, 32(4), pp. 485-525](https://direct.mit.edu/coli/article/32/4/485/1923/Unsupervised-Multilingual-Sentence-Boundary)
3. SnowballStemmer is a stemming algorithm used to remove morphological affixes from words, leaving only the word stem. 
4. SnowballStemmer is part of the NLTK libraries

In [10]:
# Creating a function to clean data

def split_sentences(text):
    tokenizer = nltk.data.load('tokenizers/punkt/english.pickle') # Get the The Punkt Sentence Tokenizer. 
    list_df = []                             # Create a new-empty list to store all sentences after preprocessing
    mark_sentence = '***'.join(tokenizer.tokenize(text))  
    # Add a sentence delimitator (***) to identify the beginning of each sentence
    
    #mark_sentence = re.sub(r"([a-z])\1+", r'\1', mark_sentence)
    # ReGex that removes duplicate letters. Replace them by the word without duplications)
    
    mark_sentence = re.sub(r"([@#][\w_-]+)", '', mark_sentence) 
    # ReGex that remove User Objects. (Replace them by Nothing (''))
    
    mark_sentence = re.sub(r"([^\x00-\x7F]+)", '', mark_sentence) 
    # ReGex that removes Non-ASCII Characters. (Replace them by nothing (''))
    
    mark_sentence = re.sub(r"([\b_\b]{1,})", '', mark_sentence) 
    # ReGex that removes 1 or more underscores. (Replace them by nothing (''))
    
    mark_sentence = re.sub(r"([0-9])", '', mark_sentence) 
    # ReGex that removes numbers. (Replace them by nothing (''))
    
    mark_sentence = re.sub(r"(\n+)", '', mark_sentence)
    # ReGex that removes new line characters (Replace them by nothing ('')
    
    mark_sentence = re.sub(r"([^\w\s])", ' ', mark_sentence) 
    # ReGex that removes punctuaction marks (Replace them by space (' '))
    
    mark_sentence = re.sub(r"([\s]{2,})", ' ', mark_sentence)
    # ReGex that removes Two or more spaces (Replace them by space (' '))
    
    list_df  = mark_sentence.split('***')   # Perform the sentence segmentation
    return list_df

print('Function split_sentences has been succesfully created')


# ReGex Explanation:
# \s    This RegEx matches any whitespace character. In other words, it will find strings such as " ", "\r" or "\n"
# \w    This ReGex matches any alphamumeric character. In other words, it will find strings such as: "a", "julio", "100", julio100"
# ()    The parenthesis identify a group of characters formed by the combination of 1, 2 or more characters.
#       For example, ([@][\w_-]) represents 1 group that has an structure "@ + Alphabetic Characters + _ or -"
# []    The brackets [] identify a range of characters.
#       For example, [\w\s] represents 1 range of values from any alphanumeric character or any whitespace character.
# ^     The caret symbol (^) inside of a character set [] represents the characters NOT in the range [\w\s]. 
#       (Different to the range [\w\s])
# \n    This character represents the new line character
# +     The plus sign (+) declares that "\n" is compulsory and should appears at least once. (1)
# {2,}  This quantifier declares that "\s" is compulsory and should appears at least two times. (2)
# [0-9] This ReGex matches any numerical character (0-9). In other words, it will find numbers such as "007" or "0"
# \bstring\b    This RegEx matches only the string declared.
#               In this case, it matches the underscore (_)

# \\X00      This means 0 in Hexa-decimal connotation
# \\7F       This means 127 in Hexa-decimal connotation.
# \x00-\x7F  This means a range from 0 to 127, which represents the range of the ASCII characters.

Function split_sentences has been succesfully created


In [11]:
# Cleaning the data (Part 2)(Train subset)

# Run the function "split_sentences" on the Train subset 
df_irony_train['clean_tweet'] = df_irony_train['Tweets'].apply(split_sentences)
# Action: Create a column 'clean_tweet' that stores the Tweets after preprocessing (training)

df_irony_train['clean_tweet'] = df_irony_train['clean_tweet'].apply(lambda x: ','.join(map(str, x)))
# Action: Transform the list obtained after preprocessing into single strings. (training)

df_irony_train['clean_tweet'] = df_irony_train['clean_tweet'].str.strip()
# Action: Remove extra spaces at the beginning and the end of any cell. (training)


# Cleaning the data (Part 2)(test subset)

# Run the function "split_sentences" on the Test subset
df_irony_test['clean_tweet'] = df_irony_test['Tweets'].apply(split_sentences)
# Action: Create a column 'clean_tweet' that stores the Tweets after preprocessing (test)

df_irony_test['clean_tweet'] = df_irony_test['clean_tweet'].apply(lambda x: ','.join(map(str, x)))
# Action: Transform the list obtained after preprocessing into single strings. (test)

df_irony_test['clean_tweet'] = df_irony_test['clean_tweet'].str.strip()
# Action: Remove extra spaces at the beginning and the end of any cell. (test)

print ('Done')

# df.apply()    This command is used to pass a function and apply it on every single value of the column or dataframe.
# lamda()       This command is only useful when we want to define a function that will be used only once in our program.
# ','.join()    This function takes all items in an iterable and joins them into one string.
#               In this case, it will take any word and seperate it with ",". For example: ['irwin','arnstein','subject']
# map()         This function is used to replace each value in a column or dataframe with another value.
#               In this case, this function transformed the list created after preprocessing into single strings
# str.strip()   This function is used to remove spaces at the beginning and the end of the cells.

Done


In [12]:
# Stemming the data

stemmer = SnowballStemmer('english') # Create a Stemmer object for 'English' language

# Stemming the data (Train subset)
df_irony_train['clean_tweet1'] = df_irony_train['clean_tweet'].apply(lambda x: [stemmer.stem(word) for word in x.split()])
# Action: Stem every word found in every row (train)

df_irony_train['clean_tweet1'] = df_irony_train['clean_tweet1'].apply(lambda x: ' '.join(map(str, x)))
# Action: Transform the list obtained after preprocessing into single strings (train)

df_irony_train = df_irony_train.drop(columns=['clean_tweet'])
# Drop the column "clean_tweet" (unstemmed column)(train)



# Stemming the data (Test subset)
df_irony_test['clean_tweet1'] = df_irony_test['clean_tweet'].apply(lambda x: [stemmer.stem(word) for word in x.split()]) 
# Action: Stem every word found in every row (test)

df_irony_test['clean_tweet1'] = df_irony_test['clean_tweet1'].apply(lambda x: ' '.join(map(str, x)))
# Action: Transform the list obtained after preprocessing into single strings (test)

df_irony_test = df_irony_test.drop(columns=['clean_tweet']) # Get rid of the unstemmed column.
# Drop the column "clean_tweet" (unstemmed column)(test)

print ('Done')

# df.apply()    This command is used to pass a function and apply it on every single value of the column or dataframe.
# lamda()       This command is only useful when we want to define a function that will be used only once in our program.
# x.split()     This function is used to split a string into a list using a user specified separator.
#               In this case, the content of each row is divided into words using a whitespace separator (' ')
# ','.join()    This function takes all items in an iterable and joins them into one string.
#               In this case, it will take any word and seperate it with ",". For example: ['irwin','arnstein','subject']
# map()         This function is used to replace each value in a column or dataframe with another value.
#               In this case, this function transformed the list created after preprocessing into single strings
# drop(columns='')  This function is used to delete 1 specific column
# stem()       This function is used to execute the stemmer object, which in this case is the NLTK Snowball Stemmer

Done


In [15]:
# Comparing the original Tweet vs the Tweet after second preprocessing

pd.set_option("display.max_rows", None, "display.max_columns", None, 'display.max_colwidth', None)
# This option is used to print the entire Pandas dataframe (all rows, all columns & all content)

# Train subset
print('\n\033[1mIrony Detection Train Subset:\033[0m')
display(df_irony_train[0:5]) # Output: First 5 tweets (train)

# Test subset
print('\n\033[1mIrony Detection  Test Subset:\033[0m')
display(df_irony_test[0:5]) # Output: First 5 tweets (test)

# Column 'Tweets' contains the Original tweets
# Column 'clean_tweet1' contains the Tweets after all preprocessing


[1mIrony Detection Train Subset:[0m


Unnamed: 0,Tweets,irony_output,clean_tweet1
0,ppl walking w/ crutches makes excited 3 weeks life,1,ppl walk w crutch make excit week life
1,"girl broken smile, stay , loved. 💕🎵",0,girl broken smile stay love
2,i remember i buy books online @user #servicewithasmile,1,i rememb i buy book onlin
3,@user @user banded wearing clothes? #karma,1,band wear cloth
4,found etch sketch apps. #oldschool #notoldschool,1,found etch sketch app



[1mIrony Detection  Test Subset:[0m


Unnamed: 0,Tweets,irony_output,clean_tweet1
0,@user u help?|| conservatives needed #tsu + paid 4 posting stuff like !|| $,0,u help conserv need paid post stuff like
1,"walked #starbucks asked ""tall blonde"" hahahaha #irony",1,walk ask tall blond hahahaha
2,# gonna win,0,gonna win
3,@user sort person. weirdo!,0,sort person weirdo
4,#sarcasm work mate 10/10 #boring 100% #dead mate full #shit absolutely #sleeping mate 't handle #sarcasm,1,work mate mate full absolut mate t handl


In [18]:
# Creating the final Irony Detection subset after preprocessing

pd.reset_option('^display.', silent=True) # Reset the default Pandas Display (Truncated)

# Final Irony Detection Train subset
print('\n\033[1mIrony Detection Train Subset:\033[0m')
df_irony_train_final = df_irony_train[['clean_tweet1','irony_output']] 
# Create the final Emotion Prediction dataframe (train)

display(df_irony_train_final) # Output: Irony Detection train subset (cleaning)

# Final Irony Detection Test subset
print('\n\033[1mIrony Detection test Subset:\033[0m')
df_irony_test_final = df_irony_test[['clean_tweet1','irony_output']]
# Create the final Irony Detection dataframe (test)

display(df_irony_test_final)  # Output: Irony Detection test subset (cleaning)



[1mIrony Detection Train Subset:[0m


Unnamed: 0,clean_tweet1,irony_output
0,ppl walk w crutch make excit week life,1
1,girl broken smile stay love,0
2,i rememb i buy book onlin,1
3,band wear cloth,1
4,found etch sketch app,1
...,...,...
2857,i respect belief i respect,0
2858,women hit marri manag crack,1
2859,i i post i thot i add sorri,0
2860,i dont love product christma heaven,0



[1mIrony Detection test Subset:[0m


Unnamed: 0,clean_tweet1,irony_output
0,u help conserv need paid post stuff like,0
1,walk ask tall blond hahahaha,1
2,gonna win,0
3,sort person weirdo,0
4,work mate mate full absolut mate t handl,1
...,...,...
779,drag yesterday today tomorrow meant,0
780,congrat fav team birthplac team i claim gonna ...,0
781,jessica shed tear fan sign event make weak eve...,0
782,al jazeera pro anti femin,1


In [19]:
# Transforming the Irony Detection subsets to Python lists.

# Train subset
list_irony_train_final = df_irony_train_final['clean_tweet1'].tolist()
# Action: Transform the Irony Detection Train dataframe into a Python list

# Test subset
list_irony_test_final = df_irony_test_final['clean_tweet1'].tolist()
# Action: Transform the Irony Detection Test dataframe into a Python list

display(list_irony_train_final) # Output: Irony Detection Train list
#Uncomment the following line if you want to see the Emotion Prediction Test list
#display(list_emotion_test_final) 

# df.values.tolist()    This function is used to convert a dataFrame into a Python list

['ppl walk w crutch make excit week life',
 'girl broken smile stay love',
 'i rememb i buy book onlin',
 'band wear cloth',
 'found etch sketch app',
 'hey wit support darren wilson s stori lie racist mind blown',
 'stage',
 's great day garmin reset spill cinnamon',
 'halfway workday woooo',
 'like thank nephew give horribl cold sore throat appreci',
 'i fork node readi futur s interview',
 'visit great nephew ill fair',
 'twitter account sign book read id read twilight',
 'nichola spark manipul women guy realiti sensit male charact book',
 'u smartphon http t qwbriavk smartphon app pay http t rdlrugni',
 'stop accept crumb love equal appreci silenc solitud festiv love',
 'alert media guy twitter cite proof evolut wrong s nobel prize readi',
 'shameless account firm make vast sum advis rich rip taxpay account chief http t bsdujakxb',
 'bring movi tuesday night',
 'main issu walk dead forget breath watch bloodi good',
 'back colleg',
 'friday today proud present sunday enjoy sunday',


### Step 4: Performing Word Embeddings

Word embeddings is nothing but the process of converting text data to numerical vectors, and it is used to capture not only  the semantic of the word, but also their emotional content. <br>

In this case, the **Bag-of-Words** technique will be applied in this project.
1. BoW is a representation of text that describes the occurrence of words within a document collection.
2. BoW uses word occurrence frequencies to measure the content of the tweet and see how often each words appeared
3. BoW is a method to extract features (X) from tweets. These attributes should be used for training any Machine Learning algorithm.

In [27]:
# Vectorizing the Irony Detection Train subset

#Uncomment the following line if you want to print all columns and all content of a given dataframe. 
# (Truncated rows, all columns & all content)
#pd.set_option("display.max_columns", None, 'display.max_colwidth', None)

vect_train = CountVectorizer(dtype=np.uint8, max_features = 143, min_df = 5)
Z_train = vect_train.fit_transform(list_irony_train_final)
df_vect_train = pd.DataFrame(Z_train.A, columns=vect_train.get_feature_names())
display(df_vect_train)

print("--------------------------------------------------")
display(list_irony_train_final[0])

Unnamed: 0,alway,am,amaz,awesom,back,bad,bed,best,better,big,black,book,break,call,car,chang,check,christma,class,day,die,earli,eat,end,enjoy,excit,face,famili,fan,feel,final,find,follow,free,friend,fuck,fun,funni,game,girl,give,glad,god,gonna,good,great,guess,guy,haha,half,happi,hard,hate,help,home,hope,hour,http,ignor,job,kid,leav,life,like,listen,live,lol,long,lot,love,made,make,man,mean,media,minut,miss,money,morn,new,news,nice,night,old,peopl,perfect,person,phone,photo,pick,pictur,play,pleas,post,put,read,readi,real,rt,school,servic,shit,show,sick,sleep,song,sound,start,stop,support,sure,surpris,talk,team,thank,thing,thought,time,today,tomorrow,tonight,train,tweet,twitter,us,wait,wake,watch,way,wear,week,well,went,win,wonder,word,work,world,wow,wrong,yay,yeah,year
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2857,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2858,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2859,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2860,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


--------------------------------------------------


'ppl walk w crutch make excit week life'

In [28]:
# Vectorizing the Test subset

pd.set_option("display.max_columns", None, 'display.max_colwidth', None)
vect_test= CountVectorizer(dtype=np.uint8, max_features = 143, min_df = 5)
Z_test = vect_test.fit_transform(list_irony_test_final)
df_vect_test = pd.DataFrame(Z_test.A, columns=vect_test.get_feature_names())
display(df_vect_test)

print("--------------------------------------------------")
display(list_irony_test_final[0])

Unnamed: 0,absolut,ago,agre,alway,am,back,bad,beauti,bed,best,better,black,book,call,car,check,christma,close,coach,cop,day,drive,earli,email,expect,favorit,feel,fight,final,finish,follow,found,friday,friend,full,fun,game,girl,give,gonna,good,great,guess,guy,hand,happi,hard,hey,high,home,hope,hour,hous,http,human,job,kid,kill,leagu,leav,life,like,lol,long,love,major,make,man,media,men,mind,mom,moment,morn,move,music,new,news,nice,night,old,peopl,perfect,person,play,pleas,post,present,pull,put,racist,read,rememb,royal,rt,run,santa,school,show,sleep,snow,song,speak,start,stop,support,sure,talk,team,thank,thing,thought,till,time,today,tomorrow,top,truth,turn,tweet,ugli,us,ve,video,visit,wait,wake,walk,wanna,want,watch,way,week,well,went,white,win,work,world,write,yay,yeah,year
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
779,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
780,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
781,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
782,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


--------------------------------------------------


'u help conserv need paid post stuff like'

In [30]:
# Transfering the data into 2 numpy arrays

x_train = df_vect_train.loc[:,:].to_numpy() 
# This array contains all the feature variable values.

y_train = df_irony_train_final.loc[:, df_irony_train_final.columns == 'irony_output'].to_numpy()
# This array contains ONLY the target variable values.

print('\n'+'\033[1m'+'Array of feature variables (Train):'+'\033[0m', x_train.shape) 
print(x_train) # Output: All possible elements of the array. (This includes everything except the values of the target variable ("Class label")

print('\n'+'\033[1m'+'Array of target variable (Train):'+'\033[0m', y_train.shape)
print(y_train) # Output: All possible elements of the array. (This includes only values of "Class Label" feature from the wine dataset)


[1mArray of feature variables (Train):[0m (2862, 143)
[[0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 ...
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]]

[1mArray of target variable (Train):[0m (2862, 1)
[[1]
 [0]
 [1]
 ...
 [0]
 [0]
 [1]]


In [31]:
# Transfering the data into 2 numpy arrays

x_test = df_vect_test.loc[:,:].to_numpy() 
# This array contains all the feature variable values.

y_test = df_irony_test_final.loc[:, df_irony_test_final.columns == 'irony_output'].to_numpy()
# This array contains ONLY the target variable values.

print('\n'+'\033[1m'+'Array of feature variables (Test):'+'\033[0m', x_test.shape) 
print(x_test) # Output: All possible elements of the array. (This includes everything except the values of the target variable ("Class label")

print('\n'+'\033[1m'+'Array of target variable (Test):'+'\033[0m', y_test.shape)
print(y_test) # Output: All possible elements of the array. (This includes only values of "Class Label" feature from the wine dataset)


[1mArray of feature variables (Test):[0m (784, 143)
[[0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 ...
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]]

[1mArray of target variable (Test):[0m (784, 1)
[[0]
 [1]
 [0]
 [0]
 [1]
 [0]
 [1]
 [0]
 [0]
 [1]
 [0]
 [0]
 [1]
 [0]
 [0]
 [0]
 [0]
 [0]
 [0]
 [1]
 [1]
 [1]
 [1]
 [1]
 [0]
 [1]
 [1]
 [1]
 [0]
 [0]
 [0]
 [1]
 [0]
 [0]
 [0]
 [0]
 [0]
 [0]
 [1]
 [1]
 [1]
 [1]
 [1]
 [0]
 [1]
 [0]
 [1]
 [1]
 [0]
 [0]
 [1]
 [0]
 [0]
 [1]
 [0]
 [0]
 [0]
 [1]
 [0]
 [0]
 [0]
 [1]
 [1]
 [1]
 [1]
 [1]
 [1]
 [1]
 [0]
 [0]
 [0]
 [0]
 [1]
 [0]
 [0]
 [1]
 [0]
 [0]
 [0]
 [0]
 [0]
 [0]
 [0]
 [1]
 [1]
 [0]
 [1]
 [1]
 [0]
 [1]
 [0]
 [1]
 [0]
 [1]
 [1]
 [0]
 [0]
 [0]
 [0]
 [0]
 [1]
 [0]
 [0]
 [0]
 [0]
 [1]
 [1]
 [0]
 [0]
 [0]
 [0]
 [0]
 [0]
 [0]
 [1]
 [1]
 [0]
 [0]
 [0]
 [1]
 [0]
 [1]
 [1]
 [0]
 [0]
 [1]
 [1]
 [0]
 [0]
 [0]
 [1]
 [0]
 [1]
 [0]
 [0]
 [1]
 [0]
 [0]
 [1]
 [1]
 [0]
 [0]
 [1]
 [0]
 [0]
 [1]
 [1]
 [0]
 [0]
 [0]
 [1]
 [0]
 [0]
 [0]
 [0]

### Step 5: Building the Machine Learning model

#### Naıve  Bayes  Multinomial  classifier

In [32]:
# Use a Naive Bayes model 
from sklearn.naive_bayes import MultinomialNB 

mnb = MultinomialNB() 

# Train the model
mnb.fit(x_train, y_train)

MultinomialNB()

In [33]:
# Take the model that was trained on the X_train_cv data and apply it to the X_test_cv data 
y_pred_cv_mnb = mnb.predict(x_test) 
y_pred_cv_mnb # The output is all of the predictions

array([0, 0, 1, 0, 0, 1, 1, 1, 0, 1, 0, 1, 0, 0, 1, 1, 1, 1, 1, 0, 0, 1,
       1, 0, 1, 1, 0, 1, 0, 1, 1, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 0,
       1, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0,
       1, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1, 1, 1, 1, 1, 1, 0,
       1, 1, 1, 1, 1, 0, 0, 0, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0,
       0, 1, 1, 1, 1, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 0,
       0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 1, 0,
       1, 0, 0, 0, 1, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 1, 0,
       1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1,
       0, 0, 0, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1,
       0, 1, 1, 1, 1, 0, 1, 1, 0, 0, 1, 1, 1, 0, 1, 0, 0, 1, 1, 1, 1, 0,
       0, 1, 1, 0, 1, 1, 1, 1, 0, 1, 0, 1, 0, 1, 1, 1, 0, 0, 0, 1, 0, 1,
       1, 1, 1, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 1, 0, 0, 1,
       1, 0, 0, 1, 1, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0,

In [34]:
from sklearn.metrics import classification_report, accuracy_score

print(accuracy_score(y_test, y_pred_cv_mnb))
print(classification_report(y_test, y_pred_cv_mnb))

0.4642857142857143
              precision    recall  f1-score   support

           0       0.58      0.39      0.47       473
           1       0.38      0.58      0.46       311

    accuracy                           0.46       784
   macro avg       0.48      0.48      0.46       784
weighted avg       0.50      0.46      0.47       784



#### Logistic Regression Classifier

In [35]:
# Create the Logistic Regression object
logr = LogisticRegression(solver='lbfgs')  
# Train the model
logr.fit(x_train, y_train)

LogisticRegression()

In [36]:
# Take the model that was trained on the X_train_cv data and apply it to the X_test_cv data 
y_pred_cv_logr = logr.predict(x_test) 
y_pred_cv_logr # The output is all of the predictions

array([0, 0, 1, 0, 0, 1, 1, 0, 0, 1, 1, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0,
       1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1,
       0, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0,
       0, 1, 1, 0, 0, 1, 0, 0, 0, 1, 0, 1, 1, 1, 0, 1, 1, 1, 0, 0, 0, 0,
       0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0,
       0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 1, 0,
       0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0,
       1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1,
       0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 1, 0, 1, 1, 0, 1, 1, 0, 1, 0, 1,
       0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0,
       0, 0, 1, 0, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0,
       1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1,
       1, 0, 0, 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0,

In [37]:
print(accuracy_score(y_test, y_pred_cv_logr))
print(classification_report(y_test, y_pred_cv_logr))

0.5637755102040817
              precision    recall  f1-score   support

           0       0.63      0.68      0.65       473
           1       0.44      0.39      0.41       311

    accuracy                           0.56       784
   macro avg       0.54      0.53      0.53       784
weighted avg       0.55      0.56      0.56       784



## References

<a id='section_id'></a>
&emsp;[1] C. Van Hee, E. Lefever, and V. Hoste, “Semeval-2018 task 3: Irony detection in english tweets,” in Proceedings of The 12th  International Workshop<br>
&emsp;&emsp;&ensp;on Semantic Evaluation, 2018, pp. 39–50<br>