#classification of humorous text using FastText 
The Hahackaton aims at classifying text as humorous.
The dataset, a set of short texts, has been labeled by a group of heterogenous people (age, sex, race) under 4 features: 
* is it humorous? (binary label) 
* humour grading (0 to 5; 0 is not humorous)
* controversy (binary label): when the variance in the humour grading is higher than the average
* offensiveness grading (0 to 5; 0 is not offensive)

In this approach, we use a FastText from the Gensim implementation 

| Attribute | *Accuracy* baseline | *f1-score* baseline |
| :-------- | ------------------: | ------------------: |
| is_humour | 86%                 | 88%                 |
| controversy | - | - |


## import data

In [1]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
plt.style.use('ggplot')

from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline, FeatureUnion 
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer, TfidfVectorizer
from sklearn.preprocessing import FunctionTransformer, StandardScaler, MinMaxScaler
from sklearn.utils.class_weight import compute_class_weight

from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import mean_squared_error
from sklearn.metrics import classification_report

import nltk
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.


In [3]:
df = pd.read_csv('train.csv').drop(columns = 'id')
df.head()

Unnamed: 0,text,is_humor,humor_rating,humor_controversy,offense_rating
0,TENNESSEE: We're the best state. Nobody even c...,1,2.42,1.0,0.2
1,A man inserted an advertisement in the classif...,1,2.5,1.0,1.1
2,How many men does it take to open a can of bee...,1,1.95,0.0,2.4
3,Told my mom I hit 1200 Twitter followers. She ...,1,2.11,1.0,0.0
4,Roses are dead. Love is fake. Weddings are bas...,1,2.78,0.0,0.1


## split data
* train (65%)
* valid (15%)
* test (20%)

In [4]:
X_train_raw, X_test_raw, y_train, y_test = train_test_split(df[['text']], 
                                                    df['is_humor'], 
                                                    test_size = 0.2, 
                                                    random_state = 21)
X_train_raw, X_valid_raw, y_train, y_valid = train_test_split(X_train_raw, 
                                                    y_train, 
                                                    test_size = 0.05, 
                                                    random_state = 21)
X_train_raw.head()

Unnamed: 0,text
2618,me: so tell me about yourself date: i hate sur...
898,What do you call a zombified piece of toast? T...
3722,"Today I came out to my parents, and my dad ins..."
3386,A yellow pigment in curry and curcumin can sto...
889,What's green and not heavy? Light green h


In [5]:
X_train_raw.describe()

Unnamed: 0,text
count,6080
unique,6080
top,My wife said I needed to grow up I was speechl...
freq,1


In [28]:
total = len(y_train)
print('total labels',total)
print('of which NaNs in y : ', y_train.isna().sum())
print('of which NaNs in text: ', X_train.isna().sum())
print('\n% of positives is ', round(y_train.sum()/total, 2)) #'% of positives is {:.2f}'.format(y_train.sum()/total)
print('% of negatives is ', round((total - y_train.sum())/total, 2)) #'% of negatives is {:.2f}'.format((total - y_train.sum())/total)
class_labels = np.unique(y_train)
class_weights = compute_class_weight(class_weight='balanced', classes=class_labels,y= y_train)
class_weights_dict = dict(zip(class_labels, class_weights))
print('class weights: ',class_weights_dict)

total labels 6080
of which NaNs in y :  0
of which NaNs in text:  0

% of positives is  0.61
% of negatives is  0.39
class weights:  {0: 1.2980358667805294, 1: 0.8132691278758695}


## data preprocessing
* remove punctuation (but dont remove stopwords)
* remove multiple sequential spaces
* everything to lower case
* stemming




In [10]:
def stemmer(text, stemmer):
    return(' '.join([stemmer.stem(w) for w in word_tokenize(text)]))

def lemmatize_text(text):
    lemmatizer = nltk.stem.WordNetLemmatizer()
    lemm_words = []
    for w in word_tokenize(text):
      w = lemmatizer.lemmatize(w,pos='n')
      w = lemmatizer.lemmatize(w,pos='v')
      lemm_words.append(w)
    lemmatized = " ".join(lemm_words)
    return lemmatized

def count_words(input):
    """ Returns number of occurences of characters specified in char """     
    return len(input.split())

def remove_punctuation(s_input, include_char = None):
    """ Returns input string without punctuation """
    import string as String
    punct = String.punctuation
    
    if not include_char is None:
        index = String.punctuation.index(include_char)
        punct = String.punctuation[:index] + String.punctuation[(index + 1):]
        
    punct += '\n'
        
    translator = str.maketrans(punct, ' '*len(punct))
    
    return s_input.translate(translator)

def remove_stopwords(text, use_stopwords = None, df = True, exclude_number = True):
    """ Returns input string removing stopwords from it. """

    
    if use_stopwords is None:
        use_stopwords = set(stopwords.words("english"))
        
    if df:
        new_text = word_tokenize(text)
        if exclude_number:
            new_text = [word for word in new_text if not word.isnumeric()]
        new_text = " ".join([word for word in new_text if word not in use_stopwords])
    else:
        new_text = ""
        for word in text:
            if word not in use_stopwords:
                new_text += word + " "

    return new_text

def sep_upper(text):
    """ Take a text as input and insert space before every uppercase letter. """
    
    new_text = ""
    for letter in text:
        if letter.isupper():
            new_text += " " + letter
        else:
            new_text += letter
    
    return new_text

def remove_space(text):
    return(re.sub(' +',' ',text)) 


def pre_proc(text_col):   
    text_col = text_col.str.lower() # lowercase
    text_col = text_col.apply(remove_punctuation) # removes String.punctuation characters
    #text_col = text_col.apply(remove_stopwords)   # removes english stopwords 
    text_col = text_col.str.replace('[^\w\s]','').str.strip() # and removes whitespaces
    #text_col = text_col.apply(sep_upper) # adds space before an uppercase
    text_col = text_col.apply(lemmatize_text)   
    return text_col

X_train = pre_proc(X_train_raw.text)
X_valid = pre_proc(X_valid_raw.text)
X_test = pre_proc(X_test_raw.text)
X_train.head()
#X_train_raw.head()

2618    me so tell me about yourself date i hate surpr...
898     what do you call a zombified piece of toast th...
3722    today i come out to my parent and my dad insta...
3386    a yellow pigment in curry and curcumin can sto...
889              what s green and not heavy light green h
Name: text, dtype: object

In [72]:
def toFastTextFormat(series_x,series_y,filePath):
  """ 
  transforms the data to a text file following 
  the format required by L{fasttext.train_supervised}, that is, a text file
  where each line is a sentence preceded by the labels with format: 
  C{__label__<label1> <sentence>}
  This function only allows one label


  @param series_x : a pandas series of str, named "text", with the sentences
  @type series_x: pandas.Series
  @param series_y : a pandas series with the labels for "text"
  @type series_y: pandas.Series
  @param filePath: the path for the text file to save
  @type filePath: str

  @return: None 
  """
  xx = pd.DataFrame(series_x)
  xx['label'] = series_y
  xx['ft_label'] = '__label__'+xx['label'].astype(str)
  xx['ft_sent'] = xx['ft_label'] + ' ' + xx['text']
  ft_text = xx.ft_sent.str.cat(sep='\n')

  with open(filePath,"w") as fl:
    fl.write(ft_text)


In [82]:
toFastTextFormat(X_train,y_train,'./train_text.txt')
toFastTextFormat(X_valid,y_valid,'./valid_text.txt')
toFastTextFormat(X_test,y_test,'./test_text.txt')

## FastText install and import

In [8]:
!git clone https://github.com/facebookresearch/fastText.git
%cd fastText
!make
!cp fasttext ../
%cd ..

Cloning into 'fastText'...
remote: Enumerating objects: 3854, done.[K
remote: Total 3854 (delta 0), reused 0 (delta 0), pack-reused 3854[K
Receiving objects: 100% (3854/3854), 8.23 MiB | 33.96 MiB/s, done.
Resolving deltas: 100% (2416/2416), done.
/content/fastText
c++ -pthread -std=c++11 -march=native -O3 -funroll-loops -DNDEBUG -c src/args.cc
c++ -pthread -std=c++11 -march=native -O3 -funroll-loops -DNDEBUG -c src/autotune.cc
c++ -pthread -std=c++11 -march=native -O3 -funroll-loops -DNDEBUG -c src/matrix.cc
c++ -pthread -std=c++11 -march=native -O3 -funroll-loops -DNDEBUG -c src/dictionary.cc
c++ -pthread -std=c++11 -march=native -O3 -funroll-loops -DNDEBUG -c src/loss.cc
c++ -pthread -std=c++11 -march=native -O3 -funroll-loops -DNDEBUG -c src/productquantizer.cc
c++ -pthread -std=c++11 -march=native -O3 -funroll-loops -DNDEBUG -c src/densematrix.cc
c++ -pthread -std=c++11 -march=native -O3 -funroll-loops -DNDEBUG -c src/quantmatrix.cc
c++ -pthread -std=c++11 -march=native -O3 -fun

In [74]:
!cd ./fastText && pip install .

Processing /content/fastText
Building wheels for collected packages: fasttext
  Building wheel for fasttext (setup.py) ... [?25l[?25hdone
  Created wheel for fasttext: filename=fasttext-0.9.2-cp36-cp36m-linux_x86_64.whl size=3037208 sha256=a9b00358723ed657994e0bb5281db2b5ff2bddf6fdbbf11ed725165c7154a0bd
  Stored in directory: /tmp/pip-ephem-wheel-cache-rbb1a27b/wheels/a1/9f/52/696ce6c5c46325e840c76614ee5051458c0df10306987e7443
Successfully built fasttext
Installing collected packages: fasttext
  Found existing installation: fasttext 0.9.2
    Uninstalling fasttext-0.9.2:
      Successfully uninstalled fasttext-0.9.2
Successfully installed fasttext-0.9.2


In [75]:
import fasttext

In [102]:
model = fasttext.train_supervised(input="./train_text.txt",lr=0.5, epoch=25,wordNgrams=2,ws=10,verbose=5)

In [85]:
model.save_model('./model.bin')

In [105]:
# quick test for humorous text: "I invented a new word: plagiarism!"
print(model.predict("i invented a new word plagiarism")[0])
# quick test for non humorous text: "facebook is an american online social media and social networking service based in menlo park california"
print(model.predict("facebook is an american online social media and social networking service based in menlo park california")[0])

('__label__1',)
('__label__0',)


In [107]:
model.test("./test_text.txt")

(1600, 0.856875, 0.856875)