<center>
<img src="https://habrastorage.org/files/fd4/502/43d/fd450243dd604b81b9713213a247aa20.jpg">
    
## [mlcourse.ai](https://mlcourse.ai) – Open Machine Learning Course 
Author: [Yury Kashnitskiy](https://yorko.github.io) (@yorko). This material is subject to the terms and conditions of the [Creative Commons CC BY-NC-SA 4.0](https://creativecommons.org/licenses/by-nc-sa/4.0/) license. Free use is permitted for any non-commercial purpose.

## <center> Assignment 4. Sarcasm detection with logistic regression
    
We'll be using the dataset from the [paper](https://arxiv.org/abs/1704.05579) "A Large Self-Annotated Corpus for Sarcasm" with >1mln comments from Reddit, labeled as either sarcastic or not. A processed version can be found on Kaggle in a form of a [Kaggle Dataset](https://www.kaggle.com/danofer/sarcasm).

Sarcasm detection is easy. 
<img src="https://habrastorage.org/webt/1f/0d/ta/1f0dtavsd14ncf17gbsy1cvoga4.jpeg" />

In [1]:
!ls ../input/sarcasm/

In [28]:
# some necessary imports
import os
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix
import seaborn as sns
from matplotlib import pyplot as plt
from wordcloud import WordCloud, STOPWORDS
from string import punctuation

In [3]:
train_df = pd.read_csv('../input/sarcasm/train-balanced-sarcasm.csv')

In [4]:
train_df.head()

In [5]:
train_df.info()

Some comments are missing, so we drop the corresponding rows.

In [6]:
train_df.dropna(subset=['comment'], inplace=True)

We notice that the dataset is indeed balanced

In [7]:
train_df['label'].value_counts()

We split data into training and validation parts.

In [35]:
train_texts, val_texts, y_train, y_val = train_test_split(train_df["comment"], 
                                                             train_df["label"], 
                                                             random_state=17)

## Tasks:
1. Analyze the dataset, make some plots. This [Kernel](https://www.kaggle.com/sudalairajkumar/simple-exploration-notebook-qiqc) might serve as an example
2. Build a Tf-Idf + logistic regression pipeline to predict sarcasm (`label`) based on the text of a comment on Reddit (`comment`).
3. Plot the words/bigrams which a most predictive of sarcasm (you can use [eli5](https://github.com/TeamHG-Memex/eli5) for that)
4. (optionally) add subreddits as new features to improve model performance. Apply here the Bag of Words approach, i.e. treat each subreddit as a new feature.

## Links:
  - Machine learning library [Scikit-learn](https://scikit-learn.org/stable/index.html) (a.k.a. sklearn)
  - Kernels on [logistic regression](https://www.kaggle.com/kashnitsky/topic-4-linear-models-part-2-classification) and its applications to [text classification](https://www.kaggle.com/kashnitsky/topic-4-linear-models-part-4-more-of-logit), also a [Kernel](https://www.kaggle.com/kashnitsky/topic-6-feature-engineering-and-feature-selection) on feature engineering and feature selection
  - [Kaggle Kernel](https://www.kaggle.com/abhishek/approaching-almost-any-nlp-problem-on-kaggle) "Approaching (Almost) Any NLP Problem on Kaggle"
  - [ELI5](https://github.com/TeamHG-Memex/eli5) to explain model predictions

1. Analyze the dataset, make some plots. This Kernel might serve as an example

In [9]:
label_count = train_df["label"].value_counts()
ax = label_count.plot.bar(color=['r', 'g'])
ax.set(xlabel='Labels', ylabel='Number of occurences')
ax.xaxis.label.set_size(15)
ax.yaxis.label.set_size(15)
ax.set_title('Counts by values', fontsize=20)
ax.set_xticklabels(ax.get_xticks(), rotation=1, fontsize=15)
plt.gcf().set(figwidth=14, figheight=8);

In [10]:
plt.figure(figsize=(8, 8))
plt.pie(label_count.values, startangle=90, autopct='%d%%', textprops={"fontsize": 15})
plt.legend(label_count.index, fontsize=15)

In [11]:
def plot_word_cloud(text, mask=None, max_words=200, max_font_size=100, figure_size=(24, 16), title=None, title_size=40, image_color=False):
    stopwords = set(STOPWORDS).union({'one', 'br', 'Po', 'th', 'sayi', 'fo', 'Unknown'})
    
    wordcloud = WordCloud(background_color='black', stopwords=stopwords, max_words=max_words, max_font_size=max_font_size, random_state=42,
                         width=800, height=400, mask=mask)
    wordcloud.generate(str(text))

    plt.figure(figsize=figure_size)
    if image_color:
        image_colors = ImageColorGenerator(mask);
        plt.imshow(wordcloud.recolor(color_func=image_colors), interpolation='bilinear');
        plt.title(title, fontdict={'size': title_size, 'verticalalignment': 'bottom'})
    else:
        plt.imshow(wordcloud)
        plt.title(title, fontdict={'size': title_size, 'color': 'black', 'verticalalignment': 'bottom'})
    
    plt.axis('off')

plot_word_cloud(train_df['comment'])
    
    

In [12]:
train_df.loc[train_df['label']==0, 'comment'].str.len().apply(np.log1p).hist(alpha=0.5)
train_df.loc[train_df['label']==1, 'comment'].str.len().apply(np.log1p).hist(alpha=0.5)

Class label 0 and 1 have identical distributions.

In [13]:
def generate_ngrams(text, n_gram=1):
    tokens = [token.strip() for token in text.strip().lower().split() if token not in STOPWORDS.union(set(list(punctuation)))]
    ngrams = zip(*[tokens[i:] for i in range(n_gram)])
    return [" ".join(ngram) for ngram in ngrams]

def freq_dict_gen(series, n_gram=1):
    freq_dict={}
    for text in series:
        ngrams = generate_ngrams(text, n_gram=n_gram)
        for words in ngrams:
            freq_dict[words] = freq_dict.get(words, 0) + 1 
    return dict(sorted(freq_dict.items(), key=lambda x: x[1] * -1))

def horizontal_bar(freq_dict, figsize=(10, 20), take=50, ax=None, title=None, color=None):

    if not ax:
        plt.figure(figsize=figsize)
        ax = plt.gca()
        
    y = list(freq_dict.keys())[:take][::-1]
    width = list(freq_dict.values())[:take][::-1]
    ax.barh(y=y, width=width, color=color)
    y_min, y_max = ax.get_ylim()
    ax.set_ylim(y_min + 2 * (take / 50), y_max - 2 * (take / 50))
    ax.set_title(title, fontsize=20)

In [14]:
comments_0 = train_df.loc[train_df['label']==0, 'comment']
comments_1 = train_df.loc[train_df['label']==1, 'comment']

In [15]:
#ngram=1
freq_dict_0 = freq_dict_gen(comments_0, n_gram=1)
freq_dict_1 = freq_dict_gen(comments_1, n_gram=1)

In [16]:
plt.figure(figsize=(20, 20))
horizontal_bar(freq_dict_0, take=50, title='Label 0 for n_gram = 1', color='blue', ax=plt.subplot(121))
horizontal_bar(freq_dict_1, take=50, title='Label 1 for n_gram = 1', color='orange', ax=plt.subplot(122))

In [17]:
#ngram=1
freq_dict_0 = freq_dict_gen(comments_0, n_gram=2)
freq_dict_1 = freq_dict_gen(comments_1, n_gram=2)

In [18]:
plt.figure(figsize=(20, 20))
horizontal_bar(freq_dict_0, take=50, title='Label 0 for n_gram = 2', color='blue', ax=plt.subplot(121))
horizontal_bar(freq_dict_1, take=50, title='Label 1 for n_gram = 2', color='orange', ax=plt.subplot(122))

In [19]:
#ngram=1
freq_dict_0 = freq_dict_gen(comments_0, n_gram=3)
freq_dict_1 = freq_dict_gen(comments_1, n_gram=3)

In [20]:
plt.figure(figsize=(20, 20))
horizontal_bar(freq_dict_0, take=50, title='Label 0 for n_gram = 3', color='blue', ax=plt.subplot(121))
horizontal_bar(freq_dict_1, take=50, title='Label 1 for n_gram = 3', color='orange', ax=plt.subplot(122))

In [21]:
train_df.head()

In [22]:
train_df.columns

In [23]:
#which subreddits contain the most label 1 comments
subreddit = train_df.groupby(["subreddit"])['label'].agg([np.size, np.mean, np.sum])
subreddit.sort_values(by='sum', ascending=False).head(20)

In [24]:
#which author tends to make sarcastic comments than others
author = train_df.groupby(by='author')['label'].agg([np.size, np.mean, np.sum])
author.sort_values(by='sum', ascending=False, inplace=True)
author.head(20)

2. Build a Tf-Idf + logistic regression pipeline to predict sarcasm (label) based on the text of a comment on Reddit (comment).

In [184]:
tfidf = TfidfVectorizer(max_features=50000,  ngram_range=(1, 2), min_df=2)
# X_train = tfidf.fit_transform(train_texts)
# X_val = tfidf.fit_transform(val_texts)
pipeline = Pipeline([('tfidf', tfidf), ("logreg", LogisticRegression())])
pipeline.fit(train_texts, y_train)

In [185]:
#train accuracy
pipeline.score(train_texts, y_train)

In [186]:
#val accuracy
accuracy_score(y_val, pipeline.predict(val_texts))

In [192]:
dir(pipeline)

In [198]:
pipeline.classes_

In [189]:
import eli5
eli5.show_weights(estimator=pipeline.named_steps['logreg'],
                  vec=pipeline.named_steps['tfidf'])

3. Plot the words/bigrams which a most predictive of sarcasm (you can use [eli5](https://github.com/TeamHG-Memex/eli5) for that)

In [56]:
pred_y_val = pipeline.predict(val_texts)

In [118]:
normalize = True
cm = confusion_matrix(y_val, pred_y_val).T
display(pd.DataFrame(cm, index=pd.Series(['0', '1'], 
                                         name='Predicted'), 
                     columns=pd.Series(['0', '1'], 
                                       name='Actual')))

In [133]:
import matplotlib 
def plot_confusion_matrix(cm, normalize=False, cmap='Blues', figsize=(12, 8), title='Confusion Matrix'):
    if normalize == True:
        cm = cm / cm.sum(axis=1)
    sns.heatmap(cm, 
                annot=True, 
                fmt=".2f" if normalize == True else "d",
               xticklabels=["0", "1"],
               yticklabels=["0", "1"],
               cmap=cmap,
               annot_kws={'fontsize': int(figsize[0] / 12 * 15)})
    plt.gcf().set(figwidth=figsize[0], figheight=figsize[1])
    plt.xticks(fontsize=15)
    plt.yticks(fontsize=15, rotation=0)
    plt.xlabel('Actual', fontsize=15)
    plt.ylabel('Predicted', fontsize=15, rotation=0, labelpad=50)
    plt.title(title, fontsize=figsize[0] / 12 * 20, pad=20)

In [134]:
plot_confusion_matrix(cm, figsize=(12, 8))

4. (optionally) add subreddits as new features to improve model performance. Apply here the Bag of Words approach, i.e. treat each subreddit as a new feature.


In [140]:
subreddit = train_df["subreddit"]

In [172]:
train_subreddit, val_subreddit = train_test_split(subreddit, random_state=17) 

In [176]:
X_sub_train.shape

In [178]:
X_com_train = tfidf.fit_transform(train_texts)
X_com_val = tfidf.transform(val_texts)
X_sub_train = tfidf.fit_transform(train_subreddit)
X_sub_val = tfidf.transform(val_subreddit)

In [155]:
from scipy.sparse import hstack

In [179]:
logreg = LogisticRegression()
logreg.fit(hstack([X_com_train, X_sub_train]), y_train)

In [180]:
logreg.score(hstack([X_com_train, X_sub_train]), y_train)

In [181]:
X_sub_val.shape

In [183]:
accuracy_score(y_val, logreg.predict(hstack([X_com_val, X_sub_val])))

In [None]:
cm = confusion_matrix(y_val, hstack([X_com_val, X_sub_val]))
plot_confusion_matrix(cm)