In [14]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation
import pyLDAvis
import pyLDAvis.sklearn
pyLDAvis.enable_notebook()
from __future__ import print_function
from src.preprocess_texts import preprocess_texts

In [3]:
# loading data
json_file = 'data/processed/data_final.json'
data = pd.read_json(json_file, orient="records")

# Topic modelling (LDA)

The motivation to use LDA is to present documents (in our case headlines) as a combination of the underlying topics. It will be interesting to check if there are some common topics for sarcastic headlines.

First, I will preprocess headlines using module preprocess_texts including:


*   setting to lowercase
*   removing stopwords
*   removing punctuation
*   stemming

In [4]:
sarcastic_headlines = data["headline"][data['is_sarcastic']==1]

In [7]:
headlines_preprocessed = preprocess_texts(sarcastic_headlines)

Headlines prepared in such a way are ready to be vectorized. 

I will use CountVectorizer with parameter max_df set to 0.4, which means that it will disregard words that are common to more than 40% of the headlines. Such words don't convey any unique meaning and don't have predictive power. I will set min_df=2 as I don't want to take into account words that occured only once.

In [10]:
#Creating vectorized reprezentation of headlines
tf_vectorizer = CountVectorizer(max_df=0.4, min_df=2)
headlines_vectorized = tf_vectorizer.fit_transform(headlines_preprocessed)

In [12]:
lda = LatentDirichletAllocation(random_state=52, n_components=10)
lda.fit(headlines_vectorized)

LatentDirichletAllocation(batch_size=128, doc_topic_prior=None,
             evaluate_every=-1, learning_decay=0.7,
             learning_method='batch', learning_offset=10.0,
             max_doc_update_iter=100, max_iter=10, mean_change_tol=0.001,
             n_components=10, n_jobs=None, n_topics=None, perp_tol=0.1,
             random_state=52, topic_word_prior=None,
             total_samples=1000000.0, verbose=0)

In [15]:
LDA_vis = pyLDAvis.sklearn.prepare(lda, headlines_vectorized, tf_vectorizer)

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  return pd.concat([default_term_info] + list(topic_dfs))


In [16]:
LDA_vis

The topics are a bit disappointing as many terms are repeated across topics so the topics don't seem to be exclusive. However, some of them could be interpreted as follows:


1st topic: Republican politicians (Clinton, Bush, Romney)

4th topic: Women

5th topic: Law (court, final, report, supreme) 

7th topic: Education (school, teacher, stydy, program)

8th topic: Family (love, family, christmas, visit, house

10th topic: History-related (history, archive)