# Homework 3

Kevin Kannappan

## Instructions:

Please find a file called apple.tsv in the Files section. It contains 1500 records which are news articles. The first column is a text column containing the actual article. 

Your task is to use the article text compute a suitable topic model using the Latent Dirichlet Modeling from the gensim package. You must  use the topic coherence metric to determine a suitable number of topics. Your expected output are:

a) a list of topics including the top 10 terms in each topic 

b) the top 10 documents related to each topic along with their topic proportions

c) the coherence measure for each run of the model as you determine the suitable number of topics.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import gensim
import re

## Pre-processing

In [None]:
# Read data
news_articles = pd.read_csv('./apple.tsv', header=None,sep='\t')
print(news_articles.shape)
news_articles.head()

In [None]:
# Data Cleaning:
news_articles.columns = ['article_text', 'article_id','publish_date','article_title','article_url','extra_1','author','tags','base_url','extra_2','extra_3']
news_articles.drop(['extra_1', 'extra_2','extra_3'], axis=1,inplace=True)
news_articles.head()

In [None]:
## Text manipulation
# Convert to list
news_text = news_articles.article_text.values.tolist()

# Remove non-needed characters:
news_text = [re.sub("\n", " ", article) for article in news_text]
news_text = [re.sub("\'", "", article) for article in news_text]
news_text = [re.sub('\s+', ' ', article) for article in news_text]

In [None]:
# Import stopwords
from nltk.corpus import stopwords

def clean_text(text):
    return [[i for i in gensim.utils.simple_preprocess(str(article),deacc=True) if i not in stopwords.words('english')] for article in text]

# Unfortunately, this took a bit to run:
article_clean_text = clean_text(news_text)

In [None]:
len(article_clean_text)==len(news_text)

In [None]:
# Create bigrams:
# Use original positioning
bigram = gensim.models.Phrases(news_text, min_count=5, threshold=100)
bigram_phrase = gensim.models.phrases.Phraser(bigram)

bigrams_text = [bigram_phrase[article] for article in article_clean_text]

In [None]:
bigrams_text==article_clean_text
# No notable bigrams

In [None]:
# Establish necessary components for LDA:
# Create dict and corpus
id2word = gensim.corpora.Dictionary(bigrams_text)
corpus = [id2word.doc2bow(text) for text in bigrams_text]

In [None]:
# View corpus
print(corpus[:2])
# Format in (word id, frequency)

In [None]:
tags_list = news_articles.tags.values.tolist()
tags_list = [re.sub("\{|\}", "", tag) for tag in tags_list]
total_tags = []
for i in tags_list:
    total_tags.extend(i.split(','))

In [None]:
from collections import Counter
tag_count = Counter(total_tags)
tag_count.most_common(50)

Considering tags are supposed to be represent the notion of a "topic", looking at the tag distribution is helpful to understand different topic numbers to consider for the model. At a glance of the top 50 tags, it seems like there are anywhere between 5-15 topics, depending on how sparse they are. I will now consider a range of values to create the topics.

## Part c) first, Coherence model for each run of the model

In [None]:
# Test LDA Model
# Build LDA model
num_topics = [5,10,12,14,16,18,20,25,30]
coherence_values = []
model_list = []

for i in num_topics:
    lda_model = gensim.models.ldamodel.LdaModel(corpus=corpus,
                                           id2word=id2word,
                                           num_topics=i, 
                                           random_state=100,
                                           update_every=1,
                                           chunksize=100,
                                           passes=10,
                                           alpha='auto',
                                           per_word_topics=True)
    model_list.append(lda_model)
    coherencemodel = gensim.models.CoherenceModel(model=lda_model, texts=bigrams_text, dictionary=id2word, coherence='c_v') 
    coherence_values.append(coherencemodel.get_coherence())

In [None]:
# Let's figure out the optimal topic value
plt.plot(num_topics, coherence_values)
plt.xlabel("Num Topics")
plt.ylabel("Coherence Score")
plt.legend(("c_values"), loc='best')
plt.show()

In [None]:
# Looks like 5 is the best, so the first model
bst_model = model_list[0]

## Part a) List of topics and top 10 terms

In [None]:
topics = bst_model.show_topics(formatted=False)
topics
bst_model.print_topics(num_words=10)

## Part b) Top 10 documents related to each topic & topic proportion

In [None]:
# Create empty df
results_df = pd.DataFrame()

# Iterate through model corpus
for i, row in enumerate(bst_model[corpus]):
    row = sorted(row[0], key=lambda x: (x[1]), reverse=True)
    # Gather document-level information
    for k, (num_topic, topic_prop) in enumerate(row):
        if k == 0:
            results_df = results_df.append(pd.Series([int(num_topic), round(topic_prop,2)]), ignore_index=True)
        else:
            break

addtl_lookup1 = pd.Series(news_articles.article_id.to_list())
addtl_lookup2 = pd.Series(news_articles.article_title.to_list())
results_df = pd.concat([results_df, addtl_lookup2, addtl_lookup1], axis=1)

results_df.columns = ['topic_number', 'contribution_perc','article_title','article_id']
results_df.head()

In [None]:
# Return top 10 documents per topic:
results_df.sort_values(['topic_number','contribution_perc'],ascending=False).groupby('topic_number').head(10)

Looking at the different "top" articles in the 5 topics, we see that there are very interesting results. There is a clear distinction between articles in different languages. Also, that an article in French and an article in Spanish were in the same topic highlights language similarities between the two. Although the actual content subjects in the traditional sense of the topic were not as separated, there appears to be some clear distinctions too: notably food & the arts, financial news, and then tech news. Conclusion being, while more tuning (and potentially a different model) may have been able to create more granular topics in alignment with the tags above, I believe the following topics do an adequate job separating the articles.