<img src="./images/qinsti.png" align="left" alt="drawing" width="100"/>
<br><br>
<div align="left">
    <h2>Topic Modeling - Primer</h2>
    <h3>Fitting LDA model to Financial news items</h3>
</div>



## Context

The purpose of this notebook is to demonstrate a simple example of topic modeling on a set of news items. The sample news items are from
Refinitiv's *Machine Readable news* content offering. 

The following are the main steps of building a topic model
* Load the news items
* Preprocess the text
  * Load Spacy 
  * Remove stopwords
  * Lemmatization
  * Tokenize
* Create a Document Term Matrix
* Use `sklearn` to create a LDA model


## Load Spacy Language Model

`Spacy` library in Python has a lot of useful modules for NLP tasks. It has many preprocessing modules and operates with many of the popular deep learning 
frameworks

In [1]:
import spacy
import pandas as pd
import numpy as np
nlp = spacy.load('en_core_web_lg')


## Load News Data

The dataset comprises sample news item from the financial domain. These news items are related to *commodity arbitrage* and *loans* related items 

In [2]:
news_items = pd.read_csv("./data/news-body-samples-v1.csv",sep="\t")
news_items.topic.value_counts()
cond       = news_items.apply(lambda x: 300<=len(x['body']) <=6000, axis=1)
news_items = news_items.assign(l_status = cond)
news_items = news_items[news_items.l_status==True]
news_items.topic.value_counts()

N2:COMARB    1484
N2:LOA       1368
Name: topic, dtype: int64

## Define custom tokenizer 

For using any specific tokenizer with `sklearn`, one can define tokenizer based on any library of your choice

In [3]:
def my_tokenizer(text):
   tokens = [t for t in nlp(text) if t.is_alpha and not(t.is_space or t.is_punct or t.is_stop or t.like_num)]
   return [t.lemma_.lower().strip() if t.lemma_ != "-PRON-" else t.lower_  for t in tokens ]


## Create Document Term Matrix

One can use `sklearn.feature_extraction` module to create a Document Term Matrix. 
The input to the relevant function is the customized tokenizer function 

In [4]:
from sklearn.feature_extraction.text import CountVectorizer
ct_vectorizer= CountVectorizer(tokenizer = my_tokenizer, ngram_range=(1,1),
                               min_df=0.2,
                               max_df=0.9,
                               max_features=1000)


X = ct_vectorizer.fit_transform(news_items.iloc[:,0].values)

## Use `sklearn.decomposition` module to fit LDA 

DocumentTerm Matrix can then be used as input for building an unsupervised LDA model

In [5]:
from sklearn.decomposition import LatentDirichletAllocation
lda = LatentDirichletAllocation(n_components=5 , random_state = 123,
                                learning_method='batch')

X_topics = lda.fit_transform(X)

## Explore top 30 words from the topics

In [6]:
n_top_words=30

feature_names = ct_vectorizer.get_feature_names()
for topic_idx, topic in enumerate(lda.components_):
    print(topic_idx)
    print(" ".join([ feature_names[i] for i in topic.argsort()[-n_top_words:-1]] ))


0
price demand click include market margin diesel source refinery high percent trader reuters trade week cargo sulphur barrel brent west asia close east tonne say diff gasoil crack oil
1
close month shell fall oil crack total compare report stock rise crude future refinery cargo say week sell percent refining day reuters margin ara europe fob barge trade barrel
2
guide trader april messaging week total barrel asia premium discount month source new trade fuel swap late offer window report tonne say price o cent reuters brent cargo oil
3
early margin include wednesday monday tuesday source friday expect thursday total month london international new offer bp market company asia price april march rate percent year reuter loan say
4
reuters say refining margin barrel sell double contract o energy price low discount europe fob percent oil fuel cargo future sulphur gasoil diff euro click barge ara trade diesel


## Explore sample news items from a *topic 4*

Topic 4 has many terms that are related to loans. The following retrieves a few news items belonging to this topic:

In [7]:
sample_topic=X_topics[:,3].argsort()[::-1]

for iter_idx,movie_idx in enumerate(sample_topic[:2]):
    print(news_items.iloc[movie_idx,0])


    HONG KONG, Aug 29 (Reuter) - Asian bankers had plenty to
talk about Thursday despite a dearth of new debt deals.
    Starting with Indonesia, bankers say APP Global Finance, a
subsidiary of Asia Pulp and Paper, now plans to raise funds
solely through a floating rate note issue.
    The company had been expected to raise US$500 million
through a combination of an FRN and a 144a fixed-rate bond.
    Union Bank of Switzerland <SBGZ.S> is the arranger.
    Asian bankers say the FRN might be a hard sell because it is
secured by Asia Pulp and Paper shares. Equity backing is
considered less desirable than a claim on property, for example,
because stock prices are more volatile.
    Asia Pulp and Paper is a member of the Sinar Mas group,
which currently has another deal in the market via Sinar Mas
Finance, which on Wednesday launched a US$50 million floating
rate note issue.
    Another Sinar Mas member, Tjiwi Kima, is seen bringing a
loan issue.
    From South Korea, Kwa

## Explore sample news items from _topic 5_

Topic 5 has many terms that are related to loans. The following retrieves a few news items belonging to this topic:

In [8]:
sample_topic=X_topics[:,4].argsort()[::-1]

for iter_idx,movie_idx in enumerate(sample_topic[:2]):
    print(news_items.iloc[movie_idx,0])
    


    LONDON, March 21 (Reuters) - Benchmark diesel refining
margins in northwest Europe weakened on Monday to below $9 a
barrel amid expectations of higher imports flow in April and May
as refineries come out of maintenance.
    
    * Demand in the region remained average for the season.
    * Supplies from the U.S. and Russia have been stronger than
expected in March, with around 1.4 million set for delivery from
the U.S. Gulf Coast. Volumes are expected to rise further in
April, traders said.
    * The Middle East and Asia were also expected to increase
exports to Europe in April as refinery maintenance winds down,
traders said.
    * Russia's Lukoil <LKOH.MM> has started exporting ultra-low
sulphur diesel (ULSD) via a pipeline from its export terminal in
the Baltic Sea port of Vysotsk, industry sources told
Reuters.[nL5N16T3Q1]
    * Independent Italian refiner Saras SRS.MI has issued a
tender to sell 210,000-270,000 tonnes of 0.1 percent sulphur
gasoil between Ap