## Tutorial Notebook
Welcome! This Jupyter notebook is designed to show you how our
Natural Language Processing package works using a sample dataset
from the nltk package.

In [1]:
# Imports
from nltk.corpus import reuters
import pandas as pd

from nlprov.preprocessing import preprocess_text
from nlprov.vectorize import vectorize_text, vectorize_new_text
from nlprov.similarity_calc import similarity_calculation

The code below simply extracts the first ten Reuters articles from the
Reuters Corpus in NLTK.

In [2]:
# Get 10 reuters articles
reuters_df = pd.DataFrame(reuters.fileids()[0:10], columns=['file_ids'])
reuters_df['article'] = [reuters.raw(a_id) for a_id in reuters_df.file_ids]
print(reuters_df)

     file_ids                                            article
0  test/14826  ASIAN EXPORTERS FEAR DAMAGE FROM U.S.-JAPAN RI...
1  test/14828  CHINA DAILY SAYS VERMIN EAT 7-12 PCT GRAIN STO...
2  test/14829  JAPAN TO REVISE LONG-TERM ENERGY DEMAND DOWNWA...
3  test/14832  THAI TRADE DEFICIT WIDENS IN FIRST QUARTER\n  ...
4  test/14833  INDONESIA SEES CPO PRICE RISING SHARPLY\n  Ind...
5  test/14839  AUSTRALIAN FOREIGN SHIP BAN ENDS BUT NSW PORTS...
6  test/14840  INDONESIAN COMMODITY EXCHANGE MAY EXPAND\n  Th...
7  test/14841  SRI LANKA GETS USDA APPROVAL FOR WHEAT PRICE\n...
8  test/14842  WESTERN MINING TO OPEN NEW GOLD MINE IN AUSTRA...
9  test/14843  SUMITOMO BANK AIMS AT QUICK RECOVERY FROM MERG...


The `preprocess_text` function standardizes the text in the aritcle
(e.g. removing punctuation, lower casing words) so it's ready for
vectorization.
You can see the effects on the Reuters article below.

In [3]:
preprocessed_text = preprocess_text(reuters_df.article)
print("Before:", reuters_df.article[1])
print("After:", preprocessed_text[1])

Before: CHINA DAILY SAYS VERMIN EAT 7-12 PCT GRAIN STOCKS
  A survey of 19 provinces and seven cities
  showed vermin consume between seven and 12 pct of China's grain
  stocks, the China Daily said.
      It also said that each year 1.575 mln tonnes, or 25 pct, of
  China's fruit output are left to rot, and 2.1 mln tonnes, or up
  to 30 pct, of its vegetables. The paper blamed the waste on
  inadequate storage and bad preservation methods.
      It said the government had launched a national programme to
  reduce waste, calling for improved technology in storage and
  preservation, and greater production of additives. The paper
  gave no further details.
  


After: china daily says vermin eat 7 12 pct grain stocks a survey of 19 provinces and seven cities showed vermin consume between seven and 12 pct of china s grain stocks the china daily said it also said that each year 1 575 mln tonnes or 25 pct of china s fruit output are left to rot and 2 1 mln tonnes or up to 30 pct of its veg

By default, the `preprocess_text` function does the following.
* lowercases the text
* only keeps letters and numbers
* removes NAs/NaNs
* filters the text to English language only
However, you can also add the following steps with the
parameters provided.
* a custom find/replace dictionary (`replace_dict`)
* lemmatization or stemming (`lemma`, `stem`)
* return a list of lists format versus series format (`token_list`)
* remove stop words (`stop_words`)

The `vectorize_text` function creates two objects.
1. `vec_text` - a Document Feature Matrix (DFM) which is a
sparse matrix where each row
is a document from the original dataset and each column is a feature
such as the count of a specific term
2. `vec_obj` - an sklearn Vectorizer object that contains the
parameters used to vectorize the text
It takes a parameter `vec_type` which lets you specify whether to use
the feature counts (`count`)
or the TF-IDF weighted feature counts (`tfidf`).
By default, features counts are used.

In [4]:
vec_text, vec_obj = vectorize_text(preprocessed_text)

The following cell shows how these functions can be used
on a new piece of text that would be
representative of a search query against the first 10
Reuters articles.

In [5]:
new_text = pd.Series(data=["Sumitomo Bank got merged on sunday!"])
new_preprocessed_text = preprocess_text(new_text)
new_vec_text = vectorize_new_text(new_preprocessed_text, vec_obj)

In [6]:
new_preprocessed_text

0    sumitomo bank got merged on sunday
dtype: object

The `similarity_calculation` function calculates the similarity
of the new text to the existing
documents (using their vectorized forms) using the similarity
metric specified in the `metric` parameter.
Currently, the following similarity metrics are supported:
`cosine`, `jaccard`, `mahattan`, `dice`,
and `hamming`. Cosine distance is the default.

In [7]:
cos_similarity = similarity_calculation(new_vec_text, vec_text)
cos_similarity

array([[0.03624204, 0.03143473, 0.01984189, 0.        , 0.02646281,
        0.05657357, 0.04946194, 0.        , 0.        , 0.24426358]])

While those similarity values are great to have, they're even
easier to understand when we pair
them back with their associated reuters article.

In [8]:
reuters_df['cosine_similarity'] = cos_similarity[0]
reuters_df.sort_values(by=['cosine_similarity'], ascending=False)

Unnamed: 0,file_ids,article,cosine_similarity
9,test/14843,SUMITOMO BANK AIMS AT QUICK RECOVERY FROM MERG...,0.244264
5,test/14839,AUSTRALIAN FOREIGN SHIP BAN ENDS BUT NSW PORTS...,0.056574
6,test/14840,INDONESIAN COMMODITY EXCHANGE MAY EXPAND\n Th...,0.049462
0,test/14826,ASIAN EXPORTERS FEAR DAMAGE FROM U.S.-JAPAN RI...,0.036242
1,test/14828,CHINA DAILY SAYS VERMIN EAT 7-12 PCT GRAIN STO...,0.031435
4,test/14833,INDONESIA SEES CPO PRICE RISING SHARPLY\n Ind...,0.026463
2,test/14829,JAPAN TO REVISE LONG-TERM ENERGY DEMAND DOWNWA...,0.019842
3,test/14832,THAI TRADE DEFICIT WIDENS IN FIRST QUARTER\n ...,0.0
7,test/14841,SRI LANKA GETS USDA APPROVAL FOR WHEAT PRICE\n...,0.0
8,test/14842,WESTERN MINING TO OPEN NEW GOLD MINE IN AUSTRA...,0.0


You can see that our new piece of text
"Sumitomo Bank got merged on sunday!" unsurprisingly
lines up with Article 14843 about the Sumitomo Bank merger.