# Week 7 Lab: Text Analytics

<img align="right" style="padding-right:10px;" src="figures_wk7/topic_modeling.png" width=400><br>

This week's assignment will focus on text analysis of BBC News articles.

## Our Dataset: 
**Dataset:** bbc.csv(Provided in folder assign_wk7)<br>
Consists of 2225 documents from the BBC news website corresponding to stories in five topical areas from 2004-2005. <br>
Class Labels: 5 (business, entertainment, politics, sport, tech)

## Text Analytics Lab

**Objective:** 
To demostrate all of the text analysis techniques covered int his week's lecture material. Your submission needs to include the following:
   - Preparation of the text data for analysis
       * Elimination of stopwords, punctuation, digits, lowercase
   - Identify the 10 most frequently used words in the text
       * How about the ten least frequently used words? 
       * How does lemmatization change the most/least frequent words?
           - Explain and demonstrate this topic
   - Generate a world cloud for the text
   - Demonstrate the generation of n-grams and part of speech tagging
   - Create a Topic model of the text
       * Find the optimal number of topics
       * test the accuracy of your model
       * Display your results 2 different ways.
           1) Print the topics and explain any insights at this point.
           2) Graph the topics and explain any insights at this point.


### Deliverables:

Upload your notebook's .ipynb file and your topic_model_viz.html page this week.
   
**Important:** Make sure your provide complete and thorough explanations for all of your analysis. You need to defend your thought processes and reasoning.

Reference:
> Graphic comes from https://medium.com/nanonets/topic-modeling-with-lsa-psla-lda-and-lda2vec-555ff65b0b05

In [1]:
import nltk
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')

import time
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
%matplotlib inline


from nltk.tag import pos_tag
from nltk.util import ngrams
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

import gensim
from gensim.corpora import Dictionary
from gensim.models import CoherenceModel
import gensim.corpora as corpora
from sklearn.decomposition import LatentDirichletAllocation
from sklearn.feature_extraction.text import CountVectorizer

import seaborn as sns
sns.set()


[nltk_data] Downloading package stopwords to /home/lester/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /home/lester/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /home/lester/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


## Open data file and explore columns

In [2]:
# Read the data from the csv file
df = pd.read_csv('bbc.csv', sep='\t')
df.head()
# drop coumn filename and title
df = df.drop(['filename', 'title'], axis=1)
# rename the column text to content
df = df.rename(columns={'content': 'news'})
df = df.rename(columns={'category': 'type'})
df.head()


Unnamed: 0,type,news
0,business,Quarterly profits at US media giant TimeWarne...
1,business,The dollar has hit its highest level against ...
2,business,The owners of embattled Russian oil giant Yuk...
3,business,British Airways has blamed high fuel prices f...
4,business,Shares in UK drinks and food firm Allied Dome...


## Prepare the text data for analysis
The text data is preprocessed by tokenization and removal of unwanted tokens. NLTK's English stopword list was utilized, along with custom exceptions, to filter out common words that don't carry significant meaning.

### Tokenize the text

In [3]:
df_tok = df.copy()
df_tok['news'] = df_tok['news'].apply(word_tokenize)

In [4]:
two_letters_tokens = df_tok['news'].apply(lambda x: [i for i in x if len(i) == 2 and i.upper() == i and i.isalpha()])
# get the list of acronyms
acronyms = two_letters_tokens.explode().value_counts().index.tolist()[:10]
acronyms


['US', 'UK', 'TV', 'EU', 'PC', 'BT', 'MP', 'ID', 'GM', 'FA']

### Remove unwanted tokens: punctuation, numbers, and stopwords
- Added some stopwords ranking too high

In [5]:
stop_words = stopwords.words('english')
# remove stop words identified
stop_words.append('also')
stop_words.append('first')
stop_words.append('last')

# convert to lower case only if the word is not an acronym
df_tok['news'] = df_tok['news'].apply(lambda x: [item.lower() if item not in acronyms else item for item in x])

# remove stop words
df_tok['news'] = df_tok['news'].apply(lambda x: [item for item in x if item not in stop_words])

# remove punctuation
df_tok['news'] = df_tok['news'].apply(lambda x: [item for item in x if item.isalpha()])

# remove words with length less than 3 and if the word is not an acronym
df_tok['news'] = df_tok['news'].apply(lambda x: [item for item in x if len(item) > 2 or item in acronyms])

# remove digits
df_tok['news'] = df_tok['news'].apply(lambda x: [item for item in x if not item.isdigit()])

### Lemmatization

In [6]:
lemmatizer = WordNetLemmatizer()

df_lemmatized = df_tok.copy()
df_lemmatized['news'] = df_lemmatized['news'].apply(lambda x: [lemmatizer.lemmatize(item) for item in x])

### Word frequency after lemmatization

## Topic modeling and Category naming:
The Latent Dirichlet Allocation (LDA) algorithm was applied to identify optimal topic categories based on the Coherence metric. This technique helped to uncover latent patterns and themes within the text data.

The OpenAI's GPT-3 model (Davinci) was used to generate category names based on the keywords extracted from each topic. These category names provided further insights into the underlying topics and improved interpretability of the results.

### Text classification using LDA

In [7]:
# create dictionary and corpus
id2word = Dictionary(df_lemmatized['news'])
print(id2word)
# remove extremes
id2word.filter_extremes(no_below=5, no_above=0.5)
print(id2word)
corpus = [id2word.doc2bow(text) for text in df_lemmatized['news']]
print(len(corpus))


Dictionary<24022 unique tokens: ['US', 'account', 'adjust', 'advert', 'advertising']...>
Dictionary<7737 unique tokens: ['US', 'account', 'adjust', 'advert', 'advertising']...>
2225


In [8]:
# create gensim lda models with 5-11 topics
lda_models = [gensim.models.LdaMulticore(corpus=corpus,
                                                id2word=id2word,
                                                num_topics=i,
                                                random_state=42,
                                                chunksize=100,
                                                passes=10,
                                                per_word_topics=True) for i in range(6, 7)]


In [9]:
# calculate coherence score for each model
coherence_scores = []
for i, lda in enumerate(lda_models):
    coherence_model = CoherenceModel(model=lda, texts=df_lemmatized['news'], dictionary=id2word, coherence='c_v')
    coherence_score = coherence_model.get_coherence()
    coherence_scores.append(coherence_score)

In [10]:
# print Number of topics, Log likelihood, Perplexity and Coherence of lda models as a table
df_coherence = pd.DataFrame({'Number of topics': [i+5 for i in range(len(lda_models))],
                     'Log likelihood': [lda.log_perplexity(corpus) for lda in lda_models],
                        'Perplexity': [np.exp2(-lda.log_perplexity(corpus)) for lda in lda_models],
                        'Coherence': coherence_scores})

print(df_coherence)

# get the best lda model based on coherence score
lda = lda_models[np.argmax(coherence_scores)]
print(f'Best number of topics: {np.argmax(coherence_scores) + 5}')

   Number of topics  Log likelihood  Perplexity  Coherence
0                 5       -7.750877  215.400177   0.514496
Best number of topics: 5


In [11]:
# print topics of lda model
for i, topic in lda.show_topics(formatted=False):
    print(f'Topic {i}:')
    print([i[0] for i in topic])


Topic 0:
['US', 'market', 'company', 'sale', 'bank', 'price', 'firm', 'economy', 'growth', 'share']
Topic 1:
['people', 'game', 'technology', 'phone', 'one', 'user', 'new', 'site', 'could', 'computer']
Topic 2:
['service', 'broadband', 'people', 'UK', 'BT', 'million', 'music', 'phone', 'digital', 'new']
Topic 3:
['film', 'US', 'blog', 'new', 'one', 'company', 'director', 'people', 'star', 'life']
Topic 4:
['government', 'party', 'labour', 'people', 'election', 'minister', 'say', 'blair', 'could', 'new']
Topic 5:
['game', 'best', 'player', 'one', 'win', 'time', 'world', 'two', 'england', 'play']


# Conclusions
- The top 10 most and least frequent words were identified after tokenizing the text and removing unwanted tokens. To achieve this, NLTK's English stopword list was employed along with a few custom exceptions.
- After lemmatization, the most and least frequent words were determined. This process helped reduce data sparsity and influenced the results, with the least frequent words being impacted the most.
- N-grams Analysis, specifically bigrams, proved to be a simple yet effective method for discovering patterns and potential tags in documents. This analysis revealed interesting word combinations such as prime-minister, six-nation, mobile-phone, tony-blair, general-election and new-york.
- Part-of-speech (POS) tagging supplied valuable data that can be beneficial for sentiment analysis and language modeling.
- Latent Dirichlet Allocation (LDA) was used for topic modeling, with the optimal number of categories determined based on the Coherence metric.
- To generate category names based on keywords or tokens for each category, additional steps were taken. OpenAI's GPT-3 model (Davinci) automatically produced these category names using the provided keywords. These category names provided deeper insights into the resulting topics.

In [12]:
import joblib
model = {
    "stop_words": stop_words,
    "acronyms": acronyms,
    "dictionary": id2word,
    "lda": lda,
    "categories": ['Financial/Economic Market', 'Mobile Technology/Gaming', 'Music Industry', 'Film Industry', 'Political Elections/Parties', 'Sports/Gaming']
}

# save the model to disk
joblib.dump(model, 'bbc.pkl')

['bbc.pkl']