<div style="background:#E9FFF6; color:#440404; padding:8px; border-radius: 4px; text-align: center; font-weight: 500;">IFN619 - Data Analytics for Strategic Decision Makers (2024_sem1)</div>

# IFN619 :: C1-UnstructuredAnalytics

For this tutorial, you will use the studio notebook as a guide, and:

1. Use the Guardian API to undertake your own search and obtain a json file of documents
2. Create a TF/IDF document-term matrix for your documents
3. Perform topic modelling of your documents using NMF

In [None]:
# Import the necessary libraries
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation, NMF
import pandas as pd
import json
import random

### 1. Accessing the data via The Guardian API

Make a copy of the studio notebook file, and modify it to perform your own search of the Guardian API. **NOTE:** you will need to obtain your own developer API key first and put it in a file in the appropriate folder.

A suggested search term is "ukraine", or come up with another that is of interest to you and will return a fair amount of data.

Save your search results in a json file, then read in that data below...

In [None]:
# Load the data - articles from The Guardian
file_path = "data/"
file_name = ???

with open(f"{file_path}{file_name}",'r', encoding='utf-8') as fp:
    articles = json.load(fp)

print(f"Loaded {len(articles)} articles from {file_name}")

#### Create a top10 terms dataframe

Using the index from the documents, create a dataframe that can hold the top10 terms for each document.

In [None]:
# Create a dataframe to hold top terms for each analysis type
terms_df = pd.DataFrame(index=articles.keys(),columns=['tfidf','nmf'])
terms_df

### Term Frequency / Inverse Document Frequency (TF/IDF)


In [None]:
# Set parameters appropriate to your data
tfidf_vectorizer = TfidfVectorizer(
    max_df=???, min_df=???, max_features=???, stop_words="english"
)

In [None]:
# Get the document vectors
tfidf_dt_matrix = tfidf_vectorizer.fit_transform(???)

# Display the vector for the first document
tfidf_dt_matrix.toarray()[???]

#### Update the terms matrix

In [None]:
# list of feature names
feature_names = ???.get_feature_names_out()

# create a df to combine matrix with feature names
tfidf_df = pd.DataFrame(tfidf_dt_matrix.toarray(), index=???, columns=???)
tfidf_df

In [None]:
for idx in terms_df.index:
    tfidf = dict(tfidf_df.loc[idx].sort_values(ascending=False).head(???))
    #print(counts)
    terms_df.at[idx,'tfidf'] = list(tfidf.keys()) 

terms_df

### Topic modelling with Non-negative Matrix Factorisation (NMF)


[NMF](https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation) is a different algorithm for obtaining *topics* (a list of terms) from a document-term matrix. It also factorises the document-term matrix into 2 factor matrices: document-topic and topic-term.

In [None]:
# Set the number of topics
num_topics = ???

# Create the model
nmf_model = NMF(n_components=???,init='random',beta_loss='frobenius')

# Fit the model to the data and use it to transform the data
doc_topic_nmf = nmf_model.fit_transform(???)

topic_term_nmf = nmf_model.components_

In [None]:
# Get the topics and their terms
nmf_topic_dict = {}
for index, topic in enumerate(???):
    zipped = zip(feature_names, topic)
    top_terms=dict(sorted(zipped, key = lambda t: t[1], reverse=True)[:10])
    #print(top_terms)
    top_terms_list= {key : round(top_terms[key], 4) for key in top_terms.keys()}
    nmf_topic_dict[f"topic_{index}"] = top_terms_list

# Print the topics with their terms    
for k,v in nmf_topic_dict.items():
    print(k)
    print(v)
    print()

#### Update the terms matrix

In [None]:
for idx,topic in enumerate(???):
    topic_num = topic.argmax()
    top_topic = nmf_topic_dict[f"topic_{topic_num}"]
    terms_df['nmf'].iloc[idx] = list(top_topic.keys())

terms_df

### Check against articles

In [None]:
# Sample 5 random articles
samples = random.sample(range(0,len(terms_df)),???)

for sample in samples:
    doc = terms_df.iloc[sample]
    print(f"[{sample}] {doc.name}")
    print("\t- TFIDF:\t",doc['tfidf'])
    print("\t- NMF:\t\t",doc['nmf'])
    print()

## Refine your analysis

Once you have worked through the process. Try tweaking the parameters in the TF/IDF vectorizer and also in the NMF topic modelling to try and obtain better results for your data.

#### Advanced

You may obtain better results by doing the following:

1. Creating smaller documents (e.g. article paragraphs)
2. Pre-processing the text by Stemming or Lemmatizing, and by removing additional stop words.