<div style="background:#E9FFF6; color:#440404; padding:8px; border-radius: 4px; text-align: center; font-weight: 500;">IFN619 - Data Analytics for Strategic Decision Makers</div>

# IFN619 :: C1-UnstructuredAnalytics

For this tutorial, you will use the studio notebook as a guide, and:

1. Use the Guardian API to undertake your own search and obtain a json file of documents
2. Create a TF/IDF document-term matrix for your documents
3. Perform topic modelling of your documents using NMF

In [1]:
# Import the necessary libraries
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation, NMF
import pandas as pd
import json
import random

### 1. Accessing the data via The Guardian API

Make a copy of the lecture notebook file (Accessing the Guardian API), and modify it to perform your own search of the Guardian API. **NOTE:** you will need to obtain your own developer API key first and put it in a file in the appropriate folder.

A suggested search term is "cyclone alfred", or come up with another that is of interest to you and will return a fair amount of data.

Save your search results in a json file, then read in that data below...

In [2]:
# Load the data - articles from The Guardian
file_path = "data/"
file_name = "cyclone.json"

with open(f"{file_path}{file_name}",'r', encoding='utf-8') as fp:
    articles = json.load(fp)

print(f"Loaded {len(articles)} articles from {file_name}")

Loaded 242 articles from cyclone.json


#### Discussion
Let's have a quick look what articles we have collected. 

In [3]:
# An overview of article titles
for title in articles.keys():
    print(title)

Ex-Tropical Cyclone Alfred: tracking rainfall and wind speeds [2025-03-09T22:42:37Z]
Ex-Tropical Cyclone Alfred: what we know so far [2025-03-09T03:52:12Z]
How does ex-Tropical Cyclone Alfred compare to past storms? [2025-03-08T19:12:06Z]
Send us your photographs and videos of ex-Tropical Cyclone Alfred [2025-03-06T06:32:50Z]
Flood-weary Murwillumbah waits and readies as Tropical Cyclone Alfred looms [2025-03-05T04:51:29Z]
Queensland evacuations begin as Cyclone Alfred storm path tracks towards Brisbane [2025-03-05T02:24:08Z]
Ex-Cyclone Alfred reaches mainland as heavy rain and damaging floods expected [2025-03-08T14:13:35Z]
Is climate change supercharging Tropical Cyclone Alfred as it powers towards Australia? [2025-03-07T01:04:51Z]
‘Born in a cyclone’: couple welcome baby Florence as Alfred rages outside [2025-03-10T14:00:12Z]
Ex-Tropical Cyclone Alfred aftermath: what to do once the storm has passed [2025-03-08T04:45:10Z]
Tropical Cyclone Alfred intensifies as latest forecast predic

Did all the retrieved articles relevant to the topic? If not, what are the strategies we might use to filter articles that are relevant to our topic? 
- ???
- ???
- ???

In [4]:
# Implement some of your strategies - add more cells if needed

# e.g., refine search terms and apply appropriate filters when building a search URL
# DIY 

# e.g., remove articles contains 'as it happened' - why?
## get a list of article titles 
titles = list(articles.keys())
## create an empty list to store filtered titles
filtered_titles = []

for title in titles:
    if "as it happens" not in title:
        filtered_titles.append(title)
        
# e.g., include titles that contain 'Cyclone Alfred' - why?
filtered_titles_2 = [title for title in filtered_titles if 'Cyclone Alfred' in title]

filtered_titles_2

['Ex-Tropical Cyclone Alfred: tracking rainfall and wind speeds [2025-03-09T22:42:37Z]',
 'Ex-Tropical Cyclone Alfred: what we know so far [2025-03-09T03:52:12Z]',
 'How does ex-Tropical Cyclone Alfred compare to past storms? [2025-03-08T19:12:06Z]',
 'Send us your photographs and videos of ex-Tropical Cyclone Alfred [2025-03-06T06:32:50Z]',
 'Flood-weary Murwillumbah waits and readies as Tropical Cyclone Alfred looms [2025-03-05T04:51:29Z]',
 'Queensland evacuations begin as Cyclone Alfred storm path tracks towards Brisbane [2025-03-05T02:24:08Z]',
 'Ex-Cyclone Alfred reaches mainland as heavy rain and damaging floods expected [2025-03-08T14:13:35Z]',
 'Is climate change supercharging Tropical Cyclone Alfred as it powers towards Australia? [2025-03-07T01:04:51Z]',
 'Ex-Tropical Cyclone Alfred aftermath: what to do once the storm has passed [2025-03-08T04:45:10Z]',
 'Tropical Cyclone Alfred intensifies as latest forecast predicts landfall just north of Brisbane [2025-03-04T03:58:28Z]',

Since we have filtered out some articles, we need to update the JSON file to remove the articles that are no longer relevant. If we don't do this, the code below might try to access articles that don't exist anymore, which could cause errors.

In [5]:
# Filter the JSON data to only include these titles
# advanced
# articles_filtered = {title: content for title, content in articles.items() if title in filtered_titles_2}

# an easier way
## Initialise an empty dictionary
articles_filtered = {}
## Loop through the articles and add the ones that are in the DataFrame's index
for title, content in articles.items():
    if title in filtered_titles_2:
        articles_filtered[title] = content

articles_filtered

 'How does ex-Tropical Cyclone Alfred compare to past storms? [2025-03-08T19:12:06Z]': 'As ex-Tropical Cyclone Alfred crossed the coast , authorities remain concerned about heavy rainfall and the potential for flooding. It is expected to make landfall in Queensland on Saturday as a tropical low. According to the Bureau of Meteorology, over the years, at least 20 cyclones have approached within 300km of south-east Queensland and northern NSW. Only a few have made landfall, but history shows that cyclones, and even ex-cyclones, have the potential to wreak havoc in this corner of the country.    Related: Albanese warns against complacency as Cyclone Alfred weakens to tropical low off Queensland coast    Dr Stephen Turton, an adjunct professor at Central Queensland University who has studied tropical cyclones for more than 30 years, said events in 1954 and 1974 caused extensive damage, in large part because of extensive rainfall and flooding.  Sign up for Guardian Australia’s breaking news

#### Create a top10 terms dataframe

Using the index from the documents, create a dataframe that can hold the top10 terms for each document.

In [6]:
# Create a dataframe to hold top terms for each analysis type
terms_df = pd.DataFrame(index=articles_filtered.keys(),columns=['tfidf','nmf'])
terms_df

Unnamed: 0,tfidf,nmf
Ex-Tropical Cyclone Alfred: tracking rainfall and wind speeds [2025-03-09T22:42:37Z],,
Ex-Tropical Cyclone Alfred: what we know so far [2025-03-09T03:52:12Z],,
How does ex-Tropical Cyclone Alfred compare to past storms? [2025-03-08T19:12:06Z],,
Send us your photographs and videos of ex-Tropical Cyclone Alfred [2025-03-06T06:32:50Z],,
Flood-weary Murwillumbah waits and readies as Tropical Cyclone Alfred looms [2025-03-05T04:51:29Z],,
Queensland evacuations begin as Cyclone Alfred storm path tracks towards Brisbane [2025-03-05T02:24:08Z],,
Ex-Cyclone Alfred reaches mainland as heavy rain and damaging floods expected [2025-03-08T14:13:35Z],,
Is climate change supercharging Tropical Cyclone Alfred as it powers towards Australia? [2025-03-07T01:04:51Z],,
Ex-Tropical Cyclone Alfred aftermath: what to do once the storm has passed [2025-03-08T04:45:10Z],,
Tropical Cyclone Alfred intensifies as latest forecast predicts landfall just north of Brisbane [2025-03-04T03:58:28Z],,


### Term Frequency / Inverse Document Frequency (TF/IDF)


In [7]:
# Set parameters appropriate to your data
tfidf_vectorizer = TfidfVectorizer(
    max_df=0.75, min_df=2, max_features=10000, stop_words="english"
)

In [8]:
# Get the document vectors
tfidf_dt_matrix = tfidf_vectorizer.fit_transform(articles_filtered.values())

# Display the vector for the first document
tfidf_dt_matrix.toarray()[0]

array([0., 0., 0., ..., 0., 0., 0.])

#### Update the terms matrix

In [9]:
# list of feature names
feature_names = tfidf_vectorizer.get_feature_names_out()

# create a df to combine matrix with feature names
tfidf_df = pd.DataFrame(tfidf_dt_matrix.toarray(), index=articles_filtered.keys(), columns=feature_names)
tfidf_df

Unnamed: 0,000,000km,10,100,100km,100mm,10am,10km,11,11am,...,years,yes,yesterday,yetta,young,youth,zealand,zelenskyy,zone,zones
Ex-Tropical Cyclone Alfred: tracking rainfall and wind speeds [2025-03-09T22:42:37Z],0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Ex-Tropical Cyclone Alfred: what we know so far [2025-03-09T03:52:12Z],0.050456,0.0,0.0,0.0,0.030221,0.0,0.066038,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
How does ex-Tropical Cyclone Alfred compare to past storms? [2025-03-08T19:12:06Z],0.053964,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.039321,0.0,...,0.042303,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Send us your photographs and videos of ex-Tropical Cyclone Alfred [2025-03-06T06:32:50Z],0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Flood-weary Murwillumbah waits and readies as Tropical Cyclone Alfred looms [2025-03-05T04:51:29Z],0.023938,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.056296,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Queensland evacuations begin as Cyclone Alfred storm path tracks towards Brisbane [2025-03-05T02:24:08Z],0.068654,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.026909,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.150074
Ex-Cyclone Alfred reaches mainland as heavy rain and damaging floods expected [2025-03-08T14:13:35Z],0.056675,0.0,0.0,0.0,0.033947,0.047684,0.037089,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Is climate change supercharging Tropical Cyclone Alfred as it powers towards Australia? [2025-03-07T01:04:51Z],0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Ex-Tropical Cyclone Alfred aftermath: what to do once the storm has passed [2025-03-08T04:45:10Z],0.016294,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.01916,0.0,0.0,0.041128,0.0,0.0,0.0,0.0,0.0,0.0
Tropical Cyclone Alfred intensifies as latest forecast predicts landfall just north of Brisbane [2025-03-04T03:58:28Z],0.0,0.0,0.0,0.0,0.038748,0.0,0.0,0.0,0.0,0.0,...,0.025356,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [10]:
for idx in terms_df.index:
    tfidf = dict(tfidf_df.loc[idx].sort_values(ascending=False).head(10))
    #print(counts)
    terms_df.at[idx,'tfidf'] = list(tfidf.keys()) 

terms_df

Unnamed: 0,tfidf,nmf
Ex-Tropical Cyclone Alfred: tracking rainfall and wind speeds [2025-03-09T22:42:37Z],"[default, rain, historic, data, calculated, re...",
Ex-Tropical Cyclone Alfred: what we know so far [2025-03-09T03:52:12Z],"[sunday, said, nsw, river, lismore, low, warni...",
How does ex-Tropical Cyclone Alfred compare to past storms? [2025-03-08T19:12:06Z],"[said, caused, wanda, 1974, cyclones, 1954, me...",
Send us your photographs and videos of ex-Tropical Cyclone Alfred [2025-03-06T06:32:50Z],"[share, videos, images, form, don, photos, inf...",
Flood-weary Murwillumbah waits and readies as Tropical Cyclone Alfred looms [2025-03-05T04:51:29Z],"[floods, says, got, lot, flood, rain, going, r...",
Queensland evacuations begin as Cyclone Alfred storm path tracks towards Brisbane [2025-03-05T02:24:08Z],"[properties, said, storm, modelling, surge, zo...",
Ex-Cyclone Alfred reaches mainland as heavy rain and damaging floods expected [2025-03-08T14:13:35Z],"[saturday, said, sunday, nsw, downgraded, afte...",
Is climate change supercharging Tropical Cyclone Alfred as it powers towards Australia? [2025-03-07T01:04:51Z],"[cyclones, temperatures, climate, form, change...",
Ex-Tropical Cyclone Alfred aftermath: what to do once the storm has passed [2025-03-08T04:45:10Z],"[check, water, insurance, room, toilet, safe, ...",
Tropical Cyclone Alfred intensifies as latest forecast predicts landfall just north of Brisbane [2025-03-04T03:58:28Z],"[said, coastal, tuesday, early, oates, gissing...",


### Topic modelling with Non-negative Matrix Factorisation (NMF)


[NMF](https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation) is a different algorithm for obtaining *topics* (a list of terms) from a document-term matrix. It also factorises the document-term matrix into 2 factor matrices: document-topic and topic-term.

In [11]:
# Set the number of topics
num_topics = 20

# Create the model
nmf_model = NMF(n_components=num_topics,init='random',beta_loss='frobenius')

# Fit the model to the data and use it to transform the data
doc_topic_nmf = nmf_model.fit_transform(tfidf_dt_matrix)

topic_term_nmf = nmf_model.components_

In [12]:
# Get the topics and their terms
nmf_topic_dict = {}
for index, topic in enumerate(topic_term_nmf):
    zipped = zip(feature_names, topic)
    top_terms=dict(sorted(zipped, key = lambda t: t[1], reverse=True)[:10])
    #print(top_terms)
    top_terms_list= {key : round(top_terms[key], 4) for key in top_terms.keys()}
    nmf_topic_dict[f"topic_{index}"] = top_terms_list

# Print the topics with their terms    
for k,v in nmf_topic_dict.items():
    print(k)
    print(v)
    print()

topic_0
{'said': 1.0973, 'friday': 0.6814, 'landfall': 0.6288, 'winds': 0.569, 'coastal': 0.563, 'storm': 0.4893, 'north': 0.4682, 'expected': 0.458, 'flooding': 0.4538, 'category': 0.4527}

topic_1
{'videos': 0.6846, 'share': 0.607, 'images': 0.5883, 'form': 0.4972, 'don': 0.4655, 'photos': 0.4324, 'information': 0.4244, 'active': 0.4055, 'risks': 0.3827, 'video': 0.3794}

topic_2
{'grant': 1.1581, 'eligible': 0.684, 'payment': 0.625, 'income': 0.6101, 'available': 0.5598, 'essential': 0.5426, 'recovery': 0.4777, 'disaster': 0.4484, 'grants': 0.4144, 'services': 0.3867}

topic_3
{'ants': 3.1936, 'ant': 2.2901, 'hay': 1.8909, 'said': 1.5673, 'spread': 1.3319, 'pianta': 1.2261, 'suppression': 1.035, 'invasive': 0.9783, 'rifa': 0.8905, 'pest': 0.7459}

topic_4
{'sunday': 2.0585, 'nsw': 1.4637, 'said': 1.2528, 'services': 1.0326, 'power': 0.9802, 'schools': 0.889, 'river': 0.8841, 'flood': 0.8626, 'saturday': 0.7556, 'ex': 0.7339}

topic_5
{'word': 2.2694, 'starter': 1.202, 'trump': 0.775

#### Update the terms matrix

In [13]:
for idx,topic in enumerate(doc_topic_nmf):
    topic_num = topic.argmax()
    top_topic = nmf_topic_dict[f"topic_{topic_num}"]
    title = terms_df.index[idx]
    terms_df.loc[title, 'nmf'] = list(top_topic.keys())

terms_df

Unnamed: 0,tfidf,nmf
Ex-Tropical Cyclone Alfred: tracking rainfall and wind speeds [2025-03-09T22:42:37Z],"[default, rain, historic, data, calculated, re...","[default, rain, historic, data, calculated, re..."
Ex-Tropical Cyclone Alfred: what we know so far [2025-03-09T03:52:12Z],"[sunday, said, nsw, river, lismore, low, warni...","[sunday, nsw, said, services, power, schools, ..."
How does ex-Tropical Cyclone Alfred compare to past storms? [2025-03-08T19:12:06Z],"[said, caused, wanda, 1974, cyclones, 1954, me...","[said, caused, wanda, 1974, cyclones, 1954, me..."
Send us your photographs and videos of ex-Tropical Cyclone Alfred [2025-03-06T06:32:50Z],"[share, videos, images, form, don, photos, inf...","[videos, share, images, form, don, photos, inf..."
Flood-weary Murwillumbah waits and readies as Tropical Cyclone Alfred looms [2025-03-05T04:51:29Z],"[floods, says, got, lot, flood, rain, going, r...","[says, flood, floods, got, 2022, really, lot, ..."
Queensland evacuations begin as Cyclone Alfred storm path tracks towards Brisbane [2025-03-05T02:24:08Z],"[properties, said, storm, modelling, surge, zo...","[said, friday, landfall, winds, coastal, storm..."
Ex-Cyclone Alfred reaches mainland as heavy rain and damaging floods expected [2025-03-08T14:13:35Z],"[saturday, said, sunday, nsw, downgraded, afte...","[sunday, nsw, said, services, power, schools, ..."
Is climate change supercharging Tropical Cyclone Alfred as it powers towards Australia? [2025-03-07T01:04:51Z],"[cyclones, temperatures, climate, form, change...","[climate, cyclones, atmosphere, temperatures, ..."
Ex-Tropical Cyclone Alfred aftermath: what to do once the storm has passed [2025-03-08T04:45:10Z],"[check, water, insurance, room, toilet, safe, ...","[house, check, water, official, power, disaste..."
Tropical Cyclone Alfred intensifies as latest forecast predicts landfall just north of Brisbane [2025-03-04T03:58:28Z],"[said, coastal, tuesday, early, oates, gissing...","[said, friday, landfall, winds, coastal, storm..."


### Check against articles

In [14]:
# Sample 5 random articles
samples = random.sample(range(0,len(terms_df)),5)

for sample in samples:
    doc = terms_df.iloc[sample]
    print(f"[{sample}] {doc.name}")
    print("\t- TFIDF:\t",doc['tfidf'])
    print("\t- NMF:\t\t",doc['nmf'])
    print()

[20] ‘A perfect storm’: the dedicated rescuers caring for sodden seabirds blown in by Cyclone Alfred [2025-03-14T23:00:33Z]
	- TFIDF:	 ['birds', 'seabirds', 'bird', 'say', 'events', 'weekend', 'says', 'experience', 'happening', 'course']
	- NMF:		 ['hospital', 'leaf', 'laura', 'wilson', 'care', 'animals', 'wildlife', 'round', 'surgery', 'spare']

[23] Cyclone Alfred brought Brisbane’s fourth major flood in 50 years – can the city be flood-proofed? [2025-03-14T23:00:32Z]
	- TFIDF:	 ['says', 'flood', 'built', 'insurance', 'city', 'areas', 'risk', 'floods', 'years', 'regularly']
	- NMF:		 ['says', 'flood', 'floods', 'got', '2022', 'really', 'lot', 'going', 'years', 'built']

[32] ‘I need to survive’: rower attempting to cross Pacific activates emergency beacon off Queensland near Cyclone Alfred [2025-03-02T04:10:09Z]
	- TFIDF:	 ['mockus', 'sunday', 'row', 'amsa', 'rowing', 'ship', 'said', 'pacific', 'typhoon', 'coral']
	- NMF:		 ['mockus', 'said', 'rowing', 'boat', 'typhoon', 'team', 'rou

## Refine your analysis

Once you have worked through the process. Try tweaking the parameters in the TF/IDF vectorizer and also in the NMF topic modelling to try and obtain better results for your data.

#### Advanced

You may obtain better results by doing the following:

- Creating smaller documents (e.g. article paragraphs)
- Pre-processing the text by Stemming or Lemmatizing, and by removing additional stop words.
- ???