# Lab 6. Natural Language Processing
# Task 6.3 Topic Modelling
## Problem Descriptions
Topic modelling is a type of statistical modelling that uses unsupervised machine learning to discover hidden semantic patterns or groups of similar words within the text document. This task aims to build a Latent Dirichlet Allocation model (LDA) to assess the relation between words within the document, calculate the top supporting words, and predict the hidden topic based on them.


## Implementation and results

In [None]:
import nltk
from nltk.corpus import stopwords
from nltk.stem.snowball import SnowballStemmer
from gensim import models, corpora


In [None]:
documents = [
  """
  Artificial Intelligence (AI) has become a pivotal force
  in the modern technological landscape. AI systems, driven by
  machine learning algorithms, are enhancing capabilities in
  various sectors, from healthcare diagnostics to financial
  analysis. The development of autonomous vehicles is a testament
  to the progress in AI, integrating complex algorithms for
  navigation and safety. Another significant advancement is in
  natural language processing, enabling machines to understand and
  respond to human language more effectively. Ethical considerations
  in AI development, particularly regarding data privacy and algorithmic
  bias, are increasingly becoming topics of critical discourse.
  The future of AI holds immense potential, but it also necessitates
  responsible innovation to address potential societal impacts.
  """,
  """
  Environmental conservation is a global imperative, addressing critical
  issues such as climate change, habitat loss, and biodiversity decline.
  Conservation efforts are focusing on protecting natural ecosystems,
  which are vital for maintaining ecological balance and supporting life.
  Initiatives like reforestation and wildlife protection programs are crucial
  in combating the effects of deforestation and species extinction. Sustainable
  practices in agriculture and industry are being promoted to reduce
  environmental footprints. The rise in renewable energy sources, such
  as solar and wind, is playing a key role in reducing greenhouse gas
  emissions. Public awareness and community involvement in environmental
  conservation are essential for fostering a sustainable relationship between
  humans and the natural world.
  """
]

In [None]:
# Clean the data by using stemming and stopwords removal
nltk.download('stopwords')
stemmer = SnowballStemmer('english')
stop_words = stopwords.words('english')
texts = [
  [stemmer.stem(word) for word in document.lower().split() if word not in stop_words]
  for document in documents
  ]

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [None]:
# Create a dictionary from the words
dictionary = corpora.Dictionary(texts)

# Create a document-term matrix
doc_term_mat = [dictionary.doc2bow(text) for text in texts]

# Generate the LDA model
num_topics = 2
ldamodel = models.ldamodel.LdaModel(doc_term_mat,
        num_topics=num_topics, id2word=dictionary, passes=25)


In [None]:
num_words = 5
for i in range(num_topics):
  print(ldamodel.print_topic(i, topn=num_words))

print('\nTop ' + str(num_words) + ' contributing words to each topic:')
for item in ldamodel.print_topics(num_topics=num_topics, num_words=num_words):
    print('\nTopic', item[0])
    list_of_strings = item[1].split(' + ')
    for text in list_of_strings:
        details = text.split('*')
        print("%-12s:%0.2f%%" %(details[1], 100*float(details[0])))


0.025*"ai" + 0.018*"algorithm" + 0.018*"languag" + 0.018*"becom" + 0.018*"machin"
0.026*"conserv" + 0.026*"environment" + 0.019*"natur" + 0.019*"reduc" + 0.019*"sustain"

Top 5 contributing words to each topic:

Topic 0
"ai"        :2.50%
"algorithm" :1.80%
"languag"   :1.80%
"becom"     :1.80%
"machin"    :1.80%

Topic 1
"conserv"   :2.60%
"environment":2.60%
"natur"     :1.90%
"reduc"     :1.90%
"sustain"   :1.90%


In [None]:
new_docs = [
  """
  The integration of Artificial Intelligence (AI) in environmental conservation
  efforts is opening new avenues for sustainable development. AI-powered systems
  are being employed to analyze vast amounts of environmental data, aiding in
  climate change research and wildlife monitoring. For instance, machine
  learning algorithms can predict deforestation trends and identify areas at
  risk, enabling proactive conservation measures. In wildlife conservation,
  AI is used to analyze camera trap images, facilitating the tracking and study
  of animal populations without human intrusion. This synergy of AI and
  environmental conservation not only enhances efficiency in data processing and
  analysis but also provides innovative solutions to complex environmental
  challenges. The fusion of these fields exemplifies the potential of technology
  in aiding and amplifying conservation efforts.
  """
]

new_texts = [
  [stemmer.stem(word) for word in document.lower().split() if word not in stop_words]
  for document in new_docs
  ]
new_doc_term_mat = [dictionary.doc2bow(text) for text in new_texts]

vector = ldamodel[new_doc_term_mat]
print(vector[0])


[(0, 0.5307139), (1, 0.4692861)]


##Discussion
In this task, text documents with topic "artificial intelligence" and "environmental conservation" are fed into the LDA model as training data. After the model training, the top 5 contributing words to each topic is generated.

Topic 0
* "ai"        :2.50%
* "algorithm" :1.80%
* "languag"   :1.80%
* "becom"     :1.80%
* "machin"    :1.80%

From supporting words for the first topic, it seems normal if the topic is about "artificial intelligence". Even "ai" has become the top supporting word. Other supporting words such as "algorithm", "language" and "machine" are closely related to the topic too.


Topic 1
* "conserv"    :2.60%  
* "environment":2.60%  
* "natur"      :1.90%  
* "reduc"      :1.90%  
* "sustain"    :1.90%  

Looking at the 2 highest contributing words, it is obvious the topic is about "environmental conservation". Other supporting words such as "nature", "reduce", and "sustain" are closely related to the topic too.

The projection vector of the new document to the LDA topic space is [(0, 0.5307139), (1, 0.4692861)]. Such projection vector is reasonable because the new document is talking about the integration of artificial intelligence in environmental conservation.

In conclusion, this model is generally reliable in topic modelling but there is some improvement that can be made in handling stop words. "Become" should not appear in the top 5 contributing words of the first topic "Artificial intelligence", this term doesn't contribute much meaning.
