<a href="https://colab.research.google.com/github/gyasifred/NLP-Techniques/blob/main/Latent_Semantic_Analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This notebook showcases Latent Semantic Analysis with NLTK and Sklearn for text analysis.

In [3]:
# Downnload the required Datasets
!wget -nc https://raw.githubusercontent.com/lazyprogrammer/machine_learning_examples/master/nlp_class/all_book_titles.txt

--2023-07-27 17:47:27--  https://raw.githubusercontent.com/lazyprogrammer/machine_learning_examples/master/nlp_class/all_book_titles.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 127992 (125K) [text/plain]
Saving to: ‘all_book_titles.txt’


2023-07-27 17:47:27 (5.68 MB/s) - ‘all_book_titles.txt’ saved [127992/127992]



Libraries

In [4]:
import numpy as np
import matplotlib.pyplot as plt
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import TruncatedSVD
import nltk

In [9]:
nltk.download("punkt")
nltk.download("stopwords")
nltk.download("wordnet")

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [6]:
#Open and read sentences into a python list
titles = [lines.strip() for lines  in open("all_book_titles.txt")]
#print first 5 lines
titles[:5]

['Philosophy of Sex and Love A Reader',
 'Readings in Judaism, Christianity, and Islam',
 'Microprocessors Principles and Applications',
 'Bernhard Edouard Fernow: Story of North American Forestry',
 'Encyclopedia of Buddhism']

In [7]:
Wordnet_lemmatizer = WordNetLemmatizer()

In [12]:
# Get stop words
stop_words = set(stopwords.words('english'))

In [13]:
def my_tokenizer(text):
    """
    Tokenizes the input text, removes stopwords, lemmatizes words, and filters out short and digit-containing tokens.

    Parameters:
    text (str): The input text to be tokenized.

    Returns:
    list: A list of processed tokens.
    """
    text = text.strip().lower()
    # Tokenize words
    tokens = nltk.tokenize.word_tokenize(text)
    # Remove single and 2 character words e.g. "the", "a" etc
    tokens = [word for word in tokens if len(word) > 2]
    # Obtain lemma of words
    tokens = [Wordnet_lemmatizer.lemmatize(token) for token in tokens]
    # Remove stop words
    tokens = [token for token in tokens if token not in stop_words]
    # Remove any digits, i.e. "3rd edition"
    tokens = [t for t in tokens if not any(c.isdigit() for c in t)]
    return tokens

In [14]:
vectorizer = CountVectorizer(binary=True,tokenizer=my_tokenizer)
# fit transform our text
x = vectorizer.fit_transform(titles)



In [19]:
# Get the index to word mapping
ind2word = vectorizer.get_feature_names_out()
ind2word

array(["'the", '...', 'a-z', ..., 'zen', 'zionism', 'zurich'],
      dtype=object)

In [27]:
# transpose X to make rows = terms, cols = documents
X = x.T
X.shape

(2151, 2373)

In [22]:
#Reduce dimensionality  to 2 which can be visualize
svd = TruncatedSVD()
Z = svd.fit_transform(X)

In [25]:
Z.shape

(2151, 2)

In [28]:
!pip install plotly



In [32]:
import plotly.express as px
fig = px.scatter(x=Z[:,0], y=Z[:,1], text=ind2word, size_max=60)
fig.update_traces(textposition='top center')
fig.show()
