Use the following lists to find open source data sets to complete take-home exercises. You can also apply these in the data set provided for AT1.

[Open Data Sets](https://canvas.uts.edu.au/courses/32341/pages/open-data-sets-for-nlp-and-text-analysis?module_item_id=1878922)

### Pre-processing

1.   Calculate word associations in a large data set; try different methods to calculate it (e.g. pmi, chi-square test, etc.)
2.  Compare lemmatization and stemming results
3.   Try adding another pre-processing step to remove all numbers/ digits from text




In [None]:
import nltk
from nltk.collocations import BigramCollocationFinder
from nltk.metrics import BigramAssocMeasures
from nltk.corpus import stopwords
import re

# Sample text corpus
corpus = """This is a large dataset containing different word associations.
            For example, word associations such as 'text analysis' or 'natural language'
            are commonly found in this dataset."""

# Tokenize the text
words = nltk.word_tokenize(corpus.lower())

# Remove stopwords
stop_words = set(stopwords.words('english'))
filtered_words = [word for word in words if word not in stop_words and re.match(r'\w+', word)]

# Use NLTK's BigramCollocationFinder to find word pairs
finder = BigramCollocationFinder.from_words(filtered_words)

# PMI calculation
pmi_scores = finder.score_ngrams(BigramAssocMeasures.pmi)

# Print top 5 word pairs with highest PMI
print("Top 5 word pairs by PMI:", pmi_scores[:5])


### Topic Modeling

Try out Topic Modeling using the Sci-kit learn (SKLearn) package. There are different algorithms you can read about and experiment with - Latent Semantic Analysis (LSA), Latent Dirichlet Allocation (LDA), Non-negative Matrix Factorization (NMF)

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation, TruncatedSVD, NMF

# Example corpus
corpus = [
    "Natural language processing is an interesting field.",
    "Machine learning and deep learning are crucial for NLP tasks.",
    "Text analysis involves extracting meaningful insights from text data.",
    "Deep learning enables language models to learn context.",
    "Latent Dirichlet Allocation is a popular topic modeling technique.",
    "Non-negative Matrix Factorization helps in topic discovery.",
    "Latent Semantic Analysis uncovers hidden structures in text data."
]

# Vectorize the corpus using TF-IDF for LSA and NMF, and Count for LDA
tfidf_vectorizer = TfidfVectorizer(stop_words='english')
X_tfidf = tfidf_vectorizer.fit_transform(corpus)

count_vectorizer = CountVectorizer(stop_words='english')
X_count = count_vectorizer.fit_transform(corpus)

# Helper function to display topics
def display_topics(model, feature_names, num_top_words):
    for idx, topic in enumerate(model.components_):
        print(f"Topic {idx + 1}: " + " ".join([feature_names[i] for i in topic.argsort()[:-num_top_words - 1:-1]]))

# 1. Latent Semantic Analysis (LSA)
print("Latent Semantic Analysis (LSA) Topics:")
lsa = TruncatedSVD(n_components=2, random_state=42)
lsa.fit(X_tfidf)
display_topics(lsa, tfidf_vectorizer.get_feature_names_out(), 3)

# 2. Latent Dirichlet Allocation (LDA)
print("\nLatent Dirichlet Allocation (LDA) Topics:")
lda = LatentDirichletAllocation(n_components=2, random_state=42)
lda.fit(X_count)
display_topics(lda, count_vectorizer.get_feature_names_out(), 3)

# 3. Non-negative Matrix Factorization (NMF)
print("\nNon-negative Matrix Factorization (NMF) Topics:")
nmf = NMF(n_components=2, random_state=42)
nmf.fit(X_tfidf)
display_topics(nmf, tfidf_vectorizer.get_feature_names_out(), 3)


### Text Clustering

Try out text clustering with a different dataset and build an optimized model by re-evaluating the number of clusters.

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
import matplotlib.pyplot as plt

# Sample dataset of text documents
corpus = [
    "Natural language processing is an exciting field of study.",
    "Deep learning is essential for tasks in NLP.",
    "Text analysis helps to derive insights from data.",
    "Machine learning enables the creation of predictive models.",
    "Clustering techniques like K-means help to organize data.",
    "Topic modeling discovers hidden structures in data.",
    "Supervised learning involves labeled data for model training."
]

# Step 1: Text Vectorization
vectorizer = TfidfVectorizer(stop_words='english')
X = vectorizer.fit_transform(corpus)

# Step 2: Determine Optimal Number of Clusters using the Elbow Method
inertia = []
silhouette_scores = []
k_values = range(2, 8)  # Testing different cluster numbers

for k in k_values:
    kmeans = KMeans(n_clusters=k, random_state=42)
    kmeans.fit(X)
    inertia.append(kmeans.inertia_)
    silhouette_avg = silhouette_score(X, kmeans.labels_)
    silhouette_scores.append(silhouette_avg)

# Plot the Elbow Method
plt.figure(figsize=(10, 4))
plt.subplot(1, 2, 1)
plt.plot(k_values, inertia, 'bo-')
plt.xlabel('Number of Clusters (k)')
plt.ylabel('Inertia')
plt.title('Elbow Method')

# Plot Silhouette Score
plt.subplot(1, 2, 2)
plt.plot(k_values, silhouette_scores, 'bo-')
plt.xlabel('Number of Clusters (k)')
plt.ylabel('Silhouette Score')
plt.title('Silhouette Analysis')
plt.show()

# Step 3: Fit Optimal Model
# Choose k based on the plots (e.g., k=3 if the elbow or highest silhouette score is at 3)
optimal_k = 3
kmeans_optimal = KMeans(n_clusters=optimal_k, random_state=42)
kmeans_optimal.fit(X)

# Display clustering results
for i, label in enumerate(kmeans_optimal.labels_):
    print(f"Document {i+1} is in Cluster {label}")
