<a href="https://colab.research.google.com/github/guilhermelaviola/NaturalLanguageProcessing/blob/main/Class07.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Document Clustering & Word Clouds**
Document clustering and word cloud techniques are important NLP methods for analyzing large amounts of text data. Document clustering groups similar documents by representing them as numerical vectors, often using the Bag of Words model and measuring similarity with metrics like cosine distance, making it especially useful in recommendation systems to suggest content with similar themes. Word clouds, in contrast, visually display word frequency, highlighting prominent terms through size differences to quickly reveal key themes and trends. Creating word clouds involves tokenization, removing stop words, and counting word frequencies, often with the help of Python libraries. Together, these techniques help organize, interpret, and extract meaningful insights from textual data in applications such as text mining, sentiment analysis, and recommendation systems.

In [None]:
# Importing all the necessary libraries and resources:
import pandas as pd
from wordcloud import WordCloud
import matplotlib.pyplot as plt

In [None]:
# Reading the dataset:
df = pd.read_csv('dataset_movies.csv')

# Assuming 'df' has a 'tokens' column with tokenized tokens, then we can count the frequency of the tokens:
def count_frequency(tokens):
    frequencies = {}
    for token in tokens:
        if token in frequencies:
            frequencies[token] += 1
        else:
            frequencies[token] = 1
    return frequencies

# Generating a Word Cloud for each cluster:
for cluster in clusters:
    tokens_cluster = []
    for movie in cluster:
        tokens_cluster.extend(df.loc[movie, 'tokens'])
    frequencies = count_frequency(tokens_cluster)
    wordcloud = WordCloud(background_color='white').generate_from_frequencies(frequencies)

    # Displaying the Word Cloud:
    plt.figure(figsize=(10, 6))
    plt.imshow(wordcloud, interpolation='bilinear')
    plt.axis(“off”)
    plt.title(f'Cluster Word Cloud {cluster[0]}')
    plt.show()

In [None]:
# Processing the dataset:
dataset = pd.read_csv('https://github.com/alvelvis/data-movies-willianoliveiragibin/raw/main/data-willianoliveiragibin.csv', index_col=0)
dataset
dataset.Description
type(dataset.Description[0])
dataset.Description = dataset.Description.apply(eval)
type(dataset.Description[0])
dataset.Description = dataset.Description.apply(lambda x: ''.join(x))
dataset.Description

Unnamed: 0,Description
0,"Overthecourseofseveralyears,twoconvictsformafr..."
1,"DonVitoCorleone,headofamafiafamily,decidestoha..."
2,"AnanimeadaptationoftheHinduepictheRamayana,whe..."
3,"Lazy,uneducatedstudentsshareaveryclosebond.The..."
4,WhenthemenaceknownastheJokerwreakshavocandchao...
...,...
9995,Thegangencounterswithsomespiritualbodiesandfin...
9996,"Afteralifetimeofscams,aself-centeredmillennial..."
9997,Afatherdoesn'twanthisthreedaughterstogetmarrie...
9998,Anintimaterelationshipbetweenahumanandanandroi...


In [None]:
# Extracting tokens off the dataset:
import string
punctuation = string.punctuation
import nltk
nltk.download('punkt')
nltk.download('stopwords')
from nltk.tokenize import word_tokenize

stopwords = nltk.corpus.stopwords.words('english')

def extract_tokens(text):
  freq = {}
  tokens = word_tokenize(text)
  tokens = [token.lower() for token in tokens if token not in punctuation and token.lower() not in stopwords]
  return tokens

  dataset['tokens'] = dataset.Description.apply(extract_tokens)

  dataset.tokens[0]


  all_the_tokens = []

  for tokens in dataset.tokens:
    all_the_tokens.extend(tokens)
    print(len(all_the_tokens))
    print(len(set(all_the_tokens)))
    freq_tokens = {}
    for token in all_the_tokens:
      if not token in freq_tokens:
        freq_tokens[token] = 0
        freq_tokens[token] += 1
        most_frequent_tokens = sorted(freq_tokens.items(), key=lambda x: -x[1])[:1000]
        most_frequest_tokens
        most_frequest_tokens = [x[0] for x in most_frequest_tokens]
        print(most_frequest_tokens)

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [None]:
# Vectorizing all the movie descriptions:
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity

def vectorize(tokens):
  vector = []
  for token in most_frequent_tokens:
    vector.append(tokens.count(token))
    return np.asarray(vector)

    first_movie = dataset.tokens[0]

    first_movie_vector = vectorize(first_movie)

    print(first_movie_vector)

    print(len(first_movie_vector))

    dataset['vector'] = dataset['tokens'].apply(vectorize)
    dataset.vectors