# 3. Text Clustering Algorithm.

---

Welcome! In this notebook, we set the neccesary functions to use the Text Clustering Algorithm that we developed earlier on [...]. As we mentioned previously, the main goal of this algorithm is to categorize the content of a website using it's plain text; this will help us to understand the kind of websites that we will visit with **TheWitness**, and also it will determine if a particular website is worth to explore. The latter is very important, since our exploration on internet growths exponentially on every step.

<img src = "./notebook_assets/nlp_1.png" width = "500px">

In [2]:
# Import libraries.
import pandas as pd

# Import text clustering libraries.
import gensim
from gensim.models.ldamulticore import LdaMulticore
from gensim import corpora, models
from gensim.test.utils import common_texts
from gensim.corpora.dictionary import Dictionary
from gensim.test.utils import datapath

# Import language processing libraries.
import nltk
from nltk.corpus import stopwords
import string
from nltk.stem.wordnet import WordNetLemmatizer
from itertools import chain

# We will ignore warnings.
import warnings
warnings.simplefilter("ignore")

In [18]:
# Initialize LdaModel.
lda = gensim.models.ldamodel.LdaModel

# Get stop words from english and spanish (we will include more languages
# in the future!)
stop = set(stopwords.words('english')).union(stopwords.words('spanish'))
exclude = set(string.punctuation)

# Initialize Lemmatizer object.
lemma = WordNetLemmatizer()

# We load the text clustering model that we built previously on [...].
temp_file = datapath("dictio")
dictionary = corpora.Dictionary.load(temp_file, mmap='r')
temp_file = datapath("text_clustering_model")
ldamodel = lda.load(temp_file, mmap='r')

"""
This functions cleans a given text, i.e, it filters stop words, and 
standardize it.
"""
def clean(text):
    
    # Filter stop words.
    stop_free = " ".join([word for word in text.lower().split() if word not in stop])
    
    # Eliminate puntuactions symbols.
    punct_free = "".join(ch for ch in stop_free if ch not in exclude)
    
    # Lemmatize the text!
    normalized = " ".join([lemma.lemmatize(word) for word in punct_free.split()])
    
    # Return a corpus.
    return normalized.split()

"""
This function returns the maximum probability of the distribution
as long as the threshold is pass, othewise returns -1 (this means
that the content is not interesting for us).
"""
def get_max_category(list_distribution, umbral_param):
    
    # Set some variables.
    temp_label = -1
    temp_max = 0
    
    # Iterate over the probability distribution.
    for cluster in list_distribution:
        if cluster[1] > temp_max and cluster[1] > umbral_param:
            temp_max = cluster[1]
            temp_label = cluster[0]
            
    # Return the cateogry with the maximum probability.
    return temp_label

"""
This function generates a probability distribution over the 5 categories
we defined previosly on the model's development notebook, and returns 
tha maximum of them, as long as they pass the threshold
"""
def cluster_text(text, umbral_param = 0.6):
    
    # Clean de text of the website.
    clean_text = [clean(text)]
    doc_term_matrix_test = [dictionary.doc2bow(doc) for doc in clean_text]
    
    # Get the probability distribuion of the 5 categories!
    distribution = ldamodel[doc_term_matrix_test[0]]
    
    # Get max category.
    max_category = get_max_category(distribution, umbral_param)
    return max_category

def get_max_probability(text): 
    
    # Clean de text of the website.
    clean_text = [clean(text)]
    doc_term_matrix_test = [dictionary.doc2bow(doc) for doc in clean_text]
    
    # Get the probability distribuion of the 5 categories!
    distribution = ldamodel[doc_term_matrix_test[0]]
    
    # Get probability values.
    probability_values = list()
    for element in distribution:
        probability_values.append(element[1])
        
    # Return max probability.
    return max(probability_values)

In [10]:
print("Text Clustering Model Ready!")

Text Clustering Model Ready!
