Loading the dataset

In [22]:
import json

data_file = open("./data_sets.json")
raw_data = json.load(data_file)
data_file.close()

training_sites = raw_data.get("trainingSitesAndTags", {})

Treating raw data for training models.
First, we need to actually retrieve the content present in the url.
Then, we strip the content of its HTML and style tags, as they are irrelevant to our classification. For this, we use BeautifulSoup decompose function.

In [23]:
from sklearn.model_selection import train_test_split
import requests
from bs4 import BeautifulSoup

# Function to remove tags
def remove_tags(html):
    soup = BeautifulSoup(html, "html.parser")
  
    for data in soup(['style', 'script']):
        data.decompose()
  
    return ' '.join(soup.stripped_strings)

sites_contents = []
y_sites = []

for [url, usefulness] in training_sites:
    r = requests.get(url)
    filtered_content = remove_tags(r.content)
    sites_contents.append(filtered_content)
    y_sites.append(usefulness)
    # print(url)

https://www.magazineluiza.com.br/jogo-rayman-origins-xbox-360-xbox-one-ubisoft/p/daac6j943b/ga/javt/
https://www.magazineluiza.com.br/cyberpunk-2077-para-xbox-one-cd-projekt-red/p/043186100/ga/jaco/
https://www.magazineluiza.com.br/battlefield-2042-para-xbox-one-e-xbox-series-x-electronic-arts/p/231247900/ga/bttf/
https://www.magazineluiza.com.br/mortal-kombat-11-para-ps4-netherrealm-studios/p/221812200/ga/glut/
https://www.magazineluiza.com.br/the-last-of-us-part-ii-para-ps4-naughty-dog/p/043185800/ga/jaco/
https://www.magazineluiza.com.br/playstation-5-825gb-1-controle-branco-sony-com-horizon-forbidden-west/p/235382200/ga/gps5/
https://www.magazineluiza.com.br/playstation-5-825gb-1-controle-branco-sony-com-horizon-forbidden-west-3-jogos/p/229762200/ga/otga/
https://www.magazineluiza.com.br/celular-smartphone-multilaser-f-p9130-tela-55-sensor-digital-32gb-ram-1gbl-preto-3g/p/fdfjc47dfc/te/smuf/
https://www.magazineluiza.com.br/xbox-series-s-rrs-00006-512gb-1-controle-branco-microsoft/

Lastly, we implement feature selection. For this kind of data, that is, documents composed of many words, using bag of words is a sensible approach.
It vectorizes whole documents by storing the amount of times each unique word in the vocabulary has appeared on it, using a very efficient data structure for spending less memory (sparse matrices from scipy).

In [24]:
from sklearn.feature_extraction.text import CountVectorizer

count_vect = CountVectorizer()
X_sites = count_vect.fit_transform(sites_contents)
X_sites.shape

(58, 5261)

Now, it's a good idea to normalize this vector. As it stands, it counts the absolute number of occurrences. But some documents may be longer than others and this fact can cause inaccuracies down the line. We will transform the occurrences count into a frequency measure (term frequency - tf).

In [25]:
from sklearn.feature_extraction.text import TfidfTransformer

# We downscale the weight of too frequent words by turning the use_idf to True.
tf_transformer = TfidfTransformer(use_idf=False).fit(X_sites)
X_sites_tf = tf_transformer.transform(X_sites)
X_sites_tf.shape

(58, 5261)

In [26]:
X_train, X_test, y_train, y_test = train_test_split(X_sites_tf, y_sites, random_state=1337)

Now we need to choose a model and fit it with the training set.
There are many choices here. We will start with basic gaussian naive bayes, and change later to a specific variant, the multinomial naive bayes, that is said to be the most suitable for this task.

In [27]:
from sklearn.naive_bayes import GaussianNB

gnb = GaussianNB()
gnb.fit(X_train.toarray(), y_train)

GaussianNB()

In [28]:
def unzip_list(l):
    """ Function to unzip list """
    l1 = []
    l2 = []
    for [item1, item2] in l:
        l1.append(item1)
        l2.append(item2)
    return l1, l2

testing_sites = raw_data.get("testingSitesAndTags", {})
url_list, tags_list = unzip_list(testing_sites)

We now call transform instead of fit_transform in count_vect.

In [29]:
def generate_bow_list_from_urls(url_list: list):
    """
    Function to get the content and process it properly for predictions.
    If training is True, call fit_transform to generate
    """
    contents_list = []

    for url in url_list:
        r = requests.get(url)
        filtered_content = remove_tags(r.content)
        contents_list.append(filtered_content)
    
    filtered_contents_bow = count_vect.transform(contents_list)
    filtered_contents_bow_tf = tf_transformer.transform(filtered_contents_bow)
    return filtered_contents_bow_tf

bow_list = generate_bow_list_from_urls(url_list)

In [30]:
def classify_usefulness(model, bow_list) -> list:
    """
    Function to predict the usefulness of the contents of a url,
    using the given model.
    """
    predictions_list = model.predict(bow_list.toarray())
    return predictions_list


predictions = classify_usefulness(gnb, bow_list)
for i in range(len(tags_list)):
    if predictions[i] == tags_list[i]:
        print("acertou")
    else:
        print("errou")

acertou
errou
acertou
acertou
errou
acertou
acertou
errou
acertou
acertou
acertou
acertou


Now testing with Multinomial Naive Bayes.

In [31]:
from sklearn.naive_bayes import MultinomialNB

mnnb = MultinomialNB()
mnnb.fit(X_train.toarray(), y_train)

MultinomialNB()

In [32]:
classify_usefulness(mnnb, bow_list)
for i in range(len(tags_list)):
    if predictions[i] == tags_list[i]:
        print("acertou")
    else:
        print("errou")

acertou
errou
acertou
acertou
errou
acertou
acertou
errou
acertou
acertou
acertou
acertou
