Loading the dataset

In [3]:
import json

data_file = open("./training_data_set.json")
raw_data = json.load(data_file)
data_file.close()

sites = raw_data.get("sitesAndTags", [])
print("Quantidade total de sites:", len(sites))

Quantidade total de sites: 200


Treating raw data for training models.
First, we need to actually retrieve the content present in the url.
Then, we strip the content of its HTML and style tags, as they are irrelevant to our classification. For this, we use BeautifulSoup decompose function.

In [6]:
%pip install scikit-learn
%pip install bs4

Defaulting to user installation because normal site-packages is not writeable
Collecting scikit-learn
  Downloading scikit_learn-1.1.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (30.5 MB)
[2K     [38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m30.5/30.5 MB[0m [31m14.2 MB/s[0m eta [36m0:00:00[0mm eta [36m0:00:01[0m0:01[0m:01[0m
[?25hCollecting numpy>=1.17.3
  Downloading numpy-1.23.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (17.0 MB)
[2K     [38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m17.0/17.0 MB[0m [31m16.1 MB/s[0m eta [36m0:00:00[0mm eta [36m0:00:01[0m0:01[0m:01[0m
[?25hCollecting joblib>=1.0.0
  Downloading joblib-1.1.0-py2.py3-none-any.whl (306 kB)
[2K     [38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m307.0/307.0 KB[0m [31m20.7 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting scipy>=1.3.2
  Downloading scipy-1.9.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x

In [7]:
from sklearn.model_selection import train_test_split
import requests
from bs4 import BeautifulSoup

# Function to remove tags
def remove_tags(html):
    soup = BeautifulSoup(html, "html.parser")
  
    for data in soup(['style', 'script']):
        data.decompose()
  
    return ' '.join(soup.stripped_strings)

sites_contents = []
y_sites = []

for [url, usefulness] in sites:
    r = requests.get(url)
    filtered_content = remove_tags(r.content)
    sites_contents.append(filtered_content)
    y_sites.append(usefulness)
    print(url)

https://www.bigboygames.com.br/jogo-fallen-legion-revenants-vanguard-edition-nintendo-switch
https://www.bigboygames.com.br/world-of-warriors-ps4-4955-p993556
https://www.bigboygames.com.br/jogo-monster-hunter-world-seminovo-ps4-11807-br-p1006227
https://www.bigboygames.com.br/jogo-back-4-blood-xbox-one
https://www.bigboygames.com.br/jogo-far-cry-6-xbox-one-p1006787
https://www.bigboygames.com.br/controle-baseus-sem-fio-transparente-seminovo-nintendo-switch
https://www.bigboygames.com.br/cartao-xbox-live-brasil-r200-5598-p995690
https://www.bigboygames.com.br/case-zelda-botw-seminovo-nintendo-switch-lite-15816
https://www.bigboygames.com.br/console-nintendo-switch-lite-amarelo-seminovo-16101
https://www.bigboygames.com.br/case-protetora-nintendo-swtich-lite-cinza-amarelo-14940
https://www.ibyte.com.br/jogo-hades-xbox/p
https://www.ibyte.com.br/marvel-s-spider-man-miles-morales-ps5/p
https://www.ibyte.com.br/game-fifa-18-xbox-one/p
https://www.shockgames.com.br/produto/horizon-zero-dawn

Lastly, we implement feature selection. For this kind of data, that is, documents composed of many words, using bag of words is a sensible approach.
It vectorizes whole documents by storing the amount of times each unique word in the vocabulary has appeared on it, using a very efficient data structure for spending less memory (sparse matrices from scipy).
Even more, we will eliminate portuguese stopwords from the documents, and limit the bag of words' max features to be 500.

In [8]:
%pip install nltk
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords

Defaulting to user installation because normal site-packages is not writeable
Collecting nltk
  Downloading nltk-3.7-py3-none-any.whl (1.5 MB)
[2K     [38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.5/1.5 MB[0m [31m5.3 MB/s[0m eta [36m0:00:00[0mm eta [36m0:00:01[0m0:01[0m:01[0m
[?25hCollecting regex>=2021.8.3
  Downloading regex-2022.7.25-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (765 kB)
[2K     [38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m765.7/765.7 KB[0m [31m17.1 MB/s[0m eta [36m0:00:00[0m31m18.8 MB/s[0m eta [36m0:00:01[0m
Collecting tqdm
  Downloading tqdm-4.64.0-py2.py3-none-any.whl (78 kB)
[2K     [38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m78.4/78.4 KB[0m [31m13.0 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: tqdm, regex, nltk
[0mSuccessfully installed nltk-3.7 regex-2022.7.25 tqdm-4.64.0
Note: you may need to restart the kernel to use updated packages.


[nltk_data] Downloading package stopwords to /home/alps2/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [10]:
from sklearn.feature_extraction.text import CountVectorizer

count_vect = CountVectorizer(stop_words=stopwords.words('portuguese'))
X_sites = count_vect.fit_transform(sites_contents)
X_sites.shape

(200, 9192)

Now, it's a good idea to normalize this vector. As it stands, it counts the absolute number of occurrences. But some documents may be longer than others and this fact can cause inaccuracies down the line. We will transform the occurrences count into a frequency measure (term frequency - tf).

In [11]:
from sklearn.feature_extraction.text import TfidfTransformer

# We downscale the weight of too frequent words by turning the use_idf to True.
tf_transformer = TfidfTransformer(use_idf=False).fit(X_sites)
X_sites_tf = tf_transformer.transform(X_sites)
X_sites_tf.shape

(200, 9192)

In [12]:
X_train, X_test, y_train, y_test = train_test_split(X_sites_tf, y_sites, random_state=1337)
X_train = X_train.toarray()
X_test = X_test.toarray()

print("Quantidade de sites para treino:", len(X_train))
print("Quantidade de sites para teste:", len(X_test))

Quantidade de sites para treino: 150
Quantidade de sites para teste: 50



If more sites need to be added to the list of bag of words, We need to call transform instead of fit_transform with count_vect.
The following function should prove useful for this.

In [13]:
def generate_bow_list_from_urls(url_list: list):
    """
    Function to get the content and process it properly for predictions.
    If training is True, call fit_transform to generate
    """
    contents_list = []

    for url in url_list:
        r = requests.get(url)
        filtered_content = remove_tags(r.content)
        contents_list.append(filtered_content)
    
    filtered_contents_bow = count_vect.transform(contents_list)
    filtered_contents_bow_tf = tf_transformer.transform(filtered_contents_bow)
    return filtered_contents_bow_tf

Creating a holding dict for all models predictions.

In [14]:
predictions = {}

Now we need to choose a model and fit it with the training set.
There are many choices here. We will start with basic gaussian naive bayes, and change later to different models to see how they fare.

In [15]:
from sklearn.naive_bayes import GaussianNB

gnb = GaussianNB()
gnb.fit(X_train, y_train)
predictions['gaussian_naive_bayes'] = gnb.predict(X_test)

Now testing with Multinomial Naive Bayes.

In [16]:
from sklearn.naive_bayes import MultinomialNB

mnnb = MultinomialNB()
mnnb.fit(X_train, y_train)
predictions['multinomial_naive_bayes'] = mnnb.predict(X_test)

Now testing with decision tree.

In [17]:
from sklearn.tree import DecisionTreeClassifier
dt =  DecisionTreeClassifier()
dt.fit(X_train, y_train)
predictions['decision_tree'] = dt.predict(X_test)

Now testing with random forest.

In [18]:
from sklearn.ensemble import RandomForestClassifier
rf =  RandomForestClassifier()
rf.fit(X_train, y_train)
predictions['random_forest'] = rf.predict(X_test)

Now testing with support vector machine (SVM).

In [19]:
from sklearn.linear_model import SGDClassifier
svm =  SGDClassifier(
    loss='hinge', penalty='l2',
    alpha=1e-3, random_state=1337,
    max_iter=5, tol=None
)
svm.fit(X_train, y_train)
predictions['support_vector_machine'] = svm.predict(X_test)

Calculating accuraccy, recall and f1 score of models

In [20]:
from sklearn.metrics import precision_recall_fscore_support

for model_name in predictions:
    print("Stats for", model_name)

    pred = predictions[model_name]
    precision, recall, f1, _ = precision_recall_fscore_support(y_test, pred, labels=[True, False])
    print("Precision:", precision)
    print("recall:", recall)
    print("f1:", f1)
    print("===============================================================")

Stats for gaussian_naive_bayes
Precision: [0.68421053 0.91666667]
recall: [0.96296296 0.47826087]
f1: [0.8        0.62857143]
Stats for multinomial_naive_bayes
Precision: [0.63157895 0.75      ]
recall: [0.88888889 0.39130435]
f1: [0.73846154 0.51428571]
Stats for decision_tree
Precision: [0.92 0.84]
recall: [0.85185185 0.91304348]
f1: [0.88461538 0.875     ]
Stats for random_forest
Precision: [0.95454545 0.78571429]
recall: [0.77777778 0.95652174]
f1: [0.85714286 0.8627451 ]
Stats for support_vector_machine
Precision: [0.77419355 0.84210526]
recall: [0.88888889 0.69565217]
f1: [0.82758621 0.76190476]


Consideramos que para o nosso classificador, a precisão é mais relevante que o recall. Basicamente, a precisão é a capacidade do modelo de não marcar um exemplo negativo como positivo. E Recall é capacidade de classificar corretamente todos os exemplos positivos. Para um web-scrapper, deixar de marcar alguns sites corretos como tal não é uma perda tão grande, mas marcar sites incorretos, é.