Loading the dataset

In [16]:
import json

data_file = open("./training_data_set.json")
raw_data = json.load(data_file)
data_file.close()

training_sites = raw_data.get("trainingSitesAndTags", [])
print("Quantidade de links de treino:", len(training_sites))

data_file = open("./testing_data_set.json")
raw_data = json.load(data_file)
data_file.close()

testing_sites = raw_data.get("testingSitesAndTags", [])
print("Quantidade de links de teste:", len(testing_sites))

testing_url_list = []
testing_tags_list = []
for [url, tag] in testing_sites:
    testing_url_list.append(url)
    testing_tags_list.append(tag)

Quantidade de links de treino: 58
Quantidade de links de teste: 12


Treating raw data for training models.
First, we need to actually retrieve the content present in the url.
Then, we strip the content of its HTML and style tags, as they are irrelevant to our classification. For this, we use BeautifulSoup decompose function.

In [17]:
from sklearn.model_selection import train_test_split
import requests
from bs4 import BeautifulSoup

# Function to remove tags
def remove_tags(html):
    soup = BeautifulSoup(html, "html.parser")
  
    for data in soup(['style', 'script']):
        data.decompose()
  
    return ' '.join(soup.stripped_strings)

sites_contents = []
y_sites = []

for [url, usefulness] in training_sites:
    r = requests.get(url)
    filtered_content = remove_tags(r.content)
    sites_contents.append(filtered_content)
    y_sites.append(usefulness)
    # print(url)

Lastly, we implement feature selection. For this kind of data, that is, documents composed of many words, using bag of words is a sensible approach.
It vectorizes whole documents by storing the amount of times each unique word in the vocabulary has appeared on it, using a very efficient data structure for spending less memory (sparse matrices from scipy).

In [18]:
from sklearn.feature_extraction.text import CountVectorizer

count_vect = CountVectorizer()
X_sites = count_vect.fit_transform(sites_contents)
X_sites.shape

(58, 5247)

Now, it's a good idea to normalize this vector. As it stands, it counts the absolute number of occurrences. But some documents may be longer than others and this fact can cause inaccuracies down the line. We will transform the occurrences count into a frequency measure (term frequency - tf).

In [19]:
from sklearn.feature_extraction.text import TfidfTransformer

# We downscale the weight of too frequent words by turning the use_idf to True.
tf_transformer = TfidfTransformer(use_idf=False).fit(X_sites)
X_sites_tf = tf_transformer.transform(X_sites)
X_sites_tf.shape

(58, 5247)

In [20]:
X_train, X_test, y_train, y_test = train_test_split(X_sites_tf, y_sites, random_state=1337)
X_train = X_train.toarray()

We now call transform instead of fit_transform in count_vect.

In [21]:
def generate_bow_list_from_urls(url_list: list):
    """
    Function to get the content and process it properly for predictions.
    If training is True, call fit_transform to generate
    """
    contents_list = []

    for url in url_list:
        r = requests.get(url)
        filtered_content = remove_tags(r.content)
        contents_list.append(filtered_content)
    
    filtered_contents_bow = count_vect.transform(contents_list)
    filtered_contents_bow_tf = tf_transformer.transform(filtered_contents_bow)
    return filtered_contents_bow_tf

bow_list = generate_bow_list_from_urls(testing_url_list)
bow_list = bow_list.toarray()

# Creating a holding dict for all models predictions.
predictions = {}

Now we need to choose a model and fit it with the training set.
There are many choices here. We will start with basic gaussian naive bayes, and change later to different models to see how they fare.

In [22]:
from sklearn.naive_bayes import GaussianNB

gnb = GaussianNB()
gnb.fit(X_train, y_train)
predictions['gaussian_naive_bayes'] = gnb.predict(bow_list)

Now testing with Multinomial Naive Bayes.

In [23]:
from sklearn.naive_bayes import MultinomialNB

mnnb = MultinomialNB()
mnnb.fit(X_train, y_train)
predictions['multinomial_naive_bayes'] = mnnb.predict(bow_list)

Now testing with decision tree.

In [24]:
from sklearn.tree import DecisionTreeClassifier
dt =  DecisionTreeClassifier()
dt.fit(X_train, y_train)
predictions['decision_tree'] = dt.predict(bow_list)

Calculating accuraccy, recall and f1 score of models

In [25]:
from sklearn.metrics import precision_recall_fscore_support

for model_name in predictions:
    print("Stats for", model_name)

    pred = predictions[model_name]
    precision, recall, f1, _ = precision_recall_fscore_support(testing_tags_list, pred)
    print("Precision:", precision)
    print("recall:", recall)
    print("f1:", f1)
    print("===============================================================")

Stats for gaussian_naive_bayes
Precision: [0.66666667 0.66666667]
recall: [0.66666667 0.66666667]
f1: [0.66666667 0.66666667]
Stats for multinomial_naive_bayes
Precision: [0.5 0. ]
recall: [1. 0.]
f1: [0.66666667 0.        ]
Stats for decision_tree
Precision: [0.66666667 1.        ]
recall: [1.  0.5]
f1: [0.8        0.66666667]


  _warn_prf(average, modifier, msg_start, len(result))


Consideramos que para o nosso classificador, a precisão é mais relevante que o recall. Basicamente, a precisão é a capacidade do modelo de não marcar um exemplo negativo como positivo. E Recall é capacidade de classificar corretamente todos os exemplos positivos. Para um web-scrapper, deixar de marcar alguns sites corretos como tal não é uma perda tão grande, mas marcar sites incorretos, é.