# Web spam detection through link-based features

Project for Information Retrieval exam at University of Trieste, January 2021

In [1]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
import spam_detection as sd #python module containing custom functions
np.random.seed(2) #random seed for reproducibility

Web spam detection is a crucial issue for web search engines. In fact, ranking algorithms such as PageRank cannot explicitly penalize spam websites in favor of trustworthy ones, meaning that users may find very high-rank pages that have no useful content and are highly ranked because they are part of a link farm, a popular way to fool ranking algorithms.

###  The dataset
The [WEBSPAM-UK2006](https://chato.cl/webspam/datasets/uk2006/) dataset contains 11402 hosts in the `.uk` domain, of which 7866 are labeled as *spam* or *normal*. Newer datasets have been released by the same authors, but this 2006 version remains the one with the highest number of manually labeled samples.

The file `new_hostnames.csv` contains the names of the hosts in the dataset, while `webspam-uk2006-labels.txt` assigns to 8045 host names a label chosen among *spam*, *normal* or *undecided*. For the purposes of this project, *undecided*-labeled hosts were considered unlabeled, leaving only 7866 hosts labeled as *spam* or *normal*.

Finally, the file `uk-2006-05.hostgraph_weighted.txt` contains the weighted graph of the hosts, each row containing a host index, the indices of outlinked hosts and, for each of them, the number of outlinks.

The function `read_graph` returns a sparse `csr_matrix` $\mathcal{R}$, with $\mathcal{R}_{i,j}$ equal to $0$ if there is no edge connecting host $i$ to host $j$ or to $\frac{1}{O(i)}$ where $O(i)$ is the total number of hosts outlinked by $i$.

Since PageRank algorithm requires a stochastic matrix, but there is no guarantee that each host has at least one outlink (dangling node problem), as proposed in [[1]](#references), an artificial node with a single self-loop was added to the graph, with ingoing edges from all dangling nodes.

In [2]:
hostnames=sd.hostnames_list('data/new_hostnames.csv')
labels_dict=sd.labels_dictionary('data/webspam-uk2006/webspam-uk2006-labels.txt')
labels, labeled_dataset=sd.make_dataset(labels_dict,hostnames)
R=sd.read_graph('data/uk-2006-05.hostgraph_weighted.txt',len(hostnames))

### PageRank

In the following cell PageRank is computed iteratively, according to the equation:
$$
{rank}_{k+1}=\frac{\alpha}{N}\mathbf{1}+(1-\alpha)R^T\cdot {rank}_k
$$
where ${rank}_k$ is the column vector storing the PageRank scores at step $k$, $\mathbf{1}$ is a column vector of ones, $N$ is the number of nodes (in this case 11403) and $\alpha$ is the teleporting factor. The iterative computation is performed up to a fixed precision of $\epsilon$, i.e. until $|{rank}_k-{rank}_{k-1}|_1 < \epsilon$.


In [3]:
alpha=.01
eps=1e-8

print("PageRank computation")
rank=sd.compute_PR(alpha,eps,R)

PageRank computation
PageRank computed


### An approximation for Personalized PageRank

Personalized PageRank is an algorithm directly derived from PageRank, whose result is a $N \times N$ matrix $\mathcal{PRM}$ such that $\mathcal{PRM}_{i,j}$ is the contribution of node $i$ to the PageRank of node $j$. This implies that the sum of $\mathcal{PRM}$ is equal to the PageRank vector.\
The contribution vector $cpr(v)$ is defined to be the row vector whose transpose is the $v$-th column of matrix $\mathcal{PRM}$. This vector stores the contribution of all nodes to the PageRank of node $v$ and it is of particular interest when it comes to web spam detection.\
\
However, Personalized PageRank computation is infeasible on large datasets, since it requires an iterative computation (conceptually identical to the one of PageRank) that includes, at each step, a matrix multiplication between two square $N \times N$ matrices, one of which is non-sparse.\
To address this problem, the authors of [[2]](#references) propose a local algorithm for the computation of $\delta$-approximations of contribution vectors.
\
Given a node $v$ and its contribution vector $c:=cpr(v)$, a $\delta$-approximation of $cpr(v)$ is a non-negative vector $c^*$ such that $c(u)-\delta rank(v) \leq c^*(u) \leq c(u)$ for all nodes $u$.

For the following computations, it is useful to store the columns of the $\mathcal{R}$ matrix in a list.


In [4]:
columns=sd.columns_list(R)

In [5]:
delta=1e-3

nl=len(labeled_dataset)
ap=np.zeros((nl,len(rank)))
print("Approximation of Personalized PageRank for labeled hosts")
for v in range(nl):
    print(str(v+1)+'/'+str(nl),end='\r')
    ap[v]=(sd.approximate_contributions(labeled_dataset[v], alpha, delta*rank[labeled_dataset[v]], rank[labeled_dataset[v]], columns))

Approximation of Personalized PageRank for labeled hosts
7866/7866

### Features for link-based web spam detection
Spam detection is particularly relevant for high PageRank hosts, since people tend to click on highly ranked pages, almost always within the first page of search engine results [[3]](#references).\
However, we can see that PageRank alone is not able to filter out spam pages. In fact, if we restrict our view to the highest ranked 25% of the dataset, 161 hosts out of 2095 (total labeled hosts in this 25%) are labeled as spam, as opposed to 773/7866 on the entire dataset. The proportion of spam hosts drops from 9.8% to 7.7%, but that is surely not enough to consider PageRank as a spam detection or spam-robust algorithm.

In [15]:
n=25
labeled_top=sd.top_n_percent(n,rank,labeled_dataset)

y=labels[labeled_dataset]
y_top=y[labeled_top]

print("Total spam hosts: ", sum(y))
print("Total labeled hosts: ",len(y))
print("Total spam hosts %: ", 100*sum(y)/len(y),'\n')
print("Spam hosts in top 25%: ", sum(y_top))
print("Labeled hosts in top 25%: ", len(y_top))
print("Spam hosts % in top 25%: ", 100*sum(y_top)/len(y_top))

Total spam hosts:  773
Total labeled hosts:  7866
Total spam hosts %:  9.827103991863718 

Spam hosts in top 25%:  144
Labeled hosts in top 25%:  2276
Spam hosts % in top 25%:  6.3268892794376095


There are two basic approaches to web spam detection: content-based and link-based.\
In a content-based setting the `HTML` code of pages is scanned and some features are computed, usually related to text (for example, average word length or fraction of visible text). On the other hand, in a link-based setting (the one explored in this notebook), information is obtained exclusively from the web graph and computations performed over it (such as ranking algorithms).\
Some very trivial link-based features that can be computed even before performing PageRank are indegree and outdegree, defined as follows:
 - Indegree: total number of incoming links to a host
 - Outdegree: total number of outgoing links from a host

Other useful features can be computed from the contribution vector of a node (or from a $\delta$-approximation):
 - Size of $\delta$-significant contributing set: for a node $v$, this feature is defined as $cs\_size=|S_{\delta}|=|\{u|c^*(u)>\delta rank(v)\}|$.
 - Contribution from vertices in the $\delta$-significant contributing set: $cs\_contribution=\sum_{u \in S_{\delta}} \frac{c^*(u)}{rank(v)}$ 
 - $l_2$ norm of $\delta$-significant contributing vector: $l_2\_norm=\sum_{u \in S_{\delta}} (\frac{c^*(u)}{rank(v)})^2$

These features, together with PageRank scores, can be used to train a binary classifier for spam detection.

In [7]:
x=np.zeros((nl,6))
indegree,outdegree,cs_size,cs_contribution,l2_norm=sd.extract_features(R,delta,ap,labeled_dataset,rank)
x[:,0]=rank[labeled_dataset]
x[:,1]=indegree
x[:,2]=outdegree
x[:,3]=cs_size
x[:,4]=cs_contribution
x[:,5]=l2_norm

### Evaluation of classifiers

Some Machine Learning can be used to detect spam websites. The simplest approach would be to consider one individual feature, set a threshold and classify as *spam* or *normal* all hosts with a score above (or below) that threshold.\
Actually, despite being trivial, this approach works quite well in this case, since it allows to classify labeled hosts with an overall accuracy of 70% and, most importantly, with a recall of 95% on the *spam* class, for which one could imagine recall to be crucial over precision.\
However, more complex techniques can be used to increase accuracy and precision, while trying not to compromise too much recall.\
Simple tree-based models seem to work well on these data. In fact, a decision tree classifier scores an accuracy just below 90% and a random forest classifier scores 92% accuracy. However, one has to keep in mind that these models have much lower recall on *spam* class, scoring respectively 44% and 47%, meaning that more than 1 spam website out of 2 is classified as a *normal* website.\
Similar results can be achieved on the set of top 25% highest ranked hosts.

In [8]:
k=10

print("Single split decision tree")
clf=DecisionTreeClassifier(max_depth=1,class_weight='balanced')
sd.print_prediction_metrics(clf,x,y,k)

print("Decision tree")
clf=DecisionTreeClassifier(class_weight='balanced')
sd.print_prediction_metrics(clf,x,y,k)

print("Random forest")
clf=RandomForestClassifier(class_weight='balanced_subsample')
sd.print_prediction_metrics(clf,x,y,k)

Single split decision tree
Accuracy:  0.7116704805491991
Precision on spam:  0.2470389170896785
Recall on spam:  0.944372574385511
Decision tree
Accuracy:  0.9092295957284515
Precision on spam:  0.5390728476821192
Recall on spam:  0.5265200517464425
Random forest
Accuracy:  0.931858632087465
Precision on spam:  0.6908212560386473
Recall on spam:  0.5549805950840879


In [9]:
x_top=x[labeled_top]

print("Single split decision tree")
clf=DecisionTreeClassifier(max_depth=1,class_weight='balanced')
sd.print_prediction_metrics(clf,x_top,y_top,k)

print("Decision tree")
clf=DecisionTreeClassifier(class_weight='balanced')
sd.print_prediction_metrics(clf,x_top,y_top,k)

print("Random forest")
clf=RandomForestClassifier(class_weight='balanced_subsample')
sd.print_prediction_metrics(clf,x_top,y_top,k)

Single split decision tree
Accuracy:  0.9582601054481547
Precision on spam:  0.6033755274261603
Recall on spam:  0.9930555555555556
Decision tree
Accuracy:  0.961335676625659
Precision on spam:  0.6609195402298851
Recall on spam:  0.7986111111111112
Random forest
Accuracy:  0.9630931458699473
Precision on spam:  0.6704545454545454
Recall on spam:  0.8194444444444444


### References
<a id='references'></a>
- [1] R. Andersen, C. Borgs, J. Chayes, J. Hopcroft, K. Jain, V. Mirrokni and S. Teng. Robust PageRank and locally computable spam detection features. In Proceedings of the 4th international workshop on Adversarial information retrieval on the web (AIRWeb '08), 2008.
- [2] R. Andersen, C. Borgs, J. Chayes, J. Hopcroft, V. Mirrokni and S. Teng. Local computation of PageRank contributions. In 5th International Workshop of Algorithms and Models for the Web-Graph, 2007.
- [3] C. Barry, M. Lardner. A Study of First Click Behaviour and User Interaction on the Google SERP. In: Pokorny J. et al. (eds) Information Systems Development. Springer, New York, 2011.