# Wikipedia Traffic

This notebook aims at modeling user traffic on [Wikipedia](https://wikipedia.org) using a recurrent graph convolutional neural network.

Goal: anomaly detection. Can be used to detect events in the real world. Other applications:
* intrusion detection on telecomunnication networks,
* anomaly detection on energy networks,
* accident detection on transporation networks.

Events: Super Bowl, Academy Awards, Grammy, Miss Universe, Golden Globe. Mostly December-February.
Missed: Charlie Hebdo, Ebola

Network is very large: 5M nodes, 300M edges. Downsampling ideas:
* Choose a category, e.g. science.
* Take most active ones.
* Concatenate in modules / communities / super-nodes.

Raw data
* [Wikimedia SQL dumps](https://dumps.wikimedia.org/enwiki/), to construct the hyperlink graph.
    * Network size: 5M nodes, 300M edges.
* [Pagecounts](https://dumps.wikimedia.org/other/pagecounts-all-sites/) as activations on the graph.
    * Data from 2014-09-23 0h to 2015-06-05 22h.
    * 6142 hours in total.

In [None]:
%matplotlib inline

import os
import datetime

import IPython.display as ipd
from tqdm import tqdm_notebook
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import graph_tool.all as gt

In [None]:
%load_ext dotenv
%dotenv .env

WIKI_RAW = os.environ.get('WIKI_RAW')  # Downloaded from dumps.wikimedia.org.
WIKI_CLEAN = os.environ.get('WIKI_CLEAN')  # Processed by Kirell Benzi.

In [None]:
sns.set_context("notebook", font_scale=1.5)
plt.rcParams['figure.figsize'] = (17, 5)

## 1 Hyperlink graph

In [None]:
g = gt.load_graph(os.path.join(WIKI_CLEAN, 'enwiki-20150403-graph.gt'))

In [None]:
g.is_directed()
#g.set_directed(False)

In [None]:
print('{:.2e} vertices'.format(g.num_vertices()))
print('{:.2e} edges'.format(g.num_edges()))

g.list_properties()

In [None]:
idx = 42
page_title = g.vertex_properties['page_title'][idx]
page_id = g.vertex_properties['page_id'][idx]
print('{}: {}'.format(page_id, page_title))

In [None]:
hist = gt.vertex_hist(g, 'total')
plt.loglog(hist[1][:-1], hist[0])
plt.xlabel('#edges')
plt.ylabel('#nodes');

In [None]:
# Too large to be drawn in full.
#gt.sfdp_layout
#gt.graph_draw(g)

In [None]:
# Remove uninteresting pages.
#g.set_vertex_filter()
#g.remove_vertex

In [None]:
A = gt.adjacency(g)
A

## 2 Pages

A lot of pages in `pagecounts` are redirections to actual pages. We need to merge the hits.

In [None]:
filepath = os.path.join(WIKI_CLEAN, 'enwiki-20150403-page-redirect.csv.gz')
redirect = pd.read_csv(filepath, compression='gzip', sep='|', encoding='utf-8', quoting=3, index_col=1)

redirect.head()

In [None]:
#assert len(redirect) == len(redirect['page_id'].unique())
print('{:.2e} unique pages, {:.2e} pages including redirections'.format(
        len(redirect['fix_page_id'].unique()),
        len(redirect)))

In [None]:
redirect.loc[page_id]

In [None]:
def id2title(page_id):
    return redirect.at[page_id, 'fix_page_title']
    #return redirect[redirect['page_id'] == page_id]['fix_page_title'].values[0]
id2title(330)

In [None]:
def find_in_title(string):

    def find(page_title, string):
        try:
            return string.lower() in page_title.lower()
        except:
            return False

    #b = redirect['fix_page_title'].apply(find, string=string)
    b = redirect['page_title'].apply(find, string=string)
    #return redirect[b]
    return redirect[b & (redirect['is_redirect'] == 0)]

#find_in_title('ebola')
find_in_title('zirka')

## 3 Page views / counts

Graph has 4M nodes but lot of pages are not seen much. `signal_500.h5` lists only 118k pages.

In [None]:
# Kirell's signal which includes views when greater than 500.
filepath = os.path.join(WIKI_CLEAN, 'signal_500.h5')
signal = pd.read_hdf(filepath, 'data')
signal['count_views'].plot(kind='hist', logy=True)
print(len(signal), len(signal['page_id'].unique()), len(signal['layer'].unique()), signal['count_views'].max())
signal.head()

In [None]:
filepath = '../data/wikipedia/activations_all.h5'

if os.path.exists(filepath):
    activations = pd.read_hdf(filepath, 'activations')

else:
    START = datetime.datetime(2014, 9, 23, 2)
    #END = datetime.datetime(2014, 9, 24, 2)
    END = datetime.datetime(2015, 6, 5, 20)

    activations = pd.DataFrame(columns=pd.date_range(START, END, freq='H'))

    folder = os.path.join(WIKI_CLEAN, 'pagecounts_clean')
    for date in tqdm_notebook(activations.columns):
        filename = 'pagecounts-{:4d}{:02d}{:02d}-{:02d}0000.csv.gz'.format(date.year, date.month, date.day, date.hour)
        filename = os.path.join(folder, filename)
        pagecounts = pd.read_csv(filename, compression='gzip', index_col=0, squeeze=True)
        #print(len(pagecounts), filename)
        print(date)
        activations[date] = pagecounts
        activations[date] = activations[date].fillna(0).astype(np.int32)

    activations.to_hdf(filepath, 'activations')

print(activations.shape)
activations.head()

* Predictable fluctuations with unpredictable spikes. Those are outliers.
* Anomalies should be outliers persisting for many hours.

In [None]:
page_id = 40817806
page_id = 25
title = '{} ({})'.format(id2title(page_id), page_id)
activations.loc[page_id].plot(title=title)
plt.ylabel('#hits per hour');

In [None]:
#activations.plot(kind='hist', logy=True);

## Cleanup

In [None]:
TO_REMOVE = [15580374, 42727860] # page ids to remove (Main page, Undefined)