# <center> Data Preparation </center>
## <center> 3$^{rd}$ Assignment - 1$^{st}$ Notebook </center>
### <center> By Group 80 for the course Advances in Data Mining of the University of Leiden taught by Wojtek Kowalczyk </center>
<center> Lisa Dombrovskij (s1504819) - dombrovskij@strw.leidenuniv.nl </center>
<center> Margherita Grespan (s2233150) - grespan@strw.leidenuniv.nl </center>

***

The data used for this project is compiled by Cristian Consonni, David Laniado, & Alberto Montresor (2019) and can be found [here](https://zenodo.org/record/2539424#.Xa2dvJMzZ-U). It consists of four columns showing links between pages on Wikipedia. It shows the title of the page and the page it links to, along with a unique ID for each page. The dataset now spans over 17 years, and each year a snapshot is taken of Wikipedia's link network. 

In this notebook the data is preprocessed. The columns needed for the project are extracted from the data and the ID's are converted into consecutive integers.

In [1]:
import numpy as np
import pandas as pd
import pickle

In [3]:
#Read in columns 'page_id_from' and 'page_id_to' from the data.
full_frame = pd.read_csv('enwiki.wikilink_graph.2004-03-01.csv', delimiter='\t', usecols = ['page_id_from','page_id_to'])

As we are only interested in the connections between pages, i.e. the graph, we only extract the ID's from the dataset. This gives two columns: 'page_id_from' and 'page_id_to'. The latter gives the page ID to which the pagelink refers, the first column gives the page ID from which the link is made.

In [4]:
full_frame.head()

Unnamed: 0,page_id_from,page_id_to
0,12,34568
1,12,35416
2,12,34569
3,12,34699
4,12,34700


In [5]:
full_frame.page_id_from.unique() #Not consecutive

array([      12,       25,       39, ..., 56612504, 56612815, 56613112])

Above the sorted array of unique ID's in the dataset is shown. It is clear that there are gaps in the ID's. For consistency and clarity, we convert the ID's to be consecutive integers, starting at zero. Essentially, the new ID for each ID in the array above will be its index. 

In [6]:
unique_ids = np.unique(full_frame[['page_id_from', 'page_id_to']].values) #Original unique ID's
n_unique_ids = len(unique_ids) #Amount of ID's

In [7]:
new_ids = np.arange(n_unique_ids) #New ID's --> Consecutive integers

In [8]:
#Put original ID's in a dictionary as keys with new ID's as values
conversion_dict = {}

for original_id, new_id in zip(unique_ids, new_ids):
    conversion_dict[original_id] = new_id

In [15]:
full_frame['new_page_id_from'] = full_frame.page_id_from.apply(lambda x: conversion_dict[x]) #Convert ID's

In [16]:
full_frame['new_page_id_to'] = full_frame.page_id_to.apply(lambda x: conversion_dict[x]) #Convert ID's

In [17]:
full_frame.head()

Unnamed: 0,page_id_from,page_id_to,new_page_id_from,new_page_id_to
0,12,34568,0,18381
1,12,35416,0,19179
2,12,34569,0,18382
3,12,34699,0,18501
4,12,34700,0,18502


The new table shows the old ID's (page_id_from and page_id_to) and their new ID's (new_page_id_from, new_page_id_to). Since we only need the new ID's, the last two columns are stored as a pickle file for further use. If necessary, the ID's can always be converted back using the conversion dictionary made in this notebook.

In [19]:
with open('prepared_data', 'wb') as f:
    pickle.dump(full_frame[['new_page_id_from','new_page_id_to']], f)