# Analysis of Wikipedia Network
Elliot Williams<br>March 24, 2018<br>Web Scraping w. Prof. Oleinikov

## Background

Before this point, I ran `extractHTML.sh` on the entire set of Wikipedia articles, accessible from the [Internet Archive](https://archive.org/search.php?query=subject%3A%22enwiki%22%20AND%20subject%3A%22data%20dumps%22%20AND%20collection%3A%22wikimediadownloads%22&and[]=subject%3A%22Wikipedia%22), resulting in 54 files representing the adjacency list of the directed Wikipedia network (in which articles are nodes, hyperlinks edges).

## Analyzing the Results

Now, I want to plot the out degree distribution of the Wikipedia network, as well as see some instances of high-degree nodes.

Let's start by figuring out the number of nodes and edges in all the files...

In [32]:
import re
import glob
import ray

file_list = glob.glob("./link_files/*.dat")

@ray.remote
def getOutdegs(filename):
    f = open(filename)

    # Parses out HTTP links --> just in case any got through the initial search
    # Also, some of the links contain '#' pointers to specific parts of an article 
    # -- let's remove these (both done in one loop for speed purposes)
    lines = [re.sub("#.*", "", line.strip()) for line in f if 
             not re.search("https?:\/\/", line)]

    t_indices = [i for i in range(len(lines)) if re.search(">>>>", lines[i])]

    print("There are {} nodes represented in this file".format(len(t_indices)))
    print("There are {} edges in this file".format(len(lines) - len(t_indices)))
    
    titles   = [re.sub(">>>>", "", lines[i]) for i in t_indices]
    out_degs = [t_indices[i+1] - t_indices[i] - 1 
                for i in range(len(t_indices) - 1)]
    t_outdegs = list(zip(titles, out_degs))
    t_outdegs.sort(key=lambda x : x[1], reverse=True)
    return(t_outdegs)

ray.init()
file_results = ray.get([getOutdegs.remote(filename) for filename in file_list])

Waiting for redis server at 127.0.0.1:65530 to respond...
Waiting for redis server at 127.0.0.1:54793 to respond...
Starting local scheduler with the following resources: {'CPU': 16, 'GPU': 0}.

View the web UI at http://localhost:8889/notebooks/ray_ui12217.ipynb?token=de8533dfb8e08659f65f164609b13684050b2c0522595c25



In [40]:
from functools import reduce
t_outdegs = reduce(lambda x,y : x + y, file_results)
t_outdegs.sort(key=lambda x: x[1], reverse=True)

In [46]:
t_outdegs[1:50]
len(t_outdegs)

import pandas as pd
titles, outdegs = zip(*t_outdegs)
df = pd.DataFrame({"title": titles, "outdeg":outdegs})

In [47]:
%%R -i df

UsageError: Cell magic `%%R` not found.
