In [1]:
import networkx as nx

0a) Download the dataset from SNAP, and extract it somewhere:


0b) Use links.tsv to construct a directed NetworkX graph. links.tsv looks like the following:

In [2]:
import pandas as pd

In [3]:
path = '/Users/maxperozek/CP341/Day4/wikispeedia_paths-and-graph/links.tsv'

links_df = pd.read_csv(path, sep='\t', on_bad_lines='skip')

In [4]:
links_df

Unnamed: 0,to,from
0,%C3%81ed%C3%A1n_mac_Gabr%C3%A1in,Bede
1,%C3%81ed%C3%A1n_mac_Gabr%C3%A1in,Columba
2,%C3%81ed%C3%A1n_mac_Gabr%C3%A1in,D%C3%A1l_Riata
3,%C3%81ed%C3%A1n_mac_Gabr%C3%A1in,Great_Britain
4,%C3%81ed%C3%A1n_mac_Gabr%C3%A1in,Ireland
...,...,...
119877,Zulu,South_Africa
119878,Zulu,Swaziland
119879,Zulu,United_Kingdom
119880,Zulu,Zambia


In [5]:
edges = []
for index, row in links_df.iterrows():
    edges.append((row['to'],row['from']))

In [6]:
links_gr = nx.DiGraph()

In [7]:
links_gr.add_edges_from(edges)

In [8]:
links_gr.has_edge('%C3%89douard_Manet','Paris')

True

In [9]:
nx.is_strongly_connected(links_gr)

False

0c) For every article, we want to build a database of search terms - what words show up in what article. You can do this any way you like. My recommendation would be to use a hash map that maps article title to a set of words. Pseudocode:

In [10]:
import os

In [11]:
rootdir = '/Users/maxperozek/CP341/Day4/plaintext_articles/'

db = {}
for subdir, dirs, files in os.walk(rootdir):
    
    for file in files:
        with open(rootdir + file, "r") as f:
            text = f.read()
            wordlist = text.split()
            wordlist = set(wordlist)
            wordlist = list(wordlist)
            db[file[:-4]] = wordlist
        

In [12]:
len(db)

4604

0d) Write a line of code that uses your hashmap from 0c to list all of the article titles that contain a specific search term.

In [13]:
search_word = 'spacetime'
[key for key in db if search_word in db[key]]

['Luminiferous_aether',
 'Special_relativity',
 'Introduction_to_special_relativity',
 'Fermi_paradox',
 'Black_hole',
 'Metric_expansion_of_space',
 'Gottfried_Leibniz',
 'Hubble%27s_law',
 'Physics',
 'Euclidean_geometry',
 'Quantum_mechanics',
 'Redshift',
 'Maxwell%27s_equations',
 'Time',
 'Acceleration',
 'Big_Bang',
 'Physical_paradox',
 'String_theory',
 'Gravitation',
 'Phase_%28matter%29',
 'Cosmic_inflation']

1a) Write code to compute pageranks using a random traveler clicking links in the graph.  At each step, the traveler should either pick a random link on their current page, or (with a 15% chance) go to a random page somewhere else in the datasets. Perform a very large (more than 10^4) many hops and keep track of how many times your traveler visits each page in the graph. This number of visits, divided by the total number of visits, is the probability of our traveler visiting a given page.

In [14]:
import numpy as np

In [15]:
def inc_visits(gr, node):
    if 'visits' in gr.nodes[node]:
        gr.nodes[node]['visits'] = gr.nodes[node]['visits'] + 1
    else:
        gr.nodes[node]['visits'] = 1

def rand_traveler_rank(gr, iters=int(1e4)):
    gr = gr.copy()
        
    start = int(np.random.randint(low=0, high=len(links_gr.nodes()), size=1))
    current = list(gr.nodes())[start]
    inc_visits(gr, current)
    
    # walk iters
    for i in range(iters):
        
        neighbors = list(gr[current])
        if np.random.random() > 0.85 or len(neighbors) <= 0:
            # go to a random page
            current = list(gr.nodes())[int(np.random.randint(low=0, high=len(links_gr.nodes()), size=1))]
            inc_visits(gr, current)
        else:
            new = neighbors[int(np.random.randint(low=0, high=len(neighbors), size=1))]
            current = new
            inc_visits(gr, current)
    return gr
        

In [16]:
ranked_gr = rand_traveler_rank(links_gr)

1b) Use your method from 0d, along with your ranks from 1a, to showcase a few example search results. Do these results look reasonable? Why or why not?

In [17]:
def ranked_search(term, gr, db):
    pages = [key for key in db if term in db[key]]
    
    ranked = []
    for page in pages:
        try:
            ranked.append((gr.nodes().data()[page]['visits'], page)) if 'visits' in gr.nodes().data()[page] else ranked.append((0, page))
        except KeyError:
            ranked.append((0, page))
    ranked.sort(key=lambda y: y[0], reverse=True)

    return ranked
        
    

Looks like we are trying to use a node name that does not exist in the link graph
the one that keeps popping up as an error is 'Private_Peaceful'
I checked the original .tsv file and it is not in any of the edges
#### Error Handling Options:
When we see a page name that is not in the graph:
 * Add to graph as disconncected node with 'visits' = 0
 * Skip page, ignore completely
 * [Do not add to graph but still return it in the search results at the end of the list]

In [18]:
# first 15 results
ranked_search('book', ranked_gr, db)[:15]

[(99, 'United_States'),
 (51, 'English_language'),
 (45, 'Japan'),
 (41, 'World_War_II'),
 (34, 'World_War_I'),
 (30, 'Christianity'),
 (29, '19th_century'),
 (27, 'Scientific_classification'),
 (26, 'German_language'),
 (21, 'Judaism'),
 (21, 'Evolution'),
 (19, 'Bible'),
 (16, 'Berlin'),
 (16, 'Aristotle'),
 (14, 'Mathematics')]

In [19]:
ranked_search('compute', ranked_gr, db)[:15]

[(30, 'Time_zone'),
 (8, 'Logic'),
 (7, 'Quantum_mechanics'),
 (7, 'Asteroid'),
 (5, 'Prime_number'),
 (4, 'Time'),
 (4, 'Photon'),
 (3, 'Ordinary_differential_equation'),
 (3, 'Mathematical_analysis'),
 (2, 'Abacus'),
 (2, 'Cryptography'),
 (2, 'Greenhouse_effect'),
 (2, 'Differential_equation'),
 (2, 'Trigonometry'),
 (1, 'Calculus')]

In [20]:
ranked_search('calculus', ranked_gr, db)[:15]

[(14, 'Mathematics'),
 (8, 'Isaac_Newton'),
 (8, 'Science'),
 (8, 'Education'),
 (8, 'Logic'),
 (6, 'Gottfried_Leibniz'),
 (4, 'Algorithm'),
 (4, 'Ren%C3%A9_Descartes'),
 (4, 'Programming_language'),
 (4, 'Geometry'),
 (4, 'Age_of_Enlightenment'),
 (3, 'Archimedes'),
 (3, 'Blaise_Pascal'),
 (3, 'Mathematical_analysis'),
 (2, 'Chemistry')]

These search results look somewhat reasonable but in many cases, more 'common' things will be ranked higher than expected. The results are also very Euro-centric, but I imagine this is just the nature of this particular wikipedia subset. The most striking example of these two phenomena is the search for 'book', where top results are:
'United_States', 'World_War_II', 'English_language', 'Japan', 'Christianity', 'World_War_I'
With the exception of Japan, all of the top results are very western and it makes sense given the search that we have designed since the dataset is likely very western oriented, and there are probably tons of links going to 'United_States' and 'World_War_II'.

1c) Write code to compute pageranks using the matrix multiplication / Markov chain approach. We should still have a 15% chance of jumping to a random node in the network, rather than following links. I recommend implementing your matrix multiplication with sparse matrices - the transition matrix is probably too big to fit in memory otherwise. 

In [21]:
from scipy import sparse as spr

In [22]:
# Make sure we are treating source and target nodes correctly 
# adjacency matrix from networkx will probably need to be transposed

In [23]:
adj = nx.to_scipy_sparse_array(links_gr).T

In [24]:
adj.sum(0)

array([11, 12, 10, ...,  2,  4,  6])

In [117]:
adj_dense = adj.todense()
sums = adj_dense.sum(0)

# iter over cols
for i in range(len(adj_dense)):
    for j in range(len(adj_dense[0])):
        adj_dense[i][j] = adj_dense[i][j] / sums[j] if not sums[j] == 0 else 0
    

In [118]:
# add handling for cols that sum to 0 (divide by 0 issue)

# S = (adj / adj.sum(0))
S_85 = spr.coo_matrix(0.85 * adj_dense)
rand_coef = 0.15 * (1/len(adj_dense))

In [119]:
S = adj_dense
goog = .85 * S + .15 * ((1/len(adj_dense)) * np.ones_like(S))
goog = spr.coo_matrix(goog)

In [168]:
current = 1/len(adj.todense()) * np.ones(len(adj.todense()))
print(current)

[0.00021777 0.00021777 0.00021777 ... 0.00021777 0.00021777 0.00021777]


In [169]:
current = spr.csr_array(current).T
current1 = current

In [170]:
for i in range(100):
    # print(i, current.shape, current1.shape, goog.shape)
    current1 = goog.dot(current1)
    current = (S_85 * current) + (rand_coef * np.ones(current.shape))

In [171]:
current

matrix([[3.26655052e-05],
        [3.26655052e-05],
        [3.26655052e-05],
        ...,
        [3.26655052e-05],
        [3.26655052e-05],
        [3.26655052e-05]])

1d) Same as 1b, but with your new ranks from 1c. Do these results look improved?

In [225]:
def ranked_search_matrix(term, gr, db, vec):
    pages = [key for key in db if term in db[key]]
    
    ranked = []
    for page in pages:

        try:
            ranked.append((float(vec[list(gr.nodes()).index(page)].todense()), page))
        except:
            ranked.append((0, page))
    
    ranked = sorted(ranked, key= lambda x: x[0], reverse=True)
    return ranked

In [230]:
search_res = ranked_search_matrix('book', links_gr, db, current1)
for i in range(15):
    print(search_res[i][1])

Functional_programming
Immanuel_Kant
Animal_rights
Japan
John_W._Campbell
Andriyivskyy_Descent
Carl_Jung
Arithmetic
Book_of_Common_Prayer
Wars_of_the_Roses
Intelligence
Martin_Luther_King%2C_Jr.
Nintendo_Entertainment_System
John_Dee
Eifel_Aqueduct


In [231]:
search_res = ranked_search_matrix('compute', links_gr, db, current1)
for i in range(15):
    print(search_res[i][1])

Calculus
Charles_Babbage
Abacus
Pi
Trigonometric_function
Cryptography
Quantum_mechanics
StarCraft
Time
Time_zone
Photon
Actuary
Greenhouse_effect
String_theory
Differential_equation


In [232]:
search_res = ranked_search_matrix('calculus', links_gr, db, current1)
for i in range(15):
    print(search_res[i][1])

Functional_programming
Mathematics
Chemistry
Calculus
Supply_and_demand
Gottfried_Leibniz
Isaac_Newton
Algorithm
Utilitarianism
Pi
Science
Archimedes
Force
Ren%C3%A9_Descartes
Programming_language


The results from the markov chain strategy looks slightly improved from the random walk results. But there is still a great deal of overlap between the 2 strategies.

### Problem 2

Given the ubiquity of search engines in the daily life of modern humans, and the consequent ability for search engines to influence and affect our perceptions of new information, there are heavy ethical implications to page rank algorithms which search engines use to prioritize some content over others. Cansu Canca’s “Did You Find It on the Internet? Ethical Complexities of Search Engine Rankings” gives the example of search results which perpetuate sexist stereotypes such as the conception that “professor” is a masculine profession, as search results show men in searches for “professor” at a disproportionately high rate. Canca cites other studies which have shown that young girls are interested in professions which have many female role models, something that is certainly lessened by search results which reflect dated/ stereotypical occupations based on gender. The Stanford Encyclopedia of Philosophy’s page “Search Engines and Ethics” provides a comprehensive overview of the topic but most interestingly notes that there is an ethical implication in the commercialization and opacity of page rank algorithms which will enable well funded corporations and groups to research and optimize their web pages for search, such that parties without such resources are pushed to the periphery of influence and accessibility via search engine results.
