# Assignment 3
In this assignment you will implement a non-trivial problem that processes Big Data. To facilitate its processing in a regular computer, the actual amount of data will not be big, but the techniques that you will implement would scale to larger volumes of data.

This assignment is worth 15% of the total assessment of the unit.

This assignment relates to the following Learning Outcomes:
* Apply Map-reduce techniques to a number of problems that involve Big Data.
* Apply Big Data techniques to data mining.


**Submission deadline: Friday Week 12, 11:55pm**

The following code unzips the data stored in tweets.zip. This is the same data you used in Assignment 2.

In [1]:
import zipfile
from pathlib import Path
if not Path('10000 tweets-NEW.json').exists():
    print("Unzipping tweets")
    with zipfile.ZipFile('cleaned-tweets.zip') as myzip:
        myzip.extractall()

The following code implements a [Python generator](https://wiki.python.org/moin/Generators) that simulates a stream of tweets. You will use this iterator in some of the following tasks. The function uses the `yield` statement instead of a `return` statement so that it does not need to read the entire file into memory. By doing this, the function can work with files of  unlimited size.

In [2]:
import json
def stream_tweets():
    with open('10000 tweets-NEW.json', encoding='iso8859-1') as jfile:
        for line in jfile:
            try:
                next_tweet = json.loads(line)
            except:
                continue # yield next tweet instead of returning str
            yield next_tweet

There will be a demonstration of the use of this code in the lectures and workshops. Below is an example of how it can be used in a loop:

In [3]:
counter = 0
for s in stream_tweets():
    if counter > 3:
        break
    counter += 1
    print(s)

{'id': 'tag:search.twitter.com,2005:715690137900941312', 'objectType': 'activity', 'actor': {'objectType': 'person', 'id': 'id:twitter.com:18064228', 'link': 'http://www.twitter.com/Intelledox', 'displayName': 'Intelledox', 'postedTime': '2008-12-11T23:47:55.000Z', 'image': 'https://pbs.twimg.com/profile_images/485981380585603072/inMuMtJ7_normal.png', 'summary': "Intelledox's mobile-ready digitalization software helps over 1 million people to do business faster, smarter & efficiently Digitalize your business process now!", 'links': [{'href': 'http://www.intelledox.com', 'rel': 'me'}], 'friendsCount': 486, 'followersCount': 549, 'listedCount': 24, 'statusesCount': 1188, 'twitterTimeZone': 'Canberra', 'verified': False, 'utcOffset': '39600', 'preferredUsername': 'Intelledox', 'languages': ['en'], 'location': {'objectType': 'place', 'displayName': 'Canberra, Australia'}, 'favoritesCount': 55}, 'verb': 'post', 'postedTime': '2016-04-01T00:00:00.000Z', 'generator': {'displayName': 'HubSpot'

## Task 1 (5 marks)
Fill the gaps in the class below that processes the stream and issues the following standing queries:

* (1 mark) The length of the shortest tweet and the length of the longest tweet so far.
* (2 marks) The twitter ID of the person who has posted most tweets in the last 1000 posts.
* (2 marks) The twitter ID of the most active twitter when we apply an exponentially decaying window with $c=10^{-3}$ and a threshold of 0.5.

In your implementation, make sure that the system scales well to unlimited streams, and answer the following question:

1. How much memory do you need to reserve to keep the information about each of the standing queries?

# TODO: this question needs fixing
part 3, make the algorithm better

In [4]:
import numpy as np
from collections import Counter
from pprint import pprint
class StreamProcessor:
    
    shortest = None
    longest = None
    frequency = []
    active = Counter()
    gi = 0
    
    def _step_rval(self):
        return {'shortest': self.shortest,
               'longest': self.longest,
               'most_frequent': max(Counter(self.frequency)),
               'most_active': max(self.active, key=lambda x: self.active[x])}
    
    # end of class variables
    def step(self, item):
        """Process one item from the stream and return the answers to the
        standing queries as a Python dictionary with the following keys:
          - shortest
          - longest
          - most_frequent
          - most_active
        """
        userid = item['actor']['id']
        post = item['body']
        l = len(post)
        self.frequency.append(userid)
        self.frequency = self.frequency[-1000:] # cut off entries before last 1000.
        
        # apply decay to all users
        # mathematical equivalent of decay in growth
        # so dont have to iterate over whole counter, only increment the growth indicator
        # (IE each time add a larger number, and occasionally when numbers get too big 
        # multiply down to managable size, and reset growth indicator)
        # note: I can prove this mathematically
        if self.gi > 1000:
            for k in self.active:
                self.active[k] = self.active[k]*(.999**self.gi)
                self.gi = 0
        # add this tweet to activity
        self.active[userid] += 1/(.999**self.gi)
        self.gi += 1
        
        # initialize lowest and highest
        if self.shortest is None:
            self.shortest = l
            self.longest = l
            return self._step_rval() # no more calculation
        
        if l<self.shortest:
            self.shortest = l
        if l > self.longest:
            self.longest = l
        
        return self._step_rval()

In [5]:
n = 1000
s = 0
s_ = 0
for i in range(n):
    s_ *= .999
    s_ += 1
    s += 1/(.999**i)
print('maximum value n can get to with growth indicator cutoff = n')
print(s, s_)

maximum value n can get to with growth indicator cutoff = n
1717.9225742264064 632.3045752290355


The following code will apply the stream processor to the first 5 elements of the stream.

In [6]:
counter = 0
stream = StreamProcessor()
for s in stream_tweets():
    if s == 'Tweet error':
        continue
    if counter >= 5:
        break
    counter += 1
    print(stream.step(s))

{'shortest': 140, 'longest': 140, 'most_frequent': 'id:twitter.com:18064228', 'most_active': 'id:twitter.com:18064228'}
{'shortest': 138, 'longest': 140, 'most_frequent': 'id:twitter.com:188921458', 'most_active': 'id:twitter.com:188921458'}
{'shortest': 127, 'longest': 140, 'most_frequent': 'id:twitter.com:97578801', 'most_active': 'id:twitter.com:97578801'}
{'shortest': 127, 'longest': 140, 'most_frequent': 'id:twitter.com:97578801', 'most_active': 'id:twitter.com:3266593548'}
{'shortest': 127, 'longest': 140, 'most_frequent': 'id:twitter.com:97578801', 'most_active': 'id:twitter.com:225568917'}


## Task 2 (5 marks)
Apply the minhashing techniques we have covered in week 7 to determine the set of near-duplicates among the tweet posts. For this exercise use only the first 500 tweet posts (so that you do not need to wait too long). To complete this assignment you can reuse code from the lecture notebooks and from the workshop exercises. Use your judgement to determine the parameters and answer the following questions:

1. What value of $k$ did you use to represent the $k$-shingles and why?
2. Did you hash the $k$-shingles and why?
3. If you hashed the $k$-shingles, how many buckets did you use and why?
4. How many hashes did you use for minhashing, how many buckets, and why?
5. How many bands and rows did you use for locality-sensitive hashing and why?

In [7]:
from itertools import combinations, count
from collections import namedtuple
import pandas as pd
from matplotlib import pyplot as plt
% matplotlib inline
# implementation:
# k_shingles function creates a zip of k copies of the tweet, each
# offset by one more then the previous and returns the columns. EG:
#          >>> k_shingles('testing', 3)
# internally:  v v v v v
#comprehension t e s t i
#   function   e s t i n
#   returns    s t i n g
# zip transposes
# map applies the hashing function
#          <<< {'tes','est','sti','tin','ing'} 
# 
# note: zip only returns values while arguments have elements at the
# current index.
# 
# this method of computing k_shingles is memory intensive, but has
# a lower time complexity (and is very pythonic). The memory shouldnt
# be an issue as tweets are all below 280 characters (so (280-k) * k)
def k_shingles(tweet, k, h=lambda x:hash(''.join(x))):
    if len(tweet) < k:
        return tweet
    else:
        return set(map(h, zip(*[tweet[i:len(tweet)-(k-i-1)] for i in range(k)])))

# print(k_shingles('0123456789', 3))
# print(k_shingles('test', 3, h=hash))
# print(k_shingles('tesa', 3, h=hash))

# for easier debugging
Tweet = namedtuple('Tweet', ['body','id'])

tweetset = {t['actor']['id']:Tweet(t['body'], t['actor']['id']) for _, t in zip(range(500), stream_tweets())}
#tweetset = {i: Tweet(a, i) for i, a in enumerate(['abc','abcd','zeqr', 'abacus', 'c', 'abcde'])}
buckets = 2**64
k = 20
hash_count = 400
# creates functions of the form hash_function(1) = lambda s:hash(s+str(1))%buckets
#                             # hash_function(1)('test') = some integer between 0 and buckets.
hash_function = lambda i: lambda s: hash(str(s)+str(i))%buckets
# print(hash_function(3)('a'))
hash_functions = [hash_function(i) for i in range(hash_count)]
# define signature as a named tuple (instead of an anonymous tuple)
# helps for knowing what objects are (as python is duck typed)
Signature = namedtuple('Signature', ['id', 'hash_values'])

def similarity(signature1, signature2):
    similar = 0
    for v1, v2 in zip(signature1.hash_values, signature2.hash_values):
        if v1 == v2:
            similar += 1
    return similar / hash_count

# basically a constructor for the Signature namedtuple
def signature(tweet, k=k, hash_functions=hash_functions):
    return Signature(tweet.id, tuple(min(map(h, k_shingles(tweet.body, k))) for h in hash_functions))
        # does not return a generator of the signature as it may need to be accessed many times.

# signatures = [signature(v) for k, v in tweetset.items()]

def signature_sensitivity_hashing(signature, row_size=3, hash_function=hash_functions[0]):
    bands = len(signature.hash_values)//row_size
    new_hash_values = tuple([hash_function(signature.hash_values[i*row_size:(i+1)*row_size]) for i in range(bands)] +
                                    [hash_function(signature.hash_values[:-len(signature.hash_values)%bands])])
    # sometimes the second slice will return [:0] meaning the same hash is always returned
    # (when hash_count%row_size==0)
    # this is accounted for below:
    return Signature(signature.id, new_hash_values
                     [:-1 if len(signature.hash_values)%row_size==0 else len(new_hash_values)])

# a = signature_sensitivity_hashing(signatures[0], row_size=10)
# b = signatures[0]
# print(len(a.hash_values), len(b.hash_values)/10)

def similarity_sets(tweetset, k, hash_functions, row_count, similarity_cutoff, precalc_similarities=None):
    # when similarity cutoff < 0; this will essentially s
    hash_count = len(hash_functions)
    signatures = [signature_sensitivity_hashing(signature_sensitivity_hashing(signature(tweet,
                                                                                        k=k,
                                                                                        hash_functions=hash_functions),
                                                                              row_size=row_count,
                                                                              hash_function=hash_functions[0]))
                  for i, tweet in tweetset.items()]

    # create lookup table of similarities (and dont repeat calculation if it was provided)
    if precalc_similarities is None:
        sims = {key1 : {key2 : None for key2 in tweetset} for key1 in tweetset}
        for s1, s2 in combinations(signatures, 2):
            sim = similarity(s1, s2)
            sims[s1.id][s2.id] = sim
            sims[s2.id][s1.id] = sim # it is a symmetric matrix
    else:
        sims = precalc_similarities
        
    if similarity_cutoff < 0:
        return [set(tweetset.keys())], sims
    sets = []
    for tweet in tweetset:
        added=False
        for s in sets:
            for t in s:
                # min clustering
                if sims[tweet][t] > similarity_cutoff:
                    s.add(tweet)
                    added=True # go on to the next tweet
                    break
            if added:
                break
        if not added:
            sets.append(set([tweet]))
    return sets, sims
        
    
    
# for t1, t2 in combinations([signature(v) for k, v in tweetset.items()], 2):
#     sim = similarity(t1, t2)
#     count = 0
#     if sim > 0:
#         print(t1.id, t2.id, sim)

# sets, similarities = similarity_sets(tweetset, k=10, hash_functions=hash_functions,
#                                      row_count=2, similarity_cutoff=0)
# print(max(tuple(map(len, sets))))

# some experimentation
try:
    df = pd.read_csv('q2_experimentation.csv', index_col=0)
except:
    hash_functions = [hash_function(i) for i in range(20)] # 50 hashing functions
    df = pd.DataFrame(columns=['k','row_count','cutoff','groups'])
    for k in range(2, 11):
        print('k={}'.format(k))
        k*= 3
        for row_count in range(1, 5):
            print('\trow_count={}'.format(row_count))
            prev_calc = None
            for similarity_cutoff in range(10):
                similarity_cutoff/=10
                sets, prev_calc = similarity_sets(tweetset, k=k, hash_functions=hash_functions,
                                             row_count=row_count, similarity_cutoff=similarity_cutoff,
                                             precalc_similarities=prev_calc)
                df = df.append({'k':float(k),
                                'row_count':float(row_count),
                                'cutoff':similarity_cutoff,
                                'groups':sets
                               }, ignore_index=True)
    df.to_csv('q2_experimentation.csv')
            

In [8]:
dfb = pd.read_csv('q2_experimentation.csv', index_col=0)
dfb['groups'] = list(map(eval, dfb['groups'].values))
dfb['group_size'] = list(map(lambda x:list(map(len, x)), dfb['groups']))
dfb['group_sim_count'] = list(map(lambda x:len([i for i in x if i>1]), dfb['group_size'].values))
dfb['max_group_size'] = list(map(lambda x:max(map(len, x)), dfb['groups'].values))
dfb = dfb[dfb['max_group_size'] < max(dfb['max_group_size'])]
print('below, we can see all arguments where some items were matched as similar, but not all.')
similars = dfb[dfb['group_sim_count'] > 0]
similars

below, we can see all arguments where some items were matched as similar, but not all.


Unnamed: 0,k,row_count,cutoff,groups,group_size,group_sim_count,max_group_size
0,6.0,1.0,0.0,"[{id:twitter.com:18064228}, {id:twitter.com:18...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...",12,5
10,6.0,2.0,0.0,"[{id:twitter.com:18064228}, {id:twitter.com:18...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...",9,3
20,6.0,3.0,0.0,"[{id:twitter.com:18064228}, {id:twitter.com:18...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...",9,3
40,9.0,1.0,0.0,"[{id:twitter.com:18064228}, {id:twitter.com:18...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...",9,4
50,9.0,2.0,0.0,"[{id:twitter.com:18064228}, {id:twitter.com:18...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...",8,3
60,9.0,3.0,0.0,"[{id:twitter.com:18064228}, {id:twitter.com:18...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...",8,3
80,12.0,1.0,0.0,"[{id:twitter.com:18064228}, {id:twitter.com:18...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...",8,3
90,12.0,2.0,0.0,"[{id:twitter.com:18064228}, {id:twitter.com:18...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...",7,3
100,12.0,3.0,0.0,"[{id:twitter.com:18064228}, {id:twitter.com:18...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...",7,3
120,15.0,1.0,0.0,"[{id:twitter.com:18064228}, {id:twitter.com:18...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...",10,3


In [9]:
for i, row in enumerate(similars.values):
    print('\n\n\n')
    print('k: {}\nrow count: {}\ngroup similarity count: {}\nmax group size: {}\n'
          .format(*row[:2], *row[-2:]))
    for s in row[3]:
        if len(s) > 1:
            print('{} similar tweets (if less are printed they were exact duplicates of another tweet)'.format(len(s)))
            tweets = set()
            for tweetid in s:
                tweets.add(tweetset[tweetid].body)
            for t in tweets:
                print(t)
            print()
    if i>3: # dont print them all.
        break





k: 6.0
row count: 1.0
group similarity count: 12
max group size: 5

2 similar tweets (if less are printed they were exact duplicates of another tweet)
Teaching kids to be triple threat https://t.co/PspNVzKiZJ
Obama meets with Asian leaders about North Korea nuke threat --&gt; https://t.co/8ksgGEeEpH

2 similar tweets (if less are printed they were exact duplicates of another tweet)
RT @FamousWomen: Love is always a demonstration - not just a set of words or a feeling.

2 similar tweets (if less are printed they were exact duplicates of another tweet)
Planned weekend trackwork, #AirportLine trains start &amp; end at Central due to City Circle trackwork https://t.co/L5H3pIJnAh
Planned weekend trackwork, #WesternLine trains start &amp; end at Central due to City Circle trackwork https://t.co/kwMSLrcPzb

3 similar tweets (if less are printed they were exact duplicates of another tweet)
RT @GMEAgency: Get the latest #VideoMarketing news-----&gt; https://t.co/IZofzmIBI5

2 similar tweets

In [10]:
#raise Exception # dont run this it will take ~5 minutes
tweetset = {t['actor']['id']:Tweet(t['body'], t['actor']['id']) for _, t in zip(range(50000), stream_tweets())}
k = 6
row_count = 3
sets, prevcalc = similarity_sets(tweetset, k=k, hash_functions=hash_functions, row_count=row_count, similarity_cutoff=0)

In [11]:
for s in sets:
    if len(s) > 1:
        print('{} similar tweets (if less are printed they were exact duplicates of another tweet)'.format(len(s)))
        tweets = set()
        for tweetid in s:
            tweets.add(tweetset[tweetid].body)
        for t in tweets:
            print(t)
        print()

2 similar tweets (if less are printed they were exact duplicates of another tweet)
Just posted a photo https://t.co/LCJ2dntIfM
Just posted a photo https://t.co/YPDEfyYNuo

12 similar tweets (if less are printed they were exact duplicates of another tweet)
RT @weeklystandard: A @greenpeaceusa activist thanked @HillaryClinton, then things went sour. https://t.co/5UNjxpks4d #imsosick  https://t.â¦

2 similar tweets (if less are printed they were exact duplicates of another tweet)
RT @Nero: DEATH THREAT  https://t.co/afoR7txmAS

2 similar tweets (if less are printed they were exact duplicates of another tweet)
RT @FamousWomen: Love is always a demonstration - not just a set of words or a feeling.

8 similar tweets (if less are printed they were exact duplicates of another tweet)
I just checked in at Western Union with #mPLUSPlaces Download today!  https://t.co/BljCYBnGFQ
I just checked in at Western Union with #mPLUSPlaces Download today!  https://t.co/UGOwIGLf5p
I just checked in at West

# answers:
1. What value of $k$ did you use to represent the $k$-shingles and why?

through experimentation, the best results were seen with smaller numbers of k, but greater then about 5. Greater numbers then 5 are used as hashes would have too many matches and the possible number of sets for a tweet gets reduced, meaning the ordering of the text has much less meaning, and the meaning is more in specifically what letters are contained. (imagine k=1, each tweets value set would just be single characters contained in it)


2. Did you hash the $k$-shingles and why?

I did hash the k-shingles as it reduced the runtime of the application by making each following hash on the hash less time consuming.


3. If you hashed the $k$-shingles, how many buckets did you use and why?

I did not limit the standard python hash function with a number of bins as to preserve uniqueness, at no cost.

4. How many hashes did you use for minhashing, how many buckets, and why?

I used 400 hashes for minhashing, and again i did not limit the number of buckets (all hashes in python are < 2**(chipset size of python install IE 32 or 64))


5. How many bands and rows did you use for locality-sensitive hashing and why?

I used bands of size \[1, 3\], as more just removed individuality between tweets. using 3 seemed to minimize runtime while maximizing similarities.

In [12]:
dfb[dfb['max_group_size'] > 1]

Unnamed: 0,k,row_count,cutoff,groups,group_size,group_sim_count,max_group_size
0,6.0,1.0,0.0,"[{id:twitter.com:18064228}, {id:twitter.com:18...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...",12,5
10,6.0,2.0,0.0,"[{id:twitter.com:18064228}, {id:twitter.com:18...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...",9,3
20,6.0,3.0,0.0,"[{id:twitter.com:18064228}, {id:twitter.com:18...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...",9,3
40,9.0,1.0,0.0,"[{id:twitter.com:18064228}, {id:twitter.com:18...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...",9,4
50,9.0,2.0,0.0,"[{id:twitter.com:18064228}, {id:twitter.com:18...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...",8,3
60,9.0,3.0,0.0,"[{id:twitter.com:18064228}, {id:twitter.com:18...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...",8,3
80,12.0,1.0,0.0,"[{id:twitter.com:18064228}, {id:twitter.com:18...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...",8,3
90,12.0,2.0,0.0,"[{id:twitter.com:18064228}, {id:twitter.com:18...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...",7,3
100,12.0,3.0,0.0,"[{id:twitter.com:18064228}, {id:twitter.com:18...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...",7,3
120,15.0,1.0,0.0,"[{id:twitter.com:18064228}, {id:twitter.com:18...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...",10,3


## Task 3 (5 marks)
Implement a MapReduce version of PageRank **using combiners** as described in the lectures of week 9. The MapReduce version should incorporate teleporting with $\beta=0.85$. 

For this assignment we will use Python's built-in functions `map` and `reduce`. For example, the following code is a Python version that uses MapReduce to compute the sum of squares of the numbers in a list:

In [13]:
from functools import reduce
def my_square(x):
    return x**2

def my_sum(x,y):
    return x+y

my_list = [1, 2, 3, 4, 5]

def mapreduce(a_list):
    temp = map(my_square, a_list) # Note that map returns an iterator, not a list
    return reduce(my_sum, temp)

mapreduce(my_list)

55

In [14]:
1+2**2+3**2+4**2+5**2

55

The above Python code is not efficient and it does not take advantage of parallel computing units (feel free to search the Web for parallel versions) but it will serve for this assignment. 

Note that `map` returns a Python iterator and not a list and there are operations that cannot be performed on it. For example, you cannot select a slice or compute the length:

In [15]:
temp = map(my_square, my_list)
print(len(temp))

TypeError: object of type 'map' has no len()

In [None]:
temp = map(my_square, my_list)
print(temp[0:2])

For this assignment, use an artificially generated network such as the one used in the workshop of week 9. The code is:

In [None]:
def generate_network(n, sparsity):
    "Return a transition matrix with n nodes"
    # Fill the matrix
    result = np.zeros((n,n))
    for i in range(int(n*n - n*n*sparsity)):
        x = np.random.randint(n)
        y = np.random.randint(n)
        result[x,y] = 1
        
    # Normalise the results
    for c in range(n):
        degree = np.sum(result[:, c])
        if degree > 0:
            result[:, c] /= degree
    return result

In [None]:
generate_network(5,0.7)

* In your demonstration, generate a network with 20 nodes and compute the PageRank of each node. 
* Your solution must include a graph that shows how the PageRank changes at each iteration.
* Do not attempt to remove dead ends (to simplify this exercise).
* What size of blocks did you use for your solution?

In [None]:
M = generate_network(20, 0.7)

The following code does not use MapReduce (it's based on the lecture notebook). Use it for your reference.

In [None]:


epsillon = 0.0001
beta = 0.85
page_count = 20
M = generate_network(page_count, 0.7)
PR = np.ones((page_count, 1)) / page_count
iterations = 0
oldPR = np.zeros((page_count,1))
allPR = [PR]
while max(np.abs(oldPR-PR)) > epsillon:
    oldPR = PR
    PR = beta*(np.dot(M, PR)) + (1-beta)/page_count*np.ones((page_count,1))
    allPR.append(PR)
    iterations += 1
print("PR after %i iterations:" % iterations)
print(PR)
print(sum(PR.T[0]))
for p in range(page_count):
    data = [onePR[p,0] for onePR in allPR]
    plt.plot(data)
plt.xlabel("Iterations")
plt.ylabel("PageRank")
plt.show()

Write the solution with MapReduce below.

In [None]:
Constants = namedtuple('Constants', ['epsillon','beta', 'page_count'])
c = Constants(epsillon, beta, len(M))
from itertools import chain

def tuplify(f):
    def r(*x, **xx):
        return tuple(f(*x, **xx))
    return r

# @tuplify
def map_pagerank(key_value, constants=c):
    key, value = key_value
    # input------------------------------------
    # key: (page, pagerank)
    # value: itterable of outgoing links
    
    # NOTE TO THE MARKER:
    # the time is 11:43 and I have to leave for work in 10 minutes.
    # I did not properly read the question so I am going to leave the dead-end
    # removal in as I have already written it, and I dont have time to change and test again.
    for v in value: 
        yield v, ((key[1] / len(value)) * (1-constants.beta), ()) # transversal 
    yield (key[0],
    (constants.beta / constants.page_count, value)) # teleportation
    # output-----------------------------------
    # key: page
    # value: (pagerank, outgoing links)
    

# @tuplify
def reduce_pagerank(key_values, constants=c):
    key, values = key_values
    # input-----------------------------------
    # key: page
    # values: itterable of (pagerank, outgoing links)s
    outlinks = []
    pagerank = 0
    for pr, ol in values:
        outlinks += list(ol)
        pagerank += pr
    yield (key, pagerank), tuple(outlinks)
    # output-----------------------------------
    # key: (page, pagerank)
    # value: tuple of outgoing links


# @tuplify
def prepare_reduce(key_value_pairs, constants=c):
    # function takes all key value pairs and groups them by key.
    # this returns the format:
    # key : itterable of values that had that key
    rval = {}
    for key, value in key_value_pairs:
        if key not in rval:
            rval[key] = set()
        rval[key].add(value)
    for key in rval:
        yield key, tuple(rval[key])


# @tuplify
def prepare_map(network_matrix):
    for pagenumber, row in enumerate(network_matrix):
        yield (pagenumber, 1/len(network_matrix)), tuple(i for i in range(len(network_matrix)) if row[i] != 0)

        
# @tuplify
def mapchain(function, args):
    return chain(*map(function, args)) # chains all values returned by function together.


@tuplify # left this here to calculate epsillon (otherwise generators will finish
# in calculating epsillon, and wont be available to calculate the next itteration of mapreduce)
def mapreduce(values, map_function=map_pagerank, reduce_function=reduce_pagerank, constants=c):
    d = values
    d = mapchain(lambda x:map_pagerank(x, constants=c), d)
    d = prepare_reduce(d)
    d = mapchain(lambda x:reduce_pagerank(x, constants=c), d)
    return d

In [None]:
M = generate_network(page_count, .7)
d = tuple(prepare_map(M))
old = np.array(tuple(map(lambda x:x[0][1], sorted(d, key=lambda x:x[0][0]))))
maxdiff = 1
count = 0
all_ranks = [old]
c = Constants(.0001, .85, len(M))
while maxdiff > c.epsillon:
    d = mapreduce(d, )
    current = np.array(tuple(map(lambda x:x[0][1], sorted(d, key=lambda x:x[0][0]))))
    maxdiff = max(np.abs(current - old))
    old = current
    all_ranks.append(current)
    count += 1
all_ranks = np.array(all_ranks)
print('PR after {} itterations:'.format(count))
print(np.array([current]).T)
print(sum(all_ranks[0]))
plt.plot(range(len(all_ranks)),all_ranks)
plt.show()
# epsylon beta page_count M 