# Solution to the "Stalker Challenge" - By Juan C. Alvarez

## Problem Statement

Use the dataset here: https://snap.stanford.edu/data/loc-gowalla.html

* Assume a "stalker" is someone who, in this dataset, visits some of the same locations as another person, after the other person goes to that location.
* A "stalker score" for a pair of people, A & B, is the number of locations for which A has visited a location followed by B visiting that same location in the
future.
* Any given location should be counted once in the score, so a stalker score can never be higher than the number of unique locations that A and B have
in common.

Use the datasets from the web page above to answer the following questions:

1. Which friend pair has the highest "stalker score"?
2. Which non-friend pair has the highest "stalker score"?

You can use any tools you want to solve this puzzle, except asking for help from other people. Please feel free to email at any time for any clarifications.

Please give the winning user id pairs and “stalker score” for each question, and please explain your solution methods, including any source code if you wrote any.

**Ps. We give points for "pure" solutions, so no additional libraries. The more complex the additional libraries are the more points we deduct**

## 1. Data Exploration

For starters, we will obtain, uncompress, explore, and describe the contents of the data files:

In [31]:
DATA_FOLDER = './data/'
EDGES_URL = 'https://snap.stanford.edu/data/loc-gowalla_edges.txt.gz'
EDGES_COMP_FILE = 'loc-gowalla_edges.txt.gz'
EDGES_FILE = 'loc-gowalla_edges.txt'
CHECKINS_URL = 'https://snap.stanford.edu/data/loc-gowalla_totalCheckins.txt.gz'
CHECKINS_COMP_FILE = 'loc-gowalla_totalCheckins.txt.gz'
CHECKINS_FILE = 'loc-gowalla_totalCheckins.txt'

TEST_FOLDER = './data/'
TEST_EDGES_FILE = 'edgestest.txt'
TEST_CHECKINS_FILE = 'checkinstest.txt'

In [4]:
!echo 'Obtaining edges file'
!wget $EDGES_URL -P $DATA_FOLDER

Obtaining edges file
Will not apply HSTS. The HSTS database must be a regular and non-world-writable file.
ERROR: could not open HSTS store at '/home/jcalvarezj/.wget-hsts'. HSTS will be disabled.
--2021-02-25 16:40:26--  https://snap.stanford.edu/data/loc-gowalla_edges.txt.gz
Resolving snap.stanford.edu (snap.stanford.edu)... 171.64.75.80
Connecting to snap.stanford.edu (snap.stanford.edu)|171.64.75.80|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 6351523 (6.1M) [application/x-gzip]
Saving to: ‘./data/loc-gowalla_edges.txt.gz’


2021-02-25 16:40:35 (686 KB/s) - ‘./data/loc-gowalla_edges.txt.gz’ saved [6351523/6351523]



In [5]:
!echo 'Obtaining Check-ins file'
!wget $CHECKINS_URL -P $DATA_FOLDER

Obtaining Check-ins file
Will not apply HSTS. The HSTS database must be a regular and non-world-writable file.
ERROR: could not open HSTS store at '/home/jcalvarezj/.wget-hsts'. HSTS will be disabled.
--2021-02-25 16:40:36--  https://snap.stanford.edu/data/loc-gowalla_totalCheckins.txt.gz
Resolving snap.stanford.edu (snap.stanford.edu)... 171.64.75.80
Connecting to snap.stanford.edu (snap.stanford.edu)|171.64.75.80|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 105470044 (101M) [application/x-gzip]
Saving to: ‘./data/loc-gowalla_totalCheckins.txt.gz’


2021-02-25 16:41:35 (1.72 MB/s) - ‘./data/loc-gowalla_totalCheckins.txt.gz’ saved [105470044/105470044]



In [6]:
!echo 'Uncompressing both files'
!gzip -dkv {DATA_FOLDER + EDGES_COMP_FILE}
!gzip -dkv {DATA_FOLDER + CHECKINS_COMP_FILE}

Uncompressing both files
./data/loc-gowalla_edges.txt.gz:	 71.3% -- replaced with ./data/loc-gowalla_edges.txt
./data/loc-gowalla_totalCheckins.txt.gz:	 73.3% -- replaced with ./data/loc-gowalla_totalCheckins.txt


In [7]:
!head -n20 {DATA_FOLDER + EDGES_FILE}
!echo '-----------------------------------------------'
!head -n20 {DATA_FOLDER + CHECKINS_FILE}

0	1
0	2
0	3
0	4
0	5
0	6
0	7
0	8
0	9
0	10
0	11
0	12
0	13
0	14
0	15
0	16
0	17
0	18
0	19
0	20
-----------------------------------------------
0	2010-10-19T23:55:27Z	30.2359091167	-97.7951395833	22847
0	2010-10-18T22:17:43Z	30.2691029532	-97.7493953705	420315
0	2010-10-17T23:42:03Z	30.2557309927	-97.7633857727	316637
0	2010-10-17T19:26:05Z	30.2634181234	-97.7575966669	16516
0	2010-10-16T18:50:42Z	30.2742918584	-97.7405226231	5535878
0	2010-10-12T23:58:03Z	30.261599404	-97.7585805953	15372
0	2010-10-12T22:02:11Z	30.2679095833	-97.7493124167	21714
0	2010-10-12T19:44:40Z	30.2691029532	-97.7493953705	420315
0	2010-10-12T15:57:20Z	30.2811204101	-97.7452111244	153505
0	2010-10-12T15:19:03Z	30.2691029532	-97.7493953705	420315
0	2010-10-12T00:21:28Z	40.6438845363	-73.7828063965	23261
0	2010-10-11T20:21:20Z	40.74137425	-73.9881052167	16907
0	2010-10-11T20:20:42Z	40.741388197	-73.9894545078	12973
0	2010-10-11T00:06:30Z	40.7249103345	-73.9946207517	341255
0	2010-10-10T22:00:37Z	40.729768314	-73.99853

The data seems to be correct and has the following format:

- Edges file: userId (int) and friendId (int), separated by tabs
- Check-Ins file: userId (int), check-in time (ISO timestamp at Zero timezone), latitude (float), longitude (float), and locationId (int). All separated by tabs

To easily handle timestamps, the "Z" suffix will be converted to "+00:00" at the moment of loading

## 2. Data Structures

Adjacency list graphs will be implemented and used to solve the problem. The idea here is to create a graph to keep track of the associations among stalkers and also of their visited places, and another graph to keep track of friendship associations (to easily query at the moment of answering to the second question)

The graphs have the following structure:

- A dictionary of associations (friends or stalkers). For example:

`associations = {
    <node1>: [<list of associated nodes>],
    ...,
    <nodeN>: [<list of associated nodes>]
}`

- A dictionary of "weights" for edges. This structure will be used to represent edges as pairs of nodes in the form of a tuple, which refer to stalker associations. ('a', 'b') means 'a' is stalked by 'b'. The values there pairs are mapped to are sets of visited locations (chose sets to count only once every visited location)

`weights = {
    <pair of nodes 1>: {<set of weights>},
    ...,
    <pair of nodes M>: {<set of weights>}
}`

Here weights will refer to the place that both people visited (for the graph of stalkers). As a set to include each only once

In [15]:
class Graph:
    """
    This class represents the associations between people
    """
    def __init__(self):
        """
        Constructor initializer
        """
        self.relations = {}
        self.weights = {}

    def add_relation(self, start_node, end_node):
        """
        Adds a relation between two nodes
        """
        if not self.relations.get(start_node):
            self.relations[start_node] = {end_node}
        else:
            self.relations[start_node].add(end_node)

    def add_weight(self, start_node, end_node, weight):
        """
        Adds a weight for a couple of nodes' association
        """
        pair = start_node, end_node
        if not self.weights.get(pair):
            self.weights[pair] = {weight}
        else:
            self.weights[pair].add(weight)

    def __str__(self):
        return f'Graph with nodes:\n{self.relations}\nAnd weights:\n{self.weights}'
        

As each line of the edges file represents one associaion between two people, at the time of reading each line input is split respectively in order to be added to the friendships graph

In [47]:
def read_friendship_graph(friends_file):
    """
    Reads the edges (friendships) file and returns a graph with the associations
    """
    friendship_graph = Graph()

    with open(friends_file) as edges:
        for line in edges:
            user, friend = line.split('\t')
            user = int(user)
            friend = int(friend.replace('\n', ''))
            friendship_graph.add_relation(user, friend)

    return friendship_graph

As records are read from the checkins file, the following steps are performed:

1. A dictionary will contain a list of visits per location id. This will serve as a comparison point for each record. 

2. The comparison with other recorded visits in the same location id will be added to the stalker graph accordingly (higher timestamp equals being the stalker). For simplicity, only the weights structure will be used

In [43]:
from datetime import datetime

def read_stalkers_graph(checkins_file):
    """
    Reads the check-ins file and returns a graph with the associations between people
    that mean stalking (weighted as a list of involved location ids)
    """
    stalkers_graph = Graph()
    visit_records = {}

    with open(checkins_file) as checkins:
        for line in checkins:
            user_id, checkin_time, _, _, location_id = line.split('\t')
            user_id = int(user_id)
            checkin_time = checkin_time.replace('Z', '+00:00')
            location_id = int(location_id.replace('\n', ''))

            new_visit = (user_id, checkin_time)

            if not visit_records.get(location_id):
                visit_records[location_id] = [new_visit]
            else:
                for visit in visit_records[location_id]:
                    if new_visit[1] < visit[1]:
                        stalkers_graph.add_weight(new_visit[0], visit[0], location_id)
                    else:
                        stalkers_graph.add_weight(visit[0], new_visit[0], location_id)
                
                visit_records[location_id].append(new_visit)

    return stalkers_graph

Graph with nodes:
{}
And weights:
{(1, 2): {1, 6}, (1, 3): {2}, (2, 3): {4}, (2, 1): {5}, (2, 4): {3}}


## 3. Answering questions

### 1. Which friend pair has the highest "stalker score"?

In [49]:
def compute_most_stalking_friend():
    """
    Answers the second question by calculating which stalker pair has the highest score
    for a pair of people who are not friends to each other
    """
    highest_friend_stalker = (None, 0)
    stalkers_graph = read_stalkers_graph(f'{DATA_FOLDER}{CHECKINS_FILE}')
    friendship_graph = read_friendship_graph(f'{DATA_FOLDER}{EDGES_FILE}')
    stalking_dict = stalkers_graph.weights

    for pair, locations in stalking_dict.items():
        stalking_score = len(locations)

        if stalking_score > highest_friend_stalker[1] and pair[0] in friendship_graph.relations[pair[1]]:
            highest_friend_stalker = (pair, stalking_score)

    return highest_friend_stalker

print(compute_most_stalking_friend())

((1, 3), 1)


This means that the pair of people (friends) with highest stalking score is (1, 3) for 1 locations

### 2. Which non-friend pair has the highest "stalker score"?

In [52]:
def compute_most_stalking_nonfriend():
    """
    Answers the second question by calculating which stalker pair has the highest score
    for a pair of people who are not friends to each other
    """
    stalkers_graph = read_stalkers_graph(f'{DATA_FOLDER}{CHECKINS_FILE}')
    friendship_graph = read_friendship_graph(f'{DATA_FOLDER}{EDGES_FILE}')

    highest_nonfriend_stalker = (None, 0)
    stalking_dict = stalkers_graph.weights

    for pair, locations in stalking_dict.items():
        stalking_score = len(locations)

        if stalking_score > highest_nonfriend_stalker[1] and pair[0] not in friendship_graph.relations[pair[1]]:
            highest_nonfriend_stalker = (pair, stalking_score)

    return highest_nonfriend_stalker

print(compute_most_stalking_nonfriend())

((1, 2), 2)


This means that the pair of people (not friends) with highest stalking score is (1, 2) for 2 locations

## \[Extra\]. Unit tests

Testing against the input on the test directory. It's expected to have the following associations from the files:

Stalking Graph with Pairs **(user_id_i, user_id_j): {location_id_1, ..., location_id_N}**

> (1, 2): {1, 6}  
> (1, 3): {2}  
> (2, 3): {4}  
> (2, 4): {3}  
> (2, 1): {5}  

Friendship Graph with elements **user_id_i: {user_id_j, ..., user_id_M}**

> 1: {3}  
> 2: {4}  
> 3: {1}  
> 4: {2}  

In [53]:
import unittest

class SolutionTest(unittest.TestCase):
    """
    Class to execute unit tests of the solution from test cases at './test'
    """
    def setUp(self):
        """
        Unit test initialization
        """
        self.friends_graph = Graph()
        self.friends_graph.add_relation(1, 3)
        self.friends_graph.add_relation(2, 4)
        self.friends_graph.add_relation(3, 1)
        self.friends_graph.add_relation(4, 2)

        self.stalkers_graph = Graph()
        self.stalkers_graph.add_weight(1, 2, 1)
        self.stalkers_graph.add_weight(1, 2, 6)
        self.stalkers_graph.add_weight(1, 3, 2)
        self.stalkers_graph.add_weight(2, 3, 4)
        self.stalkers_graph.add_weight(2, 4, 3)
        self.stalkers_graph.add_weight(2, 1, 5)

        self.highest_stalker_friend = ((1, 3), 1)
        self.highest_stalker_nonfriend = ((1, 2), 2)

    def tearDown(self):
        """
        Unit test clean-up
        """
        del(self.friends_graph)
        del(self.stalkers_graph)

    def test_compute_friends_graph(self):
        """
        Test for the read_friendship_graph(...) method
        """
        computed_friends_graph = read_friendship_graph(f'{TEST_FOLDER}{TEST_EDGES_FILE}')
        self.assertEqual(self.friends_graph.relations, computed_friends_graph.relations)

    def test_compute_stalkers_graph(self):
        """
        Test for the read_stalkers_graph(...) method
        """
        computed_stalkers_graph = read_stalkers_graph(f'{TEST_FOLDER}{TEST_CHECKINS_FILE}')
        self.assertEqual(self.stalkers_graph.weights, computed_stalkers_graph.weights)

    def test_compute_most_stalking_friend(self):
        """
        Test for the compute_most_stalking_friend(...) method
        """
        computed_highest_stalker_friend = compute_most_stalking_friend()
        self.assertEqual(self.highest_stalker_friend, computed_highest_stalker_friend)

    def test_compute_most_stalking_nonfriend(self):
        """
        Test for the read_friendship_graph(...) method
        """
        computed_highest_stalker_nonfriend = compute_most_stalking_nonfriend()
        self.assertEqual(self.highest_stalker_nonfriend, computed_highest_stalker_nonfriend)


unittest.main(argv=[''], exit=False)

....
----------------------------------------------------------------------
Ran 4 tests in 0.020s

OK


<unittest.main.TestProgram at 0x7fa96c896040>