# Solution to the "Stalker Challenge" - By Juan C. Alvarez

**Notice:** This solution uses the _dataclass_ annotation, so Python version >= 3.7 is required

## Problem Statement

Use the dataset here: https://snap.stanford.edu/data/loc-gowalla.html

* Assume a "stalker" is someone who, in this dataset, visits some of the same locations as another person, after the other person goes to that location.
* A "stalker score" for a pair of people, A & B, is the number of locations for which A has visited a location followed by B visiting that same location in the
future.
* Any given location should be counted once in the score, so a stalker score can never be higher than the number of unique locations that A and B have
in common.

Use the datasets from the web page above to answer the following questions:

1. Which friend pair has the highest "stalker score"?
2. Which non-friend pair has the highest "stalker score"?

You can use any tools you want to solve this puzzle, except asking for help from other people. Please feel free to email at any time for any clarifications.

Please give the winning user id pairs and “stalker score” for each question, and please explain your solution methods, including any source code if you wrote any.

**Ps. We give points for "pure" solutions, so no additional libraries. The more complex the additional libraries are the more points we deduct**

## 1. Data Exploration

For starters, we will obtain, uncompress, explore, and describe the contents of the data files:

In [1]:
DATA_FOLDER = './data/'
EDGES_URL = 'https://snap.stanford.edu/data/loc-gowalla_edges.txt.gz'
EDGES_COMP_FILE = 'loc-gowalla_edges.txt.gz'
EDGES_FILE = 'loc-gowalla_edges.txt'
CHECKINS_URL = 'https://snap.stanford.edu/data/loc-gowalla_totalCheckins.txt.gz'
CHECKINS_COMP_FILE = 'loc-gowalla_totalCheckins.txt.gz'
CHECKINS_FILE = 'loc-gowalla_totalCheckins.txt'

In [7]:
!echo 'Obtaining edges file'
!wget $EDGES_URL -P $DATA_FOLDER

Obtaining edges file
Will not apply HSTS. The HSTS database must be a regular and non-world-writable file.
ERROR: could not open HSTS store at '/home/jcalvarezj/.wget-hsts'. HSTS will be disabled.
--2021-02-24 18:15:44--  https://snap.stanford.edu/data/loc-gowalla_edges.txt.gz
Resolving snap.stanford.edu (snap.stanford.edu)... 171.64.75.80
Connecting to snap.stanford.edu (snap.stanford.edu)|171.64.75.80|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 6351523 (6.1M) [application/x-gzip]
Saving to: ‘./data/loc-gowalla_edges.txt.gz’


2021-02-24 18:15:52 (827 KB/s) - ‘./data/loc-gowalla_edges.txt.gz’ saved [6351523/6351523]



In [8]:
!echo 'Obtaining Check-ins file'
!wget $CHECKINS_URL -P $DATA_FOLDER

Obtaining Check-ins file
Will not apply HSTS. The HSTS database must be a regular and non-world-writable file.
ERROR: could not open HSTS store at '/home/jcalvarezj/.wget-hsts'. HSTS will be disabled.
--2021-02-24 18:15:55--  https://snap.stanford.edu/data/loc-gowalla_totalCheckins.txt.gz
Resolving snap.stanford.edu (snap.stanford.edu)... 171.64.75.80
Connecting to snap.stanford.edu (snap.stanford.edu)|171.64.75.80|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 105470044 (101M) [application/x-gzip]
Saving to: ‘./data/loc-gowalla_totalCheckins.txt.gz’


2021-02-24 18:16:49 (1.89 MB/s) - ‘./data/loc-gowalla_totalCheckins.txt.gz’ saved [105470044/105470044]



In [10]:
!echo 'Uncompressing both files'
!gzip -dkv {DATA_FOLDER + EDGES_COMP_FILE}
!gzip -dkv {DATA_FOLDER + CHECKINS_COMP_FILE}

Uncompressing both files
./data/loc-gowalla_edges.txt.gz:	 71.3% -- replaced with ./data/loc-gowalla_edges.txt
./data/loc-gowalla_totalCheckins.txt.gz:	 73.3% -- replaced with ./data/loc-gowalla_totalCheckins.txt


In [11]:
!head -n20 {DATA_FOLDER + EDGES_FILE}
!echo '-----------------------------------------------'
!head -n20 {DATA_FOLDER + CHECKINS_FILE}

0	1
0	2
0	3
0	4
0	5
0	6
0	7
0	8
0	9
0	10
0	11
0	12
0	13
0	14
0	15
0	16
0	17
0	18
0	19
0	20
-----------------------------------------------
0	2010-10-19T23:55:27Z	30.2359091167	-97.7951395833	22847
0	2010-10-18T22:17:43Z	30.2691029532	-97.7493953705	420315
0	2010-10-17T23:42:03Z	30.2557309927	-97.7633857727	316637
0	2010-10-17T19:26:05Z	30.2634181234	-97.7575966669	16516
0	2010-10-16T18:50:42Z	30.2742918584	-97.7405226231	5535878
0	2010-10-12T23:58:03Z	30.261599404	-97.7585805953	15372
0	2010-10-12T22:02:11Z	30.2679095833	-97.7493124167	21714
0	2010-10-12T19:44:40Z	30.2691029532	-97.7493953705	420315
0	2010-10-12T15:57:20Z	30.2811204101	-97.7452111244	153505
0	2010-10-12T15:19:03Z	30.2691029532	-97.7493953705	420315
0	2010-10-12T00:21:28Z	40.6438845363	-73.7828063965	23261
0	2010-10-11T20:21:20Z	40.74137425	-73.9881052167	16907
0	2010-10-11T20:20:42Z	40.741388197	-73.9894545078	12973
0	2010-10-11T00:06:30Z	40.7249103345	-73.9946207517	341255
0	2010-10-10T22:00:37Z	40.729768314	-73.99853

The data seems to be correct and has the following format:

- Edges file: userId (int) and friendId (int), separated by tabs
- Check-Ins file: userId (int), check-in time (ISO timestamp at Zero timezone), latitude (float), longitude (float), and locationId (int). All separated by tabs

To easily handle timestamps, the "Z" suffix will be converted to "+00:00" at the moment of loading

## 2. Data Structures

Adjacency list graphs will be implemented and used to solve the problem. The idea here is to create a graph to keep track of the associations among stalkers and also of their visited places, and another graph to keep track of friendship associations (to easily query at the moment of answering to the second question)

The graphs have the following structure:

- A dictionary of associations. For example:

`associations = {
    <node1>: [<list of associated nodes>],
    ...,
    <nodeN>: [<list of associated nodes>]
}`

- A dictionary of weights for edges. Edges are pairs of nodes represented as a set as {a, b} == {b, a} (for it is an undirected graph)

`weights = {
    <pair of nodes (set) 1>: [<set of weights>],
    ...,
    <pair of nodes (set) M>: [<set of weights>]
}`

Here weights will mean the place that both people visited. As a set to include each only once

In [45]:
from datetime import datetime
from dataclasses import dataclass
from collections import defaultdict

@dataclass
class CheckinRecord:
    """
    This class represents a Ckeck-In record from the dataset
    """
    user_id: int
    checkin_time: datetime
    latitude: float
    longitude: float
    location_id: int


class Graph:
    """
    This class represents the relations between people
    """

    def __init__(self):
        """
        Constructor initializer
        """
        self.relations = {}
        self.weights = {}

    def add_relation(self, start_node, end_node):
        """
        Adds a relation between two nodes
        """
        if not self.relations.get(start_node):
            self.relations[start_node] = {end_node}
        else:
            self.relations[start_node].add(end_node)

    def add_weight(self, start_node, end_node, weight):
        """
        Adds a weight relating to a couple of nodes
        """
        pair = {start_node, end_node}
        if not self.weights.get(pair):
            self.weights[pair] = {weight}
        else:
            self.weights[pair].add(weight)

    def __str__(self):
        return f'Graph with nodes:\n{self.relations}\nAnd weights:\n{self.weights}'
        

As each line of the edges file represents one associaion between two people, at the time of reading each line input is split respectively in order to be added to the friendships graph

In [48]:
friendship_graph = Graph()

with open(f'{DATA_FOLDER}{EDGES_FILE}') as edges:
    lines = [next(edges) for x in range(20)]

    for l in lines:
        user, friend = l.split('\t')
        friend = friend.replace('\n', '')
        friendship_graph.add_relation(user, friend)

    print(friendship_graph)


Graph with nodes:
{'0': {'16', '8', '15', '12', '3', '9', '19', '13', '18', '14', '5', '7', '4', '20', '2', '10', '6', '17', '1', '11'}}
And weights:
{}
