%%html
<style>
table {float:left}
</style>

# Finding co-occurrence


### reading the data from file

* The data is in a `tsv` (tab-separated values) file. 
  * fields are separted using tabs
* The first line is the header describing the columns (person)
* The fist column is the label for the days. 
* Each cell contains an `X` if person i did not stay at a hotel that day or the `id` of the hotel where he/she stayed (H1, H2, H3, H4 and H5)

| Person      	| P1  	| P2  | P3  	| `...`  	|
|-------------	|:-----:|:-----:|:-----:|-----:|
| Hotel Day 1 	| H1  	| H3  	| X   	| `...`  	|
| Hotel Day 2 	| X   	| H1  	| H5  	| `...`  	|
| `...`       	| `...` 	| `...`  	| `...`  	| `...` |


### Python Application

* The goal is to introduce Python in the context of an application
* We will start with a verbose implementation to explore the language 
  * Some constructs are not efficient but are useful to abstract complexity
* Will define a compact version at the end
* We will discuss how to scale this problem using available resources



In [12]:
for line in open("data/hotel_data.tsv"):
    # The line we read have "\n" at the end
    # If we don't remove it, we should at least tell print not to add 
    # another "\n" when printing
    print(line, end="")

Person		P1	P2	P3	P4	P5	P6	P7	P8	P9	P10
Hotel day 1	H1	H4	X	H2	X	X	H2	X	H5 	H1
Hotel day 2	X	X	X	X	H1	X	X	X	X	X
Hotel day 3	X	X	X	X	X	X	X	H4	X	X
Hotel day 4	X	X	X	X	X	X	X	X	X	H3
Hotel day 5	X	X	H1	X	X	H1	X	X	X	X
Hotel day 6	X	X	X	X	X	H4	X	X	X	X
Hotel day 7	X	X	X	X	X	X	X	X	X	X
Hotel day 8	X	H3	X	X	X	X	X	H5	X	X
Hotel day 9	X	X	X	X	X	X	X	X	X	X
Hotel day 10	X	X	X	X	X	X	X	X	X	X
Hotel day 11	X	X	X	X	X	X	H2	X	X	X
Hotel day 12	H4	X	X	X	X	X	X	X	X	X
Hotel day 13	X	X	X	X	X	X	X	X	X	X
Hotel day 14	X	X	X	H2	X	X	X	X	X	X
Hotel day 15	X	X	X	X	X	X	X	X	X	X
Hotel day 16	X	X	X	X	X	X	X	X	X	X
Hotel day 17	X	X	X	X	X	X	X	X	H3	X
Hotel day 18	X	X	X	X	H3	X	X	X	X	X
Hotel day 19	X	X	H5	X	X	X	X	X	X	X
Hotel day 20	X	X	X	H3	X	X	H3	X	X	X

In [14]:
# Skip the header

hotel_days = []
i = 0
for line in open("data/hotel_data.tsv"):
    if i > 0:
        print(line, end="")
    i+=1 


Hotel day 1	H1	H4	X	H2	X	X	H2	X	H5 	H1
Hotel day 2	X	X	X	X	H1	X	X	X	X	X
Hotel day 3	X	X	X	X	X	X	X	H4	X	X
Hotel day 4	X	X	X	X	X	X	X	X	X	H3
Hotel day 5	X	X	H1	X	X	H1	X	X	X	X
Hotel day 6	X	X	X	X	X	H4	X	X	X	X
Hotel day 7	X	X	X	X	X	X	X	X	X	X
Hotel day 8	X	H3	X	X	X	X	X	H5	X	X
Hotel day 9	X	X	X	X	X	X	X	X	X	X
Hotel day 10	X	X	X	X	X	X	X	X	X	X
Hotel day 11	X	X	X	X	X	X	H2	X	X	X
Hotel day 12	H4	X	X	X	X	X	X	X	X	X
Hotel day 13	X	X	X	X	X	X	X	X	X	X
Hotel day 14	X	X	X	H2	X	X	X	X	X	X
Hotel day 15	X	X	X	X	X	X	X	X	X	X
Hotel day 16	X	X	X	X	X	X	X	X	X	X
Hotel day 17	X	X	X	X	X	X	X	X	H3	X
Hotel day 18	X	X	X	X	H3	X	X	X	X	X
Hotel day 19	X	X	H5	X	X	X	X	X	X	X
Hotel day 20	X	X	X	H3	X	X	H3	X	X	X

In [15]:
# Add the data to a list of lists
# Each data is list of person person/hotel stays

hotel_days = []
i = 0
for line in open("data/hotel_data.tsv"):
    if i > 0:
        hotel_days.append(line.rstrip().split("\t")[1:])
    i+=1 

# No need to add print. Jupyter automatically prints it
# since it's the last line in the cell
hotel_days

[['H1', 'H4', 'X', 'H2', 'X', 'X', 'H2', 'X', 'H5 ', 'H1'],
 ['X', 'X', 'X', 'X', 'H1', 'X', 'X', 'X', 'X', 'X'],
 ['X', 'X', 'X', 'X', 'X', 'X', 'X', 'H4', 'X', 'X'],
 ['X', 'X', 'X', 'X', 'X', 'X', 'X', 'X', 'X', 'H3'],
 ['X', 'X', 'H1', 'X', 'X', 'H1', 'X', 'X', 'X', 'X'],
 ['X', 'X', 'X', 'X', 'X', 'H4', 'X', 'X', 'X', 'X'],
 ['X', 'X', 'X', 'X', 'X', 'X', 'X', 'X', 'X', 'X'],
 ['X', 'H3', 'X', 'X', 'X', 'X', 'X', 'H5', 'X', 'X'],
 ['X', 'X', 'X', 'X', 'X', 'X', 'X', 'X', 'X', 'X'],
 ['X', 'X', 'X', 'X', 'X', 'X', 'X', 'X', 'X', 'X'],
 ['X', 'X', 'X', 'X', 'X', 'X', 'H2', 'X', 'X', 'X'],
 ['H4', 'X', 'X', 'X', 'X', 'X', 'X', 'X', 'X', 'X'],
 ['X', 'X', 'X', 'X', 'X', 'X', 'X', 'X', 'X', 'X'],
 ['X', 'X', 'X', 'H2', 'X', 'X', 'X', 'X', 'X', 'X'],
 ['X', 'X', 'X', 'X', 'X', 'X', 'X', 'X', 'X', 'X'],
 ['X', 'X', 'X', 'X', 'X', 'X', 'X', 'X', 'X', 'X'],
 ['X', 'X', 'X', 'X', 'X', 'X', 'X', 'X', 'H3', 'X'],
 ['X', 'X', 'X', 'X', 'H3', 'X', 'X', 'X', 'X', 'X'],
 ['X', 'X', 'H5', 'X', 'X'

In [22]:
# Tracking who stayed at which hotel
# Create a empty list for each hotel
# Each each guest to the list

hotels_to_people = {} 
hotels_to_people["H1"] = []
hotels_to_people["H1"].append("P1")
hotels_to_people["H1"].append("P2")

hotels_to_people["H3"] = []
hotels_to_people["H3"].append("P5")

hotels_to_people


{'H1': ['P1', 'P2'], 'H3': ['P5']}

In [30]:
def find_groups(day_i_hotels):
    """ 
    Given a day log (list), return an index of which people stayed at which hotel
    Args:
        day_i_hotels: list of hotels start in a single day
        For example: 
            day_i_hotel = ['H1', 'H4', 'X', 'H2', 'X', 'X', 'H2', 'X', 'H5 ', 'H1']
            P1 stayed at hotel H1, P2 stayed at H4, P3 didn't stay at a hotel, etc.
    Returns:
        a dict hotels and groups of people who stayed at the hotel during that day
        For example: 
            [{1, 10}, {4, 7}] where 1, 10, 4, 7 and are people P1, P10, P4 and P7 respectively
    """
    hotels_to_people = {}
    i = 1
    # initialize the day with an empty list
    # ex. hotels_to_people["H1"] = []
    # this is not necessary but useful to abstract complexity
    for hotel_id in day_i_hotels:
        if hotel_id != "X":
            hotels_to_people[hotel_id] = []

    for hotel_id in day_i_hotels:
        if hotel_id != "X":
            hotels_to_people[hotel_id].append(i)
        i+=1
    return hotels_to_people

### Documenting functions

* Make sure you document your functions
  * use the PEP-257 or Google Python Style Guide
      * https://google.github.io/styleguide/pyguide.html
      *  https://www.python.org/dev/peps/pep-0257/

In [28]:
# This is what how doc string appears
find_groups?

In [29]:
find_groups(['H1', 'H4', 'X', 'H2', 'X', 'X', 'H2', 'X', 'H5 ', 'H1'])

{'H1': [1, 10], 'H4': [2], 'H2': [4, 7], 'H5 ': [9]}

In [48]:
# Trial Version

def find_groups(day_i_hotels):
    """ 
    Given a day log (list), return groups of people (2 or more) who stayed at the same hotel
    Args:
        day_i_hotels: list of hotels start in a single day
        For example: 
            day_i_hotel = ['H1', 'H4', 'X', 'H2', 'X', 'X', 'H2', 'X', 'H5 ', 'H1']
            P1 stayed at hotel H1, P2 stayed at H4, P3 didn't stay at a hotel, etc.
    Returns:
        a dict hotels and groups of people who stayed at the hotel during that day
        For example: 
            [{1, 10}, {4, 7}] where 1, 10, 4, 7 and are people P1, P10, P4 and P7 respectively
    """
    hotels_to_people = {}
    
    i = 1
    # initialize the day with an empty list
    # ex. hotels_to_people["H1"] = []
    # this is not necessary but useful to abstract complexity
    for hotel_id in day_i_hotels:
        if hotel_id != "X":
            hotels_to_people[hotel_id] = []

    for hotel_id in day_i_hotels:
        if hotel_id != "X":
            hotels_to_people[hotel_id].append(i)
        i+=1
    
    days_to_remove = []
    for key, value in hotels_to_people.items():
        if len(value) < 2:
            days_to_remove.append(key)
    
    print(f"need to remove days {days_to_remove}")
    
    return hotels_to_people

In [49]:
find_groups(['H1', 'H4', 'X', 'H2', 'X', 'X', 'H2', 'X', 'H5 ', 'H1'])

need to remove days ['H4', 'H5 ']


{'H1': [1, 10], 'H4': [2], 'H2': [4, 7], 'H5 ': [9]}

In [64]:
# Trial Version

def find_groups(day_i_hotels):
    """ 
    Given a day log (list), return groups of people (2 or more) who stayed at the same hotel
    Args:
        day_i_hotels: list of hotels start in a single day
        For example: 
            day_i_hotel = ['H1', 'H4', 'X', 'H2', 'X', 'X', 'H2', 'X', 'H5 ', 'H1']
            P1 stayed at hotel H1, P2 stayed at H4, P3 didn't stay at a hotel, etc.
    Returns:
        a dict hotels and groups of people who stayed at the hotel during that day
        For example: 
            [{1, 10}, {4, 7}] where 1, 10, 4, 7 and are people P1, P10, P4 and P7 respectively
    """
    hotels_to_people = {}
    
    i = 1
    # initialize the day with an empty list
    # ex. hotels_to_people["H1"] = []
    # this is not necessary but useful to abstract complexity
    for hotel_id in day_i_hotels:
        if hotel_id != "X":
            hotels_to_people[hotel_id] = []

    for hotel_id in day_i_hotels:
        if hotel_id != "X":
            hotels_to_people[hotel_id].append(i)
        i+=1
    
    days_to_remove = []
    for key, value in hotels_to_people.items():
        if len(value) < 2:
            days_to_remove.append(key)

    for day in days_to_remove:
        del(hotels_to_people[day])
    
    return hotels_to_people

In [32]:
find_groups(['H1', 'H4', 'X', 'H2', 'X', 'X', 'H2', 'X', 'H5 ', 'H1'])

{'H1': [1, 10], 'H2': [4, 7]}

In [33]:
find_groups(['H2', 'H2', 'X', 'X', 'X', 'X', 'X', 'X', 'H5 ', 'H1'])

{'H2': [1, 2]}

In [53]:
find_groups(['H1', 'X', 'H2', 'X', 'X', 'X', 'H4', 'X', 'X ', 'X'])

{}

In [36]:
test_data_1 = ['H1', 'H4', 'X', 'H2', 'X', 'X', 'H2', 'X', 'H5 ', 'H1']
assert find_groups(test_data_1) == {'H1': [1, 10], 'H2': [4, 7]}

test_data_2 = ['H2', 'H2', 'X', 'X', 'X', 'X', 'X', 'X', 'H5 ', 'H1']
assert find_groups(test_data_2) == {'H2': [1, 2]}

test_data_3 = ['H1', 'X', 'H2', 'X', 'X', 'X', 'H4', 'X', 'X ', 'X']
assert find_groups(test_data_3) == {}


In [38]:
# Generate groups for all the days 
# Store them in the list

groups_per_days = []
for day  in hotel_days:    
    groups_per_days.append(find_groups(day))
    
groups_per_days    

[{'H1': [1, 10], 'H2': [4, 7]},
 {},
 {},
 {},
 {'H1': [3, 6]},
 {},
 {},
 {},
 {},
 {},
 {},
 {},
 {},
 {},
 {},
 {},
 {},
 {},
 {},
 {'H3': [4, 7]}]

In [41]:
# given a list (ex. [1,2,3,4,5])
# generate the list of pairwise comparisons of that list

temp_list = [1,2,3,4,5]

for i in range(0, len(temp_list)-1):
    for j in range(i+1, len(temp_list)):
        print(i,j, end="\t") 

0 1	0 2	0 3	0 4	1 2	1 3	1 4	2 3	2 4	3 4	

In [42]:
# generate the list of pairwise comparisons we will do

for i in range(0, len(hotel_days)-1):
    for j in range(i+1, len(hotel_days)):
        print(i,j, end="\t") 

0 1	0 2	0 3	0 4	0 5	0 6	0 7	0 8	0 9	0 10	0 11	0 12	0 13	0 14	0 15	0 16	0 17	0 18	0 19	1 2	1 3	1 4	1 5	1 6	1 7	1 8	1 9	1 10	1 11	1 12	1 13	1 14	1 15	1 16	1 17	1 18	1 19	2 3	2 4	2 5	2 6	2 7	2 8	2 9	2 10	2 11	2 12	2 13	2 14	2 15	2 16	2 17	2 18	2 19	3 4	3 5	3 6	3 7	3 8	3 9	3 10	3 11	3 12	3 13	3 14	3 15	3 16	3 17	3 18	3 19	4 5	4 6	4 7	4 8	4 9	4 10	4 11	4 12	4 13	4 14	4 15	4 16	4 17	4 18	4 19	5 6	5 7	5 8	5 9	5 10	5 11	5 12	5 13	5 14	5 15	5 16	5 17	5 18	5 19	6 7	6 8	6 9	6 10	6 11	6 12	6 13	6 14	6 15	6 16	6 17	6 18	6 19	7 8	7 9	7 10	7 11	7 12	7 13	7 14	7 15	7 16	7 17	7 18	7 19	8 9	8 10	8 11	8 12	8 13	8 14	8 15	8 16	8 17	8 18	8 19	9 10	9 11	9 12	9 13	9 14	9 15	9 16	9 17	9 18	9 19	10 11	10 12	10 13	10 14	10 15	10 16	10 17	10 18	10 19	11 12	11 13	11 14	11 15	11 16	11 17	11 18	11 19	12 13	12 14	12 15	12 16	12 17	12 18	12 19	13 14	13 15	13 16	13 17	13 18	13 19	14 15	14 16	14 17	14 18	14 19	15 16	15 17	15 18	15 19	16 17	16 18	16 19	17 18	17 19	18 19	

In [43]:
import itertools
x = [1, 2, 3, 4, 5]
list(itertools.combinations(x, 2))

[(1, 2),
 (1, 3),
 (1, 4),
 (1, 5),
 (2, 3),
 (2, 4),
 (2, 5),
 (3, 4),
 (3, 5),
 (4, 5)]

In [51]:
# we can do this using itertool
my_days = range(0, len(hotel_days))

pairwise_comps = list(itertools.combinations(my_days, 2))

pairwise_comps[ : 10]


[(0, 1),
 (0, 2),
 (0, 3),
 (0, 4),
 (0, 5),
 (0, 6),
 (0, 7),
 (0, 8),
 (0, 9),
 (0, 10)]

In [52]:
# finding group overlap using set intersection

group_1 = [1,3,5]
group_2 = [2,3,5]

set(group_1).intersection(group_2)




{2, 5}

In [54]:
group_1 = [1,3,5]
group_2 = [2,4,6]

set(group_1).intersection(group_2)


set()

In [67]:
def compare_two_days(day_i, day_j):
    """
    computes which group of 2 or more individuals stayed at two hotels during both day_i and day_j
    
    Args:
        day_i: a dict hotels and groups of people (2+) who stayed at the hotel during that day 
        For example: 
            ["H3": {1, 10}, "H5": {4, 7}] where 1, 10, 4, 7 and are people P1, P10, P4 and P7 respectively
    Returns:
        a list of sets of people who stayed at the hotel durind both days
        For example: 
            [{4, 7}], meaning that P4, P7 both stayed together at the same hotel during day_i and day_j
    """
    
    match = []
    for group_i in day_i.values():
        for group_j in day_j.values():
            intersect = set(group_i).intersection(group_j)
            if len(intersect) > 1 :
                match.append(intersect)
    return match

In [58]:
day_i = {"H1": [1,8], "H4": [4,6,7]}
day_j = {"H3":[1,4,5,7], "H5": [2,9]}

compare_two_days(day_i, day_j)

[{4, 7}]

In [60]:
day_i = {"H1": [1,8], "H4": [4,6,7]}
day_j = {"H3":[1,4,5,7], "H5": [2,9]}
assert compare_two_days(day_i, day_j) == [{4, 7}]

day_i = {"H2": [1,3], "H3": [2,4,6]}
day_j = {"H3":[1,7], "H5": [2,3]}
assert compare_two_days(day_i, day_j) == []

In [87]:
for comp in pairwise_comps:
    match = compare_two_days(groups_per_days[comp[0]], groups_per_days[comp[1]])
    print(f"for days {comp}, the overlap was: {match}")

for days (0, 1), the overlap was: []
for days (0, 2), the overlap was: []
for days (0, 3), the overlap was: []
for days (0, 4), the overlap was: []
for days (0, 5), the overlap was: []
for days (0, 6), the overlap was: []
for days (0, 7), the overlap was: []
for days (0, 8), the overlap was: []
for days (0, 9), the overlap was: []
for days (0, 10), the overlap was: []
for days (0, 11), the overlap was: []
for days (0, 12), the overlap was: []
for days (0, 13), the overlap was: []
for days (0, 14), the overlap was: []
for days (0, 15), the overlap was: []
for days (0, 16), the overlap was: []
for days (0, 17), the overlap was: []
for days (0, 18), the overlap was: []
for days (0, 19), the overlap was: [{4, 7}]
for days (1, 2), the overlap was: []
for days (1, 3), the overlap was: []
for days (1, 4), the overlap was: []
for days (1, 5), the overlap was: []
for days (1, 6), the overlap was: []
for days (1, 7), the overlap was: []
for days (1, 8), the overlap was: []
for days (1, 9), the o

In [69]:
# complete  program
import itertools

hotel_days = []
i = 0
for line in open("data/hotel_data.tsv"):
    if i > 0:
        hotel_days.append(line.rstrip().split("\t")[1:])
    i+=1 

groups_per_days = []
for day in hotel_days:    
    groups_per_days.append(find_groups(day))
    

pairwise_comps = list(itertools.combinations(range(0, len(hotel_days)), 2))

nb_matches = 0
for comp in pairwise_comps:
    match = compare_two_days(groups_per_days[comp[0]], groups_per_days[comp[1]])
    if len(match) > 0:
        nb_matches += 1 
print(f"Total matches was: {nb_matches}")

Total matches was: 1


In [92]:
# Try on a randomy generated much larger dataset
import numpy as np
import scipy as sp
import pandas as pd

np_persons = 100_000
nb_days = 1_095
data = pd.DataFrame()
columns = []
for i in range(np_persons):
    one_person_nb_days_stay = sp.random.binomial(1095, 0.01)
    days = np.zeros(1095)
    days[np.random.choice(range(len(days)), one_person_nb_days_stay, replace=False)] = 1
    columns.append(pd.Series(days, name=f"P{i}"))
     
    if i % 1000 == 0:
        print(i)
        
data= pd.DataFrame({x.name: x  for x in columns})   

0
100
200
300
400
500
600
700
800
900
1000
1100
1200
1300
1400
1500
1600
1700
1800
1900
2000
2100
2200
2300
2400
2500
2600
2700
2800
2900
3000
3100
3200
3300
3400
3500
3600
3700
3800
3900
4000
4100
4200
4300
4400
4500
4600
4700
4800
4900
5000
5100
5200
5300
5400
5500
5600
5700
5800
5900
6000
6100
6200
6300
6400
6500
6600
6700
6800
6900
7000
7100
7200
7300
7400
7500
7600
7700
7800
7900
8000
8100
8200
8300
8400
8500
8600
8700
8800
8900
9000
9100
9200
9300
9400
9500
9600
9700
9800
9900
10000
10100
10200
10300
10400
10500
10600
10700
10800
10900
11000
11100
11200
11300
11400
11500
11600
11700
11800
11900
12000
12100
12200
12300
12400
12500
12600
12700
12800
12900
13000
13100
13200
13300
13400
13500
13600
13700
13800
13900
14000
14100
14200
14300
14400
14500
14600
14700
14800
14900
15000
15100
15200
15300
15400
15500
15600
15700
15800
15900
16000
16100
16200
16300
16400
16500
16600
16700
16800
16900
17000
17100
17200
17300
17400
17500
17600
17700
17800
17900
18000
18100
18200
18300
18400
18

In [98]:
data.iloc[0,:].sum()

1007.0

### Distributing the computation - 1

![](https://www.dropbox.com/s/7a2z5q7c1sluyyn/distributed_model_1.png?dl=1)

### Distributing the computation - 2
![](https://www.dropbox.com/s/4qf3hb4ev2v21da/distributed_model_2.png?dl=1)

### Distributing the computation - 3
![](https://www.dropbox.com/s/vnyofxggbv0bpvm/distributed_model_3.png?dl=1)

### Distributing the computation - 

* Is model 3 the best we can do?
  * The model still requires to transfer all the data, albeit in a much small format to all the machines
  * All three machines will need data from days 1 to 1095
  
* Can we minimize the number of days per machine?
    * A naive approach? 
    * This is the topic of an assignment in the ICS432  Concurrent and High-Performance Programming


### Naive Solution

![](https://www.dropbox.com/s/v2wvm4zhe6lw5vm/distributed_system_x.png?dl=1)