%%html
<style>
table {float:left}
</style>

### Finding co-occurrence


### reading the data from file

* The data is in a `tsv` (tab-separated values) file. 
  * fields are separted using tabs
* The first line is the header describing the columns (person)
* The fist column is the label for the days. 
* Each cell contains an `X` if person i did not stay at a hotel that day or the `id` of the hotel where he/she stayed (H1, H2, H3, H4 and H5)

| Person      	| P1  	| P2  | P3  	| `...`  	|
|-------------	|:-----:|:-----:|:-----:|-----:|
| Hotel Day 1 	| H1  	| H3  	| X   	| `...`  	|
| Hotel Day 2 	| X   	| H1  	| H5  	| `...`  	|
| `...`       	| `...` 	| `...`  	| `...`  	| `...` |


### Python Application

* The goal is to introduce Python in the context of an application
* We will start with a verbose implementation to explore the language 
  * Some constructs are not efficient but are useful to abstract complexity
* Will define a compact version at the end
* We will discuss how to scale this problem using available resources



In [None]:
for line in open("data/hotel_data.tsv"):
    # The line we read have "\n" at the end
    # If we don't remove it, we should at least tell print not to add 
    # another "\n" when printing 
    print(line, end="")

In [None]:
# Skip the header

hotel_days = []
i = 0
for line in open("data/hotel_data.tsv"):
    if i > 0:
        print(line, end="")
    i+=1 


In [None]:
# Add the data to a list of lists
# Each data is list of person person/hotel stays

hotel_days = []
i = 0
for line in open("data/hotel_data.tsv"):
    if i > 0:
        hotel_days.append(line.rstrip().split("\t")[1:])
    i+=1 

# No need to add print. Jupyter automatically prints it
# since it's the last line in the cell
hotel_days

In [None]:
# Tracking who stayed at which hotel
# Create a empty list for each hotel
# Each each guest to the list

hotels_to_people = {} 
hotels_to_people["H1"] = []
hotels_to_people["H1"].append("P1")
hotels_to_people["H1"].append("P2")

hotels_to_people["H3"] = []
hotels_to_people["H3"].append("P5")

hotels_to_people


In [None]:
def find_groups(day_i_hotels):
    """ 
    Given a day log (list), return an index of which people stayed at which hotel
    Args:
        day_i_hotels: list of hotels start in a single day
        For example: 
            day_i_hotel = ['H1', 'H4', 'X', 'H2', 'X', 'X', 'H2', 'X', 'H5 ', 'H1']
            P1 stayed at hotel H1, P2 stayed at H4, P3 didn't stay at a hotel, etc.
    Returns:
        a dict hotels and groups of people who stayed at the hotel during that day
        For example: 
            [{1, 10}, {4, 7}] where 1, 10, 4, 7 and are people P1, P10, P4 and P7 respectively
    """
    hotels_to_people = {}
    i = 1
    # initialize the day with an empty list
    # ex. hotels_to_people["H1"] = []
    # this is not necessary but useful to abstract complexity
    for hotel_id in day_i_hotels:
        if hotel_id != "X":
            hotels_to_people[hotel_id] = []

    for hotel_id in day_i_hotels:
        if hotel_id != "X":
            hotels_to_people[hotel_id].append(i)
        i+=1
    return hotels_to_people

### Documenting functions

* Make sure you document your functions
  * use the PEP-257 or Google Python Style Guide
      * https://google.github.io/styleguide/pyguide.html
      *  https://www.python.org/dev/peps/pep-0257/

In [None]:
# This is what how doc string appears
find_groups

In [None]:
find_groups(['H1', 'H4', 'X', 'H2', 'X', 'X', 'H2', 'X', 'H5 ', 'H1'])

In [None]:
# Trial Version

def find_groups(day_i_hotels):
    """ 
    Given a day log (list), return groups of people (2 or more) who stayed at the same hotel
    Args:
        day_i_hotels: list of hotels start in a single day
        For example: 
            day_i_hotel = ['H1', 'H4', 'X', 'H2', 'X', 'X', 'H2', 'X', 'H5 ', 'H1']
            P1 stayed at hotel H1, P2 stayed at H4, P3 didn't stay at a hotel, etc.
    Returns:
        a dict hotels and groups of people who stayed at the hotel during that day
        For example: 
            [{1, 10}, {4, 7}] where 1, 10, 4, 7 and are people P1, P10, P4 and P7 respectively
    """
    hotels_to_people = {}
    
    i = 1
    # initialize the day with an empty list
    # ex. hotels_to_people["H1"] = []
    # this is not necessary but useful to abstract complexity
    for hotel_id in day_i_hotels:
        if hotel_id != "X":
            hotels_to_people[hotel_id] = []

    for hotel_id in day_i_hotels:
        if hotel_id != "X":
            hotels_to_people[hotel_id].append(i)
        i+=1
    
    days_to_remove = []
    for key, value in hotels_to_people.items():
        if len(value) < 2:
            days_to_remove.append(key)
    
    print(f"need to remove days {days_to_remove}")
    
    return hotels_to_people

In [None]:
find_groups(['H1', 'H4', 'X', 'H2', 'X', 'X', 'H2', 'X', 'H5 ', 'H1'])

In [3]:
# Trial Version

def find_groups(day_i_hotels):
    """ 
    Given a day log (list), return groups of people (2 or more) who stayed at the same hotel
    Args:
        day_i_hotels: list of hotels start in a single day
        For example: 
            day_i_hotel = ['H1', 'H4', 'X', 'H2', 'X', 'X', 'H2', 'X', 'H5 ', 'H1']
            P1 stayed at hotel H1, P2 stayed at H4, P3 didn't stay at a hotel, etc.
    Returns:
        a dict hotels and groups of people who stayed at the hotel during that day
        For example: 
            [{1, 10}, {4, 7}] where 1, 10, 4, 7 and are people P1, P10, P4 and P7 respectively
    """
    hotels_to_people = {}
    
    i = 1
    # initialize the day with an empty list
    # ex. hotels_to_people["H1"] = []
    # this is not necessary but useful to abstract complexity
    for hotel_id in day_i_hotels:
        if hotel_id != "X":
            hotels_to_people[hotel_id] = []

    for hotel_id in day_i_hotels:
        if hotel_id != "X":
            hotels_to_people[hotel_id].append(i)
        i+=1
    
    days_to_remove = []
    for key, value in hotels_to_people.items():
        if len(value) < 2:
            days_to_remove.append(key)

    for day in days_to_remove:
        del(hotels_to_people[day])
    
    return hotels_to_people

In [None]:
find_groups(['H1', 'H4', 'X', 'H2', 'X', 'X', 'H2', 'X', 'H5 ', 'H1'])

In [None]:
find_groups(['H2', 'H2', 'X', 'X', 'X', 'X', 'X', 'X', 'H5 ', 'H1'])

In [None]:
find_groups(['H1', 'X', 'H2', 'X', 'X', 'X', 'H4', 'X', 'X ', 'X'])

In [None]:
test_data_1 = ['H1', 'H4', 'X', 'H2', 'X', 'X', 'H2', 'X', 'H5 ', 'H1']
assert find_groups(test_data_1) == {'H1': [1, 10], 'H2': [4, 7]}

test_data_2 = ['H2', 'H2', 'X', 'X', 'X', 'X', 'X', 'X', 'H5 ', 'H1']
assert find_groups(test_data_2) == {'H2': [1, 2]}

test_data_3 = ['H1', 'X', 'H2', 'X', 'X', 'X', 'H4', 'X', 'X ', 'X']
assert find_groups(test_data_3) == {}


In [None]:
# Generate groups for all the days 
# Store them in the list

groups_per_days = []
for day  in hotel_days:    
    groups_per_days.append(find_groups(day))
    
groups_per_days    

In [None]:
# given a list (ex. [1,2,3,4,5])
# generate the list of pairwise comparisons of that list

temp_list = [1, 2, 3, 4, 5]

for i in range(0, len(temp_list)-1):
    for j in range(i+1, len(temp_list)):
        print(temp_list[i],temp_list[j], end="\t") 

In [None]:
# generate the list of pairwise comparisons we will do

for i in range(0, len(hotel_days)-1):
    for j in range(i+1, len(hotel_days)):
        print(i,j, end="\t") 

In [None]:
import itertools
x = [1, 2, 3, 4, 5]
list(itertools.combinations(x, 2))

In [None]:
# we can do this using itertool
my_days = range(0, len(hotel_days))

pairwise_comps = list(itertools.combinations(my_days, 2))

pairwise_comps[ : 10]


In [None]:
set([1,2,2,3,1,1,2])


In [None]:
# finding group overlap using set intersection

group_1 = [1,3,5]
group_2 = [2,3,5]

set(group_1).intersection(group_2)




In [None]:
group_1 = [1,3,5]
group_2 = [2,4,6]

set(group_1).intersection(group_2)


In [2]:
def compare_two_days(day_i, day_j):
    """
    computes which group of 2 or more individuals stayed at two hotels during both day_i and day_j
    
    Args:
        day_i: a dict hotels and groups of people (2+) who stayed at the hotel during that day 
        For example: 
            ["H3": {1, 10}, "H5": {4, 7}] where 1, 10, 4, 7 and are people P1, P10, P4 and P7 respectively
    Returns:
        a list of sets of people who stayed at the hotel durind both days
        For example: 
            [{4, 7}], meaning that P4, P7 both stayed together at the same hotel during day_i and day_j
    """
    
    match = []
    for group_i in day_i.values():
        for group_j in day_j.values():
            intersect = set(group_i).intersection(group_j)
            if len(intersect) > 1 :
                match.append(intersect)
    return match

In [None]:
day_i = {"H1": [1,8], "H4": [4,6,7]}
day_j = {"H3":[1,4,5,7], "H5": [2,9]}

compare_two_days(day_i, day_j)

In [None]:
day_i = {"H1": [1,8], "H4": [4,6,7]}
day_j = {"H3":[1,4,5,7], "H5": [2,9]}
assert compare_two_days(day_i, day_j) == [{4, 7}]

day_i = {"H2": [1,3], "H3": [2,4,6]}
day_j = {"H3":[1,7], "H5": [2,3]}
assert compare_two_days(day_i, day_j) == []

In [None]:
for comp in pairwise_comps:
    match = compare_two_days(groups_per_days[comp[0]], groups_per_days[comp[1]])
    print(f"for days {comp}, the overlap was: {match}")

In [11]:
# # complete  program
# import itertools

# hotel_days = []
# i = 0
# print("Now reading the file")

# for line in open("data/hotel_data_100k.tsv"):
#     if i > 0:
#         hotel_days.append(line.rstrip().split("\t")[1:])
#     i+=1 

# print("  Finished reading the file")

# print("Now finding the groups")

# groups_per_days = []
# for day in hotel_days: 
#     group = find_groups(day)
#     del(group["0"])
#     groups_per_days.append(group)
    
# print("  Finished finding the groups")

# pairwise_comps = list(itertools.combinations(range(0, len(hotel_days)), 2))
# print(f"There is a total of {len(pairwise_comps)} to compute")


# print("Now comparing the days")

nb_matches = 0
i = 0
for comp in pairwise_comps:
    i+=1
    match = compare_two_days(groups_per_days[comp[0]], groups_per_days[comp[1]])
    if len(match) > 0:
        nb_matches += 1 
    if i % 500 == 0:
        print(i)
        print(f"Partial number of matches found is: {nb_matches}")
    
print(f"Total matches was: {nb_matches}")

500
Partial number of matches found is: 303
1000
Partial number of matches found is: 607
1500
Partial number of matches found is: 878
2000
Partial number of matches found is: 1173
2500
Partial number of matches found is: 1498
3000
Partial number of matches found is: 1820
3500
Partial number of matches found is: 2165
4000
Partial number of matches found is: 2465
4500
Partial number of matches found is: 2791
5000
Partial number of matches found is: 3127
5500
Partial number of matches found is: 3462
6000
Partial number of matches found is: 3767


KeyboardInterrupt: 

### Distributing the computation - 1
![](https://www.dropbox.com/s/7a2z5q7c1sluyyn/distributed_model_1.png?dl=1)

### Distributing the computation - 2
![](https://www.dropbox.com/s/4qf3hb4ev2v21da/distributed_model_2.png?dl=1)

### Distributing the computation - 3
![](https://www.dropbox.com/s/vnyofxggbv0bpvm/distributed_model_3.png?dl=1)

### Distributing the computation - 

* Is model 3 the best we can do?
  * The model still requires to transfer all the data, albeit in a much small format to all the machines
  * All three machines will need data from days 1 to 1095
  
* Can we minimize the number of days per machine?
    * A naive approach? 
    * This is the topic of an assignment in the ICS432  Concurrent and High-Performance Programming


### Distributing the computation  -- Naive Solution

![](https://www.dropbox.com/s/v2wvm4zhe6lw5vm/distributed_system_x.png?dl=1)