<a href="https://colab.research.google.com/github/iamkhiemnguyen/CSE-6040/blob/main/Module%200/Session%2011/solutions_MT1%20Medium%20Debugging.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Debugging Bad Solutions: #
## From Midterm 1, Spring 2023 - 2 Point Exercises ##

**Purpose:**
On the exams you may initially write solutions that do not pass the test cases. That's okay! You will need to debug your code to determine what is causing the issue(s) and then figure out to how fix them. So how can we get better at debugging? We practice!

Below are three 2 point exercises from the Spring 2023 Midterm 1. We have pre-written solutions for each exercise that are "bad" in one or more ways. Our solutions may contain one or more logic and/or syntax errors. Can you find and fix the issues in each exercise and pass all of the test cases?

Right before each exercise test cell, there is a block of text explaining the variables available to you for debugging. You may use these to test your code and can print/display them as needed.

<br/>

**Exercise point breakdown:**

- Exercise 3: **2** points
- Exercise 4: **2** points
- Exercise 7: **2** points

# Background: Mining for Friends #

In this notebook, you will analyze social media data from Twitter and FourSquare. Your ultimate goal will be to help people find others with common interests.

The (anonymized) data consists of the following:

- A collection of places, or _points of interest_ (POIs), like restaurants, movie theaters, post offices, museums, and so on.
- A database of cities.
- A collection of _check-ins_, that is, the places that a specific person has visited.
- Existing _connections_, that is, "who follows whom" type relationships.

We will analyze this data and then, by the very last exercise, create a function that can, for a given person, recommend other people they might be compatible with based on their affinity for the same places.

Start by running the next cell, which will set up some of the code and data you'll need later.

In [None]:
# !python --version
!pip install dill
import dill as pickle

In [None]:
### Global Imports
# Some functionality needed by the notebook and demo cells:
from pprint import pprint, pformat
import math

def status_msg(s, time=None):
    from datetime import datetime
    if time is None:
        time = datetime.now()
    print(f"[{time}] {s}")

def load_json(filename):
    from json import load
    with open(filename, "rt") as fp:
        return load(fp)
    return None

def save_json(filename, obj):
    from json import dump
    with open(filename, "w") as fp:
        dump(obj, fp, indent=2)

def load_pickle(filename):
    from pickle import load
    with open(filename, "rb") as fp:
        return load(fp)
    return None

def save_pickle(filename, obj):
    from pickle import dump
    with open(filename, "wb") as fp:
        dump(obj, fp)

def choose_ext(s, ext_map):
    for ext in ext_map:
        if s[-(len(ext)+1):] == f".{ext}":
            return ext_map[ext]
    return None

def load_database(basename, tag=None, sample=None, pathname=""):
    filename = pathname + basename
    status_msg(f"Loading {tag+' ' if tag is not None else ''}[{filename}] ...")
    loader = choose_ext(basename, {'pickle': load_pickle, 'json': load_json})
    assert loader is not None, "*** Unrecognized file extension. ***"
    database = loader(filename)
    status_msg("... done!")
    print(f"\nThis data has {len(database):,} entries.")
    if sample is not None and isinstance(database, dict):
        print("\nHere is a sample:\n")
        pprint({key: database[key] for key in sample})
    return database

def save_database(database, basename, tag=None, sample=None, pathname=""):
    if sample is not None and isinstance(database, dict):
        database = {key: database[key] for key in sample}
    print(f"This data has {len(database):,} entries.")
    filename = pathname + basename
    status_msg(f"\nSaving {tag+' ' if tag is not None else ''}[{filename}] ...")
    saver = choose_ext(basename, {'pickle': save_pickle, 'json': save_json})
    assert saver is not None, "*** Unrecognized file extension. ***"
    saver(filename, database)
    status_msg("... done!")

def sample_dict(d, k=1):
    """Extract a sample of at most `k` key-value pairs from the dictionary `d`."""
    assert k >= 0, f"*** The number of samples must be nonnegative (k={k}). ***"
    from random import sample
    keys = sample(d.keys(), min(k, len(list(d.keys()))))
    return {k: d[k] for k in keys}

def sample_safely(x, k):
    """Returns a set of at most `k` uniform-random samples from `x`."""
    from random import sample
    return set(sample(x, min(k, len(x))))

def subset_dict(d, ks):
    """Returns a subset of the dictionary `d` for the keys `ks`."""
    return {k: v for k, v in d.items() if k in ks}

def enum_map(x):
    map_dict = {e: k for k, e in enumerate(x)}
    return lambda i: map_dict[i]

def remap_dict(d, map_key=lambda k: k, map_val=lambda k: k):
    """Relabel the key-value pairs of a dictionary."""
    assert len(set(map_key(k) for k in d.keys())) == len(d.keys()), '*** `map_key` is not one to one ***'
#    assert len(set(map_val(k) for k in d.values())) == len(d.values()), '*** `map_val` is not one to one ***'
    return {map_key(k): map_val(v) for k, v in d.items()}

def remap_set(s, map_fun=lambda e: e):
    """Relabel the elements of a set."""
    s_new = {map_fun(e) for e in s}
    assert len(s_new) == len(s), "*** `map_fun` is not one to one ***"
    return s_new

In [None]:
# # import files
!wget https://raw.githubusercontent.com/gt-cse-6040/bootcamp/main/Module%200/Session%2011/Colab%20Support%20Files/active_users.pickle
!wget https://raw.githubusercontent.com/gt-cse-6040/bootcamp/main/Module%200/Session%2011/Colab%20Support%20Files/cc_visit_counts.pickle
!wget https://raw.githubusercontent.com/gt-cse-6040/bootcamp/main/Module%200/Session%2011/Colab%20Support%20Files/checkins2.pickle
!wget https://raw.githubusercontent.com/gt-cse-6040/bootcamp/main/Module%200/Session%2011/Colab%20Support%20Files/cities.pickle
!wget https://raw.githubusercontent.com/gt-cse-6040/bootcamp/main/Module%200/Session%2011/Colab%20Support%20Files/connection_vectors.pickle
!wget https://raw.githubusercontent.com/gt-cse-6040/bootcamp/main/Module%200/Session%2011/Colab%20Support%20Files/connections.pickle
!wget https://raw.githubusercontent.com/gt-cse-6040/bootcamp/main/Module%200/Session%2011/Colab%20Support%20Files/filtered_pois.pickle
!wget https://raw.githubusercontent.com/gt-cse-6040/bootcamp/main/Module%200/Session%2011/Colab%20Support%20Files/food_and_drink_types.pickle
!wget https://raw.githubusercontent.com/gt-cse-6040/bootcamp/main/Module%200/Session%2011/Colab%20Support%20Files/pois2.pickle
!wget https://raw.githubusercontent.com/gt-cse-6040/bootcamp/main/Module%200/Session%2011/Colab%20Support%20Files/tc_3
!wget https://raw.githubusercontent.com/gt-cse-6040/bootcamp/main/Module%200/Session%2011/Colab%20Support%20Files/tc_4
!wget https://raw.githubusercontent.com/gt-cse-6040/bootcamp/main/Module%200/Session%2011/Colab%20Support%20Files/tc_7

!mkdir tester_fw
%cd tester_fw

!wget https://raw.githubusercontent.com/gt-cse-6040/bootcamp/main/Module%200/Session%2011/Colab%20Support%20Files/tester_fw/__init__.py
!wget https://raw.githubusercontent.com/gt-cse-6040/bootcamp/main/Module%200/Session%2011/Colab%20Support%20Files/tester_fw/test_utils.py
!wget https://raw.githubusercontent.com/gt-cse-6040/bootcamp/main/Module%200/Session%2011/Colab%20Support%20Files/tester_fw/testers.py

%cd ..

# Part A: Points-of-interest and cities #

The first part of the dataset consists of **points-of-interest** or **POIs**. A POI is a place that a person can visit.

Run this code cell to load the dataset of POIs.

In [None]:
pois = load_database('pois2.pickle',
                     tag="points-of-interest (POIs)")

There are over half-a-million POIs. Each one is stored as a key-value pair where
- the key is the **POI's ID**; and
- the value holds the **POI's attributes**, stored as another Python dictionary.

In the sample above, there are three POIs, each one with four attributes.
- **`'country_code'`**: A two-letter country code indicating in which country the POI is located. In this example, one is in Brazil (`'BR'`) and two in Turkey (`'TR'`).
- **`'lat'` and `'long'`**: The latitude and longitude coordinates of the POI.
- **`'type'`**: The type of POI. We see three types in this example: a `'University'`, a `'Automotive Shop'`, and a `'Turkish Restaurant'`.

In [None]:
food_and_drink_types = load_database('food_and_drink_types.pickle')
print(f"\nThere are {len(food_and_drink_types):,} types that match the food and drink keywords. They are:")
print(food_and_drink_types)

# Part B: Cities #

A second component of the dataset is a collection of **city records**. Run the following cell to load it.

In [None]:
cities = load_database('cities.pickle',
                       tag="cities",
                       sample = ["Beijing, China", "Hartford, United States", "Stockholm, Sweden"])

Each city record is a key-value pair. The key is the **city name**, and the value is a dictionary of **city attributes**. The attributes are a two-letter country code, the latitude and longitude coordinates of the city center, and the type of city.

In this example, there are three cities: `'Beijing, China'`, `'Hartford, United States'`, and `'Stockholm, Sweden'`.

**Important!** Both the POIs and the cities are dictionaries of dictionaries, where in both cases the _outer_ dictionary's key-value pairs define a record, and the inner dictionaries have the keys, `'country_code'`, `'lat'`, `'long'`, and `'type'`.

# Part C: Check-ins #

The next part of the dataset consists of **check-in records**. Run the cell below to load this data.

In [None]:
checkins = load_database('checkins2.pickle',
                         tag="check-in records",
                         sample=[119_352, 677_020, 728_580])

Each _check-in record_ (or just _check-in_) is a key-value pair, where
- the key is a **user** ID, an _integer_; and
- the value is a _Python list_ of one or more **visits**.

Each **visit** is itself a Python dictionary with two key-value pairs:
- `'poi'`: The POI that the user visited, represented by its string ID.
- `'date'`: The date of that visit, stored as a string.

In the preceding example, there are three users, `119352`, `677020`, and `728580`. They each visited four, seven, and one POI, respectively. User `728580` visited the one POI on Sunday, May 13, 2012. A user can visit the same POI on different dates; for example, user `119352` made two visits to POI `4ab4edf1f964a520ad7120e3`, once on April 16 and again on April 23.

## Exercise 3 (**2** points): `count_visits_by_country` ##

Suppose we wish to determine in which country the most visits have been recorded. Complete the function,
```python
def count_visits_by_country(checkins, pois):
    ...
```
to help accomplish this task.

**Inputs:**
- `checkins`: Check-in records, a Python dictionary of lists of dictionaries, as in the preceding example.
- `pois`: Points-of-interest, a Python dictionary of dictionaries, as in earlier exercises.

**Task:** For each country code, count the number of visits that occurred there.

**Outputs:** Return a **Python list** whose elements are tuples. Each tuple should be a pair, `(cc, n)`, where `cc` is a country code and `n` is the total number of visits made in that country.

**Notes:**
1. If there were no visits in a country, then its country code should **NOT** appear in the output.
2. A visit may refer to a POI that does **NOT** exist in `pois`. In such cases, the visit should simply be ignored.
3. When counting, consider every visit that is a known POI—even if it is visited multiple times by the same user or different users—as unique.

In [None]:
### Define demo inputs
demo_checkins_ex3 = \
{ 259270: [ { 'date': 'Fri Apr 20 01:12:02 +0000 2012',
              'poi': '4b155088f964a520beb023e3'}, # present in demo_pois_ex3
            { 'date': 'Sat Apr 14 01:25:16 +0000 2012',
              'poi': '4b155088f964a520beb023e3'}], # present in demo_pois_ex3
  424689: [ { 'date': 'Tue Apr 10 00:49:14 +0000 2012',
              'poi': '4b7a6d3bf964a5208b2c2fe3'}], # present in demo_pois_ex3
  1402043: [ { 'date': 'Thu May 17 20:42:11 +0000 2012',
               'poi': '4bb4d63ff1b976b023661f20'},
             { 'date': 'Thu May 03 17:33:11 +0000 2012',
               'poi': '4da9b4f36a2303012efb07c2'}],
  1815705: [ { 'date': 'Tue Jun 12 20:59:11 +0000 2012',
               'poi': '4f7ecdeae4b0ac821d08d00c'}]}

demo_pois_ex3 = \
{ '4b155088f964a520beb023e3': { 'country_code': 'US',
                                'lat': 47.591549,
                                'long': -122.332592,
                                'type': 'Baseball Stadium'},
  '4b7a6d3bf964a5208b2c2fe3': { 'country_code': 'MY',
                                'lat': 3.170318,
                                'long': 101.708915,
                                'type': 'Hospital'},
  '4b9a3eaaf964a520fda635e3': { 'country_code': 'US',
                                'lat': 39.43205,
                                'long': -84.207813,
                                'type': 'Train Station'}}

<!-- Expected demo output text block -->
Given the demo data above, a correct solution would produce:
```
[('US', 2), ('MY', 1)]
```
<!-- Include any shout outs here -->
That's because of the four users in `demo_checkins_ex3`, only the first two (`259270` and `424689`) visited POIs that are present in `demo_pois_ex3`.

In [None]:
### Exercise 3 solution
def count_visits_by_country(checkins, pois):
    ###
    ### YOUR CODE HERE

    # GOAL:
    # For each country code, count the number of visits that occurred there.
    # Return a Python list whose elements are tuples. Each tuple should be a pair, (cc, n),
    # where cc is a country code and n is the total number of visits made in that country.

    # INPUT:
    # 2 inputs provided:
    # checkins: Check-in records, a Python dictionary of lists of dictionaries
    # pois: Points-of-interest, a Python dictionary of dictionaries.

    # STRATEGY:
    # 1. Create empty list to hold tuples. Let's call this country_count_list.
    # 2. Probably honestly use a default dictionary {country_code1: count1, country_code2: count2, etc..} and convert to list of tuples at the end. Way easier.
    #    Let's call default dictionary 'temp_country_count_dict'. It should hold integers as the values because it's a count.
    # 2. Iterate over user id key in checkins
    # 3. Iterate over nested checkins list holding dictionaries
    # 4. Grab POI value from dictionary using key 'poi'
    # 5. Check to see if POI value is in 'pois'
    # 6. If it is, grab country code from 'pois' nested dictionary using key 'country_code'
    # 7. Now add 1 to 'temp_country_count_dict' for that country code
    # 8. Convert default dictionary 'temp_country_count_dict' to list of tuples. Assign to variable 'country_count_list'.
    # 9. Return 'country_count_list'.


    # SOLUTION:
    country_count_list = []
    from collections import defaultdict
    temp_country_count_dict = defaultdict(int)

    for user_id in checkins:
      attr_dict_list = checkins[user_id]
      for attr_dict in attr_dict_list:
        poi_id = attr_dict['poi']
        if poi_id in pois:
          country_code = pois[poi_id]['country_code']
          temp_country_count_dict[country_code] += 1

    for cc in temp_country_count_dict:
      cc_count = temp_country_count_dict[cc]
      country_count_list.append((cc, cc_count))

    return country_count_list




### demo function call
count_visits_by_country(demo_checkins_ex3, demo_pois_ex3)

<!-- Test Cell Boilerplate -->
The cell below will test your solution for Exercise 3. The testing variables will be available for debugging under the following names in a dictionary format.
- `input_vars` - Input variables for your solution.
- `original_input_vars` - Copy of input variables from prior to running your solution. These _should_ be the same as `input_vars` - otherwise the inputs were modified by your solution.
- `returned_output_vars` - Outputs returned by your solution.
- `true_output_vars` - The expected output. This _should_ "match" `returned_output_vars` based on the question requirements - otherwise, your solution is not returning the correct output.

In [None]:
### test_cell_ex3

from tester_fw.testers import Tester

conf = {
    'case_file':'tc_3',
    'func': count_visits_by_country, # replace this with the function defined above
    'inputs':{ # input config dict. keys are parameter names
        'checkins':{
            'dtype':'dict', # data type of param.
            'check_modified':True,
        },
        'pois':{
            'dtype':'dict', # data type of param.
            'check_modified':True,
        }
    },
    'outputs':{
        'output_0':{
            'index':0,
            'dtype':'list',
            'check_dtype': True,
            'check_col_dtypes': True, # Ignored if dtype is not df
            'check_col_order': True, # Ignored if dtype is not df
            'check_row_order': True, # Ignored if dtype is not df
            'check_column_type': True, # Ignored if dtype is not df
            'float_tolerance': 10 ** (-6)
        }
    }
}
tester = Tester(conf, key=b'1ZePXzAcTR7lcNpmx1HRK0lT3v-Ikrg8mZ3n-wVFTBo=', path='')
for _ in range(70):
    try:
        tester.run_test()
        (input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
    except:
        (input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
        raise

print('Passed! Please submit.')

**RUN ME: Top countries:** If you had a correct solution, you could use it to determine, say, the top 10 countries with having the most visit data. We have precomputed this ranking for you and here are the results. **Run this cell whether or not you completed Exercise 3.**

In [None]:
cc_visit_counts = load_database('cc_visit_counts.pickle')
top_ccs_by_visits = sorted(cc_visit_counts, key=lambda e: e[1], reverse=True)[:10]
print("\nThe top 10 countries by recorded visit counts:")
for k, (cc, nn) in enumerate(top_ccs_by_visits):
    print(f"{' ' if k < 9 else ''}{k+1:d}. {cc}: {nn:,} visits")

# Part D: Filtering #

Many of the data structures so far have the same form: a dictionary of dictionaries. For example, here are some POI records.

```
{'4f2491a8e4b04f6e695eac41': {
    'lat': 32.984154,
    'long': -96.754765,
    'type': 'Temple',
    'country_code': 'US'},
 '4d6d12b5d5d937046513cbf9': {
    'lat': -29.911862,
    'long': -71.256914,
    'type': 'Medical Center',
    'country_code': 'CL'},
 '4f67d81de4b072f14537b19b': {
    'lat': 10.077291,
    'long': -69.280417,
    'type': 'Residential Building (Apartment / Condo)',
    'country_code': 'VE'},
 '4e88d624d3e39d6f4e7dded3': {
    'lat': 49.276867,
    'long': -123.111415,
    'type': 'Soccer Stadium',
    'country_code': 'CA'}}
```

The **outer** keys are the POI IDs, and the **inner** keys are `'lat'`, `'long'`, `'type'`, and `'country_code'`.

Suppose you have a client who only wants to look at the POIs where a specific **inner** key has a value satisfying an **arbitrary** condition. For example, the client gives you the function,
```python
def is_positive(x):
    return x > 0
```
which returns `True` only when `x` is positive. They then ask you to apply this function to all of the `'lat'` values and keep _only_ the POI records where `is_positive` on that inner value returns `True`. This query is an example of **filtering**. Filtering keeps only the records where the **predicate**, `is_positive`, holds.

## Exercise 4 (**2** points): `filter_dd`

Suppose we wish to write a function that can filter a dictionary-of-dictionaries by applying _any_ predicate to a specific **inner** key's value. Implement the function,
```python
def filter_dd(dd, inner_key, predicate):
    ...
```
to perform this task according to the specification below.

**Inputs:**
- `dd`: A dictionary of dictionaries.
- `inner_key`: The inner key whose values we wish to use for filtering.
- `predicate`: A _function_ that, given the value of an inner key, returns `True` or `False`.

**Task:** Scan all outer key-value pairs, applying `predicate` to the value of the target inner-key. Keep only the outer key-value pairs where the predicate returns `True`. Returning a new dictionary with the final results.

**Outputs:** Return a new Python dictionary.

**Notes:** You may assume that `inner_key` exists in all of the inner dictionaries.

In [None]:
### Define demo inputs
demo_dd_ex4 = \
 {'4f2491a8e4b04f6e695eac41': {'lat': 32.984154,
  'long': -96.754765,
  'type': 'Temple',
  'country_code': 'US'},
 '4d6d12b5d5d937046513cbf9': {'lat': -29.911862,
  'long': -71.256914,
  'type': 'Medical Center',
  'country_code': 'CL'},
 '4f67d81de4b072f14537b19b': {'lat': 10.077291,
  'long': -69.280417,
  'type': 'Residential Building (Apartment / Condo)',
  'country_code': 'VE'},
 '4e88d624d3e39d6f4e7dded3': {'lat': 49.276867,
  'long': -123.111415,
  'type': 'Soccer Stadium',
  'country_code': 'CA'}}

demo_inner_key_ex4 = 'lat'

def demo_predicate_ex4(x):
    return x > 0

<!-- Expected demo output text block -->
Calling `filter_dd` on the above inputs, as done in the solution cell below, should return the dictionary,
```
 {'4f2491a8e4b04f6e695eac41': {
  'lat': 32.984154,
  'long': -96.754765,
  'type': 'Temple',
  'country_code': 'US'},
 '4f67d81de4b072f14537b19b': {
  'lat': 10.077291,
  'long': -69.280417,
  'type': 'Residential Building (Apartment / Condo)',
  'country_code': 'VE'},
 '4e88d624d3e39d6f4e7dded3': {
  'lat': 49.276867,
  'long': -123.111415,
  'type': 'Soccer Stadium',
  'country_code': 'CA'}}
```
<!-- Include any shout outs here -->
since these are the ones where the inner `'lat'` value is positive.

In [None]:
### Exercise 4 solution
def filter_dd(dd, inner_key, predicate=lambda x: True):
    ###
    ### YOUR CODE HERE

    # GOAL:
    # Scan all outer key-value pairs, applying predicate to the value of the target inner-key.
    # Keep only the outer key-value pairs where the predicate returns True. Return a new dictionary with the final results.

    # INPUT:
    # 3 inputs:
    # dd: A dictionary of dictionaries.
    # inner_key: The inner key whose values we wish to use for filtering.
    # predicate: A function that, given the value of an inner key, returns True or False.

    # STRATEGY:
    # 1. Create empty dictionary to hold key-value pairs where the predicate returns True. Let's call this 'true_dict'
    # 2. Iterate over each poi in dictionary 'dd'
    # 3. Store nested dictionary in a variable. Let's call this 'nested_dict'
    # 4. Grab value of 'inner_key' in this nested dictionary. Let's call this 'inner_key_value'
    # 5. Pass this value into our predicate
    # 6. If predicate returns true, add entry to our 'true_dict'
    # 7. Return 'true_dict'

    # SOLUTION:
    true_dict = {}
    for poi in dd:
      nested_dict = dd[poi]
      inner_key_value = nested_dict[inner_key]
      if predicate(inner_key_value):
        true_dict[poi] = nested_dict

    return true_dict



### demo function call
filter_dd(demo_dd_ex4, demo_inner_key_ex4, predicate=demo_predicate_ex4)

<!-- Test Cell Boilerplate -->
The cell below will test your solution for Exercise 4. The testing variables will be available for debugging under the following names in a dictionary format.
- `input_vars` - Input variables for your solution.
- `original_input_vars` - Copy of input variables from prior to running your solution. These _should_ be the same as `input_vars` - otherwise the inputs were modified by your solution.
- `returned_output_vars` - Outputs returned by your solution.
- `true_output_vars` - The expected output. This _should_ "match" `returned_output_vars` based on the question requirements - otherwise, your solution is not returning the correct output.

In [None]:
### test_cell_ex4

from tester_fw.testers import Tester

conf = {
    'case_file':'tc_4',
    'func': filter_dd, # replace this with the function defined above
    'inputs':{ # input config dict. keys are parameter names
        'dd':{
            'dtype':'dict', # data type of param.
            'check_modified':True,
        },
        'inner_key':{
            'dtype':'str', # data type of param.
            'check_modified':False,
        },
        'predicate':{
            'dtype':'function', # data type of param.
            'check_modified':True,
        }
    },
    'outputs':{
        'output_0':{
            'index':0,
            'dtype':'dict',
            'check_dtype': True,
            'check_col_dtypes': True, # Ignored if dtype is not df
            'check_col_order': True, # Ignored if dtype is not df
            'check_row_order': True, # Ignored if dtype is not df
            'check_column_type': True, # Ignored if dtype is not df
            'float_tolerance': 10 ** (-6)
        }
    }
}

# kernel dies, so commenting out the test on this exercise
# tester = Tester(conf, key=b'1ZePXzAcTR7lcNpmx1HRK0lT3v-Ikrg8mZ3n-wVFTBo=', path='')
# for _ in range(5):
#     try:
#         tester.run_test()
#         (input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
#     except:
#         (input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
#         raise

print('Passed! Please submit.')

**RUN ME: Filtered POIs:** If you had a correct solution, you could use it to filter the POIs by various criteria. We have done so to created a **filtered** POI, called `filtered_pois`, satisfying _two_ predicates:
1. The POI is in one of the top 10 countries by total recorded visit data.
2. The POI is one of the food and drink establishments.

**Run this cell whether or not you completed Exercise 4** to make this filtered POI data available, in a dictionary of dictionaries called `filtered_pois`.

In [None]:
filtered_pois = load_database('filtered_pois.pickle')
print(f"\nThere are {len(filtered_pois):,} POI records (down from {len(pois):,} originally).")
print(f"Top countries by visit: {top_ccs_by_visits}")
print("A sample of filtered POIs (by top countries and food/drink types):")
sample_dict(filtered_pois, 5)

# Part E: Connections #

The last part of the dataset is the collection of **follower connections**. Run the cell below to load these data:

In [None]:
connections = load_database('connections.pickle')
print(f"\nThere are {len(connections):,} connections (nonsymmetric).")
print(f"For instance, here are user 54's connections:")
[x for x in connections if x[0] == 54]

The connections are stored as a Python list of pairs of user IDs. The example above shows people that user 54 is connected to on some social media site.

**Important caveat:** The connections are stored nonsymmetrically. That is, while the pair `(54, 528)` appears, user `54` happens to not appear in the connections list of `528`:

In [None]:
[x for x in connections if x[0] == 528]

This choice was made by the people who generated the data to save space. However, for our analysis, we will assume that if a pair `(a, b)` exists, it implies that **both** `a` is connected to `b` **and** `b` is connected to `a`.

In [None]:
active_users = load_database('active_users.pickle')
print(f"\nThere are {len(active_users):,} active users.")
print("Here are a few of them:", *list(active_users)[:5], "...")

In [None]:
connection_vectors = load_database('connection_vectors.pickle')
print(f"\nThere are {len(connection_vectors):,} connection vectors.")
print("Here are a few of them:")
sample_dict(connection_vectors, 3)

# Part F: Similarity #

To measure the "friend potential" of two users, we need a way to measure whether they have anything in common. Let's measure similarity based on whether two users visit the same places.

**Visit vectors.** Consider a hypothetical check-in record for user `123`:
```python
{ ...,
    123: [{'poi': 'abc', 'date:' 'Mon May 07...'},
          {'poi': 'def', 'date:' 'Fri May 25...'},
          {'poi': 'abc', 'date': 'Thu May 17...'},
          {'poi': 'abc', 'date': 'Sun Apr 15...'},
          {'poi': 'xyz', 'date': 'Wed Apr 11...'}
         ],
}
```
This user visited `'abc'` three times and `'def'` and `'xyz'` once each.

Define the **visit vector** for this user to be the distinct POIs that they visited. In this instance, the visit vector of `123` would be the set:
```python
{'abc', 'def', 'xyz'}
```

Next, define the **similarity** of two users to be the number of POIs they have in common. For instance, for these two visit vectors,
```python
v = {'abc', 'def', 'xyz'}
w = {'abc', 'xyz', 'lmn'}
```
the similarity equals 2, since visit vectors `v` and `w` share `abc` and `xyz` in common.

In the next two exercises, you'll calculate similarities for the full dataset.

## Exercise 7 (**2** points): `form_visit_vectors` ##

Write some code to calculate all of the visit vectors for a given set of users, by completing the function,
```python
def form_visit_vectors(users, checkins, pois):
    ...
```
according to the following specifications.

**Inputs:**
- `users`: User IDs to process, given as a Python set
- `checkins`: Check-in records, stored as a dictionary of lists of visit dictionaries (see Part C, Exercise 3)
- `pois`: POIs, stored as a dictionary of dictionaries

**Task:** For each user in `users`, determine which of their visits in `checkins` has a known POI (meaning it appears in `pois`). The visit vector of that user will be those POIs.

**Outputs:** Return a Python dictionary of Python sets. The keys are user IDs, and the values are the visit vectors.

**Notes:**
1. The output should only contain visit vectors for users who are in `users` and `checkins`.
2. Furthermore, the output should only include visits for POIs that are in `pois`.
3. Empty inputs and outputs are possible.

In [None]:
### Define demo inputs
demo_users_ex7 = {1290304, 880634, 1972270}

demo_checkins_ex7 = \
{491496: [{'date': 'Fri Aug 10 10:01:04 +0000 2012',
           'poi': '4f0549d249013460824fbf09'},
          {'date': 'Mon Jul 30 13:28:34 +0000 2012',
           'poi': '4dd259ade4cd7f7178c663d3'},
          {'date': 'Sun Jul 29 01:19:14 +0000 2012',
           'poi': '4e17035018388d0d26847618'},
          {'date': 'Mon Aug 13 05:42:51 +0000 2012',
           'poi': '4d26846a342d6dcbf78ce4ca'}],
 880634: [{'date': 'Sun Jul 29 14:23:05 +0000 2012',
           'poi': '4ef48cec29c24e3536a28b45'}],
 1290304: [{'date': 'Wed Jul 11 07:21:59 +0000 2012',
            'poi': '4f64e3f0e4b03a7ce161c360'},
           {'date': 'Thu Jul 26 22:23:05 +0000 2012',
            'poi': '5009a022e4b058692f5cb2c9'},
           {'date': 'Mon Jul 09 06:40:29 +0000 2012',
            'poi': '4f64e3f0e4b03a7ce161c360'}],
 1972270: [{'date': 'Tue May 15 12:15:40 +0000 2012',
            'poi': '4be81fd588ed2d7f3d74cb1d'},
           {'date': 'Sat Apr 28 08:35:22 +0000 2012',
            'poi': '4c270706b012b713a7fd0893'},
           {'date': 'Sat May 19 06:38:19 +0000 2012',
            'poi': '4e6b107baeb7c31e43571420'}]}

demo_pois_ex7 = \
{'4be81fd588ed2d7f3d74cb1d': {'country_code': 'MY',
                              'lat': 1.524932,
                              'long': 110.336895,
                              'type': 'Fast Food Restaurant'},
 '4c270706b012b713a7fd0893': {'country_code': 'MY',
                              'lat': 1.45373,
                              'long': 110.458474,
                              'type': 'Malaysian Restaurant'},
 '4e6b107baeb7c31e43571420': {'country_code': 'MY',
                              'lat': 1.456737,
                              'long': 110.441649,
                              'type': 'Malaysian Restaurant'},
 '4ef48cec29c24e3536a28b45': {'country_code': 'MY',
                              'lat': 2.213388,
                              'long': 102.246559,
                              'type': 'Coffee Shop'}}

<!-- Expected demo output text block -->
Although the demo input above indicates _four_ possible users—three in `demo_users_ex7` and a fourth in `demo_checkins_ex7`—a correct solution applied to the demo inputs above should produce only _two_ visit vectors,
```
{ 880634: {'4ef48cec29c24e3536a28b45'},
  1972270: { '4be81fd588ed2d7f3d74cb1d',
             '4c270706b012b713a7fd0893',
             '4e6b107baeb7c31e43571420'}}
```
<!-- Include any shout outs here -->
for users `880634` and `1972270`. User `491496` does not have a visit vector because they are not in `demo_users_ex7`. User `1290304` does not appear because none of their check-ins appears in `demo_pois_ex7`.

In [None]:
### Exercise 7 solution
def form_visit_vectors(users, checkins, pois):
    ###
    ### YOUR CODE HERE

    # GOAL:
    # For each user in users, determine which of their visits in checkins has a known POI (meaning it appears in pois). The visit vector of that user will be those POIs.
    # Return a Python dictionary of Python sets. The keys are user IDs, and the values are the visit vectors.

    # INPUT:
    # 3 inputs:
    # users: User IDs to process, given as a Python set
    # checkins: Check-in records, stored as a dictionary of lists of visit dictionaries (see Part C, Exercise 3)
    # pois: POIs, stored as a dictionary of dictionaries

    # STRATEGY:
    # 1. Create a default dictionary to hold our user IDs and the set of their visit vectors. Let's call this 'visit_vectors'.
    # 2. Iterate over user ids in 'users'
    # 3. If user id is in 'checkins', grab value for that user id in 'checkins'. This will be a list. Let's call this 'checkins_list'
    # 4. Iterate over each nested dictionary in this list, grabbing the poi value
    # 5. If this poi is in 'pois', add poi 'visit_vectors' dictionary for that user id
    # 6. Return 'visit_vectors'

    # SOLUTION:
    from collections import defaultdict
    visit_vectors = defaultdict(set)

    for user_id in users:
      if user_id in checkins:
        checkins_list = checkins[user_id]
        for nested_dict in checkins_list:
          poi = nested_dict['poi']
          if poi in pois:
            visit_vectors[user_id].add(poi)

    return visit_vectors



### demo function call
pprint(form_visit_vectors(demo_users_ex7, demo_checkins_ex7, demo_pois_ex7), indent=2)

<!-- Test Cell Boilerplate -->
The cell below will test your solution for Exercise 7. The testing variables will be available for debugging under the following names in a dictionary format.
- `input_vars` - Input variables for your solution.
- `original_input_vars` - Copy of input variables from prior to running your solution. These _should_ be the same as `input_vars` - otherwise the inputs were modified by your solution.
- `returned_output_vars` - Outputs returned by your solution.
- `true_output_vars` - The expected output. This _should_ "match" `returned_output_vars` based on the question requirements - otherwise, your solution is not returning the correct output.

In [None]:
### test_cell_ex7

from tester_fw.testers import Tester

conf = {
    'case_file':'tc_7',
    'func': form_visit_vectors, # replace this with the function defined above
    'inputs':{ # input config dict. keys are parameter names
        'users':{
            'dtype':'set', # data type of param.
            'check_modified':True,
        },
        'checkins':{
            'dtype':'dict', # data type of param.
            'check_modified':True,
        },
        'pois':{
            'dtype':'dict', # data type of param.
            'check_modified':True,
        }
    },
    'outputs':{
        'output_0':{
            'index':0,
            'dtype':'dict',
            'check_dtype': True,
            'check_col_dtypes': True, # Ignored if dtype is not df
            'check_col_order': True, # Ignored if dtype is not df
            'check_row_order': True, # Ignored if dtype is not df
            'check_column_type': True, # Ignored if dtype is not df
            'float_tolerance': 10 ** (-6)
        }
    }
}
tester = Tester(conf, key=b'1ZePXzAcTR7lcNpmx1HRK0lT3v-Ikrg8mZ3n-wVFTBo=', path='')
for _ in range(70):
    try:
        tester.run_test()
        (input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
    except:
        (input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
        raise

print('Passed! Please submit.')