# Debugging Bad Solutions: Module 1, Part 1 (Solution Version)

From Midterm 2, Spring 2023 - One and Two-Point Exercises

(This version of this notebook contains the solutions to the exercises. See the "exercises" version for the original exercises.)

[![Click here to view the Solution Version](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/gt-cse-6040/bootcamp/blob/main/Module%201/Session%208/m1s8nb1_Data_Debugging_part1%20-%20Solutions.ipynb)

## Introduction

### Purpose

On the exams you may initially write solutions that do not pass the test cases. That's okay! You will need to debug your code to determine what is causing the issue(s) and then figure out to how fix them. So how can we get better at debugging? We practice!

Below are two 1-point exercises and two 2-point exercise from the Spring 2023 Midterm 2. We have pre-written solutions for each exercise that are "bad" in one or more ways. Our solutions may contain one or more logic and/or syntax errors. Can you find and fix the issues in each exercise and pass all of the test cases?

Right before each exercise test cell, there is a block of text explaining the variables available to you for debugging. You may use these to test your code and can print/display them as needed.

**Exercise point breakdown:**

- Exercise 1: **1** point
- Exercise 5: **2** points
- Exercise 6: **2** points
- Exercise 7: **1** point

### Task Background: Better Reads

[Goodreads](https://www.goodreads.com/) is a website devoted to curating user-generated book reviews. You'll do some elementary data-mining to uncover "communities" of users who like the same books. Such insights might help users find like-minded communities and generate better book recommendations.

### Environment Configuration

#### Required Files

Run the following cells to download the required files.

In [None]:
# import files
!mkdir resources
%cd resources

!wget -nc https://raw.githubusercontent.com/gt-cse-6040/bootcamp/main/Module%201/Session%208/resources/demo_ex1.db
!wget -nc https://raw.githubusercontent.com/gt-cse-6040/bootcamp/main/Module%201/Session%208/resources/demo_ex6.obj
!wget -nc https://raw.githubusercontent.com/gt-cse-6040/bootcamp/main/Module%201/Session%208/resources/demo_ex7.db
!wget -nc https://raw.githubusercontent.com/gt-cse-6040/bootcamp/main/Module%201/Session%208/resources/ex1-is_read.df
!wget -nc https://raw.githubusercontent.com/gt-cse-6040/bootcamp/main/Module%201/Session%208/resources/ex1-is_reviewed.df
!wget -nc https://raw.githubusercontent.com/gt-cse-6040/bootcamp/main/Module%201/Session%208/resources/goodreads.db
!wget -nc https://raw.githubusercontent.com/gt-cse-6040/bootcamp/main/Module%201/Session%208/resources/tc_1
!wget -nc https://raw.githubusercontent.com/gt-cse-6040/bootcamp/main/Module%201/Session%208/resources/tc_4
!wget -nc https://raw.githubusercontent.com/gt-cse-6040/bootcamp/main/Module%201/Session%208/resources/tc_6
!wget -nc https://raw.githubusercontent.com/gt-cse-6040/bootcamp/main/Module%201/Session%208/resources/tc_7
!wget -nc https://raw.githubusercontent.com/gt-cse-6040/bootcamp/main/Module%201/Session%208/resources/ex1-rating.df
!wget -nc https://raw.githubusercontent.com/gt-cse-6040/bootcamp/main/Module%201/Session%208/resources/ex1-user_id.df
!wget -nc https://raw.githubusercontent.com/gt-cse-6040/bootcamp/main/Module%201/Session%208/resources/ex6-comms.df
!wget -nc https://raw.githubusercontent.com/gt-cse-6040/bootcamp/main/Module%201/Session%208/resources/ex7-means.df
!wget -nc https://raw.githubusercontent.com/gt-cse-6040/bootcamp/main/Module%201/Session%208/resources/ex7-sampler-input.df
!wget -nc https://raw.githubusercontent.com/gt-cse-6040/bootcamp/main/Module%201/Session%208/resources/tc_5
!wget -nc https://raw.githubusercontent.com/gt-cse-6040/bootcamp/main/Module%201/Session%208/resources/demo_ex5.df
!wget -nc https://raw.githubusercontent.com/gt-cse-6040/bootcamp/main/Module%201/Session%208/resources/ex5.df
!wget -nc https://raw.githubusercontent.com/gt-cse-6040/bootcamp/main/Module%201/Session%208/resources/requirements.txt

%cd ..
!mkdir tester_fw
%cd tester_fw

!wget -nc https://raw.githubusercontent.com/gt-cse-6040/bootcamp/main/Module%201/Session%208/tester_fw/__init__.py
!wget -nc https://raw.githubusercontent.com/gt-cse-6040/bootcamp/main/Module%201/Session%208/tester_fw/test_utils.py
!wget -nc https://raw.githubusercontent.com/gt-cse-6040/bootcamp/main/Module%201/Session%208/tester_fw/testers.py
!wget -nc https://raw.githubusercontent.com/gt-cse-6040/bootcamp/main/Module%201/Session%208/tester_fw/db_utils.py

%cd ..

!python -m pip install -r resources/requirements.txt

#### Required Modules and Functions

Run the following cells to configure your environment.

In [2]:
# Standard Python modules
import sys
import numpy as np
import pandas as pd
import sqlite3 as db
import math
import dill as pickle
from pprint import pprint, pformat
from tester_fw.db_utils import *

In case it's helpful, here are the versions of Python and standard modules you are using:

In [3]:
print("* Python version: {}".format(sys.version.replace('\n', ' ')))
print(f"* Numpy version: {np.__version__}")
print(f"* pandas version: {pd.__version__}")
print(f"* sqlite3 version: {db.version}")

* Python version: 3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0]
* Numpy version: 1.26.2
* pandas version: 1.5.3
* sqlite3 version: 2.6.0


#### Connecting to the Database

Some of the Goodreads data is stored in a SQLite3 database. The code cell below opens a read-only connection to it named **`grdbconn`**.

For now, don't worry about what's there. We will explain any tables you need in the exercises that use them.

In [4]:
# Goodreads database connection:
# grdbconn = db.connect('file:resource/asnlib/publicdata/goodreads.db?mode=ro', uri=True)
grdbconn = db.connect('file:resources/goodreads.db?mode=ro', uri=True)

## **Ex. 1 (1 pt)**: `count_interactions_by`

### Background: Analyzing user-book interactions

The Goodreads dataset includes **user-book interactions.** An "user-book interaction" means the user "did something" with the book on the Goodreads website:

- _Viewed_: The user looked at a book description and saved it to their personal library.
- _Read_: The user marked the book as "read."
- _Rated_: The user gave the book a rating, from 1 to 5 stars.
- _Reviewed_: The user wrote a public review of the book on the website.

These interactions are recorded in a SQL table called `Interactions`. Let's have a quick look for one of the users whose integer ID is `874199`:

In [5]:
pd.read_sql(r"SELECT * FROM Interactions WHERE user_id=874199", grdbconn)

Unnamed: 0,user_id,book_id,is_read,rating,is_reviewed
0,874199,591,1,5,1
1,874199,4753,1,0,0
2,874199,4752,1,4,1
3,874199,4751,0,0,0
4,874199,1007,1,3,0
5,874199,8600,1,0,0
6,874199,1248,1,0,0
7,874199,7785,1,0,0
8,874199,7784,1,0,0


Each row shows how this user interacted with some book. Here are some insights:
- This user interacted with nine books.
- They reviewed two of these books.
- They rated three of these books.
- They read all but one of the books they saved.

### Problem Definition

Suppose we want to group the interactions and count the number by group. For example, we might want to know, for each unique user ID, how many interactions there are. Complete the function
```python
def count_interactions_by(col, conn):
    ...
```
so that it does the following.

**Inputs:**
- `col`: The name of a column
- `conn`: A database connection containing a table named `Interactions`

**Your task:** For each unique value in column `'col'` of the `Interactions` table, count how many interactions (rows) there are.

**Output:** Return a dataframe with two columns:
- `col`: A column with the **same name** as the given input column holding the unique values
- `'count'`: A column with the number of interactions for each unique value

Refer to the demo cell below for an example of this output.

**Additional notes and hints:** You may assume that `col` holds a valid column name. The exact order of rows and columns in your output does not matter.

**Example:**

In [6]:
### Define demo inputs ###

demo_col_ex1 = 'user_id'
demo_conn_ex1 = db.connect(f'file:resources/demo_ex1.db?mode=ro', uri=True)
display(pd.read_sql("SELECT * FROM Interactions", demo_conn_ex1))

Unnamed: 0,user_id,book_id,is_read,rating,is_reviewed
0,569241,208373,0,0,0
1,569241,47199,1,5,1
2,607817,40293,0,0,0
3,569241,47383,1,5,1
4,607817,7984,0,0,0
5,607817,792,0,0,0
6,604656,2345195,1,5,1
7,607817,128860,1,0,0


Calling `count_interactions_by(demo_col_ex1, demo_conn_ex1)` should produce the following output:

|   user_id |   count |
|----------:|--------:|
|    569241 |       3 |
|    604656 |       1 |
|    607817 |       4 |

However, calling `count_interactions_by('is_read', demo_conn_ex1)` would return a two-row `DataFrame` where the count of `0` and `1` values is `4` each.

In [7]:
### Exercise 1 solution
def count_interactions_by(col, conn):
    ### BEGIN SOLUTION
    query = f"SELECT {col}, COUNT(*) AS count FROM Interactions GROUP BY {col}"
    return pd.read_sql(query, conn)
    ### END SOLUTION

### demo function calls ###
display(count_interactions_by(demo_col_ex1, demo_conn_ex1))
display(count_interactions_by('is_read', demo_conn_ex1))

Unnamed: 0,user_id,count
0,569241,3
1,604656,1
2,607817,4


Unnamed: 0,is_read,count
0,0,4
1,1,4


### Testing

<!-- Test Cell Boilerplate -->
The cell below will test your solution for Exercise 1. The testing variables will be available for debugging under the following names in a dictionary format.
- `input_vars` - Input variables for your solution.
- `original_input_vars` - Copy of input variables from prior to running your solution. These _should_ be the same as `input_vars` - otherwise the inputs were modified by your solution.
- `returned_output_vars` - Outputs returned by your solution.
- `true_output_vars` - The expected output. This _should_ "match" `returned_output_vars` based on the question requirements - otherwise, your solution is not returning the correct output.

In [8]:
### test_cell_ex1

from tester_fw.testers import Tester

conf = {
    'case_file':'tc_1',
    'func': count_interactions_by, # replace this with the function defined above
    'inputs':{ # input config dict. keys are parameter names
        'col': {
            'dtype': 'str', # data type of param.
            'check_modified': False,
        },
        'conn': {
            'dtype': 'db',
            'check_modified': False
        }
    },
    'outputs':{
        'output_0':{
            'index':0,
            'dtype':'df',
            'check_dtype': True,
            'check_col_dtypes': True, # Ignored if dtype is not df
            'check_col_order': False, # Ignored if dtype is not df
            'check_row_order': False, # Ignored if dtype is not df
#            'check_column_type': True, # Ignored if dtype is not df
            'float_tolerance': 10 ** (-6)
        }
    }
}
tester = Tester(conf, key=b'jpS7W-CpqAQfuITMEQZL-yVXfhIaCkSaei-emnyRtrI=', path='resources/')
for _ in range(70):
    try:
        tester.run_test()
        (input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
    except:
        (input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
        raise

print('Passed! Please submit.')

Passed! Please submit.


**RUN ME:** A correct implementation of `count_interactions_by`, when run on the full Goodreads dataset for the columns `is_read`, `rating`, and `is_reviewed`, would produce the following:

In [9]:
print(f"\n=== `count_interactions_by` on the full dataset ===\n")
for col_ in ['is_read', 'rating', 'is_reviewed']:
    display(load_df_from_file(f"ex1-{col_}.df"))


=== `count_interactions_by` on the full dataset ===



Unnamed: 0,is_read,count
0,0,162117
1,1,208701


Unnamed: 0,rating,count
0,0,176575
1,1,3821
2,2,10545
3,3,40965
4,4,67534
5,5,71378


Unnamed: 0,is_reviewed,count
0,0,347098
1,1,23720


> **Aside (skip if pressed for time)**: From these results, you might observe a _hint_ at a phenomenon known as a [_monotonic behavior chain_](https://dl.acm.org/doi/10.1145/3240323.3240369): the total number of interactions > the number who read > the number who rate > the number who review. Such phenomena have been used to improve automatic generation of item recommendations.

## **Ex. 5 (2 pts)**: `connect_users`

### Problem Definition



Given the analysis sample from Exercise 4, let's "connect" users.

Let's say that two users `a` and `b` are **connected** if they both gave ratings of 4 or higher to the same book. The number of unique books they both rated this way is a measure of how strong their connection is.

Complete the following function to help identify these connections.
```python
def connect_users(ubdf, threshold):
    ...
```

**Inputs:**
- `ubdf`: A "user-book" dataframe having these two columns: `user_id` and `book_id`. Each row indicates that a given user gave a given book a rating of 4 or higher.
- `threshold`: An integer threshold on connection strength.

**Your tasks:** Determine which pairs of users are connected. Count how many books connect them. Drop self-pairs (`user_id_x == user_id_y`), as well as any pairs with fewer than `threshold` connections.

**Outputs:** Return a **new** `DataFrame` with three columns:
1. `user_id_x`: A user ID
2. `user_id_y`: Another user ID
3. `count`: The number of books they both rated in common. Recall that this value should be `>= threshold`.

**Additional notes and hints.**
1. Omit self-pairs, that is, cases where `user_id_x` == `user_id_y`.
1. Return pairs **symmetrically**. That is, if the pair of users (`a`, `b`) have a count `k` at or above the threshold, then **both** (`a`, `b`, `k`) and (`b`, `a`, `k`) should be rows in the output table.
1. If no connections meet the threshold, you should return an empty `DataFrame` _with_ the specified columns.
1. You may assume there are no duplicate rows.

> _Aside:_ For really huge datasets (not what is included in this exam), dropping users with fewer than `threshold` ratings _before_ looking for pairs might be a bit faster.

**Example:** Suppose the inputs are the `DataFrame` shown below with a target connection threshold of `2`:

In [10]:
### Define demo inputs ###

demo_ubdf_ex5 = load_df_from_file("demo_ex5.df").sort_values(['book_id', 'user_id']).reset_index(drop=True)
demo_threshold_ex5 = 2

display(demo_ubdf_ex5)

Unnamed: 0,user_id,book_id
0,1,2
1,0,7
2,2,7
3,0,19
4,2,19
5,2,22
6,1,38
7,0,41
8,3,41
9,0,85


For this input, `connect_users` should produce:

|   user_id_x |   user_id_y |   count |
|------------:|------------:|--------:|
|           0 |           2 |       2 |
|           0 |           3 |       2 |
|           2 |           0 |       2 |
|           3 |           0 |       2 |

Users `0` and `2` both rated books `7` and `19`, so they meet the threshold of having reviewed 2 books in common. User `1` did not review any books in common with any other user, and so they do not appear in any pair of the output.

In [11]:
### Exercise 5 solution
def connect_users(ubdf, threshold):
    ### BEGIN SOLUTION
    uudf = ubdf.merge(ubdf, on='book_id') \
               .groupby(['user_id_x', 'user_id_y']) \
               .size() \
               .reset_index() \
               .rename(columns={0: 'count'})
    uudf = uudf[uudf['user_id_x'] != uudf['user_id_y']]
    uudf = uudf[uudf['count'] >= threshold]
    uudf = uudf.reset_index(drop=True)
    return uudf
    ### END SOLUTION

### demo function call ###
connect_users(demo_ubdf_ex5, demo_threshold_ex5)

Unnamed: 0,user_id_x,user_id_y,count
0,0,2,2
1,0,3,2
2,2,0,2
3,3,0,2


### Testing

<!-- Test Cell Boilerplate -->
The cell below will test your solution for Exercise 5. The testing variables will be available for debugging under the following names in a dictionary format.
- `input_vars` - Input variables for your solution.
- `original_input_vars` - Copy of input variables from prior to running your solution. These _should_ be the same as `input_vars` - otherwise the inputs were modified by your solution.
- `returned_output_vars` - Outputs returned by your solution.
- `true_output_vars` - The expected output. This _should_ "match" `returned_output_vars` based on the question requirements - otherwise, your solution is not returning the correct output.

In [12]:
### test_cell_ex5
from tester_fw.testers import Tester

conf = {
    'case_file':'tc_5',
    'func': connect_users, # replace this with the function defined above
    'inputs':{ # input config dict. keys are parameter names
        'ubdf': {
            'dtype': 'df', # data type of param.
            'check_modified': True
        },
        'threshold': {
            'dtype': 'int',
            'check_modified': False
        }
    },
    'outputs':{
        'output_0':{
            'index': 0,
            'dtype': 'df',
            'check_dtype': True,
            'check_col_dtypes': True, # Ignored if dtype is not df
            'check_col_order': False, # Ignored if dtype is not df
            'check_row_order': False, # Ignored if dtype is not df
            'float_tolerance': 10 ** (-6)
        }
    }
}
tester = Tester(conf, key=b'jpS7W-CpqAQfuITMEQZL-yVXfhIaCkSaei-emnyRtrI=', path='resources/')
for _ in range(70):
    try:
        tester.run_test()
        (input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
    except:
        (input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
        raise

print('Passed! Please submit.')

Passed! Please submit.


**RUN ME:** From a correct implementation of `connect_users`, one way we can "draw" the connectivity is to form a sparse matrix where nonzeros represent connections. Here is a picture of this matrix for the full dataset, using a threshold of 2:

In [13]:
uudf = load_df_from_file('ex5.df') # user-user table

print("A sample of connections:")
display(uudf.head())

if False: # Disabled due to NetworkX version incompatibility issue (fix pending)
    uudf_G = cse6040.utils.to_nx(uudf.to_records(index=False))
    ax_ex5 = cse6040.utils.graph_spy(uudf_G, markersize=0.01)
    ax_ex5.set_title('Spy plot: user-user interactions')
    ax_ex5.set_xlabel('user id')
    ax_ex5.set_ylabel('user id', rotation=0, horizontalalignment='right');
else:
#     cse6040.utils.display_image_from_file('demo-user-user-spy.png')
    pass

A sample of connections:


Unnamed: 0,user_id_x,user_id_y,count
0,175,1251,5
1,175,1369,4
2,175,1764,2
3,175,3164,5
4,175,3303,6


## **Ex. 6 (2 pts)**: `assign_communities`

### Background: Identifying "top reads" by community

> Includes details for Exercise 6 (2 points) and Exercise 7 (1 point).

The NetworkX package contains several algorithms for **detecting communities**, that is, clusters of "strongly interconnected" vertices in a graph (recall Part C).

We ran one of these algorithms on a graph formed from the user-user interactions you calculated in Part D. The algorithm grouped users (graph vertices) into clusters.

It returned these clusters as a **list of sets**, where each set is a "community" of user IDs grouped together. Since users were connected for liking the same books, it's possible users in the same community have similar tastes.

Here is the communities object that NetworkX produced for us:

In [14]:
communities = load_obj_from_file('demo_ex6.obj')

It is a list of sets:

In [15]:
communities[0]

{175,
 1369,
 1773,
 3303,
 3681,
 4851,
 4969,
 5463,
 7803,
 8105,
 8425,
 8447,
 8827,
 9987,
 11078,
 11257,
 11993,
 13269,
 13941,
 15220,
 15815,
 16005,
 17090,
 17983,
 18301,
 18587,
 19831,
 20321,
 22534,
 22610,
 24348,
 25042,
 25492,
 25600,
 26698,
 27757,
 29583,
 31172,
 32125,
 33099,
 33235,
 33594,
 33882,
 36412,
 40778,
 41825,
 42064,
 43367,
 44242,
 45569,
 45638,
 45960,
 47457,
 47480,
 48074,
 48498,
 49204,
 49261,
 50852,
 51390,
 51523,
 53191,
 53343,
 54358,
 56128,
 56968,
 58741,
 59083,
 59534,
 60703,
 61279,
 65558,
 66150,
 66467,
 67122,
 68142,
 68150,
 69530,
 70278,
 70385,
 70440,
 70673,
 71008,
 73005,
 73279,
 73919,
 74229,
 74476,
 77096,
 78758,
 79061,
 79064,
 81907,
 82673,
 82692,
 82830,
 83287,
 83954,
 85225,
 86399,
 86708,
 87238,
 88799,
 88837,
 88889,
 90278,
 90430,
 92846,
 93079,
 93807,
 94481,
 95682,
 96694,
 98487,
 98780,
 99475,
 100830,
 100900,
 101701,
 101729,
 102077,
 102080,
 104161,
 104381,
 104918,
 10602

In [16]:
type(communities), type(communities[0])

(list, set)

Here is how many communities there are:

In [17]:
len(communities)

6

The sizes of the 6 communities are:

In [18]:
[len(c) for c in communities]

[868, 2, 36, 340, 6, 574]

Let's print the smaller two:

In [19]:
print("Community 1:", communities[1])
print("Community 4:", communities[4])

Community 1: {430689, 687415}
Community 4: {154369, 676898, 542723, 677611, 649588, 332535}


The values you see are user IDs.

### Problem Definition

To merge this data with our existing database, we need to convert the Python `communities` data structure into a `DataFrame`. Complete the function below to aid in this task:

```python
def assign_communities(communities):
    ...
```

**Inputs:** The input `communities` is a list of sets of integers, as in the previous example.

**Your task:** Convert this input into a dataframe.

**Returns:** Your function should return a `DataFrame` with these columns:
- `user_id`: A user ID (an integer).
- `comm_id`: The ID of the community it belongs to (also an integer).

The community ID is its index in `communities`. That is, community `0` is stored in `communities[0]`, community `1` is in `communities[1]`, and so on.

**Example:** Consider this set of communities:

In [20]:
### Define demo inputs ###

demo_communities_ex6 = [{1, 3, 10, 17}, {2, 6, 13, 15}, {0, 5, 11, 16}, {9, 14}, {4, 7, 8, 12}]

A correct implementation of `assign_communities` would produce this result:

|   user_id |   comm_id |
|----------:|----------:|
|         1 |         0 |
|        10 |         0 |
|         3 |         0 |
|        17 |         0 |
|         2 |         1 |
|        13 |         1 |
|         6 |         1 |
|        15 |         1 |
|         0 |         2 |
|        16 |         2 |
|        11 |         2 |
|         5 |         2 |
|         9 |         3 |
|        14 |         3 |
|         8 |         4 |
|         4 |         4 |
|        12 |         4 |
|         7 |         4 |

In [21]:
### Exercise 6 solution
def assign_communities(communities):
    ### BEGIN SOLUTION
    from pandas import DataFrame
    all_uids = []
    all_cids = []
    for cid, uids in enumerate(communities):
        all_uids += list(uids)
        all_cids += [cid] * len(uids)
    return DataFrame({'user_id': all_uids, 'comm_id': all_cids})
    ### END SOLUTION

### demo function call ###
assign_communities(demo_communities_ex6)

Unnamed: 0,user_id,comm_id
0,1,0
1,10,0
2,3,0
3,17,0
4,2,1
5,13,1
6,6,1
7,15,1
8,0,2
9,16,2


### Testing

<!-- Test Cell Boilerplate -->
The cell below will test your solution for Exercise 6. The testing variables will be available for debugging under the following names in a dictionary format.
- `input_vars` - Input variables for your solution.
- `original_input_vars` - Copy of input variables from prior to running your solution. These _should_ be the same as `input_vars` - otherwise the inputs were modified by your solution.
- `returned_output_vars` - Outputs returned by your solution.
- `true_output_vars` - The expected output. This _should_ "match" `returned_output_vars` based on the question requirements - otherwise, your solution is not returning the correct output.

In [22]:
### test_cell_ex6
from tester_fw.testers import Tester

conf = {
    'case_file':'tc_6',
    'func': assign_communities, # replace this with the function defined above
    'inputs':{ # input config dict. keys are parameter names
        'communities': {
            'dtype': 'list', # data type of param.
            'check_modified': True,
        }
    },
    'outputs':{
        'output_0':{
            'index': 0,
            'dtype': 'df',
            'check_dtype': True,
            'check_col_dtypes': True, # Ignored if dtype is not df
            'check_col_order': False, # Ignored if dtype is not df
            'check_row_order': False, # Ignored if dtype is not df
            'float_tolerance': 10 ** (-6)
        }
    }
}
tester = Tester(conf, key=b'jpS7W-CpqAQfuITMEQZL-yVXfhIaCkSaei-emnyRtrI=', path='resources/')
for _ in range(70):
    try:
        tester.run_test()
        (input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
    except:
        (input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
        raise

print('Passed! Please submit.')

Passed! Please submit.


## **Ex. 7 (1 pt)**: `means_by_community`

### Problem Definition



Suppose we wish to calculate means (averages) of the interaction data _by community._ Implement the function,

```python
def means_by_community(intdf, comdf):
    ...
```

to perform this task.

**Inputs:**
1. `intdf`: An interactions `DataFrame` with columns `user_id`, `book_id`, `is_read`, `rating`, and `is_reviewed`.
1. `comdf`: A communities `DataFrame` with columns `user_id` and `comm_id`.

**Your task:** Join these `DataFrames` and then return a new `DataFrame` with the mean values of the `is_read`, `rating`, and `is_reviewed` columns **by community.**

**Outputs:** Your function should return a new `DataFrame` with these columns:
1. `comm_id`: An integer community ID, one per row.
2. `is_read`, `rating`, `is_reviewed`: The mean value of each column for all rows of `intdf` for all users of the community. These should be stored as `float` values.

**Additional notes:** A user ID might not appear in both inputs. These should not be part of any means calculation.

**Example:** Consider the following two inputs:

In [23]:
### Define demo inputs ###

demo_intdf_ex7 = load_table_from_db("Interactions", "demo_ex7.db").sort_values(by='user_id')
demo_comdf_ex7 = load_table_from_db("Communities", "demo_ex7.db").sort_values(by='user_id')

display(demo_intdf_ex7)
display(demo_comdf_ex7)

Unnamed: 0,user_id,book_id,is_read,rating,is_reviewed
2,25031,108750,1,5,1
0,25650,118565,1,5,0
4,33786,10062,1,4,0
1,108339,7405,1,4,0
3,683149,91931,1,4,0


Unnamed: 0,user_id,comm_id
0,25650,5
2,33786,5
3,108339,0
1,683149,0


A correct implementation of `means_by_community` will return:

|   comm_id |   is_read |   rating |   is_reviewed |
|----------:|----------:|---------:|--------------:|
|         0 |         1 |      4.5 |             0 |
|         3 |         1 |      5   |             0 |
|         5 |         1 |      5   |             0 |

Observe that user `34369` does not belong to any community. Therefore, none of the final averages should be affected by that user's data.

In [24]:
### Exercise 7 solution
def means_by_community(intdf, comdf):
    ### BEGIN SOLUTION
    VALUES = ['is_read', 'rating', 'is_reviewed']
    df = intdf.merge(comdf, on='user_id')
    df = df.groupby('comm_id')[VALUES].mean().reset_index()
    for c in VALUES:
        df[c] = df[c].astype(float) # paranoia?
    return df
    ### END SOLUTION

### demo function call ###
demo_result_ex7 = means_by_community(demo_intdf_ex7, demo_comdf_ex7)
display(demo_result_ex7)

Unnamed: 0,comm_id,is_read,rating,is_reviewed
0,0,1.0,4.0,0.0
1,5,1.0,4.5,0.0


### Testing

<!-- Test Cell Boilerplate -->
The cell below will test your solution for Exercise 7. The testing variables will be available for debugging under the following names in a dictionary format.
- `input_vars` - Input variables for your solution.
- `original_input_vars` - Copy of input variables from prior to running your solution. These _should_ be the same as `input_vars` - otherwise the inputs were modified by your solution.
- `returned_output_vars` - Outputs returned by your solution.
- `true_output_vars` - The expected output. This _should_ "match" `returned_output_vars` based on the question requirements - otherwise, your solution is not returning the correct output.

In [25]:
### test_cell_ex7
from tester_fw.testers import Tester

conf = {
    'case_file':'tc_7',
    'func': means_by_community, # replace this with the function defined above
    'inputs':{ # input config dict. keys are parameter names
        'intdf': {
            'dtype': 'df', # data type of param.
            'check_modified': True,
        },
        'comdf': {
            'dtype': 'df',
            'check_modified': True
        }
    },
    'outputs':{
        'output_0':{
            'index': 0,
            'dtype': 'df',
            'check_dtype': True,
            'check_col_dtypes': True, # Ignored if dtype is not df
            'check_col_order': False, # Ignored if dtype is not df
            'check_row_order': False, # Ignored if dtype is not df
            'float_tolerance': 10 ** (-6)
        }
    }
}
tester = Tester(conf, key=b'jpS7W-CpqAQfuITMEQZL-yVXfhIaCkSaei-emnyRtrI=', path='resources/')
for _ in range(70):
    try:
        tester.run_test()
        (input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
    except:
        (input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
        raise

print('Passed! Please submit.')

Passed! Please submit.


**RUN ME:** With a correct `means_by_community`, we can see whether the communities differ in how they read, rate, and review books. Here is what would happen if we ran on the full dataset:

In [26]:
ex7_means = load_df_from_file('ex7-means.df')
print(f"Recall: community sizes: {[(k, len(c)) for k, c in enumerate(communities)]}")
ex7_means

Recall: community sizes: [(0, 868), (1, 2), (2, 36), (3, 340), (4, 6), (5, 574)]


Unnamed: 0,comm_id,is_read,rating,is_reviewed
0,0,1.0,4.481228,0.105068
1,1,1.0,4.945455,0.427273
2,2,1.0,4.418115,0.267155
3,3,1.0,4.505879,0.128845
4,4,1.0,4.366142,0.110236
5,5,1.0,4.560539,0.12193
