In [1]:
%load_ext notexbook

In [2]:
%texify

<span class="badges">

[![myBinder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/leriomaggio/deep-learning-for-data-science/HEAD?filepath=0_Playground/1_playground.ipynb)
    
[![Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/leriomaggio/deep-learning-for-data-science/blob/main/0_Playground/1_playground.ipynb)

[![nbviewer](https://raw.githubusercontent.com/jupyter/design/master/logos/Badges/nbviewer_badge.svg)](https://nbviewer.jupyter.org/github/leriomaggio/deep-learning-for-data-science/blob/main/0_Playground/1_playground.ipynb)
</span>

<span class="fn"><i>[Note]: </i> This notebook has been designed using the [$\text{no}\TeX\text{book}$](https://github.com/leriomaggio/notexbook-jupyter-theme) Jupyter notebook theme. <br />
Please **Trust** the notebook to automatically enable the theme. If you are viewing this notebook in **Google Colab**, these are the [instructions](https://github.com/leriomaggio/notexbook-jupyter-theme/tree/texbook-colab) to enable the theme in Colab.</span>

# 🐍 Python for Data Science 🧪 : Test & Practice your skills 🧑‍💻


This notebook guides you through a series of exercises and simple challenges  to assess your proficiency with Python, and _some_ of its most common libraries for Data Science (i.e. `pandas` and `numpy`).
We will be starting by playing with fundamental Python data structures, finishing off with basic OOP.

Each exercise will span from _easy_ to _intermediate_ as per the difficulty level (_iow. no _advanced_ or above), also including a section with _Hints_ and references to documentation. 
If you get stuck with some of the exercises, **don't worry**: a _Solution_ is always provided. 

<br />

**_Caveat Emptor_**:

This notebook does not intend to provide a full coverage of all the topics necessary to assess your skills with the Python language. Similarly, the selected topics won't be necessarily assesed in full details. The level of details, as well as the very choice of these topics has been primarly guided by the set of _desired_ skills you _should_ have before taking this course. 

**_Notes on Solutions_**:

Even if in some cases a solution to an exercise may seem _straightforward_, it can always be the case that _your_ solution won't be exactly identical to the one **proposed** in the exercise. Please _bear_ in mind that solutions are indeed _proposed_, so they are not to be taken as **the** absolute truth. Conversely, different solutions are a wonderful opportunity to compare and discuss _pros_ and _cons_ (even if that would just refer to variable names). In some other cases however, _some_ solutions may be preferred over _some others_ as per being them considered **more** [**Pythonic**](https://stackoverflow.com/questions/25011078/what-does-pythonic-mean) ( Read [here](https://towardsdatascience.com/how-to-be-pythonic-and-why-you-should-care-188d63a5037e) for a more thorough explanation, if you're interested and you have time). 


## 1. Fundamental Python Constructs 🧰

In this Section, some of the main constructs and paradigm of the Python language will be assessed. 

<ins>**Get Ready**:</ins> 

For this section, all you'll need to have is the [Official Python Documentation](https://docs.python.org/3.8/library/index.html) open! 
In other words, you won't be needing to use (and you should not) any other external library to solve these exercises. All code should be pure Python. Here the goal is to leverage only on the modules included in the **Standard Library**, that is the _battery included_ in the Python language.

### Basic Data Structures

In this Section we will start by generating our reference dataset to be used throught the notebook. It will also be the opportunity to practice a bit with basic data structures, `random` number generations, and I/O ops.

##### 📝 &nbsp; **Exercise 1.1** 

$\Rightarrow$ A) **Write** a Python **function** called `generate_user_ids` that generates a sequence of `300` **unique** and randomly assigned `UserID`. Those **ids** will uniquely identify the _samples_ in our dataset. 

$\Rightarrow$ B) **Call** the function to create our iterable sequence of `uid` in our dataset. Store this sequence into an appropriate Python data structure.

--- 

**Note**: For simplicity, and without any loss of generality, let's assume that these `ids` will be 3 digits numbers in the range `[100,999]` (the actual value is not important). 

In [None]:
def generate_user_ids():
    # Your Code Here
    pass 

user_ids = generate_user_ids()

In [None]:
#@title Hint 🕵️ &nbsp;(_double click on the cell to open_) { display-mode: "form" }

# There are many ways to solve this exercise, as the only requirements
# is on having those IDs to be unique.

# Simplest: Look for the range function
# Smartest: Look for the random.sample function

In [None]:
#@title (proposed) Solution 💡 &nbsp; (Double Click to Open) { display-mode: "form" }

from random import sample
from random import seed

# set random seed for repeatability and for future runs
seed(123456)

def generate_user_ids():
    # Note: 1000 is the end range, so that 999 can be included
    return sample(range(100, 1000), k=300)

user_ids = generate_user_ids()

##### 📝 &nbsp; **Exercise 1.2**

Each `User` (sample) has a list of `endorsments` (`float` in range `[0:5]`). These endorsments are assigned to a user by the other users in the database. The number of `endorsments` recevied by a user may vary from a minimum of `1` (we won't have users with no endorsment) to a maximum of `299` (No user can rate themselves, and each users **cannot** endorse another single user more than once).

Store this data into an appropriate Python structure, so that it will always be possible to retrieve the sequence of `endorsments` corresponding to a given `user_id`.

$\Rightarrow$ A) **Write** a function `generate_endorsments` that accepts the seqauence of `user_ids` (generated in the prev. exercise) and returns the structure matching each `uids` with their corresponding `endorsments`. 

$\Rightarrow$ B) **Print** the first `10` user_ids in the database, along with the corresponding number of endorsments received (those should be presumably different numbers)

--- 

**Note**: `round` each endorsment value to the _second_ decimal. No worries so far about binning the values.

In [None]:
from typing import Sequence

# A) 

def generate_endorsments(user_ids: Sequence[int]):
    # your code here 
    pass

database = generate_endorsments(user_ids=user_ids)

# B) 
# Print the first 10 entries, along with the corresponding # of endorsments


In [None]:
#@title Hint 🕵️ &nbsp;(_double click on the cell to open_) { display-mode: "form" }

# The first step is to (randomly) generate these endorsment, but first we'd need
# to determine how many we should generated for a given user_id
# To do so, have a look at random.randrange

# Generate endorsment: have a look at random.uniform and round function

# Data Structure: What about a Dictionary (for now, at least)?

# Print: Look at enumerate and f-strings

In [None]:
#@title (proposed) Solution 💡 &nbsp; (Double Click to Open) { display-mode: "form" }

from random import randrange, uniform
from typing import Sequence
 # A) 
def generate_endorsments(user_ids: Sequence[int]):
    database = {}
    for uid in user_ids: 
        n_endorsees = randrange(1, len(user_ids))  # 300 is excluded
        endorsemnts = []  # identical to do list()
        for _ in range(n_endorsees):
            endorsemnts.append(round(uniform(0, 5), ndigits=2))
        database[uid] = endorsemnts
    return database

database = generate_endorsments(user_ids=user_ids)

# B)
# Print the first 10 entries, along with the corresponding # of endorsments
for i, uid in enumerate(database):
    if i == 10:
        break
    print(f"{i+1:02}) {uid}: {len(database[uid])}")
    

01) 924: 276
02) 964: 66
03) 913: 190
04) 396: 195
05) 130: 192
06) 919: 56
07) 278: 23
08) 102: 93
09) 779: 231
10) 178: 226


##### 📝 &nbsp; **Exercise 1.3**

Save the database into a text file named `user_endorsments.txt`. Data should be stored into a row-major format (i.e. one row per user/sample).

Each value in the file will be comma separated, having the following structure

```
uid1, r11, r12, ...r1k
uid2, r21, ..., r2j
...
uidN, rN1, ..., rNz-1, rNz
```

$\rightarrow$ **Write** a function `save_db()` that accepts the database structure (from prev. exercise) and a path to a folder (default: _current dir_, `./`), and writes the data to the file within the specified folder. 

---

**Note**: As per design, each user will have a different number of ratings. Don't worry about that for now. It will be fine to have a variable numbers of columns per file.

In [None]:
import os  # We will use os module for path and folders ops.

def save_db(database, data_folder: str = "./"):
    os.makedirs(data_folder, exist_ok=True)
    datafile_path = os.path.join(data_folder, "user_endorsments.txt")
    # your code here

save_db(database)

In [None]:
#@title Hint 🕵️ &nbsp;(_double click on the cell to open_) { display-mode: "form" }

# I/O in Python is pretty simple, just open with "w" permission
# For more pythonic solution look for "Context Manager". 
# I/O is the classic example.

# Look for f-strings, and the join string method
# For more, look for the OS and os.path modules in the doc

In [None]:
#@title (proposed) Solution 💡 &nbsp; (Double Click to Open) { display-mode: "form" }

import os  # We will use os module for path and folders ops.

def save_db(database, data_folder: str = "./"):
    os.makedirs(data_folder, exist_ok=True)
    datafile_path = os.path.join(data_folder, "user_endorsments.txt")
    with open(datafile_path, "w") as f:
        for uid, endorsments in database.items():
            # `map` is for a bit of functional flavour
            # but a for loop here is OK too
            end_strs = ",".join(map(str, endorsments))
            f.write(f"{uid},{end_strs}\n")

save_db(database)

##### 📝 &nbsp; **Exercise 1.4**

Read back the data into `database` from the `user_endorsments.txt` file. Data types should be hanlded accordingly.

$\Rightarrow$ **Write** a function `read_user_endorsments` that takes the path to the datafiles, and returns back the `database` structure as conceived in previous exercises.

--- 

**Note** The only purpose here is to practice with I/O. Overwriting the existing `database` structure is OK. However, please make sure to preserve the data **types** (esp. for `endorsments` as it will serve for further processing in the next sections).

In [None]:
def read_user_endorsments(datafile: str = "./user_endorsments.txt"):
    try:
        # your code here
        pass  # remove this
    except FileNotFoundError as e:
        print(str(e))
    else:
        return database

database = read_user_endorsments()

In [None]:
#@title Hint 🕵️ &nbsp;(_double click on the cell to open_) { display-mode: "form" }

# I/O in Python is pretty simple, just open with "r" permission this time.
# For more pythonic solution look for "Context Manager". 
# I/O is the classic example.

# Data processing: look for strip and split methods in strings
# I would also recommend to have a look at Tulpe Unpacking!

In [None]:
#@title (proposed) Solution 💡 &nbsp; (Double Click to Open) { display-mode: "form" }

def read_user_endorsments(datafile: str = "./user_endorsments.txt"):
    try:
        database = {}
        with open("./user_endorsments.txt", "r") as f:
            for line in f:
                line = line.strip()  # get rid of any tab&spacing 
                # For more look for Tuple Unpacking
                uid, *endorsmemts = line.split(",")
                # convert each endorsmemt in float and make a list
                endorsmemts = list(map(float, endorsmemts))
                database[uid] = endorsmemts  # uid as str is OK
    except FileNotFoundError as e:
        print(str(e))
    else:
        return database

database = read_user_endorsments()

### Data Processing

In this section we will start having fun with the data by implementing simple processing operations. First off, let's start to make the data more interesting.

##### 📝 &nbsp; **Exercise 1.5**

Let's re-work our database, by also including a list of `1K` (`1,000`) products, each identified by a unique `ProdID`. 
Each `ProdID` will be a `8`-digits code _zero-padded_ (e.g. `1` ==> `0000001`, `893` ==> `0000893`).

Each product has recevied a `rating` score from the users.
Similar to endorsments, those ratings are `float` in `[0:10]` (3rd decimal is OK, this time). We will simulate those by randomly generating the scores.

Differently from endorsment though, we want to mark who is the original user who left the rating. Besides, there is no limitation on the number of ratings a single user can leave to a single product (i.e. `m2m` relationship with _no constraints_). Nonetheless, they are `200` in total per single product.

$\Rightarrow$ **Write** a function `generate_products` which generate this database of products.

---

**Note**: While creating the product database, remember that for each product rating, we want to record the `uid` of the original author. 


In [81]:
def generate_products(user_ids):
    # your code here
    pass

product_db = generate_products(user_ids=user_ids)

In [82]:
#@title Hint 🕵️ &nbsp;(_double click on the cell to open_) { display-mode: "form" }

# Zero-padding: look at string format options 

# First, think about the data structure. Very similar to user endorsment case, 
# but this time we want to record uid too. Tuple? Dict?
# Look for collections.defaultdict

# Selection: Look for random.choices

In [108]:
#@title (proposed) Solution 💡 &nbsp; (Double Click to Open) { display-mode: "form" }

from random import uniform, choices, sample
from collections import defaultdict

def generate_products(user_ids):
    # Dictionary of lists again
    prod_db = defaultdict(list)
    pids = sample(range(0, 100000), k=1000)
    # generate zero-padded pid as str
    pids = map(lambda pid: "{:08}".format(pid), pids)
    # Generate ratings
    for pid in pids:
        raters = choices(user_ids, k=200)
        for uid in raters:
            score = round(uniform(0, 10), ndigits=3)
            prod_db[pid].append((uid, score))
    return prod_db

product_db = generate_products(user_ids=user_ids)

We will work on better Data Abstractions later, let's crack on with some data processing for now!

##### 📝 &nbsp; **Exercise 1.6**

Calculate the list of the top 10 influentatial users, that is the 10 users who received the highest endorsments in total (`sum` of all their endorsments)

$\Rightarrow$ **Write** the `top_10_users` function that returns the list of the top 10 user IDs, along with their corresponding total endorsment values. 


In [87]:
def top_10_users(user_database):
    # your code here
    pass

top_10_users(database)

In [88]:
#@title Hint 🕵️ &nbsp;(_double click on the cell to open_) { display-mode: "form" }

# Look for the sorted function, and their parameters
# Remember, database is a dictionary of lists.

In [91]:
#@title (proposed) Solution 💡 &nbsp; (Double Click to Open) { display-mode: "form" }

def top_10_users(user_database):
    sorted_users = sorted(user_database, 
                          key=lambda uid: sum(user_database[uid]),
                          reverse=True)
    top_10_uids = sorted_users[:10]
    ranking = list()
    for uid in top_10_uids:
        ranking.append((uid, sum(user_database[uid])))
    return ranking
                          

top_10_users(database)

[('199', 759.7200000000003),
 ('374', 757.7999999999994),
 ('758', 757.6299999999997),
 ('812', 747.6200000000001),
 ('501', 742.8200000000002),
 ('286', 738.1499999999999),
 ('341', 737.8399999999993),
 ('272', 736.4200000000002),
 ('369', 735.65),
 ('838', 732.4599999999997)]

##### 📝 &nbsp; **Exercise 1.7**

Similarly to the previous exercise, let's now generate the list of the top 10 most popular users, as in those who received the most number of ratings

$\Rightarrow$ **Write** the `top_10_popular_users` function that returns the list of the top 10 user IDs, along with their corresponding number of endorsments. 


In [93]:
def top_10_popular_users(user_database):
    # your code here
    pass

top_10_popular_users(database)

In [94]:
#@title Hint 🕵️ &nbsp;(_double click on the cell to open_) { display-mode: "form" }

# Look for the sorted function, and their parameters - as in prev. exercise
# Remember, database is a dictionary of lists.

In [95]:
#@title (proposed) Solution 💡 &nbsp; (Double Click to Open) { display-mode: "form" }

def top_10_popular_users(user_database):
    sorted_users = sorted(user_database, 
                          key=lambda uid: len(user_database[uid]),
                          reverse=True)
    top_10_uids = sorted_users[:10]
    ranking = list()
    for uid in top_10_uids:
        ranking.append((uid, len(user_database[uid])))
    return ranking
                          

top_10_popular_users(database)

[('286', 295),
 ('199', 294),
 ('269', 294),
 ('341', 294),
 ('674', 293),
 ('758', 293),
 ('455', 291),
 ('394', 289),
 ('374', 288),
 ('857', 288)]

Interestingly, the two rankings differ. This is quite expected though, as the definition of "influential" that we had applied in the previous exercise was flawed as it didn't consider the number of ratings received. Let's try to have a more accurate statistic on the users, then.

##### 📝 &nbsp; **Exercise 1.8**

Generate the list of the top 10 most influential users in the database, considering a normalised version of the score calculated in `1.6`, that is the sum of their scores, normalised by the number of endorsments received. 

$\Rightarrow$ **Write** the `top_10_users_normalised` function that returns the list of the top 10 user IDs, along with their corresponding influential scores. 


In [96]:
def top_10_users_normalised(user_db):
    # your code here
    pass

top_10_users_normalised(database)

In [97]:
#@title Hint 🕵️ &nbsp;(_double click on the cell to open_) { display-mode: "form" }

# Look for the sorted function, and their parameters - as in prev. exercise
# Remember, database is a dictionary of lists.

In [98]:
#@title (proposed) Solution 💡 &nbsp; (Double Click to Open) { display-mode: "form" }

def top_10_users_normalised(user_db):
    # different implementation this time
    user_scores = map(lambda uid: (uid, sum(user_db[uid])/len(user_db[uid])), 
                      user_db)
    sorted_users = sorted(user_scores, 
                          key=lambda info: info[1],
                          reverse=True)
    return sorted_users[:10]
                          

top_10_users_normalised(database)

[('508', 3.27),
 ('950', 2.9979999999999998),
 ('941', 2.9466666666666668),
 ('917', 2.9156249999999995),
 ('859', 2.908333333333333),
 ('278', 2.8752173913043477),
 ('388', 2.86),
 ('430', 2.8026153846153843),
 ('246', 2.784634146341464),
 ('188', 2.7615384615384615)]

##### 📝 &nbsp; **Exercise 1.9**

Calculate the users who left the highest number of ratings on products. 

$\Rightarrow$ **Write** the `top_rater` function that returns the user ID `uid` of the user who left the top number of rating scores (as in count, not in value), along with the corresponding count number.

---

**Note**: We are interested here in counting the ratings per user, not their actual value. Besides, remember that a user may have rated a product more than once. Every rating left counts, here!


In [None]:
def top_rater(product_db):
    # your code here
    pass

top_rater(product_db)

In [99]:
#@title Hint 🕵️ &nbsp;(_double click on the cell to open_) { display-mode: "form" }

# Look for collections.Counter

In [131]:
#@title (proposed) Solution 💡 &nbsp; (Double Click to Open) { display-mode: "form" }

from collections import Counter

def top_rater(product_db):
    # By default, the values of this dict will be lists
    user_rating_counts = Counter()
    # iterate values, we're not interested in specific pid
    for ratings in product_db.values():
        # rating is a list of tuples
        # We need to count the number of votes per ID
        votes_uids = [uid for uid, _ in ratings]
        voters_count = Counter(votes_uids)
        user_rating_counts.update(voters_count)
    return user_rating_counts.most_common(n=1)

top_rater(product_db)

[(590, 758)]

##### 📝 &nbsp; **Exercise 1.10**

Calculate the users who left the highest number of ratings on **DISTINCT** products. 

$\Rightarrow$ **Write** the `top_rater_distinct` function modifying the solution to the previous exercise so that each user is counted only once per product.


In [None]:
def top_rater_distinct(product_db):
    # your code here
    pass

top_rater(product_db)

In [None]:
#@title Hint 🕵️ &nbsp;(_double click on the cell to open_) { display-mode: "form" }

# Look for collections.Counter

In [132]:
#@title (proposed) Solution 💡 &nbsp; (Double Click to Open) { display-mode: "form" }

from collections import Counter

def top_rater_distinct(product_db):
    # By default, the values of this dict will be lists
    user_rating_counts = Counter()
    # iterate values, we're not interested in specific pid
    for ratings in product_db.values():
        # rating is a list of tuples
        # We need to count the number of votes per ID
        votes_uids = set([uid for uid, _ in ratings])
        voters_count = Counter(votes_uids)
        user_rating_counts.update(voters_count)
    return user_rating_counts.most_common(n=1)

top_rater_distinct(product_db)

[(599, 536)]

##### 📝 &nbsp; **Exercise 1.11**

Calculate the average ratings per each product and the corresponding standard deviation. 

$\Rightarrow$ **Write** the `product_stats` function which returns a dictionary containing average and std ratings per each product.

---

**Note**: Keep this exercise in mind, as we will come back to this shortly when we will be working on different data abstractions with OOP

In [None]:
def product_stats(product_db):
    # your code here
    pass

prod_stats = product_stats(product_db)

In [None]:
#@title Hint 🕵️ &nbsp;(_double click on the cell to open_) { display-mode: "form" }

# Look for collections.Counter

In [None]:
#@title (proposed) Solution 💡 &nbsp; (Double Click to Open) { display-mode: "form" }

def product_stats(product_db):
    # your code here
    pass

prod_stats = product_stats(product_db)

[(599, 536)]

### Object-Oriented Programming

## 2. PyData Fundamentals



#### `numpy`

### `pandas`