<img src="https://drive.google.com/uc?export=view&id=184ugHgH0NZsfKhi6awHLuc2voY2bKY5o" width=180, align="center"/>

Master's degree in Intelligent Systems

Subject: 11754 - Deep Learning

Year: 2022-2023

Professor: Miguel Ángel Calafat Torrens

# LABORATORY 1

## EXERCISE 1


Make an algorithm that implements the Caesar cipher. For this exercise you do not need to import any library. The encryption has to work with the ranges 0-9, a-z and A-Z; that is, an offset of +1 on character '9' should result in character '0'. This same displacement on the character 'z' must result in 'a', and if it is on 'Z' it must result in 'A'.

The algorithm must work in any direction (positive and negative displacements). Spaces and other printable symbols should not be modified.

[https://en.wikipedia.org/wiki/Caesar_cipher](https://en.wikipedia.org/wiki/Caesar_cipher)

[https://www.ascii-code.com/](https://www.ascii-code.com/)

Try to do it without any loops,

The signature of the function should be:

```python
def cesar_cipher(message, displ=1):
    """
    Cipher the message using Cesar cipher.

    Inputs:
    -------
    - message: (string) Message to cipher.
    - displ: (integer) Displacement. Number of positions to move.
    
    Outputs:
    --------
    - ciphered: (string). Ciphered message
    """
    
    # Your code goes here
    #___________________________________________________________________________
    pass
    #___________________________________________________________________________

    return ciphered
```
**Hint:** `map()` is faster than loops

In [122]:
def cesar_cipher(message, displ=1):
    """
    Cipher the message using Cesar cipher.

    Inputs:
    -------
    - message: (string) Message to cipher.
    - displ: (integer) Displacement. Number of positions to move.
    
    Outputs:
    --------
    - ciphered: (string). Ciphered message
    """
    # Defining the ranges of characters to cipher
    digit_range = range(48, 58)  # 0-9
    lower_range = range(97, 123)  # a-z
    upper_range = range(65, 91)  # A-Z

    # Defining the functions to cipher each range of characters
    def cipher_digits(c):
        return chr((ord(c) - digit_range[0] + displ) % len(digit_range) + digit_range[0])

    def cipher_lower(c):
        return chr((ord(c) - lower_range[0] + displ) % len(lower_range) + lower_range[0])

    def cipher_upper(c):
        return chr((ord(c) - upper_range[0] + displ) % len(upper_range) + upper_range[0])

    # Defining the function to cipher a single character
    def cipher_char(c):
        if c.isdigit():
            return cipher_digits(c)
        elif c.islower():
            return cipher_lower(c)
        elif c.isupper():
            return cipher_upper(c)
        else:
            return c

    # Using map to cipher all characters in the message
    ciphered_chars = map(cipher_char, message)

    # Joining the ciphered characters into a string
    ciphered = ''.join(ciphered_chars)

    return ciphered

In [123]:
# Check that it works. You should be able to read it perfectly.
msg_m11 = "Zs, zvlj, T dpp. Estd dppxd ez hzcv qtyp. Qlyeldetn! Jzf " + \
    "piapnepo te ezz, ctrse? Gpcj hpww, jzf nly yzh nzyetyfp htes pipcntdp 3."
cesar_cipher(msg_m11, -11)

'Oh, okay, I see. This seems to work fine. Fantastic! You expected it too, right? Very well, you can now continue with exercise 2.'

## EXERCISE 2



In this exercise you are given a fake dataframe. Treat it accordingly to answer next questions. All questions should be answered through code. You'll find a wider explanation in the cells below.

1) How many patients are in the dataset??

2) Add a new column with the body mass index.

3) Normalize the weight and the height: (x - mean) / std

4) What is the percentage of women in the dataset?

5) Calculate the mean of age of women taller than the mean.

6) Convert to lowercase the data in column 'finding'.

7) List all unique conditions of the column 'finding'.

8) Create a list with all single labels (illnesses or no finding) present in the dataset.

9) Create a new column for each different disease (and also one for 'no finding'), and assign the values 1 or 0 to each patient depending on whether or not they have the condition.

**HINT:** `map()` and `lambda` functions may create very concise and fast code for iterating over a dataframe.

In [124]:
# Execute this cell to get the pandas dataframe in the variable 'df'

import random
import numpy as np
from numpy.random import randint
import pandas as pd

def generate_dataframe(size=100, random_state='rand'):
    """This function generates a random dataframe to interact with"""

    # Random state for reproducibility
    seed = randint(1e6) if random_state == 'rand' else random_state
    random.seed(seed)
    np.random.seed(seed)

    # Column id
    id = np.arange(1, size + 1)

    # Columns age
    age = randint(30, 86, size=size)

    # Column gender
    gender = [(lambda x:"F" if x == 0 else "M")(x) for x in randint(0, 2, size)]

    # Column weight (kg)
    weight = 40 + np.trunc(9000 * np.random.random(size)) / 100

    # Column height (m)
    height = np.trunc(100 * np.random.normal(1.68, 0.12, size)) / 100

    # List of common illnesses
    ill_list = ['Diabetes', 'Depression', 'Anxiety', 'Hemorrhoid', 'Lupus',
            'Psoriasis', 'Bronchitis', 'Lyme Disease', 'Herpes', 'Pneumonia']

    # Lambda function to create samples of various illnesses
    fcn = lambda: '|'.join(random.sample(ill_list, randint(1, len(ill_list)-1)))

    # Array of random values between 0 and 1
    r = np.random.random(size)

    # Findings column
    finding = list(map(lambda x: "No Finding" if x < 0.5 else fcn(), r))

    # Build dataframe
    df = pd.DataFrame({'id': id,
                       'age': age,
                       'gender': gender,
                       'height': height,
                       'weight': weight,
                       'finding': finding})
    
    return df

df = generate_dataframe(4000, 0)
df.head(10)

Unnamed: 0,id,age,gender,height,weight,finding
0,1,74,M,1.73,70.46,Bronchitis|Pneumonia|Diabetes
1,2,77,F,1.67,83.44,No Finding
2,3,83,F,1.79,73.58,No Finding
3,4,30,F,1.53,52.98,Lupus|Herpes|Lyme Disease
4,5,33,M,1.65,57.89,No Finding
5,6,33,F,1.61,47.13,No Finding
6,7,69,F,1.59,82.08,No Finding
7,8,39,F,1.84,116.08,No Finding
8,9,49,F,1.5,48.14,No Finding
9,10,51,M,1.97,99.28,Bronchitis|Lupus|Lyme Disease|Anxiety|Herpes|D...


In [125]:
# Explore the dataframe as you need
df

Unnamed: 0,id,age,gender,height,weight,finding
0,1,74,M,1.73,70.46,Bronchitis|Pneumonia|Diabetes
1,2,77,F,1.67,83.44,No Finding
2,3,83,F,1.79,73.58,No Finding
3,4,30,F,1.53,52.98,Lupus|Herpes|Lyme Disease
4,5,33,M,1.65,57.89,No Finding
...,...,...,...,...,...,...
3995,3996,37,F,1.61,127.87,Diabetes|Psoriasis
3996,3997,31,F,1.73,63.36,No Finding
3997,3998,74,M,1.47,105.16,No Finding
3998,3999,78,M,1.45,53.87,Depression|Bronchitis


### QUESTION 1:

How many patients are in the dataset? Each different _id_ is a different patient.

In [126]:
# 1) How many patients are in the dataset?

n_patients = df['id'].nunique()
print('Number of patients: ' + str(n_patients))

Number of patients: 4000


### QUESTION 2:

Add a new column with the body mass index. The formula to calculate it is: $i_{mass}=\frac{weight[kg]}{height[m]^2}$

In [127]:
# 2) Add a new column with the body mass index.
df['mass'] = df['weight'] / pow(df['height'],2)
df['mass']

0       23.542384
1       29.918606
2       22.964327
3       22.632321
4       21.263545
          ...    
3995    49.330659
3996    21.170103
3997    48.664908
3998    25.621879
3999    16.313209
Name: mass, Length: 4000, dtype: float64

### QUESTION 3:

Normalize the weight and the height. Normalize subtracting the mean and dividing by standard deviation. Don't create a new dataframe. Do it over current _df_.

In [128]:
# 3) Normalize the weight and the height: (x - mean) / std
df['weight'] = (df['weight'] - df['weight'].mean()) / df['weight'].std()
df['height'] = (df['height'] - df['height'].mean()) / df['height'].std()
df.head(10)

Unnamed: 0,id,age,gender,height,weight,finding,mass
0,1,74,M,0.469233,-0.526043,Bronchitis|Pneumonia|Diabetes,23.542384
1,2,77,F,-0.029818,-0.025755,No Finding,29.918606
2,3,83,F,0.968284,-0.405789,No Finding,22.964327
3,4,30,F,-1.194271,-1.199775,Lupus|Herpes|Lyme Disease,22.632321
4,5,33,M,-0.196169,-1.010529,No Finding,21.263545
5,6,33,F,-0.528869,-1.425252,No Finding,18.182169
6,7,69,F,-0.69522,-0.078173,No Finding,32.46707
7,8,39,F,1.38416,1.232289,No Finding,34.286389
8,9,49,F,-1.443796,-1.386323,No Finding,21.395556
9,10,51,M,2.465437,0.584766,Bronchitis|Lupus|Lyme Disease|Anxiety|Herpes|D...,25.581695


### QUESTION 4:

What is the percentage of women in the dataset?

In [129]:
# 4) What is the percentage of women in the dataset?
counts = df['gender'].value_counts()

# calculating the percentage of women
percent_w = counts['F'] / counts.sum() * 100

print("Percentage of women is " + str(percent_w) + "%")

Percentage of women is 50.075%


### QUESTION 5:

Calculate the mean of age of women taller than the mean.

About half of the women in this dataset are taller than average. Find the mean age of these tall women.

In [130]:
# 5) Calculating the mean of age of women taller than the mean
mean_age_taller = df[(df['gender'] == 'F') & (df['height'] > df['height'].mean())]['age'].mean()

print("The mean of age of women taller than the mean is " + str(mean_age_taller))

The mean of age of women taller than the mean is 56.91935483870968


### QUESTION 6:

Convert to lowercase the data in column _finding_.

In [131]:
# 6 - Convert to lowercase the data in column 'finding'
df['finding'] = df['finding'].str.lower()
df.head(10)

Unnamed: 0,id,age,gender,height,weight,finding,mass
0,1,74,M,0.469233,-0.526043,bronchitis|pneumonia|diabetes,23.542384
1,2,77,F,-0.029818,-0.025755,no finding,29.918606
2,3,83,F,0.968284,-0.405789,no finding,22.964327
3,4,30,F,-1.194271,-1.199775,lupus|herpes|lyme disease,22.632321
4,5,33,M,-0.196169,-1.010529,no finding,21.263545
5,6,33,F,-0.528869,-1.425252,no finding,18.182169
6,7,69,F,-0.69522,-0.078173,no finding,32.46707
7,8,39,F,1.38416,1.232289,no finding,34.286389
8,9,49,F,-1.443796,-1.386323,no finding,21.395556
9,10,51,M,2.465437,0.584766,bronchitis|lupus|lyme disease|anxiety|herpes|d...,25.581695


### QUESTION 7:

Find the number of unique conditions of the column 'finding'.

Each unique combination of diseases is a unique condition. For example, "Anxiety" is one combination, "Anxiety|Bronchitis" is another combination. "Anxiety|Bronchitis|Herpes" is another different combination, and so on. All of them are unique conditions.

Please note that items may be arranged in a different order; i.e. "Anxiety|Bronchitis" is the same combination that "Bronchitis|Anxiety", so you shouldn't count them as two different conditions.

"No Finding" is also a condition.

In [132]:
# 7) Find the number of all unique conditions in the 'finding' column.

count_uniq = None
df['finding'] = df['finding'].apply(lambda x: set(x.split('|')) if '|' in x else {x})
unique_sets = set(frozenset(x) for x in df['finding'])
count_uniq = len(unique_sets)
print("Number of unique conditions: " + str(count_uniq))

Number of unique conditions: 750


### QUESTION 8:

Create a list with all single finding labels (single illness or no finding) present in the dataset.

The solution you are looking for is the next one:

`['anxiety', 'bronchitis', 'depression', 'diabetes', 'hemorrhoid', 'herpes',
 'lupus', 'lyme disease', 'no finding', 'pneumonia', 'psoriasis']`

Obviously you don't have to copy and paste the list above. You need to extract it from the tags present in the 'finding' column of the dataframe without making explicit the words you are looking for.

**HINT:** `itertools.chain()` may be useful.

In [133]:
# 8) Create a list with all single illnesses or conditions present in the
# dataset.
import itertools

list_ills = list(set(itertools.chain(*df['finding'])))

list_ills

['no finding',
 'anxiety',
 'hemorrhoid',
 'pneumonia',
 'lyme disease',
 'depression',
 'herpes',
 'bronchitis',
 'psoriasis',
 'diabetes',
 'lupus']

### QUESTION 9:

Create a new column for each different disease (and also one for 'no finding'), and assign the values 1 or 0 to each patient depending on whether or not they have the condition.

The dataframe will end up having 11 more columns.

In [134]:
# 8) Create a new column for each different disease, and assign the values 1 or
# 0 to each patient depending on whether or not they have the disease.

# creating a new column for each findings and assigning 1 or 0 to each patient
for disease in list_ills:
    df[disease] = df['finding'].apply(lambda x: 1 if disease in x else 0)
df.head(10)

Unnamed: 0,id,age,gender,height,weight,finding,mass,no finding,anxiety,hemorrhoid,pneumonia,lyme disease,depression,herpes,bronchitis,psoriasis,diabetes,lupus
0,1,74,M,0.469233,-0.526043,"{bronchitis, diabetes, pneumonia}",23.542384,0,0,0,1,0,0,0,1,0,1,0
1,2,77,F,-0.029818,-0.025755,{no finding},29.918606,1,0,0,0,0,0,0,0,0,0,0
2,3,83,F,0.968284,-0.405789,{no finding},22.964327,1,0,0,0,0,0,0,0,0,0,0
3,4,30,F,-1.194271,-1.199775,"{lyme disease, herpes, lupus}",22.632321,0,0,0,0,1,0,1,0,0,0,1
4,5,33,M,-0.196169,-1.010529,{no finding},21.263545,1,0,0,0,0,0,0,0,0,0,0
5,6,33,F,-0.528869,-1.425252,{no finding},18.182169,1,0,0,0,0,0,0,0,0,0,0
6,7,69,F,-0.69522,-0.078173,{no finding},32.46707,1,0,0,0,0,0,0,0,0,0,0
7,8,39,F,1.38416,1.232289,{no finding},34.286389,1,0,0,0,0,0,0,0,0,0,0
8,9,49,F,-1.443796,-1.386323,{no finding},21.395556,1,0,0,0,0,0,0,0,0,0,0
9,10,51,M,2.465437,0.584766,"{lyme disease, depression, herpes, anxiety, br...",25.581695,0,1,0,0,1,1,1,1,1,0,1
