# Statistics and Data Science: Exercises library

## Data normalization with Euclidean norm

A very common operation is to transform you data by normalization. Imagine you have a list of data points $x=$`[21.4,45.7,38.5,76.4,61.9,43.4,52.6,27.2]` and you want to normalize your data using the [Euclidean norm](https://en.wikipedia.org/wiki/Norm_(mathematics)#Euclidean_norm), i.e., convert the data between 0 and 1 with the following operation:

$\hat{x}_{i} = \frac{x_{i}}{||x||}$

where: $||x||=\sqrt{x_1^2+...+x_n^2}$

Normalization is common (necessary) when you deal with several variables that have very different scales.

- Using list comprehension, create a new list with normalized $x$ using the Euclidean norm. 

In [1]:
import numpy as np
x = [21.4, 45.7, 38.5, 76.4, 61.9, 43.4, 52.6, 27.2]

def euc_sum(l):
    return np.sqrt(np.sum(np.square(l)))

euc_norm = [np.round(item/euc_sum(x),2) for item in x]

print(euc_norm)
print(f'We can see that the original list is still intact: {x}')


[0.15, 0.33, 0.28, 0.55, 0.45, 0.31, 0.38, 0.2]
We can see that the original list is still intact: [21.4, 45.7, 38.5, 76.4, 61.9, 43.4, 52.6, 27.2]


## Data cleaning with comprehension

Suppose we have the following list: $x=$`[21.4, 'NaN', 45.7,38.5,76.4,61.9, 'NaN', 43.4,52.6,27.2]`. Unfortunately we have some `'NaN'` values (Not a Number).

- Clean your list, dropping `'NaN'` values, using list comprehension

In [2]:
import time

x = [21.4, 'NaN', 45.7, 38.5, 76.4, 61.9, 'NaN', 43.4, 52.6, 27.2]

# One way of doing it with lambda
arr = np.array(range(5000))
start_lambda = time.time()
x_filtered = list(filter(lambda a: a != 'NaN', x))
stop_lambda = time.time()
result_lambda = stop_lambda - start_lambda
print(x_filtered, result_lambda)

# Another way of doing it with comprehension
start_compr = time.time()
x_cleaned = [item for item in x if item != 'NaN']
stop_compr = time.time()
result_compr = stop_compr - start_compr
print(x_cleaned, result_compr)

# TODO: Run the same operation 100 times, and compare the different mean time per function.


[21.4, 45.7, 38.5, 76.4, 61.9, 43.4, 52.6, 27.2] 4.696846008300781e-05
[21.4, 45.7, 38.5, 76.4, 61.9, 43.4, 52.6, 27.2] 3.814697265625e-05


## Data manipulation using dictionary comprehension

Comprehension is not only for list, dictionary too! Suppose you have the following dictionary, with the grades of some students on a 0-100 scale:

`{'Adam': 72, 'Elena': 91, 'Xiang': 87, 'Julie': 81, 'Takafumi': 79}`

- Use dictionary comprehension to convert the grade from the 0-100 scale to the Swiss 0-6 scale.
- Use dictionary comprehension to round to the nearest 0.25 (for instance 4.2 should be converted to 4.25). 

Tips: you can use the `round()` function

In [3]:
import numpy as np
grades = {'Adam': 72, 'Elena': 91, 'Xiang': 87, 'Julie': 81, 'Takafumi': 79}

grades_adjusted2 = {key:round(value*6/100, 2) for (key, value) in grades.items()}
grades_adjusted = {key:round(value*24/100)/4 for (key, value) in grades.items()}

print(grades_adjusted)


{'Adam': 4.25, 'Elena': 5.5, 'Xiang': 5.25, 'Julie': 4.75, 'Takafumi': 4.75}


## Green Bonds

You have a list of green bonds identifiers: 
`gb_ID = ['CH843556=S', 'CH843556=', 'CH868037=', 'CH6YT=RR', 'CH30YT=RR', 'CH975519=', 'CH1580323=', 'CH1580323=S', 'CH2452496=S']`

- Create a new list with the elements of `gb_ID` but removing the `'='` sign and what follows. For instance 'CH843556=S' should be CH843556
- Create a new list selecting the elements of `gb_ID` with nothing after the `'='` sign, i.e. we disregard elements such as 'CH843556=S' 

Hints: 
- You can use list comprehension inside another list comprehension.
- For the second question, you could use Regular Expressions [RegEx](https://docs.python.org/3/library/re.html). See also this [tutorial](https://www.w3schools.com/python/python_regex.asp)

In [4]:
gb_ID = ['CH843556=S', 'CH843556=', 'CH868037=', 'CH6YT=RR', 'CH30YT=RR', 'CH975519=', 'CH1580323=', 'CH1580323=S', 'CH2452496=S']

gb_ID_deep_cleansed = [item[0:item.find('=')] for item in gb_ID]
gb_ID_cleansed = [item for item in gb_ID if len(item) == item.find('=')+1]

print(f'A list removed of the  "=" and other tailing characters: {gb_ID_deep_cleansed}')
print(f'A list cleansed of strings with characters after "=": {gb_ID_cleansed}')


A list removed of the  "=" and other tailing characters: ['CH843556', 'CH843556', 'CH868037', 'CH6YT', 'CH30YT', 'CH975519', 'CH1580323', 'CH1580323', 'CH2452496']
A list cleansed of strings with characters after "=": ['CH843556=', 'CH868037=', 'CH975519=', 'CH1580323=']


## Optimizing recursive function

During the lecture, we have defined a function to calculate Fibonacci numbers: 

$F(0)=0$

$F(1)=1$

$F(n)=F(n-1)+F(n-2)$

However, our function was not efficient since we needed to repeat operations. For example, to compute $F(5)$, we needed $F(4)$ and $F(3)$, but to know $F(4)$ we needed to compute $F(3)$ and $F(2)$, and so on. Since Fibonacci numbers were not stored in memory, the function calculated many identical subproblems over and over again.

- Design a function that calculate Fibonacci numbers and solves the repetition issue.
- Create a list of the first 12 Fibonacci numbers.

Hint: you can use a dictionary

In [5]:
import time

fib_num = []

def fib(n):
    if len(fib_num) == n:
        return fib_num
    elif len(fib_num) == 0:
        fib_num.append(0)
        return fib(n)
    elif len(fib_num) == 1:
        fib_num.append(1)
        return fib(n)
    else:
        last_value = fib_num[-1]
        second_last_value = fib_num[-2]
        fib_num.append(last_value+second_last_value)
        return fib(n)

start_fib = time.time()
fib = fib(12)
stop_fib = time.time()
result_fib = stop_fib - start_fib
print(fib, result_fib)


[0, 1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89] 3.218650817871094e-05


## Book information

We have some information about two books:

`(
Title = 'Sapiens: A Brief History of Humankind', 
Author = 'Yuval Noah Harari',
Year = 2011,
Language = 'Hebrew',
ISBN = '978-0062316097')`

`(
Title = 'Les Racines du ciel',
Author = 'Romain Gary',
Year = 1956,
Publisher = 'Gallimard'
)`

As you can see, the information we have differs.

- Write a function that prints for each key: 'The (key) is (value).'. The key should be in lower cases, except the ISBN number.
- Call your function with our two books.

For instance, the output for the second book should look like this:

The title is Les Racines du ciel.
The author is Romain Gary.
The year is 1956.
The publisher is Gallimard.

Hint: Try to use arbitrary keyword argument `**kwarg` and the format string method

In [6]:
book_dict_sapiens = {
    'Title': 'Sapiens: A Brief History of Humankind',
    'Author': 'Yuval Noah Harari',
    'Year': 2011,
    'Language': 'Hebrew',
    'ISBN': '978-0062316097',
}

book_dict_ciel = {
    'Title': 'Les Racines du ciel',
    'Author': 'Romain Gary',
    'Year': 1956,
    'Publisher': 'Gallimard'
}


def print_book(**books):
    result = ''
    for key, value in books.items():
        if key == 'ISBN':
            result += f'The {key} is {value}. '
        else:
            result += f'The {key.lower()} is {value}. '
    return result

book_prettified_sapiens = print_book(**book_dict_sapiens)
book_prettified_ciel = print_book(**book_dict_ciel)

print(book_prettified_sapiens)
print(book_prettified_ciel)


The title is Sapiens: A Brief History of Humankind. The author is Yuval Noah Harari. The year is 2011. The language is Hebrew. The ISBN is 978-0062316097. 
The title is Les Racines du ciel. The author is Romain Gary. The year is 1956. The publisher is Gallimard. 
