# INFO-F-422 -  Statistical Foundations of Machine Learning 

### Gian Marco Paldino - __[gian.marco.paldino@ulb.be](mailto:gian.marco.paldino@ulb.be)__
### Cédric Simar - __[cedric.simar@ulb.be](mailto:cedric.simar@ulb.be)__

## TP 1 - Introduction to R

####  February 2023

#### Materials originally developed by *Yann-Aël Le Borgne, Fabrizio Carcillo and Gianluca Bontempi*


### Use Python as a calculator

In [None]:
2 + 2

In [None]:
import math
math.exp(-2)

In [None]:
import numpy as np
# generate random numbers from a standard normal distribution
np.random.randn(15)

### Assign values to variables

In [None]:
x = 2
x

In [None]:
y = x + x
y

### Installing and loading libraries

In Python, you install packages using `pip` or `conda` outside the script. For instance:  
`pip install numpy pandas matplotlib scipy`

Then you can import them:

In [204]:
import pandas as pd
import matplotlib.pyplot as plt
from scipy import stats

### Naming conventions

Python naming conventions are guidelines for naming variables, functions, classes, and other identifiers to improve code readability and consistency. Python follows the PEP 8 style guide, which recommends using **snake_case** (lowercase letters separated by underscores) for variable and function names, while **PascalCase** (each word capitalized, no underscores) is preferred for class names. Constants are written in **ALL_CAPS**, with underscores separating words. Private or internal names often start with a single underscore (_), while double underscores (__name__) can signal special methods or prevent name collisions in subclasses. Adhering to these conventions ensures that code is easier to understand and maintain, especially in collaborative projects. Variables are case-sensitive and cannot start with a number. Avoid using built-ins like `list`, `dict`, etc.

---

## Vectors and arrays in Python

### Defining arrays (vectors)

In [None]:
w = np.array([1,2,3,4])
x = np.array([1.5, 2.5, 3.5, 3.5])
y = np.array(["Huey", "Dewey", "Loui"])
z = np.array([True, False, False, True])
wx = w > x
wx

### Creating sequences and repetitions

In [None]:
# Sequence from 1 to 4
x = np.arange(1,5)  
x


In [None]:

# Repeat the number 1 four times
y = np.repeat(1,4)
y

### Exercises

Generate the vectors:  
- `1 3 5`  
- `1 2 2 2 3 3 3 3 3`

In [None]:
seq_135 = np.arange(1,6,2)
seq_135



In [None]:
seq_complex = np.concatenate([np.repeat(1,1), np.repeat(2,3), np.repeat(3,5)])
seq_complex

### Vector arithmetic

In [None]:
weight = np.array([60, 72, 57, 90, 95, 72,60])
height = np.array([1.75, 1.80, 1.65, 1.90, 1.74, 1.91,1.69])
bmi = weight / height**2
bmi

### Exercises

*  Compute the adjusted BMI using the formula $bmi_2=\frac{weight}{height^{2.5}/1.3}$ and store it in a variable  `bmi2` 

#### Solution


In [None]:
bmi2 = weight/(height**2.5/1.3)
bmi2

### Functions for arrays

In [212]:
v = np.arange(5,0,-1)

In [None]:
v.sum()

In [None]:
len(v)

In [None]:
np.sort(v)

In [None]:
np.mean(v)

In [None]:
np.std(v, ddof=1) # sample std

### Indexing arrays

In [None]:
height[4]  # 5th element (0-based indexing)

In [None]:
height[[2,4,6]] # pick indices 2,4,6

In [None]:
height[height > 1.70]  # conditional indexing

In [None]:
height[(height>1.70) & (height<1.90)]

### Matrices

In Python, we use NumPy to handle arrays (matrices and tensors). NumPy arrays can contain elements of any numeric type. <br>
You can reshape a vector into a matrix using reshape() method in NumPy.

In [None]:
import numpy as np

x = np.arange(1, 13)
x = x.reshape(3, 4)
x

In [None]:
np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]).reshape(3, 4, order='F')


In [None]:
np.full((6, 7), 1)


In [None]:
np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]).reshape(3, 4)


In [None]:
# Adding column names
import pandas as pd

x = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]).reshape(3, 4)
df = pd.DataFrame(x, index=['A', 'B', 'C'])
df


In [None]:
# Stack two arrays vertically
np.column_stack((np.arange(1, 5), np.arange(5, 9), np.arange(9, 13)))


In [None]:
# Stacking arrays horizontally
np.row_stack((np.arange(1, 5), np.arange(5, 9), np.arange(9, 13)))


### Matrix functions

In [None]:
a = np.array([[1, 2], [3, 4]])
a

In [None]:
# matrix product
a.dot(a)


In [None]:
# inverse matrix
np.linalg.inv(a)


In [None]:
# transpose of a matrix
a.T


### Exercises

Sort the vector `height` by decreasing order of `weight`

In [None]:
sorted_indices = np.argsort(weight)[::-1]  # descending order indices
height[sorted_indices]

---

## Categories, Lists and data frames

### Factors / Categorical Data

We can convert an array into a categorical data type using the from_codes method from the pandas library. <br>
The from_codes method takes two main arguments:
- codes: An array-like of integers where each integer points to a category.
- categories: A list of category names corresponding to the integer codes. 

In this case, the pain array [0, 3, 2, 2, 1] is mapped to the categories ["non", "mild", "medium", "severe"]<br>

In [None]:
pain = np.array([0,3,2,2,1])
categories = pd.Categorical.from_codes(pain, categories=["non","mild","medium","severe"])
categories

### Lists

In Python, lists are just Python lists. For structured data, we often use dictionaries or data frames.

In [None]:
intake_pre = np.array([5260, 5470, 5640, 6180, 6390,6515, 6805, 7515, 7515, 8230, 8770])
intake_post = np.array([3910, 4220, 3885, 5160, 5645,4680, 5265, 5975, 6790, 6900, 7335])

mylist = {"before": intake_pre, "after": intake_post}
mylist["before"]

### Data frames

In Python, data frames are handled by pandas.

In [None]:
d = pd.DataFrame({"intake_pre": intake_pre, "intake_post": intake_post})
d
d["intake_pre"]

---

## Loops & conditions

### Loops

In [None]:
for i in range(5):
    print(i)

### While loop

In [None]:
count = 0
while count < 5:
    print(count)
    count += 1

### Conditions

In [None]:
val = 10
if val > 0:
    print("Positive")
elif val < 0:
    print("Negative")
else:
    print("Zero")

### Apply-like operations

In Python, you often use vectorized operations with NumPy, or `apply`-like methods in pandas.

For a `DataFrame` `df`:

In [None]:
# mean of each column
d.mean()

In [None]:
# apply function to each column
d.apply(np.mean)

In [None]:
# for each row
d.apply(np.mean, axis=1)

### Exercise 

Compute the mean of the data frame `thuesen` to get the result as a list, and as a vector. <br>
The thuesen dataset in R is part of the ISwR package and contains data on blood glucose levels and short velocity.  <br>
We create it for you here. <br>

In [None]:
import pandas as pd

# Create a dictionary with the data
data = {
    'blood_glucose': [4.03, 4.14, 4.21, 4.27, 4.35, 4.47, 4.57, 4.60, 4.61, 4.63, 4.65, 4.66, 4.67, 4.68, 4.70, 4.71, 4.72, 4.73, 4.74, 4.75, 4.76, 4.77, 4.78, 4.79, 4.80, 4.81, 4.82, 4.83, 4.84, 4.85],
    'short_velocity': [2.91, 2.99, 3.02, 3.05, 3.08, 3.12, 3.15, 3.17, 3.18, 3.19, 3.20, 3.21, 3.22, 3.23, 3.24, 3.25, 3.26, 3.27, 3.28, 3.29, 3.30, 3.31, 3.32, 3.33, 3.34, 3.35, 3.36, 3.37, 3.38, 3.39]
}

# Create a DataFrame
thuesen_df = pd.DataFrame(data)
thuesen_df


In [None]:
mean_list = thuesen_df.mean().to_list()
mean_list

In [None]:
mean_vector = thuesen_df.mean().values
mean_vector

---

## Functions

### Defining functions

In [None]:
def compute_sum(x):
    cleaned = x[~pd.isnull(x)]
    s = cleaned.sum()
    mean_val = cleaned.mean()
    return {"sum": s, "mean": mean_val}

compute_sum(np.array([1,2,-4,np.nan,6]))

In Python, a set of functions can be saved in a .py file and then importing them into your script. <br>
You can import the functions using the `import` statement

---

## Plotting

In [247]:
import matplotlib.pyplot as plt

In [None]:
plt.scatter(height, weight)
plt.title("height vs weight")
plt.xlabel("height")
plt.ylabel("weight")
plt.show()

Add a line:

In [None]:
hh = np.array([1.65, 1.70, 1.75, 1.80, 1.85, 1.90])
plt.scatter(height, weight)
plt.plot(hh, 22.5 * hh**2, color='red')
plt.show()

Save plot:

In [250]:
plt.figure()
plt.scatter(height, weight)
plt.plot(hh, 22.5 * hh**2, color='red')
plt.savefig("myplot.png")
plt.close()

---

## Probabilities and distributions


### Sampling

In [251]:
np.random.seed(123456)

In [None]:
# draw 5 integers between 1 and 40
np.random.choice(np.arange(1,41), 5, replace=False)

In [None]:
# coin tosses
np.random.choice(["H","T"], 10, replace=True)

In [None]:
# biased coin toss
np.random.choice(["H","T"], 10, replace=True, p=[0.8,0.2])

### Distributions with scipy.stats
In Python, you can use the scipy.stats module to perform computations for probability distributions. Here is how you can compute the density, distribution function, quantiles, and pseudo-random numbers for a normal distribution, and plot the normal density

In [255]:
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import norm

In [256]:
# Parameters for the normal distribution
mean = 0
std_dev = 1

# Density (Probability Density Function)
x = np.linspace(-5, 5, 100)
density = norm.pdf(x, mean, std_dev)

In [None]:
# Distribution Function (Cumulative Distribution Function)
distribution = norm.cdf(x, mean, std_dev)
distribution

In [None]:
# Quantiles (Inverse of CDF)
quantiles = norm.ppf([0.025, 0.5, 0.975], mean, std_dev)
quantiles

In [None]:
# Random Numbers
random_numbers = norm.rvs(mean, std_dev, size=1000)
random_numbers

In [None]:
plt.plot(x, density, label='Normal Density')
plt.title('Normal Density Function')
plt.xlabel('x')
plt.ylabel('Density')
plt.legend()
plt.show()

#### Exercises

Plot the density for a binomial random variable $B(50, 0.33)$

In [None]:
from scipy.stats import binom

x = np.arange(51)
pmf_vals = binom.pmf(x, n=50, p=0.33)
plt.bar(x, pmf_vals)
plt.title("Binomial B(50,0.33)")
plt.show()

* Compute the probability of a normally distributed variable with mean 132 and standard deviation 13 being smaller than 160


In [None]:
norm.cdf(160, loc=132, scale=13)

### Quantiles

* Definition: the quantile is the inverse of the distribution function. The p-quantile is by definition that value having the property that there is a probability $p$ to obtain a value lower or equal to it. For example, the median is the 50% quantile.
* The quantiles are used for computing the confidence intervals. Let $n$ observations be drawn from a normal distribution with the same mean $\mu$ and the same standard-deviation $\sigma$. It is well known that the observed mean $\overline{x}$ follows a normal distribution with mean $\mu$ and standard deviation $\sigma/\sqrt{n}$. A confidance interval of 95% for $\mu$ can be obtained by

\begin{equation}
\overline{x} + \sigma/\sqrt{n} \times N_{0.025} \leq \mu \leq  \overline{x} + \sigma/\sqrt{n} \times N_{0.975}
\end{equation}
     
where $N_{0.025}$ is the 2.5% quantile of the normal distribution.

In [None]:
# 2.5% and 97.5% quantiles
q025 = norm.ppf(0.025)
q975 = norm.ppf(0.975)
q025, q975

Confidence interval:

In [None]:
xbar = 83
sigma = 12
n = 5
sem = sigma / np.sqrt(n)
lower = xbar + sem * q025
upper = xbar + sem * q975
(lower, upper)

### Generation of pseudo-random numbers


In [None]:
np.random.randn(10)  # standard normal

In [None]:
np.random.normal(7,5,10) # normal with mean=7, sd=5

In [None]:
np.random.binomial(20,0.5,10)

---

## Descriptive statistics

In [268]:
# Generate 50 random numbers from a standard normal distribution
x = np.random.randn(50)

In [None]:
# Compute the mean of the array
x.mean()

In [None]:
# Compute the variance of the array with degrees of freedom 1
x.var(ddof=1)

In [None]:
# Compute the standard deviation of the array with degrees of freedom 1
x.std(ddof=1)

In [None]:
# Compute the median of the array
np.median(x)

Quantiles:

In [273]:
# Define a vector of probabilities from 0 to 1 in increments of 0.1
pvec = np.arange(0, 1.1, 0.1)

In [None]:
# Compute the quantiles of the array x at the specified probabilities
np.quantile(x, pvec)

### Plotting distributions

* Histogram and empirical cumulative distribution (the empirical distribution function is defined as the number of data points smaller or equal to $x$ devided by the total number of points)


In [None]:
# Parameters for the normal distribution
mean = 0
std_dev = 1

# Generate random numbers
random_numbers = norm.rvs(mean, std_dev, size=1000)

# Sort the random numbers
sorted_numbers = np.sort(random_numbers)

# Compute the ECDF values
ecdf = np.arange(1, len(sorted_numbers) + 1) / len(sorted_numbers)

# Create a figure with two subplots
fig, axes = plt.subplots(1, 2, figsize=(12, 6))

# Plot Histogram
axes[0].hist(random_numbers, bins=30, density=True, alpha=0.6, color='g')
axes[0].set_title('Histogram of Normal Distribution')
axes[0].set_xlabel('Value')
axes[0].set_ylabel('Frequency')

# Plot ECDF
axes[1].plot(sorted_numbers, ecdf, marker='.', linestyle='none')
axes[1].set_title('Empirical Cumulative Distribution Function (ECDF)')
axes[1].set_xlabel('Value')
axes[1].set_ylabel('ECDF')

# Show the plots
plt.tight_layout()
plt.show()

Boxplot:

* In a boxplot, the box in the middle of the graph indicates the quartiles and the median.The two horizontal lines represent the largest (or the smallest) observation which falls within a distance of 1.5 times the size of the box. The observations outside this box are considered as "extremes" and are noted by points.

In [None]:
# Parameters for the normal distribution
mean = 0
std_dev = 1

# Generate random numbers
random_numbers = norm.rvs(mean, std_dev, size=1000)

# Create a figure
fig, ax = plt.subplots(figsize=(8, 6))

# Plot Boxplot
ax.boxplot(random_numbers)
ax.set_title('Boxplot of Normal Distribution')
ax.set_xlabel('Value')

# Show the plot
plt.tight_layout()
plt.show()

---

## Grouped data
We are presenting here some techniques for comparing plots between groups.

Consider the following dataset

In [277]:
energy_data = pd.DataFrame({
    "expend": [9.21,7.53,7.48,8.08,8.09,10.15,8.40,10.88,6.13,7.90,
               11.51,12.79,7.05,11.85,9.97,7.48,8.79,9.69,9.68,7.58,9.19,8.11],
    "stature": ["obese","lean","lean","lean","lean","lean","lean","lean","lean","lean",
                "obese","obese","lean","obese","obese","lean","obese","obese","obese","lean","obese","lean"]
})

Histograms by group:

In [None]:
expend_lean = energy_data.loc[energy_data.stature=="lean","expend"]
expend_obese = energy_data.loc[energy_data.stature=="obese","expend"]

plt.figure(figsize=(8,4))
plt.subplot(1,2,1)
plt.hist(expend_lean, bins=10, color="white", edgecolor='black')
plt.xlim(5,13)
plt.ylim(0,4)
plt.title("Lean")

plt.subplot(1,2,2)
plt.hist(expend_obese, bins=10, color="grey", edgecolor='black')
plt.xlim(5,13)
plt.ylim(0,4)
plt.title("Obese")

plt.tight_layout()
plt.show()

Parallel boxplots:

In [None]:
plt.boxplot([expend_lean, expend_obese], labels=["Lean","Obese"])
plt.title("Expenditure by Stature")
plt.show()

---

## Tables

*  A bi-dimensional table can be generated using `np.array`. The following example concerns the caffeine consumption while giving birth, with respect to women's civil status.

In [None]:
caff_marital = np.array([
 [652,1537,598,242],
 [36,46,38,21],
 [218,327,106,67]
])

rows = ["Married","Prev.married","Single"]
cols = ["0","1-150","151-300",">300"]
caff_df = pd.DataFrame(caff_marital, index=rows, columns=cols)
caff_df

Marginal tables:

In [None]:
caff_df.sum(axis=1) # row sums

In [None]:
caff_df.sum(axis=0) # column sums

Relative frequencies:

In [None]:
caff_df.apply(lambda r: r/r.sum(), axis=1) # row-wise proportions

In [None]:
caff_df / caff_df.sum().sum() # overall proportions


### Graphical display of tables

Bar plots:

In [None]:
total_caff = caff_df.sum(axis=1)
total_caff.plot.bar(color='black')
plt.show()

Stacked barplot:

In [None]:
caff_df.plot(kind='bar', stacked=True)
plt.show()

Side-by-side (unstack and plot):

In [None]:
caff_df.T.plot.bar(figsize=(6,4))
plt.show()

Normalized proportions:

In [None]:
(caff_df.T / caff_df.T.sum()).plot.bar(stacked=True)
plt.show()

Create a scatterplot with the same information as the previous bar plots

In [None]:
# Flatten and plot as dot chart
vals = caff_df.values.flatten()
labels = [(r,c) for r in rows for c in cols]
ypos = np.arange(len(vals))
plt.scatter(vals, ypos)
plt.yticks(ypos, labels)
plt.title("Scatterplot of caff.marital")
plt.show()

Pie charts:

In [None]:
fig, axes = plt.subplots(1,3, figsize=(8,8))
for i, (ax, row) in enumerate(zip(axes.flat, rows)):
    ax.pie(caff_df.loc[row], labels=cols, autopct='%1.1f%%')
    ax.set_title(row)
plt.tight_layout()
plt.show()

---

## Common Errors

- Mixing data types: In pandas DataFrames, adding rows of different lengths or data types leads to missing values (NaN) or errors.

- Missing data: Pandas uses `NaN` for missing data. Operations must handle these values (e.g. `df["c"]` might not exist; `NaN` propagate in computations).

---

## Debugging

For debugging, Python provides `print()` statements, `assert` statements, `pdb` (Python debugger), or IDE features like breakpoints.

In [None]:
def f(a): return g(a)
def g(b): return h(b)
def h(c): return i(c)
def i(d):
    if not isinstance(d, (int,float)):
        raise ValueError("`d` must be numeric")
    return d + 10

try:
    f("a")
except Exception as e:
    print("Error:", e)

We get a traceback showing where the error occurred. We can also use `import pdb; pdb.set_trace()` inside functions to debug step-by-step.

---

## Session info-like commands in Python

In [None]:
import sys
sys.version
import platform
platform.platform()

import pip
# To list installed packages:
!pip list

---
