### Predictive modelling with Python

*Jure Žabkar*

*Wed, 5 March 2025*

---

On Aug 2nd 2008, abcNEWS published an [article](https://abcnews.go.com/Health/Fitness/story?id=5499878&page=1) on obesity in America. Here's an interesting statement from it:

*if current overweight and obesity trends continue, 86 percent of Americans could be overweight or obese by the year 2030.
Even more troubling, the authors note, "By 2048, all American adults would become overweight or obese."*

"if obesity trends continue..." How did the researchers come up with these trends and why should they be more careful in the interpretation of these trends?

![Obesity Apocalypse](attachment:abcNEWS.png)

Let's try to do a similar study, this time on rats (mice might have been cuter but they don't get that big). 

### Obesity Apocalypse in Rats

We will learn how to:
* import libraries
* import data from a csv file
* get to know your data: head, describe, info
* print out a DataFrame
* plot the data
* reshape the data
* linear regression: fit, predict
* plot the predictions of the regression model (regression line)
* errors on the training set: mse, mae

In [1]:
import numpy as np
import pandas as pd
%matplotlib inline
import matplotlib.pyplot as plt

In [2]:
rats = pd.read_csv("datasets/rats-obesity.csv")
print(rats.head())

     weight       size
0  3.502132  22.271559
1  5.321947  26.007667
2  1.000686   7.232203
3  2.813995  22.490089
4  1.880535  12.433266


In [3]:
print(rats.describe())

          weight       size
count  10.000000  10.000000
mean    2.887776  18.523853
std     1.305650   5.813336
min     1.000686   7.232203
25%     1.939792  14.659924
50%     2.943680  20.930048
75%     3.471750  22.435456
max     5.321947  26.007667


In [4]:
print(rats.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   weight  10 non-null     float64
 1   size    10 non-null     float64
dtypes: float64(2)
memory usage: 292.0 bytes
None


In [5]:
rats

Unnamed: 0,weight,size
0,3.502132,22.271559
1,5.321947,26.007667
2,1.000686,7.232203
3,2.813995,22.490089
4,1.880535,12.433266
5,1.554032,14.148107
6,2.117561,16.195376
7,3.073364,21.640214
8,3.380605,20.219881
9,4.2329,22.600167


In [6]:
rats.plot.scatter(x='weight', y='size');
# or: rats.plot(kind="scatter", x='weight', y='size');
plt.xlabel("weight [dag]");
plt.ylabel("size [cm]");

In [7]:
from sklearn.linear_model import LinearRegression
lin_reg = LinearRegression()

In [8]:
x = rats["weight"]
y = rats["size"]

In [9]:
x = x.values.reshape(-1,1)
"""
you are asking numpy to reshape your array with 1 column
and as many rows as necessary to accommodate the data.
This operation will result in a 2D array with a shape (n, 1),
where n is the number of elements in your original array.
"""


'\nyou are asking numpy to reshape your array with 1 column\nand as many rows as necessary to accommodate the data.\nThis operation will result in a 2D array with a shape (n, 1),\nwhere n is the number of elements in your original array.\n'

In [10]:
lin_reg.fit(x, y)

In [11]:
plt.plot(x, y, "b.");
y_predicted = lin_reg.predict(x);
plt.plot(x, y_predicted, "r-");

In [12]:
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error

$R^2$ delež variance v $y$, ki ga lahko napovemo z $x$.

(Explained variance score: 1 is perfect prediction)

RMSE vs MAE:
- RMSE meri standardni odklon napak pri napovedih.
  Std. odklon je koren *variance* (= povprečje kvadratov odklonov od povprečja).
- Npr. RMSE = 5000 pomeni, da:
    - 68% napovedi pade znotraj 5000 in
    - 95% pade znotraj 10000.
- RMSE je bolj občutljiva na osamelce (večja norma, večja občutljivost);
  zato pre velikem številu osamelcev raje uporabimo MAE

In [13]:
print("Model performance on TRAIN data:")
print(f"Mean squared error: {mean_squared_error(y, y_predicted):.2f}")
print(f"Mean absolute error: {mean_absolute_error(y, y_predicted):.2f}")
print(f'Variance score: {r2_score(y, y_predicted):.2f}')

Model performance on TRAIN data:
Mean squared error: 5.20
Mean absolute error: 1.94
Variance score: 0.83
