# Linear regression : project
**Joris, Shan, André LIMONIER**

<br>

The linear model is one of the most used model in statistics to find a link between a response variable and explanatory ones.\
We are looking this model more in details in the sense that I am giving to you a data set and you are trying to answer to some demands.

Our aim is to focus onto the prediction for new observations and to quantify in some sense the accuracy of it.

1. So, you split the dataset in two parts at random, a first one named training set and a second one named validation set.
2. Thanks to the training set, estimate the parameters of the linear model, at first by using all the explanatory variables.
3. Then, let consider the validation set.
   1. For each individual of it, compute the associated prediction with the model build previously.
   2. You can compute the test error associated to this validation set.
4. Make several splittings in two parts to compare this test error.
   1. And each time, compare it to the training error.
   2. Which error is better and why?
5. Now, give the confidence interval for the observation of the response variable associated to a new individual for who we now the values of the explanatory variables.
6. What are the assumptions needed to compute this confidence interval and how to see if they are satisfied in practice?
7. Now, try to identify the explanatory variables that are really needed to obtain correct prediction.


In [1]:
# data manipulation
import numpy as np
import pandas as pd
from scipy import stats

# plotting
import plotly.express as px
import plotly.figure_factory as ff

# sklean imports
from sklearn.linear_model import LinearRegression


In [2]:
FILE_NAME1 = "reg1.txt"
FILE_NAME2 = "regm.txt"

df = pd.read_csv(
    filepath_or_buffer=FILE_NAME1, # choose which dataset to use
    header=None,
    sep=" ",
)


def rename_col(df):
    """
    set the last column to "target" and the others to feat0, feat1, etc
    """
    n_col = len(df.columns)
    df.columns = [f"feat{i}" for i in range(n_col - 1)] + ["target"]


rename_col(df)


## Question 1 ─ Train-test split


In [3]:
def my_tt_split(X, y, train_size):
    """
    define a custom and simplistic version of sklearn's `train_test_split`
    """

    assert len(X) == len(y), print("X and y should have the same length")

    idx = np.arange(len(X))  # compute indices
    np.random.shuffle(idx)  # shuffle indices
    X_shuffled = X.values[idx]  # apply to X
    y_shuffled = y.values[idx]  # apply to y

    train_thresh = int(train_size * len(X))
    X_train = X_shuffled[:train_thresh]
    X_test = X_shuffled[train_thresh:]
    y_train = y_shuffled[:train_thresh]
    y_test = y_shuffled[train_thresh:]

    X_train = pd.DataFrame(data=X_train, columns=X.columns)
    X_test = pd.DataFrame(data=X_test, columns=X.columns)
    y_train = pd.Series(data=y_train, name="target")
    y_test = pd.Series(data=y_test, name="target")

    return X_train, X_test, y_train, y_test


In [4]:
X = df[df.columns[:-1]]
y = df["target"]

X_train, X_test, y_train, y_test = my_tt_split(X, y, train_size=0.8)

X_train.shape, X_test.shape, y_train.shape, y_test.shape

((40, 1), (10, 1), (40,), (10,))

We have to make train and test dataframes, because computing anything other than a prediction with the test data will result in data leakage.

For instance, we will compute the correlation on the train data (between the features and the target variable), because computing it on the test data would be "cheating".

In [5]:
# make train and test dataframes with X and y
df_train = X_train.join(y_train)
df_test = X_test.join(y_test)

## Question 2 ─ Linear model (all variables)


### Perform small EDA (Exploratory Data Analysis)


In [6]:
ff.create_scatterplotmatrix(
    df=df_train,
    diag="histogram",
    # index="target",
    size=3,
    colormap_type="cat",
    width=800,
    height=800,
)


In [7]:
corr = df_train.corr().round(4)

print("Correlation matrix heatmap")
ff.create_annotated_heatmap(
    z=corr.values,
    x=list(corr.index),
    y=list(corr.columns),
    showscale=True,
    colorscale=px.colors.sequential.Bluered_r,
)


Correlation matrix heatmap


In [8]:
lr = LinearRegression()
lr.fit(X_train, y_train)


LinearRegression()

## Question 3 ─ Predict and evaluate


### Predict


In [9]:
y_pred = lr.predict(X_test)
px.scatter(pd.DataFrame({"true_values": y_test, "pred_values": y_pred}))

### Evaluate


In [10]:
mse = np.mean((y_test - y_pred) ** 2)
mse

0.06209664929647207

## Question 4 ─ Make several splittings


### Compare training and test errors


In [11]:
def compute_n_mse(n_reps, X, y):
    """
    Perform `n_reps` splits of `X` and `y` and compute the MSE on each repetition.
    """

    store_mse = np.array([])
    store_lr = np.array([])

    for _ in range(n_reps):

        # peform train-test split
        X_train, X_test, y_train, y_test = my_tt_split(X, y, train_size=0.8)

        # fit regressor
        lr = LinearRegression()
        lr.fit(X_train, y_train)
        store_lr = np.append(store_lr, lr)  # store for confidence interval computation

        # predict
        y_pred_train = lr.predict(X_train)
        y_pred_test = lr.predict(X_test)

        # compute MSE
        mse_train = np.mean((y_train - y_pred_train) ** 2)
        mse_test = np.mean((y_test - y_pred_test) ** 2)

        # append MSE to list of all MSE's
        store_mse = np.append(store_mse, [mse_train, mse_test]).reshape(-1, 2)

    return store_mse


n_reps = 1000
store_mse = compute_n_mse(n_reps=n_reps, X=X, y=y)

df_mse = pd.DataFrame(data=store_mse, columns=["train", "test"])
px.histogram(
    data_frame=df_mse,
    opacity=0.6,
    barmode="overlay",
    title="MSE comparison between train and test sets",
)


### Which error is better? Why?

Overall, the train error is better, because this is the data we use to find the parameters of our linear model. Subsequently, the model is deployed on previously unseen data, so the parameters are expected to perform well too, but they often underperform in comparison to train data.\
We can see however that in some non-negligible portion of all cases, we are "lucky" and the parameters perform better on test data than on train data.


## Question 5 ─ Confidence interval


We perform a kind of bootstrap on the prediction value. That is, given a `new_sample`, we predict its value with the previously fitted linear regressors. We sort the predicted values, take the 2.5th and 97.5% percentiles (for a 95% confidence interval) and return their value.


In [12]:
new_samples = X_test[:1]  # some example of new sample


def compute_ci(lr, new_data, y_train, alpha=0.05):
    """
    return the lower and upper bounds of the confidence interval at level `alpha`
    """
    y_pred = lr.predict(new_data)
    std = stats.sem(y_train)

    quantile = 1 - alpha / 2
    n_samples = len(y_train)
    t_score = stats.t.ppf(quantile, n_samples - 1)
    eccentricity = t_score * std

    # compute lower and upper percentile values
    conf_lower = np.round(y_pred - eccentricity, decimals=4)
    conf_upper = np.round(y_pred + eccentricity, decimals=4)

    return conf_lower, conf_upper


df_ci = pd.DataFrame()

# we use X_test as the new data
# but it could be some other unknown data
df_ci["ci_lower"], df_ci["ci_upper"] = compute_ci(
    lr=lr,
    new_data=X_test, 
    y_train=y_train,
)
df_ci["target"] = y_test
df_ci = df_ci.sort_values("target", ignore_index=True)
px.line(df_ci)


## Question 6 ─ Assumptions to compute confidence interval


In order to compute the confidence intervals, we assume that the `new_data` that is passed is drawn from the same unknown distribution as the one used to train our regressor.

To see if this assumption is satisfied in practice, one can verify that each of the feature falls within reasonable range when compared to the training data. Checking the value of the z-score or using a $\chi^2$-test is possible.


## Question 7 ─ Remove useless variables


**Not applicable**