# TP regression sur des images

* Dans ce TP, vous mettrez en oeuvre vos connaissances en régression pour prédire l'age d'individus en fonction de leur visage.

* dans la première étape du TP, vous devrez télécharger le jeu de données pour constituer les matrices X_train et Y_train (jeu d'entrainement) et X_test et Y_test (jeu de test). Ensuite, vous devrez utiliser des techniques de régression s'appuyant sur la sélection de variables, pour que le résultat soit plus interprétable

## Chargement du jeu de données.

* Téléchargez [le jeu de données ici](https://www.lamsade.dauphine.fr/~ychevaleyre/data/face_age.zip). Il faut d'abord le dézipper.
* Dans le répertoire, il y a une centaine de sous-répertoires. Dans chacun de ces sous-répertoire, vous trouvez des images d'individus dont l'age correspond à leur répertoire. Chaque image est un fichier PNG couleur de taille 200x200.

* Ecrivez une fonction `chargeImage` qui prend un nom de fichier image, qui charge l'image (vous utiliserez la fonction `imread` de `matplotlib.image`) et la convertit en un tableau numpy 2D de taille 25x25. Pour cela, vous utiliserez entre autres `scipy.ndimage.interpolation.zoom(X,.125)` qui divise par 8 la taille de l'image. Il faudra aussi d'abord fusionner les 3 canaux de couleur RGB en un seul (c'est à dire faire la moyenne des trois canaux pour obtenir un seul nombre qui représente une nuance de gris).

* Faites une fonction `chargeDossiers` qui utilise `chargeImage` pour charger les 30 premiers fichiers image des répertoires $5,10,15,20,25,\ldots$, et les place dans une liste. Note: pour avoir la liste de tous les noms de fichier dans le répertoire `dossier`, il faut utiliser la fonction `listdir` du package `os`. Par exemple on peut faire `listdir('/')`.

## Construction du tableau de données

Maintenant que les images sont chargés, il faut pour chaque image créer un vecteur la représentant. On en peut pas directement faire une régression linéaire sur les pixels, car ceux ci ne contiennent pas assez d'information.

* Créez une fonction `diffHorizontal` qui, pour une image $I$ donnée, renvoie une image $H$ telle que $H(x,y)=I(x+1,y)-I(x,y)$. Attention: en faisant cette opération, $H$ contiendra une colonne de moins que $I$. Pour que $H$ ai le même nombre de colonnes que $I$, on dupliquera la dernière colonne de $H$. Créez une fonction `diffVertical` qui, pour une image $I$ donnée, renvoie une image $V$ telle que $V(x,y)=I(x,y+1)-I(x,y)$. Idem pour la dernière ligne de $V$.

* Crée une fonction `encodeImage` qui, pour une image $I1$, renvoie un tableau 1D la représentant. Voici comment ce tableau est construit. A partir de l'image $I1$ de taille 25x25, on crée les images $I2$ de taille 12x12, $I3$ de taille 6x6, $I4$ de taille 3x3. Pour chacune de ces image $Ik$, on crée les images $Hk$ et $Vk$ avec les fonctions ci-dessus. Toutes ces images sont ensuite applaties et concaténées en un grand tableau 1D.

* On peut donc regrouper toutes les images dans un grand tableau 2D $X$ ou chaque ligne correspond à une image, et on crée aussi un tableau $Y$ qui correspond à l'age des images correspondantes.

* Séparez ces tableau en deux, pour avoir des données d'apprentissage et des données de test. Vous n'oublierez pas de mélanger les images (les ages doivent être uniformément répartis dans les données)

* Appliquez une régression linéaire classique avec la méthode matricielle vue précédemment. Calculez le score $r2$ comme vu au précédent TP, sur le jeu de test. Que pouvez vous dire du résultat ?

# Sélection de variables et interprétabilité

On considère comme interprétable un modèle linéaire qui s'appuie sur un petit nombre de variables.
On utilisera pour cela deux approches:



* A partir de la régression précédente,  vous sélectionnerez les 10 composantes du vecteur de paramètres les plus grands en valeur absolue. Puis, vous éliminerez des données toutes les variables qui ne correspondent pas à ces 10 paramètres. Puis, vous relancerez la régression linéaire.

* Appliquez l'algorithme "matching pursuit" (voir page [wikipedia](https://en.wikipedia.org/wiki/Matching_pursuit) par exemple ) et comparez les deux.


# 1. Setup and Imports


In [2]:
!wget https://www.lamsade.dauphine.fr/~ychevaleyre/data/face_age.zip
!unzip face_age.zip

# Import necessary libraries
import numpy as np
import matplotlib.image as img
import scipy.ndimage as ndimage
import os
import matplotlib.pyplot as plt

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
  inflating: face_age/003/7253.png   
  inflating: __MACOSX/face_age/003/._7253.png  
  inflating: face_age/003/8189.png   
  inflating: __MACOSX/face_age/003/._8189.png  
  inflating: face_age/003/5687.png   
  inflating: __MACOSX/face_age/003/._5687.png  
  inflating: face_age/003/3396.png   
  inflating: __MACOSX/face_age/003/._3396.png  
  inflating: face_age/003/7655.png   
  inflating: __MACOSX/face_age/003/._7655.png  
  inflating: face_age/003/8758.png   
  inflating: __MACOSX/face_age/003/._8758.png  
  inflating: face_age/003/1027.png   
  inflating: __MACOSX/face_age/003/._1027.png  
  inflating: face_age/003/4639.png   
  inflating: __MACOSX/face_age/003/._4639.png  
  inflating: face_age/003/4177.png   
  inflating: __MACOSX/face_age/003/._4177.png  
  inflating: face_age/003/3426.png   
  inflating: __MACOSX/face_age/003/._3426.png  
  inflating: face_age/003/1557.png   
  inflating: __MACOSX/face_age/003/._

# 2. Utility Functions


## 2.1 Image Loading and Preprocessing


In [15]:
def chargeImage(filename):
    """
    Load an image, convert it to grayscale, and resize to 25x25.
    """
    image = img.imread(filename)  # Load image
    grayscale = np.mean(image, axis=2)  # Convert to grayscale
    resized = ndimage.zoom(grayscale, 0.125)  # Resize to 25x25
    return resized

def getPath():
    """
    Retrieve file paths and corresponding age labels.
    """
    paths, Y = [], []
    for age in range(5, 101, 5):
        subdir = f"face_age/{str(age).zfill(3)}/"
        files = os.listdir(subdir)[:30]  # Limit to 30 images per directory
        paths += [os.path.join(subdir, f) for f in files]
        Y += [age] * len(files)
    return paths, np.array(Y)

def chargeDossiers(paths):
    """
    Load images from a list of paths.
    """
    return [chargeImage(path) for path in paths]


## 2.2 Derivative Functions


In [16]:
def diffHorizontal(img):
    """
    Compute horizontal differences for an image.
    """
    n = img.shape[0]
    H = np.zeros((n, n))
    H[:-1, :] = img[1:, :] - img[:-1, :]
    H[-1, :] = H[-2, :]  # Duplicate the last row
    return H

def diffVertical(img):
    """
    Compute vertical differences for an image.
    """
    n = img.shape[0]
    V = np.zeros((n, n))
    V[:, :-1] = img[:, 1:] - img[:, :-1]
    V[:, -1] = V[:, -2]  # Duplicate the last column
    return V


# 3. Encoding Functions


In [17]:
def encodeImage(image):
    """
    Encode an image into a 1D feature vector using multiple levels and derivatives.
    """
    levels = [image]
    for scale in [0.48, 0.24, 0.12]:
        levels.append(ndimage.zoom(image, scale))

    derivatives = []
    for level in levels:
        derivatives.append(diffHorizontal(level))
        derivatives.append(diffVertical(level))

    # Flatten and concatenate all levels and derivatives
    return np.concatenate([lvl.flatten() for lvl in levels + derivatives])

def create_X(images):
    """
    Create the feature matrix X from a list of images.
    """
    return np.array([encodeImage(img) for img in images])


# 4. Normalization and Data Preparation


In [18]:
def normalization(X):
    """
    Normalize each column of X.
    """
    norms = np.linalg.norm(X, axis=0)
    return X / norms

def shuffle_and_split(X, Y, train_size=0.8):
    """
    Shuffle and split the dataset into training and testing sets.
    """
    indices = np.random.permutation(len(X))
    X, Y = X[indices], Y[indices]
    split = int(train_size * len(X))
    return X[:split], Y[:split], X[split:], Y[split:]


# 5. Regression and Evaluation Metrics


In [19]:
def matrice_dague(A):
    """
    Compute the pseudo-inverse of a matrix.
    """
    return np.linalg.pinv(A)

def MSE(Y, Y_pred):
    """
    Compute Mean Squared Error.
    """
    return np.mean((Y - Y_pred) ** 2)

def R2(Y, Y_pred):
    """
    Compute the R-squared value.
    """
    return 1 - (np.sum((Y - Y_pred) ** 2) / np.sum((Y - np.mean(Y)) ** 2))


# 6. Linear Regression


In [20]:
def linear_regression(X_train, Y_train, X_test, Y_test):
    """
    Perform linear regression and evaluate the model.
    """
    X_train_aug = np.column_stack((np.ones(X_train.shape[0]), X_train))  # Add bias
    beta = matrice_dague(X_train_aug) @ Y_train

    X_test_aug = np.column_stack((np.ones(X_test.shape[0]), X_test))
    Y_test_pred = X_test_aug @ beta

    print("R-squared:", R2(Y_test, Y_test_pred))
    print("MSE:", MSE(Y_test, Y_test_pred))
    return beta


# 7. Feature Selection


In [21]:
def max_components_vecteur(beta, k):
    """
    Select indices of the top-k largest components in absolute value.
    """
    return np.argsort(np.abs(beta))[::-1][:k]


# 8. Matching Pursuit


In [30]:
def matching_pursuit(X, Y, k):
    """
    Perform the Matching Pursuit algorithm to select k components.
    """
    theta = np.zeros(X.shape[1], dtype=np.float64)  # Ensure theta is float64
    residual = Y.astype(np.float64)  # Ensure residual is float64
    selected_indices = []

    for _ in range(k):
        correlations = np.abs(X.T @ residual)
        idx = np.argmax(correlations)  # Find the index of max correlation
        selected_indices.append(idx)
        theta[idx] = X[:, idx] @ residual
        residual -= theta[idx] * X[:, idx]  # Update the residual

    return theta


# 9. Full Workflow


## 9.1 Load and Prepare Data


In [35]:
# Load dataset
paths, Y = getPath()
images = chargeDossiers(paths)

# Encode and normalize
X = create_X(images)
X = normalization(X)

# Shuffle and split
X_train, Y_train, X_test, Y_test = shuffle_and_split(X, Y)


## 9.2 Run Linear Regression


In [36]:
print("Linear Regression with All Features:")
beta = linear_regression(X_train, Y_train, X_test, Y_test)


Linear Regression with All Features:
R-squared: -1.1319403842322275
MSE: 1410.3626323515905


## 9.3 Feature Selection Regression


In [37]:
print("Linear Regression with Top 10 Features:")
top_10_indices = max_components_vecteur(beta[1:], 10)  # Skip bias term
linear_regression(X_train[:, top_10_indices], Y_train, X_test[:, top_10_indices], Y_test)


Linear Regression with Top 10 Features:
R-squared: 0.13279745879828608
MSE: 573.6886771492441


array([  47.54281317, -116.89192578,   71.58551121,   35.55983175,
        125.63947351,  -77.87115627,   42.0387282 ,  -49.77304449,
         15.41547272,   85.54005042,  -15.09673653])

## 9.4 Matching Pursuit


In [39]:
print("Matching Pursuit with 10 Features:")
theta = matching_pursuit(X_train, Y_train, 10)
Y_test_pred = X_test @ theta
print("R-squared:", R2(Y_test, Y_test_pred))
print("MSE:", MSE(Y_test, Y_test_pred))


Matching Pursuit with 10 Features:
R-squared: -0.011629455251471965
MSE: 669.2327759374426


# Standard Deviation of Age Prediction Errors and Its Significance

### Explanation:
- The **Mean Squared Error (MSE)** measures the average of the squared differences between predicted and actual ages. However, since it is squared, its value is not directly interpretable in the same scale as the original ages.
- The **Standard Deviation (σ)** of the prediction errors is the square root of the MSE. This gives us an idea of how far off the predictions are from the actual ages, on average, in the same unit (years in this case).

### Formula:
\[
\sigma = \sqrt{\text{MSE}}
\]

This value provides an interpretable measure of the typical prediction error in years.

---

### Results from the Models:
1. **Linear Regression with All Features:**
   - MSE = 1486.57  
   - Standard Deviation:
     \[
     \sigma = \sqrt{1486.57} \approx 38.55 \text{ years}
     \]
   - Example: If the actual age is **50 years**, the prediction might typically vary by **±38.55 years**, meaning the predicted age could be anywhere between **11.45 years** and **88.55 years**. This is very inaccurate.

2. **Linear Regression with Top 10 Features:**
   - MSE = 582.95  
   - Standard Deviation:
     \[
     \sigma = \sqrt{582.95} \approx 24.15 \text{ years}
     \]
   - Example: If the actual age is **50 years**, the prediction might typically vary by **±24.15 years**, meaning the predicted age could be anywhere between **25.85 years** and **74.15 years**. This is better but still has a wide range.

3. **Matching Pursuit with 10 Features:**
   - MSE = 756.01  
   - Standard Deviation:
     \[
     \sigma = \sqrt{756.01} \approx 27.49 \text{ years}
     \]
   - Example: If the actual age is **50 years**, the prediction might typically vary by **±27.49 years**, meaning the predicted age could be anywhere between **22.51 years** and **77.49 years**. This is slightly worse than using top 10 features but still much better than using all features.

---

### Key Takeaways:
- **Lower standard deviation** indicates better model performance, as predictions are closer to the actual values.
- **Linear Regression with Top 10 Features** performs the best among the models tested, achieving a standard deviation of **24.15 years**.
- The results show that dimensionality reduction (e.g., feature selection) improves model performance by reducing noise and overfitting.
- However, the current prediction errors (24-38 years) are still too large for practical use, suggesting further improvements in feature engineering, data preprocessing, or using more advanced models.

### Visualization of Standard Deviation:
To make this clearer:
- Imagine the true age is **50 years**.
- If the model's standard deviation of prediction error is **24 years**, most predictions will fall within the range:
  \[
  50 \pm 24 = [26, 74] \text{ years}
  \]
- If the standard deviation is **38 years**, predictions will fall within the range:
  \[
  50 \pm 38 = [12, 88] \text{ years}
  \]
This shows why reducing the standard deviation is critical for improving the reliability of age predictions.


# Future Work and Practical Limitations

### **Practical Limitations of the Current Model**
1. **Prediction Accuracy:**
	* The best model, Linear Regression with Top 10 Features has a standard deviation of **24.15 years**. For an actual age of 50 years, it suggests that the predictions will be typically in the range of **26 years** to **74 years**, which certainly is not precise enough for practical use.
- In the real world, such large prediction errors would make the model unreliable; in applications where age precision is important, such as demographic analysis or personalized services, even medical diagnostics, these models would plainly fail.

2. **Interpretability:**
	* Although interpretability improves by feature reduction-e.g., top 10 features-the predictions are not good enough. Some optimization is still required between interpretability and performance.

3. **Dataset Quality:
The dataset may have anomalies or noisy data that could be detrimental to the model in training. For instance:
    - Classes across age groups could be imbalanced.
    - Lighting conditions and facial expressions, among other conditions in these images, may cause noisy data.

4. **Feature Representation:**
This can be the current feature extraction process based on a difference-based and downscaled-image-based approach. This may lead to poor performance of the model by not representing any high-level pattern or facial structure that can be related to age prediction.

---

### **Future Work and Recommendations**
Several improvements are needed to make this model practically useful, such as:

1. **Better Feature Engineering:**
- More sophisticated image features, like HOG, LBP, or Gabor filters, can be used to catch the facial structures more effectively.
- Dimensionality reduction techniques, such as **Principal Component Analysis (PCA)**, may be employed to retain more informative features in less noise.

2. **Deep Learning Approaches:**
- Utilize deep learning models, such as CNNs, which by design learn this from the raw images.
- VGGFace and ResNet can yield better performance by fine-tuning for age prediction.
3. **Regularization and Robustness**
- For linear models, employ regularization methods like Lasso or Ridge regression to reduce overfitting.
- Include data augmentation, such as rotation, scaling, or adding noise to the images, to help the model generalize well in case of variation in the images.

4. **Improved Data Quality and Quantity:**
- Use a dataset larger and more diverse, with at least similar age group representation.
- A dataset with regular lighting and pose variations and similar quality reduces noise.

5. **Model Evaluation Metrics:
While MSE and standard deviation provide useful measures of error, other metrics exist which are more relevant to real-world applications; for example, the **Mean Absolute Error (MAE)**.

---

### **In Practical Life**
Given that the predictive accuracy of the current model is not good, it's far from being applied in the real world. Here is why:
- **Personalized Services:** Targeted marketing or recommendation-based applications are infeasible to use with a predicted age for an individual with an average error of ±24 years.
- **Medical Applications:** Estimating biological age for health interventions may be as important, yet with such a wide error margin, the model is unwieldy and can even mislead in such tasks.
**Demographic Research:** In demographic studies, an estimation of age is to be as accurate as possible to categorize the people into meaningful groups. An error span of 24 years might misclassify the people into an altogether different age bracket.

---

### **Conclusion**
The current model clearly illustrates that basic regression techniques, along with manual feature engineering, simply do not cut it for more complex tasks such as age prediction from images. Though the exercise underlined the importance of feature selection and dimensionality reduction, outcomes are not useful in a practical sense. Future efforts should be given to:
Adopting advanced feature extraction methods.
Deep Learning Leveraging.
- Improvement in dataset quality and size.
It is only then that the model will be able to realize such an improvement in precision, suitable for real-life applications.


## **Credits**
- Data provided by Yann Chevaleyre.
- Analysis and implementation by Mohamed ZOUAD & Mohamed El Amine ROUIBI.
