<img src="../src/packt-banner.png" alt="">



We will use a very famous dataset, called Labelled Faces in the Wild, which
consists of 1288 faces of famous people, and it is available at http://viswww.cs.umass.edu/lfw/lfw-funneled.tgz.

However, note that it can be easily imported via scikit-learn from the datasets class.
Each image consists of 1850 features: we could proceed by simply using each of them in the model.



Fitting a SVM to non-linear data using the Kernel Trick produces non- linear decision boundaries.
In particular, we seek to:
* Build SVM model with radial basis function (RBF) kernel
* Use a grid search cross-validation to explore ran- dom combinations of parameters.

### Step to do:

1. Loading the dataf from sklearn.datasets:

In [15]:
from sklearn.datasets import fetch_lfw_people
faces = fetch_lfw_people(min_faces_per_person=60)

In [16]:
# What fields are in the dictionary?
faces.keys()

dict_keys(['data', 'images', 'target', 'target_names', 'DESCR'])

In [17]:
faces['target_names']

array(['Ariel Sharon', 'Colin Powell', 'Donald Rumsfeld', 'George W Bush',
       'Gerhard Schroeder', 'Hugo Chavez', 'Junichiro Koizumi',
       'Tony Blair'], dtype='<U17')

In [18]:

print(faces['DESCR'])

.. _labeled_faces_in_the_wild_dataset:

The Labeled Faces in the Wild face recognition dataset
------------------------------------------------------

This dataset is a collection of JPEG pictures of famous people collected
over the internet, all details are available on the official website:

    http://vis-www.cs.umass.edu/lfw/

Each picture is centered on a single face. The typical task is called
Face Verification: given a pair of two pictures, a binary classifier
must predict whether the two images are from the same person.

An alternative task, Face Recognition or Face Identification is:
given the picture of the face of an unknown person, identify the name
of the person by referring to a gallery of previously seen pictures of
identified persons.

Both Face Verification and Face Recognition are tasks that are typically
performed on the output of a model trained to perform Face Detection. The
most popular model for Face Detection is called Viola-Jones and is
implemented in the OpenC

In [19]:
faces['DESCR']



In [20]:

print(faces['data'])

[[0.53333336 0.52418303 0.49673203 ... 0.00653595 0.00653595 0.00130719]
 [0.28627452 0.20784314 0.2535948  ... 0.96993464 0.95032686 0.9346406 ]
 [0.31633988 0.3895425  0.275817   ... 0.4261438  0.7895425  0.9555555 ]
 ...
 [0.11633987 0.11111111 0.10196079 ... 0.5660131  0.579085   0.5542484 ]
 [0.19346406 0.21045752 0.29150328 ... 0.6875817  0.6575164  0.5908497 ]
 [0.12418301 0.09673203 0.10849673 ... 0.12941177 0.16209151 0.29150328]]


In [21]:
# shape of the data?
print(faces['data'].shape)

(1348, 2914)


In [22]:
#  target names?
print(faces['target_names'])

['Ariel Sharon' 'Colin Powell' 'Donald Rumsfeld' 'George W Bush'
 'Gerhard Schroeder' 'Hugo Chavez' 'Junichiro Koizumi' 'Tony Blair']


4. Dividing the data into features (X) using the faces.data and target (y) using faces.target 

In [23]:

X =faces.data
y =faces.target



We train the model with 70% of the samples and test with the remaining 30%.

In [24]:

from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test=train_test_split(X,y,train_size=0.7,random_state=42)




# print the sizes of our training and test set to verify if the splitting has occurred properly.
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)








(943, 2914)
(405, 2914)
(943,)
(405,)


In [25]:
from sklearn.svm import SVC

# write your code here
svm_model=SVC(kernel='rbf',class_weight='balanced')
svm_model.fit(X_train,y_train)
y_pred=svm_model.predict(X_test)

In [26]:
from sklearn.model_selection import GridSearchCV
param_grid = {'C': [1,5,10,50],'gamma': [0.001,0.0005,0.01,0.1]}

#  code for GridSearchCV:

grid_search=GridSearchCV(svm_model,param_grid,cv=10)
grid_search.fit(X_train,y_train)
best_params=grid_search.best_params_
best_model=grid_search.best_estimator_
print("Best hyperparameters:",best_params)

Best hyperparameters: {'C': 50, 'gamma': 0.001}


In [27]:

y_prect=best_model.predict(X_test)

 Model performances:


In [28]:
from sklearn.metrics import classification_report
labels = list(faces.target_names)
print(classification_report(y_test,y_pred,target_names=labels))

                   precision    recall  f1-score   support

     Ariel Sharon       0.62      0.76      0.68        17
     Colin Powell       0.77      0.81      0.79        84
  Donald Rumsfeld       0.70      0.78      0.74        36
    George W Bush       0.84      0.77      0.80       146
Gerhard Schroeder       0.44      0.64      0.52        28
      Hugo Chavez       0.93      0.52      0.67        27
Junichiro Koizumi       0.89      1.00      0.94        16
       Tony Blair       0.68      0.63      0.65        51

         accuracy                           0.75       405
        macro avg       0.73      0.74      0.72       405
     weighted avg       0.76      0.75      0.75       405



- Overall, this model performs well with 75% accuracy for some persons, but there is room for improvement due to the data imbalance, as evidenced by the support. 

- For example, if we consider the Gerhard Schroeder model, it predicts with 44% precision, 64% recall, with only 28 instances in the support, which is relatively low.

- I believe that increasing the size of the training data could enhance the model's performance. 

- By providing more diverse examples for training, the model may better learn to generalize across various scenarios and improve its predictive capabilities.