# Laboratorio de regresión logística

|                |   |
:----------------|---|
| **Nombre**     |  Paola A. Figueroa Álvarez |
| **Fecha**      |  11/ oct/2025 |
| **Expediente** |  751310  | 

In machine learning, Support Vector Machines (SVM) are supervised learning models with associated learning algorithms that analyze data used for classification and regression analysis. It is mostly used in classification problems. In this algorithm, each data item is plotted as a point in p-dimensional space (where p is the number of features), with the value of each feature being the value of a particular coordinate. Then, classification is performed by finding the hyper-plane that best differentiates the two classes (or more if we have a multi class problem):

$$ f(x) = w^T \varphi(x) + b $$

where $\varphi: X \rightarrow F $ is a function that makes each input point $x$ correspond to a point in F, where F is a Hilbert space.

In addition to performing linear classification, SVMs can efficiently perform a non-linear classification, implicitly mapping their inputs into high-dimensional feature spaces (more specifically using the kernel trick, like the RBF funcion). 

[1]

OLS utilizes the squared residuals to fit the parameters. Large residuals caused by outliers may worsen the accuracy significantly.

Support Vectors use piecewise linear functions to counter this, in which a hyperparameter  $\epsilon$ called the margin lets errors that are less or equal to it be 0, and error larger than it be $e - \epsilon$. 

The problem to solve is:

\begin{split}
        \min_{w, b, \xi, \xi^*} \mathcal{P}_\epsilon(w, b, \xi) &= \frac{1}{2} w^T w + c \sum_{k=1}^{N} \xi_k \\
        \text{s.t. } & y_k [w^T \varphi(x_k) - b] \geq 1- \xi_k,\ \ k = 1, ..., N \\
        & \xi_k \geq 0,\ \ k = 1, ..., N
\end{split}


The most important question that arises when using a SVM is how to choose the correct hyperplane. Consider the following scenarios:

### Scenario 1

In this scenario there are three hyperplanes called A, B, and C. Now, the problem is to identify the hyperplane which best differentiates the stars and the circles.

<center><img src="https://media.geeksforgeeks.org/wp-content/uploads/SVM_21-2.png" alt="what image shows"></center>

In this case, hyperplane B separates the stars and the circle betters, hence it is the correct hyperplane.


### Scenario 2

Now take another scenario where all three hyperplanes are segregating classes well. The question that arises is how to choose the best hyperplane in this situation.

<center><img src="https://media.geeksforgeeks.org/wp-content/uploads/SVM_4-2.png" alt="what image shows"></center>

In such scenarios, we calculate the margin (which is the distance between nearest data point and the hyperplane). The hyperplane with the largest margin will be considered as the correct hyperplane to classify the dataset.

Here C has the largest margin. Hence, it is considered as the best hyperplane.


### Kernels
Knowing 
$$ w = \sum_{k=1}^{N} \alpha_k y_k \varphi(x_k) $$

And
$$ y_{pred} = w^T \varphi(x) + b $$

Then 
$$ y_{pred} = (\sum_{k=1}^{N} \alpha_k y_k \varphi(x_k))^T \varphi(x) + b $$

Where $\varphi$ is a function that makes each input in $x$ correspond to a point in $F$ (a Hilbert space). This can be seen as processing and transforming the input featuers to keep the model's convexity. [2]

This also allows us to transform the inputs into another space where they might be more easily classified.

<center><img src=https://miro.medium.com/max/838/1*gXvhD4IomaC9Jb37tzDUVg.png alt="what image shows"></center>

## ROC and AUC

A ROC (Receiver Operating Characteristic) is a graph that shows how the classification model performs at the classification thresholds. 

ROC curves typically feature true positive rate on the Y axis, and false positive rate on the X axis. This means that the top left corner of the plot is the “ideal” point - a false positive rate of zero, and a true positive rate of one. This is not very realistic, but it does mean that a larger area under the curve (AUC) is usually better. [3]

True Positive Rate is a synonym for Recall and defined as:
$$ TPR = \frac{TP}{TP + FN} $$

False Positive Rate is a synonym for Specificity and defined as:

$$ FPR = \frac{FP}{FP + TN} $$

ROC curves are typically used in binary classification to study the output of a classifier. In order to extend ROC curve and ROC area to multi-label classification, it is necessary to binarize the output. One ROC curve can be drawn per label, but one can also draw a ROC curve by considering each element of the label indicator matrix as a binary prediction (micro-averaging).

E.g. If you lower a classification threshold, more items would be classified as positive, increasing False Positives and True Positives.

AUC stands for Area under the ROC.

## Ejercicio 1

- Utiliza el dataset `Iris`, modela con SVC y haz Cross-Validation de diferentes kernels ('linear', 'poly', 'rbf', 'sigmoid').
- Modela con LogisticRegression.
- El método de Cross-Validation es K-Folds con $k=10$.
- Utiliza el AUC como métrico de Cross-Validation.
- Compara resultados.

In [3]:

import numpy as np
import pandas as pd
from sklearn.gaussian_process import GaussianProcessRegressor
from sklearn.gaussian_process.kernels import RBF
import matplotlib.pyplot as plt
import statsmodels.api as sm
from scipy import stats 
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_percentage_error
from sklearn.linear_model import Ridge, ElasticNet, Lasso, LinearRegression, LogisticRegression
from sklearn.model_selection import LeaveOneOut, train_test_split, KFold, cross_val_score
from sklearn.preprocessing import StandardScaler, PolynomialFeatures
from sklearn.svm import SVC
from sklearn.pipeline import Pipeline
from sklearn import datasets


In [4]:
iris = datasets.load_iris()
df_iris = pd.DataFrame(columns=iris.feature_names, data=iris.data)
df_iris["Class"] = iris.target
df_iris["Class"] = df_iris["Class"].astype(str)


In [5]:
df_iris.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   sepal length (cm)  150 non-null    float64
 1   sepal width (cm)   150 non-null    float64
 2   petal length (cm)  150 non-null    float64
 3   petal width (cm)   150 non-null    float64
 4   Class              150 non-null    object 
dtypes: float64(4), object(1)
memory usage: 6.0+ KB


In [6]:
X = df_iris.drop("Class", axis=1)
y = df_iris["Class"]

kfold = KFold(n_splits=10, shuffle=True, random_state=42)

In [22]:
kernels = ['linear', 'poly', 'rbf', 'sigmoid', 'LogisticRegression']
resultados= {}

for kernel_name in kernels:
    if kernel_name == 'LogisticRegression':
        classifier = LogisticRegression(max_iter=200, random_state=42)
    else:
        classifier = SVC(kernel=kernel_name, probability=True, random_state=42)
    
    # Pipeline
    pipeline = Pipeline([
        ('scaler', StandardScaler()),
        ('classifier', classifier) 
    ])
    
    # Cross-Validation Score
    scores1 = cross_val_score(
        pipeline, 
        X, 
        y, 
        cv=kfold, 
        scoring='roc_auc_ovr' #  #Como el dataset tiene 3 clases podríamos usar entre ovo y ovr
    )
    resultados[kernel_name] = np.mean(scores1)
    print(f"--- {kernel_name}: AUC medio = {np.mean(scores1):.4f} (std = {np.std(scores1):.4f})")


--- linear: AUC medio = 1.0000 (std = 0.0000)
--- poly: AUC medio = 1.0000 (std = 0.0000)
--- rbf: AUC medio = 0.9959 (std = 0.0124)
--- sigmoid: AUC medio = 0.9819 (std = 0.0333)
--- LogisticRegression: AUC medio = 0.9967 (std = 0.0082)


Los kernels linear y poly  alcanzaron un área bajo la curva de 1.0000 con una desviación estándar de 0.0000. Esto indica que el modelo puede clasificar las tres especies de Iris de manera perfecta y consistente en la validación cruzada.
El hecho de que el kernel linear y la Logistic Regression (la cual por naturaleza crea fronteras lineales) tengan un desempeño casi perfecto, confirma la conclusión que ya había notado: las clases del dataset Iris son linealmente separables en el espacio de características, lo cual significa que es posible trazar una frontera de decisión recta (o un hiperplano plano) que separe completamente una clase de las otras, o todas las clases entre sí.
Finalmente, podemos decir que todos los clasificadores probados demuestran un rendimiento excelente (todos los AUC están por encima de 0.98), lo que subraya que el dataset Iris es un problema de clasificación relativamente fácil de resolver debido a su alta separabilidad.

## Ejercicio 2
- Repite el ejercicio 1 con el dataset `Default`. Utiliza `default` como target.

In [23]:
df_default= pd.read_csv("/Users/paofigueroa/Documents/sem 5/Lab de aprendizaje estadístico/Default.csv")
df_default.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 4 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   default  10000 non-null  object 
 1   student  10000 non-null  object 
 2   balance  10000 non-null  float64
 3   income   10000 non-null  float64
dtypes: float64(2), object(2)
memory usage: 312.6+ KB


In [24]:
X2 = df_default.drop("default", axis=1)
y2 = df_default["default"]

# tenemos variables que son objetos, por lo que hay que hacer dummies 
X2 = pd.get_dummies(X2, drop_first=True)
X2.info()

kfold = KFold(n_splits=10, shuffle=True, random_state=42)


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 3 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   balance      10000 non-null  float64
 1   income       10000 non-null  float64
 2   student_Yes  10000 non-null  bool   
dtypes: bool(1), float64(2)
memory usage: 166.1 KB


In [32]:
print(len(X2), len(y2))

10000 10000


In [33]:
for kernel_name in kernels:
    if kernel_name == 'LogisticRegression':
        classifier = LogisticRegression(max_iter=200, random_state=42)
    else:
        classifier = SVC(kernel=kernel_name, probability=True, random_state=42)
    
    # Pipeline
    pipeline = Pipeline([
        ('scaler', StandardScaler()),
        ('classifier', classifier) 
    ])
    
    # Cross-Validation Score
    scores2 = cross_val_score(
        pipeline, 
        X2, 
        y2, 
        cv=kfold, 
        scoring='roc_auc_ovr' #Como el dataset tiene variables binarias, usamos roc_auc
    )
    resultados[kernel_name] = np.mean(scores2)
    print(f"--- {kernel_name}: AUC medio = {np.mean(scores2):.4f} (std = {np.std(scores2):.4f})")

--- linear: AUC medio = 0.9199 (std = 0.0134)
--- poly: AUC medio = 0.8752 (std = 0.0438)
--- rbf: AUC medio = 0.8394 (std = 0.0373)
--- sigmoid: AUC medio = 0.7373 (std = 0.0270)
--- LogisticRegression: AUC medio = 0.9491 (std = 0.0170)


La Regresión logística superó a todos los clasificadores SVM probados. Esto indica que para predecir si un cliente entrará en default, el modelo lineal es el que proporciona la mejor capacidad de discriminación en este conjunto de datos.
Dentro de los modelos SVM, el kernel lineal obtuvo el mejor resultado, reforzando la idea de que la relación entre las variables predictoras (balance, income, student_Yes) y la variable objetivo (default) es predominantemente lineal.
El bajo desempeño de los kernels rbf y sigmoid sugiere que intentar mapear los datos a un espacio de características de mayor dimensión  no es beneficioso para este problema. Esto puede deberse a que el ruido o la complejidad de estas transformaciones ocultan la simple frontera de decisión lineal. 


# Addendum

Métricos disponibles para clasificación:
- ‘accuracy’
- ‘balanced_accuracy’
- ‘top_k_accuracy’
- ‘average_precision’
- ‘neg_brier_score’
- ‘f1’
- ‘f1_micro’
- ‘f1_macro’
- ‘f1_weighted’
- ‘f1_samples’
- ‘neg_log_loss’
- ‘precision’ etc.
- ‘recall’ etc.
- ‘jaccard’ etc.
- ‘roc_auc’
- ‘roc_auc_ovr’
- ‘roc_auc_ovo’
- ‘roc_auc_ovr_weighted’
- ‘roc_auc_ovo_weighted’
- ‘d2_log_loss_score’

# References

[1] Shigeo Abe.Support Vector Machines for Pattern Classification,2Ed.Springer-Verlag London,2010. ISBN978-1-84996-097-7. URLhttps://www.springer.com/gp/book/9781849960977.

[2] Johan A K Suykens, Tony Van Gestel, Jos De Brabanter, BartDe Moor, and Joos Vandewalle.Least Squares Support VectorMachines. World Scientific,2002. ISBN9789812381514. URLhttps://www.worldscientific.com/worldscibooks/10.1142/5089.

[3] Bradley, A. P. (1997). The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern recognition, 30(7), 1145-1159. URL https://www.researchgate.net/post/how_can_I_interpret_the_ROC_curve_result