# Eliminare

#### DATA and stuffs

In [1]:
import pandas as pd
import numpy as np
import pickle
from sklearn.naive_bayes import GaussianNB
from sklearn.preprocessing import StandardScaler
# from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, accuracy_score
# for first and future models
from sklearn.linear_model import SGDClassifier
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV


In [33]:
# Open the pickle file for reading
with open('data4.pkl', 'rb') as file:
    # Load the data from the pickle file
    data = pickle.load(file)

# Convert the loaded data into a pandas DataFrame
df_binary = pd.DataFrame(data)

In [3]:
pure_df =df
only_one_coff_df = df.iloc[:, :-2]
none_coff = df.iloc[:, :-3]


##### Pure

In [25]:
# Mapping for relabeling
label_mapping = {
    'fall' :'fall',
    'rfall': 'fall',
    'lfall': 'fall',
    'light': 'fall',
    'sit': 'normal',
    'walk': 'normal',
    'step': 'normal'
}

# Replace existing labels with new labels
df_binary['label'] = df_binary['label'].map(label_mapping)

In [5]:
# labels
y = df["label"]
prova = df.drop("label", axis=1)

# split data 
X_train, X_test, y_train, y_test = train_test_split(prova, y, test_size=0.3)

# scale the features (may be useful if we are going to add other features with different scale)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train.astype(np.float64))
X_test_scaled = scaler.fit_transform(X_test.astype(np.float64))

#### Logistic vs Naive Bayes

# Binary classification (Naive Bayes vs Logistic regression)

In this section, we compare the performance of two models belonging to two different categories of machine learning models: discriminative and generative.
In brief, discriminative models target either to model $p(y|x)$ directly (i.e. as logistic regression does), or to learn a direct map from the inputs to the class variable (i.e. as perceptron does).
Generative models, on the other hand, try to model $p(x|y)$ and $p(y)$ (aka the likelihood and the prior distributions) and they make predictions by using the Bayes rule that allows to derive $p(y|x)$.


In this project, we compare the performance of logistic regression (representant of the discriminative category) and naive Bayes(representat of the generative category).
Naive Bayes models are a group of extremely fast and simple classification algorithms that are often suitable for very high-dimensional datasets. 
Naive Bayes models assume the conditional independency of the predictors, a very strong assumption that may not be true. 
Since naive Bayes models raise additional assumptions w.r.t. the logistic regression, they should be treated as a quick-and-dirty baseline for a classification problem.
In the cases where we do not have particular constraints in terms of both time/resources and information (number of observation), we should prevent fromm using this type of models.

There are many flavours of naive Bayes models but, for our purposes, we focus our attention on the Gaussian naive Bayes model. 
In this classifier, the assumption is that data from each label is drawn from a simple Gaussian distribution.
In order to fairly compare the performance of the naive Bayes and the logistic regression, we use a common procedure for the two models: a 10-fold cross validation with 10 repetitions.
This procedure allows us to obtain a more trustworthy accuracy measure.

In [32]:
from sklearn.preprocessing import StandardScaler
# from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, accuracy_score, ConfusionMatrixDisplay, classification_report, f1_score
# CV based searching for first and future models
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV, RepeatedKFold
from sklearn.linear_model import LogisticRegression, RidgeClassifier
from sklearn.naive_bayes import GaussianNB

Since we want to perform a binary classification, we have to map our seven classes onto two classes ("fall" or "normal")

In [34]:
# Mapping for relabeling
label_mapping = {
    'fall' :'fall',
    'rfall': 'fall',
    'lfall': 'fall',
    'light': 'fall',
    'sit': 'normal',
    'walk': 'normal',
    'step': 'normal'
}

# Replace existing labels with new labels
df_binary['label'] = df_binary['label'].map(label_mapping)

We store our output vector in an appropriate variable and we drop the corresponding column from the dataset

In [35]:
y = df_binary["label"]
df_binary = df_binary.drop("label", axis=1)

We split the dataset into X_train and X_test set and the output vector into y_train and y_test. Then, we scale the the input matrix both in the train and test sets.

In [36]:
# split data 
seed = 1218
X_train, X_test, y_train, y_test = train_test_split(df_binary, y, test_size=0.3, random_state=seed)

# scale the features (may be useful if we are going to add other features with different scale)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train.astype(np.float64))
X_test_scaled = scaler.fit_transform(X_test.astype(np.float64))

We measure the accuracy of Gaussian naive bayes classifier using a reapeted K-fold cross validation.

In [77]:
# Prepare the cross-validation procedure
cv = RepeatedKFold(n_splits=10, n_repeats=10,random_state=seed)
model = GaussianNB()
# Evaluate the model using cross-validation and accuracy scoring
scores = cross_val_score(model, X_train_scaled, y_train, scoring='accuracy', cv=cv, n_jobs=-1)
# Calculate the mean of accuracy scores
mean_nb = scores.mean()
sd_nb = scores.std()
mean_nb

0.9544852941176469

Now, it is the turn of the logistic regression. Unlike naive Bayes, the logistic regression does not make any assumption on the distribution of the data. In fact, in the logistic regression we maximize the conditional likelihood of $Y$ given $X$ ,$Pr(Y|X)$, leaving unspecified the distribution of the inputs.
This implies that if the true class conditional distributions of data are Gaussian and conditionally independent given Y, the logistic regression will be asymptotically less efficient than naive Bayes.
Conversely, by making significantly weaker assumptions, logistic regression is more robust and less sensitive to incorrect modeling assumptions.

In [78]:
# Prepare the cross-validation procedure
cv = RepeatedKFold(n_splits=10, n_repeats=10,random_state=seed)

# Create a logistic regression model
model = LogisticRegression()
# Evaluate the model using cross-validation and accuracy scoring
scores = cross_val_score(model, X_train_scaled, y_train, scoring='accuracy', cv=cv, n_jobs=-1)

# Calculate the mean of accuracy scores
mean_lr = scores.mean()
sd_lr = scores.std()
mean_lr

0.9869852941176471

In [117]:
import plotly.graph_objects as go

fig = go.Figure()

# Add scatter traces for mean_nb and mean_lr
fig.add_trace(go.Scatter(
    x=[0, 0.5],
    y=[mean_nb, mean_lr],
    mode='markers',
    name='Means',
    error_y=dict(
        type='data',
        symmetric=False,
        array=[conf_nb[1] - mean_nb, conf_lr[1] - mean_lr],
        arrayminus=[mean_nb - conf_nb[0], mean_lr - conf_lr[0]]
    )
))

# Update x-axis properties
fig.update_xaxes(range=[-0.1, 0.6], tickvals=[0, 0.5], ticktext=['NB', 'LR'])

# Update x-axis properties
fig.update_yaxes(range=[0.8, 1.01])

fig.show()



In [None]:
We can see that the logistic regression outperforms the naive Bayes in terms of both bias and variance.
As it is explained in a famous paper published by Andrew Ng et al. in 2001, it has been proved that logistic regression outperforms naive Bayes with a large number of training samples.
However, the generative model reaches its asymptotic faster (O(log n)) than the discriminative model(O (n))