# Regression overview

- To get started with our model, we need two components:

   1. The equation describing the model
   2. The data
   
   
- Equations are specified using patsy formula syntax. Important operators are:
    1. `~` : Separates the left-hand side and right-hand side of a formula.
    2. `+` : Creates a union of terms that are included in the model.
    3. `:` : Interaction term.
    3. `*` : `a * b` is short-hand for `a + b + a:b`, and is useful for the common case of wanting to include all interactions between a set of variables.
    
    
- Intercepts are added by default.


- Categorical variables can be included directly by adding a term C(a). More on that soon!


- For (2), we can conveniently use pandas dataframe.

In [7]:
import pandas as pd
import numpy as np

# Création d'un DataFrame avec des features continues et catégoriques
data = {
    'Age': [25, 30, 35, 40, 45],
    'Salary': [50000, 60000, 70000, 80000, 90000],
    'Gender': [1, 0, 0, 1, 0], # 1: Homme, 0: Femme
    'Purchased': [0, 1, 0, 1, 1]
}

df = pd.DataFrame(data)

# Définir X (features) et Y (target)
X = df[['Age', 'Salary', 'Gender']]
Y = df['Purchased']

# Afficher X et Y
print("Features (X):")
print(X)
print("\nTarget (Y):")
print(Y)

"""
# Combiner X et Y dans un seul DataFrame
df_combined = pd.concat([X, Y], axis=1)

# Afficher le DataFrame combiné
print(df_combined)
"""

Features (X):
   Age  Salary  Gender
0   25   50000       1
1   30   60000       0
2   35   70000       0
3   40   80000       1
4   45   90000       0

Target (Y):
0    0
1    1
2    0
3    1
4    1
Name: Purchased, dtype: int64


'\n# Combiner X et Y dans un seul DataFrame\ndf_combined = pd.concat([X, Y], axis=1)\n\n# Afficher le DataFrame combiné\nprint(df_combined)\n'

In [6]:
# create model 
import statsmodels.formula.api as smf

# Declares the model
mod = smf.ols(formula='Purchased ~ Age + Salary + C(Gender)', data=df)
# Fits the model (find the optimal coefficients, adding a random seed ensures consistency)
np.random.seed(2)
res = mod.fit()
# Print thes summary output provided by the library.
print(res.summary())

# feature names
variables = res.params.index

# quantifying uncertainty!

# Extracting the coefficients from the logistic regression model
coefficients = res.params.values

# p-values
p_values = res.pvalues

# standard errors
standard_errors = res.bse.values

#confidence intervals
res.conf_int()


                            OLS Regression Results                            
Dep. Variable:              Purchased   R-squared:                       0.333
Model:                            OLS   Adj. R-squared:                 -0.333
Method:                 Least Squares   F-statistic:                    0.5000
Date:                Mon, 07 Oct 2024   Prob (F-statistic):              0.667
Time:                        17:06:18   Log-Likelihood:                -2.5132
No. Observations:                   5   AIC:                             11.03
Df Residuals:                       2   BIC:                             9.855
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
                     coef    std err          t      P>|t|      [0.025      0.975]
----------------------------------------------------------------------------------
Intercept         -0.8000      1.575     -0.

  warn("omni_normtest is not valid with less than 8 observations; %i "


Unnamed: 0,0,1
Intercept,-7.575824,5.975824
C(Gender)[T.1],-2.594597,2.594597
Age,-3.493978e-08,5.493979e-08
Salary,-6.987947e-05,0.0001098795


In [8]:
# Déclare le modèle en utilisant toutes les features
mod_all_features = smf.ols(formula='Purchased ~ Age + Salary + C(Gender)', data=df)

# Ajuste le modèle
res_all_features = mod_all_features.fit()

# Affiche le résumé du modèle
print(res_all_features.summary())

                            OLS Regression Results                            
Dep. Variable:              Purchased   R-squared:                       0.333
Model:                            OLS   Adj. R-squared:                 -0.333
Method:                 Least Squares   F-statistic:                    0.5000
Date:                Mon, 07 Oct 2024   Prob (F-statistic):              0.667
Time:                        19:43:51   Log-Likelihood:                -2.5132
No. Observations:                   5   AIC:                             11.03
Df Residuals:                       2   BIC:                             9.855
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
                     coef    std err          t      P>|t|      [0.025      0.975]
----------------------------------------------------------------------------------
Intercept         -0.8000      1.575     -0.

  warn("omni_normtest is not valid with less than 8 observations; %i "


### De nombreuses informations utiles sont fournies par défaut.

- La variable dépendante : 
- La méthode : Le type de modèle qui a été ajusté (MCO)
- Nb observations : Le nombre de points de données (299 patients)
- R2 : La fraction de variance expliquée
- Une liste de prédicteurs
- Pour chaque prédicteur : coefficient, erreur standard des coefficients, valeur p, intervalles de confiance à 95%. 

In [14]:
import statsmodels.api as sm
import statsmodels.formula.api as smf

# Fit the logistic regression model
mod = smf.logit(formula='Purchased ~ Age + Salary + C(Gender)', data=df)

# Fit the model (find the optimal coefficients)

res = mod.fit()


# Print the summary of the model
print(res.summary())


         Current function value: 0.484139
         Iterations: 35


ValueError: operands could not be broadcast together with shapes (5,4) (4,4) (5,4) 