# Ejercicio de GAM

[1] Reproduzca el modelo de GAM realizado anteriormente usando `GLMGam`.

[2] Esta pregunta se relaciona con el conjunto de datos `College.csv`.

(a) Divida los datos en un conjunto de entrenamiento y un conjunto de prueba. 

Se ha realizado una selección de predictores empleando una regresión lineal con variable respuesta `out-of-state`, obteniendose como mejores variables a `['Expend', 'pAlumni', 'Board', 'Private_Yes', 'PhD', 'GradRate', 'SFRatio', 'Terminal']`.

(b) Ajuste un GAM (usando splines cúbicos naturales `cr` de patsy) en los datos de entrenamiento, usando `out-of-state` como respuesta y los predictores seleccionados como variables predictoras. Grafique los resultados y explique sus hallazgos.

(c) Evalúe el modelo obtenido en el conjunto de prueba y explique los resultados obtenidos.

(d) ¿Para qué variables, si las hay, hay evidencia de una relación no lineal con la respuesta?

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.api as sm
from patsy import dmatrix
from collections import defaultdict
from operator import itemgetter
# from pandas import scatter_matrix 

%matplotlib inline
plt.style.use('ggplot')

## Import College Data

In [2]:
college = pd.read_csv('College.csv', index_col=0)
college.info()

<class 'pandas.core.frame.DataFrame'>
Index: 777 entries, Abilene Christian University to York College of Pennsylvania
Data columns (total 18 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   Private      777 non-null    object 
 1   Apps         777 non-null    int64  
 2   Accept       777 non-null    int64  
 3   Enroll       777 non-null    int64  
 4   Top10perc    777 non-null    int64  
 5   Top25perc    777 non-null    int64  
 6   F.Undergrad  777 non-null    int64  
 7   P.Undergrad  777 non-null    int64  
 8   Outstate     777 non-null    int64  
 9   Room.Board   777 non-null    int64  
 10  Books        777 non-null    int64  
 11  Personal     777 non-null    int64  
 12  PhD          777 non-null    int64  
 13  Terminal     777 non-null    int64  
 14  S.F.Ratio    777 non-null    float64
 15  perc.alumni  777 non-null    int64  
 16  Expend       777 non-null    int64  
 17  Grad.Rate    777 non-null    int64  
dtypes: 

In [3]:
college.head()

Unnamed: 0,Private,Apps,Accept,Enroll,Top10perc,Top25perc,F.Undergrad,P.Undergrad,Outstate,Room.Board,Books,Personal,PhD,Terminal,S.F.Ratio,perc.alumni,Expend,Grad.Rate
Abilene Christian University,Yes,1660,1232,721,23,52,2885,537,7440,3300,450,2200,70,78,18.1,12,7041,60
Adelphi University,Yes,2186,1924,512,16,29,2683,1227,12280,6450,750,1500,29,30,12.2,16,10527,56
Adrian College,Yes,1428,1097,336,22,50,1036,99,11250,3750,400,1165,53,66,12.9,30,8735,54
Agnes Scott College,Yes,417,349,137,60,89,510,63,12960,5450,450,875,92,97,7.7,37,19016,59
Alaska Pacific University,Yes,193,146,55,16,44,249,869,7560,4120,800,1500,76,72,11.9,2,10922,15


In [4]:
# Private es una variable categorica que convertimos a numerica
college = pd.get_dummies(data=college, columns=['Private'])

In [5]:
college.head(1)

Unnamed: 0,Apps,Accept,Enroll,Top10perc,Top25perc,F.Undergrad,P.Undergrad,Outstate,Room.Board,Books,Personal,PhD,Terminal,S.F.Ratio,perc.alumni,Expend,Grad.Rate,Private_No,Private_Yes
Abilene Christian University,1660,1232,721,23,52,2885,537,7440,3300,450,2200,70,78,18.1,12,7041,60,0,1


In [6]:
del college['Private_No']
#renombra las variables para quitar el punto
college.rename(columns={'F.Undergrad': 'FUndergrad', 'P.Undergrad': 'PUndergrad', 'Room.Board':'Board', 
                       'S.F.Ratio':'SFRatio', 'perc.alumni':'pAlumni', 'Grad.Rate':'GradRate'}, inplace=True)

### Función para ayudar a graficar una matriz de dispersión usando el modelo GAM
La siguiente función puede serle útil para aplicarla al modelo (en el punto b)

In [11]:
def gam_plot(data, design, gam_model, predictors):
    """
    A method to plot fitted basis functions similar to R's gam library and associated plot method.
    
    Parameters:
        data:       Original DataFrame containing all features.
        design:     An n_samples by n_features design dataframe with basis functions created using Patsy.
        gam_model:  A statsmodel fitted OLS object. 
        predictors: A list of string names for each of the predictors used to construct the gam.        

    """
    basis_funcs = defaultdict(np.array,[])
    extremums = defaultdict(list)
    
    # Compute the product of the design df with the coeffecients
    results = design * gam.params
    # find the columns that have the predictor in the col name and get the results from these columns
    for predictor in predictors:
        column_names = [col for col in design.columns if predictor in col]
        basis_funcs[predictor] = results[column_names]
        
        
    # Plotting
    num_rows = len(predictors)//3 + 1 if len(predictors)%3!=0 else len(predictors)//3
    fig, axarr = plt.subplots(num_rows, 3, figsize=(12,8))
    
    for el, predictor in enumerate(predictors):
        # get the tuple index to plot to
        plt_index = np.unravel_index(el,(num_rows, 3))
        
        # numeric predictors
        if data[predictor].values.dtype.type is np.int64:
            # get the order of the predictor and make a grid
            order = np.argsort(data[predictor])
            predictor_grid = data[predictor].values[order]
            # sum the basis functions and order the same as the predictor
            basis_vals = np.sum(basis_funcs[predictor].values, axis=1)[order]
            axarr[plt_index[0], plt_index[1]].plot(predictor_grid, basis_vals)
            
        else:
            # categorical predictors
            # make a temporary data frame
            s = pd.DataFrame(data[predictor])
            # add the sum of the basis functions for this categorical predictor
            s['f-'+predictor] = np.sum(basis_funcs[predictor].values, axis=1)
            # zero-mean
            s['f-'+predictor] = s['f-'+predictor] - s['f-'+predictor].mean()
            # call boxplot grouping by predictor col
            s.boxplot(column=['f-'+predictor], by=predictor, ax = axarr[plt_index[0], plt_index[1]]);
            axarr[plt_index[0], plt_index[1]].set_ylabel('F%d(%s)' %(el,predictor));
            plt.suptitle('')
            
        # Add Labels
        axarr[plt_index[0], plt_index[1]].set_xlabel(predictor);
        axarr[plt_index[0], plt_index[1]].set_ylabel('F%d(%s)' % (el,predictor));
        plt.tight_layout() 
                
    return basis_funcs