# Extended Practice: Importances and Coefficients

- The following practice assignment is much longer than a typical practice assignment.
    - You may skip this assignment if you feel comfortable with what you have learned thus far.
    - Note: while the target grades (G1-G3) are different, all of the features from this data set are the same as those from the Student Performance lessons. They require the same preprocessing steps as you've seen in the previous lessons.
    
## Task
- For this assignment, we will be using the alternative version of the student performance dataset that we've been exploring in the lessons this week. You will create a model to predict the students' final grades (G3), but using the Math grades version of the data. The features are the same as the dataset used in the lessons, but the G1, G2, and G3 columns are the students' grades for Math instead of Portuguese.
- First, preprocess the data.
    1. Perform train-test-split with G3 as the target.
    2. Use a ColumnTransformer with the required preprocessing steps
        - Drop any unnecessary binary categories using the drop='if_binary' argument for OneHotEncoder.
        = Don't forget to add verbose_feature_names_out=False
    3. Create DataFrame versions of your X_train and X_test data using the correct feature names.
- Second, fit a tree-based model of your choice (that produces feature importances).
    1. Evaluate its performance on the training and test data.
    2. extract and visualize the feature importances determined by the model.
    3. Answer what were the top 5 most important features?
- Third, apply sklearn's permutation_importance.
    1. visualize the permutation importances.
    2. Answer what are the top 5 most important features the same as the top 5 most important features (according to our built-in importance)?
- Fourth, Fit a sklearn LinearRegression model.
    1. Evaluate its performance on the training & test data.
    2. visualize the model's top 15 largest coefficients (according to absolute value).
    4. Select the 3 largest coefficients (by absolute value) and explain what they mean and what insights they might provide.

In [6]:
import pandas as pd
import numpy as np

# preprocessing
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import make_column_transformer, make_column_selector 
from sklearn.impute import SimpleImputer
from sklearn import metrics

In [2]:
pd.set_option('display.max_columns',0)

In [3]:
df = pd.read_excel('https://docs.google.com/spreadsheets/d/e/2PACX-1vS6xDKNpWkBBdhZSqepy48bXo55QnRv1Xy6tXTKYzZLMPjZozMfYhHQjAcC8uj9hQ/pub?output=xlsx')
df.info()
df.head(3)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 395 entries, 0 to 394
Data columns (total 33 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   school      395 non-null    object
 1   sex         395 non-null    object
 2   age         395 non-null    int64 
 3   address     395 non-null    object
 4   famsize     395 non-null    object
 5   Pstatus     395 non-null    object
 6   Medu        395 non-null    int64 
 7   Fedu        395 non-null    int64 
 8   Mjob        395 non-null    object
 9   Fjob        395 non-null    object
 10  reason      395 non-null    object
 11  guardian    395 non-null    object
 12  traveltime  395 non-null    int64 
 13  studytime   395 non-null    int64 
 14  failures    395 non-null    int64 
 15  schoolsup   395 non-null    object
 16  famsup      395 non-null    object
 17  paid        395 non-null    object
 18  activities  395 non-null    object
 19  nursery     395 non-null    object
 20  higher    

Unnamed: 0,school,sex,age,address,famsize,Pstatus,Medu,Fedu,Mjob,Fjob,reason,guardian,traveltime,studytime,failures,schoolsup,famsup,paid,activities,nursery,higher,internet,romantic,famrel,freetime,goout,Dalc,Walc,health,absences,G1,G2,G3
0,GP,F,18,U,GT3,A,4,4,at_home,teacher,course,mother,2,2,0,yes,no,no,no,yes,yes,no,no,4,3,4,1,1,3,6,5,6,6
1,GP,F,17,U,GT3,T,1,1,at_home,other,course,father,1,2,0,no,yes,no,no,no,yes,yes,no,5,3,3,1,1,3,4,5,5,6
2,GP,F,15,U,LE3,T,1,1,at_home,other,other,mother,1,2,3,yes,no,yes,no,yes,yes,yes,no,4,3,2,2,3,3,10,7,8,10


In [4]:
# missing values
df.isna().sum().sum()

0

In [5]:
# duplciated
df.duplicated().sum()

0

# preprocess the data

In [7]:
target = 'G3'

y = df[target]
X = df.drop(columns = target)

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 42)

In [8]:
data_types = df.dtypes
obj_data = data_types[(data_types == 'object')]

for column in obj_data.index:
  print(column)
  print(f'Unique Values: {df[column].unique()}\n')

school
Unique Values: ['GP' 'MS']

sex
Unique Values: ['F' 'M']

address
Unique Values: ['U' 'R']

famsize
Unique Values: ['GT3' 'LE3']

Pstatus
Unique Values: ['A' 'T']

Mjob
Unique Values: ['at_home' 'health' 'other' 'services' 'teacher']

Fjob
Unique Values: ['teacher' 'other' 'services' 'health' 'at_home']

reason
Unique Values: ['course' 'other' 'home' 'reputation']

guardian
Unique Values: ['mother' 'father' 'other']

schoolsup
Unique Values: ['yes' 'no']

famsup
Unique Values: ['no' 'yes']

paid
Unique Values: ['no' 'yes']

activities
Unique Values: ['no' 'yes']

nursery
Unique Values: ['yes' 'no']

higher
Unique Values: ['yes' 'no']

internet
Unique Values: ['no' 'yes']

romantic
Unique Values: ['no' 'yes']

