# Dataset notes:

- [Link to the data dictionary and info](https://www.kaggle.com/uciml/student-alcohol-consumption)
- **There are 2 separated CSVs:** 
    - Both datasets include demographic information regarding a specific student.
    - One pertains to performance in a Math class.
        - The Math dataset contains 395 students.
        - It contains 33 columns.
    - One pertains to performance in a Portugeuse (Por) Language class.
        - The Por dataset contains 649  students.
        - It contains 33 columns.
- **I will be using the combined dataset.**
- **Note from the data publisher:**
    - `Additional note: there are several (382) students that belong to both datasets. These students can be identified by searching for identical attributes that characterize each student, as shown in the annexed R file.`
    - Although the final dataframe will contain more than one entry for 382 students, the repeat entry will contain info on grade for a different class than the first entry.

# Imports

In [175]:
# Tools
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

# Transformation
from sklearn.preprocessing import StandardScaler

# Modeling
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LinearRegression, \
                                 LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import VotingClassifier

# Dataframe

In [109]:
df_math = pd.read_csv("../data/student-mat.csv")
df_por = pd.read_csv("../data/student-por.csv")
df = df_math.append(df_por, ignore_index=True)

In [110]:
df.head()

Unnamed: 0,school,sex,age,address,famsize,Pstatus,Medu,Fedu,Mjob,Fjob,...,famrel,freetime,goout,Dalc,Walc,health,absences,G1,G2,G3
0,GP,F,18,U,GT3,A,4,4,at_home,teacher,...,4,3,4,1,1,3,6,5,6,6
1,GP,F,17,U,GT3,T,1,1,at_home,other,...,5,3,3,1,1,3,4,5,5,6
2,GP,F,15,U,LE3,T,1,1,at_home,other,...,4,3,2,2,3,3,10,7,8,10
3,GP,F,15,U,GT3,T,4,2,health,services,...,3,2,2,1,1,5,2,15,14,15
4,GP,F,16,U,GT3,T,3,3,other,other,...,4,3,2,1,2,5,4,6,10,10


In [17]:
df.shape

(1044, 33)

## EDA - NaN, Dtypes, Value Counts

In [5]:
# NaN
df.isna().sum()

school        0
sex           0
age           0
address       0
famsize       0
Pstatus       0
Medu          0
Fedu          0
Mjob          0
Fjob          0
reason        0
guardian      0
traveltime    0
studytime     0
failures      0
schoolsup     0
famsup        0
paid          0
activities    0
nursery       0
higher        0
internet      0
romantic      0
famrel        0
freetime      0
goout         0
Dalc          0
Walc          0
health        0
absences      0
G1            0
G2            0
G3            0
dtype: int64

In [19]:
# How many numeric, obj, and categ cols?
df.dtypes

school        object
sex           object
age            int64
address       object
famsize       object
Pstatus       object
Medu           int64
Fedu           int64
Mjob          object
Fjob          object
reason        object
guardian      object
traveltime     int64
studytime      int64
failures       int64
schoolsup     object
famsup        object
paid          object
activities    object
nursery       object
higher        object
internet      object
romantic      object
famrel         int64
freetime       int64
goout          int64
Dalc           int64
Walc           int64
health         int64
absences       int64
G1             int64
G2             int64
G3             int64
dtype: object

In [88]:
# Check value counts:
vc_dict = {}
for col in df:
    vc_dict[col] = df[col].value_counts()

vc_dict["school"]

GP    772
MS    272
Name: school, dtype: int64

## Section Notes:

- A dictionary of the value_counts of each columnn was created to make it easier to reference later on.
- No NaN in DF
- Clean enough to work with, will need a decent amount of feature engineering in order incorporate the current numeric and categorical columns.

# Feature Engineering

## Create Numeric Columns

In [89]:
# Check the non-numeric cols:
obj_list = []
for col in df:
    if df[col].dtype == "object":
        obj_list.append(col)

# The list of columns that are non-numeric:
obj_list

['school',
 'sex',
 'address',
 'famsize',
 'Pstatus',
 'Mjob',
 'Fjob',
 'reason',
 'guardian',
 'schoolsup',
 'famsup',
 'paid',
 'activities',
 'nursery',
 'higher',
 'internet',
 'romantic']

In [90]:
bin_list = []
dummies_list = []

for k,v in vc_dict.items():  # Iterate thru dict created in previous section of value_counts in the df
    if k in obj_list:  # Check the keys if the cols have object values (available in obj_list)
        if v.count() == 2:  # if the count of object values equals 2
            bin_list.append(v)  # add to a list of variables to be binarized 
        elif v.count() >= 3:  # else if greater than 3
            dummies_list.append(v)  # add to list of cols to be dummied
# Check:
bin_list

[GP    772
 MS    272
 Name: school, dtype: int64,
 F    591
 M    453
 Name: sex, dtype: int64,
 U    759
 R    285
 Name: address, dtype: int64,
 GT3    738
 LE3    306
 Name: famsize, dtype: int64,
 T    923
 A    121
 Name: Pstatus, dtype: int64,
 no     925
 yes    119
 Name: schoolsup, dtype: int64,
 yes    640
 no     404
 Name: famsup, dtype: int64,
 no     824
 yes    220
 Name: paid, dtype: int64,
 no     528
 yes    516
 Name: activities, dtype: int64,
 yes    835
 no     209
 Name: nursery, dtype: int64,
 yes    955
 no      89
 Name: higher, dtype: int64,
 yes    827
 no     217
 Name: internet, dtype: int64,
 no     673
 yes    371
 Name: romantic, dtype: int64]

In [91]:
vc_dict

{'school': GP    772
 MS    272
 Name: school, dtype: int64,
 'sex': F    591
 M    453
 Name: sex, dtype: int64,
 'age': 16    281
 17    277
 18    222
 15    194
 19     56
 20      9
 21      3
 22      2
 Name: age, dtype: int64,
 'address': U    759
 R    285
 Name: address, dtype: int64,
 'famsize': GT3    738
 LE3    306
 Name: famsize, dtype: int64,
 'Pstatus': T    923
 A    121
 Name: Pstatus, dtype: int64,
 'Medu': 4    306
 2    289
 3    238
 1    202
 0      9
 Name: Medu, dtype: int64,
 'Fedu': 2    324
 1    256
 3    231
 4    224
 0      9
 Name: Fedu, dtype: int64,
 'Mjob': other       399
 services    239
 at_home     194
 teacher     130
 health       82
 Name: Mjob, dtype: int64,
 'Fjob': other       584
 services    292
 teacher      65
 at_home      62
 health       41
 Name: Fjob, dtype: int64,
 'reason': course        430
 home          258
 reputation    248
 other         108
 Name: reason, dtype: int64,
 'guardian': mother    728
 father    243
 other     

## Binarize

In [111]:
# Will take a different approach later, don't want to get bogged down on more complex functions

def to_binarize(data, col, value1, value2):
    data[col] = data[col].map({value1: 1, value2: 0})
    return data

In [107]:
bin_list

[GP    772
 MS    272
 Name: school, dtype: int64,
 F    591
 M    453
 Name: sex, dtype: int64,
 U    759
 R    285
 Name: address, dtype: int64,
 GT3    738
 LE3    306
 Name: famsize, dtype: int64,
 T    923
 A    121
 Name: Pstatus, dtype: int64,
 no     925
 yes    119
 Name: schoolsup, dtype: int64,
 yes    640
 no     404
 Name: famsup, dtype: int64,
 no     824
 yes    220
 Name: paid, dtype: int64,
 no     528
 yes    516
 Name: activities, dtype: int64,
 yes    835
 no     209
 Name: nursery, dtype: int64,
 yes    955
 no      89
 Name: higher, dtype: int64,
 yes    827
 no     217
 Name: internet, dtype: int64,
 no     673
 yes    371
 Name: romantic, dtype: int64]

In [112]:
to_binarize(df, "school", "GP", "MS")
to_binarize(df, "sex", "F", "M")
to_binarize(df, "address", "U", "R")
to_binarize(df, "famsize", "GT3", "LE3")
to_binarize(df, "Pstatus", "T", "A")
to_binarize(df, "schoolsup", "yes", "no")  # Went in diff order
to_binarize(df, "famsup", "yes", "no")  # Went in  diff order
to_binarize(df, "paid", "yes", "no")  # went in diff order
to_binarize(df, "activities", "yes", "no") # went in diff order
to_binarize(df, "nursery", "yes", "no")
to_binarize(df, "higher", "yes", "no")
to_binarize(df, "internet", "yes", "no")
to_binarize(df, "romantic", "yes", "no")  # went  in diff order

Unnamed: 0,school,sex,age,address,famsize,Pstatus,Medu,Fedu,Mjob,Fjob,...,famrel,freetime,goout,Dalc,Walc,health,absences,G1,G2,G3
0,1,1,18,1,1,0,4,4,at_home,teacher,...,4,3,4,1,1,3,6,5,6,6
1,1,1,17,1,1,1,1,1,at_home,other,...,5,3,3,1,1,3,4,5,5,6
2,1,1,15,1,0,1,1,1,at_home,other,...,4,3,2,2,3,3,10,7,8,10
3,1,1,15,1,1,1,4,2,health,services,...,3,2,2,1,1,5,2,15,14,15
4,1,1,16,1,1,1,3,3,other,other,...,4,3,2,1,2,5,4,6,10,10
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1039,0,1,19,0,1,1,2,3,services,other,...,5,4,2,1,2,5,4,10,11,10
1040,0,1,18,1,0,1,3,1,teacher,services,...,4,3,4,1,1,1,4,15,15,16
1041,0,1,18,1,1,1,1,1,other,other,...,1,1,1,1,1,5,6,11,12,9
1042,0,0,17,1,0,1,3,1,services,services,...,2,4,5,3,4,2,6,10,10,10


## Get Dummies

In [118]:
dummies_list

[other       399
 services    239
 at_home     194
 teacher     130
 health       82
 Name: Mjob, dtype: int64,
 other       584
 services    292
 teacher      65
 at_home      62
 health       41
 Name: Fjob, dtype: int64,
 course        430
 home          258
 reputation    248
 other         108
 Name: reason, dtype: int64,
 mother    728
 father    243
 other      73
 Name: guardian, dtype: int64]

In [123]:
# df = pd.get_dummies(df, columns=["Mjob", "Fjob", "reason", "guardian"])
df

Unnamed: 0,school,sex,age,address,famsize,Pstatus,Medu,Fedu,traveltime,studytime,...,Fjob_other,Fjob_services,Fjob_teacher,reason_course,reason_home,reason_other,reason_reputation,guardian_father,guardian_mother,guardian_other
0,1,1,18,1,1,0,4,4,2,2,...,0,0,1,1,0,0,0,0,1,0
1,1,1,17,1,1,1,1,1,1,2,...,1,0,0,1,0,0,0,1,0,0
2,1,1,15,1,0,1,1,1,1,2,...,1,0,0,0,0,1,0,0,1,0
3,1,1,15,1,1,1,4,2,1,3,...,0,1,0,0,1,0,0,0,1,0
4,1,1,16,1,1,1,3,3,1,2,...,1,0,0,0,1,0,0,1,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1039,0,1,19,0,1,1,2,3,1,3,...,1,0,0,1,0,0,0,0,1,0
1040,0,1,18,1,0,1,3,1,1,2,...,0,1,0,1,0,0,0,0,1,0
1041,0,1,18,1,1,1,1,1,2,2,...,1,0,0,1,0,0,0,0,1,0
1042,0,0,17,1,0,1,3,1,2,1,...,0,1,0,1,0,0,0,0,1,0


## Feature Creation

In [187]:
df.columns

Index(['school', 'sex', 'age', 'address', 'famsize', 'Pstatus', 'Medu', 'Fedu',
       'traveltime', 'studytime', 'failures', 'schoolsup', 'famsup', 'paid',
       'activities', 'nursery', 'higher', 'internet', 'romantic', 'famrel',
       'freetime', 'goout', 'Dalc', 'Walc', 'health', 'absences', 'G1', 'G2',
       'G3', 'Mjob_at_home', 'Mjob_health', 'Mjob_other', 'Mjob_services',
       'Mjob_teacher', 'Fjob_at_home', 'Fjob_health', 'Fjob_other',
       'Fjob_services', 'Fjob_teacher', 'reason_course', 'reason_home',
       'reason_other', 'reason_reputation', 'guardian_father',
       'guardian_mother', 'guardian_other'],
      dtype='object')

In [215]:
# studytime, failures, absences, freetime

In [195]:
df["failures"].value_counts()

0    861
1    120
2     33
3     30
Name: failures, dtype: int64

In [200]:
df["have_failed"] = df["failures"] > 0
df["have_failed"].map({True: 1, False: 0})
df["have_failed"].value_counts()  # 183 have failed

False    861
True     183
Name: have_failed, dtype: int64

In [214]:
df["absences"].value_counts()  # 359 have 0 absences
df["absences"].mean()  # Mean is 4.43
(df["absences"] > df["absences"].mean()).sum()  # 334 greater than the mean number of absences
df["high_absences"] = (df["absences"] > df["absences"].mean()).map({True: 1, False: 0})
df["high_absences"].value_counts()

0    710
1    334
Name: high_absences, dtype: int64

In [231]:
df["freetime"].value_counts()
df.loc[df["freetime"] >= 3]
df.groupby("freetime")["famrel"].mean()  # Less free time, lower the family relationship score\
df.groupby("absences")["Dalc"].describe()

Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
absences,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
0,359.0,1.401114,0.777188,1.0,1.0,1.0,2.0,5.0
1,15.0,1.2,0.560612,1.0,1.0,1.0,1.0,3.0
2,175.0,1.365714,0.737323,1.0,1.0,1.0,2.0,5.0
3,15.0,1.866667,0.743223,1.0,1.0,2.0,2.0,3.0
4,146.0,1.424658,0.893159,1.0,1.0,1.0,1.75,5.0
5,17.0,1.647059,0.931476,1.0,1.0,1.0,2.0,4.0
6,80.0,1.4125,0.822065,1.0,1.0,1.0,1.25,5.0
7,10.0,1.3,0.483046,1.0,1.0,1.0,1.75,2.0
8,64.0,1.828125,1.175878,1.0,1.0,1.0,2.0,5.0
9,10.0,1.5,1.269296,1.0,1.0,1.0,1.0,5.0


## Section Notes:
- A list of columns containing non-numeric values was created, to check against the v-c dictionary.
- Will need to return to 2 sections above to implement an automated process to binarize and dummy cols

# Classification Problem

## Problem 1: __

### Feature Set

In [137]:
# Check possible Y variable buckets:
# ((df["G3"] > 10).astype(int)).value_counts()

1    661
0    383
Name: G3, dtype: int64

In [160]:
X = df.drop(axis=1, columns=["G1", "G2", "G3"])
y = (df["G3"] > 10).astype(int)

In [161]:
y.value_counts()

1    661
0    383
Name: G3, dtype: int64

### Train/Test Split, Baseline Accuracy

In [162]:
X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y, 
                                                    test_size=0.33,
                                                    stratify=y,
                                                    random_state=42)

In [163]:
y_test.value_counts(normalize=True)

1    0.631884
0    0.368116
Name: G3, dtype: float64

### KNN

In [170]:
ss = StandardScaler()

knn_pipe = Pipeline([
    ("ss", ss),
    ("knn", KNeighborsClassifier())
])

knn_params = {
    "knn__n_neighbors": [7, 9, 11],
    "knn__p": [1, 2]
}

knn_gs = GridSearchCV(
    knn_pipe,
    knn_params,
    cv=5,
    verbose=10
)

In [171]:
knn_gs.fit(X_train, y_train)

Fitting 5 folds for each of 6 candidates, totalling 30 fits
[CV] knn__n_neighbors=7, knn__p=1 ....................................
[CV] ........ knn__n_neighbors=7, knn__p=1, score=0.707, total=   0.0s
[CV] knn__n_neighbors=7, knn__p=1 ....................................
[CV] ........ knn__n_neighbors=7, knn__p=1, score=0.657, total=   0.0s
[CV] knn__n_neighbors=7, knn__p=1 ....................................
[CV] ........ knn__n_neighbors=7, knn__p=1, score=0.700, total=   0.0s
[CV] knn__n_neighbors=7, knn__p=1 ....................................
[CV] ........ knn__n_neighbors=7, knn__p=1, score=0.607, total=   0.0s
[CV] knn__n_neighbors=7, knn__p=1 ....................................
[CV] ........ knn__n_neighbors=7, knn__p=1, score=0.676, total=   0.0s
[CV] knn__n_neighbors=7, knn__p=2 ....................................
[CV] ........ knn__n_neighbors=7, knn__p=2, score=0.693, total=   0.0s
[CV] knn__n_neighbors=7, knn__p=2 ....................................
[CV] ........ knn

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   4 out of   4 | elapsed:    0.1s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:    0.1s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   6 out of   6 | elapsed:    0.1s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   7 out of   7 | elapsed:    0.1s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   8 out of   8 | elapsed:    0.1s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   9 out of   9 | elapsed:    0.1s remaining:    0.0s


[CV] ........ knn__n_neighbors=9, knn__p=2, score=0.643, total=   0.0s
[CV] knn__n_neighbors=9, knn__p=2 ....................................
[CV] ........ knn__n_neighbors=9, knn__p=2, score=0.698, total=   0.0s
[CV] knn__n_neighbors=11, knn__p=1 ...................................
[CV] ....... knn__n_neighbors=11, knn__p=1, score=0.700, total=   0.0s
[CV] knn__n_neighbors=11, knn__p=1 ...................................
[CV] ....... knn__n_neighbors=11, knn__p=1, score=0.679, total=   0.0s
[CV] knn__n_neighbors=11, knn__p=1 ...................................
[CV] ....... knn__n_neighbors=11, knn__p=1, score=0.700, total=   0.0s
[CV] knn__n_neighbors=11, knn__p=1 ...................................
[CV] ....... knn__n_neighbors=11, knn__p=1, score=0.629, total=   0.0s
[CV] knn__n_neighbors=11, knn__p=1 ...................................
[CV] ....... knn__n_neighbors=11, knn__p=1, score=0.683, total=   0.0s
[CV] knn__n_neighbors=11, knn__p=2 ...................................
[CV] .

[Parallel(n_jobs=1)]: Done  30 out of  30 | elapsed:    0.3s finished


GridSearchCV(cv=5, error_score=nan,
             estimator=Pipeline(memory=None,
                                steps=[('ss',
                                        StandardScaler(copy=True,
                                                       with_mean=True,
                                                       with_std=True)),
                                       ('knn',
                                        KNeighborsClassifier(algorithm='auto',
                                                             leaf_size=30,
                                                             metric='minkowski',
                                                             metric_params=None,
                                                             n_jobs=None,
                                                             n_neighbors=5, p=2,
                                                             weights='uniform'))],
                                verbose=False),
             ii

In [172]:
# Check Scores:
print(f"Train Score: {knn_gs.score(X_train, y_train)}")
print(f"Test Score: {knn_gs.score(X_test, y_test)}\n")
print(f"Best Score: {knn_gs.best_score_}")

Train Score: 0.7424892703862661
Test Score: 0.6869565217391305

Best Score: 0.6867112024665982


In [173]:
knn_gs.best_params_

{'knn__n_neighbors': 9, 'knn__p': 2}

### Logistic Regression

In [183]:
logreg_pipe = Pipeline([
    ("ss", StandardScaler()),
    ("logreg", LogisticRegression())
])

logreg_params = {
    "logreg__penalty": ["l1", "l2"],
    "logreg__solver": ["saga", "liblinear"],
    "logreg__C": [0.5, 1,0, 1.5]
}

logreg_gs = GridSearchCV(
    logreg_pipe,
    logreg_params,
    cv=5,
    verbose=10
)

In [184]:
logreg_gs.fit(X_train, y_train)

Fitting 5 folds for each of 16 candidates, totalling 80 fits
[CV] logreg__C=0.5, logreg__penalty=l1, logreg__solver=saga ..........
[CV]  logreg__C=0.5, logreg__penalty=l1, logreg__solver=saga, score=0.771, total=   0.0s
[CV] logreg__C=0.5, logreg__penalty=l1, logreg__solver=saga ..........
[CV]  logreg__C=0.5, logreg__penalty=l1, logreg__solver=saga, score=0.786, total=   0.0s
[CV] logreg__C=0.5, logreg__penalty=l1, logreg__solver=saga ..........
[CV]  logreg__C=0.5, logreg__penalty=l1, logreg__solver=saga, score=0.736, total=   0.0s
[CV] logreg__C=0.5, logreg__penalty=l1, logreg__solver=saga ..........
[CV]  logreg__C=0.5, logreg__penalty=l1, logreg__solver=saga, score=0.671, total=   0.0s
[CV] logreg__C=0.5, logreg__penalty=l1, logreg__solver=saga ..........
[CV]  logreg__C=0.5, logreg__penalty=l1, logreg__solver=saga, score=0.763, total=   0.0s
[CV] logreg__C=0.5, logreg__penalty=l1, logreg__solver=liblinear .....
[CV]  logreg__C=0.5, logreg__penalty=l1, logreg__solver=liblinear, s

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   4 out of   4 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:    0.1s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   6 out of   6 | elapsed:    0.1s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   7 out of   7 | elapsed:    0.1s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   8 out of   8 | elapsed:    0.1s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   9 out of   9 | elapsed:    0.1s remaining:    0.0s


[CV]  logreg__C=1, logreg__penalty=l1, logreg__solver=liblinear, score=0.793, total=   0.0s
[CV] logreg__C=1, logreg__penalty=l1, logreg__solver=liblinear .......
[CV]  logreg__C=1, logreg__penalty=l1, logreg__solver=liblinear, score=0.721, total=   0.0s
[CV] logreg__C=1, logreg__penalty=l1, logreg__solver=liblinear .......
[CV]  logreg__C=1, logreg__penalty=l1, logreg__solver=liblinear, score=0.679, total=   0.0s
[CV] logreg__C=1, logreg__penalty=l1, logreg__solver=liblinear .......
[CV]  logreg__C=1, logreg__penalty=l1, logreg__solver=liblinear, score=0.777, total=   0.0s
[CV] logreg__C=1, logreg__penalty=l2, logreg__solver=saga ............
[CV]  logreg__C=1, logreg__penalty=l2, logreg__solver=saga, score=0.757, total=   0.0s
[CV] logreg__C=1, logreg__penalty=l2, logreg__solver=saga ............
[CV]  logreg__C=1, logreg__penalty=l2, logreg__solver=saga, score=0.793, total=   0.0s
[CV] logreg__C=1, logreg__penalty=l2, logreg__solver=saga ............
[CV]  logreg__C=1, logreg__penal

ZeroDivisionError: float division by zero

ZeroDivisionError: float division by zero

ZeroDivisionError: float division by zero

ZeroDivisionError: float division by zero

ZeroDivisionError: float division by zero

ValueError: b'C <= 0'

ValueError: b'C <= 0'

ValueError: b'C <= 0'

ValueError: b'C <= 0'

ValueError: b'C <= 0'

ZeroDivisionError: float division by zero

ZeroDivisionError: float division by zero

ZeroDivisionError: float division by zero

ZeroDivisionError: float division by zero

ZeroDivisionError: float division by zero

ValueError: b'C <= 0'

ValueError: b'C <= 0'

ValueError: b'C <= 0'

ValueError: b'C <= 0'

ValueError: b'C <= 0'



[CV]  logreg__C=1.5, logreg__penalty=l1, logreg__solver=liblinear, score=0.793, total=   0.0s
[CV] logreg__C=1.5, logreg__penalty=l1, logreg__solver=liblinear .....
[CV]  logreg__C=1.5, logreg__penalty=l1, logreg__solver=liblinear, score=0.714, total=   0.0s
[CV] logreg__C=1.5, logreg__penalty=l1, logreg__solver=liblinear .....
[CV]  logreg__C=1.5, logreg__penalty=l1, logreg__solver=liblinear, score=0.679, total=   0.0s
[CV] logreg__C=1.5, logreg__penalty=l1, logreg__solver=liblinear .....
[CV]  logreg__C=1.5, logreg__penalty=l1, logreg__solver=liblinear, score=0.777, total=   0.0s
[CV] logreg__C=1.5, logreg__penalty=l2, logreg__solver=saga ..........
[CV]  logreg__C=1.5, logreg__penalty=l2, logreg__solver=saga, score=0.757, total=   0.0s
[CV] logreg__C=1.5, logreg__penalty=l2, logreg__solver=saga ..........
[CV]  logreg__C=1.5, logreg__penalty=l2, logreg__solver=saga, score=0.793, total=   0.0s
[CV] logreg__C=1.5, logreg__penalty=l2, logreg__solver=saga ..........
[CV]  logreg__C=1.5,

[Parallel(n_jobs=1)]: Done  80 out of  80 | elapsed:    0.5s finished


GridSearchCV(cv=5, error_score=nan,
             estimator=Pipeline(memory=None,
                                steps=[('ss',
                                        StandardScaler(copy=True,
                                                       with_mean=True,
                                                       with_std=True)),
                                       ('logreg',
                                        LogisticRegression(C=1.0,
                                                           class_weight=None,
                                                           dual=False,
                                                           fit_intercept=True,
                                                           intercept_scaling=1,
                                                           l1_ratio=None,
                                                           max_iter=100,
                                                           multi_class='auto',
              

In [185]:
print(f"Train Score: {logreg_gs.score(X_train, y_train)}")
print(f"Test Score: {logreg_gs.score(X_test, y_test)}\n")
print(f"Best Score: {logreg_gs.best_score_}")

Train Score: 0.7696709585121603
Test Score: 0.7565217391304347

Best Score: 0.7482528263103803


In [186]:
logreg_gs.best_params_

{'logreg__C': 1, 'logreg__penalty': 'l1', 'logreg__solver': 'saga'}

## Section Notes:

-  **There are 20 values possible for `G3`, or Final Grade. What are possible ways to classify this?**
    - Can ___ predictors accurately predict if a student will be in the top or bottom 25% of the class?
        - Can the weekend and weekday prepdictors predict if a student will be in the lower 50% of their class?
        

# Regression Problem

##

## Section Notes:

- If all variables are included as predictors, we are introducing overfitting to the models, but worth a check.