# Step 1 - Benchmarking

### Domain and Data

Using Madelon data set which is an artificial dataset, we are aiming to find the best approach for feature selection.

### Problem Statement

For further comparison use, we need a benchmarking pipeline.

### Solution Statement

First we read the data from database and convert it to data dictionary by using train, test split method. then will scale the data and feed it to a Logistic Regression model with l2 penalty and high C value to minimize the regularization effect.

### Metric

The mean accuracy of the model is the metric we will be using in this project. Based on these scores and their variation with regard to features selected we can decide on which features are important.

### Benchmark

The results from this step will be a proper benchmark for further steps as the model we implemented here has no feature reduction aspect and all the features are considered important.

## Implementation

Implement the following code pipeline using the functions you write in `lib/project_5.py`.

<img src="assets/benchmarking.png" width="600px">

In [1]:
from lib.project_5 import pipeline

In [2]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import SelectKBest
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVC, LinearSVC

In [3]:
proj5_conn = {
    "url" : "joshuacook.me",
    "port" : "5432",
    "database" : "dsi",
    "table" : "madelon",
    "user" : "dsi_student",
    "password" : "correct horse battery staple"
}


In [4]:
step1_b_output = (pipeline(proj5_conn, StandardScaler(), model=LogisticRegression(C=1000), random_state=10))

Connected to the database and got the data successfully.
Data dictionary created.
Data is scaled.
Transformer is  not found.
No grid search.


In [5]:
step1_b_output["scaler"]

StandardScaler(copy=True, with_mean=True, with_std=True)

In [6]:
step1_b_output["model"]

LogisticRegression(C=1000, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [7]:
step1_b_output["train_score"], step1_b_output["test_score"]

(0.78466666666666662, 0.56999999999999995)

In [8]:
features = pd.DataFrame(step1_b_output["features"], columns=["Feature", "Coefficient"])

In [9]:
features["abs_coefs"] = abs(features["Coefficient"])

In [10]:
features.sort_values(by="abs_coefs", ascending=False).head()

Unnamed: 0,Feature,Coefficient,abs_coefs
451,feat_451,2.175024,2.175024
318,feat_318,-2.160156,2.160156
442,feat_442,-2.144897,2.144897
453,feat_453,1.422522,1.422522
153,feat_153,-1.364683,1.364683


In [11]:
high_features = features[features["abs_coefs"] > 0.001]

In [12]:
high_features.shape[0]

496