In [2]:
from lib.project_5 import load_data_from_database, make_data_dict, general_transformer, general_model

# Step 1 - Benchmarking



### Domain and Data
MADELON is an artificial dataset, which was part of the NIPS 2003 feature selection challenge. This is a two-class classification problem with continuous input variables. The difficulty is that the problem is multivariate and highly non-linear.  It is an artificial dataset containing data points grouped in 32 clusters placed on the vertices of a five dimensional hypercube and randomly labeled +1 or -1. The five dimensions constitute 5 informative features. 15 linear combinations of those features were added to form a set of 20 (redundant) informative features. Based on those 20 features one must separate the examples into the 2 classes (corresponding to the +-1 labels). We added a number of distractor feature called 'probes' having no predictive power. The order of the features and patterns were randomized.

### Problem Statement

We will be assessing the data to determine what an appropriate benchmark might be.

### Solution Statement

We will be identifying a benchmark to find feature selection algorithms that significantly outperform methods using all features in performing a binary classification task.

### Metric

We will be simply analysing our data with the ridge/default penalty. 

### Benchmark

This is the process by which you identify a benchmark for the project.

## Implementation

Implement the following code pipeline using the functions you write in `lib/project_5.py`.

<img src="assets/benchmarking.png" width="600px">

In [3]:
import pandas as pd
from sqlalchemy import create_engine
from sklearn.model_selection import train_test_split 
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression

In [4]:
mandelon_df = load_data_from_database()


In [5]:
data_dict = make_data_dict(mandelon_df, random_state=77)

In [6]:
scaled_mandelon = general_transformer(StandardScaler(), data_dict)


In [7]:
modeled_data = general_model(LogisticRegression(C=10000000), data_dict)
modeled_data


{'X_test': array([[-1.03207743, -0.28192768,  0.23721972, ..., -1.520195  ,
         -0.83912849,  1.11284104],
        [ 0.83013124,  0.85987615,  0.02929943, ..., -0.62010915,
         -0.27222274, -0.08265644],
        [ 0.20939502,  0.53364649, -1.84198313, ..., -1.07015207,
          0.61862915, -0.931074  ],
        ..., 
        [ 2.84752397,  0.85987615, -1.40015253, ...,  0.20496955,
          1.13154387, -0.69968739],
        [-2.11836582, -1.84783007, -0.3605511 , ...,  1.18006256,
         -1.10908361, -0.12122087],
        [ 0.05421096,  0.24003979,  1.79662186, ...,  1.48009117,
          0.37566954,  0.99714773]]),
 'X_train': array([[ -2.56157149e-01,   1.87118811e+00,  -2.04610887e-01, ...,
          -3.95087684e-01,   4.29660565e-01,  -1.00820287e+00],
        [ -1.00973093e-01,   6.31515385e-01,   5.23110111e-01, ...,
          -1.37018069e+00,   2.40691983e-01,  -1.59785308e-01],
        [  2.09395020e-01,  -3.18537169e+00,  -1.71203295e+00, ...,
          -4.700948

In [10]:
print(data_dict['test_score'])

0.544


In [11]:
print(data_dict['train_score'])

0.786
