# STEP 1 - Benchmarking

### Domain and Data

Domain: We will set a starting benchmark with our feature selection pipeline.

Data: Our dataset is the MADELON dataset. It is a synthetic, artificial dataset, with 500 features and one target variable, labeled either +1 or -1. Of the 500 features, 5 are actually informative, 15 are linear combinations of the 5 informative features, and the rest (480) are basically noise (i.e. distractors).

### Problem Statement

Our goal is to set a general performance level that we will hopefully improve upon. In other words, we are setting a benchmark.

### Solution Statement

In order to reach our objective as described in the problem statement, we will build our own machine learning pipeline. This general pipeline will, in sequential order, do the following:

1) load our data from a database on a remote server

2) create a data dictionary from that data. This step also involves splitting our data into a feature matrix X, and a target, y. Furthermore, our X and y will be split into training and test sets. These values will be put into our data dictionary, which will then hold our X-train, X-test, y-train, y-test, and eventually both the type of transformer (when our data dictionary is passed to step 3, see below) and the model (step 4, again, see below) that was used on our data

3) using the data dictionary from step 2, we will next transform our data, using whichever transformer we wish (e.g. StandardScaler, SelectKBest)

4) using the transformed data from step 3, we will model our data, using any model of our choice (e.g. Logistic Regression, KNN, etc.)

Our 1st pipeline that is run (which will be demonstrated in this particular Jupyter Notebook), as noted before in our problem statement, will be for benchmarking purposes. In this pipeline, StandardScaler will be our transformer, and Logistic Regression will be our model. For benchmarking purposes, we will run a naive Logistic Regression. Therefore, we will set a very high C value to minimize regularization (i.e. setting a high C value, such as 1,000,000 will make our penalty very small, since the penalty is, in fact, 1/C).

Here is a visual of the steps in our pipeline:

<img src="assets/benchmarking.png" width="600px">

##### Load the data from a database:

In [1]:
# Import our wrapper functions from the project_5.py in our lib
from lib.project_5 import load_data_from_database, add_to_process_list, make_data_dict, validate_dictionary, general_model, general_transformer

In [2]:
# Load our data, from the database, into a DataFrame
madelon_df = load_data_from_database()

In [3]:
# Make sure our data was loaded correctly. Our DataFrame should have 2000 rows and 501 columns
madelon_df.shape

(2000, 501)

##### Make a data dictionary (which includes splitting our data into a feature matrix (X), and target (y), and then splitting both our X and y into train and test sets):

In [4]:
# Create a data dictionary from our DataFrame
data_dictionary = make_data_dict(madelon_df)

In [5]:
# Make sure data_dictionary appears correct (a train_test split of
# our data which == a data dictionary with X_train, X_test, y_train, y_test)
data_dictionary

{'X_test':        feat_000  feat_001  feat_002  feat_003  feat_004  feat_005  feat_006  \
 index                                                                         
 116         472       463       518       487       549       482       498   
 858         481       449       481       486       483       473       475   
 35          481       467       540       490       503       479       495   
 862         481       504       440       487       506       482       458   
 1168        486       508       537       490       408       474       418   
 349         480       488       498       481       481       476       445   
 1919        483       482       448       485       496       478       497   
 981         466       518       549       477       523       473       449   
 1055        481       490       547       476       515       484       434   
 1530        470       460       542       491       489       478       447   
 516         481       503    

##### Run our data through our general transformer. Here, our transformer will be StandardScaler:

In [6]:
from sklearn.preprocessing import StandardScaler
scaled = general_transformer(StandardScaler(), data_dictionary)
scaled

{'X_test': array([[-1.51840849, -0.70228734,  0.19621132, ...,  1.11292377,
          0.7123275 ,  0.36377799],
        [-0.12699416, -1.1578502 , -0.7515457 , ..., -0.25819254,
         -1.36743266,  0.67730731],
        [-0.12699416, -0.57212652,  0.75974252, ..., -0.76334066,
          0.44222878,  0.67730731],
        ..., 
        [ 0.80061538, -0.50704611,  1.06712318, ...,  1.18508779,
          0.41521891,  1.34355711],
        [ 1.10981857, -1.12531   , -1.69930273, ..., -0.33035656,
         -0.34105751, -1.83092723],
        [-1.05460371, -2.65469961,  0.86220274, ...,  0.10262754,
          0.46923865,  0.75568964]]),
 'X_train': array([[-0.43619735, -1.22293061, -0.85400592, ..., -1.41281681,
         -0.04394892,  0.63811614],
        [-2.75522122, -1.48325225,  0.96466296, ...,  0.46344762,
         -0.36806739,  0.40296915],
        [ 0.95521697, -0.08402346, -1.64807262, ..., -0.18602853,
          0.33418929, -0.73357462],
        ..., 
        [ 0.4914122 , -0.669747

##### Run our transformed data through our general model function. We will use Logistic Regression with a high C (to minimize regularization) as our model:

In [8]:
from sklearn.linear_model import LogisticRegression
scored = general_model(LogisticRegression(C=1000000), scaled)
scored

{'X_test': array([[-1.51840849, -0.70228734,  0.19621132, ...,  1.11292377,
          0.7123275 ,  0.36377799],
        [-0.12699416, -1.1578502 , -0.7515457 , ..., -0.25819254,
         -1.36743266,  0.67730731],
        [-0.12699416, -0.57212652,  0.75974252, ..., -0.76334066,
          0.44222878,  0.67730731],
        ..., 
        [ 0.80061538, -0.50704611,  1.06712318, ...,  1.18508779,
          0.41521891,  1.34355711],
        [ 1.10981857, -1.12531   , -1.69930273, ..., -0.33035656,
         -0.34105751, -1.83092723],
        [-1.05460371, -2.65469961,  0.86220274, ...,  0.10262754,
          0.46923865,  0.75568964]]),
 'X_train': array([[-0.43619735, -1.22293061, -0.85400592, ..., -1.41281681,
         -0.04394892,  0.63811614],
        [-2.75522122, -1.48325225,  0.96466296, ...,  0.46344762,
         -0.36806739,  0.40296915],
        [ 0.95521697, -0.08402346, -1.64807262, ..., -0.18602853,
          0.33418929, -0.73357462],
        ..., 
        [ 0.4914122 , -0.669747

### Metric

Our measure of success will be the accuracy score from our naive Logistic Regression.

### Benchmark

Because our goal in this first pipeline is to obtain a benchmark score from our model, our primary benchmark for success will be a properly functioning pipeline that provides a "correct" (as in properly calculated) accuracy score from the Logistic Regression. In terms of our metric (accuracy score), we will aim for an accuracy greater than 50%.

### Results

We were hoping for an accuracy greater than 50%. Our test score from our naive Logistic Regression was approximately 57%. Thus, our benchmark to improve upon is 57%.