# Homework 7

## Due Thursday, November 29th 2018 at 11:59 PM.

### Be sure to push the final version of your notebook to your GitHub repo.  Follow the instructions on the course website.

### Topics
####  [Part 1](#part_1):  Database schema [15 points]
* [Problem 1](#p1.1). Schema [15 points]

####  [Part 2](#part_2):  Insert records [35 points]
* [Problem 2](#p2.1). Baseline model [15 points]
* [Problem 3](#p2.2). Reduced model [10 points]
* [Problem 4](#p2.3). L1 penalty model [10 points]

####  [Part 3](#part_3):  Queries [20 pts]
* [Problem 5](#p3.1). Best model coefficients [10 points]
* [Problem 6](#p3.2). Best model score [10 points]

---

<a id='part_1'></a>
# Part 1:  Database schema

<a id='p1.1'></a>
## Problem 1 (15 points): 

In this problem you will set up a SQL database using the `sqllite` package in Python. The purpose of the database will be to store parameters and model results related to a simple *Logistic Regression* problem. Rather than keeping the results in `Numpy` arrays as we usually do, the idea here is to make use of a `SQL` database to materialize the results so that it can easily be accessed from disk at a later stage.

The design of the database should be flexible enough so that the results from different model iterations can be stored in the database. It should also be able to deal with a different set of features by model iteration.

A list of the tables to include in the database and the relevant fields in each table is shown below (tables are in bold):

**model_params**: 
* id 
* desc 
* param_name
* value

**model_coeffs**
* id 
* desc 
* feature_name
* value

**model_results**
* id 
* desc 
* train_score
* test_score

Create a `SQL` database called `regression.sqlite` containing the three tables shown above.

In [154]:
import sqlite3
import pandas as pd

# table view settings
pd.set_option('display.width', 500)
pd.set_option('display.max_columns', 100)
pd.set_option('display.notebook_repr_html', True)

# Create database
tables = {
    'model_params':'id, desc, param_name, value',
    'model_coeffs':'id, desc, feature_name, value',
    'model_results':'id, desc, train_score, test_score'
}
db = sqlite3.connect('regression.sqlite')
cursor = db.cursor()
cursor.execute("PRAGMA foreign_keys=1") # for cross-refs
for table_name, table_fields in tables.items():
    cursor.execute(f"DROP TABLE IF EXISTS {table_name}")
    # initialze model_params table
    cursor.execute('''CREATE TABLE ''' + table_name + \
                   ''' (''' + table_fields + ''') ''')
db.commit() # Commit changes to the database

<a id='part_2'></a>
# Part 2: Insert records

In this section you will populate the database you created in the previous question with some records for a number of different model iterations / scenarios.

<a id='p2.1'></a>
## Problem 2 (15 points): 
Create a baseline Logistic Regression model using the provided code (below).  Insert the relevant arrays into the corresponding tables in the database.

**model_params**
Values from the [`get_params`](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html#sklearn.linear_model.LogisticRegression.get_params) method.

**model_coeffs**
Coefficient and intercept values of the fitted model (see `coef_` and `intercept_` attributes in the documentation).

**model_results**
Train and validation accuracy obtained from the [`score`](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html#sklearn.linear_model.LogisticRegression.score) method.


#### Remarks
* Reference scikit-learn documentation to get more detail on the methods / attributes list above:  
[https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html)

* Note that the *id* and *desc* are just identifier fields used to identify the results from a specific model iteration or scenario. For example for the baseline model you could set *id = 1* and *desc = "Baseline model"*.


#### Suggestions
You may want to create a function to save data to the database.  You will be able to re-use this function in subsequent sections.

In [155]:
# Import additional libraries
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_breast_cancer

sns.set_style('darkgrid')

%config InlineBackend.figure_format = 'retina'

# holds model names and associated model idx for referencing 
# when adding data to tables later
models = {} 

# Load data
data = load_breast_cancer()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = data.target

# Split into train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                    test_size=0.2, 
                                                    random_state=87)

# Fit model
clf = LogisticRegression(solver='liblinear') # avoid FutureWarning
clf.fit(X_train, y_train)

# add model to model dictionary 
models['Baseline model'] = 1

# convenience function for viewing tables
def viz_tables(cols, query):
    q = cursor.execute(query).fetchall()
    framelist = dict()
    for i, col_name in enumerate(cols):
        framelist[col_name] = [col[i] for col in q]
    return pd.DataFrame.from_dict(framelist)

# convenience function for inserting data into tables
def insert_data(table_name, model_name, data=None):
    for field, val in data.items():
        cursor.execute('''INSERT INTO ''' + table_name + \
                               ''' (''' + tables[table_name] + ''') ''' + \
                               ''' VALUES (?, ?, ?, ?)''',
                               (models[model_name], model_name, field, val))
        
# model_params data
params = clf.get_params() # data for model_params table

# model_coeffs data
coeffs = dict(zip(data['feature_names'], clf.coef_[0])) 
coeffs['intercept'] = clf.intercept_[0] # adding intercept to dict

# model_results data
train_score = clf.score(X_train, y_train)
test_score = clf.score(X_test, y_test)
results = {'train':train_score, 'test':test_score} # model_results

# insert data into relevant tables
insert_data('model_params', 'Baseline model', data=params)

insert_data('model_coeffs', 'Baseline model', data=coeffs)

table_name = 'model_results'
model_name = 'Baseline model'
cursor.execute('''INSERT INTO ''' + table_name + \
                               ''' (''' + tables[table_name] + ''') ''' + \
                               ''' VALUES (?, ?, ?, ?)''',
                               (models[model_name], model_name, train_score, test_score))

# display each table
for table_name, table_fields in tables.items():
    print(table_name)
    query = f'SELECT * FROM {table_name}'
    display(viz_tables(table_fields.split(), query))
    print()

model_params


Unnamed: 0,"id,","desc,","param_name,",value
0,1,Baseline model,C,1
1,1,Baseline model,class_weight,
2,1,Baseline model,dual,0
3,1,Baseline model,fit_intercept,1
4,1,Baseline model,intercept_scaling,1
5,1,Baseline model,max_iter,100
6,1,Baseline model,multi_class,warn
7,1,Baseline model,n_jobs,
8,1,Baseline model,penalty,l2
9,1,Baseline model,random_state,



model_coeffs


Unnamed: 0,"id,","desc,","feature_name,",value
0,1,Baseline model,mean radius,2.143352
1,1,Baseline model,mean texture,0.073687
2,1,Baseline model,mean perimeter,-0.148922
3,1,Baseline model,mean area,0.01565
4,1,Baseline model,mean smoothness,-0.104633
5,1,Baseline model,mean compactness,-0.407477
6,1,Baseline model,mean concavity,-0.594942
7,1,Baseline model,mean concave points,-0.263499
8,1,Baseline model,mean symmetry,-0.155283
9,1,Baseline model,mean fractal dimension,-0.028102



model_results


Unnamed: 0,"id,","desc,","train_score,",test_score
0,1,Baseline model,0.96044,0.938596





<a id='p2.2'></a>
## Problem 3 (10 points): 
Create a second model using only the features included in the list below (in `feature_cols`).  Insert the relevant arrays into the corresponding tables in the database.

Remember to update the `id` and `desc` values for the second iteration.

#### Suggestions
* Name this second model `"Reduced model"`.

In [156]:
feature_cols = ['mean radius',
                'texture error',
                'worst radius',
                'worst compactness',
                'worst concavity']

In [157]:
X_reduced = X[feature_cols]
y = data.target

# Split into train and test
X_train, X_test, y_train, y_test = train_test_split(X_reduced, y, 
                                                    test_size=0.2, 
                                                    random_state=87)

# Fit model
clf = LogisticRegression(solver='liblinear') # avoid FutureWarning
clf.fit(X_train, y_train)

# add model to model dictionary 
models['Reduced model'] = 2

# model_params data
params = clf.get_params() # data for model_params table

# model_coeffs data
coeffs = dict(zip(data['feature_names'], clf.coef_[0])) 
coeffs['intercept'] = clf.intercept_[0] # adding intercept to dict

# model_results data
train_score = clf.score(X_train, y_train)
test_score = clf.score(X_test, y_test)
results = {'train':train_score, 'test':test_score} # model_results

# insert data into relevant tables
insert_data('model_params', 'Reduced model', data=params)

insert_data('model_coeffs', 'Reduced model', data=coeffs)

table_name = 'model_results'
model_name = 'Reduced model'
cursor.execute('''INSERT INTO ''' + table_name + \
                               ''' (''' + tables[table_name] + ''') ''' + \
                               ''' VALUES (?, ?, ?, ?)''',
                               (models[model_name], model_name, train_score, test_score))

# display each table
for table_name, table_fields in tables.items():
    print(table_name)
    query = f'SELECT * FROM {table_name}'
    display(viz_tables(table_fields.split(), query))
    print()

model_params


Unnamed: 0,"id,","desc,","param_name,",value
0,1,Baseline model,C,1
1,1,Baseline model,class_weight,
2,1,Baseline model,dual,0
3,1,Baseline model,fit_intercept,1
4,1,Baseline model,intercept_scaling,1
5,1,Baseline model,max_iter,100
6,1,Baseline model,multi_class,warn
7,1,Baseline model,n_jobs,
8,1,Baseline model,penalty,l2
9,1,Baseline model,random_state,



model_coeffs


Unnamed: 0,"id,","desc,","feature_name,",value
0,1,Baseline model,mean radius,2.143352
1,1,Baseline model,mean texture,0.073687
2,1,Baseline model,mean perimeter,-0.148922
3,1,Baseline model,mean area,0.01565
4,1,Baseline model,mean smoothness,-0.104633
5,1,Baseline model,mean compactness,-0.407477
6,1,Baseline model,mean concavity,-0.594942
7,1,Baseline model,mean concave points,-0.263499
8,1,Baseline model,mean symmetry,-0.155283
9,1,Baseline model,mean fractal dimension,-0.028102



model_results


Unnamed: 0,"id,","desc,","train_score,",test_score
0,1,Baseline model,0.96044,0.938596
1,2,Reduced model,0.945055,0.885965





<a id='p2.3'></a>
## Problem 4 (10 points): 
Create one last model using an **l1-penalty** ($L_{1}$) term and **all** the features. Insert the relevant arrays into the corresponding tables in the database.

**Hint:** Refer to the `penalty` parameter of the `LogisticRegression` class.

#### Suggestions
Call this model `"L1 penalty model"`.

In [158]:
# Split into train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                    test_size=0.2, 
                                                    random_state=87)

# Fit model
clf = LogisticRegression(solver='liblinear', penalty='l1') # avoid FutureWarning
clf.fit(X_train, y_train)

# add model to model dictionary 
models['L1 penalty model'] = 3

# model_params data
params = clf.get_params() # data for model_params table

# model_coeffs data
coeffs = dict(zip(data['feature_names'], clf.coef_[0])) 
coeffs['intercept'] = clf.intercept_[0] # adding intercept to dict

# model_results data
train_score = clf.score(X_train, y_train)
test_score = clf.score(X_test, y_test)
results = {'train':train_score, 'test':test_score} # model_results

# insert data into relevant tables
insert_data('model_params', 'L1 penalty model', data=params)

insert_data('model_coeffs', 'L1 penalty model', data=coeffs)

table_name = 'model_results'
model_name = 'L1 penalty model'
cursor.execute('''INSERT INTO ''' + table_name + \
                               ''' (''' + tables[table_name] + ''') ''' + \
                               ''' VALUES (?, ?, ?, ?)''',
                               (models[model_name], model_name, train_score, test_score))

# display each table
for table_name, table_fields in tables.items():
    print(table_name)
    query = f'SELECT * FROM {table_name}'
    display(viz_tables(table_fields.split(), query))
    print()

model_params




Unnamed: 0,"id,","desc,","param_name,",value
0,1,Baseline model,C,1
1,1,Baseline model,class_weight,
2,1,Baseline model,dual,0
3,1,Baseline model,fit_intercept,1
4,1,Baseline model,intercept_scaling,1
5,1,Baseline model,max_iter,100
6,1,Baseline model,multi_class,warn
7,1,Baseline model,n_jobs,
8,1,Baseline model,penalty,l2
9,1,Baseline model,random_state,



model_coeffs


Unnamed: 0,"id,","desc,","feature_name,",value
0,1,Baseline model,mean radius,2.143352
1,1,Baseline model,mean texture,0.073687
2,1,Baseline model,mean perimeter,-0.148922
3,1,Baseline model,mean area,0.015650
4,1,Baseline model,mean smoothness,-0.104633
5,1,Baseline model,mean compactness,-0.407477
6,1,Baseline model,mean concavity,-0.594942
7,1,Baseline model,mean concave points,-0.263499
8,1,Baseline model,mean symmetry,-0.155283
9,1,Baseline model,mean fractal dimension,-0.028102



model_results


Unnamed: 0,"id,","desc,","train_score,",test_score
0,1,Baseline model,0.96044,0.938596
1,2,Reduced model,0.945055,0.885965
2,3,L1 penalty model,0.967033,0.95614





<a id='part_3'></a>

# Part 3:  Queries

<a id='p3.1'></a>
## Problem 5 (10 points): 
Query the database to identify the model with the highest validation score.
* Print the id of the best model and the corresponding validation score.
  ```bash
  Best model id: 
  Best validation score:
  ```
* Print the feature names and corresponding coefficients of that model.

<a id='p3.2'></a>
## Problem 6 (10 points): 

Use the coefficients extracted in the previous question to reproduce the validation score (accuracy) of the best performing model (as stored in the database).

**Hint:** You should be able to achieve this by overwriting the relevant variables in the Logistic regression object, i.e. there is no need write your own formula to generate individual predictions (you are welcome to do this if you want).

#### Remarks
The problem demos a simple scenario in which someone with access to your database can easily reproduce your results.