# Data Programming in Python | BAIS:6040

# Homework 8. Pipelines

## Instructions

To complete the homework, fill in the commands needed to finish all of the exercises below. Program everything inside this notebook. If the exercises request that you store information within certain variables, please use those specific variables names (case sensitive).

## Questions

Suppose you have a dataframe <i>dfdi</i> with the features of diamonds.

In [1]:
import pandas as pd
import numpy as np
from seaborn import load_dataset
dfdi = load_dataset("diamonds")
np.random.seed(0)
for _ in range(20): 
        r = np.random.randint(len(dfdi)) 
        c = np.random.randint(low=1, high=len(dfdi.columns)) 
        dfdi.iloc[r, c] = np.nan
dfdi.head()

Unnamed: 0,carat,cut,color,clarity,depth,table,price,x,y,z
0,0.23,Ideal,E,SI2,61.5,55.0,326.0,3.95,3.98,2.43
1,0.21,Premium,E,SI1,59.8,61.0,326.0,3.89,3.84,2.31
2,0.23,Good,E,VS1,56.9,65.0,327.0,4.05,4.07,2.31
3,0.29,Premium,I,VS2,62.4,58.0,334.0,4.2,4.23,2.63
4,0.31,Good,J,SI2,63.3,58.0,335.0,4.34,4.35,2.75


For the questions below, let's aim to build a <b>logistic regression</b> model using the diamond dataset that is able to predict whether the price of a new diamond will be greater than the median price in the existing dataset based on its carat, cut, color, clarity, depth, table, x, y, and z. This is a __classification__ model using __logistic regression__ not a __regression__ problem. 

1\. Build a list called categorical_cols to indicate you intend to use <i>cut</i>, <i>color</i>, and <i>clarity</i> as categorical variables. (5 pts)

In [2]:
# Your answer here
categorical_cols = ["cut", "color", "clarity"]

2\. Build a list called continuous_cols to indicate you intend to use the remaining 6 features as numeric variables. Then concatenate the two lists into a list called predictor_cols. (10 pts)

In [3]:
# Your answer here
continuous_cols = ["carat", "depth", "table", "x", "y", "z"]
predictor_cols = categorical_cols + continuous_cols

3\. Create, and add to the dataframe, the new boolean target variable called price_median_greater. Assign the price_median_greater target variable name to a variable called target_col. (15 pts)

In [4]:
# Your answer here
dfdi["price_median_greater"] = dfdi["price"] > dfdi.price.median()
target_col = "price_median_greater"

4\. Split <i>X</i> and <i>y</i> into two training datasets <i>X_train</i> and <i>y_train</i> and two test datasets <i>X_text</i> and <i>y_test</i>. Set the `test_size` to 0.25 and `random_state` to 0. (10 pts)

In [5]:
# Your answer here
from sklearn.model_selection import train_test_split  

X=dfdi[predictor_cols]
y=dfdi[target_col]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=0)

5\. Build a pipeline for the necessary preprocessing and the model. Use the median to replace missing values of numeric variables and the most frequent value to replace missing values for categorical variables.  The numeric variables should be scaled with StandardScaler and the categorical variables should be One Hot Encoded (without dropping any columns). The handle_unknown parameter of OneHotEncoder should be set to 'ignore'. You can use the default parameter values for Logistic Regression. (25 pts)

In [6]:
# Your answer here
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression  

num_transformer = Pipeline(steps=[('impute', SimpleImputer(strategy='median'))
                                 ,('scale', StandardScaler())])

cat_transformer = Pipeline(steps=[('impute',SimpleImputer(strategy='most_frequent'))
                                 ,('enc', OneHotEncoder(sparse = False, handle_unknown='ignore'
                                                        ,dtype=np.int32))])


preprocessor = ColumnTransformer(transformers=[('num', num_transformer, continuous_cols),
                                               ('cat', cat_transformer, categorical_cols)]
                                 ,remainder='passthrough')

pipe_logistic = Pipeline(steps=[('preprocess', preprocessor)
                            ,('rgr', LogisticRegression())])

pipe_logistic.steps

[('preprocess',
  ColumnTransformer(remainder='passthrough',
                    transformers=[('num',
                                   Pipeline(steps=[('impute',
                                                    SimpleImputer(strategy='median')),
                                                   ('scale', StandardScaler())]),
                                   ['carat', 'depth', 'table', 'x', 'y', 'z']),
                                  ('cat',
                                   Pipeline(steps=[('impute',
                                                    SimpleImputer(strategy='most_frequent')),
                                                   ('enc',
                                                    OneHotEncoder(dtype=<class 'numpy.int32'>,
                                                                  handle_unknown='ignore',
                                                                  sparse=False))]),
                                   ['cut', 'color', 'clarit

6\. Fit the pipeline to the training dataset. (5 pts)

In [7]:
# Your answer here
pipe_logistic.fit(X_train, y_train)

Pipeline(steps=[('preprocess',
                 ColumnTransformer(remainder='passthrough',
                                   transformers=[('num',
                                                  Pipeline(steps=[('impute',
                                                                   SimpleImputer(strategy='median')),
                                                                  ('scale',
                                                                   StandardScaler())]),
                                                  ['carat', 'depth', 'table',
                                                   'x', 'y', 'z']),
                                                 ('cat',
                                                  Pipeline(steps=[('impute',
                                                                   SimpleImputer(strategy='most_frequent')),
                                                                  ('enc',
                                              

7\. Get the accuracy score against the training and test datasets and store in the variables train_acc and test_acc respectively. (10 pts)

In [8]:
# Your answer here
train_acc = pipe_logistic.score(X_train, y_train)
test_acc = pipe_logistic.score(X_test, y_test)

In [9]:
# Check your answer here. (Do not make any change to this cell. Just run this cell.)

print("Training accuracy ", train_acc)
print("Test accuracy ", test_acc)

Training accuracy  0.9792361883574342
Test accuracy  0.9789395624768261


8\. Using the first observation in the test dataset, store the answer to whether the price would be predicted to be above the median in a variable called predict_label. (10 pts)

In [10]:
# Your answer here
predict_label = pipe_logistic.predict(X_test.head(1))

In [11]:
# Check your answer here. (Do not make any change to this cell. Just run this cell.)

predict_label

array([ True])

9\. Using the last observation in the test dataset, store the predicted probability the price would be above the median in a variable called predict_prob. (10 pts)

In [12]:
# Your answer here
predict_prob = pipe_logistic.predict(X_test.tail(1))

In [13]:
# Check your answer here. (Do not make any change to this cell. Just run this cell.)

predict_prob

array([False])