### Module 13-Lab 1: Practice Pipelines

Use the pipeline to scale the data and fit a different algorithm: XGB

**Data:**
    
The data used in this problem is a simplified version of using "gene expression" to predict cancer in people. It is based on this dataset:
- http://archive.ics.uci.edu/ml/datasets/gene+expression+cancer+RNA-Seq

In this data, there are 6 genes that are each represented by a floating point number. The target value is 'cancer_detected' which is 0 = false and 1 = true.<P>
    
The goal is to use a classification algorithm to predict cancer based on the values of the genes.<P>
    
Our method:
1. Load, isolate and split the data
2. Define the steps in the pipeline
3. Create the pipeline
4. Use the pipeline to transform and train your model
5. Evaluate the result
6. Predict on new, unseen data

In [None]:
# We will need to install XGBOOST every time we restart our instanace
%pip install xgboost

In [None]:
import xgboost as xgb
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
import boto3
import pandas as pd
import numpy as np

### 1. Load, isolate and split the data

In [None]:
# Load df from S3 .csv
sess = boto3.session.Session()
s3 = sess.client('s3') 
source_bucket = 'machinelearning-read-only'
source_key = 'data/gene-cancer-small.csv'
response = s3.get_object(Bucket=source_bucket, Key=source_key)
df = pd.read_csv(response.get("Body"))
df.head(5)

In [None]:
# Notice the scales of the features
df.describe()

In [None]:
# Features
X = df.drop(['cancer_detected'],axis = 1)
# Target
y = df['cancer_detected']
# Split into train/test
# Reserve 20% for testing
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20,random_state = 42)
# Verify the sizes of the split datasets
print('X_train:', X_train.shape)
print('y_train:', y_train.shape)
print('X_test:', X_test.shape)
print('y_test:', y_test.shape)

### 2. Define the steps in the pipeline

In [None]:
# Pipelines consist of sequential steps. The are technically a 'list of tuples'.
#
#    https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html
#
# Step 1: scale the data using the normalization or standardization scaler
# Step 2: fit a xgb model
#

scaler = "what scaler will you use?"
xgbc = "how do you create an xgboost classifier model?" # We did this in a previous module

steps = [('Scaler', scaler), ('XGB', xgbc)]
steps

### 3. Create the pipeline

In [None]:
# your code here

### 4. Use the pipeline to scale and train your model
hint: Use the fit() function.

In [None]:
# your code here

### 5. Evaluate the result
- Use your trained model to predict on the X_test dataset
- Calculate the accuracy of the model
- Print the confusion matrix

In [None]:
# Your code here

### 6. Predict on new, unseen data


In [None]:
# A new patient had these gene numbers
X_unseen = pd.DataFrame(data = [[2.2, 25.5, 55.55, -20.1, 355.8, 180.0]],columns = X_test.columns)
X_unseen

In [None]:
# Does the model predict they will get cancer?
# your code here