<a href="https://colab.research.google.com/github/kevinajordan/DS-Training/blob/master/03_Data_Prep_XGBoost.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Data Preparation Continued

Last lesson we went over data pre-processing. We discovered the importance of scaling and normalizing your data. We also discussed handling missing data and incorrect values. The data pre-processing steps one takes is tied to a model that will be run.

With this lesson we will continue with data preparation tied to a specific example of gradient boosting. You will learn about handling categorical data and one-hot encoding. You will also learn how to automatically handle missing data with XGBoost.

Dataset for Today: Iris Flower
http://archive.ics.uci.edu/ml/datasets/Iris

https://scikit-learn.org/stable/auto_examples/datasets/plot_iris_dataset.html



In [0]:
# Clone DS-Training repo for datasets and skeleton code
!git clone https://github.com/kevinajordan/DS-Training.git

Cloning into 'DS-Training'...
remote: Enumerating objects: 55, done.[K
remote: Counting objects:   1% (1/55)   [Kremote: Counting objects:   3% (2/55)   [Kremote: Counting objects:   5% (3/55)   [Kremote: Counting objects:   7% (4/55)   [Kremote: Counting objects:   9% (5/55)   [Kremote: Counting objects:  10% (6/55)   [Kremote: Counting objects:  12% (7/55)   [Kremote: Counting objects:  14% (8/55)   [Kremote: Counting objects:  16% (9/55)   [Kremote: Counting objects:  18% (10/55)   [Kremote: Counting objects:  20% (11/55)   [Kremote: Counting objects:  21% (12/55)   [Kremote: Counting objects:  23% (13/55)   [Kremote: Counting objects:  25% (14/55)   [Kremote: Counting objects:  27% (15/55)   [Kremote: Counting objects:  29% (16/55)   [Kremote: Counting objects:  30% (17/55)   [Kremote: Counting objects:  32% (18/55)   [Kremote: Counting objects:  34% (19/55)   [Kremote: Counting objects:  36% (20/55)   [Kremote: Counting objects:  38% (21/55

In [0]:
# Set our working directory to the dataset folder
import os
os.chdir('DS-Training/datasets')

In [0]:
# import the Iris dataset
import pandas as pd
df = pd.read_csv('iris.csv')

Perform some initial EDA on the Iris dataset below.

Get the following information:


*   First 5 lines of our data
*   Dataset information
*   Descriptive statistics
*   Shape of our data
*   How many of our target label classes?




In [0]:
# Look at the first five lines of our data
df.head(n=5)

Unnamed: 0,5.1,3.5,1.4,0.2,Iris-setosa
0,4.9,3.0,1.4,0.2,Iris-setosa
1,4.7,3.2,1.3,0.2,Iris-setosa
2,4.6,3.1,1.5,0.2,Iris-setosa
3,5.0,3.6,1.4,0.2,Iris-setosa
4,5.4,3.9,1.7,0.4,Iris-setosa


sepal length in cm 
2. sepal width in cm 
3. petal length in cm 
4. petal width in cm 

In [0]:
names = ['sep_w','sep_l','pet_l','pet_w','species']
df = pd.read_csv('iris.csv', names=names)
df.head()

Unnamed: 0,sep_w,sep_l,pet_l,pet_w,species
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa


In [0]:
# What's the shape?
df.shape

(150, 5)

In [0]:
# Get the statistics
df.describe()

Unnamed: 0,sep_w,sep_l,pet_l,pet_w
count,150.0,150.0,150.0,150.0
mean,5.843333,3.054,3.758667,1.198667
std,0.828066,0.433594,1.76442,0.763161
min,4.3,2.0,1.0,0.1
25%,5.1,2.8,1.6,0.3
50%,5.8,3.0,4.35,1.3
75%,6.4,3.3,5.1,1.8
max,7.9,4.4,6.9,2.5


In [0]:
# How many of each target label class?
df.groupby('species').size()

species
Iris-setosa        50
Iris-versicolor    50
Iris-virginica     50
dtype: int64

## LabelEncoding

XGBoost models represent all problems as a regression predictive modeling problem that only takes numerical values as input. 

Label encoding is a way to do this



In [0]:
# What function in sklearn.preprocessing does label encoding?
from sklearn import preprocessing
from sklearn.preprocessing import LabelEncoder

In [0]:
# split data into x (features) and y (target variable)
data = df.values
x = data[:, 0:4]
y = data[:,4]

In [0]:
print (y)

['Iris-setosa' 'Iris-setosa' 'Iris-setosa' 'Iris-setosa' 'Iris-setosa'
 'Iris-setosa' 'Iris-setosa' 'Iris-setosa' 'Iris-setosa' 'Iris-setosa'
 'Iris-setosa' 'Iris-setosa' 'Iris-setosa' 'Iris-setosa' 'Iris-setosa'
 'Iris-setosa' 'Iris-setosa' 'Iris-setosa' 'Iris-setosa' 'Iris-setosa'
 'Iris-setosa' 'Iris-setosa' 'Iris-setosa' 'Iris-setosa' 'Iris-setosa'
 'Iris-setosa' 'Iris-setosa' 'Iris-setosa' 'Iris-setosa' 'Iris-setosa'
 'Iris-setosa' 'Iris-setosa' 'Iris-setosa' 'Iris-setosa' 'Iris-setosa'
 'Iris-setosa' 'Iris-setosa' 'Iris-setosa' 'Iris-setosa' 'Iris-setosa'
 'Iris-setosa' 'Iris-setosa' 'Iris-setosa' 'Iris-setosa' 'Iris-setosa'
 'Iris-setosa' 'Iris-setosa' 'Iris-setosa' 'Iris-setosa' 'Iris-setosa'
 'Iris-versicolor' 'Iris-versicolor' 'Iris-versicolor' 'Iris-versicolor'
 'Iris-versicolor' 'Iris-versicolor' 'Iris-versicolor' 'Iris-versicolor'
 'Iris-versicolor' 'Iris-versicolor' 'Iris-versicolor' 'Iris-versicolor'
 'Iris-versicolor' 'Iris-versicolor' 'Iris-versicolor' 'Iris-versicolor

In [0]:
# instantiate the label encoder function
le = LabelEncoder()

# fit the label encoder to your data
le = le.fit(y)

# transform your data with the label encoder
le_y = le.transform(y)

# look at your encoded target classes now
print(le_y)

[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2
 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 2 2]


What's the difference between ordinal and nominal data?

## Running XGBoost

We will cover XGBoost and other classifiers more in depth in the next lesson. For now we will just run through it

In [0]:
# Import Software dependencies
import xgboost
from sklearn import model_selection
from sklearn.metrics import accuracy_score

In [0]:
# split data into training and testing
x_train, x_test, y_train, y_test = model_selection.train_test_split(x, le_y, test_size=0.33, random_state=0)

# fit model on training data
model = xgboost.XGBClassifier()
model.fit(x_train, y_train)
print(model)


XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, gamma=0,
              learning_rate=0.1, max_delta_step=0, max_depth=3,
              min_child_weight=1, missing=None, n_estimators=100, n_jobs=1,
              nthread=None, objective='multi:softprob', random_state=0,
              reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
              silent=None, subsample=1, verbosity=1)


In [0]:
# make predictions for test data
y_pred = model.predict(x_test)
predictions = [round(value) for value in y_pred]

# evaluate predictions
accuracy = accuracy_score(y_test, predictions)
print("Accuracy: %.2f%%" % (accuracy * 100.0))

Accuracy: 96.00%


## One-Hot Encoding

For datasets that only contain categorical data, like the breast cancer dataset, we need to do one hot encoding.

Dataset: breast-cancer

http://archive.ics.uci.edu/ml/datasets/Breast+Cancer

In [0]:
# binary classification, breast cancer dataset, label and one hot encoded
import numpy as np
from pandas import read_csv
from xgboost import XGBClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# What sklearn.preprocessing function performs one hot encoding? import the one hote encoding and label encoding modules
from sklearn.preprocessing import _______
from sklearn.preprocessing import _______


In [0]:
# load data



###Perform some quick EDA on the breast cancer dataset. Get the same information as above

In [0]:
# look at the first few lines of the dataset
df.head(n=5)

In [0]:
df.info()

We need to perform one hot encoding. 

**Example:**

left-up, left-low, right-up, right-low, central

1,0,0,0,0

0,1,0,0,0

0,0,1,0,0

0,0,0,1,0

0,0,0,0,1

In [0]:
# split data into x and y
data = df.values
x = data[____]
x = x.astype(str)
y = data[____]


In [0]:
# encode string input values as integers
encoded_x = None
for i in range(0, x.shape[1]):
	#instantiate label encoder
  le = _____()
  
  # fit and transform the data with the label encoder
	feature = le.fit_transform(x[:,i])
	feature = feature.reshape(x.shape[0], 1)
  
  # instantiate the one hot encoder function
	onehot_encoder = ________(sparse=False)
	feature = onehot_encoder.fit_transform(feature)
	if encoded_x is None:
		encoded_x = feature
	else:
		encoded_x = np.concatenate((encoded_x, feature), axis=1)
print("x shape: : ", encoded_x.shape)

In [0]:
# encode string class values as integers
# instantiate label encoder for your target variable
label_encoder = _____()

# fit and transform your data with the label encoder
label_encoder = label_encoder.fit(y)
label_encoded_y = label_encoder.transform(y)


Running through XGBClassifier

In [0]:
# split data into train and test sets
x_train, x_test, y_train, y_test = train_test_split(encoded_x, label_encoded_y, test_size=0.3, random_state=0)
# fit model no training data
model = XGBClassifier()
model.fit(x_train, y_train)
print(model)
# make predictions for test data
y_pred = model.predict(x_test)
predictions = [round(value) for value in y_pred]
# evaluate predictions
accuracy = accuracy_score(y_test, predictions)
print("Accuracy: %.2f%%" % (accuracy * 100.0))