# Template

## Attempt 1

In [1]:
from sklearn.datasets import load_iris

iris = load_iris()

numSamples, numFeatures = iris.data.shape
print(numSamples)
print(numFeatures)
print(list(iris.target_names))

150
4
['setosa', 'versicolor', 'virginica']


Let's divide our data into 20% reserved for testing our model, and the remaining 80% to train it with. By withholding our test data, we can make sure we're evaluating its results based on new flowers it hasn't seen before. Typically we refer to our features (in this case, the petal sizes) as X, and the labels (in this case, the species) as y.

In [2]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.2, random_state=0)

In [3]:
X_train, X_test, y_train, y_test

(array([[6.4, 3.1, 5.5, 1.8],
        [5.4, 3. , 4.5, 1.5],
        [5.2, 3.5, 1.5, 0.2],
        [6.1, 3. , 4.9, 1.8],
        [6.4, 2.8, 5.6, 2.2],
        [5.2, 2.7, 3.9, 1.4],
        [5.7, 3.8, 1.7, 0.3],
        [6. , 2.7, 5.1, 1.6],
        [5.9, 3. , 4.2, 1.5],
        [5.8, 2.6, 4. , 1.2],
        [6.8, 3. , 5.5, 2.1],
        [4.7, 3.2, 1.3, 0.2],
        [6.9, 3.1, 5.1, 2.3],
        [5. , 3.5, 1.6, 0.6],
        [5.4, 3.7, 1.5, 0.2],
        [5. , 2. , 3.5, 1. ],
        [6.5, 3. , 5.5, 1.8],
        [6.7, 3.3, 5.7, 2.5],
        [6. , 2.2, 5. , 1.5],
        [6.7, 2.5, 5.8, 1.8],
        [5.6, 2.5, 3.9, 1.1],
        [7.7, 3. , 6.1, 2.3],
        [6.3, 3.3, 4.7, 1.6],
        [5.5, 2.4, 3.8, 1.1],
        [6.3, 2.7, 4.9, 1.8],
        [6.3, 2.8, 5.1, 1.5],
        [4.9, 2.5, 4.5, 1.7],
        [6.3, 2.5, 5. , 1.9],
        [7. , 3.2, 4.7, 1.4],
        [6.5, 3. , 5.2, 2. ],
        [6. , 3.4, 4.5, 1.6],
        [4.8, 3.1, 1.6, 0.2],
        [5.8, 2.7, 5.1, 1.9],
        [5

Now we'll load up XGBoost, and convert our data into the DMatrix format it expects. One for the training data, and one for the test data.

In [3]:
import xgboost as xgb

train = xgb.DMatrix(X_train, label=y_train)
test = xgb.DMatrix(X_test, label=y_test)

Now we'll define our hyperparameters. We're choosing softmax since this is a multiple classification problem, but the other parameters should ideally be tuned through experimentation.

In [4]:
param = {
    'max_depth': 4,
    'eta': 0.3,
    'objective': 'multi:softmax',
    'num_class': 3} 
epochs = 10 

Let's go ahead and train our model using these parameters as a first guess.

In [5]:
model = xgb.train(param, train, epochs)

Now we'll use the trained model to predict classifications for the data we set aside for testing. Each classification number we get back corresponds to a specific species of Iris.

In [6]:
predictions = model.predict(test)

In [7]:
print(predictions)

[2. 1. 0. 2. 0. 2. 0. 1. 1. 1. 2. 1. 1. 1. 1. 0. 1. 1. 0. 0. 2. 1. 0. 0.
 2. 0. 0. 1. 1. 0.]


Let's measure the accuracy on the test data...

In [8]:
from sklearn.metrics import accuracy_score

accuracy_score(y_test, predictions)

1.0

Holy crow! It's perfect, and that's just with us guessing as to the best hyperparameters!

Normally I'd have you experiment to find better hyperparameters as an activity, but you can't improve on those results. Instead, see what it takes to make the results worse! How few epochs (iterations) can I get away with? How low can I set the max_depth? Basically try to optimize the simplicity and performance of the model, now that you already have perfect accuracy.