<a href="https://colab.research.google.com/github/mfavaits/YouTube-Series-on-Machine-Learning/blob/master/GB_IMDB.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

The IMDB dataset actually comes packaged with keras and its allready tokenized, menaing the text is allready converted in a sequence of unique word indices. The IMDB dataset contains 50,000 movie reviews (25,000 for training and 25,000 for testing). Each set contains of 50% positive and 50% negative reviews (12,500 x 2). 

In [0]:
import numpy as np
from keras.datasets import imdb
import matplotlib.pyplot as plt

Using TensorFlow backend.


In [0]:
vocabulary=7500 # we will only use the 7500 most frequently used words

Next code block is to fix a bug in keras. If bug would not exist this block would be just 1 line of code: (train_data, train_labels), (test_data, test_labels) = imdb.load_data(num_words=vocabulary)

I suggest to just copy the fix and it does not really matter if you understand it or not



In [0]:
# save np.load
np_load_old = np.load

# modify the default parameters of np.load
np.load = lambda *a,**k: np_load_old(*a, allow_pickle=True, **k)

# call load_data with allow_pickle implicitly set to true
(train_data, train_labels), (test_data, test_labels) = imdb.load_data(num_words=vocabulary)

# restore np.load for future normal usage
np.load = np_load_old

In the next line of code we will print the lists that contain sequences of words represented by a word index. If the text has not been converted to a sequence of indices we would need to add one pre-processing step using Tokenizer

In [0]:
print(train_data[1]) # train_data is a list of word sequences

[1, 194, 1153, 194, 2, 78, 228, 5, 6, 1463, 4369, 5012, 134, 26, 4, 715, 8, 118, 1634, 14, 394, 20, 13, 119, 954, 189, 102, 5, 207, 110, 3103, 21, 14, 69, 188, 8, 30, 23, 7, 4, 249, 126, 93, 4, 114, 9, 2300, 1523, 5, 647, 4, 116, 9, 35, 2, 4, 229, 9, 340, 1322, 4, 118, 9, 4, 130, 4901, 19, 4, 1002, 5, 89, 29, 952, 46, 37, 4, 455, 9, 45, 43, 38, 1543, 1905, 398, 4, 1649, 26, 6853, 5, 163, 11, 3215, 2, 4, 1153, 9, 194, 775, 7, 2, 2, 349, 2637, 148, 605, 2, 2, 15, 123, 125, 68, 2, 6853, 15, 349, 165, 4362, 98, 5, 4, 228, 9, 43, 2, 1157, 15, 299, 120, 5, 120, 174, 11, 220, 175, 136, 50, 9, 4373, 228, 2, 5, 2, 656, 245, 2350, 5, 4, 2, 131, 152, 491, 18, 2, 32, 7464, 1212, 14, 9, 6, 371, 78, 22, 625, 64, 1382, 9, 8, 168, 145, 23, 4, 1690, 15, 16, 4, 1355, 5, 28, 6, 52, 154, 462, 33, 89, 78, 285, 16, 145, 95]


Now we will vectorize the training and test data. Basically we will create a matrix where the rows are the reviews and where the columns represent the vocabulary (7500 columns). We will set a 1 in the correct column if the word of the review matches a word of the vocabulary. This means that matrix will be rather sparse.

In [0]:
def vectorize_sequences(sequences, dimension=vocabulary):
    results=np.zeros((len(sequences), dimension))
    for i, sequence in enumerate(sequences):
        results[i, sequence]=1
    return results

Now we apply the function to our training and test data as well as the labels. For the labels we use a different method. We simply use the asarray function to convert the list to an array and we assign the items in the array to float32 

In [0]:
x_train=vectorize_sequences(train_data)
x_test=vectorize_sequences(test_data)

y_train=np.asarray(train_labels).astype('float32')
y_test=np.asarray(test_labels).astype('float32')

In [0]:
pip install bayesian-optimization



In [0]:
import xgboost as xgb
from xgboost.sklearn import XGBClassifier
import bayes_opt
from bayes_opt import BayesianOptimization
from sklearn.model_selection import cross_val_score

In [0]:
#model_not_tuned = XGBClassifier()
#model_not_tuned.fit(x_train, y_train)

Let us look at score before parameter tuning and using defaults

In [0]:
pbounds = {'n_estimators': (50, 1000), 'eta': (0.01, 3), 'max_depth': (1,32), 'gamma':(0,5), 'min_child_weight': (2,20), 'subsample':(0.5,1), 'colsample_bytree':(0.5,1)}

In [0]:
model_tuning = XGBClassifier(n_jobs=-1)

In [0]:
def xgboostcv(eta, n_estimators, max_depth, gamma, min_child_weight, subsample, colsample_bytree):
    return np.mean(cross_val_score(model_tuning, x_train, y_train, cv=5, scoring='accuracy'))

In [0]:
optimizer = BayesianOptimization(
    f=xgboostcv,
    pbounds=pbounds,
    random_state=1)

In [0]:
optimizer.maximize(
    init_points=2,
    n_iter=4)
print(optimizer.max)

|   iter    |  target   | colsam... |    eta    |   gamma   | max_depth | min_ch... | n_esti... | subsample |
-------------------------------------------------------------------------------------------------------------
| [0m 1       [0m | [0m 0.808   [0m | [0m 0.7085  [0m | [0m 2.164   [0m | [0m 0.000571[0m | [0m 10.37   [0m | [0m 4.642   [0m | [0m 137.7   [0m | [0m 0.5931  [0m |
| [0m 2       [0m | [0m 0.808   [0m | [0m 0.6728  [0m | [0m 1.196   [0m | [0m 2.694   [0m | [0m 14.0    [0m | [0m 14.33   [0m | [0m 244.2   [0m | [0m 0.9391  [0m |
| [0m 3       [0m | [0m 0.808   [0m | [0m 0.824   [0m | [0m 1.385   [0m | [0m 1.402   [0m | [0m 9.156   [0m | [0m 2.095   [0m | [0m 999.8   [0m | [0m 0.8966  [0m |
| [0m 4       [0m | [0m 0.808   [0m | [0m 0.51    [0m | [0m 1.916   [0m | [0m 1.044   [0m | [0m 31.53   [0m | [0m 19.96   [0m | [0m 998.3   [0m | [0m 0.5124  [0m |
| [0m 5       [0m | [0m 0.808   [0m | [0m 0.609

In [0]:
model = XGBClassifier(eta=2, n_estimators=138, max_depth=10, min_child_weight=5, gamma=0, subsample=0.6, colsample_bytree=0.7 )

NameError: ignored

In [0]:
model.fit(x_train, y_train)

XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=0.7, eta=2, gamma=0,
              learning_rate=0.1, max_delta_step=0, max_depth=10,
              min_child_weight=5, missing=None, n_estimators=138, n_jobs=1,
              nthread=None, objective='binary:logistic', random_state=0,
              reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
              silent=None, subsample=0.6, verbosity=1)

In [0]:
y_pred = model.predict(x_test)
from sklearn import metrics
print("Accuracy:",metrics.accuracy_score(y_test, y_pred))

Accuracy: 0.85956


In [0]:
from sklearn.metrics import confusion_matrix
y_pred=model.predict(x_test)
confusion_matrix(y_test,y_pred)

NameError: ignored