## Configuring Gradient Boosting

### XGBoost's hyperparameters
XGBoost has a multitude of tuning parameters that are split into three categories:
- General Parameters, which guide the overall functioning
- Booster Parameters, which guide the indiviual boosters at each step
- Learning Task Parameters, which guide the opitmization performed
Not all are important to our project, but we will need to spend some time looking at the tuning parameters to figure out the best ones for out project.

There are too many for us to list all, butt some of the most common ones, and most likely that we will use, are:
- nthread
- seed
- silent
- subsample
- max_delta_step
- missing
- scale_pos_weight
- booster
- gbtree
- max_depth
- n_estimators
- colsample_bytree
- reg_lambda
- colsample_bylevel
- objective
- leaning_rate
- base_score
- min_child_weight
- reg_alpha
- gamma

In [10]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

data = pd.read_csv('/users/kdeuser/Desktop/CMPU366/FinalProject/mbti_1.csv') 

In [11]:
## set all MBTI letters to correspond to 0 and 1s
mbtipersonalities = {'I':0, 'E':1, 'N':0, 'S':1, 'T':0, 'F':1, 'J':0, 'P':1}
## format of MBTI perdonality types
mbtip_list = [{0:'I', 1:'E'}, {0:'N', 1:'E'}, {0:'T', 1:'F'}, {0:'J', 1:'P'}]

## make mbti personalities to binary vectors
def mbti_to_bin_vec(mbti_pers):
    return [mbtipersonalities[i] for i in mbti_pers]

## change back from binary vectors to the mbti personalities
def bin_vec_to_mbti(mbti_pers):
    ## needs to be a string
    s = ""
    for i, j in enumerate(mbti_pers):
        s += mbtip_list[i][j]
    return s

check = data.head(10)
mbti_binvec  = np.array([mbti_to_bin_vec(i) for i in check.type])
print("Binarized MBTI list:\n%s"% mbti_binvec)

binvec_mbti = np.array([bin_vec_to_mbti(i) for i in mbti_binvec])
print("Back to MBTI:  \n%s" % binvec_mbti)

##check with og data.head
print("Original MBTI: \n%s" % data.head(10).type)

Binarized MBTI list:
[[0 0 1 0]
 [1 0 0 1]
 [0 0 0 1]
 [0 0 0 0]
 [1 0 0 0]
 [0 0 0 0]
 [0 0 1 0]
 [0 0 0 0]
 [0 0 1 0]
 [0 0 0 1]]
Back to MBTI:  
['INFJ' 'ENTP' 'INTP' 'INTJ' 'ENTJ' 'INTJ' 'INFJ' 'INTJ' 'INFJ' 'INTP']
Original MBTI: 
0    INFJ
1    ENTP
2    INTP
3    INTJ
4    ENTJ
5    INTJ
6    INFJ
7    INTJ
8    INFJ
9    INTP
Name: type, dtype: object


In [12]:
# to preprocss the data we have to make everything lowercase, remove
# nonalphabetic words, urls and overly common words (e.g. a, the, at, etc.)
# and lemmatice the words

### ADD REMOVEING STOPPWORDS and mbti types
from nltk.stem import WordNetLemmatizer
from nltk import word_tokenize
from nltk.corpus import stopwords
import re

## list of mbti types
mbti_types_list = ['ESTP', 'ESFP', 'ENFP', 'ENTP', 'ESTJ', 'ESFJ', 'ENFJ', 'ENTJ',
                  'ISTJ', 'ISFJ', 'INFJ', 'INTJ', 'ISTP', 'ISFP', 'INFP', 'INTP']
mbti_types_list = [i.lower() for i in mbti_types_list]

##choose stopwords
stopWOrds = stopwords.words("english")

lemmatizer = WordNetLemmatizer()

def preprocessing(data, remove_mbti=True, remove_stopwords=True):
    personalities = []
    posts = []
    datalength = len(data)
    i = 0
    
    ## iterate through the rows 
    for row in data.iterrows():
        i += 1
        if (i % 500 == 0 or i ==1 or i == datalength):
            print("%s of %s rows" % (i, datalength))
        
        ## go through current row
        comments = row[1].posts
        ## remove urls, after % is ascii charcters so 2 bits
        removeurls = re.sub('(http|https|ftp)://(?:[A-Za-z]|[0-9]|[$-_&@+]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', ' ', comments)
        ## remove non words
        removenonwords = re.sub('[^A-Za-z]', ' ', removeurls)
        removeextraspace = re.sub('  +',' ', removenonwords)
        ## make lowercase 
        changed = removeextraspace.lower()
        
        #lemmatize basedoff of includiong stopwords or not 
        if remove_stopwords:
            changed = " ".join([lemmatizer.lemmatize(w) for w in changed.split(' ') if w not in stopWOrds])
        else:
            ## lemmatize
            changed = " ".join([lemmatizer.lemmatize(w) for w in changed.split(' ')])
        
        ## remove mbti types
        if remove_mbti:
            for m in mbti_types_list:
                changed = changed.replace(m,"")
                
        ## make mbti typesbinary vectors
        bin_vecs = mbti_to_bin_vec(row[1].type)
        personalities.append(bin_vecs)
        posts.append(changed)
        
    posts = np.array(posts)
    personalities = np.array(personalities)
    return posts, personalities
        

In [13]:
posts, personalities = preprocessing(data)

1 of 8675 rows
500 of 8675 rows
1000 of 8675 rows
1500 of 8675 rows
2000 of 8675 rows
2500 of 8675 rows
3000 of 8675 rows
3500 of 8675 rows
4000 of 8675 rows
4500 of 8675 rows
5000 of 8675 rows
5500 of 8675 rows
6000 of 8675 rows
6500 of 8675 rows
7000 of 8675 rows
7500 of 8675 rows
8000 of 8675 rows
8500 of 8675 rows
8675 of 8675 rows


In [14]:
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.manifold import TSNE



count_vec = CountVectorizer(stop_words="english",
                            analyzer = "word",
                            ngram_range=(1,1),
                            max_df=0.9,
                            min_df=0.1,
                            max_features=None)


count_train = count_vec.fit(posts)
print("CountVectorizer...")
# should create and return a count-vectorized output of docs
posts_count = count_vec.fit_transform(posts)


tfidf_izer = TfidfTransformer()

print("tf-idf...")
posts_tfidf =  tfidf_izer.fit_transform(posts_count).toarray()                           
                             
#vectorizer = TfidfVectorizer()
#vectors = vectorizer.fit_transform([documentA, documentB])
#feature_names = vectorizer.get_feature_names()
#dense = vectors.todense()
#denselist = dense.tolist()
#df = pd.DataFrame(denselist, columns=feature_names)

CountVectorizer...
tf-idf...


In [15]:
# the MBTI type indicators
mbti_type_ind = ["IE: Introversion (I) - Extroversion (E)",
                "NS: Intuition (N) - Sensing (S)",
                "FT: Feeling(F) - Thinking (T)",
                "JP: Judging (J) - Percieving (P)"]

for m in range(len(mbti_type_ind)):
    print(mbti_type_ind[m])
    
mbti1row = bin_vec_to_mbti(personalities[0,:])

IE: Introversion (I) - Extroversion (E)
NS: Intuition (N) - Sensing (S)
FT: Feeling(F) - Thinking (T)
JP: Judging (J) - Percieving (P)


In [16]:
from xgboost import XGBClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import warnings
warnings.filterwarnings('ignore')


X = posts_tfidf

#we need to train each of the mbti type indicators
for m in range(len(mbti_type_ind)):
    print("%s ..." % mbti_type_ind[m])
    
    #for each type indicator we train a different Y
    Y = personalities[:,m]
    
    # next we split the data into train and test sets
    #unsure how to choose the random seed or test size
    seed = 50
    test_size = 0.50
    X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=test_size, random_state=seed)

    #fit model on the training data
    model = XGBClassifier()
    model.fit(X_train, Y_train)
        
    #make predictions for the test data
    Y_prediction = model.predict(X_test)
    predictions = [round(value) for value in Y_prediction]
    
    #check the accuracy
    accuracy = accuracy_score(Y_test, predictions)
    print(" - %s Accuracy: %.2f%%" % (mbti_type_ind[m], accuracy * 100.0))

IE: Introversion (I) - Extroversion (E) ...
 - IE: Introversion (I) - Extroversion (E) Accuracy: 77.41%
NS: Intuition (N) - Sensing (S) ...
 - NS: Intuition (N) - Sensing (S) Accuracy: 86.58%
FT: Feeling(F) - Thinking (T) ...
 - FT: Feeling(F) - Thinking (T) Accuracy: 73.28%
JP: Judging (J) - Percieving (P) ...
 - JP: Judging (J) - Percieving (P) Accuracy: 65.35%


In [17]:
#some of the xgbparams we may look at 
default_get_xgb_params = model.get_xgb_params()
print(default_get_xgb_params)

{'objective': 'binary:logistic', 'max_depth': 3, 'learning_rate': 0.1, 'subsample': 1, 'colsample_bylevel': 1, 'booster': 'gbtree', 'scale_pos_weight': 1, 'reg_alpha': 0, 'n_estimators': 100, 'silent': 1, 'seed': 0, 'missing': None, 'reg_lambda': 1, 'base_score': 0.5, 'nthread': 1, 'gamma': 0, 'colsample_bytree': 1, 'max_delta_step': 0, 'min_child_weight': 1}


We will start iuth using the parameters:
- n_estimators: number of trees to build
- max_depth: the maximum. depth of a tree, 
    - the larger the value the more complex the model is and more likely to overfit 
- nthread: the number of parallel threads running XGBoost
- learning_rate (also known as eta): step size shrinkage, in order to prevent overfitting. 

In [18]:
#set up some of the parameters for xgboost
param={}

#good ones to choose initially would be n_estimators, max_depth, n_thread, learning_rate
param['n_estimators']=100
#max_depth default is 6 
param['max_depth']=2
#number of parallel threads running XGBoost
param['nthread']=8
#learning rate default is 0.3
param['learning_rate']=0.2

#training the MBTI type indicators individually
for m in range(len(mbti_type_ind)):
    print("%s ... " % (mbti_type_ind[m]))
    
    Y= personalities[:,m]
    
    #split into train and test sets
    seed = 50
    test_size = 0.50
    X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=test_size, random_state=seed)

    #fit model on the training data
    model = XGBClassifier(**param)
    model.fit(X_train, Y_train)
        
    #make predictions for the test data
    Y_prediction = model.predict(X_test)
    predictions = [round(value) for value in Y_prediction]
    
    #check the accuracy
    accuracy = accuracy_score(Y_test, predictions)
    print(" - %s Accuracy: %.2f%%" % (mbti_type_ind[m], accuracy * 100.0))
    
    

IE: Introversion (I) - Extroversion (E) ... 
 - IE: Introversion (I) - Extroversion (E) Accuracy: 77.78%
NS: Intuition (N) - Sensing (S) ... 
 - NS: Intuition (N) - Sensing (S) Accuracy: 86.68%
FT: Feeling(F) - Thinking (T) ... 
 - FT: Feeling(F) - Thinking (T) Accuracy: 73.44%
JP: Judging (J) - Percieving (P) ... 
 - JP: Judging (J) - Percieving (P) Accuracy: 64.62%


First attempts of parameter tuning did little with the accuracy not improving at all, with the rough estimates of accuracy still being:
- IE: 77%
- NS: 86%
- FT: 73%
- JP: 65%
note: played around with the parameter, including setting n_estimates to 50 instead of 100, which reduced our accuracy by around 1% for each mbti type.

Can also try:
- subsample: ratio of the training istances, reducing means that less is sampled of the training data before growing trees
- colsample_bytree: percentage of features used per tree. The higher it is the more likely it will lead to overfitting
- num_parallel_tree: number of parallel trees constructed during each iteration


#### Playing around with parameters
We are going through testing different parameters to see which result in the best improvement of accuracy.
The accuracy before parameters were added, for test_set=0.5 (one run):
- IE: 77.41%
- NS: 86.58%
- TF: 73.28%
- JP: 65.35%

In [20]:
param={}


param_list=['n_estimators', 'max_depth', 'nthread', 'learning_rate', 'subsample',
          'colsample_bytree', 'num_parallel_tree', 'num_feature', 'seed',
          'silent', 'max_delta_step', 'scale_pos_weight',
          'reg_lambda', 'colsample_bylevel', 'objective', 
          'base_score', 'min_child_weight', 'reg_alpha', 'gamma']

#previous value was 100
param['n_estimators']=90 #no real change when set to 100

#max_depth default is 6 , was set at 3
param['max_depth']=4  #not a big improvement

#number of parallel threads running XGBoost, was set at 1
param['nthread']=2 #no real change

#learning rate default is 0.3, was set at 0.1
param['learning_rate']=0.2 #slight changes

#default is 1
param['subsample']=0.75 #a bit of change

#default is 1
param['colsample_bytree']= 0.75 #slight changes

#default is 1
param['num_parallel_tree'] = 2 #no real change

#set automatically by XGBoost
param['num_feature']= 25 #no real change

#default 0
param['seed']= 21

#belongs to verbosity?
param['silent']= 2

#default 0
param['max_delta_step']= 2

#set at None
#param['missing']=

#set at 1
param['scale_pos_weight']= 2

#default is gbtree
#param['booster'] = "dgblinear" #no real changes with gblinear or dart

#default is 1
param['reg_lambda']= 2

#default is 1
param['colsample_bylevel']=0.5

#set at "binary:logistoc"
param['objective']= "reg:logistic"

#set at 0.5
param['base_score']= 0.7

#default 1
param['min_child_weight']= 2

#set at 0
param['reg_alpha']= 1

#default is 0
param['gamma']= 1


#go throguh all params
for p in range(len(param_list)):
    params={}  
    params[param_list[p]] = param[param_list[p]]
    print(param_list[p])
    #training the MBTI type indicators individually
    for m in range(len(mbti_type_ind)):
        print("%s ... " % (mbti_type_ind[m]))

        Y= personalities[:,m]

        #split into train and test sets
        seed = 50
        test_size = 0.50

        X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=test_size, random_state=seed)

        #fit model on the training data
        model = XGBClassifier(**params)
        model.fit(X_train, Y_train)

        #make predictions for the test data
        Y_prediction = model.predict(X_test)
        predictions = [round(value) for value in Y_prediction]

        #check the accuracy
        accuracy = accuracy_score(Y_test, predictions)
        print(" - %s Accuracy: %.2f%%" % (mbti_type_ind[m], accuracy * 100.0))



n_estimators
IE: Introversion (I) - Extroversion (E) ... 
 - IE: Introversion (I) - Extroversion (E) Accuracy: 77.57%
NS: Intuition (N) - Sensing (S) ... 
 - NS: Intuition (N) - Sensing (S) Accuracy: 86.63%
FT: Feeling(F) - Thinking (T) ... 
 - FT: Feeling(F) - Thinking (T) Accuracy: 73.08%
JP: Judging (J) - Percieving (P) ... 
 - JP: Judging (J) - Percieving (P) Accuracy: 65.31%
max_depth
IE: Introversion (I) - Extroversion (E) ... 
 - IE: Introversion (I) - Extroversion (E) Accuracy: 77.57%
NS: Intuition (N) - Sensing (S) ... 
 - NS: Intuition (N) - Sensing (S) Accuracy: 86.49%
FT: Feeling(F) - Thinking (T) ... 
 - FT: Feeling(F) - Thinking (T) Accuracy: 73.72%
JP: Judging (J) - Percieving (P) ... 
 - JP: Judging (J) - Percieving (P) Accuracy: 65.21%
nthread
IE: Introversion (I) - Extroversion (E) ... 
 - IE: Introversion (I) - Extroversion (E) Accuracy: 77.41%
NS: Intuition (N) - Sensing (S) ... 
 - NS: Intuition (N) - Sensing (S) Accuracy: 86.58%
FT: Feeling(F) - Thinking (T) ... 


n_estimators
 - IE: Introversion (I) - Extroversion (E) Accuracy: 77.57%
 - NS: Intuition (N) - Sensing (S) Accuracy: 86.63%
 - FT: Feeling(F) - Thinking (T) Accuracy: 73.08%
 - JP: Judging (J) - Percieving (P) Accuracy: 65.31%

max_depth
 - IE: Introversion (I) - Extroversion (E) Accuracy: 77.57%
 - NS: Intuition (N) - Sensing (S) Accuracy: 86.49%
 - FT: Feeling(F) - Thinking (T) Accuracy: 73.72%
 - JP: Judging (J) - Percieving (P) Accuracy: 65.21%

nthread
 - IE: Introversion (I) - Extroversion (E) Accuracy: 77.41%
 - NS: Intuition (N) - Sensing (S) Accuracy: 86.58%
 - FT: Feeling(F) - Thinking (T) Accuracy: 73.28%
 - JP: Judging (J) - Percieving (P) Accuracy: 65.35%

learning_rate
 - IE: Introversion (I) - Extroversion (E) Accuracy: 77.32%
 - NS: Intuition (N) - Sensing (S) Accuracy: 86.33%
 - FT: Feeling(F) - Thinking (T) Accuracy: 73.91%
 - JP: Judging (J) - Percieving (P) Accuracy: 65.95%

subsample
 - IE: Introversion (I) - Extroversion (E) Accuracy: 77.55%
 - NS: Intuition (N) - Sensing (S) Accuracy: 86.61%
 - FT: Feeling(F) - Thinking (T) Accuracy: 73.72%
 - JP: Judging (J) - Percieving (P) Accuracy: 65.86%

colsample_bytree
 - IE: Introversion (I) - Extroversion (E) Accuracy: 77.69%
 - NS: Intuition (N) - Sensing (S) Accuracy: 86.63%
 - FT: Feeling(F) - Thinking (T) Accuracy: 73.70%
 - JP: Judging (J) - Percieving (P) Accuracy: 65.44%

num_parallel_tree
 - IE: Introversion (I) - Extroversion (E) Accuracy: 77.41%
 - NS: Intuition (N) - Sensing (S) Accuracy: 86.58%
 - FT: Feeling(F) - Thinking (T) Accuracy: 73.28%
 - JP: Judging (J) - Percieving (P) Accuracy: 65.35%

num_feature
 - IE: Introversion (I) - Extroversion (E) Accuracy: 77.41%
 - NS: Intuition (N) - Sensing (S) Accuracy: 86.58%
 - FT: Feeling(F) - Thinking (T) Accuracy: 73.28%
 - JP: Judging (J) - Percieving (P) Accuracy: 65.35%

seed
IE: Introversion (I) - Extroversion (E) ... 
 - IE: Introversion (I) - Extroversion (E) Accuracy: 77.41%
NS: Intuition (N) - Sensing (S) ... 
 - NS: Intuition (N) - Sensing (S) Accuracy: 86.58%
FT: Feeling(F) - Thinking (T) ... 
 - FT: Feeling(F) - Thinking (T) Accuracy: 73.28%
JP: Judging (J) - Percieving (P) ... 
 - JP: Judging (J) - Percieving (P) Accuracy: 65.35%

silent
IE: Introversion (I) - Extroversion (E) ... 
 - IE: Introversion (I) - Extroversion (E) Accuracy: 77.41%
NS: Intuition (N) - Sensing (S) ... 
 - NS: Intuition (N) - Sensing (S) Accuracy: 86.58%
FT: Feeling(F) - Thinking (T) ... 
 - FT: Feeling(F) - Thinking (T) Accuracy: 73.28%
JP: Judging (J) - Percieving (P) ... 
 - JP: Judging (J) - Percieving (P) Accuracy: 65.35%

max_delta_step
IE: Introversion (I) - Extroversion (E) ... 
 - IE: Introversion (I) - Extroversion (E) Accuracy: 77.41%
NS: Intuition (N) - Sensing (S) ... 
 - NS: Intuition (N) - Sensing (S) Accuracy: 86.58%
FT: Feeling(F) - Thinking (T) ... 
 - FT: Feeling(F) - Thinking (T) Accuracy: 73.28%
JP: Judging (J) - Percieving (P) ... 
 - JP: Judging (J) - Percieving (P) Accuracy: 65.35%

scale_pos_weight
IE: Introversion (I) - Extroversion (E) ... 
 - IE: Introversion (I) - Extroversion (E) Accuracy: 75.86%
NS: Intuition (N) - Sensing (S) ... 
 - NS: Intuition (N) - Sensing (S) Accuracy: 86.26%
FT: Feeling(F) - Thinking (T) ... 
 - FT: Feeling(F) - Thinking (T) Accuracy: 68.42%
JP: Judging (J) - Percieving (P) ... 
 - JP: Judging (J) - Percieving (P) Accuracy: 63.12%

reg_lambda
IE: Introversion (I) - Extroversion (E) ... 
 - IE: Introversion (I) - Extroversion (E) Accuracy: 77.80%
NS: Intuition (N) - Sensing (S) ... 
 - NS: Intuition (N) - Sensing (S) Accuracy: 86.56%
FT: Feeling(F) - Thinking (T) ... 
 - FT: Feeling(F) - Thinking (T) Accuracy: 73.65%
JP: Judging (J) - Percieving (P) ... 
 - JP: Judging (J) - Percieving (P) Accuracy: 65.05%

colsample_bylevel
IE: Introversion (I) - Extroversion (E) ... 
 - IE: Introversion (I) - Extroversion (E) Accuracy: 77.57%
NS: Intuition (N) - Sensing (S) ... 
 - NS: Intuition (N) - Sensing (S) Accuracy: 86.63%
FT: Feeling(F) - Thinking (T) ... 
 - FT: Feeling(F) - Thinking (T) Accuracy: 73.35%
JP: Judging (J) - Percieving (P) ... 
 - JP: Judging (J) - Percieving (P) Accuracy: 65.31%

objective
IE: Introversion (I) - Extroversion (E) ... 
 - IE: Introversion (I) - Extroversion (E) Accuracy: 77.41%
NS: Intuition (N) - Sensing (S) ... 
 - NS: Intuition (N) - Sensing (S) Accuracy: 86.58%
 - FT: Feeling(F) - Thinking (T) Accuracy: 73.28%
 - JP: Judging (J) - Percieving (P) Accuracy: 65.35%

base_score
 - IE: 77.22% 
 - NS: 86.63%
 - FT: 73.17%
 - JP: 65.56%

min_child_weight
 - IE: 77.59%
 - NS: 86.61%
 - FT: 73.86%
 - JP: 64.78%

reg_alpha
 - IE: 77.69%
 - NS: 86.61%
 - FT: 73.72% 
 - JP: 65.15%

gamma
 - IE: 77.69%
 - NS: 86.61%
 - FT: 73.37%
 - JP: 65.63%

n_estimators = 90
 - IE: 77.57% 
 - NS: 86.63%
 - FT: 73.08% 
 - JP: 65.31%
 
max_depth = 4
 - IE: 77.50%
 - NS: 86.51%
 - FT: 73.58%
 - JP: 65.31%
 
nthread = 2
 - IE: 77.50%
 - NS: 86.51%
 - FT: 73.58%
 - JP: 65.31%
 
learning_rate = 0.2
 - IE: 76.88%
 - NS: 86.17% 
 - FT: 73.03%
 - JP: 64.08%
 
subsample = 0.75
 - IE: 76.23%
 - NS: 86.10%
 - FT: 74.00%
 - JP: 64.25%
 
colsample_bytree = 0.75
 - IE: 77.13%
 - NS: 86.10%
 - FT: 73.74%
 - JP: 62.93%
 
num_parallel_tree = 2
 - IE: 77.48%
 - NS: 86.40%
 - FT: 74.69%
 - JP: 64.75%
 
num_feature = 25
 - IE: 77.48%
 - NS: 86.40%
 - FT: 74.69%
 - JP: 64.75%
 
seed = 21
 - IE: 77.06%
 - NS: 86.24%
 - FT: 75.20%
 - JP: 63.95%
 
silent = 2
 - IE: 77.06%
 - NS: 86.24%
 - FT: 75.20%
 - JP: 63.95%
 
max_delta_step = 2
 - IE: 77.06%
 - NS: 86.24%
 - FT: 75.20%
 - JP: 63.95%
 
scale_pos_weight = 2
 - IE: 75.70% 
 - NS: 85.82%
 - FT: 72.01%
 - JP: 65.28%
 
reg_lambda = 2
 - IE: 75.68% 
 - NS: 85.82%
 - FT: 73.03%
 - JP: 63.99%
 
colsample_bylevel = 0.5
 - IE: 76.56%
 - NS: 85.34%
 - FT: 72.61%
 - JP: 64.04%
 
objective = "reg:logistic"
 - IE: 76.56%
 - NS: 85.34%
 - FT: 72.61%
 - JP: 64.04%
 
base_score = 0.7
 - IE: 76.33%
 - NS: 85.43%
 - FT: 72.57%
 - JP: 65.05%
 
min_child_weight = 2
 - IE: 76.03%
 - NS: 85.34%
 - FT: 72.25%
 - JP: 63.53%
 
reg_alpha = 1
 - IE: 75.96%
 - NS: 85.36%
 - FT: 72.22%
 - JP: 64.59%
 
gamma = 1
 - IE: 76.35%
 - NS: 85.52%
 - FT: 72.43% 
 - JP: 63.95%

In [22]:
param={}

param_list = ['num_feature', 
              'seed', 'silent', 'max_delta_step', 'scale_pos_weight', 
              'reg_lambda', 'min_child_weight', 'reg_alpha', 'gamma']
random_num_list=[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10]



#max_depth default is 6 , was set at 3
param['max_depth']=4  #not a big improvement

#number of parallel threads running XGBoost, was set at 1
param['nthread']=2 #no real change

#default is 1
param['num_parallel_tree'] = 2 #no real change

#set automatically by XGBoost
param['num_feature']= 25 #no real change

#default 0
param['seed']= 21

#belongs to verbosity?
param['silent']= 2

#default 0
param['max_delta_step']= 2


#set at 1
param['scale_pos_weight']= 2

#default is 1
param['reg_lambda']= 2

#default 1
param['min_child_weight']= 2

#set at 0
param['reg_alpha']= 1

#default is 0
param['gamma']= 1

#go throguh all params

for p in range(len(param_list)):
    for r in range(len(random_num_list)):
        params={}  
        params[param_list[p]] = random_num_list[r]
        print('%s = %s' % (param_list[p], random_num_list[r]))
        #training the MBTI type indicators individually
        for m in range(len(mbti_type_ind)):
            print("%s ... " % (mbti_type_ind[m]))

            Y= personalities[:,m]

            #split into train and test sets
            seed = 50
            test_size = 0.50

            X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=test_size, random_state=seed)

            #fit model on the training data
            model = XGBClassifier(**params)
            model.fit(X_train, Y_train)

            #make predictions for the test data
            Y_prediction = model.predict(X_test)
            predictions = [round(value) for value in Y_prediction]

            #check the accuracy
            accuracy = accuracy_score(Y_test, predictions)
            print(" - %s Accuracy: %.2f%%" % (mbti_type_ind[m], accuracy * 100.0))



num_feature = 0
IE: Introversion (I) - Extroversion (E) ... 
 - IE: Introversion (I) - Extroversion (E) Accuracy: 77.41%
NS: Intuition (N) - Sensing (S) ... 
 - NS: Intuition (N) - Sensing (S) Accuracy: 86.58%
FT: Feeling(F) - Thinking (T) ... 
 - FT: Feeling(F) - Thinking (T) Accuracy: 73.28%
JP: Judging (J) - Percieving (P) ... 
 - JP: Judging (J) - Percieving (P) Accuracy: 65.35%
num_feature = 1
IE: Introversion (I) - Extroversion (E) ... 


KeyboardInterrupt: 

num_feature = 0 
 - IE: 77.41%
 - NS: 86.58%
 - FT: 73.28%
 - JP: 65.35%

num_feature = 1
 - IE: 77.41%
 - NS: 86.58%
 - FT: 73.28%
 - JP: 65.35%

num_feature = 2
 - IE: 77.41%
 - NS: 86.58%
 - FT: 73.28%
 - JP: 65.35%

num_feature = 3
 - IE: 77.41%
 - NS: 86.58%
 - FT: 73.28%
 - JP: 65.35%

num_feature = 4
 - IE: 77.41%
 - NS: 86.58%
 - FT: 73.28%
 - JP: 65.35%

num_feature = 5 
 - IE: 77.41%
 - NS: 86.58%
 - FT: 73.28% 
 - JP: 65.35%

num_feature = 6
 - IE: 77.41%
 - NS: 86.58%
 - FT: 73.28% 
 - JP: 65.35%

num_feature = 7
 - IE: 77.41%
 - NS: 86.58%
 - FT: 73.28%
 - JP: 65.35%

num_feature = 8
 - IE: 77.41%
 - NS: 86.58%
 - FT: 73.28%
 - JP: 65.35%

num_feature = 9
 - IE: 77.41%
 - NS: 86.58%
 - FT: 73.28%
 - JP: 65.35%

num_feature = 10 
 - IE: 77.41%
 - NS: 86.58%
 - FT: 73.28%
 - JP: 65.35%


seed = 0
 - IE: 77.41%
 - NS: 86.58%
 - FT: 73.28%
 - JP: 65.35%

seed = 1 
 - IE: 77.41%
 - NS: 86.58%
 - FT: 73.28%
 - JP: 65.35%

seed = 2
 - IE: 77.41%
 - NS: 86.58%
 - FT: 73.28%
 - JP: 65.35%

seed = 3
 - IE: 77.41%
 - NS: 86.58% 
 - FT: 73.28%
 - JP: 65.35%

seed = 4
 - IE: 77.41%
 - NS: 86.58%
 - FT: 73.28%
 - JP: 65.35%

seed = 5
 - IE: 77.41%
 - NS: 86.58%
 - FT: 73.28%
 - JP: 65.35%

seed = 6
 - IE: 77.41%
 - NS: 86.58%
 - FT: 73.28% 
 - JP: 65.35%

seed = 7
 - IE: 77.41%
 - NS: 86.58%
 - FT: 73.28%
 - JP: 65.35%

seed = 8
 - IE: 77.41%
 - NS: 86.58%
 - FT: 73.28%
 - JP: 65.35%

seed = 9
 - IE: 77.41%
 - NS: 86.58%
 - FT: 73.28%
 - JP: 65.35%

seed = 10
 - IE: 77.41%
 - NS: 86.58%
 - FT: 73.28%
 - JP: 65.35%


silent = 1
 - IE: 77.41%
 - NS: 86.58%
 - FT: 73.28%
 - JP: 65.35%

silent = 2
 - IE: 77.41%
 - NS: 86.58%
 - FT: 73.28% 
 - JP: 65.35%

silent = 3
 - IE: 77.41%
 - NS: 86.58%
 - FT: 73.28%
 - JP: 65.35%

silent = 4
 - IE: 77.41%
 - NS: 86.58%
 - FT: 73.28%
 - JP: 65.35%

silent = 5
 - IE: 77.41%
 - NS: 86.58%
 - FT: 73.28%
 - JP: 65.35%

silent = 6
 - IE: 77.41%
 - NS: 86.58%
 - FT: 73.28%
 - JP: 65.35%

silent = 7
 - IE: 77.41%
 - NS: 86.58%
 - FT: 73.28%
 - JP: 65.35%

silent = 8
 - IE: 77.41%
 - NS: 86.58% 
 - FT: 73.28%
 - JP: 65.35%

silent = 9
 - IE: 77.41%
 - NS: 86.58%
 - FT: 73.28%
 - JP: 65.35%

silent = 10 
 - IE: 77.41%
 - NS: 86.58%
 - FT: 73.28%
 - JP: 65.35%


max_delta_step = 0 
 - IE: 77.41%
 - NS: 86.58%
 - FT: 73.28%
 - JP: 65.35%

max_delta_step = 1
 - IE: 77.41%
 - NS: 86.58%
 - FT: 73.28%
 - JP: 65.35%

max_delta_step = 2
 - IE: 77.41%
 - NS: 86.58%
 - FT: 73.28%
 - JP: 65.35%

max_delta_step = 3
 - IE: 77.41%
 - NS: 86.58%
 - FT: 73.28%
 - JP: 65.35%

max_delta_step = 4
 - IE: 77.41%
 - NS: 86.58%
 - FT: 73.28%
 - JP: 65.35%

max_delta_step = 5
 - IE: 77.41%
 - NS: 86.58%
 - FT: 73.28%
 - JP: 65.35%

max_delta_step = 6
 - IE: 77.41%
 - NS: 86.58%
 - FT: 73.28%
 - JP: 65.35%

max_delta_step = 7
 - IE: 77.41%
 - NS: 86.58%
 - FT: 73.28%
 - JP: 65.35%

max_delta_step = 8
 - IE: 77.41%
 - NS: 86.58%
 - FT: 73.28%
 - JP: 65.35%

max_delta_step = 9
 - IE: 77.41%
 - NS: 86.58%
 - FT: 73.28%
 - JP: 65.35%

max_delta_step = 10
 - IE: 77.41%
 - NS: 86.58%
 - FT: 73.28%
 - JP: 65.35%


scale_pos_weight = 0
 - IE: 77.06%
 - NS: 86.63%
 - FT: 46.59%
 - JP: 39.65%

scale_pos_weight = 1
 - IE: 77.41%
 - NS: 86.58%
 - FT: 73.28%
 - JP: 65.35%

scale_pos_weight = 2
 - IE: 75.86%
 - NS: 86.26%
 - FT: 68.42%
 - JP: 63.12%

scale_pos_weight = 3
 - IE: 71.35%
 - NS: 84.39%
 - FT: 64.59%
 - JP: 61.78%

scale_pos_weight = 4
 - IE: 65.35%
 - NS: 81.65%
 - FT: 62.03%
 - JP: 61.27%

scale_pos_weight = 5
 - IE: 58.92%
 - NS: 78.28%
 - FT: 61.02%
 - JP: 61.02%

scale_pos_weight = 6
 - IE: 54.08%
 - NS: 74.83%
 - FT: 59.61%
 - JP: 61.04%

scale_pos_weight = 7
 - IE: 48.76%
 - NS: 71.30%
 - FT: 59.15%
 - JP: 60.88%

scale_pos_weight = 8
 - IE: 45.78%
 - NS: 67.96%
 - FT: 57.98% 
 - JP: 60.79%

scale_pos_weight = 9
 - IE: 43.68%
 - NS: 65.12%
 - FT: 57.61%
 - JP: 60.81%

scale_pos_weight = 10
 - IE: 41.68%
 - NS: 62.61%
 - FT: 57.12%
 - JP: 60.81%


reg_lambda = 0
 - IE: 42.21%
 - NS: 61.71%
 - FT: 57.93%
 - JP: 61.07%

reg_lambda = 1
 - IE: 41.68%
 - NS: 62.61%
 - FT: 57.12%
 - JP: 60.81%

reg_lambda = 2
 - IE: 41.56%
 - NS: 62.22%
 - FT: 57.22%
 - JP: 60.63%

reg_lambda = 3
 - IE: 41.70%
 - NS: 62.17%
 - FT: 57.19%
 - JP: 60.72%

reg_lambda = 4
 - IE: 41.01%
 - NS: 62.77%
 - FT: 56.92%
 - JP: 60.74%

reg_lambda = 5
 - IE: 40.55%
 - NS: 62.24%
 - FT: 56.80%
 - JP: 60.63%

reg_lambda = 6
 - IE: 40.57%
 - NS: 61.96%
 - FT: 56.96%
 - JP: 60.63%

reg_lambda = 7
 - IE: 40.27%
 - NS: 62.06%
 - FT: 56.59%
 - JP: 60.65%

reg_lambda = 8
 - IE: 40.23% 
 - NS: 61.00%
 - FT: 56.55%
 - JP: 60.56%

reg_lambda = 9
 - IE: 39.95%
 - NS: 61.71%
 - FT: 56.59%
 - JP: 60.60%

reg_lambda = 10
 - IE: 39.35%
 - NS: 61.50%
 - FT: 56.39%
 - JP: 60.56%


min_child_weight = 0
 - IE: 39.40%
 - NS: 62.17%
 - FT: 56.57%
 - JP: 60.56%

min_child_weight = 1
 - IE: 39.35% 
 - NS: 61.50%
 - FT: 56.39%
 - JP: 60.56%

min_child_weight = 2
 - IE: 39.67%
 - NS: 61.57%
 - FT: 56.75% 
 - JP: 60.60%

min_child_weight = 3
 - IE: 39.51%
 - NS: 60.86%
 - FT: 56.45%
 - JP: 60.58%

min_child_weight = 4
 - IE: 39.81%
 - NS: 61.62%
 - FT: 56.13%
 - JP: 60.56%

min_child_weight = 5
 - IE: 40.66%
 - NS: 61.25%
 - FT: 56.36%
 - JP: 60.63%

min_child_weight = 6
 - IE: 39.47%
 - NS: 61.76%
 - FT: 56.18%
 - JP: 60.60%

min_child_weight = 7
 - IE: 39.93%
 - NS: 60.63%
 - FT: 56.20%
 - JP: 60.56%

min_child_weight = 8
 - IE: 39.40%
 - NS: 60.77%
 - FT: 56.32%
 - JP: 60.58%

min_child_weight = 9
 - IE: 38.96%
 - NS: 60.88%
 - FT: 56.34%
 - JP: 60.63%

min_child_weight = 10
 - IE: 39.44%
 - NS: 61.23%
 - FT: 56.11%
 - JP: 60.60%


reg_alpha = 0
 - IE: 39.44%
 - NS: 61.23%
 - FT: 56.11% 
 - JP: 60.60%

reg_alpha = 1
 - IE: 39.44%
 - NS: 61.11%
 - FT: 56.27%
 - JP: 60.67%

reg_alpha = 2
 - IE: 39.42%
 - NS: 61.48%
 - FT: 56.11%
 - JP: 60.60%

reg_alpha = 3
 - IE: 39.17%
 - NS: 61.25%
 - FT: 56.06% 
 - JP: 60.56%

reg_alpha = 4
 - IE: 39.58%
 - NS: 61.11%
 - FT: 55.92%
 - JP: 60.58%

reg_alpha = 5
 - IE: 38.93%
 - NS: 61.00%
 - FT: 55.99%
 - JP: 60.63%

reg_alpha = 6
 - IE: 38.13%
 - NS: 60.90%
 - FT: 55.81%
 - JP: 60.56%

reg_alpha = 7
 - IE: 38.38%
 - NS: 60.83%
 - FT: 55.76%
 - JP: 60.67%

reg_alpha = 8
 - IE: 38.66%
 - NS: 60.93%
 - FT: 55.49%
 - JP: 60.63%

reg_alpha = 9
 - IE: 37.30%
 - NS: 60.63% 
 - FT: 55.49%
 - JP: 60.63%

reg_alpha = 10
 - IE: 38.61%
 - NS: 60.33% 
 - FT: 55.56%
 - JP: 60.60%


gamma = 0
 - IE: 38.61%
 - NS: 60.33%
 - FT: 55.56%
 - JP: 60.60%

gamma = 1
 - IE: 37.97%
 - NS: 61.39%
 - FT: 55.56%
 - JP: 60.60%

gamma = 2
 - IE: 38.43%
 - NS: 60.95%
 - FT: 55.58%
 - JP: 60.67%

gamma = 3
 - IE: 37.71%
 - NS: 60.14%
 - FT: 55.58%
 - JP: 60.65%

gamma = 4
 - IE: 38.01%
 - NS: 60.63%
 - FT: 55.37%
 - JP: 60.60%

gamma = 5
 - IE: 38.06%
 - NS: 60.65%
 - FT: 55.46%
 - JP: 60.60%

gamma = 6
 - IE: 38.01%
 - NS: 60.90%
 - FT: 55.60%
 - JP: 60.65%

gamma = 7
 - IE: 37.57%
 - NS: 60.93%
 - FT: 55.44%
 - JP: 60.60%

gamma = 8
 - IE: 38.04%
 - NS: 60.86% 
 - FT: 55.49% 
 - JP: 60.56%

gamma = 9 
 - IE: 37.57% 
 - NS: 60.60%
 - FT: 55.49%
 - JP: 60.58%

gamma = 10
 - IE: 37.46%
 - NS: 60.72%
 - FT: 55.37%
 - JP: 60.58%


max_depth = 0
 - IE: 77.06%
 - NS: 86.63%
 - FT: 53.41%
 - JP: 60.35%

max_depth = 1
 - IE: 77.09%
 - NS: 86.63%
 - FT: 70.91% 
 - JP: 64.64%

max_depth = 2 
 - IE: 77.39%
 - NS: 86.63%
 - FT: 72.18%
 - JP: 65.21%

max_depth = 3
 - IE: 77.41%
 - NS: 86.58%
 - FT: 73.28%
 - JP: 65.35%

max_depth = 4
 - IE: 77.57%
 - NS: 86.49%
 - FT: 73.72%
 - JP: 65.21%

max_depth = 5
 - IE: 77.29%
 - NS: 86.54%
 - FT: 73.44%
 - JP: 64.59%

max_depth = 6
 - IE: 77.18%
 - NS: 86.45%
 - FT: 74.16%
 - JP: 65.17%

max_depth = 7
 - IE: 77.39%
 - NS: 86.54%
 - FT: 73.97%
 - JP: 64.59%

max_depth = 8
 - IE: 77.09%
 - NS: 86.49%
 - FT: 73.61%
 - JP: 64.91%

max_depth = 9
 - IE: 77.13%
 - NS: 86.51%
 - FT: 72.87%
 - JP: 64.87%

max_depth = 10
 - IE: 77.27%
 - NS: 86.58%
 - FT: 73.86%
 - JP: 64.73%


nthread = 0
 - IE: 77.27%
 - NS: 86.58%
 - FT: 73.86%
 - JP: 64.73%

nthread = 1
 - IE: 77.27%
 - NS: 86.58%
 - FT: 73.86%
 - JP: 64.73%

nthread = 2
 - IE: 77.27%
 - NS: 86.58%
 - FT: 73.86%
 - JP: 64.73%

nthread = 3
 - IE: 77.27%
 - NS: 86.58%
 - FT: 73.86%
 - JP: 64.73%

nthread = 4
 - IE: 77.27%
 - NS: 86.58%
 - FT: 73.86%
 - JP: 64.73%

nthread = 5
 - IE: 77.27%
 - NS: 86.58%
 - FT: 73.86%
 - JP: 64.73%

nthread = 6
 - IE: 77.27%
 - NS: 86.58%
 - FT: 73.86%
 - JP: 64.73%

nthread = 7
 - IE: 77.27%
 - NS: 86.58%
 - FT: 73.86%
 - JP: 64.73%

nthread = 8
 - IE: 77.27%
 - NS: 86.58%
 - FT: 73.86% 
 - JP: 64.73%

nthread = 9
 - IE: 77.27%
 - NS: 86.58%
 - FT: 73.86% 
 - JP: 64.73%

nthread = 10
 - IE: 77.27%
 - NS: 86.58%
 - FT: 73.86%
 - JP: 64.73%

In [None]:
param={}

param_list = ['learning_rate', 'subsample', 'colsample_bytree', 
              'colsample_bylevel', 'base_score']
random_num_list=[0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1]


#learning rate default is 0.3, was set at 0.1
param['learning_rate']=0.2 #slight changes

#default is 1
param['subsample']=0.75 #a bit of change

#default is 1
param['colsample_bytree']= 0.75 #slight changes

#default is 1
param['colsample_bylevel']=0.5

#set at 0.5
param['base_score']= 0.7




#go throguh all params
for p in range(len(param_list)):
    for r in range(len(random_num_list)):
        params={}  
        params[param_list[p]] = random_num_list[r]
        print('%s = %s' % (param_list[p], random_num_list[r]))
        #training the MBTI type indicators individually
        for m in range(len(mbti_type_ind)):
            print("%s ... " % (mbti_type_ind[m]))

            Y= personalities[:,m]

            #split into train and test sets
            seed = 50
            test_size = 0.50

            X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=test_size, random_state=seed)

            #fit model on the training data
            model = XGBClassifier(**params)
            model.fit(X_train, Y_train)

            #make predictions for the test data
            Y_prediction = model.predict(X_test)
            predictions = [round(value) for value in Y_prediction]

            #check the accuracy
            accuracy = accuracy_score(Y_test, predictions)
            print(" - %s Accuracy: %.2f%%" % (mbti_type_ind[m], accuracy * 100.0))



learning_rate = 0.1
 - IE: 77.41%
 - NS: 86.58%
 - FT: 73.28%
 - JP: 65.35%
 
learning_rate = 0.2
 - IE: 77.32%
 - NS: 86.33%
 - FT: 73.91%
 - JP: 65.95%

learning_rate = 0.3
 - IE: 77.20%
 - NS: 85.82%
 - FT: 74.14%
 - JP: 63.42%

learning_rate = 0.4
 - IE: 76.37%
 - NS: 85.27%
 - FT: 73.61%
 - JP: 63.21%

learning_rate = 0.5
 - IE: 75.63% 
 - NS: 84.44%
 - FT: 72.50%
 - JP: 62.03%

learning_rate = 0.6 
 - IE: 74.25%
 - NS: 84.97%
 - FT: 71.12% 
 - JP: 62.13%

learning_rate = 0.7
 - IE: 73.81%
 - NS: 84.07% 
 - FT: 71.12%
 - JP: 61.66%

learning_rate = 0.8
 - IE: 72.75%
 - NS: 83.82%
 - FT: 71.23%
 - JP: 60.33%

learning_rate = 0.9
 - IE: 72.54%
 - NS: 83.47%
 - FT: 69.64% 
 - JP: 60.00%

learning_rate = 1
 - IE: 72.87%
 - NS: 83.06%
 - FT: 69.82%
 - JP: 60.63%


subsample = 0.1
 - IE: 65.95%
 - NS: 76.88%
 - FT: 61.64%
 - JP: 56.62%

subsample = 0.2
 - IE: 66.57%
 - NS: 75.33%
 - FT: 62.68%
 - JP: 57.77%

subsample = 0.3
 - IE: 68.12%
 - NS: 75.15%
 - FT: 62.84%
 - JP: 56.87%

subsample = 0.4
 - IE: 66.76%
 - NS: 75.56%
 - FT: 64.41%
 - JP: 57.38%

subsample = 0.5
 - IE: 66.87%
 - NS: 77.73%
 - FT: 66.18% 
 - JP: 56.64%

subsample = 0.6
 - IE: 70.29%
 - NS: 79.85%
 - FT: 67.73%
 - JP: 59.43%

subsample = 0.7 
 - IE: 70.59%
 - NS: 80.71%
 - FT: 69.64%
 - JP: 57.98%

subsample = 0.8
 - IE: 71.42%
 - NS: 81.49%
 - FT: 69.89%
 - JP: 59.89%

subsample = 0.9
 - IE: 71.97%
 - NS: 82.09%
 - FT: 70.31%
 - JP: 60.88%

subsample = 1
 - IE: 72.87%
 - NS: 83.06%
 - FT: 69.82%
 - JP: 60.63%


colsample_bytree = 0.1
 - IE: 72.18%
 - NS: 82.62%
 - FT: 69.41%
 - JP: 59.29%

colsample_bytree = 0.2
 - IE: 71.25%
 - NS: 82.39%
 - FT: 69.39% 
 - JP: 60.24%

colsample_bytree = 0.3
 - IE: 72.52%
 - NS: 82.32%
 - FT: 71.14%
 - JP: 61.30%

colsample_bytree = 0.4
 - IE: 72.13%
 - NS: 82.90%
 - FT: 69.69% 
 - JP: 61.46%

colsample_bytree = 0.5
 - IE: 72.20%
 - NS: 82.80%
 - FT: 69.94%
 - JP: 59.20%

colsample_bytree = 0.6
 - IE: 72.36%
 - NS: 82.76%
 - FT: 69.80%
 - JP: 60.35%

colsample_bytree = 0.7
 - IE: 72.45%
 - NS: 82.80%
 - FT: 70.75%
 - JP: 59.75%

colsample_bytree = 0.8
 - IE: 72.73%
 - NS: 83.31%
 - FT: 70.31%
 - JP: 59.94%

colsample_bytree = 0.9
 - IE: 72.68%
 - NS: 82.78%
 - FT: 70.42%
 - JP: 59.89%

colsample_bytree = 1
 - IE: 72.87%
 - NS: 83.06%
 - FT: 69.82%
 - JP: 60.63%


colsample_bylevel = 0.1
 - IE: 72.41%
 - NS: 82.07%
 - FT: 70.59%
 - JP: 59.06%

colsample_bylevel = 0.2
 - IE: 72.57%
 - NS: 82.64%
 - FT: 70.31%
 - JP: 59.27%

colsample_bylevel = 0.3
 - IE: 72.36%
 - NS: 82.00%
 - FT: 70.26%
 - JP: 59.94%

colsample_bylevel = 0.4
 - IE: 73.49%
 - NS: 82.73%
 - FT: 70.12%
 - JP: 60.42%

colsample_bylevel = 0.5
 - IE: 72.11%
 - NS: 83.03%
 - FT: 70.17%
 - JP: 59.75%

colsample_bylevel = 0.6
 - IE: 72.04%
 - NS: 82.25%
 - FT: 69.29%
 - JP: 59.73%

colsample_bylevel = 0.7
 - IE: 72.87% 
 - NS: 83.47%. 
 - FT: 70.95%
 - JP: 61.96%

colsample_bylevel = 0.8
 - IE: 72.50%
 - NS: 83.15%
 - FT: 69.69%
 - JP: 60.65%

colsample_bylevel = 0.9
 - IE: 72.84%
 - NS: 83.45%
 - FT: 71.09%
 - JP: 60.26%

colsample_bylevel = 1
 - IE: 72.87%
 - NS: 83.06%
 - FT: 69.82%
 - JP: 60.63%


base_score = 0.1 
 - IE: 73.35%
 - NS: 82.39%
 - FT: 65.35%
 - JP: 39.70%

base_score = 0.2
 - IE: 72.48%
 - NS: 83.36%
 - FT: 69.71%
 - JP: 60.00%

base_score = 0.3 
 - IE: 72.04%
 - NS: 83.31%
 - FT: 70.45%
 - JP: 60.67%

base_score = 0.4
 - IE: 72.61%
 - NS: 82.96%
 - FT: 70.45%
 - JP: 60.74%

base_score = 0.5
 - IE: 72.87%
 - NS: 83.06%
 - FT: 69.82%
 - JP: 60.63%

base_score = 0.6
 - IE: 72.61%
 - NS: 83.47%
 - FT: 69.73%
 - JP: 61.27%

base_score = 0.7
 - IE: 72.71%
 - NS: 82.99%
 - FT: 71.62%
 - JP: 60.10%

base_score = 0.8
 - IE: 72.11% 
 - NS: 82.48%
 - FT: 70.59%
 - JP: 60.51%

base_score = 0.9
 - IE: 29.23% 
 - NS: 76.12%
 - FT: 60.33%
 - JP: 59.87%

In [None]:
param = {}
param_list=['n_estimators', 'num_feature', 'seed', 'gamma']

random_num_list=[10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110]

#previous value was 100
param['n_estimators']=90 #no real change when set to 100

#set automatically by XGBoost
param['num_feature']= 25 #no real change

#default 0
param['seed']= 21


#default is 0
param['gamma']= 1


#go throguh all params
for p in range(len(param_list)):
    for r in range(len(random_num_list)):
        params={}  
        params[param_list[p]] = random_num_list[r]
        print('%s = %s' % (param_list[p], random_num_list[r]))
        #training the MBTI type indicators individually
        for m in range(len(mbti_type_ind)):
            print("%s ... " % (mbti_type_ind[m]))

            Y= personalities[:,m]

            #split into train and test sets
            seed = 50
            test_size = 0.50

            X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=test_size, random_state=seed)

            #fit model on the training data
            model = XGBClassifier(**params)
            model.fit(X_train, Y_train)

            #make predictions for the test data
            Y_prediction = model.predict(X_test)
            predictions = [round(value) for value in Y_prediction]

            #check the accuracy
            accuracy = accuracy_score(Y_test, predictions)
            print(" - %s Accuracy: %.2f%%" % (mbti_type_ind[m], accuracy * 100.0))


n_estimators = 10
 - IE: 77.22%
 - NS: 86.65%
 - FT: 66.62%
 - JP: 64.41%

n_estimators = 20
 - IE: 77.09%
 - NS: 86.63%
 - FT: 69.13%
 - JP: 64.38%

n_estimators = 30
 - IE: 77.13%
 - NS: 86.63%
 - FT: 70.52%
 - JP: 64.64%

n_estimators = 40
 - IE: 77.18%
 - NS: 86.63%
 - FT: 71.21%
 - JP: 64.96%

n_estimators = 50
 - IE: 77.36%
 - NS: 86.63%
 - FT: 71.92%
 - JP: 64.78%

n_estimators = 60
 - IE: 77.48%
 - NS: 86.63%
 - FT: 72.13%
 - JP: 65.19%

n_estimators = 70
 - IE: 77.50%
 - NS: 86.63%
 - FT: 72.80%
 - JP: 65.65%

n_estimators = 80
 - IE: 77.32%
 - NS: 86.63%
 - FT: 73.03%
 - JP: 65.21%

n_estimators = 90 
 - IE: 77.57%
 - NS: 86.63%
 - FT: 73.08%
 - JP: 65.31%

n_estimators = 100
 - IE: 77.41%
 - NS: 86.58%
 - FT: 73.28%
 - JP: 65.35%

n_estimators = 110
 - IE: 77.39%
 - NS: 86.58%
 - FT: 73.77%
 - JP: 65.03%


num_feature = 10 
 - IE: 77.39%
 - NS: 86.58%
 - FT: 73.77%
 - JP: 65.03%

num_feature = 20
 - IE: 77.39%
 - NS: 86.58%
 - FT: 73.77%
 - JP: 65.03%

num_feature = 30
 - IE: 77.39%
 - NS: 86.58%
 - FT: 73.77%
 - JP: 65.03%

num_feature = 40
 - IE: 77.39%
 - NS: 86.58%
 - FT: 73.77%
 - JP: 65.03%

num_feature = 50
 - IE: 77.39%
 - NS: 86.58%
 - FT: 73.77%
 - JP: 65.03%

num_feature = 60
 - IE: 77.39%
 - NS: 86.58%
 - FT: 73.77%
 - JP: 65.03%

num_feature = 70
 - IE: 77.39%
 - NS: 86.58%
 - FT: 73.77%
 - JP: 65.03%

num_feature = 80
 - IE: 77.39%
 - NS: 86.58%
 - FT: 73.77%
 - JP: 65.03%

num_feature = 90
 - IE: 77.39% 
 - NS: 86.58%
 - FT: 73.77%
 - JP: 65.03%

num_feature = 100
 - IE: 77.39%
 - NS: 86.58%
 - FT: 73.77%
 - JP: 65.03%

num_feature = 110
 - IE: 77.39%
 - NS: 86.58%
 - FT: 73.77%
 - JP: 65.03%


seed = 10
 - IE: 77.39%
 - NS: 86.58%
 - FT: 73.77%
 - JP: 65.03%

seed = 20
 - IE: 77.39%
 - NS: 86.58%
 - FT: 73.77%
 - JP: 65.03%

seed = 30
 - IE: 77.39%
 - NS: 86.58%
 - FT: 73.77%
 - JP: 65.03%

seed = 40
 - IE: 77.39%
 - NS: 86.58%
 - FT: 73.77%
 - JP: 65.03%

seed = 50
 - IE: 77.39%
 - NS: 86.58%
 - FT: 73.77%
 - JP: 65.03%

seed = 60
 - IE: 77.39%
 - NS: 86.58%
 - FT: 73.77%
 - JP: 65.03%

seed = 70
 - IE: 77.39%
 - NS: 86.58%
 - FT: 73.77%
 - JP: 65.03%

seed = 80
 - IE: 77.39%
 - NS: 86.58%
 - FT: 73.77%
 - JP: 65.03%

seed = 90
 - IE: 77.39%
 - NS: 86.58%
 - FT: 73.77%
 - JP: 65.03%

seed = 100
 - IE: 77.39%
 - NS: 86.58%
 - FT: 73.77%
 - JP: 65.03%

seed = 110
 - IE: 77.39%
 - NS: 86.58%
 - FT: 73.77%
 - JP: 65.03%




#### Gamma
gamma = 10
 - IE: 77.62%
 - NS: 86.63%
 - FT: 73.91%
 - JP: 64.96%

gamma = 20
 - IE: 77.09%
 - NS: 86.63%
 - FT: 71.23%
 - JP: 64.43%

gamma = 30
 - IE: 77.06%
 - NS: 86.63%
 - FT: 69.96%
 - JP: 64.27%

gamma = 40
 - IE: 77.06%
 - NS: 86.63%
 - FT: 68.37% 
 - JP: 61.27%

gamma = 50
 - IE: 77.06%
 - NS: 86.63%
 - FT: 66.74%
 - JP: 60.35%

gamma = 60
 - IE: 77.06%
 - NS: 86.63%
 - FT: 66.25%
 - JP: 60.35%

gamma = 70
 - IE: 77.06%
 - NS: 86.63%
 - FT: 65.12%
 - JP: 60.35%

gamma = 80
 - IE: 77.06%
 - NS: 86.63%
 - FT: 64.98%
 - JP: 60.35%

gamma = 90
 - IE: 77.06%
 - NS: 86.63%
 - FT: 64.59% 
 - JP: 60.35%

gamma = 100
 - IE: 77.06%
 - NS: 86.63%
 - FT: 62.93%
 - JP: 60.35%

gamma = 110
 - IE: 77.06%
 - NS: 86.63%
 - FT: 61.60%
 - JP: 60.35%

#### Booster

In [21]:
param={}

param_list = ['booster']
random_num_list=['gbtree', 'gblinear', 'dart']


#default is gbtree
param['booster'] = "dgblinear" #no real changes with gblinear or dart


#go throguh all params
params={}  
for p in range(len(param_list)):
    for r in range(len(random_num_list)):
        params={}  
        params[param_list[p]] = random_num_list[r]
        print('%s = %s' % (param_list[p], random_num_list[r]))
        #training the MBTI type indicators individually
        for m in range(len(mbti_type_ind)):
            print("%s ... " % (mbti_type_ind[m]))

            Y= personalities[:,m]

            #split into train and test sets
            seed = 50
            test_size = 0.50

            X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=test_size, random_state=seed)

            #fit model on the training data
            model = XGBClassifier(**params)
            model.fit(X_train, Y_train)

            #make predictions for the test data
            Y_prediction = model.predict(X_test)
            predictions = [round(value) for value in Y_prediction]

            #check the accuracy
            accuracy = accuracy_score(Y_test, predictions)
            print(" - %s Accuracy: %.2f%%" % (mbti_type_ind[m], accuracy * 100.0))



booster = gbtree
IE: Introversion (I) - Extroversion (E) ... 
 - IE: Introversion (I) - Extroversion (E) Accuracy: 77.41%
NS: Intuition (N) - Sensing (S) ... 
 - NS: Intuition (N) - Sensing (S) Accuracy: 86.58%
FT: Feeling(F) - Thinking (T) ... 
 - FT: Feeling(F) - Thinking (T) Accuracy: 73.28%
JP: Judging (J) - Percieving (P) ... 
 - JP: Judging (J) - Percieving (P) Accuracy: 65.35%
booster = gblinear
IE: Introversion (I) - Extroversion (E) ... 
 - IE: Introversion (I) - Extroversion (E) Accuracy: 77.06%
NS: Intuition (N) - Sensing (S) ... 
 - NS: Intuition (N) - Sensing (S) Accuracy: 86.63%
FT: Feeling(F) - Thinking (T) ... 
 - FT: Feeling(F) - Thinking (T) Accuracy: 53.41%
JP: Judging (J) - Percieving (P) ... 
 - JP: Judging (J) - Percieving (P) Accuracy: 60.35%
booster = dart
IE: Introversion (I) - Extroversion (E) ... 
 - IE: Introversion (I) - Extroversion (E) Accuracy: 77.41%
NS: Intuition (N) - Sensing (S) ... 
 - NS: Intuition (N) - Sensing (S) Accuracy: 86.58%
FT: Feeling(F) 

booster = gbtree
 - IE: Introversion (I) - Extroversion (E) Accuracy: 77.41%
 - NS: Intuition (N) - Sensing (S) Accuracy: 86.58%
 - FT: Feeling(F) - Thinking (T) Accuracy: 73.28%
 - JP: Judging (J) - Percieving (P) Accuracy: 65.35%

booster = gblinear
 - IE: Introversion (I) - Extroversion (E) Accuracy: 77.06%
 - NS: Intuition (N) - Sensing (S) Accuracy: 86.63%
 - FT: Feeling(F) - Thinking (T) Accuracy: 53.41%
 - JP: Judging (J) - Percieving (P) Accuracy: 60.35%

booster = dart
 - IE: Introversion (I) - Extroversion (E) Accuracy: 77.41%
 - NS: Intuition (N) - Sensing (S) Accuracy: 86.58%
 - FT: Feeling(F) - Thinking (T) Accuracy: 73.28%
 - JP: Judging (J) - Percieving (P) Accuracy: 65.35%

#### Objective

In [23]:
param={}

param_list = ['objective']
random_num_list=['reg:squarederror', 'reg:squaredlogerror', 'reg:logistic',
                'reg:pseudohubererror', 'binary:logistic', 'binary:logitraw',
                'binary:hinge', 'count:poisson', 'survival:cox', 'survival:aft',
                'aft_loss_distribution', 'multi:softmax', 'multi:softprob',
                'rank:pairwise', 'rank:ndcg', 'rank:map', 'rank:gamma', ' reg:tweedie']


#set at "binary:logistoc"
param['objective']= "reg:logistic"



#go throguh all params
for p in range(len(param_list)):
    for r in range(len(random_num_list)):
        params={}  
        params[param_list[p]] = random_num_list[r]
        print('%s = %s' % (param_list[p], random_num_list[r]))
        #training the MBTI type indicators individually
        for m in range(len(mbti_type_ind)):
            print("%s ... " % (mbti_type_ind[m]))

            Y= personalities[:,m]

            #split into train and test sets
            seed = 50
            test_size = 0.50

            X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=test_size, random_state=seed)

            #fit model on the training data
            model = XGBClassifier(**params)
            model.fit(X_train, Y_train)

            #make predictions for the test data
            Y_prediction = model.predict(X_test)
            predictions = [round(value) for value in Y_prediction]

            #check the accuracy
            accuracy = accuracy_score(Y_test, predictions)
            print(" - %s Accuracy: %.2f%%" % (mbti_type_ind[m], accuracy * 100.0))



objective = reg:squarederror
IE: Introversion (I) - Extroversion (E) ... 
[18:03:08] src/objective/objective.cc:21: Objective candidate: reg:linear
[18:03:08] src/objective/objective.cc:21: Objective candidate: reg:logistic
[18:03:08] src/objective/objective.cc:21: Objective candidate: binary:logistic
[18:03:08] src/objective/objective.cc:21: Objective candidate: binary:logitraw
[18:03:08] src/objective/objective.cc:21: Objective candidate: count:poisson
[18:03:08] src/objective/objective.cc:21: Objective candidate: survival:cox
[18:03:08] src/objective/objective.cc:21: Objective candidate: reg:gamma
[18:03:08] src/objective/objective.cc:21: Objective candidate: reg:tweedie
[18:03:08] src/objective/objective.cc:21: Objective candidate: rank:pairwise
[18:03:08] src/objective/objective.cc:21: Objective candidate: rank:ndcg
[18:03:08] src/objective/objective.cc:21: Objective candidate: rank:map
[18:03:08] src/objective/objective.cc:21: Objective candidate: multi:softmax
[18:03:08] src/obj

XGBoostError: b'[18:03:08] src/objective/objective.cc:23: Unknown objective function reg:squarederror\n\nStack trace returned 8 entries:\n[bt] (0) 0   libxgboost.dylib                    0x000000201a35e3bd dmlc::StackTrace() + 301\n[bt] (1) 1   libxgboost.dylib                    0x000000201a35e15f dmlc::LogMessageFatal::~LogMessageFatal() + 47\n[bt] (2) 2   libxgboost.dylib                    0x000000201a35dc19 dmlc::LogMessageFatal::~LogMessageFatal() + 9\n[bt] (3) 3   libxgboost.dylib                    0x000000201a3e3638 xgboost::ObjFunction::Create(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&) + 1000\n[bt] (4) 4   libxgboost.dylib                    0x000000201a37076e xgboost::LearnerImpl::LazyInitModel() + 718\n[bt] (5) 5   libxgboost.dylib                    0x000000201a37c60b XGBoosterUpdateOneIter + 91\n[bt] (6) 6   libffi.6.dylib                      0x0000000104e22884 ffi_call_unix64 + 76\n[bt] (7) 7   ???                                 0x00007ffeeca33440 0x0 + 140732868539456\n\n'