# A bewitched XGBoost for all the mad data scientists out there...

![](https://i.pinimg.com/originals/1f/8b/ea/1f8bea26e9a5c7f39c3cdf4b3f8ba7d2.jpg)

In this notebook, I plan to experiment with some mystical features from EDA & NLP on the Spooky Author dataset, and obtain the magic formula for the most bewitched XGBoost ever seen!

1. **Exploratory Data Analysis (EDA)** - Thorough data analysis and visualizations to better understand the writings of the three different authors: Edgar Allan Poe (EAP), HP Lovecraft (HPL), and Mary Wollstonecraft Shelley (MWS).

2. **Natural Language Processing (NLP) ** - Leverage NLTK (Natural Language Toolkit) for doing some awesome text processing methods such as tokenizations, lemmatization, stemming, bag-of-words and get to the bottom of all the misteries!

3. **Mystical feature engineering**

4. **A bewitched XGBoost**
---

In [None]:
import base64
import numpy as np
import pandas as pd
import plotly.graph_objs as go
import plotly.tools as tls
from collections import Counter
from scipy.misc import imread
import xgboost as xgb
import seaborn as sns
import nltk
import string
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.decomposition import NMF, LatentDirichletAllocation, TruncatedSVD
from sklearn import ensemble, metrics, model_selection, naive_bayes
from matplotlib import pyplot as plt
%matplotlib inline

import plotly.offline as py
py.init_notebook_mode(connected=True)

color = sns.color_palette()

pd.options.mode.chained_assignment = None

Now...let us take a glimpse right into the soul of this spooky dataset!

In [None]:
# Read training data with Pandas
df_train = pd.read_csv("../input/train.csv")
df_test = pd.read_csv("../input/test.csv")
print("Number of rows in train dataset : ",df_train.shape[0])
print("Number of rows in test dataset : ",df_test.shape[0])

# 1. Exploratory Data Analysis (EDA)

Let us understand some basic facts about the three authors: Edgar Allan Poe (EAP), HP Lovecraft (HPL), and Mary Wollstonecraft Shelley (MWS).

In [None]:
df_train.head()

In [None]:
df_train.describe()

In [None]:
df_train['text_len']=df_train["text"].apply(lambda x: len(str(x)))

plt.figure(figsize=(14,8))
sns.violinplot(x="text_len", y="author", data=df_train, scale="width")
plt.ylabel('Author Name', fontsize=14)
plt.xlabel('Text Length', fontsize=14)
plt.show()

The above plot doesn't look so good because there are some really long blocks of text by Mary Shelley, that also contain a lot of stop words. We can filter out these cases and see that the majority of texts from the training data are clustered around 100 characters.
In this case, on average, Edgar Allan Poe's texts are the shortest, while HP Lovecraft's are the longest.

In [None]:
plt.figure(figsize=(14,8))
sns.violinplot(x="text_len", y="author", data=df_train[df_train["text_len"] < 400], scale="width")
plt.ylabel('Author Name', fontsize=14)
plt.xlabel('Text Length', fontsize=14)
plt.show()

In [None]:
auth_cnt = df_train['author'].value_counts()
auth_cnt.values

plt.figure(figsize=(14,8))
sns.barplot(auth_cnt.index, auth_cnt.values)
plt.ylabel('Number of Occurrences', fontsize=14)
plt.xlabel('Author Name', fontsize=14)
plt.show()

## Stopword Removal

As alluded to above stopwords are generally words that appear so commonly and at such a high frequency in the corpus that they don't actually contribute much to the learning or predictive process as a learning model would fail to distinguish it from other texts. Stopwordsinclude terms such as "to" or "the" and therefore, it would be to our benefit to remove them during the pre-processing phase. Conveniently, NLTK comes with a predefined list of 153 english stopwords.

To filter out stop words from our tokenized list of words, we can simply use a list comprehension as follows:

In [None]:
from nltk.corpus import stopwords
from nltk.tokenize import RegexpTokenizer

# Needed to get rid of punctuation
tokenizer = RegexpTokenizer(r'\w+')
# Searching a set is much faster than searching a list
eng_stopwords = set(stopwords.words("english") + 
                ['one','us','yet','could','would','need','even','might','like',
                 'must','every','never','go','thus','may','much','however'])

def cleanup_spooky_text( spooky_text ):
    # 1. Convert to lower case, and tokenize (split) into individual words
    spooky_words = tokenizer.tokenize(spooky_text.lower())
    # 2. Remove stop words
    meaningful_spooky_words = [w for w in spooky_words if not w in eng_stopwords]   
    # 3. Join the words back into one string separated by space
    return( " ".join(meaningful_spooky_words))

print(eng_stopwords)

In [None]:
df_train["text"].head()

In [None]:
df_train["clean_text"] = df_train["text"].apply(lambda x: cleanup_spooky_text(x))

In [None]:
df_train["clean_text"].head()

We can then display again the above plot to show text length distributions for each author. You notice that now the majority of texts from the training data clusters have shifted more towards 60 characters in length:

In [None]:
df_train['clean_text_len']=df_train["clean_text"].apply(lambda x: len(str(x)))

plt.figure(figsize=(14,8))
sns.violinplot(x="clean_text_len", y="author", data=df_train[df_train["clean_text_len"] < 300], scale="width")
plt.ylabel('Author Name', fontsize=14)
plt.xlabel('Text Length', fontsize=14)
plt.show()

## Summary statistics of the training set

Here we can visualize some basic statistics in the data, like the distribution of entries for each author. For this purpose, I will invoke the handy Plot.ly visualisation library and plot some simple bar plots. Unhide the cell below if you want to see the Plot.ly code.

In [None]:
all_words = df_train['clean_text'].str.split(expand=True).unstack().value_counts()

data = [go.Bar(
            x = all_words.index.values[2:40],
            y = all_words.values[2:40],
            marker= dict(colorscale='Viridis',color = all_words.values[2:80]),
            text='Word counts'
    )]

layout = go.Layout(title='Top 40 Word frequencies in the cleansed training dataset')
fig = go.Figure(data=data, layout=layout)
py.iplot(fig)

Not surprinsingly, most of the words that appear in this word frequency plot don't actually tell us much about the main themes and characters that the authors present to the reader.

## Separate the text of each author

Let's create 3 lists to store the texts of Edgar Allan Poe, HP Lovecraft and Mary Shelley:

In [None]:
EAP_text = df_train[df_train.author=="EAP"]["clean_text"].values
HPL_text = df_train[df_train.author=="HPL"]["clean_text"].values
MWS_text = df_train[df_train.author=="MWS"]["clean_text"].values

## WordClouds to visualise the preferred spooky words of each author



In [None]:
from wordcloud import WordCloud

In [None]:
plt.figure(figsize=(20,20))
plt.subplot(211)
wc = WordCloud(background_color="black", max_words=100, 
               stopwords=eng_stopwords, max_font_size= 40)
wc.generate(" ".join(EAP_text))
plt.title("Edgar Allan Poe\n", fontsize=30)
plt.imshow(wc.recolor(colormap= 'viridis', random_state=17))
plt.axis('off')

Among Edgar Allan Poe's preffered words are: little, night, eye, day, thing, thought, matter.

In [None]:
plt.figure(figsize=(20,20))
wc = WordCloud(background_color="black", max_words=100, 
               stopwords=eng_stopwords, max_font_size= 40)
wc.generate(" ".join(HPL_text))
plt.title("HP Lovecraft\n", fontsize=30)
plt.imshow(wc.recolor(colormap= 'viridis', random_state=17))
plt.axis('off')

Similar to Poe,  you can see that HP Lovecraft's favours words such as: night, strange, street, thing, dream, time. These go hand in hand with the idea of the “unnameable”, the “unspeakable”, and the “indescribable, as well as the ancient cults associated with the mystic Cthulhu creature.

In [None]:
plt.figure(figsize=(20,20))
wc = WordCloud(background_color="black", max_words=100, 
               stopwords=eng_stopwords, max_font_size= 40)
wc.generate(" ".join(MWS_text))
plt.title("Mary Shelley\n", fontsize= 30)
plt.imshow(wc.recolor(colormap= 'viridis' , random_state=17))
plt.axis('off')

On the other hand, we can see that Mary Shelley preffers more optimistic words: life, hope, friend, love, thought, heart. Also, the character Raymond appears quite a bit in the texts.

Let's display a plot to show the distribution of number of words from texts for each author. Given that we're using the texts without stop words, you can notice that the majority clusters around 10 words:

In [None]:
## Number of words in the text ##
df_train["num_words"] = df_train["clean_text"].apply(lambda x: len(str(x).split()))

plt.figure(figsize=(14,8))
sns.violinplot(x='num_words', y='author', data=df_train[df_train["num_words"] <= 30])
plt.xlabel('Number of words in text', fontsize=14)
plt.ylabel('Author Name', fontsize=14)
plt.title('Number of words by author', fontsize=14)
plt.show()

In [None]:
df_train["num_puncts"] = df_train['text'].apply(lambda x: len([c for c in str(x) \
                                                                     if c in string.punctuation]) )
plt.figure(figsize=(14,8))
sns.violinplot(x='num_puncts', y='author', data=df_train[df_train['num_puncts'] <= 10])
plt.xlabel('Author Name', fontsize=14)
plt.ylabel('Number of punctuations by author', fontsize=14)
plt.title('Number of puntuations in text', fontsize=14)
plt.show()

# 2. Natural Language Processing

**Natural Language Toolkit (NLTK)** is a comprehensive Python library for natural language processing and text analytics. More information can be found at http://www.nltk.org/, and demos of select NLTK functionality and production-ready APIs are available at http://text-processing.com.

The main NLP & text pre-processing steps as as follows:

1. **Tokenization** - is the process of splitting a string into a list of pieces or tokens. A token is a piece of a whole, so a word is a token in a sentence, and a sentence is a token in a paragraph.

2. **Stopwords** - are common words that generally do not contribute to the meaning of a sentence or text. Please have a look again at the *cleanup_spooky_text* function and the *eng_stopwords* dataset used in the first chapter.

3. **Stemming**  - is a technique to remove affixes from a word, ending up with the stem. For example, the stem of *thinking* is *think*, and a good stemming algorithm knows that the ing suffix can be removed.

4. **Lemmatization** is very similar to stemming, but is more akin to synonym replacement.

5. **Bag-of-words feature extraction** - Text feature extraction is the process of transforming what is essentially a list of words into a feature set that is usable by a classifier.





## Tokenization

 I've already leveraged the *NLTK RegexpTokenizer* in the first section when removing stopwords. Below you have two examples, one with a simple word tokenizer and the other one with the RegexpTokenizer:

In [None]:
from nltk.tokenize import word_tokenize, RegexpTokenizer

sample_text = df_train.text[0]

print('Word Tokenizer output: ' + str(word_tokenize(sample_text)) + "\n")

tokenizer = RegexpTokenizer(r'\w+') # Keep only words by removing punctuation
print('RegexpTokenizer output: ' + str(tokenizer.tokenize(sample_text)))

## Stemming and Lemmatization

One of the most common stemming algorithms is the Porter stemming algorithm by Martin Porter. It is designed to remove and replace well-known suffixes of English words.
Below I have created a Porter stemmer instance:

In [None]:
stemmer = nltk.stem.PorterStemmer()
print("The stemmed form of thinking is: {}".format(stemmer.stem("thinking")))

A lemma is a root word, as opposed to the root stem. So unlike stemming, you are always left with a valid word that means the same thing. However, the word you end up with can be completely different.

In [None]:
from nltk.stem import WordNetLemmatizer
lemm = WordNetLemmatizer()
print("The lemmatized form of believes is: {}".format(lemm.lemmatize("believes")))

## Bag-of-words feature extraction

The main goal here is to transform lists of words into feature sets. The bag of words model is the simplest method; it constructs a word presence feature set from all the words of an instance. This method doesn't care about the order of the words, or how many times a word occurs, all that matters is whether the word is present in a list of words.

In [None]:
sentence = ["This process, however, afforded me no means of ascertaining the dimensions of my dungeon; as I might make its circuit, and return to the point whence I set out, without being aware of the fact; so perfectly uniform seemed the wall."]
vectorizer = CountVectorizer(min_df=0)
sentence_transform = vectorizer.fit_transform(sentence)

print("The features are:\n {}".format(vectorizer.get_feature_names()))
print("\nThe vectorized array looks like:\n {}".format(sentence_transform.toarray()))

# 3. Mystical Feature Engineering

Feature engineering is agreed by kagglers to be key to success in applied machine learning. There are two main types of features:

- **Text based features** - in our case, directly extracted from the available text: word frequency, specific sentiments, word2vec, TF-IDF etc.
- **Meta features** - numerical statistics extracted from the available text: number of characters, number of words, number of stop words, number of punctuations etc



In [None]:
df_test["clean_text"] = df_test["text"].apply(lambda x: cleanup_spooky_text(x))

## TEXT-BASED FEATURES - TODO
## Based on the links from this topic: https://www.kaggle.com/c/spooky-author-identification/discussion/42925

## META-FEATURES
## Number of characters in the text
df_train["nb_chars"] = df_train["text"].apply(lambda x: len(str(x)))
df_test["nb_chars"] = df_test["text"].apply(lambda x: len(str(x)))

## Number of words in the text
df_train["nb_words"] = df_train["text"].apply(lambda x: len(str(x).split()))
df_test["nb_words"] = df_test["text"].apply(lambda x: len(str(x).split()))

## Number of relevant words in the text (stop words removed)
df_train["nb_rel_words"] = df_train["clean_text"].apply(lambda x: len(str(x).split()))
df_test["nb_rel_words"] = df_test["clean_text"].apply(lambda x: len(str(x).split()))

## Number of unique words in the text (stop words removed)
df_train["nb_uniq_words"] = df_train["clean_text"].apply(lambda x: len(set(str(x).split())))
df_test["nb_uniq_words"] = df_test["clean_text"].apply(lambda x: len(set(str(x).split())))

## Number of stopwords in the text
df_train["nb_stopwords"] = df_train["text"].apply(lambda x: len([w for w in str(x).lower().split() if w in eng_stopwords]))
df_test["nb_stopwords"] = df_test["text"].apply(lambda x: len([w for w in str(x).lower().split() if w in eng_stopwords]))

## Number of punctuations in the text
df_train["nb_punct"] =df_train['text'].apply(lambda x: len([c for c in str(x) if c in string.punctuation]) )
df_test["nb_punct"] =df_test['text'].apply(lambda x: len([c for c in str(x) if c in string.punctuation]) )

## Data pre-processing

In [None]:
## Prepare the data for modeling
## TODO

# 4. A bewitched XGBoost

In [None]:
## TODO
def bewitched_XGB(train_X, train_y, test_X, test_y):
    
    xgb_params = {
    'seed': 20171106,
    'colsample_bytree': 0.8,
    'silent': 1,
    'subsample': .85,
    'eta': 0.04,
    'objective': 'multi:softprob',
    'num_parallel_tree': 7,
    'max_depth': 5,
    'min_child_weight': 10,
    'nthread': 22,
    'num_class': 3,
    'eval_metric': 'mlogloss',
    }
    
    num_rounds = 1000
    xgtrain = xgb.DMatrix(train_X, label=train_y)

**I will update this after experimenting with text based features created based on the information presented in this discussion topic: https://www.kaggle.com/c/spooky-author-identification/discussion/42925**

Please upvote if you found this useful!

Cheers, Andrei