### Ladient Dirichlet Allocation 
LDA is a probablistic model that aims to find groups of words that appear frequently together.
In this way, we are able to represent our data as topics.
Using bag of words, we will feed it to the LDA for munching.

In [11]:
import pandas as pd
df_train = pd.read_csv('train.csv', encoding='utf-8')
df_test = pd.read_csv('test.csv', encoding='utf-8')

### Preprocess the text
Save some time by preprocessing the words.
Remove punctuation and change to lowercase.

In [12]:
import re
def preprocessor(text):
    text = re.sub('[\W]+', ' ', text.lower())
    return text

In [13]:
df_train['text'] = df_train['text'].apply(preprocessor)

### Encode class labels

In [14]:
from sklearn.preprocessing import LabelEncoder
class_le = LabelEncoder()
df_train['author'] = class_le.fit_transform(df_train['author'].values)

### Split data for training and testing

In [15]:
from sklearn.model_selection import train_test_split
X, y = df_train.iloc[:, df_train.columns.get_loc('text')].values, \
       df_train.iloc[:, df_train.columns.get_loc('author')].values
X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                   test_size=0.3,
                                                   random_state=42,
                                                   stratify=y)

### Download the nesscary packages for tokenization

In [16]:
import nltk
nltk.download('stopwords')
nltk.download('wordnet')
from nltk.corpus import stopwords as sw
stopwords = sw.words('english')

[nltk_data] Downloading package stopwords to /home/luce/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /home/luce/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


### Tokenization helper functions

In [17]:
def tokenizer(text):
    return text.split()

In [18]:
from nltk.stem.porter import PorterStemmer
porter = PorterStemmer()
def tokenizer_porter(text):
    return [porter.stem(word) for word in text.split()]

In [None]:
from nltk.stem import WordNetLemmatizer
lemma = WordNetLemmatizer()
def tokenizer_lemma(text):
    return [lemma.lemmatize(word) for word in text.split()]

### Start the training
Used bag of words, LDA, and Logistic Regression with pipeline.
RandomizedSearchCV used because I don't own an amazing rig :(
Since the kaggle grades base on log loss, scoring caters to it.

In [None]:
from sklearn.model_selection import RandomizedSearchCV
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation

count = CountVectorizer(strip_accents=None,
                       lowercase=False,
                       preprocessor=None)

lda = LatentDirichletAllocation(random_state=42,
                              learning_method='batch')

param_grid = {'vect__ngram_range': [(1,1)],
              'vect__stop_words': [stopwords],
              'vect__tokenizer': [tokenizer,
                                  tokenizer_porter,
                                  tokenizer_lemma],
              'vect__max_df': [0.05, 0.1, 0.15],
              'vect__max_features': [1000, 5000],
              'lda__n_components': [10, 20, 30],
              'clf__penalty': ['l1', 'l2'],
              'clf__C': [0.1, 1.0, 10.0, 100.0]}

pipe = Pipeline([
                 ('vect', count),
                 ('lda', lda),
                 ('clf', LogisticRegression(random_state=42))
                ])

model = RandomizedSearchCV(pipe, param_grid, scoring='neg_log_loss', cv=10, n_iter=50, verbose=1, n_jobs=-1)
model.fit(X_train, y_train)

Fitting 10 folds for each of 50 candidates, totalling 500 fits


In [None]:
print(model.best_estimator_)

In [None]:
print(model.best_params_)

In [None]:
print(model.best_score_)

In [None]:
model.best_estimator_.steps[1][1].components_.shape

In [None]:
model.best_estimator_.steps[0][1]

In [None]:
n_top_words = 10
feature_names = rs_lr_bag.best_estimator_.steps[0][1].get_feature_names()
for topic_idx, topic in enumerate(rs_lr_bag.best_estimator_.steps[1][1].components_):
    print('Topic %d:' % (topic_idx + 1))
    print(' '.join([feature_names[i]
                   for i in topic.argsort() [:-n_top_words - 1:-1]]))