## Load, Train, and Test Baseline MBTI Classification Model

In [38]:
from __future__ import absolute_import
from __future__ import print_function
from __future__ import division

import json, os, re, shutil, sys, time
from importlib import reload
import collections, itertools
import unittest
from IPython.display import display, HTML

# NLTK for NLP utils and corpora
import nltk

# NumPy and TensorFlow
import numpy as np
import pandas as pd
import tensorflow as tf
assert(tf.__version__.startswith("1."))

# Helper libraries
# from w266_common import utils, vocabulary, tf_embed_viz

# Your code
import MBTI16_NN_Baseline; reload(MBTI16_NN_Baseline)

<module 'MBTI16_NN_Baseline' from '/Users/linhtran/Desktop/Berkeley-MIDS/w266/2018-fall-main/assignment/w266_project/MBTI16_NN_Baseline.py'>

## Specifications for Baseline RNN with LSTM for MBTI

In this baseline, we will model after the [Ma & Liu (2018)](https://web.stanford.edu/class/cs224n/reports/2736946.pdf) paper. Architecture and Parameters defined below. Note that the hyperparameters were defined for the binary classifiers, so we need to test combos for 16 classs model as the paper does not explicitly say.

### Pre-Processing:
* Split the post by user to individual posts
* Converting to lowercase (but we want to incorporate cap- ital letter usage too so we include that as a feature)
* Using NLTK lemmatizer to combine word forms
* Identifying special text (URLs, numbers, dates, emojis) with regex and replacing them with special escape tokens to standardize
* Separating punctuation from text
* Assigning words to numerical indices based on fre-quency in training set

### Architecture:
* Model: RNN encoder-decoder framework per [Wu et al (2016)](https://arxiv.org/pdf/1609.08144.pdf)
* Econding Network: LSTM
* Decoding Network: 3 layer RNN
* Output: ReLu each layer, softmax for final

### Hyperparameters
* B = 500
* S = 35? (this is inferred from the data post length)
* V = |V| corpus
* M = [128,200,256,523] (try combo of these)
* H_i = [256,300] (try combo of these)
* C = [2,16] (the specs above were for binary (2) class, but we need to try combo for 16 class

### Training:
* Epochs = 25
* 70% train, 15% test, 15% holdout
* Loss: Cross Entropy with Xavier initialization
* Optimizers: RMSProp for encoder, decoder Adam

*Expected Results: Train 55%, Test 23%*


![Generic RNN Architecture for MBTI](MBTI_RNN_arch.jpg)

## 1. Pre-Processing

In [3]:
df = pd.read_csv('./mbti_1.csv')
df.head(10)

Unnamed: 0,type,posts
0,INFJ,'http://www.youtube.com/watch?v=qsXHcwe3krw|||...
1,ENTP,'I'm finding the lack of me in these posts ver...
2,INTP,'Good one _____ https://www.youtube.com/wat...
3,INTJ,"'Dear INTP, I enjoyed our conversation the o..."
4,ENTJ,'You're fired.|||That's another silly misconce...
5,INTJ,'18/37 @.@|||Science is not perfect. No scien...
6,INFJ,"'No, I can't draw on my own nails (haha). Thos..."
7,INTJ,'I tend to build up a collection of things on ...
8,INFJ,"I'm not sure, that's a good question. The dist..."
9,INTP,'https://www.youtube.com/watch?v=w8-egj0y8Qs||...


In [37]:
# initialize train/test set by short posts
post = []
utype = []
user = []

for index, row in df.iterrows():
    posts = row['posts'].split('|||')
    post.extend(posts)
    utype.extend([row['type'] for i in range(len(posts))])
    user.extend([index for i in range(len(posts))])
    
short_posts = pd.DataFrame({"user": user,"type": utype,"post": post})
print(short_posts.shape)
short_posts.head()

(422845, 3)


Unnamed: 0,post,type,user
0,'http://www.youtube.com/watch?v=qsXHcwe3krw,INFJ,0
1,http://41.media.tumblr.com/tumblr_lfouy03PMA1q...,INFJ,0
2,enfp and intj moments https://www.youtube.com...,INFJ,0
3,What has been the most life-changing experienc...,INFJ,0
4,http://www.youtube.com/watch?v=vXZeYwwRDw8 h...,INFJ,0


"'http://www.youtube.com/watch?v=qsXHcwe3krw"