# Characterizing Horror Stories

This is one of my first explorations into NLP, something that I find quite fascinating. The real challenge for this exercise isn't necessarily the model selection and parameter (although this is a key decision), but more about the features that build the model. I decided to start simple an build more complex features as I go, so for now I will start with the ratio of words per punctuation and the most common types of punctuation used.

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import nltk
from collections import Counter
import string
from subprocess import check_output
print(check_output(["ls", "../input"]).decode("utf8"))

In [None]:
train = pd.read_csv('../input/train.csv')
train.head()

The first step is to simple tokenize the text using the nltk package. This takes the string from the text feature and splits the words and puncuation into individual strings and places them in a list.

In [None]:
train['tokens'] = [nltk.word_tokenize(i) for i in train['text']]
train.head()

To get to the words to puncutation ratio, I first need to ge the count of puncutations and the count of words for each record. I created a field for count of words, count of puncuation, and the ratio between them. I also created a token list without punctuation, which will be useful later. 

In [None]:
#Count of Punc.
punc_list = [string.punctuation]
def count_punc_in_tokens(series):
    counts = []
    for s in series:
        counter = 0
        for i in s:
            if i in punc_list[0]:
                counter += 1
        counts.append(counter)
    return counts
train['count of punc'] = count_punc_in_tokens(train['tokens'])

In [None]:
# Count of Non-Punc
punc_list = [string.punctuation]
def count_words_in_tokens(series):
    counts = []
    for s in series:
        counter = 0
        for i in s:
            if i not in punc_list[0]:
                counter += 1
        counts.append(counter)
    return counts
train['count of not punc'] = count_words_in_tokens(train['tokens'])
train['words_to_punc_ratio'] = train['count of not punc']/train['count of punc']
def tokens_without_punc(series):
    row = []
    for s in series:
        tokens = []
        for i in s:
            if i not in punc_list[0]:
                tokens.append(i)
        row.append(tokens)
    return row

train['token_without_punc'] = [i for i in tokens_without_punc(train['tokens'])]

Next, I wanted to see if the authors had distinct parts of speech (POS) they liked to use. I used nltk for this as well. The nltk module pos_tag creates a list of tuples which contain a POS and the token from the tokens without punc list. I then used the Counter() class from collections to generate a list of dicts where the key is the POS and the value is the count of that POS in the list of tokens. From these dicts, I joined them into a single df with each unique POS acting as the index and each column representing the counts for each POS by record in the training data. This allows me to easily transpose the df and join it by the index to the rest of the training date, making each POS count a new feature. We probably do not want to use all of these features in our model, so I droped the POS where the sum of the counts was less than the median sum of the counts. 

In [None]:
def pos_used(series):
    counts = []
    itter = 0
    for s in series:
        cnt = Counter()
        tags = nltk.pos_tag(s)
        print(tags)
        pos_list = []
        for i in tags:
            pos_list.append(i[1])
        for i in pos_list:
            cnt[i] += 1
        count = list(cnt.items())
        df = pd.DataFrame(count,  columns=['POS', str(itter)])
        df.set_index('POS', inplace = True)
        counts.append(df)
        itter += 1
    #[i.set_index('POS', inplace = True) for i in counts]
    pos_df = pd.concat(counts, axis=1)
    return pos_df

pos_used = pos_used(train['token_without_punc'])

In [None]:
pos_used.fillna(0,inplace=True)
pos_used['SUM'] = pos_used.sum(axis=1)
pos_used = pos_used[pos_used['SUM'] >= pos_used['SUM'].quantile(.5)]
pos_used.drop(['SUM'], axis=1, inplace=True)

In [None]:
#print(pos_used.T.shape) # 19579, 38
#print(train.shape) # 19579, 8
pos_df = pos_used.transpose()
pos_df.reset_index(inplace = True)
pos_df.drop(['index'], inplace = True, axis= 1)

In [None]:
train2 = pd.concat([train,pos_df], axis = 1)

In [None]:
train = train2.drop(['id', 'text', 'tokens', 'token_without_punc','count of punc', 'count of not punc'], axis = 1)

In [None]:
train.head()

Finally, on to the model. I decided to start with a one-vs-all SVM model and random forest classifier. I standardized the features first then randomly split the rows 60:40 into training and validation sets.  I did not do any hyperparameter tweaking to these models. 

In [None]:
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.multiclass import OneVsRestClassifier
from sklearn.ensemble import RandomForestClassifier
y = train.iloc[:,0]
X = train.iloc[:,1:]
X = StandardScaler().fit_transform(X)
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size = .4, random_state = 42)

In [None]:
clf = OneVsRestClassifier(SVC(kernel = 'linear'))
clf.fit(X_train,y_train)
cross_val_score(clf, X_train, y_train, cv=3, scoring = 'accuracy')

In [None]:
clf = RandomForestClassifier()
clf.fit(X_train,y_train)
cross_val_score(clf, X_train, y_train, cv=3, scoring = 'accuracy')

The accuracy score for the One-vs-rest SVM classifier showed the best results, but a low score in general. Both models produced a close score, however, which points to a lack of feature seperation over model choice. So I will brain storm some more features to experiment with and 