# Feature Extraction
In this step, we convert the raw text into numerical features for analysis. We have to convert both the keywords and text data. Let's start with the keywords

In [1]:
import pandas as pd
import numpy as np

train = pd.read_csv('../downloads/train.csv')
test = pd.read_csv('../downloads/train.csv')

train.keyword.fillna('',inplace=True)
test.keyword.fillna('',inplace=True)
train.location.fillna('',inplace=True)
test.location.fillna('',inplace=True)

train.sample(5)

Unnamed: 0,id,keyword,location,text,target
3044,4368,earthquake,"Melbourne, Australia",Nepal earthquake 3 months on: Women fear abuse...,1
1534,2216,chemical%20emergency,,THE CHEMICAL BROTHERS to play The Armory in SF...,0
1260,1816,buildings%20on%20fire,World Wide,1943: Poland - work party prisoners in the Naz...,1
2042,2932,danger,Hailing from Dayton,I wish I could get Victoria's Secret on front....,0
4610,6553,injury,,DAL News: Wednesday's injury report: RB Lance ...,0


In [2]:
from nltk import TweetTokenizer
tt = TweetTokenizer()

from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(tokenizer=tt.tokenize)
cv.fit(train['text'])

# Vectorize keywords and tweets
# First the training set
train_keywords = pd.DataFrame(pd.get_dummies(train.keyword,prefix='KW'))
train_tweets = pd.DataFrame(cv.transform(train['text']).toarray(),columns=cv.get_feature_names())
X_train = pd.concat([train_keywords,train_tweets],axis=1)
y_train = train.target

test_keywords = pd.DataFrame(pd.get_dummies(test.keyword,prefix='KW'))
test_tweets = pd.DataFrame(cv.transform(test['text']).toarray(),columns=cv.get_feature_names())
X_test = pd.concat([test_keywords,test_tweets],axis=1)
y_test = test.target



# Modelling
## Ridge Classifier
Let's start out with a very simple model: a ridge classifier. How well do we do for classification?

In [3]:
from sklearn.linear_model import RidgeClassifier

model = RidgeClassifier()
model.fit(X_train,y_train)

RidgeClassifier()

In [4]:
y_pred = model.predict(X_test)

from sklearn.metrics import classification_report

print(classification_report(y_test,y_pred))

              precision    recall  f1-score   support

           0       1.00      1.00      1.00      4342
           1       1.00      0.99      1.00      3271

    accuracy                           1.00      7613
   macro avg       1.00      1.00      1.00      7613
weighted avg       1.00      1.00      1.00      7613



Ahh, nice! Nearly perfect precision and recall! I wasn't expecting that using default parameters. Looks like the data contain useful information for classifying tweets. It makes sense, of course, the data were labeled by human readers who looked at the same text information. They must have selected tweets that they were confident in classifying. 

### Dropping the keyword data

Let's see if the results look quite as good if we drop the keyword data.

In [5]:
model2 = RidgeClassifier()
model2.fit(cv.transform(train['text']).toarray(),y_train)
y_pred = model2.predict(cv.transform(test['text']).toarray())
print(classification_report(y_test,y_pred))

              precision    recall  f1-score   support

           0       0.99      1.00      1.00      4342
           1       1.00      0.99      0.99      3271

    accuracy                           1.00      7613
   macro avg       1.00      1.00      1.00      7613
weighted avg       1.00      1.00      1.00      7613



The numbers look a little worse, but keywords clearly don't have a drastic effect on the classification performance.

# Unsupervised methods

Now we can ask the question, why does it work so well? Here we can look to the patterns in the data using unsupervised methods. 

Can we turn the problem around and predict the keyword from the tweet text? If we did a topic analysis, would the topics map to keywords?

In [7]:
from sklearn.decomposition import NMF, LatentDirichletAllocation

nmf = NMF(n_components=20, random_state=1,
          alpha=.1, l1_ratio=.5)

nmf.fit(X_train,y_train)

n_components = 10
n_top_words = 20


def print_top_words(model, feature_names, n_top_words):
    for topic_idx, topic in enumerate(model.components_):
        message = "Topic #%d: " % topic_idx
        message += " ".join([feature_names[i]
                             for i in topic.argsort()[:-n_top_words - 1:-1]])
        print(message)
    print()
    
print_top_words(nmf,X_train.columns,n_top_words)



Topic #0: the that by was are with world has about at after still not last first more it's out we years
Topic #1: ? # why follow so la airlines missing aircraft debris reunion _ found no u mh370 crush malaysia KW_crush this
Topic #2: . with s it will not was & no u be p we have they it's o he KW_detonate m
Topic #3: : rt 2015-08- û_ california from as at 05 [ ] pm news s utc train police over 3 (
Topic #4: a like was by at up get this but video with from what just after watch that under @youtube sandstorm
Topic #5: ' from as are families wreckage by KW_wreckage confirmed who conclusively | were those mh370 malaysia ) it's video rescuers
Topic #6: ! be & all out what please we news from wind ass check day KW_loud%20bang bang just loud hey neighbour's
Topic #7: to be with going have get or make want go how as do it out over not we back so
Topic #8: in killed suicide / crash people bomber who police up fire KW_hostages accident land two hostages released as are bomb
Topic #9: i it was so 