# Feature Extraction
In this step, we convert the raw text into numerical features for analysis. We have to convert both the keywords and text data. Let's start with the keywords

In [1]:
import pandas as pd
import numpy as np

train = pd.read_csv('../downloads/train.csv')
test = pd.read_csv('../downloads/train.csv')

train.keyword.fillna('',inplace=True)
test.keyword.fillna('',inplace=True)
train.location.fillna('',inplace=True)
test.location.fillna('',inplace=True)

train.sample(5)

Unnamed: 0,id,keyword,location,text,target
3408,4878,explode,"Spring Grove, IL",If Schwarber ran into me going that fast I wou...,0
1668,2410,collide,,But even if the stars and moon collide I never...,0
4992,7122,military,,Online infantryman experimental military train...,0
3206,4600,emergency%20services,,@Glenstannard @EssexWeather do you know where ...,1
2761,3967,devastation,"Washington, DC",70 Years After Atomic Bombs Japan Still Strugg...,1


In [2]:
from nltk import TweetTokenizer
tt = TweetTokenizer()

from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(tokenizer=tt.tokenize)
cv.fit(train['text'])

# Vectorize keywords and tweets
# First the training set
train_keywords = pd.DataFrame(pd.get_dummies(train.keyword))
train_tweets = pd.DataFrame(cv.transform(train['text']).toarray(),columns=cv.get_feature_names())
X_train = pd.concat([train_keywords,train_tweets],axis=1)
y_train = train.target

test_keywords = pd.DataFrame(pd.get_dummies(test.keyword))
test_tweets = pd.DataFrame(cv.transform(test['text']).toarray(),columns=cv.get_feature_names())
X_test = pd.concat([test_keywords,test_tweets],axis=1)
y_test = test.target



# Modelling
## Ridge Classifier
Let's start out with a very simple model: a ridge classifier. How well do we do for classification?

In [3]:
from sklearn.linear_model import RidgeClassifier

model = RidgeClassifier()
model.fit(X_train,y_train)

RidgeClassifier()

In [4]:
y_pred = model.predict(X_test)

from sklearn.metrics import classification_report

print(classification_report(y_test,y_pred))

              precision    recall  f1-score   support

           0       1.00      1.00      1.00      4342
           1       1.00      0.99      1.00      3271

    accuracy                           1.00      7613
   macro avg       1.00      1.00      1.00      7613
weighted avg       1.00      1.00      1.00      7613



Ahh, nice! Nearly perfect precision and recall! I wasn't expecting that using default parameters. Looks like the data contain useful information for classifying tweets. It makes sense, of course, the data were labeled by human readers who looked at the same text information. 

### Dropping the keyword data

Let's see if the results look quite as good if we drop the keyword data.

In [5]:
model2 = RidgeClassifier()
model2.fit(cv.transform(train['text']).toarray(),y_train)
y_pred = model2.predict(cv.transform(test['text']).toarray())
print(classification_report(y_test,y_pred))

              precision    recall  f1-score   support

           0       0.99      1.00      1.00      4342
           1       1.00      0.99      0.99      3271

    accuracy                           1.00      7613
   macro avg       1.00      1.00      1.00      7613
weighted avg       1.00      1.00      1.00      7613



The numbers look a little worse, but keywords clearly don't have a drastic effect on the classification performance.

# Unsupervised methods

Now we can ask the question, why does it work so well? Here we can look to the patterns in the data using unsupervised methods. 

Can we turn the problem around and predict the keyword from the tweet text? If we did a topic analysis, would the topics map to keywords?