## Text Feature Extraction

### What Features are appropriate?

Perhaps the list of unique words?



In [21]:
tweets = [  'This is the first tweet.', 'This is the second second tweet.','And the third one.','Is this the first tweet?'] 
vocab = [x.lower() for tweet in tweets for x in tweet.split() ]
vocab = list(set(vocab))
sorted(vocab)

['and',
 'first',
 'is',
 'one.',
 'second',
 'the',
 'third',
 'this',
 'tweet.',
 'tweet?']

### Vectorization

vectorization the general process of turning a collection of text documents into numerical feature vectors.

### CountVectorizer

Count Vectorizer is useful for tokenizing. In Sklearn tokenizing strings gives an integer id for each possible token, for instance by using white-spaces and punctuation as token separators. CountVectorizer counts the number of token occurrences i.e. the number of times a token appears


In [22]:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(tweets)
features = vectorizer.get_feature_names()
features

['and', 'first', 'is', 'one', 'second', 'the', 'third', 'this', 'tweet']

In [23]:
X

<4x9 sparse matrix of type '<class 'numpy.int64'>'
	with 19 stored elements in Compressed Sparse Row format>

In [24]:
X.toarray()  

array([[0, 1, 1, 0, 0, 1, 0, 1, 1],
       [0, 0, 1, 0, 2, 1, 0, 1, 1],
       [1, 0, 0, 1, 0, 1, 1, 0, 0],
       [0, 1, 1, 0, 0, 1, 0, 1, 1]], dtype=int64)

###  transform 
.transform changes the text to a list ot features

In [25]:
vectorizer.transform(['Lets assume this is a new tweet']).toarray()

array([[0, 0, 1, 0, 0, 0, 0, 1, 1]])

### Bi-Grams and Tri-Grams

In [26]:
bigram_vectorizer = CountVectorizer(ngram_range=(1, 2), min_df=1)

X = bigram_vectorizer.fit_transform(tweets)
features_bi = bigram_vectorizer.get_feature_names()
print(sorted(features_bi))

['and', 'and the', 'first', 'first tweet', 'is', 'is the', 'is this', 'one', 'second', 'second second', 'second tweet', 'the', 'the first', 'the second', 'the third', 'third', 'third one', 'this', 'this is', 'this the', 'tweet']


In [27]:
trigram_vectorizer = CountVectorizer(ngram_range=(1, 3),  min_df=1)

X = trigram_vectorizer.fit_transform(tweets)
features_tri = trigram_vectorizer.get_feature_names()
print(sorted(features_tri))

['and', 'and the', 'and the third', 'first', 'first tweet', 'is', 'is the', 'is the first', 'is the second', 'is this', 'is this the', 'one', 'second', 'second second', 'second second tweet', 'second tweet', 'the', 'the first', 'the first tweet', 'the second', 'the second second', 'the third', 'the third one', 'third', 'third one', 'this', 'this is', 'this is the', 'this the', 'this the first', 'tweet']


#### Problems:
1. Longer documents will have higher average count values than shorter documents
2. Some words are very common e.g. 'the', 'and', 'is' will automatically have higher counts

#### Solution:
1. Term Frequencies times Inverse Document Frequency

### TF-idf -  
Term Frequencies almost look at words as a percentage of the total i.e. Instead of -the word 'kenya' was used 100 times, TF says the word 'kenya' was used 2% of the time.

To take care of the second problem, we use IDF.
Inverse Document Frequency factor is a way of diminishing the weight of terms that occur very frequently in a document set and increasing the weight of terms that occur rarely


In [28]:
from sklearn.feature_extraction.text import TfidfTransformer
tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X)
X_train_tfidf.shape

(4, 31)

# Author Attribution

In [29]:
##Importations
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer,TfidfTransformer
from sklearn.utils import shuffle
from sklearn import metrics

## Train

#### a.) Load the Data

In [30]:
#Create a dictionary for storing the data
my_data = {'data':[],
          'target':[],
          'target_names':[]}

mytarget = ['UKenyatta','RailaOdinga']

my_data['target_names'] = mytarget
my_data

{'data': [], 'target': [], 'target_names': ['UKenyatta', 'RailaOdinga']}

In [31]:
#Load the data
nrs = pd.read_csv('kenya.csv')
nrs = shuffle(shuffle(nrs))
nrs.head()

Unnamed: 0,tweet,class
37877,"b'The roads constructed across the country, pr...",0
25822,b'We were hosted by Mzee Ole Kinayi who was ge...,1
6526,b'Attending Sunday service earlier at ACK Cath...,1
38426,b'Doctors to be paid for days they were on str...,0
38825,"b'RT @WilliamsRuto: We must hold together, pre...",0


In [32]:
my_data['data'] = list(nrs['tweet'])
my_data['target'] = list(nrs['class'])

### b. Split the data in training and test set

In [33]:
X_train = my_data['data'][:-1000]
y_train = my_data['target'][:-1000]

X_test = my_data['data'][-1000:]
y_test = my_data['target'][-1000:]

### c.) Create and Train a classifier
#### Feature Extraction

In [34]:
#Occurences
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(X_train)
X_train_counts.shape

(47840, 14780)

In [35]:
#Frequencies

tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
X_train_tfidf.shape

(47840, 14780)

In [36]:
#Training a classifier
from sklearn.naive_bayes import MultinomialNB
clf = MultinomialNB().fit(X_train_tfidf, y_train)

### Test

In [37]:
X_tests_counts = count_vect.transform(X_test)
X_tests_tfidf = tfidf_transformer.transform(X_tests_counts)
expected  = y_test
predicted = clf.predict(X_tests_tfidf)
print("Accuracy of our model is:\n%s" % metrics.accuracy_score(expected, predicted))
print("Confusion matrix:\n%s" % metrics.confusion_matrix(expected, predicted))

Accuracy of our model is:
0.984
Confusion matrix:
[[306  15]
 [  1 678]]


### Apply

In [38]:
#Predicting Outcome
tweet1 = 'Waived debts for rice farmers in the Mwea Irrigation Scheme as we continue to extend our support to farmers in every part of this country' #
tweet2 = 'Kenyans don\'t want handouts. They need a hand up initiative from their government. The economy must grow for all - not the few.'
tweet3 = 'today is a beautiful day'

tweets_new = [tweet1,tweet2,tweet3]
X_new_counts = count_vect.transform(tweets_new)

X_new_tfidf = tfidf_transformer.transform(X_new_counts)

predicted = clf.predict(X_new_tfidf)

for tw, category in zip(tweets_new, predicted):
    print('\n%r ===> %s' % (tw, my_data['target_names'][category]))


'Waived debts for rice farmers in the Mwea Irrigation Scheme as we continue to extend our support to farmers in every part of this country' ===> UKenyatta

"Kenyans don't want handouts. They need a hand up initiative from their government. The economy must grow for all - not the few." ===> RailaOdinga

'today is a beautiful day' ===> RailaOdinga
