# Supervised Learning
So far we have looked at how we can work with numerical data in performing different types of classification.

Today we look at how machine learning can be used to process text. This is generally a field of machine learning called Natural Language Processing

<img src="images/nlp.png" height="32" width="40%" align="left">

Natural language processing has many applications such as:
    - Translation (E.g. Google translate)
    - Summarization
    - Information Retrieval: Example

<img src="images/obama.png" height="32" width="50%" align="left">

    - Sentiment Analysis

<img src="images/sentiment_analysis.png" height="42" width="50%" align="left">

    - Text Classification


<img src="images/googlenews.png" height="32" width="40%" align="left">

Before delving into an example, let's look at how text is preprocessed before being fed into a machine learning model


## Text Preprocessing
Text preprocessing generally involves a number of steps:
1. Cleaning: Removing special characters etc
2. Tokenizing. A token is a string of contiguous characters. A word is a token
3. Vectorizing: Moving from words to vectors.


In [73]:
tweets = ['This is the first tweet.', 'This is the second second tweet.','And the third one.','Is this the first tweet?'] 
#vocab = [word.lower() for tweet in tweets for word in tweet.split() ]
vocab = []
for tweet in tweets:
	for word in tweet.split():
		vocab.append(word.lower())
        
vocab = list(set(vocab))
#vocab
sorted(vocab)

['and',
 'first',
 'is',
 'one.',
 'second',
 'the',
 'third',
 'this',
 'tweet.',
 'tweet?']

### Vectorization

vectorization the general process of turning a collection of text documents into numerical feature vectors.

### CountVectorizer

Count Vectorizer is useful for tokenizing. In Sklearn tokenizing strings gives an integer id for each possible token, for instance by using white-spaces and punctuation as token separators. CountVectorizer counts the number of token occurrences i.e. the number of times a token appears


In [74]:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(tweets)
features = vectorizer.get_feature_names()
features

['and', 'first', 'is', 'one', 'second', 'the', 'third', 'this', 'tweet']

In [75]:
X

<4x9 sparse matrix of type '<class 'numpy.int64'>'
	with 19 stored elements in Compressed Sparse Row format>

In [76]:
X.toarray()  

array([[0, 1, 1, 0, 0, 1, 0, 1, 1],
       [0, 0, 1, 0, 2, 1, 0, 1, 1],
       [1, 0, 0, 1, 0, 1, 1, 0, 0],
       [0, 1, 1, 0, 0, 1, 0, 1, 1]], dtype=int64)

###  transform 
.transform changes the text to a list ot features

In [77]:
vectorizer.transform(['Lets assume this is a new tweet']).toarray()

array([[0, 0, 1, 0, 0, 0, 0, 1, 1]])

### Bi-Grams and Tri-Grams

In [78]:
bigram_vectorizer = CountVectorizer(ngram_range=(1, 2), min_df=1)

X = bigram_vectorizer.fit_transform(tweets)
features_bi = bigram_vectorizer.get_feature_names()
print(sorted(features_bi))

['and', 'and the', 'first', 'first tweet', 'is', 'is the', 'is this', 'one', 'second', 'second second', 'second tweet', 'the', 'the first', 'the second', 'the third', 'third', 'third one', 'this', 'this is', 'this the', 'tweet']


In [79]:
trigram_vectorizer = CountVectorizer(ngram_range=(1, 3),  min_df=1)

X = trigram_vectorizer.fit_transform(tweets)
features_tri = trigram_vectorizer.get_feature_names()
print(sorted(features_tri))

['and', 'and the', 'and the third', 'first', 'first tweet', 'is', 'is the', 'is the first', 'is the second', 'is this', 'is this the', 'one', 'second', 'second second', 'second second tweet', 'second tweet', 'the', 'the first', 'the first tweet', 'the second', 'the second second', 'the third', 'the third one', 'third', 'third one', 'this', 'this is', 'this is the', 'this the', 'this the first', 'tweet']


#### Problems:
1. Longer documents will have higher average count values than shorter documents
2. Some words are very common e.g. 'the', 'and', 'is' will automatically have higher counts

#### Solution:
1. Term Frequencies times Inverse Document Frequency

### TF-IDF  
Term frequency is the number of times a word appears in a document, or in this case a tweet. If the word "kenya" appears twice in a the tweet "", then the term frequency of kenya is 2.

Term Frequency can be taken as the raw count of a term but is often adjusted to account for the total number of words in the document. Example: Instead of -the word 'kenya' was used 2 times, TF says the word 'kenya' was used 20% of the time.


<img src="images/tf.png" height="32" width="40%" align="left">

To take care of the second problem, we use IDF.
Inverse Document Frequency factor is a way of diminishing the weight of terms that occur very frequently in a document set and increasing the weight of terms that occur rarely


<img src="images/idf.png" height="42" width="50%" align="left">

In [80]:
from sklearn.feature_extraction.text import TfidfTransformer
tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X)
X_train_tfidf.shape

  if hasattr(X, 'dtype') and np.issubdtype(X.dtype, np.float):


(4, 31)

Today we will look at text processing. The overall goal will be to predict who has written some tweets.


# Author Attribution


In [138]:
##Importations
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer,TfidfTransformer
from sklearn.utils import shuffle
from sklearn import metrics
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split


## Train

#### a.) Load the Data

In [156]:
user1 = pd.read_csv('datasets/Miss_Wanza_tweets.csv')
user1.columns = ['id', 'timestamp', 'tweet']
user2 = pd.read_csv('datasets/LewisMunyi_tweets.csv')
user2.columns = ['id', 'timestamp', 'tweet']
user1['Name'] = 0
user2['Name'] = 1
collectiveTweets = pd.concat([user1, user2])
collectiveTweets = shuffle(collectiveTweets)
target_names = ['Rosianah','Lewis']
collectiveTweets.head(5)

Unnamed: 0,id,timestamp,tweet,Name
8536,1055716406564716546,2018-10-26 07:03:00,b'@mkmuigai @Migwination Na Selina Gomez',1
6936,1054967240322355201,2018-10-24 05:26:05,"b""@Mwihaki_ @kuirab You're not the only one. I...",1
3276,1031474152790736903,2018-08-20 09:32:56,"b""RT @gitweeta: Kenya's geography according to...",1
2449,1013369290844770304,2018-07-01 10:30:41,b'RT @Mr_DrinksOnMe: The pattern on these shoe...,1
9154,918136202108198913,2017-10-11 15:28:23,"b""RT @iam_bett: There's life away from Raila O...",1


### b. Split the data in training and test set

In [158]:
X = collectiveTweets['tweet']
y = collectiveTweets['Name']

X_train, X_test, Y_train, Y_test = train_test_split(X, y, test_size = 0.30, random_state = 0)

X_train = X[:-1000]
y_train = y[:-1000]

X_test = X[-1000:]
y_test = y[-1000:]

### c.) Create and Train a classifier
#### Feature Extraction

In [159]:
#Occurences
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(X_train)
X_train_counts.shape



(12951, 8878)

In [160]:
#Frequencies

tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
X_train_tfidf.shape

(12951, 8878)

In [164]:
#Training a classifier
classifier = LogisticRegression()
clf = classifier.fit(X_train_tfidf, y_train)

### Test

In [165]:
X_tests_counts = count_vect.transform(X_test)
X_tests_tfidf = tfidf_transformer.transform(X_tests_counts)
expected  = y_test
predicted = clf.predict(X_tests_tfidf)
print("Accuracy of our model is:\n%s" % metrics.accuracy_score(expected, predicted))
print("Confusion matrix:\n%s" % metrics.confusion_matrix(expected, predicted))

Accuracy of our model is:
0.975
Confusion matrix:
[[144  24]
 [  1 831]]


### Apply

In [176]:
#Predicting Outcome
tweet1 = 'Food'
tweet2 = 'Machine Learning'
tweet3 = 'Nigga'
tweet4 = 'Lol'
tweet5 = 'I love you'
tweet6 = 'Go to hell'
tweet7 = 'Yaaay'
tweet8 = 'Nice'

tweets_new = [tweet1, tweet2, tweet3, tweet4,tweet5,tweet6, tweet7, tweet8]
X_new_counts = count_vect.transform(tweets_new)

X_new_tfidf = tfidf_transformer.transform(X_new_counts)

predicted = clf.predict(X_new_tfidf)

for tw, category in zip(tweets_new, predicted):
    print('\n{} ===> {}'.format(tw, target_names[category]))



Food ===> Rosianah

Machine Learning ===> Rosianah

Nigga ===> Lewis

Lol ===> Lewis

I love you ===> Lewis

Go to hell ===> Lewis

Yaaay ===> Lewis

Nice ===> Lewis
