# SMS Filtering with linear models
In this Exercise you will learn how to predict positive and negative reviews on **movie_reviews** corpus , which categorizes each review as positive or negative.

### Libraries

In this task you will need the following libraries:
- [scikit-learn](http://scikit-learn.org/stable/index.html) — a tool for data mining and data analysis.
- [NLTK](http://www.nltk.org) — a platform to work with natural language.

In [46]:
import nltk
import numpy as np

import random
import re

## Data
This data is reviews about movies, 

In [47]:
with open("not_spam_sms.txt", "r") as neg_file:
    neg_data = [(line, "neg") for line in neg_file]
with open("spam_sms.txt", "r") as pos_file:
    pos_data = [(line, "pos") for line in pos_file]

data = neg_data + pos_data
random.shuffle(data)

### Text preprocessing
In this section you need to prepare your data.. complete the functions as instructed

In [48]:
def lower_case_text(in_text):
    out_text = in_text.lower()
    return out_text

replace the following symboles by space in text: '[@,;/(){}\[\]\|]'

In [49]:
def rep_symbols_text(in_text):
    pattern = r"[@,;/(){}\[\]\|]"
    out_text = re.sub(pattern, " ", in_text)
    return out_text

remove every thing but lower case chars, numbers from 0 to 9 and spaces
hint:<br>
1- define a regular expression pattern with character class that matches any character other than the ones you want to keep<br>
2- replace these symbols with empty string

In [50]:
def clean_text(in_text):
    pattern = r"[^\d\s\w]"
    out_text = re.sub(pattern, "", in_text)
    return out_text

remove stopwords from text

In [51]:
from nltk.corpus import stopwords
STOPWORDS = set(stopwords.words('english'))
def remove_stopwords_from_text(in_text):
    out_text =" ".join(w for w in in_text.split() if w not in STOPWORDS)
    return out_text

stem each word in the text from text

In [52]:
def text_preprocess(text):
    text = lower_case_text(text)
    text = rep_symbols_text(text)
    text = clean_text(text)
    text = remove_stopwords_from_text(text)
#     text = stem_text(text)
    return text

In [53]:
text_preprocess("Congrats! 1 year special cinema pass for 2 is yours. call 09061209465 now! C Suprman V, Matrix3, StarWars3, etc all 4 FREE! bx420-ip4-5we. 150pm. Dont miss out! ")

'congrats 1 year special cinema pass 2 call 09061209465 c suprman v matrix3 starwars3 etc 4 free bx420ip45we 150pm dont miss'

In [54]:
def test_text_prepare():
    examples = ["I'm back &amp; we're packing the CAR now, I'll let you know if there's \"room\"", 
                "I agree that this is the best Michael Myers-based \"Halloween\" movie since 1981's \"Halloween II.\""]
    answers = ["im back amp packing car ill let know theres room", 
               "agree best michael myersbased halloween movie since 1981s halloween ii"]
    for ex, ans in zip(examples, answers):
        if text_preprocess(ex) != ans:
            print(text_preprocess(ex))
            return "Wrong answer for the case: '%s'" % ex
    return 'Basic tests are passed.'

In [55]:
print(test_text_prepare())

Basic tests are passed.


Normalise all reviews in data list, by normalizing the review part of the list of tuples

In [56]:
######### YOUR CODE HERE #############

norm_data = [(text_preprocess(t),c) for (t,c) in data]


split data into training and testing

In [57]:
training_set_size = int(len(norm_data)*0.70)
X = [example[0] for example in norm_data]
y = [example[1] for example in norm_data]

X_train = X[:training_set_size]
y_train = y[:training_set_size]

X_test = X[training_set_size:]
y_test = y[training_set_size:]

### Transforming text to a vector

Machine Learning algorithms work with numeric data and we cannot use the provided text data "as is". There are many ways to transform text data to numeric vectors. In this task you will try to use two of them.

#### Bag of words

One of the well-known approaches is a *bag-of-words* representation. To create this transformation, You can generate document term matrix by using scikit-learn's CountVectorizer from *scikit-learn*. Use *train* corpus to train a vectorizer. Don't forget to take a look into the arguments that you can pass to it. We suggest that you use bigrams along with unigrams in your vocabulary. 


In [58]:
from sklearn.feature_extraction.text import CountVectorizer

In [59]:
def bow_features(X_train, X_test):
    """
        X_train, X_val, X_test — samples        
        return Bag of words vectorized representation of each sample and vocabulary
    """
    # Create Count vectorizer with a proper parameters choice
    # Fit the vectorizer on the train set
    # Transform the train, test, and val sets and return the result
    
    
    ####### YOUR CODE HERE #######
    count_vectorizer = CountVectorizer()
    X_train_Vec = count_vectorizer.fit_transform(X_train)
    X_test_Vec = count_vectorizer.transform(X_test)
    ######################################
    ######### YOUR CODE HERE #############
    ######################################
    
    return X_train_Vec, X_test_Vec, count_vectorizer.vocabulary_

In [60]:
X_train_bag, X_test_bag, bow_vocab = bow_features(X_train, X_test)

In [61]:
X_train_bag

<3901x7471 sparse matrix of type '<class 'numpy.int64'>'
	with 33083 stored elements in Compressed Sparse Row format>

#### TF-IDF

The second approach extends the bag-of-words framework by taking into account total frequencies of words in the corpora. It helps to penalize too frequent words and provide better features space. 

Implement function *tfidf_features* using class [TfidfVectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html) from *scikit-learn*. Use *train* corpus to train a vectorizer. Don't forget to take a look into the arguments that you can pass to it. We suggest that you filter out too rare words (occur less than in 5 titles) and too frequent words (occur more than in 90% of the titles). Also, use bigrams along with unigrams in your vocabulary. 

In [62]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [63]:
def tfidf_features(X_train, X_test):
    """
        X_train, X_val, X_test — samples        
        return TF-IDF vectorized representation of each sample and vocabulary
    """
    # Create TF-IDF vectorizer with a proper parameters choice
    # Fit the vectorizer on the train set
    # Transform the train, test, and val sets and return the result
    
    
    ####### YOUR CODE HERE #######
    tfidf_vectorizer = TfidfVectorizer(min_df=2, max_df=0.5, ngram_range=(1, 2))
    X_train = tfidf_vectorizer.fit_transform(X_train)
    X_test = tfidf_vectorizer.transform(X_test)
    ######################################
    ######### YOUR CODE HERE #############
    ######################################
    
    return X_train, X_test, tfidf_vectorizer.vocabulary_

In [76]:
X_train_tfidf, X_test_tfidf, tfidf_vocab = tfidf_features(X_train, X_test)

## Classify

In [77]:
from sklearn.multiclass import OneVsRestClassifier
from sklearn.linear_model import LogisticRegression, RidgeClassifier
def train_classifier(X_train, y_train):
    classifier = RidgeClassifier()
    classifier.fit(X_train, y_train)
    return classifier

In [78]:
classifier_bag = train_classifier(X_train_bag, y_train)
classifier_tfidf = train_classifier(X_train_tfidf, y_train)

In [79]:
y_test_predicted_labels_bag = classifier_bag.predict(X_test_bag)
y_test_predicted_scores_bag = classifier_bag.decision_function(X_test_bag)

y_test_predicted_labels_tfidf = classifier_tfidf.predict(X_test_tfidf)
y_test_predicted_scores_tfidf = classifier_tfidf.decision_function(X_test_tfidf)

In [80]:
for i in range(3):
    print('Review:\t{}\nTrue label:\t{}\nPredicted label BoW:\t{}\nPredicted label Tf_Idf:\t{}\n\n'.format(
        X_test[i][:100] + "...",y_test[i],y_test_predicted_labels_bag[i], y_test_predicted_labels_tfidf[i])
    )

Review:	500 new mobil 2004 must go txt nokia 89545 collect todayfrom 1 www4tcbiz 2optout 08718726270150gbp m...
True label:	pos
Predicted label BoW:	pos
Predicted label Tf_Idf:	pos


Review:	cant pick phone right pl send messag...
True label:	neg
Predicted label BoW:	neg
Predicted label Tf_Idf:	neg


Review:	hey look like wrong one kappa guy number still phone want text see he around...
True label:	neg
Predicted label BoW:	neg
Predicted label Tf_Idf:	neg




In [81]:
from sklearn.metrics import accuracy_score

print('Bag-of-words Accuracy: '+ str(accuracy_score(y_test, y_test_predicted_labels_bag)))
print('Tfidf Accuracy: ' + str(accuracy_score(y_test, y_test_predicted_labels_tfidf)))

Bag-of-words Accuracy: 0.9754931261207412
Tfidf Accuracy: 0.9802749551703527


In [82]:

def print_words_for_tag(classifier, index_to_words):
    top_positive_words = [index_to_words[index] for index in classifier.coef_.argsort().tolist()[0][-5:]]  # top-5 words sorted by the coefficiens.
    top_negative_words = [index_to_words[index] for index in classifier.coef_.argsort().tolist()[0][:5]] # bottom-5 words  sorted by the coefficients.
    print('Top Possitive Words:\t{}'.format(', '.join(top_positive_words)))
    print('Top Negative Words:\t{}\n'.format(', '.join(top_negative_words)))

print('______________________________\nClassifier: \t {} '.format('Bag of words'))
print_words_for_tag(classifier_bag, {i:word for word,i in bow_vocab.items()})

print('______________________________\nClassifier: \t {} '.format('IF-IDF',))
print_words_for_tag(classifier_tfidf, {i:word for word,i in tfidf_vocab.items()})



______________________________
Classifier: 	 Bag of words 
Top Possitive Words:	content, claim, 07090201529, ringtone, voicemail
Top Negative Words:	liked, improved, machan, executive, urgnt

______________________________
Classifier: 	 IF-IDF 
Top Possitive Words:	150p, servic, mobil, claim, txt
Top Negative Words:	free call, text your, like new, road, machan



In [92]:
porter = nltk.PorterStemmer()

def stem_text(in_text):
    out_text = " ".join(porter.stem(w) for w in in_text.split())
    return out_text

In [93]:
new_norm_data = [(stem_text(t),c) for (t,c) in norm_data]


In [94]:
training_set_size = int(len(new_norm_data)*0.70)
new_X = [example[0] for example in new_norm_data]
new_y = [example[1] for example in new_norm_data]

new_X_train = new_X[:training_set_size]
new_y_train = new_y[:training_set_size]

new_X_test = new_X[training_set_size:]
new_y_test = new_y[training_set_size:]

In [95]:
new_X_train_bag, new_X_test_bag, new_bow_vocab = bow_features(new_X_train, new_X_test)

In [96]:
new_X_train_tfidf, new_X_test_tfidf, new_tfidf_vocab = tfidf_features(new_X_train, new_X_test)

In [97]:
classifier_bag = train_classifier(new_X_train_bag, new_y_train)
classifier_tfidf = train_classifier(new_X_train_tfidf, new_y_train)

In [98]:
new_y_test_predicted_labels_bag = classifier_bag.predict(new_X_test_bag)
new_y_test_predicted_scores_bag = classifier_bag.decision_function(new_X_test_bag)

new_y_test_predicted_labels_tfidf = classifier_tfidf.predict(new_X_test_tfidf)
new_y_test_predicted_scores_tfidf = classifier_tfidf.decision_function(new_X_test_tfidf)

In [99]:
from sklearn.metrics import accuracy_score

print('Bag-of-words Accuracy with stem: '+ str(accuracy_score(new_y_test, new_y_test_predicted_labels_bag)))
print('Tfidf Accuracy with stem: ' + str(accuracy_score(new_y_test, new_y_test_predicted_labels_tfidf)))

Bag-of-words Accuracy: 0.9731022115959355
Tfidf Accuracy: 0.9802749551703527
