# `AA Workshop 14` â€” Coding Challenge

Complete the tasks below to practice text mining techniques from `W14_Textmining.ipynb`.

Guidelines:
- Work in order. Run each cell after editing with Shift+Enter.
- Keep answers short; focus on making things work.
- If a step fails, read the error and fix it.

By the end you will have exercised:
- transforming text documents to numerical vectors
- implementing classification models based on text data

## Task 1 - Text Classification for Movie Data

Let's practice text mining using a dataset of movie reviews. Your goal is to compute the sentiment (i.e., if text has a positive or negative connotation) of text documents. We will be using Movie Reviews text data from Rotten Tomatoes (https://www.kaggle.com/c/sentiment-analysis-on-movie-reviews/data), which can be found in the data-folder (`movies.tsv`).

Complete the following steps:
- load and understand the data (what is the target?)
- construct bag of words model (tokenization, preprocessing)
- build and evaluate a classification model

### Load and explore data

In [None]:
# load the data
import pandas as pd

data=pd.read_csv('../data/movies.tsv', sep='\t')
data.head(20)

In [None]:
# explore the dataframe
data.info()

In [None]:
# sentiment distribution
# 0 - negative 1 - somewhat negative 2 - neutral 3 - somewhat positive 4 - positive
data.Sentiment.value_counts()

In [None]:
# visualize sentiment distribution 

from matplotlib import pyplot as plt

Sentiment_count=data.groupby('Sentiment').count()
plt.bar(Sentiment_count.index.values, Sentiment_count['Phrase'])
plt.xlabel('Review Sentiments')
plt.ylabel('Number of Review')
plt.show()

### Bag of Words Model

In [None]:
from sklearn.feature_extraction.text import CountVectorizer # convert a collection of text documents to a matrix of token counts
from nltk.tokenize import word_tokenize

# lowercase=True - convert all characters to lowercase
# stop_words='english' - remove all stopwords based on the english language
# ngram_range = (1,1) - only consider unigrams, i.e. single words
# Word Tokenizer
cv = CountVectorizer(lowercase=True,stop_words='english',ngram_range = (1,1), tokenizer = word_tokenize)
text_counts= cv.fit_transform(data['Phrase'])

# Term Document Matrix (document, term) - "total count"
print(text_counts)

### Model Building and Evaluation

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
    text_counts, data['Sentiment'], test_size=0.3)

The multinomial Naive Bayes classifier is suitable for classification with
discrete features (e.g., word counts for text classification). The
multinomial distribution normally requires integer feature counts. However,
in practice, fractional counts such as tf-idf may also work.

In [None]:
from sklearn.naive_bayes import MultinomialNB
from sklearn import metrics

clf = MultinomialNB().fit(X_train, y_train)
predicted = clf.predict(X_test)
print("MultinomialNB Accuracy:",metrics.accuracy_score(y_test, predicted))

### Try again but use TF-IDF instead of Total Count

In [None]:
# try again but use tf-idf instead of total counts
from sklearn.feature_extraction.text import TfidfVectorizer

tf=TfidfVectorizer(lowercase=True,stop_words='english',ngram_range = (1,1),tokenizer = word_tokenize)
text_tf= tf.fit_transform(data['Phrase'])

X_train, X_test, y_train, y_test = train_test_split(
    text_tf, data['Sentiment'], test_size=0.3)

clf = MultinomialNB().fit(X_train, y_train)
predicted= clf.predict(X_test)
print("MultinomialNB Accuracy:",metrics.accuracy_score(y_test, predicted))

----