# Logistic Regression based Sentiment Analysis
In this notebook, we will explore the use of logistic regression (LR) for sentiment analysis. In particular, we will not only use existing LR implementation to perform the clasification but, more importantly, also look into how to select features and study how features influence the model performance. 

In [None]:
# obtain and split data 
from nltk.corpus import movie_reviews
import random
documents = [(list(movie_reviews.words(fileid)), category)
             for category in movie_reviews.categories()
             for fileid in movie_reviews.fileids(category)]
random.shuffle(documents)

train_data = documents[:1200]
dev_data = documents[1200:1600]
test_data = documents[1600:]

In [None]:
# use tf-idf vectors to represent text, use logistic regression as classifier
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression

tfidf_vectorizer = TfidfVectorizer(use_idf=False)
train_text = [' '.join(tokens) for (tokens,label) in train_data]
train_labels = [label for (tokens,label) in train_data]
tfidf_vectorizer.fit(train_text)
train_vecs = tfidf_vectorizer.transform(train_text)
clf = LogisticRegression().fit(train_vecs, train_labels)

dev_text = [' '.join(tokens) for (tokens,label) in dev_data]
dev_vecs = tfidf_vectorizer.transform(dev_text)
dev_pred_labels = clf.predict(dev_vecs)
dev_true_labels = [label for (tokens,label) in dev_data]

from sklearn.metrics import accuracy_score, precision_recall_fscore_support
print('acc', accuracy_score(dev_true_labels, dev_pred_labels))
print('precision, recall, f1, support', precision_recall_fscore_support(dev_true_labels, dev_pred_labels, average=None, labels=['pos', 'neg']))


## Selection Features
The code above uses standard tf-idf vectorizer to represent texts. Function *TfidfVectorizer* allows you to customize the features you want to used in the vectors. The function manual can be found [here](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html). Notably, you may consider play with the following options and find which combination yields the best performance on the dev set.
* **lowercase**: bool (default=True). If True, the vectorizer will convert all characters to lowercase before tokenizing.
* **stop_words**: {‘english’}, list, or None (default=None). Whether to remove stopwords. You are allowed to specify the stopwords list
* **ngram_range**: tuple (min_n, max_n), default=(1, 1). This option allows you to specify which n-grams will be used to build the vocabulary. For example, if you let ngram_range=(1,2), your vocabulary will include all uni-grams and bi-grams in your training set.
* **max_df**: float in range \[0.0, 1.0\] or int (default=1.0). You can exclude types that appear in too many documents (i.e. types that have too high document-frequency values). High-df words are often common words and hence are less informative.
* **min_df**: float in range \[0.0, 1.0\] or int (default=1). Contrary to the last option, this option allows you to exclude low-df words from your vocabulary. Low-df words are usually not representative enough; removing them can help to reduce the feature number and hence avoid sparse vectors.
* **max_features**: int or None (default=None). If not None, build a vocabulary that only consider the top max_features ordered by term frequency across the corpus. Again, this option allows you to reduce the feature number.
* **vocabulary**: Mapping or iterable, optional (default=None). This option allows you to specify the vocabulary used to build the vectors. Particulary useful when vectorizing text in the dev and test set, because you should use the vocabulary built at training time.
* **binary**: bool (default=False). If True, all non-zero term counts are set to 1.
* **use_idf**: bool (default=True). Whether idf is used.
* **smooth_idf**: bool (default=True). Smooth idf weights by adding one to document frequencies, as if an extra document was seen containing every term in the collection exactly once. Prevents zero divisions.
* **sublinear_tf**: bool (default=False). Apply sublinear tf scaling, i.e. replace tf with 1 + log(tf).

In [None]:
# write a function that allows you to try different options combinations, and returns the 
# performance of each combination and the trained model. 

def get_model_performance(train_text, train_labels, dev_text, dev_labels, options):
    tfidf_vectorizer = TfidfVectorizer() #apply options to the vectorizer
    tfidf_vectorizer.fit(train_text)
    train_vecs = tfidf_vectorizer.transform(train_text)
    clf = LogisticRegression().fit(train_vecs, train_labels)
    
    dev_vecs = tfidf_vectorizer.transform(dev_text)
    dev_pred_labels = clf.predict(dev_vecs)

    return accuracy_score(dev_labels, dev_pred_labels), clf, tfidf_vectorizer


train_text = [' '.join(tokens) for (tokens,label) in train_data]
train_labels = [label for (tokens,label) in train_data]
dev_text = [' '.join(tokens) for (tokens,label) in dev_data]
dev_labels = [label for (tokens,label) in dev_data]

option_combos = [None] # try different combinations of options
best_acc = -1
best_clf = None
best_vectorizer = None
for option in option_combos:
    acc, clf, vectorizer = get_model_performance(train_text, train_labels, dev_text, dev_labels, option)
    if acc > best_acc:
        best_acc = acc
        best_clf = clf
        best_vectorizer = vectorizer

In [None]:
# use the best obtained model and vectorizer to predict test data
test_text = [' '.join(tokens) for (tokens,label) in test_data]
test_vecs = best_vectorizer.transform(test_text)
test_pred_labels = best_clf.predict(test_vecs)
test_true_labels = [label for (tokens, label) in test_data]

print('acc', accuracy_score(test_true_labels, test_pred_labels))
print('precision, recall, f1, support', precision_recall_fscore_support(test_true_labels, test_pred_labels, average=None, labels=['pos', 'neg']))

