<font color = green >

## Home task: Sentiment Analysis

</font>

<font color = green >

### Load data 

</font>

[Sentiment Analysis Dataset](https://www.kaggle.com/sonaam1234/sentimentdata)

alternative source:
[rt-polaritydata](https://github.com/dennybritz/cnn-text-classification-tf/tree/master/data/rt-polaritydata)

alternative source:
[Movie Review Data](http://www.cs.cornell.edu/people/pabo/movie-review-data)

Each line in these two files corresponds to a single snippet (usually containing roughly one single sentence); all snippets are down-cased.  
[More info about dataset](https://www.cs.cornell.edu/people/pabo/movie-review-data/rt-polaritydata.README.1.0.txt)



In [1]:
import os

CWD = os.getcwd()  # Current working directory
DATA_DIR = os.path.join(CWD, 'data')
NEG_FILE = os.path.join(DATA_DIR, 'rt-polarity.neg')
POS_FILE = os.path.join(DATA_DIR, 'rt-polarity.pos')

In [2]:
with open(NEG_FILE, 'r', encoding='utf-8', errors='ignore') as f:  # Some invalid symbols encountered 
    content = f.read()  
texts_neg = content.splitlines()

print('Length of texts_neg = {:,}'.format(len(texts_neg)), end='\n\n')
for i, review in enumerate(texts_neg[:5]):
    print(f'{i + 1}:', review)

Length of texts_neg = 5,331

1: simplistic , silly and tedious . 
2: it's so laddish and juvenile , only teenage boys could possibly find it funny . 
3: exploitative and largely devoid of the depth or sophistication that would make watching such a graphic treatment of the crimes bearable . 
4: [garbus] discards the potential for pathological study , exhuming instead , the skewed melodrama of the circumstantial situation . 
5: a visually flashy but narratively opaque and emotionally vapid exercise in style and mystification . 


In [3]:
with open(POS_FILE, 'r', encoding='utf-8', errors='ignore') as f:
    content = f.read()
texts_pos = content.splitlines()

print('Length of texts_pos = {:,}'.format(len(texts_pos)), end='\n\n')
for i, review in enumerate(texts_pos[:5]):
    print(f'{i + 1}:', review)

Length of texts_pos = 5,331

1: the rock is destined to be the 21st century's new " conan " and that he's going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal . 
2: the gorgeously elaborate continuation of " the lord of the rings " trilogy is so huge that a column of words cannot adequately describe co-writer/director peter jackson's expanded vision of j . r . r . tolkien's middle-earth . 
3: effective but too-tepid biopic
4: if you sometimes like to go to the movies to have fun , wasabi is a good place to start . 
5: emerges as something rare , an issue movie that's so honest and keenly observed that it doesn't feel like one . 


In [4]:
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

In [5]:
nltk.download('stopwords', quiet=True)

True

In [6]:
STOPWORDS = set(stopwords.words('english'))

def preprocess(text):
    '''
    Performs preprocessing of `text` and return a list of its words
    containing at least one alphabetic character and being not stop words
    :param text: text to preprocess
    :type text: str
    :return: a list of words from `text`
    :rtype: list[str]
    '''
    return [
        word for word in word_tokenize(text)
        if any(c.isalpha() for c in word) and word not in STOPWORDS
    ]

# Prepocess negative and positive reviews to get all tokens
all_tokens = []
for text in (texts_neg + texts_pos):
    all_tokens.extend(preprocess(text))

In [7]:
from nltk import FreqDist

# Create a list containing tokens sorted by their frequency in reviews
vocab = FreqDist(all_tokens)
most_common_words, _ = zip(*vocab.most_common())
print(f'Top 10 most common words in reviews: {most_common_words[:10]}')

Top 10 most common words in reviews: ("'s", 'film', 'movie', "n't", 'one', 'like', 'story', 'much', 'even', 'good')


In [8]:
def word_features(text, n):
    '''
    Converts a sentence into features accepted by the NLTK classifier
    :param text: a sentence
    :type text: str
    :param n: number of most common words from `most_common_words` to use
    :type n: int
    :return: a dictionary mapping words from `most_common_words` to bool values:
    True if a word containg in `text`, False otherwise (so called featureset)
    :rtype: dict[str, bool]
    '''
    tokens = set(preprocess(text))  # Preprocess the text to get its words (tokens)
    return {word: word in tokens for word in most_common_words[:n]}

# Construct data from negative and positive reviews as featuresets
text_data = [
    (word_features(text, 2000), 'neg') for text in texts_neg
] + [
    (word_features(text, 2000), 'pos') for text in texts_pos
]

In [9]:
import random
random.seed(0)

# Split text_data into training and test sets (80% for training set, 20% for test set)
random.shuffle(text_data)
threshold = int(0.8 * len(text_data))
text_train, text_test = text_data[:threshold], text_data[threshold:]

print('Total number of samples:', len(text_data))
print('Number of training samples:', len(text_train))
print('Number of test samples:', len(text_test))

Total number of samples: 10662
Number of training samples: 8529
Number of test samples: 2133


In [10]:
from nltk import NaiveBayesClassifier
from nltk.classify import accuracy as nltk_accuracy

# Use Naive Bayes to perform review classification 
clf_nb = NaiveBayesClassifier.train(text_train)

print('Naive Bayes:')
print(f'accuracy: {nltk_accuracy(clf_nb, text_test):.4f}', end='\n\n')

# Find top 10 most informative features (words) from most_common_words
clf_nb.show_most_informative_features(10)

Naive Bayes:
accuracy: 0.7431

Most Informative Features
              engrossing = True              pos : neg    =     19.7 : 1.0
                    warm = True              pos : neg    =     19.7 : 1.0
                 generic = True              neg : pos    =     15.6 : 1.0
             examination = True              pos : neg    =     14.4 : 1.0
                 routine = True              neg : pos    =     14.3 : 1.0
               inventive = True              pos : neg    =     13.7 : 1.0
                  boring = True              neg : pos    =     13.0 : 1.0
                    flat = True              neg : pos    =     12.6 : 1.0
              refreshing = True              pos : neg    =     11.7 : 1.0
                mediocre = True              neg : pos    =     11.7 : 1.0


In [11]:
import warnings
warnings.filterwarnings('ignore')

In [12]:
import pandas as pd

# Build DataFrame containing reviews and their ratings (0 for negative, 1 for positive) to use with scikit-learn classifiers
texts_df = pd.DataFrame({'review': (texts_neg + texts_pos), 'rating': ([0]*len(texts_neg) + [1]*len(texts_pos))})
texts_df

Unnamed: 0,review,rating
0,"simplistic , silly and tedious .",0
1,"it's so laddish and juvenile , only teenage bo...",0
2,exploitative and largely devoid of the depth o...,0
3,[garbus] discards the potential for pathologic...,0
4,a visually flashy but narratively opaque and e...,0
...,...,...
10657,both exuberantly romantic and serenely melanch...,1
10658,mazel tov to a film about a family's joyous li...,1
10659,standing in the shadows of motown is the best ...,1
10660,it's nice to see piscopo again after all these...,1


In [13]:
from sklearn.model_selection import train_test_split

# Split texts_df into training (80%) and test (20%) sets
X_train, X_test, y_train, y_test = train_test_split(texts_df['review'], texts_df['rating'], train_size=0.8, random_state=0)

In [14]:
from sklearn.feature_extraction.text import CountVectorizer

# Transform texts into a matrix of unigram and bigram counts using a count vectorizer
vectorizer = CountVectorizer(max_features=50000, ngram_range=(1, 2))
X_train_vectorized = vectorizer.fit_transform(X_train)
X_test_vectorized = vectorizer.transform(X_test)

In [15]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, f1_score, roc_auc_score

# Use logistic regression for classification 
clf_logreg = LogisticRegression(max_iter=3000, random_state=0).fit(X_train_vectorized, y_train)
preds = clf_logreg.predict(X_test_vectorized)
scores = clf_logreg.decision_function(X_test_vectorized)

# Evaluate the logistic regression model
print('Logistic Regression:')
print(f'accuracy: {accuracy_score(y_test, preds):.4f}')
print(f'F1 score: {f1_score(y_test, preds):.4f}')
print(f'AUC: {roc_auc_score(y_test, scores):.4f}')

Logistic Regression:
accuracy: 0.7764
F1 score: 0.7827
AUC: 0.8566


In [16]:
features = vectorizer.get_feature_names_out()
coef = clf_logreg.coef_.squeeze()

# Find top 10 n-grams indicating a negative review using -coef of the model
coef_neg_index = (-coef).argsort()
print('Top 10 n-grams which have highest among the -coef values (tend to indicate a negative rating):')
print(features[coef_neg_index[-10:]])

Top 10 n-grams which have highest among the -coef values (tend to indicate a negative rating):
['neither' 'pretentious' 'unfunny' 'bore' 'worst' 'badly' 'too' 'boring'
 'bad' 'dull']


In [17]:
# Find top 10 n-grams indicating a positive review using coef of the model
coef_pos_index = coef.argsort()
print('Top 10 n-grams which have highest among the +coef values (tend to indicate a positive rating):')
print(features[coef_pos_index[-10:]])

Top 10 n-grams which have highest among the +coef values (tend to indicate a positive rating):
['remarkable' 'unexpected' 'engrossing' 'solid' 'works' 'fun' 'hilarious'
 'wonderful' 'enjoyable' 'powerful']


In [18]:
from sklearn.svm import LinearSVC

# Use support vector machines (linear) for classification
clf_svm = LinearSVC(max_iter=3000, random_state=0).fit(X_train_vectorized, y_train)
preds = clf_svm.predict(X_test_vectorized)
scores = clf_svm.decision_function(X_test_vectorized)

# Evaluate the SVM model
print('Linear Support Vector Machines:')
print(f'accuracy: {accuracy_score(y_test, preds):.4f}')
print(f'F1 score: {f1_score(y_test, preds):.4f}')
print(f'AUC: {roc_auc_score(y_test, scores):.4f}')

Linear Support Vector Machines:
accuracy: 0.7632
F1 score: 0.7701
AUC: 0.8455


In [19]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Transform texts into a matrix of unigrams' and bigrams' TF-IDF values using a TF-IDF vectorizer
vectorizer = TfidfVectorizer(max_features=50000, ngram_range=(1, 2))
X_train_vectorized = vectorizer.fit_transform(X_train)
X_test_vectorized = vectorizer.transform(X_test)

In [20]:
features = vectorizer.get_feature_names_out()

# Find top 10 n-grams with lowest TF-IDF values and top 10 n-grams with highest TF-IDF values
tfidf_index = X_train_vectorized.max(axis=0).toarray().squeeze().argsort()
print(f'Top 10 n-grams which have lowest TF-IDF values: {features[tfidf_index[:10]]}', end='\n\n')
print(f'Top 10 n-grams which have highest TF-IDF values: {features[tfidf_index[-10:]]}')

Top 10 n-grams which have lowest TF-IDF values: ['unnamed easily' 'they exist' 'well they' 're coming' 'they they'
 'unnamed' 'they well' 're they' 're back' 'whatever terror']

Top 10 n-grams which have highest TF-IDF values: ['crummy' 'pathetic' 'fantastic' 'bland' 'fun' 'terrible' 'indeed'
 'refreshing' 'slummer' 'and']


In [21]:
# Use logistic regression for classification
clf_logreg = LogisticRegression(max_iter=3000, random_state=0).fit(X_train_vectorized, y_train)
preds = clf_logreg.predict(X_test_vectorized)
scores = clf_logreg.decision_function(X_test_vectorized)

# Evaluate the logistic regression model
print('Logistic Regression:')
print(f'accuracy: {accuracy_score(y_test, preds):.4f}')
print(f'F1 score: {f1_score(y_test, preds):.4f}')
print(f'AUC: {roc_auc_score(y_test, scores):.4f}')

Logistic Regression:
accuracy: 0.7689
F1 score: 0.7774
AUC: 0.8511


In [22]:
# Use support vector machines (linear) for classification
clf_svm = LinearSVC(max_iter=3000, random_state=0).fit(X_train_vectorized, y_train)
preds = clf_svm.predict(X_test_vectorized)
scores = clf_svm.decision_function(X_test_vectorized)

# Evaluate the SVM model
print('Linear Support Vector Machines:')
print(f'accuracy: {accuracy_score(y_test, preds):.4f}')
print(f'F1 score: {f1_score(y_test, preds):.4f}')
print(f'AUC: {roc_auc_score(y_test, scores):.4f}')

Linear Support Vector Machines:
accuracy: 0.7886
F1 score: 0.7951
AUC: 0.8686
