# Logistic Regression and Sentiment analysis 

In this assignment you will implement and experiment with various feature engineering techniques in the context of Logistic Regression models for Sentiment classification of movie reviews. We will use the LR model implemented in sklearn:

https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html

## Write Your Name Here: Shreyas Lokesha
##### UNC ID: 801210964
##### E-mail : slokesha@uncc.edu

# <font color="blue"> Submission Instructions</font>

1. Click the Save button at the top of the Jupyter Notebook.
2. Please make sure to have entered your name above.
3. Select Cell -> All Output -> Clear. This will clear all the outputs from all cells (but will keep the content of ll cells). 
4. Select Cell -> Run All. This will run all the cells in order, and will take several minutes.
5. Once you've rerun everything, select File -> Download as -> PDF via LaTeX and download a PDF version *LR-SentimentAnalysis.pdf* showing the code and the output of all cells, and save it in the same folder that contains the notebook file *LR-SentimentAnalysis.ipynb*.
6. Look at the PDF file and make sure all your solutions are there, displayed correctly. The PDF is the only thing we will see when grading!
7. Submit **both** your PDF and notebook on Canvas.

## From documents to feature vectors
This section illustratess the prototypical components of machine learning pipeline for an NLP task, in this case document classification:

1. Read document examples (train, devel, test) from files with a predefined format:
    - assume one document per line, usign the format "\<label\> \<text\>".

2. Tokenize each document:
    - using a spaCy tokenizer.

3. Feature extractors:
    - so far, just words.

4. Process each document into a feature vector:
    - map document to a dictionary of feature names.
    - map feature names to unique feature IDs.
    - each document is a feature vector, where each feature ID is mapped to a feature value (e.g. word occurences).

In [1]:
import spacy
import re
from spacy.lang.en import English
from scipy import sparse
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression

In [2]:
# Create spaCy tokenizer.
spacy_nlp = English()

def spacy_tokenizer(text):
    tokens = spacy_nlp.tokenizer(text)
    
    return [token.text for token in tokens]

In [3]:
def read_examples(filename):
    X = []
    Y = []
    with open(filename, mode = 'r', encoding = 'utf-8') as file:
        for line in file:
            [label, text] = line.rstrip().split(' ', maxsplit = 1)
            X.append(text)
            Y.append(label)
    return X, Y

In [4]:
def word_features(tokens):
    feats = {}
    for word in tokens:
        feat = 'WORD_%s' % word
        if feat in feats:
            feats[feat] +=1
        else:
            feats[feat] = 1
    return feats

In [5]:
def add_features(feats, new_feats):
    for feat in new_feats:
        if feat in feats:
            feats[feat] += new_feats[feat]
        else:
            feats[feat] = new_feats[feat]
    return feats

This function tokenizes the document, runs all the feature extractors on it and assembles the extracted features into a dictionary mapping feature names to feature values. It is important that feature names do not conflict with each other, i.e. different features should have different names. Each document will have its own dictionary of features and their values.

In [6]:
def docs2features(trainX, feature_functions, tokenizer):
    examples = []
    count = 0
    for doc in trainX:
        feats = {}

        tokens = tokenizer(doc)
        
        for func in feature_functions:
            add_features(feats, func(tokens))

        examples.append(feats)
        count +=1
        
        if count % 100 == 0:
            print('Processed %d examples into features' % len(examples))
    
    return examples

In [7]:
# This helper function converts feature names to unique numerical IDs.

def create_vocab(examples):
    feature_vocab = {}
    idx = 0
    for example in examples:
        for feat in example:
            if feat not in feature_vocab:
                feature_vocab[feat] = idx
                idx += 1
                
    return feature_vocab

In [8]:
# This helper function converts a set of examples from a dictionary of feature names to values representation
# to a sparse representation of feature ids to values. This is important because almost all feature values will
# be 0 for most documents and it would be wasteful to save all in memory.

def features_to_ids(examples, feature_vocab):
    new_examples = sparse.lil_matrix((len(examples), len(feature_vocab)))
    for idx, example in enumerate(examples):
        for feat in example:
            if feat in feature_vocab:
                new_examples[idx, feature_vocab[feat]] = example[feat]
                
    return new_examples

In [9]:
# Evaluation pipeline for the Logistic Regression classifier.

def train_and_test(trainX, trainY, devX, devY, feature_functions, tokenizer):
    # Pre-process training documents. 
    trainX_feat = docs2features(trainX, feature_functions, tokenizer)

    # Create vocabulary from features in training examples.
    feature_vocab = create_vocab(trainX_feat)
    print('Vocabulary size: %d' % len(feature_vocab))

    trainX_ids = features_to_ids(trainX_feat, feature_vocab)
    
    # Train LR model.
    lr_model = LogisticRegression(penalty = 'l2', C = 1.0, solver = 'lbfgs', max_iter = 1000)
    lr_model.fit(trainX_ids, trainY)
    
    # Pre-process test documents. 
    devX_feat = docs2features(devX, feature_functions, tokenizer)
    devX_ids = features_to_ids(devX_feat, feature_vocab)
    
    # Test LR model.
    print('Accuracy: %.3f' % lr_model.score(devX_ids, devY))

In [10]:
import os

datapath = '../data'

train_file = os.path.join(datapath, 'imdb_sentiment_train.txt')
trainX, trainY = read_examples(train_file)

dev_file = os.path.join(datapath, 'imdb_sentiment_dev.txt')
devX, devY = read_examples(dev_file)

# Specify features to use.
features = [word_features]

# Evaluate LR model.
train_and_test(trainX, trainY, devX, devY, features, spacy_tokenizer)

Processed 100 examples into features
Processed 200 examples into features
Processed 300 examples into features
Processed 400 examples into features
Processed 500 examples into features
Processed 600 examples into features
Processed 700 examples into features
Processed 800 examples into features
Processed 900 examples into features
Processed 1000 examples into features
Processed 1100 examples into features
Processed 1200 examples into features
Processed 1300 examples into features
Processed 1400 examples into features
Processed 1500 examples into features
Vocabulary size: 28692
Processed 100 examples into features
Processed 200 examples into features
Processed 300 examples into features
Processed 400 examples into features
Processed 500 examples into features
Processed 600 examples into features
Processed 700 examples into features
Processed 800 examples into features
Processed 900 examples into features
Processed 1000 examples into features
Processed 1100 examples into features
Process

## Feature engineering

Evaluate LR model performance when adding positive and negative lexicon features. We will be using Bing Liu's sentiment lexicons from https://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html

Read the positive and negative sentiment lexicons. There should be 2007 entries in the positive lexicon and 4784 entries in the positive lexicon.

In [11]:
def read_lexicon(filename):
    lexicon = set()
    with open(filename, mode = 'r', encoding = 'ISO-8859-1') as file:
        # YOUR CODE HERE
        for line in file:
            if(line[0] != ';'):
                lexicon.add(line.strip('\n'))
        

    return lexicon

lexicon_path = '../data/bliu'

poslex_file = os.path.join(lexicon_path, 'positive-words.txt')
neglex_file = os.path.join(lexicon_path, 'negative-words.txt')

poslex = read_lexicon(poslex_file)
neglex = read_lexicon(neglex_file)


print(len(poslex), 'entries in the positive lexicon.')
print(len(neglex), 'entries in the negative lexicon.')

2007 entries in the positive lexicon.
4784 entries in the negative lexicon.


Use the lexicons to create two lexicon features:

- A feature 'POSLEX' whose value indicates how many tokens belong to the positive lexicon.
- A feature 'NEGLEX' whose value indicates how many tokens belong to the negative lexicon.

In [12]:
def two_lexicon_features(tokens):
    feats = {'POSLEX': 0, 'NEGLEX': 0}
    # YOUR CODE HERE
    for token in tokens:
        #temp = token.lower()
        if(token in poslex):
            feats['POSLEX'] += 1
        elif(token in neglex):
            feats['NEGLEX'] += 1
    
    return feats

Evaluate the LR model using the two new lexicon features.

In [13]:
# Specify features to use.
features = [word_features, two_lexicon_features]

# Evaluate LR model.
train_and_test(trainX, trainY, devX, devY, features, spacy_tokenizer)

Processed 100 examples into features
Processed 200 examples into features
Processed 300 examples into features
Processed 400 examples into features
Processed 500 examples into features
Processed 600 examples into features
Processed 700 examples into features
Processed 800 examples into features
Processed 900 examples into features
Processed 1000 examples into features
Processed 1100 examples into features
Processed 1200 examples into features
Processed 1300 examples into features
Processed 1400 examples into features
Processed 1500 examples into features
Vocabulary size: 28694
Processed 100 examples into features
Processed 200 examples into features
Processed 300 examples into features
Processed 400 examples into features
Processed 500 examples into features
Processed 600 examples into features
Processed 700 examples into features
Processed 800 examples into features
Processed 900 examples into features
Processed 1000 examples into features
Processed 1100 examples into features
Process

This time, we create a separate feature for each word that appears in each lexicon, as follows:

- If a word from the positive lexicon (e.g. 'like') appears N times in the document (e.g. 5 times), add a positive lexicon feature 'POSLEX_word' for that word that is associated that value (e.g. {'POSLEX_like' : 5}.
- Similarly, if a word from the negative lexicon (e.g. 'dislike') appears N times in the document (e.g. 5 times), add a negative lexicon feature 'NEGLEX_word' for that word that is associated that value (e.g. {'NEGLEX_dislike' : 5}.

In [14]:
def lexicon_features(tokens):
    feats = {}
    # YOUR CODE HERE
    # Assume the positive and negative lexicons are available in poslex and neglex, respectively.
    for item in tokens:
        pos_feat = "POSLEX_"+str(item)
        neg_feat = "NEGLEX_"+str(item)
        if(item in poslex and pos_feat not in feats):
            feats[pos_feat] = 1
        elif(item in poslex and pos_feat in feats):
            feats[pos_feat]+=1
        elif(item in neglex and neg_feat not in feats):
            feats[neg_feat] = 1
        elif(item in neglex and neg_feat in feats):
            feats[neg_feat]+=1
            
    
    
    return feats

In [15]:
# Specify features to use.
features = [word_features, lexicon_features]

# Evaluate LR model.
train_and_test(trainX, trainY, devX, devY, features, spacy_tokenizer)

Processed 100 examples into features
Processed 200 examples into features
Processed 300 examples into features
Processed 400 examples into features
Processed 500 examples into features
Processed 600 examples into features
Processed 700 examples into features
Processed 800 examples into features
Processed 900 examples into features
Processed 1000 examples into features
Processed 1100 examples into features
Processed 1200 examples into features
Processed 1300 examples into features
Processed 1400 examples into features
Processed 1500 examples into features
Vocabulary size: 31721
Processed 100 examples into features
Processed 200 examples into features
Processed 300 examples into features
Processed 400 examples into features
Processed 500 examples into features
Processed 600 examples into features
Processed 700 examples into features
Processed 800 examples into features
Processed 900 examples into features
Processed 1000 examples into features
Processed 1100 examples into features
Process

Add a feature 'DOC_LEN' whose value is the natural logarithm of the document length (use *math.log* to compute logarithms).

In [16]:
import math
def len_feature(tokens):
    feat = {'DOC_LEN': 0}
    document_length = math.log(len(tokens))
    feat['DOC_LEN'] = document_length
    return feat

In [17]:
# Specify features to use.
features = [word_features, lexicon_features, len_feature]

# Evaluate LR model.
train_and_test(trainX, trainY, devX, devY, features, spacy_tokenizer)

Processed 100 examples into features
Processed 200 examples into features
Processed 300 examples into features
Processed 400 examples into features
Processed 500 examples into features
Processed 600 examples into features
Processed 700 examples into features
Processed 800 examples into features
Processed 900 examples into features
Processed 1000 examples into features
Processed 1100 examples into features
Processed 1200 examples into features
Processed 1300 examples into features
Processed 1400 examples into features
Processed 1500 examples into features
Vocabulary size: 31722
Processed 100 examples into features
Processed 200 examples into features
Processed 300 examples into features
Processed 400 examples into features
Processed 500 examples into features
Processed 600 examples into features
Processed 700 examples into features
Processed 800 examples into features
Processed 900 examples into features
Processed 1000 examples into features
Processed 1100 examples into features
Process

Add a feature 'DEICTIC_COUNT' that counts the number of 1st and 2nd person pronouns in the document.

In [18]:
def deictic_feature(tokens):
    pronouns = set(('i', 'my', 'me', 'we', 'us', 'our', 'you', 'your'))
    count = 0
    
    # YOUR CODE HERE
    for item in tokens:
        if (item.lower() in pronouns):
            count += 1
    
    return {'DEICTIC_COUNT': count}

In [19]:
# Specify features to use.
features = [word_features, lexicon_features, len_feature, deictic_feature]

# Evaluate LR model.
train_and_test(trainX, trainY, devX, devY, features, spacy_tokenizer)

Processed 100 examples into features
Processed 200 examples into features
Processed 300 examples into features
Processed 400 examples into features
Processed 500 examples into features
Processed 600 examples into features
Processed 700 examples into features
Processed 800 examples into features
Processed 900 examples into features
Processed 1000 examples into features
Processed 1100 examples into features
Processed 1200 examples into features
Processed 1300 examples into features
Processed 1400 examples into features
Processed 1500 examples into features
Vocabulary size: 31723
Processed 100 examples into features
Processed 200 examples into features
Processed 300 examples into features
Processed 400 examples into features
Processed 500 examples into features
Processed 600 examples into features
Processed 700 examples into features
Processed 800 examples into features
Processed 900 examples into features
Processed 1000 examples into features
Processed 1100 examples into features
Process

Let's try without the word features.

In [20]:
# Specify features to use.
features = [lexicon_features, len_feature, deictic_feature]

# Evaluate LR model.
train_and_test(trainX, trainY, devX, devY, features, spacy_tokenizer)

Processed 100 examples into features
Processed 200 examples into features
Processed 300 examples into features
Processed 400 examples into features
Processed 500 examples into features
Processed 600 examples into features
Processed 700 examples into features
Processed 800 examples into features
Processed 900 examples into features
Processed 1000 examples into features
Processed 1100 examples into features
Processed 1200 examples into features
Processed 1300 examples into features
Processed 1400 examples into features
Processed 1500 examples into features
Vocabulary size: 3031
Processed 100 examples into features
Processed 200 examples into features
Processed 300 examples into features
Processed 400 examples into features
Processed 500 examples into features
Processed 600 examples into features
Processed 700 examples into features
Processed 800 examples into features
Processed 900 examples into features
Processed 1000 examples into features
Processed 1100 examples into features
Processe

## Bonus points ##
Anything extra goes here. For example:

- Preprocess the tokens to account for negation, as explained in class. For this, define a lexicon of negation words, such as 'not', 'never', etc. Redefine the sentiment lexicon features such that whenever moddified by a netation word, a positive sentiment word is counted as negative, i.e. 'NOT_like' will be a negative sentiment token. Train and evaluate the performance of the new model.
- Evaluate the impact of other features, such as the presence of exclamation points, the number of negations, etc.
- Using binary word features, i.e. does the word appear or not in the document, instead of count features.
- Evaluate the impact of the additional features when trained only on 500 examples. Use the first 250 examples as positive and the last 250 exampels as negative.

In [36]:
def negation_lexicon(tokens):
    negatX = set(('no','not','never','seldom','none','neither','hardly','barely'))
    negToken = []
    negFeat = {}
    for item in tokens:
        startIndex = 0
        endIndex = 0
        if(item.lower() in negatX):
            startIndex = tokens.index(item)
            negMatch = item
            for i in range(startIndex+1,len(tokens)):
                if(tokens[i] == '.' or tokens[i] == ','):
                    endIndex = i
                    break
            for i in range(startIndex+1,endIndex):
                negToken.append(str(negMatch)+"_"+str(tokens[i]))
    
    for x in negToken:
        for i in range(len(x)):
            if(x[i] == '_'):
                if(str(x[i+1:]) in poslex and x not in  negFeat):
                    negFeat[x] = 1
                elif(str(x[i+1:]) in poslex and x in  negFeat):
                    negFeat[x] += 1
            
    return negFeat



features = [word_features,lexicon_features,negation_lexicon,len_feature, deictic_feature]
train_and_test(trainX, trainY, devX, devY, features, spacy_tokenizer)
        

Processed 100 examples into features
Processed 200 examples into features
Processed 300 examples into features
Processed 400 examples into features
Processed 500 examples into features
Processed 600 examples into features
Processed 700 examples into features
Processed 800 examples into features
Processed 900 examples into features
Processed 1000 examples into features
Processed 1100 examples into features
Processed 1200 examples into features
Processed 1300 examples into features
Processed 1400 examples into features
Processed 1500 examples into features
Vocabulary size: 32157
Processed 100 examples into features
Processed 200 examples into features
Processed 300 examples into features
Processed 400 examples into features
Processed 500 examples into features
Processed 600 examples into features
Processed 700 examples into features
Processed 800 examples into features
Processed 900 examples into features
Processed 1000 examples into features
Processed 1100 examples into features
Process

In [38]:
def numberOfNegations(tokens):
    negat = set(('no','not','never','seldom','none','neither','hardly','barely'))
    cnt = 0
    for x in tokens:
        if(x in negat):
            cnt += 1
    
    return {"Negation":cnt}

features = [word_features,lexicon_features,negation_lexicon,numberOfNegations,len_feature, deictic_feature]
train_and_test(trainX, trainY, devX, devY, features, spacy_tokenizer)
    

Processed 100 examples into features
Processed 200 examples into features
Processed 300 examples into features
Processed 400 examples into features
Processed 500 examples into features
Processed 600 examples into features
Processed 700 examples into features
Processed 800 examples into features
Processed 900 examples into features
Processed 1000 examples into features
Processed 1100 examples into features
Processed 1200 examples into features
Processed 1300 examples into features
Processed 1400 examples into features
Processed 1500 examples into features
Vocabulary size: 32158
Processed 100 examples into features
Processed 200 examples into features
Processed 300 examples into features
Processed 400 examples into features
Processed 500 examples into features
Processed 600 examples into features
Processed 700 examples into features
Processed 800 examples into features
Processed 900 examples into features
Processed 1000 examples into features
Processed 1100 examples into features
Process

In [39]:
def exclamation_lexicon(tokens):
    lexicon_present = False
    for item in tokens:
        if(item == '!'):
            lexicon_present = True
    
    return {"exclamation":lexicon_present}

def exclamation_lex_count(tokens):
    cnt = 0
    for item in tokens:
        if('!' in item):
            cnt += 1
            
    return {"exclam_count":cnt}

print("Accuracy for exclamation count feature")
features = [word_features,lexicon_features,exclamation_lex_count,negation_lexicon,numberOfNegations,
            len_feature, deictic_feature]
train_and_test(trainX, trainY, devX, devY, features, spacy_tokenizer)
print("=======================================")
print("Accuracy for exclamation present feature")
features = [word_features,lexicon_features,exclamation_lexicon,negation_lexicon,numberOfNegations,
            len_feature, deictic_feature]
train_and_test(trainX, trainY, devX, devY, features, spacy_tokenizer)

Accuracy for exclamation count feature
Processed 100 examples into features
Processed 200 examples into features
Processed 300 examples into features
Processed 400 examples into features
Processed 500 examples into features
Processed 600 examples into features
Processed 700 examples into features
Processed 800 examples into features
Processed 900 examples into features
Processed 1000 examples into features
Processed 1100 examples into features
Processed 1200 examples into features
Processed 1300 examples into features
Processed 1400 examples into features
Processed 1500 examples into features
Vocabulary size: 32159
Processed 100 examples into features
Processed 200 examples into features
Processed 300 examples into features
Processed 400 examples into features
Processed 500 examples into features
Processed 600 examples into features
Processed 700 examples into features
Processed 800 examples into features
Processed 900 examples into features
Processed 1000 examples into features
Proces

In [46]:
def preProcessingBinaryFeats(training_ds,feature_funcs,tokenizer):
    examples = []
    count = 0
    for item in training_ds:
        wordCnt = {}
        tokenset = tokenizer(item)
        x = word_features(tokenset)
        x_noDups = removeDuplicateFeats(x)
        add_features(wordCnt,x_noDups)
        examples.append(wordCnt)
        count +=1
        if count % 100 == 0:
            print('Processed %d examples into features' % len(examples))
    return examples

def removeDuplicateFeats(tokenDict):
    for key,value in tokenDict.items():
        if(len(tokenDict) > 2):
            if(value > 1):
                tokenDict[key] = 1
    return tokenDict

feature_functions = [word_features,lexicon_features]
trainX_feat = preProcessingBinaryFeats(trainX, feature_functions, spacy_tokenizer)

    # Create vocabulary from features in training examples.
feature_vocab = create_vocab(trainX_feat)
print('Vocabulary size: %d' % len(feature_vocab))

trainX_ids = features_to_ids(trainX_feat, feature_vocab)
    
    # Train LR model.
lr_model = LogisticRegression(penalty = 'l2', C = 1.0, solver = 'lbfgs', max_iter = 1000)
lr_model.fit(trainX_ids, trainY)
    
    # Pre-process test documents. 
devX_feat = preProcessingBinaryFeats(devX, feature_functions, spacy_tokenizer)
devX_ids = features_to_ids(devX_feat, feature_vocab)
    
    # Test LR model.
print('Accuracy: %.3f' % lr_model.score(devX_ids, devY))


Processed 100 examples into features
Processed 200 examples into features
Processed 300 examples into features
Processed 400 examples into features
Processed 500 examples into features
Processed 600 examples into features
Processed 700 examples into features
Processed 800 examples into features
Processed 900 examples into features
Processed 1000 examples into features
Processed 1100 examples into features
Processed 1200 examples into features
Processed 1300 examples into features
Processed 1400 examples into features
Processed 1500 examples into features
Vocabulary size: 28692
Processed 100 examples into features
Processed 200 examples into features
Processed 300 examples into features
Processed 400 examples into features
Processed 500 examples into features
Processed 600 examples into features
Processed 700 examples into features
Processed 800 examples into features
Processed 900 examples into features
Processed 1000 examples into features
Processed 1100 examples into features
Process

In [42]:
def train_500_examples(training_ds,feature_funcs,tokenizer):
    examples = []
    count = 0
    for doc in trainX:
        feats = {}

        tokens = tokenizer(doc)
        
        for func in feature_functions:
            if(count <= 250 or count >= 1250):
                add_features(feats, func(tokens))

        examples.append(feats)
        count +=1
        
        if(count > 250):
            count = 1250
    
    return examples

feature_functions = [word_features,lexicon_features,deictic_feature,negation_lexicon,numberOfNegations,
                    exclamation_lex_count,len_feature]
trainX_feat = train_500_examples(trainX, feature_functions, spacy_tokenizer)

    # Create vocabulary from features in training examples.
feature_vocab = create_vocab(trainX_feat)
print('Vocabulary size: %d' % len(feature_vocab))

trainX_ids = features_to_ids(trainX_feat, feature_vocab)
    
    # Train LR model.
lr_model = LogisticRegression(penalty = 'l2', C = 1.0, solver = 'lbfgs', max_iter = 1000)
lr_model.fit(trainX_ids, trainY)
    
    # Pre-process test documents. 
devX_feat = train_500_examples(devX, feature_functions, spacy_tokenizer)
devX_ids = features_to_ids(devX_feat, feature_vocab)
    
    # Test LR model.
print('Accuracy: %.3f' % lr_model.score(devX_ids, devY))

Vocabulary size: 32159
Accuracy: 1.000


## Analysis ##
Include an analysis of the results that you obtained in the experiments above.

A PDF file including the analysis report has been attached in the submission section.