# Sentiment Analysis - NLU

## Statistics
Student: Francesco Laiti

---

This notebook contains the source code to extract statistics for ``subjectivity`` and ``movie_reviews`` datasets, available in the NLTK library.

### Pre requirements

In [None]:
import nltk
from nltk.corpus import subjectivity, movie_reviews, stopwords
import numpy as np

nltk.download('movie_reviews')
nltk.download('subjectivity')
nltk.download('stopwords')

stop_words = set(stopwords.words('english'))

In [2]:
def get_subj_data():
    subj = subjectivity.sents(categories='subj')
    obj = subjectivity.sents(categories='obj')
    return subj + obj

def get_pol_data():
    neg = movie_reviews.paras(categories='neg')
    pos = movie_reviews.paras(categories='pos')
    return neg + pos

subj_corpus = get_subj_data()
movie_corpus  = get_pol_data()

### Subjectivity dataset

#### Vocabulary size

In [3]:
subj_sw_corpus = []
for sent in subj_corpus:
    tmp_seq = []
    for x in sent:
        if x not in stop_words:
            tmp_seq.append(x)
    subj_sw_corpus.append(tmp_seq)

print('Subjectivity vocabulary size: ', len(set([w for d in subj_corpus for w in d])), ' words')
print('Subjectivity vocabulary size removing stop words: ', len(set([w for sent in subj_sw_corpus for w in sent])), ' words')

Subjectivity vocabulary size:  23906  words
Subjectivity vocabulary size removing stop words:  23753  words


#### Words per sentence

In [4]:
w_per_sents = []
for d in subj_corpus:
    w_per_sents.append(len(d))

print('Maximum words per sentence: ', max(w_per_sents))
print('Minimum words per sentence: ', min(w_per_sents))
print('Average words per sentence: ', round(np.mean(w_per_sents)))

Maximum words per sentence:  120
Minimum words per sentence:  10
Average words per sentence:  24


### Movie reviews dataset

#### Vocabulary size

In [5]:
movie_sw_corpus = []
for doc in movie_corpus:
    tmp_sent = []
    for sent in doc:
        tmp_seq = []
        for x in sent:
            if x not in stop_words:
                tmp_seq.append(x)
        tmp_sent.append(tmp_seq)
    movie_sw_corpus.append(tmp_sent)

print('Movie review vocabulary size: ', len(set([w for list in movie_corpus for sent in list for w in sent])), ' words')
print('Movie review vocabulary size removing stop words: ', len(set([w for list in movie_sw_corpus for sent in list for w in sent])), ' words')

Movie review vocabulary size:  39768  words
Movie review vocabulary size removing stop words:  39617  words


#### Words per sentence

In [6]:
w_per_sents = []
for d in movie_corpus:
    sum = 0
    for w in d:
        sum += len(w)
    w_per_sents.append(sum)

print('Maximum words per sentence: ', max(w_per_sents))
print('Minimum words per sentence: ', min(w_per_sents))
print('Average words per sentence: ', round(np.mean(w_per_sents)))

Maximum words per sentence:  2879
Minimum words per sentence:  19
Average words per sentence:  792


#### Sents per document

In [7]:
sent_per_doc = []
for d in movie_corpus:
    sent_per_doc.append(len(d))

print('Maximum sents per document: ', max(sent_per_doc))
print('Minimum sents per document: ', min(sent_per_doc))
print('Average sents per document: ', round(np.mean(sent_per_doc)))

Maximum sents per document:  188
Minimum sents per document:  1
Average sents per document:  36


#### Words per sentence AFTER filtering out objective sentences

In [3]:
import pickle
import os

saved_path = 'weights/transformer/filtered_polarity_sents.pkl'
dict_pols_filtered = {}

if not os.path.exists(saved_path):
    raise Exception('File .pkl not found. Run filter-task in transformer or GRU notebook and come back here!')
else:
    print('Using .pkl with filtered sentences from ', saved_path)
    with open(saved_path, 'rb') as f:
        dict_pols_filtered = pickle.load(f)

w_per_sents = []
for d in dict_pols_filtered['corpus']:
    sum = 0
    for w in d:
        sum += len(w)
    w_per_sents.append(sum)

print('Maximum words per sentence: ', max(w_per_sents))
print('Minimum words per sentence: ', min(w_per_sents))
print('Average words per sentence: ', round(np.mean(w_per_sents)))

Using .pkl with filtered sentences from  weights/transformer/filtered_polarity_sents.pkl
Maximum words per sentence:  2348
Minimum words per sentence:  19
Average words per sentence:  498
