# Construct a Neural Bag-of-Words Framework for Evaluating Sentiments

Movie reviews can be categorized as either positive or negative. Analyzing the sentiment of movie review text is commonly referred to as sentiment analysis. One common approach to build sentiment analysis models is through the bag-of-words method. This method converts documents into vector formats where each word is given a specific score.

In this section, you'll learn:

- The process of formatting review text data for modeling using a limited vocabulary.
- Techniques to employ the bag-of-words model for training and testing datasets.
- Steps to create a Multilayer Perceptron with the bag-of-words model and how to utilize it for predictions on fresh review text data.

## Data preparation

The process of preparing the movie review dataset was initially outlined in the previous section. In this segment, we'll focus on the following tasks:

- Dividing the data into train and test subsets.
- Importing the data and cleansing it by eliminating punctuation and numerals.
- Establishing a defined list of desired words.

### Train and Test Split

>Let's assume we're working on a system designed to predict the sentiment of textual movie reviews, categorizing them as either **``positive``** or **``negative``**.

Once our model is established, we will predict sentiments of new textual reviews. Consequently, the same data preparation steps we applied to the training data will be needed for these new reviews as well.

To integrate this requirement into our model evaluation,

**we'll divide our datasets into training and test groups** before any data processing takes place. By doing this, we ensure that any specific knowledge from the test set, which might influence data preparation (like specific word choices), remains unknown during data preparation and model training phases.

To clarify further, we'll utilize the last 100 positive and the last 100 negative reviews (totaling 200 reviews) for testing purposes. The preceding 1,800 reviews will serve as our training dataset.

> This essentially divides our data into a **``90-10 ratio``** for training and testing, respectively.

To facilitate this division, we can rely on the filenames of the reviews. For instance, reviews labeled from 000 to 899 are designated for training, while those labeled 900 and beyond are reserved for testing.

### Loading and Cleaning Reviews

The text data appears to be quite **``tidy``** already, so there isn't much preparation needed. Without delving too deeply into the specifics, we'll prep the data using the following approach:

- Tokenize the text based on white spaces.
- Purge all punctuation from the words.
- Discard any words that aren't solely made up of alphabetical characters.
- Eliminate all known stop words from the text.
- Exclude any words that are of length 1 character.

We can encapsulate all of these steps into a function named **``clean_doc()``**, which will accept the raw text extracted from a file as an argument and return a list of cleaned tokens. Additionally, we can create a function named **``load_doc()``** to fetch a document from a file, making it ready for use with the **``clean_doc()``** function. Below is an example demonstrating how to clean the first positive review using these functions.

In [None]:
!gdown https://drive.google.com/uc?id=10OsDrN-m2IIKqZJrf-xMbEtg8VlGJ2DQ
!tar -xvf review_polarity.tar.gz

In [None]:
import nltk
nltk.download('stopwords')

In [None]:
from nltk.corpus import stopwords
import string
import re

# load doc into memory
def load_doc(filename):
	# open the file as read only
	file = open(filename, 'r')
	# read all text
	text = file.read()
	# close the file
	file.close()
	return text

# turn a doc into clean tokens
def clean_doc(doc):
	# split into tokens by white space
	tokens = doc.split()
	# prepare regex for char filtering
	re_punc = re.compile('[%s]' % re.escape(string.punctuation))
	# remove punctuation from each word
	tokens = [re_punc.sub('', w) for w in tokens]
	# remove remaining tokens that are not alphabetic
	tokens = [word for word in tokens if word.isalpha()]
	# filter out stop words
	stop_words = set(stopwords.words('english'))
	tokens = [w for w in tokens if not w in stop_words]
	# filter out short tokens
	tokens = [word for word in tokens if len(word) > 1]
	return tokens

# load the document
filename = 'txt_sentoken/pos/cv000_29590.txt'
text = load_doc(filename)
tokens = clean_doc(text)
print(tokens)

### Establishing a Vocabulary




**``Setting up a vocabulary``** of recognized terms is crucial when employing a bag-of-words model.

A larger number of words results in a more extensive representation of documents, hence it's vital to limit the words to only those presumed to be indicative. Determining this upfront is challenging and it's often necessary to explore various assumptions regarding the formation of a beneficial vocabulary.

As seen earlier, it's possible to exclude punctuation and numbers from the vocabulary. This procedure can be replicated across all documents to generate a collection of all recognized words.

A vocabulary can be structured as a **``Counter``**, which is a dictionary-like mapping of words alongside their frequency, facilitating straightforward updates and inquiries.

Each document can be integrated into the counter via a new function named **``add_doc_to_vocab()``**. Additionally, a new function named **``process_docs()``** can be utilized to traverse through all reviews in the negative directory followed by the positive directory. The comprehensive example is provided below.

In [None]:
import string
import re
from os import listdir
from collections import Counter
from nltk.corpus import stopwords

# load doc into memory
def load_doc(filename):
	# open the file as read only
	file = open(filename, 'r')
	# read all text
	text = file.read()
	# close the file
	file.close()
	return text

# turn a doc into clean tokens
def clean_doc(doc):
	# split into tokens by white space
	tokens = doc.split()
	# prepare regex for char filtering
	re_punc = re.compile('[%s]' % re.escape(string.punctuation))
	# remove punctuation from each word
	tokens = [re_punc.sub('', w) for w in tokens]
	# remove remaining tokens that are not alphabetic
	tokens = [word for word in tokens if word.isalpha()]
	# filter out stop words
	stop_words = set(stopwords.words('english'))
	tokens = [w for w in tokens if not w in stop_words]
	# filter out short tokens
	tokens = [word for word in tokens if len(word) > 1]
	return tokens

# load doc and add to vocab
def add_doc_to_vocab(filename, vocab):
	# load doc
	doc = load_doc(filename)
	# clean doc
	tokens = clean_doc(doc)
	# update counts
	vocab.update(tokens)

# load all docs in a directory
def process_docs(directory, vocab):
	# walk through all files in the folder
	for filename in listdir(directory):
		# skip any reviews in the test set
		if filename.startswith('cv9'):
			continue
		# create the full path of the file to open
		path = directory + '/' + filename
		# add doc to vocab
		add_doc_to_vocab(path, vocab)

# define vocab
vocab = Counter()
# add all docs to vocab
process_docs('txt_sentoken/pos', vocab)
process_docs('txt_sentoken/neg', vocab)
# print the size of the vocab
print(len(vocab))
# print the top words in the vocab
print(vocab.most_common(50))

Executing the example reveals a vocabulary comprising **``44,276 words``**. Additionally, we can observe a snippet of the top 50 most frequently utilized words in the movie reviews. It's noteworthy that this vocabulary was assembled solely from the reviews present in the training dataset.


```python
[('film', 7983), ('one', 4946), ('movie', 4826), ('like', 3201), ('even', 2262),
('good', 2080), ('time', 2041), ('story', 1907), ('films', 1873), ('would', 1844),
('much', 1824), ('also', 1757), ('characters', 1735), ('get', 1724), ('character', 1703),
('two', 1643), ('first', 1588), ('see', 1557), ('way', 1515), ('well', 1511),
('make', 1418), ('really', 1407), ('little', 1351), ('life', 1334), ('plot', 1288),
('people', 1269), ('bad', 1248), ('could', 1248), ('scene', 1241), ('movies', 1238),
('never', 1201), ('best', 1179), ('new', 1140), ('scenes', 1135), ('man', 1131),
('many', 1130), ('doesnt', 1118), ('know', 1092), ('dont', 1086), ('hes', 1024),
('great', 1014), ('another', 992), ('action', 985), ('love', 977), ('us', 967), ('go', 952),
('director', 948), ('end', 946), ('something', 945), ('still', 936)]
```

We can sift through the vocabulary, discarding words with a low frequency, such as those appearing only once or twice across all reviews. For instance, the snippet below will filter out and retain only the tokens that have a frequency of 2 or more across all reviews.

In [None]:
import string
import re
from os import listdir
from collections import Counter
from nltk.corpus import stopwords

# load doc into memory
def load_doc(filename):
	# open the file as read only
	file = open(filename, 'r')
	# read all text
	text = file.read()
	# close the file
	file.close()
	return text

# turn a doc into clean tokens
def clean_doc(doc):
	# split into tokens by white space
	tokens = doc.split()
	# prepare regex for char filtering
	re_punc = re.compile('[%s]' % re.escape(string.punctuation))
	# remove punctuation from each word
	tokens = [re_punc.sub('', w) for w in tokens]
	# remove remaining tokens that are not alphabetic
	tokens = [word for word in tokens if word.isalpha()]
	# filter out stop words
	stop_words = set(stopwords.words('english'))
	tokens = [w for w in tokens if not w in stop_words]
	# filter out short tokens
	tokens = [word for word in tokens if len(word) > 1]
	return tokens

# load doc and add to vocab
def add_doc_to_vocab(filename, vocab):
	# load doc
	doc = load_doc(filename)
	# clean doc
	tokens = clean_doc(doc)
	# update counts
	vocab.update(tokens)

# load all docs in a directory
def process_docs(directory, vocab):
	# walk through all files in the folder
	for filename in listdir(directory):
		# skip any reviews in the test set
		if filename.startswith('cv9'):
			continue
		# create the full path of the file to open
		path = directory + '/' + filename
		# add doc to vocab
		add_doc_to_vocab(path, vocab)

# save list to file
def save_list(lines, filename):
	# convert lines to a single blob of text
	data = '\n'.join(lines)
	# open file
	file = open(filename, 'w')
	# write text
	file.write(data)
	# close file
	file.close()

# define vocab
vocab = Counter()
# add all docs to vocab
process_docs('txt_sentoken/pos', vocab)
process_docs('txt_sentoken/neg', vocab)
# print the size of the vocab
print(len(vocab))
# keep tokens with a min occurrence
min_occurrence = 2
tokens = [k for k,c in vocab.items() if c >= min_occurrence]
print(len(tokens))
# save tokens to a vocabulary file
save_list(tokens, 'vocab.txt')

Executing the above example with this modification demonstrates that the vocabulary size reduces by slightly more than half, shrinking from around 44,000 to about 25,000 words.

## Bag-of-Words Representation

In this section, we'll explore the process of transforming each review into a format suitable for a Multilayer Perceptron model.

> The **``bag-of-words``** model serves as a mechanism for **extracting features from text**, enabling its utilization with machine learning algorithms like neural networks.

Each document, in this context a review, is transitioned into a vector representation. The vector's item count aligns with the vocabulary's word count, implying a longer vector representation with a larger vocabulary, hence the prior section's inclination towards smaller vocabularies. The introduction to the bag-of-words model was made in last week.

Within a document, words are evaluated and their scores are positioned in the corresponding spot within the representation. The upcoming section will delve into various word scoring methodologies. In this segment, the focus is on transmuting reviews into vectors, primed for the training of an initial neural network model. This segment unfolds in two phases:

1. Transitioning reviews into tokenized lines.
2. Encoding reviews utilizing a bag-of-words model representation.

### Converting Reviews into Lines of Tokens

Before we can transform reviews into vectors for modeling, it's essential to tidy them up first.

This entails loading the reviews, carrying out the cleaning procedure devised earlier, filtering out words not present in the selected vocabulary, and turning the remaining tokens into a single string or line prepped for encoding.

Initially, we require a function to prepare an individual document. Below is the function **``doc_to_line()``** detailed, which will load a document, clean it, filter out tokens not in the vocabulary, and then return the document as a string with tokens separated by white spaces.

```python
# load doc, clean and return line of tokens
def doc_to_line(filename, vocab):
  # load the doc
  doc = load_doc(filename)
  # clean doc
  tokens = clean_doc(doc)
  # filter by vocab
  tokens = [w for w in tokens if w in vocab]
  return ' '.join(tokens)
```

Following that, we require a function to process all documents in a directory (like pos and neg) to transform the documents into lines. Below, the **``process_docs()``** function is outlined, which accomplishes this task. It anticipates a directory name and a vocabulary set as input arguments, and returns a list of processed documents.

```python
# load all docs in a directory
def process_docs(directory, vocab):
  lines = list()
  # walk through all files in the folder
  for filename in listdir(directory):
    # skip any reviews in the test set
    if filename.startswith('cv9'):
      continue
    # create the full path of the file to open
    path = directory + '/' + filename
    # load and clean the doc
    line = doc_to_line(path, vocab)
    # add to list
    lines.append(line)
  return lines
```

We can invoke the **``process_docs()``** function consistently for both positive and negative reviews to assemble a dataset comprising review text alongside their corresponding output labels, with **``0``** denoting negative and **``1``** denoting positive. The **``load_clean_dataset()``** function below embodies this behavior.

```python
# load and clean a dataset
def load_clean_dataset(vocab):
  # load documents
  neg = process_docs('txt_sentoken/neg', vocab)
  pos = process_docs('txt_sentoken/pos', vocab)
  docs = neg + pos
  # prepare labels
  labels = [0 for _ in range(len(neg))] + [1 for _ in range(len(pos))]
return docs, labels
```


In [None]:
import string
import re
from os import listdir
from nltk.corpus import stopwords

# load doc into memory
def load_doc(filename):
	# open the file as read only
	file = open(filename, 'r')
	# read all text
	text = file.read()
	# close the file
	file.close()
	return text

# turn a doc into clean tokens
def clean_doc(doc):
	# split into tokens by white space
	tokens = doc.split()
	# prepare regex for char filtering
	re_punc = re.compile('[%s]' % re.escape(string.punctuation))
	# remove punctuation from each word
	tokens = [re_punc.sub('', w) for w in tokens]
	# remove remaining tokens that are not alphabetic
	tokens = [word for word in tokens if word.isalpha()]
	# filter out stop words
	stop_words = set(stopwords.words('english'))
	tokens = [w for w in tokens if not w in stop_words]
	# filter out short tokens
	tokens = [word for word in tokens if len(word) > 1]
	return tokens

# load doc, clean and return line of tokens
def doc_to_line(filename, vocab):
	# load the doc
	doc = load_doc(filename)
	# clean doc
	tokens = clean_doc(doc)
	# filter by vocab
	tokens = [w for w in tokens if w in vocab]
	return ' '.join(tokens)

# load all docs in a directory
def process_docs(directory, vocab):
	lines = list()
	# walk through all files in the folder
	for filename in listdir(directory):
		# skip any reviews in the test set
		if filename.startswith('cv9'):
			continue
		# create the full path of the file to open
		path = directory + '/' + filename
		# load and clean the doc
		line = doc_to_line(path, vocab)
		# add to list
		lines.append(line)
	return lines

# load and clean a dataset
def load_clean_dataset(vocab):
	# load documents
	neg = process_docs('txt_sentoken/neg', vocab)
	pos = process_docs('txt_sentoken/pos', vocab)
	docs = neg + pos
	# prepare labels
	labels = [0 for _ in range(len(neg))] + [1 for _ in range(len(pos))]
	return docs, labels

# load the vocabulary
vocab_filename = 'vocab.txt'
vocab = load_doc(vocab_filename)
vocab = vocab.split()
vocab = set(vocab)
# load all training reviews
docs, labels = load_clean_dataset(vocab)
# summarize what we have
print(len(docs), len(labels))

In [None]:
docs[0]

### Transforming Movie Reviews into Bag-of-Words Vectors

We'll employ the **``Keras API``** to transform reviews into encoded document vectors. Keras offers the **``Tokenizer``** class which can handle some of the cleaning and vocabulary definition chores we addressed in the prior section. Although it's beneficial to perform these tasks ourselves for a precise understanding of the actions and rationale, the **``Tokenizer``** class is convenient and efficiently converts documents into encoded vectors. Initially, the **``Tokenizer``** needs to be created, followed by fitting it on the text documents in the training dataset. In this scenario, these documents are the combined arrays of positive and negative lines as cultivated in the preceding section.

```python
# fit a tokenizer
def create_tokenizer(lines):
  tokenizer = Tokenizer()
  tokenizer.fit_on_texts(lines)
  return tokenizer
```

This procedure establishes a uniform method to convert the vocabulary into a fixed-length vector comprising **``25,768``** elements, corresponding to the total word count in the vocabulary file **``vocab.txt``**. Subsequently, documents can be encoded utilizing the Tokenizer by invoking the **``texts_to_matrix()``** method. This function accepts both a list of documents to encode and an encoding mode, representing the technique employed to score words in the document. In this case, we specify **``freq``** to score words based on their occurrence frequency in the document. This encoding approach can be applied to both the loaded training and testing data, for instance:

```python
# encode data
x_train = tokenizer.texts_to_matrix(train_docs, mode='freq')
x_test = tokenizer.texts_to_matrix(test_docs, mode='freq')
```

This step encodes all the positive and negative reviews present in the training dataset. Following this, the **``process_docs()``** function from the preceding section requires modification to process reviews in either the test or train dataset selectively. To facilitate the loading of both training and test datasets, an is_train argument is introduced, which is utilized to determine which review file names should be skipped.

```python
# load all docs in a directory
def process_docs(directory, vocab, is_train):
  lines = list()
  # walk through all files in the folder
  for filename in listdir(directory):
    # skip any reviews in the test set
    if is_train and filename.startswith('cv9'):
      continue
    if not is_train and not filename.startswith('cv9'):
      continue
    # create the full path of the file to open
    path = directory + '/' + filename
    # load and clean the doc
    line = doc_to_line(path, vocab)
    # add to list
    lines.append(line)
  return lines
```

Similarly, the **``load_clean_dataset()``** function needs to be updated to accommodate the loading of either training or testing data, ensuring it returns a NumPy array.

```python
# load and clean a dataset
def load_clean_dataset(vocab, is_train):
  # load documents
  neg = process_docs('txt_sentoken/neg', vocab, is_train)
  pos = process_docs('txt_sentoken/pos', vocab, is_train)
  docs = neg + pos
  # prepare labels
  labels = array([0 for _ in range(len(neg))] + [1 for _ in range(len(pos))])
  return docs, labels
```

We can consolidate all these steps into a single example.

In [None]:
import string
import re
from os import listdir
from numpy import array
from nltk.corpus import stopwords
from keras.preprocessing.text import Tokenizer

# load doc into memory
def load_doc(filename):
	# open the file as read only
	file = open(filename, 'r')
	# read all text
	text = file.read()
	# close the file
	file.close()
	return text

# turn a doc into clean tokens
def clean_doc(doc):
	# split into tokens by white space
	tokens = doc.split()
	# prepare regex for char filtering
	re_punc = re.compile('[%s]' % re.escape(string.punctuation))
	# remove punctuation from each word
	tokens = [re_punc.sub('', w) for w in tokens]
	# remove remaining tokens that are not alphabetic
	tokens = [word for word in tokens if word.isalpha()]
	# filter out stop words
	stop_words = set(stopwords.words('english'))
	tokens = [w for w in tokens if not w in stop_words]
	# filter out short tokens
	tokens = [word for word in tokens if len(word) > 1]
	return tokens

# load doc, clean and return line of tokens
def doc_to_line(filename, vocab):
	# load the doc
	doc = load_doc(filename)
	# clean doc
	tokens = clean_doc(doc)
	# filter by vocab
	tokens = [w for w in tokens if w in vocab]
	return ' '.join(tokens)

# load all docs in a directory
def process_docs(directory, vocab, is_train):
	lines = list()
	# walk through all files in the folder
	for filename in listdir(directory):
		# skip any reviews in the test set
		if is_train and filename.startswith('cv9'):
			continue
		if not is_train and not filename.startswith('cv9'):
			continue
		# create the full path of the file to open
		path = directory + '/' + filename
		# load and clean the doc
		line = doc_to_line(path, vocab)
		# add to list
		lines.append(line)
	return lines

# load and clean a dataset
def load_clean_dataset(vocab, is_train):
	# load documents
	neg = process_docs('txt_sentoken/neg', vocab, is_train)
	pos = process_docs('txt_sentoken/pos', vocab, is_train)
	docs = neg + pos
	# prepare labels
	labels = array([0 for _ in range(len(neg))] + [1 for _ in range(len(pos))])
	return docs, labels

# fit a tokenizer
def create_tokenizer(lines):
	tokenizer = Tokenizer()
	tokenizer.fit_on_texts(lines)
	return tokenizer

# load the vocabulary
vocab_filename = 'vocab.txt'
vocab = load_doc(vocab_filename)
vocab = set(vocab.split())

# load all reviews
train_docs, y_train = load_clean_dataset(vocab, True)
test_docs, y_test = load_clean_dataset(vocab, False)

# create the tokenizer
tokenizer = create_tokenizer(train_docs)

# encode data
x_train = tokenizer.texts_to_matrix(train_docs, mode='freq')
x_test = tokenizer.texts_to_matrix(test_docs, mode='freq')
print(x_train.shape, x_test.shape)

In [None]:
x_train[0]


Executing the example displays the dimensions of both the encoded training dataset and test dataset, comprising 1,800 and 200 documents respectively, each possessing an encoding vocabulary of identical size (vector length).

## Sentiment Analysis Models


In this segment, we will construct Multilayer Perceptron (MLP) models to categorize encoded documents as either positive or negative. The models will embody straightforward feedforward network architectures, encompassing fully connected layers, termed Dense in the Keras deep learning library. This segment is organized into three sub-sections:

- Initial sentiment analysis model
- Comparison of word scoring methods
- Rendering predictions for new reviews

### First Sentiment Analysis Model

We can construct a basic MLP model to determine the sentiment of encoded critiques. The model will feature an input layer corresponding to the vocabulary size, which also matches the length of the input texts. This can be captured in a new variable named 'n_words' as given below:

```python
n_words = x_test.shape[1]
```


We're now set to define the network. It's worth noting that the model configuration was determined with minimal trial and error, and isn't specifically optimized for this challenge.

Our design includes a single hidden layer consisting of **``50 neurons``**, complemented by a rectified linear activation function. For the output layer, we employ a solitary neuron utilizing a sigmoid activation function, which predicts either **``0``** for negative reviews or **``1``** for positive ones.

For the training process, we'll use the efficient **``Adam``** variation of gradient descent and the **``binary cross-entropy``** loss function, which is apt for **``binary classification tasks``**. Additionally, we'll monitor accuracy during the training and assessment phases of the model.

```python
# define the model
def define_model(n_words):
    # define network
    model = Sequential()
    model.add(Dense(50, input_shape=(n_words), activation='relu'))
    model.add(Dense(1, activation='sigmoid'))
    # compile network
    model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
    # summarize defined model
    model.summary()
    plot_model(model, to_file='model.png', show_shapes=True)
    return model
```

Next, we'll proceed to train the model using the training data. Given the model's compact size, it can be comfortably trained in just 10 epochs.

```python
# Train the model
model.fit(x_train, y_train, epochs=10, verbose=2)
```

After training is complete, we can assess how the model performs by making predictions on the test dataset and displaying the accuracy.

```python
# evaluate
loss, acc = model.evaluate(x_test, y_test, verbose=0)
print('Test Accuracy: %f' % (acc*100))
```


In [None]:
import string
import re
from os import listdir
from numpy import array
from nltk.corpus import stopwords
from keras.preprocessing.text import Tokenizer
from tensorflow.keras.utils import plot_model
from keras.models import Sequential
from keras.layers import Dense

# load doc into memory
def load_doc(filename):
	# open the file as read only
	file = open(filename, 'r')
	# read all text
	text = file.read()
	# close the file
	file.close()
	return text

# turn a doc into clean tokens
def clean_doc(doc):
	# split into tokens by white space
	tokens = doc.split()
	# prepare regex for char filtering
	re_punc = re.compile('[%s]' % re.escape(string.punctuation))
	# remove punctuation from each word
	tokens = [re_punc.sub('', w) for w in tokens]
	# remove remaining tokens that are not alphabetic
	tokens = [word for word in tokens if word.isalpha()]
	# filter out stop words
	stop_words = set(stopwords.words('english'))
	tokens = [w for w in tokens if not w in stop_words]
	# filter out short tokens
	tokens = [word for word in tokens if len(word) > 1]
	return tokens

# load doc, clean and return line of tokens
def doc_to_line(filename, vocab):
	# load the doc
	doc = load_doc(filename)
	# clean doc
	tokens = clean_doc(doc)
	# filter by vocab
	tokens = [w for w in tokens if w in vocab]
	return ' '.join(tokens)

# load all docs in a directory
def process_docs(directory, vocab, is_train):
	lines = list()
	# walk through all files in the folder
	for filename in listdir(directory):
		# skip any reviews in the test set
		if is_train and filename.startswith('cv9'):
			continue
		if not is_train and not filename.startswith('cv9'):
			continue
		# create the full path of the file to open
		path = directory + '/' + filename
		# load and clean the doc
		line = doc_to_line(path, vocab)
		# add to list
		lines.append(line)
	return lines

# load and clean a dataset
def load_clean_dataset(vocab, is_train):
	# load documents
	neg = process_docs('txt_sentoken/neg', vocab, is_train)
	pos = process_docs('txt_sentoken/pos', vocab, is_train)
	docs = neg + pos
	# prepare labels
	labels = array([0 for _ in range(len(neg))] + [1 for _ in range(len(pos))])
	return docs, labels

# fit a tokenizer
def create_tokenizer(lines):
	tokenizer = Tokenizer()
	tokenizer.fit_on_texts(lines)
	return tokenizer

# define the model
def define_model(n_words):
	# define network
	model = Sequential()
	model.add(Dense(50, input_shape=(n_words,), activation='relu'))
	model.add(Dense(1, activation='sigmoid'))
	# compile network
	model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
	# summarize defined model
	model.summary()
	plot_model(model, to_file='model.png', show_shapes=True)
	return model

# load the vocabulary
vocab_filename = 'vocab.txt'
vocab = load_doc(vocab_filename)
vocab = set(vocab.split())

# load all reviews
train_docs, y_train = load_clean_dataset(vocab, True)
test_docs, y_test = load_clean_dataset(vocab, False)

# create the tokenizer
tokenizer = create_tokenizer(train_docs)

# encode data
x_train = tokenizer.texts_to_matrix(train_docs, mode='freq')
x_test = tokenizer.texts_to_matrix(test_docs, mode='freq')

# define the model
n_words = x_test.shape[1]
model = define_model(n_words)

# fit network
model.fit(x_train, y_train, epochs=10, verbose=2)

# evaluate
loss, acc = model.evaluate(x_test, y_test, verbose=0)
print('Test Accuracy: %f' % (acc*100))

The model comfortably adapts to the training data in just 10 epochs, reaching an accuracy nearing 100%. When tested on the validation dataset, the model performs impressively, securing an accuracy of over 87%. This aligns with the low-to-mid 80s accuracy range cited in the original study. However, it's crucial to highlight that this isn't a direct comparison. The initial study (paper) utilized 10-fold cross-validation for gauging model proficiency, as opposed to a singular train/test division.

Next, we'll explore various word scoring techniques for the bag-of-words model.

### Comparing Word Scoring Methods

The **``texts_to_matrix()``** function provided by Keras' Tokenizer offers four distinct methods to score words:

- **binary**: Words are marked as either present (1) or absent (0).
- **count**: Each word's occurrence is represented as an integer count.
- **tfidf**: Words are scored based on frequency, penalizing those that are common across all documents.
- **freq**: Words are scored based on their occurrence frequency within a document.

We can assess the performance of the previously developed model using each of these four word scoring methods. This involves creating a function that encodes the loaded documents based on a selected scoring method. This function will create the tokenizer, fit it on the training documents, and then generate the train and test encodings using the selected method. The **``prepare_data()``** function embodies this process, taking in lists of train and test documents.

```python
# prepare bag-of-words encoding of docs
def prepare_data(train_docs, test_docs, mode):
    # create the tokenizer
    tokenizer = Tokenizer()
    # fit the tokenizer on the documents
    tokenizer.fit_on_texts(train_docs)
    # encode training data set
    x_train = tokenizer.texts_to_matrix(train_docs, mode=mode)
    # encode training data set
    x_test = tokenizer.texts_to_matrix(test_docs, mode=mode)

    return x_train, x_test
```


We also need a function to assess the MLP based on a specific data encoding. Given the stochastic nature of neural networks, the same model can yield different outcomes when trained on identical data. This variability stems primarily from the random initial weights and the random ordering of data batches during minibatch gradient descent. Therefore, a single evaluation of a model might not be wholly reliable. It's more accurate to gauge the model's ability based on the average of several runs. The subsequent function, called **``evaluate_mode()``**, accepts encoded documents and gauges the MLP's efficacy. It trains the model on the training set and gauges its performance on the test set, repeating this 10 times. The function then returns a list containing accuracy scores from all these evaluations.

```python
# evaluate a neural network model
def evaluate_mode(x_train, y_train, x_test, y_test):
    scores = list()
    n_repeats = 30
    n_words = x_test.shape[1]
    for i in range(n_repeats):
        # define network
        model = Sequential()
        model.add(Dense(50, input_shape=(n_words,), activation='relu'))
        model.add(Dense(1, activation='sigmoid'))
        # compile network
        model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
        # fit network
        model.fit(x_train, y_train, epochs=10, verbose=2)
        # evaluate
        loss, acc = model.evaluate(x_test, y_test, verbose=0)
        scores.append(acc)
        print('%d accuracy: %s' % ((i+1), acc))
    return scores
```

We're set to assess the performance using the four distinct word scoring techniques. Integrating all the components, the comprehensive example is provided below.

In [None]:
import string
import re
from os import listdir
from numpy import array
from nltk.corpus import stopwords
from keras.preprocessing.text import Tokenizer
from keras.models import Sequential
from keras.layers import Dense
from pandas import DataFrame
from matplotlib import pyplot

# load doc into memory
def load_doc(filename):
	# open the file as read only
	file = open(filename, 'r')
	# read all text
	text = file.read()
	# close the file
	file.close()
	return text

# turn a doc into clean tokens
def clean_doc(doc):
	# split into tokens by white space
	tokens = doc.split()
	# prepare regex for char filtering
	re_punc = re.compile('[%s]' % re.escape(string.punctuation))
	# remove punctuation from each word
	tokens = [re_punc.sub('', w) for w in tokens]
	# remove remaining tokens that are not alphabetic
	tokens = [word for word in tokens if word.isalpha()]
	# filter out stop words
	stop_words = set(stopwords.words('english'))
	tokens = [w for w in tokens if not w in stop_words]
	# filter out short tokens
	tokens = [word for word in tokens if len(word) > 1]
	return tokens

# load doc, clean and return line of tokens
def doc_to_line(filename, vocab):
	# load the doc
	doc = load_doc(filename)
	# clean doc
	tokens = clean_doc(doc)
	# filter by vocab
	tokens = [w for w in tokens if w in vocab]
	return ' '.join(tokens)

# load all docs in a directory
def process_docs(directory, vocab, is_train):
	lines = list()
	# walk through all files in the folder
	for filename in listdir(directory):
		# skip any reviews in the test set
		if is_train and filename.startswith('cv9'):
			continue
		if not is_train and not filename.startswith('cv9'):
			continue
		# create the full path of the file to open
		path = directory + '/' + filename
		# load and clean the doc
		line = doc_to_line(path, vocab)
		# add to list
		lines.append(line)
	return lines

# load and clean a dataset
def load_clean_dataset(vocab, is_train):
	# load documents
	neg = process_docs('txt_sentoken/neg', vocab, is_train)
	pos = process_docs('txt_sentoken/pos', vocab, is_train)
	docs = neg + pos
	# prepare labels
	labels = array([0 for _ in range(len(neg))] + [1 for _ in range(len(pos))])
	return docs, labels

# define the model
def define_model(n_words):
	# define network
	model = Sequential()
	model.add(Dense(50, input_shape=(n_words,), activation='relu'))
	model.add(Dense(1, activation='sigmoid'))
	# compile network
	model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
	return model

# evaluate a neural network model
def evaluate_mode(x_train, y_train, x_test, y_test):
	scores = list()
	n_repeats = 10
	n_words = x_test.shape[1]
	for i in range(n_repeats):
		# define network
		model = define_model(n_words)
		# fit network
		model.fit(x_train, y_train, epochs=10, verbose=0)
		# evaluate
		_, acc = model.evaluate(x_test, y_test, verbose=0)
		scores.append(acc)
		print('%d accuracy: %s' % ((i+1), acc))
	return scores

# prepare bag of words encoding of docs
def prepare_data(train_docs, test_docs, mode):
	# create the tokenizer
	tokenizer = Tokenizer()
	# fit the tokenizer on the documents
	tokenizer.fit_on_texts(train_docs)
	# encode training data set
	x_train = tokenizer.texts_to_matrix(train_docs, mode=mode)
	# encode training data set
	x_test = tokenizer.texts_to_matrix(test_docs, mode=mode)
	return x_train, x_test

# load the vocabulary
vocab_filename = 'vocab.txt'
vocab = load_doc(vocab_filename)
vocab = set(vocab.split())
# load all reviews
train_docs, y_train = load_clean_dataset(vocab, True)
test_docs, y_test = load_clean_dataset(vocab, False)
# run experiment
modes = ['binary', 'count', 'tfidf', 'freq']
results = DataFrame()
for mode in modes:
	# prepare data for mode
	x_train, x_test = prepare_data(train_docs, test_docs, mode)
	# evaluate model on data for mode
	results[mode] = evaluate_mode(x_train, y_train, x_test, y_test)
# summarize results
print(results.describe())
# plot results
results.boxplot()
pyplot.show()

In [None]:
results.head(10)

### Predicting Sentiment for New Reviews

We're now prepared to create and apply a final model to forecast the sentiments of new textual reviews, which is our primary objective. Initially, we'll train the final model using all the available data, opting for the 'binary' mode of the bag-of-words model, which demonstrated optimal performance in our prior assessments.

The prediction process for new reviews mirrors the preparation steps for the test data. This encompasses loading the text, cleansing the document, filtering tokens based on our selected vocabulary, transforming the filtered tokens into a line, encoding them with the Tokenizer, and finally making a prediction. We can directly predict the class value using the trained model by invoking the **predict()** function, which will return 0 for a negative review and 1 for a positive one. We can consolidate these steps into a new function named **predict_sentiment()**. This function would need the review text, vocabulary, tokenizer, and trained model as inputs and would yield the predicted sentiment along with a corresponding confidence score.

```python
# classify a review as negative or positive
def predict_sentiment(review, vocab, tokenizer, model):
    # clean
    tokens = clean_doc(review)
    # filter by vocab
    tokens = [w for w in tokens if w in vocab]
    # convert to line
    line = ' '.join(tokens)
    # encode
    encoded = tokenizer.texts_to_matrix([line], mode='binary')
    # predict sentiment
    yhat = model.predict(encoded, verbose=0)
    # retrieve predicted percentage and label
    percent_pos = yhat[0,0]
    if round(percent_pos) == 0:
        return (1-percent_pos), 'NEGATIVE'
    return percent_pos, 'POSITIVE'
```

In [None]:
import string
import re
from os import listdir
from numpy import array
from nltk.corpus import stopwords
from keras.preprocessing.text import Tokenizer
from tensorflow.keras.utils import plot_model
from keras.models import Sequential
from keras.layers import Dense

# load doc into memory
def load_doc(filename):
	# open the file as read only
	file = open(filename, 'r')
	# read all text
	text = file.read()
	# close the file
	file.close()
	return text

# turn a doc into clean tokens
def clean_doc(doc):
	# split into tokens by white space
	tokens = doc.split()
	# prepare regex for char filtering
	re_punc = re.compile('[%s]' % re.escape(string.punctuation))
	# remove punctuation from each word
	tokens = [re_punc.sub('', w) for w in tokens]
	# remove remaining tokens that are not alphabetic
	tokens = [word for word in tokens if word.isalpha()]
	# filter out stop words
	stop_words = set(stopwords.words('english'))
	tokens = [w for w in tokens if not w in stop_words]
	# filter out short tokens
	tokens = [word for word in tokens if len(word) > 1]
	return tokens

# load doc, clean and return line of tokens
def doc_to_line(filename, vocab):
	# load the doc
	doc = load_doc(filename)
	# clean doc
	tokens = clean_doc(doc)
	# filter by vocab
	tokens = [w for w in tokens if w in vocab]
	return ' '.join(tokens)

# load all docs in a directory
def process_docs(directory, vocab):
	lines = list()
	# walk through all files in the folder
	for filename in listdir(directory):
		# create the full path of the file to open
		path = directory + '/' + filename
		# load and clean the doc
		line = doc_to_line(path, vocab)
		# add to list
		lines.append(line)
	return lines

# load and clean a dataset
def load_clean_dataset(vocab):
	# load documents
	neg = process_docs('txt_sentoken/neg', vocab)
	pos = process_docs('txt_sentoken/pos', vocab)
	docs = neg + pos
	# prepare labels
	labels = array([0 for _ in range(len(neg))] + [1 for _ in range(len(pos))])
	return docs, labels

# fit a tokenizer
def create_tokenizer(lines):
	tokenizer = Tokenizer()
	tokenizer.fit_on_texts(lines)
	return tokenizer

# define the model
def define_model(n_words):
	# define network
	model = Sequential()
	model.add(Dense(50, input_shape=(n_words,), activation='relu'))
	model.add(Dense(1, activation='sigmoid'))
	# compile network
	model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
	# summarize defined model
	model.summary()
	plot_model(model, to_file='model.png', show_shapes=True)
	return model

# classify a review as negative or positive
def predict_sentiment(review, vocab, tokenizer, model):
	# clean
	tokens = clean_doc(review)
	# filter by vocab
	tokens = [w for w in tokens if w in vocab]
	# convert to line
	line = ' '.join(tokens)
	# encode
	encoded = tokenizer.texts_to_matrix([line], mode='binary')
	# predict sentiment
	yhat = model.predict(encoded, verbose=0)
	# retrieve predicted percentage and label
	percent_pos = yhat[0,0]
	if round(percent_pos) == 0:
		return (1-percent_pos), 'NEGATIVE'
	return percent_pos, 'POSITIVE'

# load the vocabulary
vocab_filename = 'vocab.txt'
vocab = load_doc(vocab_filename)
vocab = set(vocab.split())
# load all reviews
train_docs, y_train = load_clean_dataset(vocab)
test_docs, y_test = load_clean_dataset(vocab)
# create the tokenizer
tokenizer = create_tokenizer(train_docs)
# encode data
x_train = tokenizer.texts_to_matrix(train_docs, mode='binary')
x_test = tokenizer.texts_to_matrix(test_docs, mode='binary')
# define network
n_words = x_train.shape[1]
model = define_model(n_words)
# fit network
model.fit(x_train, y_train, epochs=10, verbose=2)
# test positive text
text = 'Best movie ever! It was great, I recommend it.'
percent, sentiment = predict_sentiment(text, vocab, tokenizer, model)
print('Review: [%s]\nSentiment: %s (%.3f%%)' % (text, sentiment, percent*100))
# test negative text
text = 'This is a bad movie.'
percent, sentiment = predict_sentiment(text, vocab, tokenizer, model)
print('Review: [%s]\nSentiment: %s (%.3f%%)' % (text, sentiment, percent*100))

## Further Exploration



This lesson provides potential enhancements to further enrich the learnings from:

- **Vocabulary Managemen**: Experiment with larger or smaller vocabularies. A more optimized set of words might enhance performance.
- **Network Topology Adjustments**: Dive into different network structures, such as deeper or broader architectures. A better-suited network could potentially improve results.
- **Incorporate Regularization**: Examine the benefits of regularization techniques like dropout. This might help in delaying the model's convergence and enhance its performance on the test set.
- **Enhanced Data Cleaning**: Delve deeper into the cleaning process of the review text. Adjusting the amount of cleaning might have implications on model accuracy.
- **Training Insights**: Use the test data as a validation set during training and visualize the train-test loss through plots. Employ these insights to fine-tune parameters like batch size and training epochs.
- **Identifying Key Words**: Investigate specific words within reviews that might strongly predict sentiments.
- **Bigram Implementation**: Modify the model to score word bigrams and analyze its efficacy across different scoring methods.
- **Review Truncation**: Examine the effects of truncating movie reviews on model performance. Experiment by shortening reviews from the beginning, end, or middle.
- **Model Ensembles**: Construct models using varied word scoring methods. Check if combining these models - creating ensembles - can boost model accuracy.
- **Evaluate Real-world Reviews**: After training a comprehensive model on all available data, assess its performance on genuine movie reviews sourced from the web.