# Steps to Process Film Review Data for Sentiment Analysis

> Working with text data varies greatly depending on the specific problem at hand.

While starting with foundational tasks such as loading the data might seem straightforward, the process can quickly become intricate, especially when you dive into the nuances of cleaning specific datasets. For those uncertain about where to initiate the process and the sequence of steps to transform raw data into model-ready data, this class segment offers guidance. Today, we will delve into the process of prepping movie review data for sentiment analysis. By the end of this session, students will be proficient in:

- Loading text data and meticulously cleaning it to eliminate punctuation and irrelevant elements.
- Crafting a vocabulary, refining it for relevance, and storing it for future use.
- Processing movie reviews through meticulous cleaning and using the tailored vocabulary, subsequently saving the transformed data in formats suited for modeling.

This section is segmented into the subsequent sections:

1. Overview of the Movie Review Dataset
2. Importing Text Data
3. Text Data Cleansing
4. Vocabulary Creation
5. Storing the Processed Data.

## Movie Review Dataset

The **``Movie Review Data``** is an aggregated dataset procured from the [IMDb](https://www.imdb.com/) domain during the early 2000s, diligently compiled by researchers Bo Pang and Lillian Lee.

These critiques were extracted and subsequently made public as a pivotal component of their exploration into natural language processing (NLP). Initially launched in 2002, the dataset experienced iterative enhancements and was subsequently versioned as 2.0 in 2004.

> It encapsulates 1,000 positive and 1,000 negative film evaluations, mined from the rec.arts.movies.reviews newsgroup hosted on IMDb's infrastructure.

Within the realm of computational linguistics and data analytics, Pang and Lee consistently refer to this collection as the **``polarity dataset.``**

In [None]:
!gdown https://drive.google.com/uc?id=10OsDrN-m2IIKqZJrf-xMbEtg8VlGJ2DQ

Once you've decompressed the file, you'll find a folder named **``txt_sentoken``**. Inside this folder, there are two subfolders: **``neg``** and **``pos``**, which hold the negative and positive reviews, respectively. Each review is saved in its individual file, following a naming pattern ranging from **``cv000``** to **``cv999``** for both neg and pos categories. We'll now proceed to examine how to load this text data.

In [None]:
!tar -xvf review_polarity.tar.gz

## Load

In this section, we'll delve into loading individual text files and subsequently iterating over directories of files. It's expected that the review dataset is locally stored in the **``txt sentoken``** directory of the current workspace. Standard file I/O operations involve opening a file, reading its ASCII content, and terminating the session. For demonstration purposes, to read the initial negative review file named **``cv000_29416.txt``**.

In [None]:
# load doc into memory
def load_doc(filename):
	# open the file as read only
	file = open(filename, 'r')
	# read all text
	text = file.read()
	# close the file
	file.close()
	return text

# load the document
filename = 'txt_sentoken/neg/cv000_29416.txt'
text = load_doc(filename)

print(text)


This reads the document in ASCII format, retaining any white spaces, including line breaks. We can encapsulate this process into a function named **``load_doc()``**, which accepts a filename as its argument and outputs the corresponding text.

## Clean Text Data

First, we'll load a document and examine the raw tokens divided by white space. We'll utilize the **``load_doc()``** function from the earlier section. The **``split()``** method can be employed to segment the loaded document into tokens based on white space.

In [None]:
# split into tokens by white space
tokens = text.split()
print(tokens)

```python

['plot', ':', 'two', 'teen', 'couples', 'go', 'to', 'a', 'church', 'party', ',', 'drink', 'and', 'then', 'drive',
'.', 'they', 'get', 'into', 'an', 'accident', '.', 'one', 'of', 'the', 'guys', 'dies', ',', 'but', 'his', 'girlfriend',
'continues', 'to', 'see', 'him', 'in', 'her', 'life', ',', 'and', 'has', 'nightmares', '.', "what's", 'the', 'deal',
 '?', 'watch', 'the', 'movie', 'and', '"', 'sorta', '"', 'find', 'out', '.', '.', '.', 'critique', ':', 'a', 'mind-fuck',
'movie', 'for', 'the', 'teen', 'generation', 'that', 'touches', 'on', 'a', 'very', 'cool', 'idea', ',', 'but',
'presents', 'it', 'in', 'a', 'very', 'bad', 'package', '.', 'which', 'is', 'what', 'makes', 'this', 'review',
'an', 'even', 'harder', 'one', 'to', 'write', ',', 'since', 'i', 'generally', 'applaud', 'films', 'which', 'attempt',
'to', 'break', 'the', 'mold', ',', 'mess', 'with', 'your', 'head', 'and','such', '(', 'lost', 'highway', '&',
'memento', ')', ',', 'but', 'there', 'are', 'good', 'and', 'bad', 'ways', 'of', 'making', 'all', 'types', 'of', 'films',
',', 'and', 'these', 'folks', 'just', "didn't", 'snag', 'this', 'one', 'correctly', '.', 'they', 'seem', 'to',
'have', 'taken', 'this', 'pretty', 'neat', 'concept', ',', 'but', 'executed', 'it', 'terribly', '.', 'so', 'what', 'are',
'the', 'problems', 'with', 'the', 'movie', '?', 'well', ',', 'its', 'main', 'problem', 'is', 'that', "it's", 'simply',
'too', 'jumbled', '.', 'it', 'starts', 'off', '"', 'normal', '"', 'but', 'then', 'downshifts', 'into', 'this', '"', 'fantasy', '"', 'world', 'in', 'which', 'you', ',', 'as', 'an', 'audience',
'member', ',', 'have', 'no', 'idea', "what's", 'going', 'on', '.', 'there', 'are', 'dreams', ',', 'there', 'are',
'characters', 'coming', 'back', 'from', 'the', 'dead', ',', 'there', 'are', 'others', 'who', 'look', 'like', 'the',
'dead', ',', 'there', 'are', 'strange', 'apparitions', ',', 'there', 'are', 'disappearances', ',', 'there', 'are', 'a',
'looooot', 'of', 'chase', 'scenes', ',', 'there', 'are', 'tons', 'of', 'weird', 'things', 'that', 'happen', ',', 'and',
'most', 'of', 'it', 'is', 'simply', 'not', 'explained', '.', 'now', 'i', 'personally', "don't", 'mind', 'trying', 'to',
'unravel', 'a', 'film', 'every', 'now', 'and', 'then', ',', 'but', 'when', 'all', 'it', 'does', 'is', 'give', 'me',
'the', 'same', 'clue', 'over', 'and', 'over', 'again', ',', 'i', 'get', 'kind', 'of', 'fed', 'up', 'after', 'a', 'while', ',',
'which', 'is', 'this', "film's", 'biggest', 'problem', '.', "it's", 'obviously', 'got', 'this', 'big', 'secret',
 'to', 'hide', ',', 'but', 'it', 'seems', 'to', 'want', 'to', 'hide', 'it', 'completely', 'until', 'its', 'final',
 'five', 'minutes', '.', 'and', 'do', 'they', 'make', 'things', 'entertaining', ',', 'thrilling', 'or', 'even', 'engaging',
',', 'in', 'the', 'meantime', '?', 'not', 'really', '.', 'the', 'sad', 'part', 'is', 'that', 'the', 'arrow', 'and', 'i',
'both', 'dig', 'on', 'flicks', 'like', 'this', ',', 'so', 'we', 'actually', 'figured', 'most', 'of', 'it', 'out', 'by',
'the', 'half-way', 'point', ',', 'so', 'all', 'of', 'the', 'strangeness', 'after', 'that', 'did', 'start', 'to', 'make',
'a', 'little', 'bit', 'of', 'sense', ',', 'but', 'it', 'still', "didn't", 'the', 'make', 'the', 'film', 'all', 'that',
'more', 'entertaining', '.', 'i', 'guess', 'the', 'bottom', 'line', 'with', 'movies', 'like', 'this', 'is', 'that', 'you',
'should', 'always', 'make', 'sure', 'that', 'the', 'audience', 'is', '"', 'into', 'it', '"', 'even', 'before', 'they',
'are', 'given', 'the', 'secret', 'password', 'to', 'enter', 'your', 'world', 'of', 'understanding', '.', 'i', 'mean', ',',
'showing', 'melissa', 'sagemiller', 'running', 'away', 'from', 'visions', 'for', 'about', '20', 'minutes', 'throughout',
'the', 'movie', 'is', 'just', 'plain', 'lazy', '!', '!', 'okay', ',', 'we', 'get', 'it', '.', '.', '.', 'there', 'are', 'people', 'chasing', 'her', 'and', 'we', "don't", 'know', 'who', 'they', 'are',
 '.', 'do', 'we', 'really', 'need', 'to', 'see', 'it', 'over', 'and', 'over', 'again', '?', 'how', 'about', 'giving',
 'us', 'different', 'scenes', 'offering', 'further', 'insight', 'into', 'all', 'of', 'the', 'strangeness', 'going', 'down',
'in', 'the', 'movie', '?', 'apparently', ',', 'the', 'studio', 'took', 'this', 'film', 'away', 'from', 'its', 'director',
'and', 'chopped', 'it', 'up', 'themselves', ',', 'and', 'it', 'shows', '.', 'there', "might've", 'been', 'a', 'pretty',
'decent', 'teen', 'mind-fuck', 'movie', 'in', 'here', 'somewhere', ',', 'but', 'i', 'guess', '"', 'the', 'suits', '"', 'decided',
'that', 'turning', 'it', 'into', 'a', 'music', 'video', 'with', 'little', 'edge', ',', 'would', 'make', 'more', 'sense',
'.', 'the', 'actors', 'are', 'pretty', 'good', 'for', 'the', 'most', 'part', ',', 'although', 'wes', 'bentley', 'just', 'seemed',
'to', 'be', 'playing', 'the', 'exact', 'same', 'character', 'that', 'he', 'did', 'in', 'american', 'beauty', ',', 'only',
'in', 'a', 'new', 'neighborhood', '.', 'but', 'my', 'biggest', 'kudos', 'go', 'out', 'to', 'sagemiller', ',', 'who', 'holds',
'her', 'own', 'throughout', 'the', 'entire', 'film', ',', 'and', 'actually', 'has', 'you', 'feeling', 'her', "character's",
'unraveling', '.', 'overall', ',', 'the', 'film', "doesn't", 'stick', 'because', 'it', "doesn't", 'entertain', ',',
"it's", 'confusing', ',', 'it', 'rarely', 'excites', 'and', 'it', 'feels', 'pretty', 'redundant', 'for', 'most', 'of',
'its', 'runtime', ',', 'despite', 'a', 'pretty', 'cool', 'ending', 'and', 'explanation', 'to', 'all', 'of', 'the',
'craziness', 'that', 'came', 'before', 'it', '.', 'oh', ',', 'and', 'by', 'the', 'way', ',', 'this', 'is', 'not',
'a', 'horror', 'or', 'teen', 'slasher', 'flick', '.', '.', '.', "it's", 'just', 'packaged', 'to', 'look', 'that', 'way',
'because', 'someone', 'is', 'apparently', 'assuming', 'that', 'the', 'genre', 'is', 'still', 'hot', 'with', 'the', 'kids',
'.', 'it', 'also', 'wrapped', 'production', 'two', 'years', 'ago', 'and', 'has', 'been', 'sitting', 'on', 'the', 'shelves',
'ever', 'since', '.', 'whatever', '.', '.', '.', 'skip', 'it', '!', "where's", 'joblo', 'coming', 'from', '?', 'a', 'nightmare',
'of', 'elm', 'street', '3', '(', '7/10', ')', '-', 'blair', 'witch', '2', '(', '7/10', ')', '-', 'the', 'crow', '(', '9/10',
')', '-', 'the', 'crow', ':', 'salvation', '(', '4/10', ')', '-', 'lost', 'highway', '(', '10/10', ')', '-', 'memento',
 '(', '10/10', ')', '-', 'the', 'others', '(', '9/10', ')', '-', 'stir', 'of', 'echoes', '(', '8/10', ')']
```

Examining the raw tokens provides numerous insights into potential preprocessing steps, including:

- Stripping words of punctuation (for instance, **``what's``**).
- Discarding tokens composed solely of punctuation (like **``-``**).
- Eliminating tokens with numbers (such as **``10/10``**).
- Omitting single-character tokens (like **``a``**).
- Excluding tokens of little semantic value (like **``and``**).

Here are some strategies:

- To filter out punctuation from tokens, we can use **``regular expressions``**.
- Tokens that are solely punctuation or contain numbers can be removed with an **``isalpha()``** check.
- Common English **``stop words``** can be eliminated using a list from **``NLTK``**.
- Short tokens can be filtered by assessing their length.





In [None]:
import nltk
nltk.download('stopwords')

In [None]:
from nltk.corpus import stopwords
import string
import re

# load doc into memory
def load_doc(filename):
	# open the file as read only
	file = open(filename, 'r')
	# read all text
	text = file.read()
	# close the file
	file.close()
	return text

# load the document
filename = 'txt_sentoken/neg/cv000_29416.txt'
text = load_doc(filename)

# split into tokens by white space
tokens = text.split()

# prepare regex for char filtering
re_punc = re.compile('[%s]' % re.escape(string.punctuation))
# remove punctuation from each word
tokens = [re_punc.sub('', w) for w in tokens]

# remove remaining tokens that are not alphabetic
tokens = [word for word in tokens if word.isalpha()]

# filter out stop words
stop_words = set(stopwords.words('english'))
tokens = [w for w in tokens if not w in stop_words]

# filter out short tokens
tokens = [word for word in tokens if len(word) > 1]
print(tokens)

We can encapsulate this in a function named **``clean_doc()``** and try it on a different review, preferably a positive one.

In [None]:
from nltk.corpus import stopwords
import string
import re

# load doc into memory
def load_doc(filename):
	# open the file as read only
	file = open(filename, 'r')
	# read all text
	text = file.read()
	# close the file
	file.close()
	return text

# turn a doc into clean tokens
def clean_doc(doc):

	# split into tokens by white space
	tokens = doc.split()

	# prepare regex for char filtering
	re_punc = re.compile('[%s]' % re.escape(string.punctuation))
	# remove punctuation from each word
	tokens = [re_punc.sub('', w) for w in tokens]

	# remove remaining tokens that are not alphabetic
	tokens = [word for word in tokens if word.isalpha()]

	# filter out stop words
	stop_words = set(stopwords.words('english'))
	tokens = [w for w in tokens if not w in stop_words]

	# filter out short tokens
	tokens = [word for word in tokens if len(word) > 1]
	return tokens

# load the document
filename = 'txt_sentoken/pos/cv000_29590.txt'
text = load_doc(filename)
tokens = clean_doc(text)
print(tokens)

The cleaning process appears to yield a satisfactory collection of tokens as an initial attempt. There are additional cleaning measures we could consider, and I'll leave those to your creativity. Now, let's explore how we can curate a desired set of tokens.

## Construct a Vocabulary

When creating predictive text models, such as the **``bag-of-words``** model, there's a need to minimize the vocabulary size. A more extensive vocabulary leads to sparser representations of each word or document. Part of the text preparation for sentiment analysis is determining and customizing the word vocabulary that the model will recognize. This can be accomplished by loading all documents in the dataset and creating a word set. We might choose to include all these words or possibly exclude some. The finalized vocabulary can be saved for future use, such as when filtering words in upcoming documents.

A useful tool for this task is the **``Counter``**, which acts as a dictionary cataloging words and their frequencies, equipped with some handy additional functions. We should design a function that processes a document and integrates it into the vocabulary. This function should:

1. Load a document using the previously defined **``load_doc()``** function.
2. Clean the document utilizing the **``clean_doc()``** function.
3. Introduce all the tokens to the **``Counter``** and refresh their counts. The **``update()``** function on the counter object can achieve this.

Here's a function named **``add_doc_to_vocab()``** that accepts a document filename and a **``Counter``** vocabulary as its parameters.

In [None]:
import string
import re
from os import listdir
from collections import Counter
from nltk.corpus import stopwords

# load doc into memory
def load_doc(filename):
	# open the file as read only
	file = open(filename, 'r')
	# read all text
	text = file.read()
	# close the file
	file.close()
	return text

# turn a doc into clean tokens
def clean_doc(doc):

	# split into tokens by white space
	tokens = doc.split()

	# prepare regex for char filtering
	re_punc = re.compile('[%s]' % re.escape(string.punctuation))
	# remove punctuation from each word
	tokens = [re_punc.sub('', w) for w in tokens]

	# remove remaining tokens that are not alphabetic
	tokens = [word for word in tokens if word.isalpha()]

	# filter out stop words
	stop_words = set(stopwords.words('english'))
	tokens = [w for w in tokens if not w in stop_words]

	# filter out short tokens
	tokens = [word for word in tokens if len(word) > 1]
	return tokens

# load doc and add to vocab
def add_doc_to_vocab(filename, vocab):
	# load doc
	doc = load_doc(filename)
	# clean doc
	tokens = clean_doc(doc)
	# update counts
	vocab.update(tokens)

# load all docs in a directory
def process_docs(directory, vocab):

	# walk through all files in the folder
	for filename in listdir(directory):

		# skip files that do not have the right extension
		if not filename.endswith(".txt"):
			continue

		# create the full path of the file to open
		path = directory + '/' + filename

		# add doc to vocab
		add_doc_to_vocab(path, vocab)

# define vocab
vocab = Counter()

# add all docs to vocab
process_docs('txt_sentoken/neg', vocab)
process_docs('txt_sentoken/pos', vocab)

# print the size of the vocab
print(len(vocab))

# print the top words in the vocab
print(vocab.most_common(50))

In [None]:
type(vocab)

In [None]:
# lenght of vocabulary
len(vocab)

Executing the example generates a vocabulary that encompasses all documents in the dataset, spanning both positive and negative reviews. The data reveals that there are slightly more than 46,000 distinct words throughout all the reviews. The three most prevalent words are "film," "one," and "movie."


> Words that are extremely rare, appearing only once in all reviews, may not be indicative. Similarly, some frequently occurring words might also lack predictive value.

It's essential to validate these assumptions using a specific predictive model. Typically, words that show up once or just a handful of times in 2,000 reviews might not offer any predictive power and can be excluded from our vocabulary. This significantly reduces the number of tokens to be modeled. To achieve this, we can sift through the words and their frequencies, retaining only those that surpass a set threshold. In this context, **we'll consider words that appear more than five times**.

```python
# keep tokens with > 5 occurrence
min_occurrence = 5
tokens = [k for k,c in vocab.items() if c >= min_occurrence]
print(len(tokens))
```

The vocabulary shrinks drastically from **``46,557 to 14,803``** words with this approach.

> Setting a minimum threshold of 5 occurrences might be too stringent; feel free to try different thresholds.

After finalizing the vocabulary, it can be saved to a new file. I prefer saving the vocabulary in ASCII format, with each word on a separate line. The following introduces a function named **``save_list()``** which saves a list of items - in this instance, tokens - to a file, with each token on a distinct line.

In [None]:
import string
import re
from os import listdir
from collections import Counter
from nltk.corpus import stopwords

# load doc into memory
def load_doc(filename):

	# open the file as read only
	file = open(filename, 'r')
	# read all text
	text = file.read()
	# close the file
	file.close()
	return text

# turn a doc into clean tokens
def clean_doc(doc):
	# split into tokens by white space
	tokens = doc.split()
	# prepare regex for char filtering
	re_punc = re.compile('[%s]' % re.escape(string.punctuation))
	# remove punctuation from each word
	tokens = [re_punc.sub('', w) for w in tokens]
	# remove remaining tokens that are not alphabetic
	tokens = [word for word in tokens if word.isalpha()]
	# filter out stop words
	stop_words = set(stopwords.words('english'))
	tokens = [w for w in tokens if not w in stop_words]
	# filter out short tokens
	tokens = [word for word in tokens if len(word) > 1]
	return tokens

# load doc and add to vocab
def add_doc_to_vocab(filename, vocab):
	# load doc
	doc = load_doc(filename)
	# clean doc
	tokens = clean_doc(doc)
	# update counts
	vocab.update(tokens)

# load all docs in a directory
def process_docs(directory, vocab):
	# walk through all files in the folder
	for filename in listdir(directory):
		# skip files that do not have the right extension
		if not filename.endswith(".txt"):
			continue
		# create the full path of the file to open
		path = directory + '/' + filename
		# add doc to vocab
		add_doc_to_vocab(path, vocab)

# save list to file
def save_list(lines, filename):
	data = '\n'.join(lines)
	file = open(filename, 'w')
	file.write(data)
	file.close()

# define vocab
vocab = Counter()

# add all docs to vocab
process_docs('txt_sentoken/neg', vocab)
process_docs('txt_sentoken/pos', vocab)

# print the size of the vocab
print(len(vocab))

# keep tokens with > 5 occurrence
min_occurrence = 5
tokens = [k for k,c in vocab.items() if c >= min_occurrence]
print(len(tokens))

# save tokens to a vocabulary file
save_list(tokens, 'vocab.txt')

## Save Prepared Data

We can employ the cleaned data and selected vocabulary to process each film review and then store these processed reviews, ensuring they're set for modeling. Separating data preparation from modeling is advisable, as it lets you concentrate on the modeling phase and revisit data preparation if innovative concepts arise. Initially, we can load the vocabulary from the file **``vocab.txt``**.

In [None]:
# load doc into memory
def load_doc(filename):
	# open the file as read only
	file = open(filename, 'r')
	# read all text
	text = file.read()
	# close the file
	file.close()
	return text

# load vocabulary
vocab_filename = 'vocab.txt'
vocab = load_doc(vocab_filename)
vocab = vocab.split()
vocab = set(vocab)

Next, we can process the reviews, utilize the loaded vocabulary to filter out undesired tokens, and store the refined reviews in a new file. One strategy might be to store all positive reviews in one file and all negative reviews in another, with the filtered tokens distinguished by spaces, and each review on distinct lines. To begin, we should formulate a function that processes a document, refines it, filters it, and returns it as a single line for storage. The function **``doc_to_line()``** detailed below accomplishes this, accepting a filename and vocabulary (in set form) as parameters. This function leverages the previously established **``load_doc()``** to retrieve the document and **``clean_doc()``** to tokenize it.


```python
# load doc, clean and return line of tokens
def doc_to_line(filename, vocab):
    # load the doc
    doc = load_doc(filename)
    
    # clean doc
    tokens = clean_doc(doc)

    # filter by vocab
    tokens = [w for w in tokens if w in vocab]
    
    return ' '.join(tokens)
```






Next, we can introduce a revised version of **``process_docs()``** that will iterate through all the reviews in a directory and transform them into lines using the **``doc_to_line()``** function for each file. This will result in a list of lines being produced.

```python
# load all docs in a directory
def process_docs(directory, vocab):
    lines = list()
    # walk through all files in the folder
    for filename in listdir(directory):
        # skip files that do not have the right extension
        if not filename.endswith(".txt"):
            continue
        # create the full path of the file to open
        path = directory + '/' + filename
        # load and clean the doc
        line = doc_to_line(path, vocab)
        # add to list
        lines.append(line)
    return lines
```

We can subsequently invoke **``process_docs()``** for both the directories containing positive and negative reviews. After that, we can use the **``save_list()``** function, mentioned earlier, to store each list of refined reviews in a file.


In [None]:
import string
import re
from os import listdir
from nltk.corpus import stopwords

# load doc into memory
def load_doc(filename):
	# open the file as read only
	file = open(filename, 'r')
	# read all text
	text = file.read()
	# close the file
	file.close()
	return text

# turn a doc into clean tokens
def clean_doc(doc):
	# split into tokens by white space
	tokens = doc.split()
	# prepare regex for char filtering
	re_punc = re.compile('[%s]' % re.escape(string.punctuation))
	# remove punctuation from each word
	tokens = [re_punc.sub('', w) for w in tokens]
	# remove remaining tokens that are not alphabetic
	tokens = [word for word in tokens if word.isalpha()]
	# filter out stop words
	stop_words = set(stopwords.words('english'))
	tokens = [w for w in tokens if not w in stop_words]
	# filter out short tokens
	tokens = [word for word in tokens if len(word) > 1]
	return tokens

# save list to file
def save_list(lines, filename):
	data = '\n'.join(lines)
	file = open(filename, 'w')
	file.write(data)
	file.close()

# load doc, clean and return line of tokens
def doc_to_line(filename, vocab):
	# load the doc
	doc = load_doc(filename)
	# clean doc
	tokens = clean_doc(doc)
	# filter by vocab
	tokens = [w for w in tokens if w in vocab]
	return ' '.join(tokens)

# load all docs in a directory
def process_docs(directory, vocab):
	lines = list()
	# walk through all files in the folder
	for filename in listdir(directory):
		# skip files that do not have the right extension
		if not filename.endswith(".txt"):
			continue
		# create the full path of the file to open
		path = directory + '/' + filename
		# load and clean the doc
		line = doc_to_line(path, vocab)
		# add to list
		lines.append(line)
	return lines

# load vocabulary
vocab_filename = 'vocab.txt'
vocab = load_doc(vocab_filename)
vocab = vocab.split()
vocab = set(vocab)
# prepare negative reviews
negative_lines = process_docs('txt_sentoken/neg', vocab)
save_list(negative_lines, 'negative.txt')
# prepare positive reviews
positive_lines = process_docs('txt_sentoken/pos', vocab)
save_list(positive_lines, 'positive.txt')