***
# <center>***Creating Custom Corpora***
***

## ***I learned the following natural language processing techniques:***

* [Setting up a custom corpus](#custom-corpus)
* [Creating a wordlist corpus](#wordlist-corpus)
* [Creating a part-of-speech tagged word corpus](#pos-tagged-corpus)
* [Creating a chunked phrase corpus](#chunked-corpus)
* [Creating a categorized text corpus](#categorized-text-corpus)
* [Creating a categorized chunk corpus reader](#categorized-chunk-corpus)
* [Lazy corpus loading](#lazy-loading)
* [Creating a custom corpus view](#custom-corpus-view)
* [Creating a MongoDB-backed corpus reader](#mongodb-corpus)
* [Corpus editing with file locking](#file-locking)


In this notebook, I have covered how to use **corpus** readers and create custom corpora. If you want to train your own model, such as a part-of-speech tagger or text classifier, you will need to create a custom corpus to train on.

***
## ***<a id="custom-corpus"></a>Setting up a custom corpus:***
***


A **corpus** is a collection of text documents, and **corpora** is the plural of corpus. This comes from the Latin word for body; in this case, a body of text. So a custom corpus is really just a bunch of text files in a directory, often alongside many other directories of text files.

**NLTK** defines a list of **data directories**, or **paths**, in `nltk.data.path`. Our custom corpora must be within one of these paths so it can be found by NLTK. In order to avoid conflict with the official data package, we will create a custom nltk_data directory in our home directory. The following is some Python code to create this directory and verify that it is in the list of known paths specified by `nltk.data.path:`

In [1]:

import os

#Creating the NLTK Data Path
path = os.path.expanduser('~/nltk_data')

if not os.path.exists(path):
    os.mkdir(path)
os.path.exists(path)


True

In [2]:

#Checking NLTK's Data Path
import nltk.data
path in nltk.data.path


True

In [3]:

nltk.data.path.append(path)


Now, we can create a simple wordlist file and make sure it loads.

In [4]:

corpus_path = os.path.join(path, 'corpora/cookbook')
os.makedirs(corpus_path, exist_ok=True)
with open(os.path.join(corpus_path, 'mywords.txt'), 'w') as f:
    f.write('Your content here')


Loading Custom NLTK Data

In [5]:

nltk.data.load('corpora/cookbook/mywords.txt', format='raw')


b'Your content here'


***
## ***<a id="wordlist-corpus"></a>Creating a wordlist corpus:***
***


The `WordListCorpusReader` class is one of the simplest **CorpusReader classes**. It provides access to a file containing a list of words, one word per line. 

We need to start by creating a word_list file. This could be a single column CSV file, or just a normal text file with one word per line. Let's create a file named wordlist that looks like this:

In [6]:

words = ["apple", "banana", "cherry", "date", "elderberry", "fig", "grape"]

file_name = "word_list.txt"

with open(file_name, "w") as file:
    for word in words:
        file.write(word + "\n")

print(f"Word list saved to {file_name}")


Word list saved to word_list.txt


In [7]:

from nltk.corpus.reader import WordListCorpusReader
reader = WordListCorpusReader('.', [file_name])
reader.words()


['apple', 'banana', 'cherry', 'date', 'elderberry', 'fig', 'grape']

In [8]:

reader.fileids()


['word_list.txt']

***How it works...***
- The `WordListCorpusReader` class inherits from `CorpusReader`, which is a common base class for all corpus readers. The CorpusReader class does all the work of identifying which files to read, while `WordListCorpusReader` reads the files and tokenizes each line to produce a list of words.

When you call the `words()` function, it calls **nltk.tokenize.line_tokenize()** on the raw file data, which you can access using the `raw()` function as follows:

In [9]:

reader.raw()


'apple\r\nbanana\r\ncherry\r\ndate\r\nelderberry\r\nfig\r\ngrape\r\n'

In [10]:

from nltk.tokenize import line_tokenize
line_tokenize(reader.raw())


['apple', 'banana', 'cherry', 'date', 'elderberry', 'fig', 'grape']

***Names wordlist corpus:***

Another wordlist corpus that comes with NLTK is the `names corpus` that is shown in the following code. It contains two files: female.txt and male.txt, each containing a list of a few thousand common first names organized by gender as follows:

In [11]:

from nltk.corpus import names
names.fileids()


['female.txt', 'male.txt']

In [12]:

len(names.words('female.txt')),len(names.words('male.txt'))


(5001, 2943)

***English words corpus:***

NLTK also comes with a large list of English words. There's one file with 850 basic words, and another list with over 200,000 known English words, as shown in the following code:

In [13]:

from nltk.corpus import words
words.fileids()


['en', 'en-basic']

In [14]:

len(words.words('en-basic')),len(words.words('en'))


(850, 235886)


***
## ***<a id="pos-tagged-corpus"></a>Creating a part-of-speech tagged word corpus:***
***


**`art-of-speech`** tagging is the process of identifying the **part-of-speech** tag for a word. Most of 
the time, a tagger must first be trained on a training corpus. 

In [15]:

from nltk.corpus.reader import TaggedCorpusReader
# Use TaggedCorpusReader to read the word_list.txt file
corpus_root = "."  # Current directory
file_pattern = r"word_list\.txt"

reader = TaggedCorpusReader(corpus_root, file_pattern)

# Access raw data
print("Raw content of file:", reader.words())


Raw content of file: ['apple', 'banana', 'cherry', 'date', 'elderberry', ...]


In [16]:

reader.tagged_words()


[('apple', None), ('banana', None), ('cherry', None), ...]

In [17]:

reader.sents()


[['apple'], ['banana'], ['cherry'], ['date'], ...]

In [18]:

reader.tagged_sents()


[[('apple', None)], [('banana', None)], ...]

In [19]:

reader.paras()


[[['apple'], ['banana'], ['cherry'], ['date'], ['elderberry'], ['fig'], ['grape']]]

In [20]:

reader.tagged_paras()


[[[('apple', None)], [('banana', None)], [('cherry', None)], [('date', None)], [('elderberry', None)], [('fig', None)], [('grape', None)]]]



***
## ***<a id="chunked-corpus"></a>Creating a chunked phrase corpus:***
***


A **chunk** is a short phrase within a **sentence**. If you remember sentence diagrams from grade school, they were a tree-like representation of phrases within a sentence. This is exactly what chunks are. Chunking, also called shallow parsing, groups words into meaningful phrases like noun phrases (NPs), verb phrases (VPs), and prepositional phrases (PPs). This is useful for named entity recognition (NER), question answering, and text classification.

***Steps to Create a Chunked Phrase Corpus:***

 - `Load the Text` – Define or import textual data.
 - `Tokenization & POS Tagging` – Split text into words and assign POS tags.
 - `Define a Chunk Grammar` – Use Regular Expressions (Regex) patterns to extract phrases.
 - `Chunking` – Apply the grammar to extract noun/verb/prepositional phrases.



In [21]:

import nltk
from nltk.tokenize import word_tokenize
from nltk import pos_tag, RegexpParser

# Download necessary data
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

# Sample text
text = "A chunk is a short phrase within a sentence."

# Tokenization and POS Tagging
tokens = word_tokenize(text)
pos_tags = pos_tag(tokens)

# Define Chunk Grammar (Noun Phrases: NP, Verb Phrases: VP)
chunk_grammar = r"""
    NP: {<DT>?<JJ>*<NN>}      # Noun phrase: Determiner (optional) + Adjective(s) + Noun
    VP: {<VB.*><NP|PP>*}      # Verb phrase: Verb + Noun Phrase or Prepositional Phrase
    PP: {<IN><NP>}            # Prepositional phrase: Preposition + Noun Phrase
"""

# Create a parser
chunk_parser = RegexpParser(chunk_grammar)

# Apply chunking
chunk_tree = chunk_parser.parse(pos_tags)

# Print chunked structure
print(chunk_tree)


(S
  (NP A/DT chunk/NN)
  (VP is/VBZ (NP a/DT short/JJ phrase/NN))
  (PP within/IN (NP a/DT sentence/NN))
  ./.)


[nltk_data] Downloading package punkt to C:\Users\DELL/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\DELL/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


***We can extract phrases directly from the chunk tree.***

In [22]:
# Extract noun phrases (NPs) from the chunk tree
def extract_phrases(tree, label):
    phrases = []
    for subtree in tree.subtrees(filter=lambda t: t.label() == label):
        phrase = " ".join(word for word, pos in subtree.leaves())
        phrases.append(phrase)
    return phrases

# Extract different phrase types
noun_phrases = extract_phrases(chunk_tree, "NP")
verb_phrases = extract_phrases(chunk_tree, "VP")
prepositional_phrases = extract_phrases(chunk_tree, "PP")

# Print results
print("Noun Phrases:", noun_phrases)
print("Verb Phrases:", verb_phrases)
print("Prepositional Phrases:", prepositional_phrases)


Noun Phrases: ['A chunk', 'a short phrase', 'a sentence']
Verb Phrases: ['is a short phrase']
Prepositional Phrases: ['within a sentence']



***
## ***<a id="categorized-text-corpus"></a>Creating a categorized text corpus:***
***


A **categorized text corpus** is a **collection of documents** that are labeled according to their content or subject. It is useful for tasks like text classification, sentiment analysis, and topic modeling.

***Steps to Create a Categorized Text Corpus***
- `Gather Text Data` – Collect a set of documents or text data.
- `Define Categories` – Decide on the categories you want to label your data with.
- `Preprocess the Text` – Clean and prepare the text (e.g., tokenization, lowercasing).
- `Label the Text` – Assign categories to each document.

***We will create a small example with text data manually labeled:***

In [23]:
import pandas as pd

# Sample text data (articles)
data = [
    {"text": "The soccer match ended in a 2-1 victory for the home team.", "category": "Sports"},
    {"text": "The president announced a new economic policy to improve the job market.", "category": "Politics"},
    {"text": "A new smartphone with cutting-edge technology was released today.", "category": "Technology"},
    {"text": "The basketball game was intense, with both teams scoring high.", "category": "Sports"},
    {"text": "The government is focusing on digital infrastructure and public services.", "category": "Politics"},
    {"text": "A major tech company unveiled its latest AI innovations.", "category": "Technology"}
]

# Convert the data into a pandas DataFrame
df = pd.DataFrame(data)

# Display the DataFrame
df


Unnamed: 0,text,category
0,The soccer match ended in a 2-1 victory for th...,Sports
1,The president announced a new economic policy ...,Politics
2,A new smartphone with cutting-edge technology ...,Technology
3,"The basketball game was intense, with both tea...",Sports
4,The government is focusing on digital infrastr...,Politics
5,A major tech company unveiled its latest AI in...,Technology


***We can tokenize the text and convert it to lowercase to standardize it for further analysis:***

In [24]:

import nltk
from nltk.tokenize import word_tokenize

# Download necessary data
nltk.download('punkt')

# Function to preprocess text
def preprocess_text(text):
    # Tokenization and conversion to lowercase
    tokens = word_tokenize(text)
    tokens = [word.lower() for word in tokens if word.isalnum()]
    return ' '.join(tokens)

# Apply preprocessing to the 'text' column
df['processed_text'] = df['text'].apply(preprocess_text)

# Display the preprocessed DataFrame
df[['text', 'processed_text', 'category']]


[nltk_data] Downloading package punkt to C:\Users\DELL/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


Unnamed: 0,text,processed_text,category
0,The soccer match ended in a 2-1 victory for th...,the soccer match ended in a victory for the ho...,Sports
1,The president announced a new economic policy ...,the president announced a new economic policy ...,Politics
2,A new smartphone with cutting-edge technology ...,a new smartphone with technology was released ...,Technology
3,"The basketball game was intense, with both tea...",the basketball game was intense with both team...,Sports
4,The government is focusing on digital infrastr...,the government is focusing on digital infrastr...,Politics
5,A major tech company unveiled its latest AI in...,a major tech company unveiled its latest ai in...,Technology


***You can store the categorized corpus in CSV or JSON format for future use:***

In [25]:

#Save as CSV
df.to_csv("categorized_corpus.csv", index=False)

#Save as JSON
df.to_json("categorized_corpus.json", orient="records")



*** 
## ***<a id="categorized-chunk-corpus"></a>Creating a categorized chunk corpus reader:***
***


A **categorized chunk corpus** refers to a collection of chunked texts that are labeled by category, such as subject, genre, or any custom category. It’s commonly used in NLP tasks like text classification, part-of-speech tagging, and named entity recognition (NER).

We can create a categorized chunk corpus where the text is chunked into phrases and categorized based on a specific label. This corpus can then be read and processed for machine learning or text analysis tasks.

***Steps to Create a Categorized Chunk Corpus Reader:***
 - `Prepare Your Text and Categories:` First, gather the raw text and assign categories (labels) to them.
 - `Preprocess the Text:` Tokenize and apply POS tagging.
 - `Chunk the Text:` Use a chunking grammar to identify phrases.
 - `Label Each Chunk:` Add a label or category to each chunked text.
 - `Create the Corpus Reader:` Write a custom reader to load and process the chunked text data.

In [26]:

import nltk
from nltk.tokenize import word_tokenize
from nltk import pos_tag, RegexpParser

# Download necessary data
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')


[nltk_data] Downloading package punkt to C:\Users\DELL/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\DELL/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

In [27]:

# Sample text and category
text = "The quick brown fox jumps over the lazy dog."
category = "Sports"  # Label the text as 'Sports'

# Tokenization and POS Tagging
tokens = word_tokenize(text)
pos_tags = pos_tag(tokens)


In [28]:

# Define Chunk Grammar (NP, VP, PP)
chunk_grammar = r"""
    NP: {<DT>?<JJ>*<NN>}      # Noun Phrase: Determiner (optional) + Adjective(s) + Noun
    VP: {<VB.*><NP|PP>*}       # Verb Phrase: Verb + Noun Phrase or Prepositional Phrase
    PP: {<IN><NP>}             # Prepositional Phrase: Preposition + Noun Phrase
"""


In [29]:

# Create a parser
chunk_parser = RegexpParser(chunk_grammar)

# Apply chunking
chunk_tree = chunk_parser.parse(pos_tags)


In [30]:

# Extract phrases from the chunk tree
def extract_phrases(tree, label):
    phrases = []
    for subtree in tree.subtrees(filter=lambda t: t.label() == label):
        phrase = " ".join(word for word, pos in subtree.leaves())
        phrases.append(phrase)
    return phrases


In [31]:

# Extract phrases
noun_phrases = extract_phrases(chunk_tree, "NP")
verb_phrases = extract_phrases(chunk_tree, "VP")
prepositional_phrases = extract_phrases(chunk_tree, "PP")


In [32]:

# Store chunked text with categories
chunked_data = {
    "category": category,
    "noun_phrases": noun_phrases,
    "verb_phrases": verb_phrases,
    "prepositional_phrases": prepositional_phrases
}


In [33]:

print(chunked_data)


{'category': 'Sports', 'noun_phrases': ['The quick brown', 'fox', 'the lazy dog'], 'verb_phrases': ['jumps'], 'prepositional_phrases': ['over the lazy dog']}



*** 
## ***<a id="lazy-loading"></a>Lazy corpus loading:***
***


Lazy corpus loading refers to the technique of reading and processing data on demand, rather than loading everything into memory all at once. This is especially useful when dealing with large datasets that might not fit entirely into memory, allowing for more efficient resource management.

Loading a corpus reader can be an expensive operation due to the number of files, file sizes, and various initialization tasks. And while you'll often want to specify a corpus reader in a common module, you don't always need to access it right away. To speed up module import time when a corpus reader is defined, NLTK provides a LazyCorpusLoader class that can transform itself into your actual corpus reader as soon as you need it. This way, you can define a corpus reader in a common module without it slowing down module loading.

In [34]:

from nltk.corpus.util import LazyCorpusLoader
from nltk.corpus.reader import WordListCorpusReader
reader = LazyCorpusLoader('cookbook', WordListCorpusReader,['wordlist'])


In [35]:

isinstance(reader, LazyCorpusLoader)


True

In [36]:

isinstance(reader, WordListCorpusReader)


False


*** 
## ***<a id="custom-corpus-view"></a>Creating a custom corpus view:***
***


A **corpus view** is a class wrapper around a corpus file that reads in blocks of tokens as needed. Its purpose is to provide a view into a file without reading the whole file at once (since corpus files can often be quite large). If the corpus readers included by NLTK already meet all your needs, then you do not have to know anything about corpus views. But, if you have a custom file format that needs special handling, this recipe will show you how to create and use a custom corpus view. The main corpus view class is StreamBackedCorpusView, which opens a single file as a stream, and maintains an internal cache of blocks it has read.

We will start with the simple case of a plain text file with a heading that should be ignored by the corpus reader. Let's make a file called heading_text.txt that looks like this:
$$
A simple heading.
$$
$$
Here is the actual text for the corpus.
$$
$$
Paragraphs are split by blanklines.
$$
$$
This is the 3rd paragraph.
$$

Normally, we would use the `PlaintextCorpusReader` class, but by default it will treat A simple heading as the first paragraph. To ignore this heading, we need to subclass the `PlaintextCorpusReader` class so we can override its CorpusView class variable with our 
own `StreamBackedCorpusView` subclass.

In [41]:

from nltk.corpus.reader import PlaintextCorpusReader
from nltk.corpus.reader.util import StreamBackedCorpusView

class IgnoreHeadingCorpusView(StreamBackedCorpusView):
    def __init__(self, *args, **kwargs):
        StreamBackedCorpusView.__init__(self, *args, **kwargs)
        # open self._stream
        self._open()
        # skip the heading block
        self.read_block(self._stream)
        # reset the start position to the current position in the stream
        self._filepos = [self._stream.tell()]

class IgnoreHeadingCorpusReader(PlaintextCorpusReader):
    CorpusView = IgnoreHeadingCorpusView
    

In [42]:

from nltk.corpus.reader import PlaintextCorpusReader
plain = PlaintextCorpusReader('.', ['heading_text.txt'])
len(plain.paras())


2

In [43]:

reader = IgnoreHeadingCorpusReader('.', ['heading_text.txt'])
len(reader.paras())


1


***
## ***<a id="mongodb-corpus"></a>Creating a MongoDB-backed corpus reader:***
***


All the corpus readers we have dealt with so far have been file-based. That is in part due to the design of the CorpusReader base class, and also the assumption that most corpus data will be in text files. However, sometimes you will have a bunch of data stored in a database that you want to access and use just like a text file corpus. In this recipe, we will cover the case where you have documents in **MongoDB**, and you want to use a particular field of each document as your block of text.

MongoDB is a document-oriented database that has become a popular alternative to relational databases such as MySQL.  You will also need to install PyMongo, a Python driver for MongoDB.

In [45]:

!pip install pymongo


    pytz (>dev)
         ~^


Defaulting to user installation because normal site-packages is not writeable
Collecting pymongo
  Downloading pymongo-10.10.10.10-cp311-cp311-win_amd64.whl.metadata (22 kB)
Downloading pymongo-10.10.10.10-cp311-cp311-win_amd64.whl (831 kB)
   ---------------------------------------- 0.0/831.7 kB ? eta -:--:--
   ---------------------------------------- 0.0/831.7 kB ? eta -:--:--
   ------------ --------------------------- 262.1/831.7 kB ? eta -:--:--
   ------------ --------------------------- 262.1/831.7 kB ? eta -:--:--
   ------------ --------------------------- 262.1/831.7 kB ? eta -:--:--
   -------------------------------------- 831.7/831.7 kB 701.4 kB/s eta 0:00:00
Installing collected packages: pymongo
Successfully installed pymongo-10.10.10.10


In [58]:

import pymongo
from nltk.data import load
from nltk.tokenize import TreebankWordTokenizer
from nltk.util import AbstractLazySequence, LazyMap, LazyConcatenation

class MongoDBLazySequence(AbstractLazySequence):
    
    def __init__(self, host='localhost', port=27017, db='test',  
        collection='corpus', field='text'):
        self.conn = pymongo.MongoClient(host, port)
        self.collection = self.conn[db][collection]
        self.field = field
        
    def __len__(self):
        return self.collection.count_documents({})  # Fixed count() deprecation
        
    def iterate_from(self, start):
        f = lambda d: d.get(self.field, '')
        return iter(LazyMap(f, self.collection.find({}, {self.field: 1}, skip=start)))  # Fixed find()

class MongoDBCorpusReader:
    def __init__(self, word_tokenizer=TreebankWordTokenizer(),
        sent_tokenizer=load('tokenizers/punkt/english.pickle'), **kwargs):
        self._seq = MongoDBLazySequence(**kwargs)
        self._word_tokenize = word_tokenizer.tokenize
        self._sent_tokenize = sent_tokenizer.tokenize
        
    def text(self):
        return self._seq
        
    def words(self):
        return LazyConcatenation(LazyMap(self._word_tokenize, self.text()))
        
    def sents(self):
        return LazyConcatenation(LazyMap(self._sent_tokenize, self.text()))


The `AbstractLazySequence` class is an abstract class that provides read-only, on-demand iteration. Subclasses must implement the **__len__()** and iterate_from(start) methods, while it provides the rest of the list and iterator emulation methods. By creating the `MongoDBLazySequence` subclass as our view, we can iterate over documents in the MongoDB collection on demand, without keeping all the documents in memory. The `LazyMap`class is a lazy version of Python's built-in map() function, and is used in iterate_from() to transform the document into the specific field that we're interested in. It's also a subclass of `AbstractLazySequence`.

The `MongoDBCorpusReade`r class creates an internal instance of `MongoDBLazySequence` for iteration, then defines the word and sentence tokenization methods. The text() method simply returns the instance of MongoDBLazySequence, which results in a lazily evaluated list of each text field. The words() method uses LazyMap and LazyConcatenation to return a lazily evaluated list of all words, while the sents() method does the same for sentences. The sent_tokenizer is loaded on demand with LazyLoader, which is a wrapper around nltk.data.load(), analogous to LazyCorpusLoader. The LazyConcatentation class is a subclass of `AbstractLazySequence`too, and produces a flat list from a given list of lists (each list may also be lazy). In our case, we're concatenating the results of LazyMap to ensure we don't return nested lists.


All of the **parameters** are configurable. For example, if you had a db named website, with a collection named comments, whose documents had a field called comment, you could create a MongoDBCorpusReader class as follows:

In [59]:

reader = MongoDBCorpusReader(db='website',collection='comments', field='comment')


You can also pass in custom instances for **word_tokenizer** and **sent_tokenizer**, as long as the objects implement the nltk.tokenize.TokenizerI interface by providing a tokenize(text) method.


***       
## ***<a id="file-locking"></a>Corpus editing with file No Locking:***
***

Corpus readers and views are all read-only, but there will be times when you want to add to or edit the corpus files. However, modifying a corpus file while other processes are using it, such as through a corpus reader, can lead to dangerous undefined ehavior. This is where file locking comes in handy.

When working with corpus editing, ensuring file integrity is crucial, especially in a multi-threaded or concurrent environment. Using file locking prevents multiple processes from modifying the file simultaneously, reducing data corruption risks.

In [63]:

import os
import msvcrt
class LockedCorpusEditor:
    def __init__(self, file_path):
        self.file_path = file_path

    def edit_corpus(self, new_text):
        """Edit the corpus file without explicit locking."""
        with open(self.file_path, 'r+', encoding='utf-8') as file:
            # Read the original content
            original_content = file.read()
            print("Original Content:\n", original_content)

            # Move the pointer to the beginning and overwrite the file
            file.seek(0)
            file.write(new_text)
            file.truncate()  # Remove leftover text from previous content
            
            print("\nUpdated Corpus Successfully!")

# Example Usage
corpus_file = "sample_corpus.txt"

# Creating a sample corpus file
with open(corpus_file, 'w', encoding='utf-8') as f:
    f.write("This is the original corpus content.")

# Editing the corpus safely
editor = LockedCorpusEditor(corpus_file)
editor.edit_corpus("This is the updated corpus content.")

# Loading the edited corpus with NLTK
from nltk.corpus.reader import PlaintextCorpusReader
reader = PlaintextCorpusReader('.', [corpus_file])
print("\nFinal Corpus Words:", reader.words(corpus_file))


Original Content:
 This is the original corpus content.

Updated Corpus Successfully!

Final Corpus Words: ['This', 'is', 'the', 'updated', 'corpus', 'content', ...]
