### Gutenberg plot analysis - word2vec using tokens - words

In [1]:
import nltk
import re
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer

In [2]:
# Download NLTK data (run this once)
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

[nltk_data] Downloading package punkt to /Users/h6x/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /Users/h6x/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /Users/h6x/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

#### Loading the data

In [5]:
# List of text files (replace with actual file paths)
base_path = "/Users/h6x/ORNL/git/learning/natural language processing/CS-524/project_1/data"
file_paths = [base_path + "/Great_short_stories_V1.txt", base_path + "/The_Memoirs_of_Sherlock_Holmes.txt", base_path + "/The_Return_of_Sherlock_Holmes.txt"]

In [6]:
book_contents=[]

# Read the contents of each book
for file_path in file_paths:
    with open(file_path, 'r') as file:
        book_contents.append(file.read())

In [8]:
# Display the first 100 character
book_contents[0][1:200]

'The Project Gutenberg eBook of Great short stories, Volume I (of 3)\n    \nThis ebook is for the use of anyone anywhere in the United States and\nmost other parts of the world at no cost and with almost'

#### Data Preprocessing

In [9]:
import nltk
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('omw-1.4')
nltk.download('averaged_perceptron_tagger')
nltk.download('wordnet')

from nltk.corpus import stopwords, wordnet
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from nltk import pos_tag

[nltk_data] Downloading package punkt to /Users/h6x/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /Users/h6x/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /Users/h6x/nltk_data...
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/h6x/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data] Downloading package wordnet to /Users/h6x/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [10]:
import re

def get_wordnet_pos(word):
    """Map POS tag to first character lemmatize() accepts"""
    tag = pos_tag([word])[0][1][0].upper()
    tag_dict = {"J": wordnet.ADJ,
                "N": wordnet.NOUN,
                "V": wordnet.VERB,
                "R": wordnet.ADV}

    return tag_dict.get(tag, wordnet.NOUN)

In [11]:
def clean_text(text):
    lemmatizer = WordNetLemmatizer()
    sub_pattern = r'[^A-Za-z]'
    split_pattern = r"\s+"
    #remove stop words
    stop_words = stopwords.words('english') + ['never','ever','couldnot','wouldnot','could','would','us',"i'm","you'd"]
    lower_book = text.lower()                                              # Converting all words into lower case.
    filtered_book = re.sub(sub_pattern,' ',lower_book).lstrip().rstrip()   # Replacing all characters except those in the pattern into spaces.
    filtered_book = word_tokenize(filtered_book)                      # tokenizethe whole book into words in a list.
    filtered_book = [lemmatizer.lemmatize(word, get_wordnet_pos(word)) for word in filtered_book if word not in stop_words]
    return filtered_book

In [12]:
len(book_contents)

3

In [13]:
cleaned_books_contents=[]
for book in book_contents :
  cleaned_books_contents.append(clean_text(book))
cleaned_books_contents[0][1:100]

['gutenberg',
 'ebook',
 'great',
 'short',
 'story',
 'volume',
 'ebook',
 'use',
 'anyone',
 'anywhere',
 'united',
 'state',
 'part',
 'world',
 'cost',
 'almost',
 'restriction',
 'whatsoever',
 'may',
 'copy',
 'give',
 'away',
 'use',
 'term',
 'project',
 'gutenberg',
 'license',
 'include',
 'ebook',
 'online',
 'www',
 'gutenberg',
 'org',
 'locate',
 'united',
 'state',
 'check',
 'law',
 'country',
 'locate',
 'use',
 'ebook',
 'title',
 'great',
 'short',
 'story',
 'volume',
 'detective',
 'story',
 'author',
 'various',
 'editor',
 'william',
 'patten',
 'release',
 'date',
 'october',
 'ebook',
 'language',
 'english',
 'original',
 'publication',
 'new',
 'york',
 'p',
 'f',
 'collier',
 'credit',
 'al',
 'haines',
 'start',
 'project',
 'gutenberg',
 'ebook',
 'great',
 'short',
 'story',
 'volume',
 'frontispiece',
 'robert',
 'louis',
 'stevenson',
 'great',
 'short',
 'story',
 'edit',
 'william',
 'patten',
 'new',
 'collection',
 'famous',
 'example',
 'literature

In [17]:
# Combine tokenized texts from all novels
corpus_tokens = []
for book in cleaned_books_contents:
    corpus_tokens += book

In [18]:
len(corpus_tokens)

165975

In [19]:
for book in cleaned_books_contents:
    print(len(book))

70810
43196
51969


above tokens are not unique words. Combination of all words from novels.

the approach I provided would combine all the tokens from the three novels into a single list, which means common words (like "Holmes" or "crime") could appear multiple times. However, this repetition is expected and beneficial when training word embeddings like Word2Vec, as the frequency and context in which words appear across different documents help the model learn better representations.

If your goal is to preserve context at the sentence or paragraph level, you should feed sentences or paragraphs into the Word2Vec model instead of a single flat list of tokens. This approach will prevent loss of context while still training on multiple novels.

Here’s how you can maintain context at the sentence level while still training on all three novels:

 Updated: Tokenize text into sentences and then tokenize each sentence into words
def tokenize_sentences(text):
    sentences = sent_tokenize(text)
    return [tokenize_text(normalize_text(sentence)) for sentence in sentences]

 Process all files and keep the sentence-level structure
corpus_sentences = []
for file_path in file_paths:
    with open(file_path, 'r', encoding='utf-8') as file:
        text = file.read()
    corpus_sentences.extend(tokenize_sentences(text))

print(corpus_sentences[:2])  # View the first two tokenized sentences


### Training the Word2Vec Model

In [35]:
import gensim
from gensim.models import Word2Vec

In [36]:
model = Word2Vec(sentences=[corpus_tokens], vector_size=100, window=5, min_count=1, workers=4)

In [39]:
model.wv.most_similar("holmes")

[('chevaux', 0.365239679813385),
 ('incisive', 0.33843156695365906),
 ('elegance', 0.33543017506599426),
 ('menially', 0.32503029704093933),
 ('ornamental', 0.32200267910957336),
 ('laughter', 0.3184606432914734),
 ('cramp', 0.31417787075042725),
 ('alert', 0.3056217432022095),
 ('frenzy', 0.303756445646286),
 ('banker', 0.30292779207229614)]

In [40]:
model.wv.most_similar("crime")

[('cloth', 0.35959240794181824),
 ('consequently', 0.35554665327072144),
 ('impends', 0.3551435172557831),
 ('springfield', 0.34750816226005554),
 ('trice', 0.3390079140663147),
 ('connivance', 0.3385235071182251),
 ('apiece', 0.3337108790874481),
 ('topped', 0.3325099050998688),
 ('repetition', 0.3324532210826874),
 ('worn', 0.3292587697505951)]

In [41]:
# Example: Get the vector for a word
word_vector = model.wv['holmes']
print(word_vector)

[-8.2281232e-03  9.3540605e-03 -1.7690062e-04 -1.9461397e-03
  4.5518545e-03 -4.2219125e-03  2.7862266e-03  7.0896945e-03
  5.9815729e-03 -7.5069158e-03  9.3357479e-03  4.6285745e-03
  3.9950162e-03 -6.2602474e-03  8.4546637e-03 -2.1937068e-03
  8.8001331e-03 -5.4028146e-03 -8.1101870e-03  6.7361891e-03
  1.6690858e-03 -2.1887308e-03  9.4936537e-03  9.4660828e-03
 -9.7360807e-03  2.5700384e-03  6.1281300e-03  3.8645645e-03
  1.9120282e-03  4.0807499e-04  6.6280772e-04 -3.8195455e-03
 -7.1659558e-03 -2.0956169e-03  3.9411574e-03  8.8720527e-03
  9.2263035e-03 -5.9806285e-03 -9.3981372e-03  9.6887825e-03
  3.4136425e-03  5.1251608e-03  6.2787551e-03 -2.8037857e-03
  7.3143332e-03  2.7537565e-03  2.7743480e-03 -2.4296937e-03
 -3.1288890e-03 -2.3007975e-03  4.3372861e-03  2.8376720e-05
 -9.6085146e-03 -9.7574247e-03 -6.1550057e-03 -1.3703447e-04
  2.0731681e-03  9.3621993e-03  5.5806101e-03 -4.2989640e-03
  1.7525473e-04  5.0215465e-03  7.6398398e-03 -1.1878897e-03
  4.2862515e-03 -5.73815

In [42]:
len(word_vector)

100

When training a **Word2Vec** model, the way you feed the corpus (either as a flat list of word tokens or as sentences that are tokenized) significantly impacts the training process and the quality of the word embeddings. Let’s break down the differences between feeding the entire corpus as a list of word tokens versus feeding it as tokenized sentences.

### 1. Feeding the Entire Corpus as a List of Word Tokens

When you provide the entire corpus as a **flat list of word tokens**, you're essentially ignoring sentence boundaries. The model sees one continuous stream of words. For example:

```python
corpus_tokens = ['The', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog', 'It', 'was', 'a', 'bright', 'cold', 'day', 'in', 'April', ...]
model = Word2Vec(sentences=[corpus_tokens], vector_size=100, window=5)
```

#### How It Affects Training:
- **Context is Unrestricted**: The sliding window will treat words across sentences as being in the same context. For example, if the last word of one sentence and the first word of the next sentence fall within the same window, they will be treated as context for each other, even though they might not be semantically related.

- **Loss of Sentence Structure**: Since there are no sentence boundaries, the model won't understand that certain words are closely related within a sentence, and others are unrelated because they belong to different sentences.

- **Quality of Embeddings**: Ignoring sentence boundaries could negatively impact the quality of word embeddings, as words might be associated with irrelevant contexts, especially in cases where sentences shift meaning drastically. 

### Example:
Suppose you have two sentences:
- **Sentence 1**: "The cat sat on the mat."
- **Sentence 2**: "The dog barked loudly."

If you provide the entire corpus as a list of tokens, Word2Vec may incorrectly associate words like `"mat"` with `"barked"` or `"dog"` with `"sat"`, because the model does not know where one sentence ends and the other begins. This can degrade the semantic quality of the embeddings.

### 2. Feeding the Corpus as Tokenized Sentences

When you feed the corpus as **separate sentences that are tokenized** into words, Word2Vec maintains the structure of each sentence and trains on word-context pairs only within the boundaries of that sentence. For example:

```python
corpus_sentences = [
    ['The', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog'],
    ['It', 'was', 'a', 'bright', 'cold', 'day', 'in', 'April'],
    ...
]
model = Word2Vec(sentences=corpus_sentences, vector_size=100, window=5)
```

#### How It Affects Training:
- **Preserves Sentence Boundaries**: The sliding window now only works within each sentence. The model knows that words at the beginning of one sentence are not in the same context as words at the end of another sentence.

- **More Relevant Context**: The model will learn word-context pairs that are limited to meaningful sentence-based relationships. For example, it will associate `"cat"` with `"mat"` and `"dog"` with `"barked"`, but will not mix these two sentences together.

- **Improved Embeddings**: This results in better word embeddings because the model is able to maintain a clearer distinction between different contexts, preserving semantic relevance within each sentence. Words that commonly appear together within sentences will have embeddings that capture their co-occurrence more effectively.

### Example:
If you have the same two sentences:
- **Sentence 1**: "The cat sat on the mat."
- **Sentence 2**: "The dog barked loudly."

When feeding the model sentence by sentence, Word2Vec knows that `"cat"` is not directly related to `"dog"` and that `"sat"` is not related to `"barked"`, resulting in more accurate word embeddings that reflect the true context of each word.

### Summary of Differences:

| **Aspect**                | **Flat List of Word Tokens**                          | **Tokenized Sentences**                      |
|---------------------------|------------------------------------------------------|----------------------------------------------|
| **Context Window**         | Spans across sentence boundaries                    | Limited to within each sentence             |
| **Sentence Structure**     | Ignored                                             | Preserved                                   |
| **Word Relationships**     | Words across sentences may be incorrectly related   | Only words within the same sentence are related |
| **Embedding Quality**      | Potentially lower due to irrelevant word contexts   | Higher quality as word relationships are more meaningful |
| **Training Data Format**   | Single list of tokens                               | List of lists (each sublist is a sentence)   |

### When to Use Each Approach:

- **Flat List of Word Tokens**: This approach might be useful in cases where the corpus consists of **very short sentences** or where sentence boundaries are not important (e.g., certain unsupervised tasks). However, it is not recommended for natural language understanding tasks like predicting relationships between words, as context is lost.

- **Tokenized Sentences**: This is the preferred approach for **most text analysis tasks**, especially when the goal is to understand word relationships within the context of a sentence (like for word embeddings, language modeling, etc.). It ensures that the embeddings capture the correct semantic relationships between words based on how they appear in the sentence.

To conclude, feeding the model **tokenized sentences** gives better results in terms of learning word relationships and producing higher-quality embeddings, as it respects sentence structure and maintains more meaningful contexts.