<a href="https://colab.research.google.com/github/pingstanton/DATA-78000-Large-Language-Models-and-Chat-GPT/blob/main/Word_Vectors_Lab_Stanton_for_78000.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## **Word Vectors Lab**
LLMs and ChatGPT | Fall 2023 | McSweeney | CUNY Graduate Center

**Public link to this Google Colab Notebook:
https://colab.research.google.com/drive/1B2Qy5AzfZEp_wF34yW4Z8S82lF6LtFbT**

**Matthew Stanton** | pingstanton@gmail.com | mstanton@gradcenter.cuny.edu | [Lab List on CUNY Academic Commons](https://pingstanton.commons.gc.cuny.edu/2023/09/21/labs-for-data-78000-large-language-models-and-chat-gpt/)

**Due:** September 25? Or next class after Yom?



---


**Large Language Models and Chat GPT**
*(Mondays 6:30p, Room 5417, CUNY Graduate Center, New York, NY)*

Instructor: Michelle McSweeney, [michelleamcsweeney.com](https://michelleamcsweeney.com)

Course Site: https://github.com/michellejm/LLMs-fall-23

Importing assigned **wordvectors-lab.ipynb** Jupyter workbook from:
https://github.com/michellejm/LLMs-fall-23/blob/main/week4-tokenization-word%20vectors/word2vec/wordvectors-lab.ipynb

---

This lab is based heavily on the [nltk documentation](https://www.nltk.org/api/nltk.lm.html)

Code annotations copied from OpenAI. (2023). ChatGPT (August 3 Version) [Large language model]. https://chat.openai.com

In [1]:
import re
import nltk
import multiprocessing
from gensim.models import Word2Vec

**Since you are using Google Colab, don't forget to add nltk.download('punkt')**

The line `nltk.download('punkt')` is a Python command that downloads the NLTK (Natural Language Toolkit) data package named "**punkt**." "Punkt" is pronounced as "poongkt." It rhymes with "spooked" but with a "p" sound at the beginning. It is a German word that means "point" or "dot" and is used in various contexts, including in the name of the NLTK package "punkt," which is used for natural language processing tasks like tokenization.

In NLTK, "punkt" refers to a **pre-trained tokenization model** (pretrained + UNK + token). Tokenization is the process of splitting a text into individual words or tokens.

The '**punkt**' package includes data files and models that NLTK uses for tokenization tasks, such as breaking down a paragraph of text into sentences or sentences into words.

When you execute `nltk.download('punkt')`, it retrieves and installs this package from the NLTK data repository if you haven't already downloaded it. Downloading it is necessary before using NLTK's tokenization functions, as they rely on the data provided by this package.

Since you're using Google Colab instead of a machine-based instance of Jupyter Notebook, you might need to run this command when working with NLTK for the first time in a session to ensure you have the required data downloaded. Once you've downloaded it, you don't need to do it again in the same session.

In [2]:
nltk.download('punkt')


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In Python, **import requests** is used to import the requests library, which is a popular third-party library for making HTTP requests. The requests library simplifies the process of sending HTTP requests to web services, APIs, or websites and handling the responses.

In this case, since you like to call your data files from the chimaboo.com server, you need to import and assign the file to the variable Prof. McSweeney is using for the remainder of the tolken lab script...

In [3]:
import requests
result = requests.get('https://chimaboo.com/coursework/DATA78000/hunger_games.txt')
file = result.text

In [4]:
# check
print(file[:100])

The Second Book of THE HUNGER GAMES 



New York Times Bestsel ling Author 

SUZHNNE 
COLLINS 



PA


...and now back to Prof. McSweeney's script as posted on GitHub:

In [5]:
# first, remove unwanted new line and tab characters from the text
for char in ["\n", "\r", "\d", "\t"]:
    file = file.replace(char, " ")

# check again
print(file[:100])



The Second Book of THE HUNGER GAMES     New York Times Bestsel ling Author   SUZHNNE  COLLINS     PA


The script above is intended to clean a text stored in the file variable by replacing certain characters with spaces.

```
for char in ["\n", "\r", "\d", "\t"]:
```

This line starts a loop that iterates over a list of characters: "\n" (newline), "\r" (carriage return), "\d" (the character 'd'), and "\t" (tab).

```
file = file.replace(char, " "):
```

For each character in the list, this line replaces all occurrences of that character in the file string with a space character " ". It assigns the modified string back to the file variable.

In [6]:
# this is simplified for demonstration
def sample_clean_text(text: str):
    # step 1: tokenize the text into sentences
    sentences = nltk.sent_tokenize(text)

    # step 2: tokenize each sentence into words
    tokenized_sentences = [nltk.word_tokenize(sent) for sent in sentences]

    # step 3: convert each word to lowercase
    tokenized_text = [[word.lower() for word in sent] for sent in tokenized_sentences]

    # return your tokens
    return tokenized_text

# call the function
tokens = sample_clean_text(text = file)

# check
print(tokens[:50])

[['the', 'second', 'book', 'of', 'the', 'hunger', 'games', 'new', 'york', 'times', 'bestsel', 'ling', 'author', 'suzhnne', 'collins', 'parti', '``', 'the', 'spark', "''", '2', '|', 'p', 'a', 'g', 'e', 'catching', 'fire', '-', 'suzanne', 'collins', 'i', 'clasp', 'the', 'flask', 'between', 'my', 'hands', 'even', 'though', 'the', 'warmth', 'from', 'the', 'tea', 'has', 'long', 'since', 'leached', 'into', 'the', 'frozen', 'air', '.'], ['my', 'muscles', 'are', 'clenched', 'tight', 'against', 'the', 'cold', '.'], ['if', 'a', 'pack', 'of', 'wild', 'dogs', 'were', 'to', 'appear', 'at', 'this', 'moment', ',', 'the', 'odds', 'of', 'scaling', 'a', 'tree', 'before', 'they', 'attacked', 'are', 'not', 'in', 'my', 'favor', '.'], ['i', 'should', 'get', 'up', ',', 'move', 'around', ',', 'and', 'work', 'the', 'stiffness', 'from', 'my', 'limbs', '.'], ['but', 'instead', 'i', 'sit', ',', 'as', 'motionless', 'as', 'the', 'rock', 'beneath', 'me', ',', 'while', 'the', 'dawn', 'begins', 'to', 'lighten', 'the',


The script above is intended to clean a text stored in the file variable by replacing certain characters with spaces. Here's what each part of the script does:


```
for char in ["\n", "\r", "\d", "\t"]:
    file = file.replace(char, " ")
```







In [7]:
model = Word2Vec(tokens,vector_size=50)
model.wv["the"]

array([-0.19487983,  0.2700533 , -1.5583338 ,  0.36597925, -0.3371598 ,
       -0.9551392 ,  0.9198467 ,  0.5818802 , -0.71432555, -0.13867426,
        0.79693824, -1.4810411 ,  0.15073562,  0.8626713 , -0.14752384,
        0.70137936,  1.3576615 , -0.29934433, -0.80562323, -0.74933475,
        0.5721515 , -0.12363323,  0.04184139,  0.15163137,  0.7772465 ,
        0.18112125,  0.00838861,  0.6172922 , -0.6277335 ,  0.7451277 ,
        0.7468352 ,  0.29427582, -0.4461629 ,  0.05755919, -0.89721066,
       -0.00350799,  1.0337962 ,  0.5536823 , -0.08614762, -0.10726095,
       -0.18539508,  0.3544069 ,  0.5748329 , -0.0603789 , -0.14828049,
        0.4158328 ,  0.00622626, -0.50676584, -0.72197056,  0.65061384],
      dtype=float32)

The line of code above is creating a **Word2Vec model** using the Gensim library for natural language processing. Specifically, it is creating a Word2Vec model with the following characteristics:

**tokens:** This is expected to be a list of sentences or documents that you want to use for training the Word2Vec model. Each sentence or document should be represented as a list of words or tokens. The Word2Vec model learns word embeddings (vector representations) based on the input tokens.

**vector_size:** This parameter specifies the dimensionality of the word vectors that the Word2Vec model will generate. In this case, the vectors will have 50 dimensions. You can adjust this number based on your specific requirements, but common choices include 100, 200, or 300 dimensions for more complex models.

Here's a breakdown of what this code does:

```
model = Word2Vec(tokens, vector_size=50)
```

**tokens** should be a list of sentences or documents, where each sentence/document is represented as a list of words or tokens.

**vector_size** specifies the dimensionality of the word vectors.

After creating the Word2Vec model, you can use it to perform various natural language processing tasks, such as finding similar words, word analogies, or generating word embeddings for downstream machine learning tasks like text classification or sentiment analysis. The model will have learned vector representations for the words in the input data that capture semantic relationships between words based on their co-occurrence patterns in the training data.

Now, taking both lines together...

```
model = Word2Vec(tokens,vector_size=50)
model.wv["the"]
```

**model = Word2Vec(tokens, vector_size=50):** This line creates a Word2Vec model using the Gensim library. It takes a list of sentences or documents represented as lists of words or tokens (tokens) and trains a Word2Vec model with word vectors of 50 dimensions (vector_size=50).

**model.wv["the"]:** After creating the Word2Vec model, this line accesses the word vector for the word "the." Specifically, it uses the wv attribute of the Word2Vec model to access the word vectors, and ["the"] is used to specify the word for which you want to retrieve the vector.

So, model.wv["the"] retrieves the word vector for the word "the" as learned by the Word2Vec model. This vector is a numerical representation of the word's meaning in the context of the training data. The resulting vector is a 1D NumPy array with 50 elements since we specified a vector size of 50 when creating the model.

You can use these word vectors for various natural language processing tasks, such as finding similar words, word analogies, or as input features for machine learning models.


In [8]:
model = Word2Vec(tokens,vector_size=50)
model.wv["run"]

array([ 0.00828638,  0.12144296, -0.16573092,  0.10531649,  0.0018127 ,
       -0.30760288,  0.5414007 ,  0.48497337, -0.34711483, -0.14518443,
        0.11741909, -0.67270654,  0.18725881,  0.26036954, -0.08415695,
        0.30421323,  0.1973696 ,  0.10552846, -0.526346  , -0.26913697,
        0.02794474,  0.26113355,  0.4140011 , -0.02001527,  0.23870245,
        0.05185207, -0.14531691,  0.09307087, -0.37051168,  0.1734845 ,
        0.10637498, -0.03884476, -0.25154987, -0.18385673, -0.39669213,
        0.21242139,  0.35123512,  0.07836144,  0.02506723, -0.03897581,
        0.27363548, -0.02549281, -0.11196419, -0.0692548 ,  0.43600735,
        0.23855945, -0.15036625, -0.2789507 ,  0.13013215,  0.19248243],
      dtype=float32)

In [10]:
model = Word2Vec(tokens,vector_size=50)
model.wv["katniss"]

array([ 0.03973578,  0.08627009,  0.27648053,  0.10672029, -0.03928645,
       -0.26195595,  0.67079884,  0.7193839 , -0.34723112, -0.3836738 ,
       -0.00602278, -0.6395669 ,  0.22696626,  0.2341876 , -0.20746003,
        0.5408073 , -0.00947533,  0.24765918, -0.6678093 , -0.324585  ,
       -0.01951198,  0.52252084,  0.7232437 , -0.22652604,  0.14861163,
        0.06890932, -0.34551346, -0.10802484, -0.48439983,  0.05078671,
        0.04133791, -0.22455572, -0.25076136, -0.1977682 , -0.3584602 ,
        0.24895051,  0.3140884 ,  0.00397272,  0.0087159 , -0.05739469,
        0.6595537 , -0.26881585, -0.3792112 , -0.06300899,  0.74578094,
        0.19023395, -0.05808637, -0.50393695,  0.41929024,  0.16718782],
      dtype=float32)

In [11]:
model = Word2Vec(tokens,vector_size=50)
model.wv["snow"]

array([-2.2528978e-02,  1.8991829e-01, -4.2667651e-01,  2.8951898e-01,
       -7.2996458e-04, -5.8234596e-01,  8.8482010e-01,  7.8967333e-01,
       -5.9726334e-01, -9.6648030e-02,  2.8540772e-01, -1.2258208e+00,
        4.0395173e-01,  4.5815641e-01, -1.7876726e-01,  4.9953991e-01,
        5.4074347e-01,  1.2455404e-01, -8.2913208e-01, -4.9192691e-01,
        7.7119984e-02,  4.0942207e-01,  5.3532690e-01, -4.8470806e-02,
        3.6675662e-01,  1.1739686e-01, -2.8019133e-01,  1.8026008e-01,
       -5.5397397e-01,  2.2549400e-01,  1.2846416e-01,  7.3842322e-03,
       -4.8193315e-01, -1.5609476e-01, -6.4691317e-01,  2.1228524e-01,
        7.4888295e-01,  3.1401318e-01, -4.9478617e-02,  5.8001623e-02,
        2.9869145e-01, -1.0381807e-01,  2.1671252e-02, -7.7230014e-02,
        6.9380474e-01,  3.6063766e-01, -2.2480157e-01, -4.5635536e-01,
        1.9147922e-01,  4.3084514e-01], dtype=float32)