![Alt Text](https://raw.githubusercontent.com/msfasha/307304-Data-Mining/main/20242/images/header.png)

<div style="display: flex; justify-content: flex-start; align-items: center;">
   <a href="https://colab.research.google.com/github/msfasha/307307-BI-Methods/blob/main/20242-NLP-LLM/lecture%20notes/Part%202%20-%20Introduction%20to%20Large%20Language%20Models/introduction_to_llm_python.ipynb" target="_parent"><img 
   src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>
</div>

## The Perceptron

### Implement the Perceptron using scikit-learn library

In [1]:
from sklearn.linear_model import Perceptron
import numpy as np

# Training data for AND gate
X = np.array([[0, 0], [0, 1], [1, 0], [1, 1]])
y = np.array([0, 0, 0, 1])

# Initialize and train Perceptron
model = Perceptron(max_iter=100, eta0=0.1, random_state=42)
model.fit(X, y)

# Results
print("Weights:", model.coef_)
print("Bias:", model.intercept_)
print("Predictions:", model.predict(X))

Weights: [[0.2 0.2]]
Bias: [-0.2]
Predictions: [0 0 0 1]


Note: In the scikit-learn Perceptron, the step function (also called the activation function) is a hard threshold function, and it's built-in.<br>

```Python
prediction = 1 if output >= 0 else 0

<div style="text-align: center;">
    <img src="https://raw.githubusercontent.com/msfasha/307307-BI-Methods/main/images/perceptron.png" alt="Simple Perceptron" width="500"/>
</div>

---

## The Mulit-Layer Perceptron - MLP

### Solving the XOR Problem with a Neural Network

This code demonstrates how to build and train a simple neural network from scratch using NumPy to learn the XOR logic gate.

In [2]:
from sklearn.neural_network import MLPClassifier
import numpy as np

# XOR input and output
X = np.array([[0, 0], [0, 1], [1, 0], [1, 1]])
y = np.array([0, 1, 1, 0])

# Define MLP with 1 hidden layer of 2 neurons (minimal config for XOR)
mlp = MLPClassifier(hidden_layer_sizes=(2,), activation='tanh',
                    solver='adam', learning_rate_init=0.01,
                    max_iter=10000, random_state=42)


# Train the model
mlp.fit(X, y)

# Make predictions
predictions = mlp.predict(X)

print("Predictions:\n", predictions)
print("\nWeights (input to hidden):\n", "[ w11 , w12 ]\n[ w21 , w22 ]\n", mlp.coefs_[0])
print("\nBias hidden:\n", mlp.intercepts_[0])
print("\nWeights (hidden to output):\n", mlp.coefs_[1])
print("\nBias output:\n", mlp.intercepts_[1])


Predictions:
 [0 1 1 0]

Weights (input to hidden):
 [ w11 , w12 ]
[ w21 , w22 ]
 [[ 2.7144501   3.27401218]
 [-2.73418453 -3.17014048]]

Bias hidden:
 [ 1.21994174 -1.63451199]

Weights (hidden to output):
 [[-4.37775211]
 [ 4.46553876]]

Bias output:
 [3.61855675]


<div style="text-align: center;">
    <img src="https://raw.githubusercontent.com/msfasha/307307-BI-Methods/main/images/mlp.png" alt="Multi Layer Perceptron" width="600"/>
</div>

---

## Building Word Embeddings from Scratch
We can build word embeddings from sratch using a corpus of our own and using gensim library to build Word2Vec representations.

### Example 1 - Simple Tokenized Corpus

In [1]:
%pip install gensim

Collecting gensim
  Downloading gensim-4.3.3-cp310-cp310-win_amd64.whl (24.0 MB)
Collecting smart-open>=1.8.1
  Using cached smart_open-7.1.0-py3-none-any.whl (61 kB)
Collecting numpy<2.0,>=1.18.5
  Downloading numpy-1.26.4-cp310-cp310-win_amd64.whl (15.8 MB)
Collecting scipy<1.14.0,>=1.7.0
  Downloading scipy-1.13.1-cp310-cp310-win_amd64.whl (46.2 MB)
Collecting wrapt
  Using cached wrapt-1.17.2-cp310-cp310-win_amd64.whl (38 kB)
Installing collected packages: wrapt, numpy, smart-open, scipy, gensim
  Attempting uninstall: numpy
    Found existing installation: numpy 2.2.5
    Uninstalling numpy-2.2.5:
      Successfully uninstalled numpy-2.2.5
  Attempting uninstall: scipy
    Found existing installation: scipy 1.15.2
    Uninstalling scipy-1.15.2:
      Successfully uninstalled scipy-1.15.2
Successfully installed gensim-4.3.3 numpy-1.26.4 scipy-1.13.1 smart-open-7.1.0 wrapt-1.17.2
Note: you may need to restart the kernel to use updated packages.


You should consider upgrading via the 'c:\Users\me\AppData\Local\Programs\Python\Python310\python.exe -m pip install --upgrade pip' command.


In [None]:
from gensim.models import Word2Vec

# Sample corpus
sentences = [
    ['data', 'science', 'is', 'fun'],
    ['machine', 'learning', 'is', 'powerful'],
    ['data', 'and', 'learning', 'are', 'related']
]

# Train the model
model = Word2Vec(sentences, vector_size=50, window=2, min_count=1, workers=2)

# Access the embedding for a word
print("Vector for 'data':\n", model.wv['data'])

# Find similar words
print("Words similar to 'data':", model.wv.most_similar('data'))

Vector for 'data':
 [-0.01723938  0.00733148  0.01037977  0.01148388  0.01493384 -0.01233535
  0.00221123  0.01209456 -0.0056801  -0.01234705 -0.00082045 -0.0167379
 -0.01120002  0.01420908  0.00670508  0.01445134  0.01360049  0.01506148
 -0.00757831 -0.00112361  0.00469675 -0.00903806  0.01677746 -0.01971633
  0.01352928  0.00582883 -0.00986566  0.00879638 -0.00347915  0.01342277
  0.0199297  -0.00872489 -0.00119868 -0.01139127  0.00770164  0.00557325
  0.01378215  0.01220219  0.01907699  0.01854683  0.01579614 -0.01397901
 -0.01831173 -0.00071151 -0.00619968  0.01578863  0.01187715 -0.00309133
  0.00302193  0.00358008]
Words similar to 'data': [('are', 0.16563551127910614), ('fun', 0.13940520584583282), ('learning', 0.1267007291316986), ('powerful', 0.08872982114553452), ('is', 0.011071977205574512), ('and', -0.027849990874528885), ('science', -0.0372748002409935), ('related', -0.15515568852424622), ('machine', -0.2187294214963913)]


### Example 2 - Simple Untokenized Corpus

The code below uses the gensim library to buidl word embeddings using Word2Vec models from scratch.<br>
It uses a text corpus to learn word similarities.

In [None]:
%pip install nltk

import nltk
nltk.download('punkt_tab')

Note: you may need to restart the kernel to use updated packages.


You should consider upgrading via the 'c:\Users\me\AppData\Local\Programs\Python\Python310\python.exe -m pip install --upgrade pip' command.
[nltk_data] Downloading package punkt to
[nltk_data]     C:/Users/me/AppData/Roaming/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [12]:
import gensim
from gensim.models import Word2Vec
from nltk.tokenize import word_tokenize

# Sample corpus
sentences = [
    "Large language models are transforming business applications",
    "Natural language processing helps computers understand human language",
    "Word embeddings capture semantic relationships between words",
    "Neural networks learn distributed representations of words",
    "Businesses use language models for various applications",
    "Customer service can be improved with language technology",
    "Modern language models require significant computing resources",
    "Language models can generate human-like text for businesses"
]

# Tokenize the sentences
tokenized_sentences = [word_tokenize(sentence.lower()) for sentence in sentences]

# Train Word2Vec model
model = Word2Vec(
    sentences=tokenized_sentences,
    vector_size=100,    # Embedding dimension
    window=5,           # Context window size
    min_count=1,        # Minimum word frequency
    workers=4           # Number of threads
)

# Save the model
model.save("word2vec.model")

# Find the most similar words to "language"
similar_words = model.wv.most_similar("language", topn=5)
print("Words most similar to 'language':")
for word, similarity in similar_words:
    print(f"{word}: {similarity:.4f}")

# Vector for a specific word
word_vector = model.wv["business"]
print(f"\nVector for 'business' (first 10 dimensions):\n{word_vector[:10]}")

# Word analogies
analogy_result = model.wv.most_similar(
    positive=["business", "language"],
    negative=["models"],
    topn=3
)
print("\nAnalogy results:")
for word, similarity in analogy_result:
    print(f"{word}: {similarity:.4f}")

Words most similar to 'language':
natural: 0.2196
between: 0.2167
resources: 0.1955
distributed: 0.1696
significant: 0.1522

Vector for 'business' (first 10 dimensions):
[ 0.00816812 -0.00444303  0.00898543  0.00825366 -0.00443522  0.00030311
  0.00427449 -0.00392632 -0.00555997 -0.00651232]

Analogy results:
neural: 0.2595
natural: 0.2004
resources: 0.1899


---

### Word Similarities Examples

#### Use Fake Embeddings

In [15]:
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity

# Fake word vectors (3D for simplicity)
word_vectors = {
  "king": np.array([0.8, 0.65, 0.1]),
   "queen": np.array([0.78, 0.66, 0.12]), 
   "man": np.array([0.9, 0.1, 0.1]),
   "woman": np.array([0.88, 0.12, 0.12]),
  "apple": np.array([0.1, 0.8, 0.9]),
}
def similarity(w1, w2):
	return cosine_similarity([word_vectors[w1]], [word_vectors[w2]])[0][0]

print("Similarity(king, queen):", similarity("king", "queen"))
print("Similarity(man, woman):", similarity("man", "woman"))
print("Similarity(king, apple):", similarity("king", "apple"))

Similarity(king, queen): 0.9995995265529728
Similarity(man, woman): 0.999399810286
Similarity(king, apple): 0.5514092058274782


#### Use Real Embeddings - Gensim library

In [16]:
import gensim.downloader as api
from gensim.models import Word2Vec

# Load pre-trained Word2Vec model
word2vec_model = api.load("word2vec-google-news-300")

# Find similar words
similar_words = word2vec_model.most_similar('computer', topn=5)
print("Words similar to 'computer':", similar_words)

# Word analogies
result = word2vec_model.most_similar(positive=['woman', 'king'], negative=['man'], topn=1)
print("king - man + woman =", result)

# Train your own Word2Vec model
sentences = [["cat", "say", "meow"], ["dog", "say", "woof"]]
model = Word2Vec(sentences, vector_size=100, window=5, min_count=1, workers=4)

# Get vector for a word
cat_vector = model.wv['cat']
print("Vector for 'cat':", cat_vector[:5])  # Show first 5 dimensions

Words similar to 'computer': [('computers', 0.7979382276535034), ('laptop', 0.6640493869781494), ('laptop_computer', 0.6548869013786316), ('Computer', 0.647333562374115), ('com_puter', 0.6082078814506531)]
king - man + woman = [('queen', 0.7118191123008728)]
Vector for 'cat': [-0.00713902  0.00124103 -0.00717672 -0.00224462  0.0037193 ]


#### Use Real Embeddings - Spacy library

Download spacy and the required libraries

In [22]:
%pip install spacy
!python -m spacy download en_core_web_md

Note: you may need to restart the kernel to use updated packages.


You should consider upgrading via the 'c:\Users\me\AppData\Local\Programs\Python\Python310\python.exe -m pip install --upgrade pip' command.


Collecting en-core-web-md==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_md-3.8.0/en_core_web_md-3.8.0-py3-none-any.whl (33.5 MB)
Installing collected packages: en-core-web-md
Successfully installed en-core-web-md-3.8.0
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_md')


You should consider upgrading via the 'c:\Users\me\AppData\Local\Programs\Python\Python310\python.exe -m pip install --upgrade pip' command.


In [23]:
import spacy
nlp = spacy.load("en_core_web_md")

word1 = nlp("king")
word2 = nlp("queen")
print("Similarity:", word1.similarity(word2))

Similarity: 0.38253095611315674


---

### Context Aware Word Embeddings - BERT

In [4]:
%pip install transformers

Collecting transformers
  Downloading transformers-4.51.3-py3-none-any.whl (10.4 MB)
Collecting huggingface-hub<1.0,>=0.30.0
  Downloading huggingface_hub-0.30.2-py3-none-any.whl (481 kB)
Collecting safetensors>=0.4.3
  Downloading safetensors-0.5.3-cp38-abi3-win_amd64.whl (308 kB)
Collecting pyyaml>=5.1
  Downloading PyYAML-6.0.2-cp310-cp310-win_amd64.whl (161 kB)
Collecting tokenizers<0.22,>=0.21
  Downloading tokenizers-0.21.1-cp39-abi3-win_amd64.whl (2.4 MB)
Installing collected packages: pyyaml, huggingface-hub, tokenizers, safetensors, transformers
Successfully installed huggingface-hub-0.30.2 pyyaml-6.0.2 safetensors-0.5.3 tokenizers-0.21.1 transformers-4.51.3
Note: you may need to restart the kernel to use updated packages.


You should consider upgrading via the 'c:\Users\me\AppData\Local\Programs\Python\Python310\python.exe -m pip install --upgrade pip' command.


In [10]:
from transformers import BertTokenizer, BertModel
import torch
import torch.nn.functional as F

# Load pretrained BERT tokenizer and model
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')

# Function to extract contextual embedding for a word (handles subwords)
def get_token_embedding(sentence, target_word):
    inputs = tokenizer(sentence, return_tensors='pt')
    outputs = model(**inputs)
    tokens = tokenizer.convert_ids_to_tokens(inputs['input_ids'][0])
    embeddings = outputs.last_hidden_state.squeeze(0)

    # Tokenize the target word the same way BERT does
    target_tokens = tokenizer.tokenize(target_word)

    # Search for the position of the target word (handling subwords)
    matches = []
    for i in range(len(tokens) - len(target_tokens) + 1):
        if tokens[i:i + len(target_tokens)] == target_tokens:
            matches = list(range(i, i + len(target_tokens)))
            break

    if not matches:
        raise ValueError(f"'{target_word}' not found in tokens: {tokens}")

    # Average the embeddings over all subword tokens
    return embeddings[matches].mean(dim=0)

# Contextual sentences
sentence_fruit = "He ate a fresh apple and enjoyed the fruit."
sentence_company = "Apple released a new product in the computer market."
sentence_orange = "An orange is a juicy fruit."
sentence_microsoft = "Microsoft computer was running the latest software."

# Get embeddings
apple_fruit = get_token_embedding(sentence_fruit, "apple")
apple_company = get_token_embedding(sentence_company, "apple")
orange = get_token_embedding(sentence_orange, "orange")
microsoft = get_token_embedding(sentence_microsoft, "Microsoft")

# Cosine similarity comparisons
sim_fruit = F.cosine_similarity(apple_fruit, orange, dim=0)
sim_company = F.cosine_similarity(apple_company, microsoft, dim=0)

# Results
print(f"Similarity between 'apple' (fruit) and 'orange': {sim_fruit.item():.4f}")
print(f"Similarity between 'apple' (company) and 'Microsoft': {sim_company.item():.4f}")


Similarity between 'apple' (fruit) and 'orange': 0.5839
Similarity between 'apple' (company) and 'Microsoft': 0.8549
