**Logistic Regression**

**Mechanics of Logistic Regression**

Logistic Regression is a popular statistical method used for binary classification. It models the probability that an input **X** belongs to a particular category (usually 0 or 1).

**Linear Combination of Features**

z = WT . X + b

**Sigmoid**

Convert a linear value to 0 to 1

Sigmoid (z) = 1/1+(e^(-z))

The output of the sigmoid function is interpreted as the probability of the input belonging to the positive class (usually labeled as 1). If Ïƒ(z) is greater than 0.5, the input is classified as 1, otherwise as 0.

Cost Function (Binary Cross-Entropy)
Gradient Descent

Logistic Regression provides a probabilistic framework, offering not just classifications but the probabilities of predictions, which can be crucial for decision-making processes.
Efficiency:

It is computationally less intensive, making it a good choice for problems with not too many features.
Interpretability:

Logistic regression models are easy to interpret. The weights directly represent the importance of each feature for the prediction.
Performance:

It often performs well in binary classification tasks, especially when the classes are linearly separable.

**Assumptions of Logistic Regression**


Binary Output:

-The dependent variable should be binary (0/1, True/False).

-While logistic regression does not require the linear relationship between independent and dependent variables, it does require linear relationship between the log odds and the independent variables.


-Logistic regression assumes little or no multicollinearity among the independent variables.

-The observations should not be from repeated measurements or matched data.
Large Sample Size:

-Logistic regression requires a large sample size to predict accurately.

**Advantages of Logistic Regression**

Simplicity and Interpretability

Efficient Computation

Good Performance on Linearly Separable Data

Probabilistic Approach

Scalability

Less Prone to Overfitting


**Disadvantages of Logistic Regression**

Assumption of Linearity

Not Suitable for Large Number of Categorical Features

Multicollinearity

Limited to Binary or Multinomial Outcomes



In [None]:
import numpy as np

class LogisticRegression:
    def __init__(self, learning_rate=0.01, iterations=1000):
        self.learning_rate = learning_rate
        self.iterations = iterations
        self.weights = None
        self.bias = None

    def _sigmoid(self, z):
        """Private method for the sigmoid function."""
        return 1 / (1 + np.exp(-z))

    def fit(self, X, y):
        """Fit the model to the data."""
        m, n = X.shape

        # Initialize weights and bias
        self.weights = np.zeros(n)
        self.bias = 0

        # Gradient descent
        for _ in range(self.iterations):
            model = np.dot(X, self.weights) + self.bias
            predictions = self._sigmoid(model)

            # Compute gradients
            dw = (1 / m) * np.dot(X.T, (predictions - y))
            db = (1 / m) * np.sum(predictions - y)

            # Update weights and bias
            self.weights -= self.learning_rate * dw
            self.bias -= self.learning_rate * db

    def predict(self, X):
        """Predict binary labels for a set of inputs."""
        model = np.dot(X, self.weights) + self.bias
        predictions = self._sigmoid(model)
        return [1 if i > 0.5 else 0 for i in predictions]

s


Predictions: [1, 1, 0, 1, 0]


**Embeddings**

**Word2Vec**

skip gram - predicts context ; cbow - predicts word
Trained by simple NN no activation - ranodm weights - eventually aligns into the embedding

fixed vocab ; no OOV

**Fasttext**

concept : Ngrams
represent each word as character level n grams

Word Representation: Each word is represented as the sum of these character n-gram vectors. This means that the word vector for "apple" is composed of the vectors of all its n-grams.



Generating Embeddings for New Words: When encountering a word not seen during training, FastText computes its embedding by summing the vectors of its character n-grams. This allows the model to produce a meaningful representation for the OOV word based on its subword units.

**Negative Sampling**

Efficiency: The primary advantage of negative sampling is computational efficiency. The original Word2Vec model used a softmax function to calculate probabilities for every word in the vocabulary, which is computationally expensive. Negative sampling drastically reduces the number of output neurons that are updated in each training step.





In [None]:
#simple embeddings coding implementations

import numpy as np
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

# Example dataset: list of sentences (documents)
documents = ["the cat sat on the mat", "the dog barked at the cat"]

# Basic Preprocessing: tokenization (splitting text into words)
tokenized_docs = [doc.split() for doc in documents]
vocabulary = list(set(word for doc in tokenized_docs for word in doc))
word_to_id = {word: i for i, word in enumerate(vocabulary)}

In [None]:
def one_hot_encode(doc):
    encoding = np.zeros(len(vocabulary))
    for word in doc:
        encoding[word_to_id[word]] = 1
    return encoding

one_hot_encoded_docs = [one_hot_encode(doc) for doc in tokenized_docs]

In [None]:
def term_frequency(doc):
    tf = np.zeros(len(vocabulary))
    for word in doc:
        tf[word_to_id[word]] += 1
    return tf / len(doc)

tf_encoded_docs = [term_frequency(doc) for doc in tokenized_docs]

In [None]:
import numpy as np

# Sample documents
documents = [
    "the quick brown fox jumps over the lazy dog",
    "the dog barked"
]

# Tokenize documents
tokenized_documents = [doc.lower().split() for doc in documents]

# Create a vocabulary
vocabulary = list(set(word for doc in tokenized_documents for word in doc))

# Calculate TF
def calculate_tf(document, vocabulary):
    tf_dict = dict.fromkeys(vocabulary, 0)
    for word in document:
        tf_dict[word] += 1
    tf_dict = {word: count / len(document) for word, count in tf_dict.items()}
    return list(tf_dict.values())

tf = np.array([calculate_tf(doc, vocabulary) for doc in tokenized_documents])

# Calculate IDF
def calculate_idf(documents, vocabulary):
    N = len(documents)
    idf_dict = dict.fromkeys(vocabulary, 0)
    for doc in documents:
        for word in set(doc):
            idf_dict[word] += 1
    idf_dict = {word: np.log(N / float(count)) for word, count in idf_dict.items()}
    return list(idf_dict.values())

idf = np.array(calculate_idf(tokenized_documents, vocabulary))

# Calculate TF-IDF
tfidf = tf * idf

print(tfidf)


'''

( (no of times term t in doc d) / (number of unique t in doc d) * log (number of docs / no of docs with t)

'''

[[0.07701635 0.         0.07701635 0.07701635 0.07701635 0.
  0.07701635 0.         0.07701635]
 [0.         0.23104906 0.         0.         0.         0.
  0.         0.         0.        ]]


'\n\n( (no of times term t in doc d) / (number of unique t in doc d) * log (number of docs / no of docs with t)\n\n'

In [None]:
def binary_bag_of_words(doc):
    return np.where(term_frequency(doc) > 0, 1, 0)

binary_bow_docs = [binary_bag_of_words(doc) for doc in tokenized_docs]

In [None]:
def count_bag_of_words(doc):
    bow = np.zeros(len(vocabulary))
    for word in doc:
        bow[word_to_id[word]] += 1
    return bow

count_bow_docs = [count_bag_of_words(doc) for doc in tokenized_docs]

In [None]:
import numpy as np

def generate_ngrams(words, n):
    # Convert list of words to a NumPy array
    words_array = np.array(words)

    # Create an array of n-gram tuples
    ngrams = np.lib.stride_tricks.sliding_window_view(words_array, window_shape=n)

    # Convert n-grams to a list of tuples
    ngrams = [tuple(ng) for ng in ngrams]

    return ngrams

# Example usage
words = ["the", "quick", "brown", "fox", "jumps", "over", "the", "lazy", "dog"]
n = 3  # For trigrams
trigrams = generate_ngrams(words, n)

print(trigrams)

[('the', 'quick', 'brown'), ('quick', 'brown', 'fox'), ('brown', 'fox', 'jumps'), ('fox', 'jumps', 'over'), ('jumps', 'over', 'the'), ('over', 'the', 'lazy'), ('the', 'lazy', 'dog')]


**Naive Bayes**


posterior = likelihood*prior / evidence

The theorem is used to calculate the probability of a hypothesis (in this case, a class label) given some observed evidence (features in the data).

p(H|E) = p(E|H)*P(H) / P(E)

In the context of Naive Bayes for classification:
H is the class label (e.g., "spam" or "not spam").
E is the observed data (features).



In [None]:
import numpy as np

# Example dataset
# Features (binary features for simplicity)
X = np.array([
    [1, 0],  # Document 1 features
    [1, 1],  # Document 2 features
    [0, 1],  # Document 3 features
    [0, 0],  # Document 4 features
    [0, 1],  # Document 5 features
    [1, 1]   # Document 6 features
])

# Labels (1 for positive class, 0 for negative class)
y = np.array([1, 1, 1, 0, 0, 0])  # Corresponding labels for each document

# Function to calculate the prior probability of each class
def calculate_prior(y):
    classes = np.unique(y)  # Unique classes in the labels
    prior = np.zeros(len(classes))  # Prior probability for each class
    for index, cls in enumerate(classes):
        prior[index] = np.mean(y == cls)  # Fraction of documents of each class
    return prior

prior = calculate_prior(y)  # Calculate prior probabilities
# Function to calculate likelihood of features given a class
def calculate_likelihood(X, y):
    classes = np.unique(y)  # Unique classes
    n_features = X.shape[1]  # Number of features
    # Likelihood of each feature value (0 or 1) given a class
    likelihood = np.zeros((len(classes), n_features, 2))

    for cls in classes:
        for i in range(n_features):
            # Probability of feature i being 0 given class cls
            likelihood[cls, i, 0] = np.mean(X[y==cls, i] == 0)
            # Probability of feature i being 1 given class cls
            likelihood[cls, i, 1] = np.mean(X[y==cls, i] == 1)

    return likelihood

likelihood = calculate_likelihood(X, y)  # Calculate likelihood

# Naive Bayes prediction function
def naive_bayes_predict(X, prior, likelihood):
    classes = np.unique(y)  # Unique classes
    n_features = X.shape[1]  # Number of features
    y_pred = np.zeros(X.shape[0])  # Array to hold predictions

    for index, x in enumerate(X):
        posteriors = np.zeros(len(classes))  # Posterior probability for each class

        for cls in classes:
            posterior = prior[cls]  # Start with the prior probability
            for i in range(n_features):
                # Multiply by likelihood of each feature
                posterior *= likelihood[cls, i, x[i]]
            posteriors[cls] = posterior  # Total posterior for this class

        y_pred[index] = np.argmax(posteriors)  # Class with the highest posterior

    return y_pred

predictions = naive_bayes_predict(X, prior, likelihood)  # Predict using the model


[0 1]


Reservoir Sampling
Reservoir Sampling is a technique used to sample a fixed number of items from a data stream of unknown or very large size. It's particularly useful when the dataset is too large to fit into memory or when dealing with streaming data where the total number of items is not known in advance.

How it Works:
Initialize the Reservoir: Create a reservoir (array) with the size equal to the desired sample size. Fill the initial part of the reservoir with the first items of the stream.

Process Each Item in the Stream: For each subsequent item in the stream (starting from the index that is one more than the reservoir size):

With a certain probability, decide whether to include this item in the reservoir.
If the item is selected to be included, it replaces a randomly chosen item in the reservoir.
The probability of choosing an item is typically set so that each item in the stream has an equal chance of being included in the final sample.

Key Features:
Efficient for large or streaming datasets.
Ensures a random sample without knowing the total size of the dataset.
Every item in the stream has an equal chance of being included in the sample.
Bootstrap Sampling
Bootstrap Sampling is a statistical method used to estimate the distribution of a statistic (like mean, median, variance) by resampling with replacement from the data. It is a fundamental tool in the field of statistics and is widely used for assessing the variability of a statistic.

How it Works:
Draw Samples with Replacement: From the original dataset, draw a sample with replacement (meaning the same item can be chosen more than once) such that the sample size is equal to the original dataset size.

Repeat the Process: Repeat this process many times, each time computing the statistic of interest from the bootstrap sample.

Estimate the Distribution: Use the collection of computed statistics from all the bootstrap samples to estimate the distribution of the statistic. This can include calculating the standard error, confidence intervals, or other measures of statistical variability.

Key Features:
Does not require any assumptions about the distribution of the data.
Suitable for small datasets, as it can help understand the variability of a statistic without needing new data.
Widely used for non-parametric statistical inference.


In [None]:
#Sampling





Random Sampling
Random Sampling is straightforward with NumPy, using the np.random.choice function.

import numpy as np

def random_sampling(data, sample_size):
    sampled_data = np.random.choice(data, size=sample_size, replace=False)
    return sampled_data

# Example usage
data = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
sample_size = 5
random_sample = random_sampling(data, sample_size)
print("Random Sample:", random_sample)


Stratified Sampling
For Stratified Sampling, you divide your data into different "strata" and then perform random sampling within each stratum.

def stratified_sampling(data, labels, sample_size_per_stratum):
    unique_labels = np.unique(labels)
    sampled_data = []

    for label in unique_labels:
        stratum = data[labels == label]
        stratum_sample = np.random.choice(stratum, size=sample_size_per_stratum, replace=False)
        sampled_data.extend(stratum_sample)

    return np.array(sampled_data)

# Example usage
data = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
labels = np.array([0, 1, 0, 1, 0, 1, 0, 1, 0, 1])  # Binary labels for simplicity
sample_size_per_stratum = 2
stratified_sample = stratified_sampling(data, labels, sample_size_per_stratum)
print("Stratified Sample:", stratified_sample)



Reservoir Sampling
Reservoir Sampling is used for sampling from a stream of data.

def reservoir_sampling(stream, sample_size):
    reservoir = np.zeros(sample_size)
    for i, element in enumerate(stream):
        if i < sample_size:
            reservoir[i] = element
        else:
            j = np.random.randint(0, i+1)
            if j < sample_size:
                reservoir[j] = element
    return reservoir

# Example usage (assuming a stream of data)
stream = np.arange(100)  # Stream of data
sample_size = 10
reservoir_sample = reservoir_sampling(stream, sample_size)
print("Reservoir Sample:", reservoir_sample)

Bootstrap Sampling
Bootstrap Sampling involves sampling with replacement.



def bootstrap_sampling(data, sample_size):
    sampled_data = np.random.choice(data, size=sample_size, replace=True)
    return sampled_data

# Example usage
data = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
sample_size = 10
bootstrap_sample = bootstrap_sampling(data, sample_size)
print("Bootstrap Sample:", bootstrap_sample)



SyntaxError: ignored

**Beam Search**

In [None]:
import numpy as np

def beam_search(decoder, beam_width, start_token, end_token):
    """
    Perform beam search with a given decoder model.

    :param decoder: A decoder model with a method get_next_tokens that takes a sequence and returns
                    a list of tuples (next_token, probability).
    :param beam_width: The number of beams to keep at each step.
    :param start_token: The token that denotes the start of a sequence.
    :param end_token: The token that denotes the end of a sequence.
    :return: The best sequence found by beam search.
    """
    beams = [([start_token], 0)]  # Initialize beams with the start token and zero score

    while True:
        new_beams = []

        for seq, score in beams:
            if seq[-1] == end_token:
                # If the sequence ends with the end token, add it to the new beams
                new_beams.append((seq, score))
                continue

            # Get possible next tokens and their probabilities
            next_tokens_with_probs = decoder.get_next_tokens(seq)

            # Extend the sequence with these tokens and update the score
            for token, prob in next_tokens_with_probs:
                new_seq = seq + [token]
                new_score = score + np.log(prob)
                new_beams.append((new_seq, new_score))

        # Keep only the top beam_width sequences
        beams = sorted(new_beams, key=lambda x: x[1], reverse=True)[:beam_width]

        # Check if all beams end with the end token
        if all(seq[-1] == end_token for seq, _ in beams):
            break

    # Return the sequence with the highest score
    return max(beams, key=lambda x: x[1])[0]


# Example Decoder Model
class ExampleDecoder:
    def __init__(self):
        # This is a placeholder for an actual model. In a real scenario, this would be a trained model.
        # The example here is simplistic and for illustration purposes only.
        self.vocab = {'<start>': 0, '<end>': 1, 'hello': 2, 'world': 3}
        self.probabilities = {
            '<start>': [('hello', 0.6), ('<end>', 0.4)],
            'hello': [('world', 0.8), ('<end>', 0.2)],
            'world': [('<end>', 1.0)]
        }

    def get_next_tokens(self, sequence):
        last_token = sequence[-1]
        return self.probabilities.get(last_token, [('<end>', 1.0)])


# Usage
decoder = ExampleDecoder()
beam_width = 2
start_token = '<start>'
end_token = '<end>'
best_sequence = beam_search(decoder, beam_width, start_token, end_token)
best_sequence



In [None]:
'''
tokenizer - own
simple lang model
self attention
'''

In [None]:
1rimport numpy as np

class SelfAttentionWithCache:
    def __init__(self, embedding_dim):
        # Simplified: Using random matrices for query, key, and value weights
        self.W_q = np.random.randn(embedding_dim, embedding_dim)
        self.W_k = np.random.randn(embedding_dim, embedding_dim)
        self.W_v = np.random.randn(embedding_dim, embedding_dim)
        self.cache = {'K': [], 'V': []}  # Initialize cache for keys and values

    def attention(self, Q, K, V):
        # Calculate attention scores (scaled dot-product attention)
        scores = np.dot(Q, K.T) / np.sqrt(K.shape[-1])
        weights = np.exp(scores) / np.sum(np.exp(scores), axis=-1, keepdims=True)
        return np.dot(weights, V)

    def update_cache(self, K, V):
        # Append new keys and values to the cache
        self.cache['K'].append(K)
        self.cache['V'].append(V)

    def forward(self, x):
        # Compute Query, Key, and Value matrices
        Q = np.dot(x, self.W_q)
        K = np.dot(x, self.W_k)
        V = np.dot(x, self.W_v)

        # Update cache with the new Key and Value
        self.update_cache(K, V)

        # Use cached Keys and Values for attention calculation
        cached_K = np.concatenate(self.cache['K'], axis=0)
        cached_V = np.concatenate(self.cache['V'], axis=0)

        # Calculate attention output
        return self.attention(Q, cached_K, cached_V)

# Example Usage
embedding_dim = 64  # Example embedding dimension
attention_layer = SelfAttentionWithCache(embedding_dim)

# Example input (token embeddings)
input_embedding = np.random.randn(1, embedding_dim)  # Single token for simplicity

# Forward pass (assuming autoregressive generation, one token at a time)
output_1 = attention_layer.forward(input_embedding)
# ... process more tokens in a similar manner ...
