<a href="https://colab.research.google.com/github/naica922/uek_259_MachieneLearning/blob/main/demos/01_NumPyIntro.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# NumPy UltraQuick Tutorial

NumPy is a Python library for creating and manipulating vectors and matrices. This Colab is not an exhaustive tutorial on NumPy.  Rather, this Colab teaches you just enough to use NumPy in the Colab exercises of Machine Learning Crash Course.


<a target="_blank" href="https://colab.research.google.com/github/LuWidme/uk259/blob/main/demos/NumPy%20Intro.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

## About Jupyter

Jupyter Notebooks consist of two kinds of components:

  * **Text cells**, which contain explanations. You are currently reading a text cell.
  * **Code cells**, which contain Python code for you to run. Code cells have a light gray background.

You *read* the text cells and *run* the code cells.






### Running code cells

You must run code cells in order. In other words, you may only run a code cell once all the code cells preceding it have already been run.

To run a code cell:

  1. Place the cursor anywhere inside the [ ] area at the top left of a code cell. The area inside the [ ] will display an arrow.
  2. Click the arrow.

Alternatively, you may invoke **Run->Run all**.

### If you see errors...

The most common reasons for seeing code cell errors are as follows:

  * You didn't run *all* of the code cells preceding the current code cell.
  * If the code cell is labeled as a **Task**, then:
    *  You haven't yet written the code that implements the task.
    *  You did write the code, but the code contained errors.

## Import NumPy module

Run the following code cell to import the NumPy module:

In [None]:
import numpy as np

## Populate arrays with specific numbers

Call `np.array` to create a NumPy matrix with your own hand-picked values. For example, the following call to `np.array` creates an 8-element vector:

In [None]:
one_dimensional_array = np.array([1.2, 2.4, 3.5, 4.7, 6.1, 7.2, 8.3, 9.5])
print(one_dimensional_array)

You can also use `np.array` to create a two-dimensional matrix. To create a two-dimensional matrix, specify an extra layer of square brackets. For example, the following call creates a 3x2 matrix:

In [None]:
two_dimensional_array = np.array([[6, 5], [11, 7], [4, 8]])
print(two_dimensional_array)

To populate a matrix with all zeroes, call `np.zeros`. To populate a matrix with all ones, call `np.ones`.

## Populate arrays with sequences of numbers

You can populate an array with a sequence of numbers:

In [None]:
sequence_of_integers = np.arange(5, 12)
print(sequence_of_integers)

Notice that `np.arange` generates a sequence that includes the lower bound (5) but not the upper bound (12).

## Populate arrays with random numbers

NumPy provides various functions to populate matrices with random numbers across certain ranges. For example, `np.random.randint` generates random integers between a low and high value. The following call populates a 6-element vector with random integers between 50 and 100.




In [None]:
random_integers_between_50_and_100 = np.random.randint(low=50, high=101, size=(6))
print(random_integers_between_50_and_100)

Note that the highest generated integer `np.random.randint` is one less than the `high` argument.

To create random floating-point values between 0.0 and 1.0, call `np.random.random`. For example:

In [None]:
random_floats_between_0_and_1 = np.random.random([6])
print(random_floats_between_0_and_1)

## Mathematical Operations on NumPy Operands

If you want to add or subtract two vectors or matrices, linear algebra requires that the two operands have the same dimensions. Furthermore, if you want to multiply two vectors or matrices, linear algebra imposes strict rules on the dimensional compatibility of operands. Fortunately, NumPy uses a trick called [**broadcasting**](https://developers.google.com/machine-learning/glossary/#broadcasting) to virtually expand the smaller operand to dimensions compatible for linear algebra. For example, the following operation uses broadcasting to add 2.0 to the value of every item in the vector created in the previous code cell:

In [None]:
random_floats_between_2_and_3 = random_floats_between_0_and_1 + 2.0
print(random_floats_between_2_and_3)

The following operation also relies on broadcasting to multiply each cell in a vector by 3:

In [None]:
random_integers_between_150_and_300 = random_integers_between_50_and_100 * 3
print(random_integers_between_150_and_300)

## Task 1: Create a Linear Dataset

Your goal is to create a simple dataset consisting of a single feature and a label as follows:

1. Assign a sequence of 15 integers from 6 to 20 (inclusive) to a NumPy array named `feature`.
2. Assign 15 values to a NumPy array named `label` such that:

```
   label = 3*(feature) + 4
```
For example, the first value for `label` could be:

```
  label = 3*(6) + 4 = 22
 ```

In [None]:
feature = ? # write your code here
print(feature)
label = ?   # write your code here
print(label)

## Task 2: Add Some Noise to the Dataset

To make your dataset a little more realistic, insert a little random noise into each element of the `label` array you already created. To be more precise, modify each value assigned to `label` by adding a *different* random floating-point value between -2 and +2.

Don't rely on broadcasting. Instead, create a `noise` array having the same dimension as `label`.

In [None]:
noise = ?    # write your code here
print(noise)
label = ?    # write your code here
print(label)

# Project

## Learnings pipeline

In [None]:
from transformers import pipeline

classifier = pipeline("sentiment-analysis")
classifier(["I've been waiting for a HuggingFace course my whole life.",
           "I hate this."])


In [None]:
classifier = pipeline("zero-shot-classification")
classifier("This is a cource about the Transformers library",
           candidate_labels=["education", "politics", "business"])

In [None]:
generator = pipeline("text-generation")
generator("In this course, we will teach you how to cook a ")

In [None]:
generator = pipeline("text-generation", model="distilgpt2")
generator(
    "In this course, we will teach you how go for a",
    max_length=30,
    num_return_sequences=2,
)

In [None]:
unmasker = pipeline("fill-mask")
unmasker("This course will teach you all about <mask> models.", top_k=2)

In [None]:
ner = pipeline("ner", grouped_entities=True)
ner("My name is Sylvain and I work at Hugging Face in Brooklyn.")

In [None]:
question_answering = pipeline("question-answering")
question_answering(
    question="Where do I work?",
    context="My name is Sylvain and I work at Hugging Face in Brooklyn",
)

In [None]:
summarizer = pipeline("summarization")
summarizer(''' America has changed dramatically''')

In [None]:
translator = pipeline("translation", model="Helsinki-NLP/opus-mt-fr-en")
translator("Ce cours est produit par Hugging Face.")

## Cosine similarity

It measures the similarity between two non zero vectory by calculating the cosine of the angle between them. <br>
Smaller distance indicates a higher similarity and larger distance indicates the lower similarity

In [None]:
# Load data

import pandas as pd

df = pd.read_parquet("hf://datasets/asoria/awesome-chatgpt-prompts-embeddings/data/train-00000-of-00001.parquet")

In [None]:
df.head(3)

In [None]:
df.embedding[0]

In [None]:
import numpy as np
def array_cosine_similarity(arr1, arr2):
  dot_product = np.dot(arr1, arr2)
  norm_a = np.linalg.norm(arr1)
  norm_b = np.linalg.norm(arr2)
  return dot_product / (norm_a * norm_b)

linux_embedding = df[df['act'] == 'Linux Terminal']['embedding'].iloc[0]

df['similarity'] = df.apply(lambda row: array_cosine_similarity(row['embedding'], linux_embedding), axis=1)

sorted_df = df.sort_values(by='similarity', ascending=False)

print(sorted_df[['act', 'prompt', 'similarity']].head(3))

## Project


In [None]:
import pandas as pd
import transformers as pipeline

df = pd.read_parquet("hf://datasets/philschmid/amazon-product-descriptions-vlm/data/train-00000-of-00001.parquet")
df

In [None]:
from sentence_transformers import SentenceTransformer

# Load a pre-trained sentence transformer model
model = SentenceTransformer('all-MiniLM-L6-v2')

### Subtask: Generate and add embeddings

Generate embeddings for the `description` column using the loaded model and add them as a new column to the DataFrame.

In [None]:
# Description embedding
embeddings = model.encode(df['description'].tolist())

df['description_embedding'] = list(embeddings)

display(df['description_embedding'])

In [None]:
import pandas as pd
from sentence_transformers import SentenceTransformer, util
from sklearn.metrics.pairwise import cosine_similarity
import torch

# Load dataset
df = pd.read_parquet("hf://datasets/philschmid/amazon-product-descriptions-vlm/data/train-00000-of-00001.parquet")
df

# Sentence embeddings
model = SentenceTransformer('all-MiniLM-L6-v2')

# Embeddings for descriptions
embeddings = model.encode(df['description'].tolist(), convert_to_tensor=True)
df['description_embedding'] = embeddings.tolist()

for i, emb in enumerate(df['description_embedding'].head()):
     print(f"Embedding for description {i}: {emb[:10]}...")

# Function to find similar descriptions
def find_similar_descriptions(input_text, dataframe, model, top_k=1):

    # Embedding for input text
    input_embedding = model.encode(input_text, convert_to_tensor=True)

    # Cosine similarity
    cosine_scores = util.cos_sim(input_embedding, torch.tensor(dataframe['description_embedding'].tolist()))[0]

    # Top similar descritions
    top_results = torch.topk(cosine_scores, k=top_k)

    similar_descriptions = []
    for score, idx in zip(top_results[0], top_results[1]):
        similar_descriptions.append({
            'description': dataframe.loc[idx.item(), 'description'],
            'similarity_score': score.item()
        })

    return pd.DataFrame(similar_descriptions)

# 5. Example Usage
    user_input = "Everlast orofessional hand wraps."
    print(f"\nFinding descriptions similar to: '{user_input}'")
    most_similar = find_similar_descriptions(user_input, df, model, top_k=3)
    print(most_similar)

    user_input_2 = "This item is bad quality and easily broken."
    print(f"\nFinding descriptions similar to: '{user_input_2}'")
    most_similar_2 = find_similar_descriptions(user_input_2, df, model, top_k=2)
    print(most_similar_2)

In [None]:
import pandas as pd
from sentence_transformers import SentenceTransformer, util
from sklearn.metrics.pairwise import cosine_similarity
import torch # Import torch for handling tensors, if needed by sentence-transformers

# 1. Load the dataset
# Ensure you have access to the dataset. 'hf://datasets/' implies a Hugging Face dataset.
# If you're running this locally, make sure the dataset is downloaded or accessible.
try:
    df = pd.read_parquet("hf://datasets/philschmid/amazon-product-descriptions-vlm/data/train-00000-of-00001.parquet")
    print("Dataset loaded successfully!")
except Exception as e:
    print(f"Error loading dataset: {e}")
    print("Please ensure you have internet connectivity or the dataset is available locally.")
    # Create a dummy DataFrame for demonstration if the real one can't be loaded
    data = {'description': [
        "This is a fantastic product, highly recommend it for daily use.",
        "A very durable and long-lasting item, perfect for outdoor activities.",
        "Not what I expected, quite flimsy and broke quickly.",
        "Excellent quality and great value for money, will buy again.",
        "Good for the price, but could be more robust."
    ]}
    df = pd.DataFrame(data)
    print("Using a dummy DataFrame for demonstration.")


# 2. Initialize the Sentence Transformer Model
# 'all-MiniLM-L6-v2' is a good general-purpose model for sentence embeddings.
# You can explore other models on the Hugging Face Model Hub based on your needs.
model = SentenceTransformer('all-MiniLM-L6-v2')
print("Sentence Transformer model loaded.")

# 3. Generate embeddings for all descriptions
print("Generating embeddings for descriptions...")
embeddings = model.encode(df['description'].tolist(), convert_to_tensor=True) # convert_to_tensor=True for GPU optimization if available
df['description_embedding'] = embeddings.tolist() # Store as list if you prefer, or keep as tensor if you're doing more tensor operations

print("Embeddings generated and added to DataFrame.")
# Display the first few embeddings (optional)
# for i, emb in enumerate(df['description_embedding'].head()):
#     print(f"Embedding for description {i}: {emb[:10]}...") # Print first 10 dimensions

# 4. Function to find similar descriptions
def find_similar_descriptions(input_text, dataframe, model, top_k=1):
    """
    Finds descriptions in the DataFrame most similar to the input text.

    Args:
        input_text (str): The text to compare against the descriptions.
        dataframe (pd.DataFrame): The DataFrame containing 'description' and 'description_embedding' columns.
        model (SentenceTransformer): The pre-trained sentence transformer model.
        top_k (int): The number of most similar descriptions to return.

    Returns:
        pd.DataFrame: A DataFrame containing the top_k most similar descriptions and their similarity scores.
    """
    # Generate embedding for the input text
    input_embedding = model.encode(input_text, convert_to_tensor=True)

    # Calculate cosine similarities between input_embedding and all description_embeddings
    # Use util.cos_sim for efficiency with SentenceTransformer embeddings
    cosine_scores = util.cos_sim(input_embedding, torch.tensor(dataframe['description_embedding'].tolist()))[0]

    # Get the top_k indices with the highest similarity scores
    top_results = torch.topk(cosine_scores, k=top_k)

    similar_descriptions = []
    for score, idx in zip(top_results[0], top_results[1]):
        similar_descriptions.append({
            'description': dataframe.loc[idx.item(), 'description'],
            'similarity_score': score.item()
        })

    return pd.DataFrame(similar_descriptions)

# 5. Example Usage
if not df.empty:
    user_input = "Looking for a very durable product that lasts long."
    print(f"\nFinding descriptions similar to: '{user_input}'")
    most_similar = find_similar_descriptions(user_input, df, model, top_k=3)
    print(most_similar)

    user_input_2 = "This item is bad quality and easily broken."
    print(f"\nFinding descriptions similar to: '{user_input_2}'")
    most_similar_2 = find_similar_descriptions(user_input_2, df, model, top_k=2)
    print(most_similar_2)
else:
    print("\nDataFrame is empty, cannot perform similarity search.")