# NLP Tasks Evaluation

In [1]:
import uuid
import json
from pprint import pprint

import chromadb
import pandas as pd
import ipywidgets as widgets
from IPython.display import display, HTML

## Sentiment Analysis

### Method 1: Embeddings, Vector Databases and Cosine Similarity Searches

This method is based on the idea that similar words have similar meanings. We can use this idea to find similar words to a given word. For example, if we want to find similar words to the word "good", we can use the cosine similarity between the vector representations of the word "good" and all the other words in the vocabulary. The words with the highest cosine similarity are the most similar words to the word "good".

In [2]:
client = chromadb.Client()
commits_collection = client.create_collection("commits")
df = pd.read_csv("data/processed/commits.csv")

labels = df["label"].tolist()
messages = df["message"].tolist()
metadatas = [eval(metadata) for metadata in df["metadata"].tolist()]

commits_collection.add(
    ids=[str(uuid.uuid4()) for _ in range(len(df))],
    documents=messages,
    metadatas=metadatas,
)

Using embedded DuckDB without persistence: data will be transient
No embedding_function provided, using default embedding function: SentenceTransformerEmbeddingFunction


In [3]:
textarea = widgets.Textarea()
button = widgets.Button(description="Submit")
output = widgets.Output()

def query_commits(x):
    with output:
        output.clear_output()
        results = commits_collection.query(query_texts=textarea.value, n_results=3)
        print(type(results))
        display(HTML(f"<pre>{json.dumps(results, indent=2)}<pre>"))


button.on_click(query_commits)

display(HTML("<b>Enter a commit message:</b>"))
display(textarea)
display(button)
display(output)

Textarea(value='')

Button(description='Submit', style=ButtonStyle())

Output()

### Method 2: Supervised Learning

This method is based on the idea that we can train a model to predict the sentiment of a sentence. In the context of git commit classification, we can train a model to predict whether a git commit is a vulnerability fix or not.