### Relevance

**Relevance** refers to the importance or pertinence of certain pieces of information within a text to a specific task or query.

- `Information Retrieval (IR)`: In tasks like question answering and fact verification, relevance determines which documents or pieces of information are most pertinent to the query. An IR system retrieves relevant documents, which an NLP system then processes to extract the needed information.
- `Text Classification`: When categorizing text documents, relevance helps identify which words or phrases are most significant for determining the category. Techniques like Layer-wise Relevance Propagation (LRP) can trace classification decisions back to individual words, highlighting their importance.
- `Sentiment Analysis`: Relevance is used to focus on words or phrases that carry sentiment, such as positive or negative emotions, to accurately gauge the overall sentiment of a text.
- `Named Entity Recognition (NER)`: In NER, relevance helps in identifying and classifying entities (like names, dates, and locations) within a text. The relevance of a word or phrase to a particular entity type is crucial for accurate recognition.
- `Contextual Understanding`: Relevance is essential for understanding the context in which words are used. This helps in tasks like machine translation, where the meaning of a word can change based on its context.

**Models and Techniques**

- `TF-IDF` (Term Frequency-Inverse Document Frequency): This is a statistical measure used to evaluate the importance of a word in a document relative to a collection of documents. It helps in identifying the most relevant terms in a document.
- `BM25`: An extension of the TF-IDF model, BM25 is a ranking function used by search engines to estimate the relevance of documents to a given search query.
- `Word2Vec` and `GloVe`: These are word embedding models that capture semantic relationships between words. They can be used to measure the relevance of words in a context by comparing their vector representations.
- `BERT` (Bidirectional Encoder Representations from Transformers): BERT is a transformer-based model that understands the context of a word in a sentence by looking at the words before and after it. It is highly effective in tasks requiring contextual relevance, such as question answering and text classification.

**Why is it important?**

1. Improve Accuracy
2. Efficiency / Resource Optimization
3. Contextual Understanding
4. User Satisfaction
5. Enhanced Decision-Making

In [39]:
import pandas as pd

from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity

In [40]:
df = pd.read_csv(r"C:\Users\nene0\Desktop\Projects\greenflash\chat_data.csv", encoding_errors='ignore')

df.head()

Unnamed: 0,Chat_ID,Message_ID,Sender,Message
0,data_science_trend,0,user,What is the latest trend in data science?
1,data_science_trend,1,copilot,"Data science is evolving rapidly, and several ..."
2,data_science_trend,2,user,Can you tell me more about generative AI?
3,data_science_trend,3,copilot,Generative AI is a fascinating and rapidly evo...
4,data_science_trend,4,user,can you explain more about how the generative ...


In [41]:
df['Chat_ID'].unique()

array(['data_science_trend', 'food_history_companies', 'gaming',
       'greek_myth', 'job_market', 'jokes', 'music_kpop', 'pets',
       'philoshophy', 'rich_poor_countries',
       'tech_product_recommendation', 'travel', 'largest_adj_product',
       'jarritos_flavors'], dtype=object)

In [42]:
dataframes = {}

for id in df['Chat_ID']:
    dataframes[id] = df[df['Chat_ID'] == id]

In [43]:
# ds_trend = dataframes['data_science_trend'].reset_index(drop=True)
# tech_rec = dataframes['tech_product_recommendation'].reset_index(drop=True)
# pets = dataframes['pets'].reset_index(drop=True)
# game = dataframes['gaming'].reset_index(drop=True)
myth = dataframes['greek_myth'].reset_index(drop=True)

In [44]:
# tech_rec.head()

In [45]:
myth.iloc[8]['Message']

'can you tell me what happen to odysseus after he returned to his home after the war?'

In [46]:
myth.iloc[9]['Message']

"Certainly! After enduring numerous adventures and challenges on his journey home from the Trojan War, Odysseus finally returned to his homeland of Ithaca. Here's a summary of what happened next:\n- Disguised Return: Upon arriving in Ithaca, Odysseus disguised himself as a beggar with the help of the goddess Athena. This allowed him to assess the situation at home without being recognized.\n- Reunion with Telemachus: Odysseus first revealed his true identity to his son, Telemachus. Together, they devised a plan to deal with the suitors who had overrun their home.\n- Penelope's Loyalty: Penelope, Odysseus's wife, had remained faithful to him despite being pressured by numerous suitors. She devised clever ways to delay remarrying, such as weaving and then secretly unraveling a shroud.\n- The Contest of the Bow: Penelope announced a contest to the suitors: whoever could string Odysseus's great bow and shoot an arrow through twelve axe heads would win her hand. None of the suitors could ac

In [47]:
message_8 = word_tokenize(myth.iloc[8]['Message'])
message_9 = word_tokenize(myth.iloc[9]['Message'])

In [48]:
stop_words = stopwords.words('english')

# Add more stopwords myself.
stop_words.extend(["'s", "'t", "'ll","'d","'re"])

stop_words

['i',
 'me',
 'my',
 'myself',
 'we',
 'our',
 'ours',
 'ourselves',
 'you',
 "you're",
 "you've",
 "you'll",
 "you'd",
 'your',
 'yours',
 'yourself',
 'yourselves',
 'he',
 'him',
 'his',
 'himself',
 'she',
 "she's",
 'her',
 'hers',
 'herself',
 'it',
 "it's",
 'its',
 'itself',
 'they',
 'them',
 'their',
 'theirs',
 'themselves',
 'what',
 'which',
 'who',
 'whom',
 'this',
 'that',
 "that'll",
 'these',
 'those',
 'am',
 'is',
 'are',
 'was',
 'were',
 'be',
 'been',
 'being',
 'have',
 'has',
 'had',
 'having',
 'do',
 'does',
 'did',
 'doing',
 'a',
 'an',
 'the',
 'and',
 'but',
 'if',
 'or',
 'because',
 'as',
 'until',
 'while',
 'of',
 'at',
 'by',
 'for',
 'with',
 'about',
 'against',
 'between',
 'into',
 'through',
 'during',
 'before',
 'after',
 'above',
 'below',
 'to',
 'from',
 'up',
 'down',
 'in',
 'out',
 'on',
 'off',
 'over',
 'under',
 'again',
 'further',
 'then',
 'once',
 'here',
 'there',
 'when',
 'where',
 'why',
 'how',
 'all',
 'any',
 'both',
 'each

In [49]:
symbols = ["'", ',', '.', '?', '!', '/', '-', '@', '#', '$', '%', '^', '&', '*', '(', ')', '[', ']', '{', '}', '_', '+', '=', ';', ':', '~', '`', '<', '>', '|']

In [50]:
def remove_stop_words(list_words):
    new_text = []
    for word in list_words:
        if word.lower() not in stop_words and word.lower() not in symbols:
            new_text.append(word)
    return new_text

In [51]:
question_cleaned = remove_stop_words(message_8)
response_cleaned = remove_stop_words(message_9)

In [52]:
print(len(message_9), len(response_cleaned))

345 166


In [53]:
question_cleaned = ' '.join(question_cleaned)
question_cleaned

'tell happen odysseus returned home war'

In [54]:
response_cleaned = ' '.join(response_cleaned)
response_cleaned

'Certainly enduring numerous adventures challenges journey home Trojan War Odysseus finally returned homeland Ithaca summary happened next Disguised Return Upon arriving Ithaca Odysseus disguised beggar help goddess Athena allowed assess situation home without recognized Reunion Telemachus Odysseus first revealed true identity son Telemachus Together devised plan deal suitors overrun home Penelope Loyalty Penelope Odysseus wife remained faithful despite pressured numerous suitors devised clever ways delay remarrying weaving secretly unraveling shroud Contest Bow Penelope announced contest suitors whoever could string Odysseus great bow shoot arrow twelve axe heads would win hand None suitors could accomplish task Odysseus still disguise succeeded effortlessly Reclaiming Throne revealing identity Odysseus help Telemachus loyal servants defeated suitors fierce battle reclaiming home throne Reunion Penelope Finally Odysseus Penelope joyfully reunited Penelope tested Odysseus ensure truly 

#### TF-iDF

In [66]:
def calculate_relevance_scores(df, text_column, vectorizer_type, chat_id_column):
    # Select the vectorizer.
    vectorizers = {'Tfidf': TfidfVectorizer(), 'Count': CountVectorizer()}
    
    # Initialize the vectorizer.
    vectorizer = vectorizers.get(vectorizer_type)
    if not vectorizer:
        raise ValueError("Invalid vectorizer type. Choose 'Tfidf' or 'Count'.")

    # List to store the results.
    results = []

    # Iterate over the DataFrame.
    for i in range(0, len(df) - 1, 2):
        prompt = df.iloc[i][text_column]
        response = df.iloc[i + 1][text_column]
        chat_id = df.iloc[i][chat_id_column]

        # Vectorize the text.
        vectors = vectorizer.fit_transform([prompt, response])

        # Calculate cosine similarity.
        similarity = cosine_similarity(vectors[0], vectors[1])[0][0]

        # Append the results.
        results.append({
            'Chat ID': chat_id,
            'Prompt': prompt,
            'Response': response,
            f'{vectorizer_type} Score': similarity
        })

    results_df = pd.DataFrame(results)
    return results_df

In [67]:
tfidf_df = calculate_relevance_scores(df, 'Message', 'Tfidf', 'Chat_ID')

tfidf_df

Unnamed: 0,Chat ID,Prompt,Response,Tfidf Score
0,data_science_trend,What is the latest trend in data science?,"Data science is evolving rapidly, and several ...",0.241213
1,data_science_trend,Can you tell me more about generative AI?,Generative AI is a fascinating and rapidly evo...,0.197096
2,data_science_trend,can you explain more about how the generative ...,Generative AI is making significant strides in...,0.307088
3,data_science_trend,what is data warehouse?,A data warehouse is a centralized repository t...,0.284913
4,data_science_trend,how is data warehouse and database different? ...,Great questions! Let's break down the differen...,0.510205
...,...,...,...,...
222,largest_adj_product,can you explain the map's callback's parameter,"Certainly! In the `map` function's callback, t...",0.271834
223,jarritos_flavors,Jarritos is available in how many flavors?,Jarritos is a popular brand of Mexican soda th...,0.285120
224,jarritos_flavors,is that including previous flavors?,The list of 13 flavors that I provided include...,0.255854
225,jarritos_flavors,i think there were 15 flavors,I apologize for the mistake in my previous res...,0.116042


In [69]:
tfidf_df['Tfidf Score'].mean()

0.1924957573369176

In [68]:
count_df = calculate_relevance_scores(df, 'Message', 'Count', 'Chat_ID')

count_df

Unnamed: 0,Chat ID,Prompt,Response,Count Score
0,data_science_trend,What is the latest trend in data science?,"Data science is evolving rapidly, and several ...",0.371014
1,data_science_trend,Can you tell me more about generative AI?,Generative AI is a fascinating and rapidly evo...,0.291633
2,data_science_trend,can you explain more about how the generative ...,Generative AI is making significant strides in...,0.427442
3,data_science_trend,what is data warehouse?,A data warehouse is a centralized repository t...,0.391759
4,data_science_trend,how is data warehouse and database different? ...,Great questions! Let's break down the differen...,0.601456
...,...,...,...,...
222,largest_adj_product,can you explain the map's callback's parameter,"Certainly! In the `map` function's callback, t...",0.367669
223,jarritos_flavors,Jarritos is available in how many flavors?,Jarritos is a popular brand of Mexican soda th...,0.422944
224,jarritos_flavors,is that including previous flavors?,The list of 13 flavors that I provided include...,0.389249
225,jarritos_flavors,i think there were 15 flavors,I apologize for the mistake in my previous res...,0.200000


In [70]:
count_df['Count Score'].mean()

0.28142683226847

In [63]:
prompt = myth.iloc[4]['Message']
response = myth.iloc[5]['Message']

# Vectorize the text
vectorizer = TfidfVectorizer()
vectors = vectorizer.fit_transform([prompt, response])

# Calculate cosine similarity
similarity = cosine_similarity(vectors[0], vectors[1])
print("Relevance Score:", similarity[0][0])


Relevance Score: 0.301656654885515


In [64]:
prompt = question_cleaned
response = response_cleaned

# Vectorize the text
vectorizer = TfidfVectorizer()
vectors = vectorizer.fit_transform([prompt, response])

# Calculate cosine similarity
similarity = cosine_similarity(vectors[0], vectors[1])
print("Relevance Score:", similarity[0][0])

Relevance Score: 0.24003883165382472


#### CountVec

In [16]:
prompt = myth.iloc[4]['Message']
response = myth.iloc[5]['Message']

# Vectorize the text
vectorizer = CountVectorizer()
vectors = vectorizer.fit_transform([prompt, response])

# Calculate cosine similarity
similarity = cosine_similarity(vectors[0], vectors[1])
print("Relevance Score:", similarity[0][0])

Relevance Score: 0.4274951405113915


In [17]:
prompt = question_cleaned
response = response_cleaned

# Vectorize the text
vectorizer = CountVectorizer()
vectors = vectorizer.fit_transform([prompt, response])

# Calculate cosine similarity
similarity = cosine_similarity(vectors[0], vectors[1])
print("Relevance Score:", similarity[0][0])

Relevance Score: 0.35233633572311335


### Note:

- CountVectorizer works better than the TF-iDF Vectorizer.
- TF-iDF is used for document similarity but why is score lower?
    - Is it because Question is short and Answer is longer?

- [x] check all the relevance score by topic.
- [] randomize the question and answer and see if the relevance score drastically goes down, if not than there is a problem.