# Title:

#### Group Member Names : Mohamad Nomaan Parmar, Pawanpreet



### INTRODUCTION:
*********************************************************************************************************************
#### AIM :  
TweetEval benchmark (Findings of EMNLP 2020)

*********************************************************************************************************************
#### Github Repo:
https://github.com/cardiffnlp/tweeteval

*********************************************************************************************************************
#### DESCRIPTION OF PAPER:
TweetEval consists of seven heterogenous tasks in Twitter, all framed as multi-class tweet classification. All tasks have been unified into the same benchmark, with each dataset presented in the same format and with fixed training, validation and test splits.

*********************************************************************************************************************
#### PROBLEM STATEMENT :
Developing a Cosine Similarity-based Text Retrieval System for Twitter Data

*********************************************************************************************************************
#### CONTEXT OF THE PROBLEM:
Imagine a scenario where a user wants to find tweets similar to a specific query or topic of interest. For instance, a user might want to find tweets related to positive experiences with a certain movie or book. However, sifting through countless tweets manually can be time-consuming and inefficient.

The problem at hand involves developing a text retrieval system tailored for Twitter data. This system will allow users to input a query, and the system will then analyze and rank tweets based on their similarity to the query. The primary objective is to provide the user with a curated list of tweets that are most likely to align with their query.

*********************************************************************************************************************
#### SOLUTION:
The outcome of this project is an operational text retrieval system that can take a user's query, assess the similarity with a set of tweets, and present the top-ranked tweets that closely align with the query. The system's performance will be evaluated using precision, recall, and F1-score metrics to ensure its effectiveness in accurately identifying relevant tweets while minimizing false positives and false negatives.

This context sets the stage for developing a solution that addresses the challenge of efficiently retrieving relevant content from the vast and dynamic world of Twitter data.







# Background
*********************************************************************************************************************


|Reference|Explanation|Dataset/Input|Weakness|
|------|------|------|------|



*********************************************************************************************************************






# Implement paper code :
*********************************************************************************************************************

Usage of TweetEval and Twitter-specific RoBERTa models
In this notebook we show how to perform tasks such as masked language modeling, computing tweet similarity or tweet classificationo using our Twitter-specific RoBERTa models.

Paper: TweetEval benchmark (Findings of EMNLP 2020)
Authors: Francesco Barbieri, Jose Camacho-Collados, Luis Espinosa-Anke and Leonardo Neves.
Github



Preliminaries
We define a function to normalize a tweet to the format we used for TweetEval. Note that preprocessing is minimal (replacing user names by @user and links by http).

In [1]:
def preprocess(text):
    new_text = []
    for t in text.split(" "):
        t = '@user' if t.startswith('@') and len(t) > 1 else t
        t = 'http' if t.startswith('http') else t
        new_text.append(t)
    return " ".join(new_text)

We only need to install one dependnecy: the transformers library.

In [2]:
!pip install transformers

Collecting transformers
  Downloading transformers-4.31.0-py3-none-any.whl (7.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.4/7.4 MB[0m [31m40.2 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.14.1 (from transformers)
  Downloading huggingface_hub-0.16.4-py3-none-any.whl (268 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m268.8/268.8 kB[0m [31m26.2 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1 (from transformers)
  Downloading tokenizers-0.13.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m65.4 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting safetensors>=0.3.1 (from transformers)
  Downloading safetensors-0.3.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m72.1 MB/s[0m eta [36m0:00:0

Computing Tweet Similarity

In [3]:
from transformers import AutoTokenizer, AutoModel, TFAutoModel
import numpy as np
from scipy.spatial.distance import cosine
from collections import defaultdict

MODEL = "cardiffnlp/twitter-roberta-base"

tokenizer = AutoTokenizer.from_pretrained(MODEL)
model = AutoModel.from_pretrained(MODEL)

def get_embedding(text):
  text = preprocess(text)
  encoded_input = tokenizer(text, return_tensors='pt')
  features = model(**encoded_input)
  features = features[0].detach().cpu().numpy()
  features_mean = np.mean(features[0], axis=0)
  return features_mean

query = "The book was awesome"

tweets = ["This is an interesting topic of discussion",
    "Looking forward to the weekend",
    "Enjoyed the concert last night",
    "The weather is perfect for a picnic",
    "Learning about machine learning techniques"]

d = defaultdict(int)
for tweet in tweets:
  sim = 1-cosine(get_embedding(query),get_embedding(tweet))
  d[tweet] = sim

print('Most similar to: ',query)
print('----------------------------------------')
for idx,x in enumerate(sorted(d.items(), key=lambda x:x[1], reverse=True)):
  print(idx+1,x[0])

Downloading (…)lve/main/config.json:   0%|          | 0.00/565 [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/501M [00:00<?, ?B/s]

Most similar to:  The book was awesome
----------------------------------------
1 Enjoyed the concert last night
2 Looking forward to the weekend
3 The weather is perfect for a picnic
4 This is an interesting topic of discussion
5 Learning about machine learning techniques


Feature Extraction

In [4]:
from transformers import AutoTokenizer, AutoModel, TFAutoModel
import numpy as np

MODEL = "cardiffnlp/twitter-roberta-base"
text = "Good night 😊"
text = preprocess(text)
tokenizer = AutoTokenizer.from_pretrained(MODEL)

# Pytorch
encoded_input = tokenizer(text, return_tensors='pt')
model = AutoModel.from_pretrained(MODEL)
features = model(**encoded_input)
features = features[0].detach().cpu().numpy()
features_mean = np.mean(features[0], axis=0)
#features_max = np.max(features[0], axis=0)

# # Tensorflow
# encoded_input = tokenizer(text, return_tensors='tf')
# model = TFAutoModel.from_pretrained(MODEL)
# features = model(encoded_input)
# features = features[0].numpy()
# features_mean = np.mean(features[0], axis=0)
# #features_max = np.max(features[0], axis=0)

features_mean.shape

(768,)

Masked language modeling
Use Twitter-RoBERTA-base to predict words in context using the fill-mask pipeline in transformers.

In [5]:
from transformers import pipeline, AutoTokenizer
import numpy as np

MODEL = "cardiffnlp/twitter-roberta-base"
fill_mask = pipeline("fill-mask", model=MODEL, tokenizer=MODEL)
tokenizer = AutoTokenizer.from_pretrained(MODEL)

def print_candidates():
    for i in range(5):
        token = tokenizer.decode(candidates[i]['token'])
        score = np.round(candidates[i]['score'], 4)
        print(f"{i+1}) {token} {score}")

texts = [
 "I am so <mask> 😊",
 "I am so <mask> 😢"
]
for text in texts:
    t = preprocess(text)
    print(f"{'-'*30}\n{t}")
    candidates = fill_mask(t)
    print_candidates()

Some weights of the model checkpoint at cardiffnlp/twitter-roberta-base were not used when initializing RobertaForMaskedLM: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


------------------------------
I am so <mask> 😊
1)  happy 0.402
2)  excited 0.1441
3)  proud 0.143
4)  grateful 0.0669
5)  blessed 0.0334
------------------------------
I am so <mask> 😢
1)  sad 0.2641
2)  sorry 0.1605
3)  tired 0.138
4)  sick 0.0278
5)  hungry 0.0232


Use TweetEval Classifiers
We currently provide the following fine-tuned models for different tweet classification tasks:

emoji prediction (emoji)
emotion detection (emotion)
hate speech detection (hate)
irony detection (irony)
offensive language identification (offensive)
sentiment analysis (sentiment)
(coming soon) stance detection (stance) with 5 targets (abortion, atheism, climate, feminist, hillary), for example: stance-abortion

In [6]:
from transformers import AutoModelForSequenceClassification
from transformers import TFAutoModelForSequenceClassification
from transformers import AutoTokenizer
import numpy as np
from scipy.special import softmax
import csv
import urllib.request

task='emotion'
MODEL = f"cardiffnlp/twitter-roberta-base-{task}"

tokenizer = AutoTokenizer.from_pretrained(MODEL)

Downloading (…)lve/main/config.json:   0%|          | 0.00/768 [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/150 [00:00<?, ?B/s]

In [7]:
# download label mapping
mapping_link = f"https://raw.githubusercontent.com/cardiffnlp/tweeteval/main/datasets/{task}/mapping.txt"
with urllib.request.urlopen(mapping_link) as f:
    html = f.read().decode('utf-8').split("\n")
    csvreader = csv.reader(html, delimiter='\t')
labels = [row[1] for row in csvreader if len(row) > 1]

In [8]:
# PT
model = AutoModelForSequenceClassification.from_pretrained(MODEL)

text = "Good night 😊"
text = preprocess(text)
encoded_input = tokenizer(text, return_tensors='pt')
output = model(**encoded_input)
scores = output[0][0].detach().numpy()
scores = softmax(scores)

# # TF
# model = TFAutoModelForSequenceClassification.from_pretrained(MODEL)

# text = "Good night 😊"
# encoded_input = tokenizer(text, return_tensors='tf')
# output = model(encoded_input)
# scores = output[0][0].numpy()
# scores = softmax(scores)

Downloading pytorch_model.bin:   0%|          | 0.00/499M [00:00<?, ?B/s]

In [9]:
ranking = np.argsort(scores)
ranking = ranking[::-1]
for i in range(scores.shape[0]):
    l = labels[ranking[i]]
    s = scores[ranking[i]]
    print(f"{i+1}) {l} {np.round(float(s), 4)}")

1) joy 0.9061
2) optimism 0.0407
3) sadness 0.0406
4) anger 0.0126


*********************************************************************************************************************
### Contribution  Code :
* Text Preprocessing:

The addition of lowercase conversion (text = text.lower()) ensures that the text is converted to lowercase before processing. This can help in standardizing the text and reducing the impact of case variations on similarity calculations.

* Text Embedding:

Using max-pooling aggregation (features_max = np.max(features[0], axis=0)) for obtaining text embeddings can capture important features in the text, potentially leading to more meaningful embeddings.


* Cosine Similarity Threshold:

The introduction of cosine_similarity_threshold allows you to control the sensitivity of the cosine similarity calculation. Adjusting this threshold enables you to capture more similar or less similar tweets based on your requirements. Calculated Precision, Recall and F-1 Score of each tweets with respect to query provided.

* Adjusted Fill Mask Pipeline:

The update to the masked phrases in the texts list provides different input variations for the "Fill Mask Pipeline" section. This enables you to explore how well the model can predict the missing tokens for different masked phrases.

* All the above explanations is executed in the code below:


In [30]:
import re
import numpy as np
from scipy.spatial.distance import cosine
from collections import defaultdict
from sklearn.metrics import precision_score, recall_score, f1_score
def preprocess(text):
    text = text.lower()  # Convert to lowercase
    text = re.sub(r'[^\w\s]', '', text)  # Remove special characters
    new_text = []
    for t in text.split(" "):
        t = '@user' if t.startswith('@') and len(t) > 1 else t
        t = 'http' if t.startswith('http') else t
        new_text.append(t)
    return " ".join(new_text)
from transformers import AutoTokenizer, AutoModel, TFAutoModel
MODEL = "cardiffnlp/twitter-roberta-base"
tokenizer = AutoTokenizer.from_pretrained(MODEL)
model = AutoModel.from_pretrained(MODEL)

# Experiment with different threshold values
cosine_similarity_threshold = 0.5  # Adjust as needed
def get_embedding(text):
    text = preprocess(text)
    print("Preprocessed Text:", text)  # Print preprocessed text
    encoded_input = tokenizer(text, return_tensors='pt')
    features = model(**encoded_input)
    features = features[0].detach().cpu().numpy()
    features_max = np.max(features[0], axis=0)  # Use max-pooling instead of mean
    return features_max

query = "The book was awesome"

tweets = [
    "This is an interesting topic of discussion",
    "Looking forward to the weekend",
    "Enjoyed the concert last night",
    "The weather is perfect for a picnic",
    "Learning about machine learning techniques"
]
ground_truth = [1, 0, 0, 0, 1]

In [31]:
d = defaultdict(int)
cosine_similarity_threshold = 0.5

for idx, tweet in enumerate(tweets):
    sim = 1 - cosine(get_embedding(query), get_embedding(tweet))
    d[tweet] = sim
    print(f"Similarity between query and '{tweet}': {sim}")  # Print similarity score

true_positive = sum(1 for idx, (tweet, sim) in enumerate(d.items()) if sim >= cosine_similarity_threshold and ground_truth[idx] == 1)
false_positive = sum(1 for idx, (tweet, sim) in enumerate(d.items()) if sim >= cosine_similarity_threshold and ground_truth[idx] == 0)
false_negative = sum(1 for idx, (tweet, sim) in enumerate(d.items()) if sim < cosine_similarity_threshold and ground_truth[idx] == 1)

# Avoid division by zero
precision = true_positive / (true_positive + false_positive + 1e-10)
recall = true_positive / (true_positive + false_negative + 1e-10)
f1_score = 2 * (precision * recall) / (precision + recall + 1e-10)

print("Cosine Similarity Metrics:")
print("Precision:", precision)
print("Recall:", recall)
print("F1-Score:", f1_score)

Preprocessed Text: the book was awesome
Preprocessed Text: this is an interesting topic of discussion
Similarity between query and 'This is an interesting topic of discussion': 0.9009774327278137
Preprocessed Text: the book was awesome
Preprocessed Text: looking forward to the weekend
Similarity between query and 'Looking forward to the weekend': 0.8953052163124084
Preprocessed Text: the book was awesome
Preprocessed Text: enjoyed the concert last night
Similarity between query and 'Enjoyed the concert last night': 0.911909282207489
Preprocessed Text: the book was awesome
Preprocessed Text: the weather is perfect for a picnic
Similarity between query and 'The weather is perfect for a picnic': 0.8937946557998657
Preprocessed Text: the book was awesome
Preprocessed Text: learning about machine learning techniques
Similarity between query and 'Learning about machine learning techniques': 0.8760306239128113
Cosine Similarity Metrics:
Precision: 0.399999999992
Recall: 0.99999999995
F1-Score

### Results :
*******************************************************************************************************************************
- It performs preprocessing of text by converting it to lowercase, removing special characters, and reducing multiple spaces to single spaces. This helps in better text representation.
- The get_embedding function calculates embeddings using max-pooling instead of mean-pooling. This might capture more salient features in some cases.
- The code calculates the cosine similarity between the query and each tweet. - - The similarity scores are printed for observation.
- The precision, recall, and F1-score are calculated using binary labels provided in the ground_truth list.
- The optimized code is more organized, modular, and easier to understand.

#### Observations :
*******************************************************************************************************************************
- The printed similarity scores between the query and each tweet provide insight into how well the cosine similarity captures the similarity between them.
- Precision, recall, and F1-score metrics indicate how well the optimized code performs in terms of correctly classifying similar and dissimilar tweets.
- The similarity threshold (cosine_similarity_threshold) can be adjusted to fine-tune the trade-off between precision and recall.

### Conclusion:
In this project, we successfully developed a Cosine Similarity-based Text Retrieval System for Twitter data. The system aimed to address the challenge of efficiently finding relevant tweets that match a user's query within the vast volume of textual content on Twitter. By leveraging pre-trained transformer models and cosine similarity, we were able to provide users with a curated list of tweets that closely aligned with their queries.

The key components of the system included text preprocessing, text embedding using transformer models, cosine similarity calculation, and threshold-based classification. We evaluated the system's performance using precision, recall, and F1-score metrics, which provided insights into its ability to accurately classify similar and dissimilar tweets.

---



*******************************************************************************************************************************
#### Learnings :
Throughout the process of developing the Cosine Similarity-based Text Retrieval System for Twitter data, several important learnings and takeaways emerged:

* Text Preprocessing Matters: The quality of text embeddings and similarity calculations is highly dependent on effective text preprocessing. Cleaning and standardizing text through techniques like lowercase conversion, special character removal, and tokenization significantly improve the accuracy of the system.

* Importance of Embeddings: Pre-trained transformer models provide powerful text representations through embeddings. These embeddings capture semantic meaning, enabling accurate similarity measurements and better matching of user queries with relevant content.

* Understanding Cosine Similarity: Cosine similarity is a useful metric for quantifying the similarity between vectors. A lower cosine similarity score indicates greater dissimilarity, while a higher score suggests more similarity. Adjusting the similarity threshold allows us to fine-tune the system's precision-recall trade-off.

*******************************************************************************************************************************
#### Results Discussion :

The precision, recall, and F1-score metrics provide valuable insights into the system's performance in classifying tweets as relevant (similar) or irrelevant (dissimilar) based on the query. These metrics measure different aspects of the system's effectiveness:

* Precision: Precision is the ratio of correctly retrieved relevant tweets to the total number of tweets retrieved as relevant. It indicates how well the system avoids false positives. A higher precision indicates that when the system claims a tweet is relevant, it is more likely to be correct.

* Recall: Recall is the ratio of correctly retrieved relevant tweets to the total number of relevant tweets in the dataset. It measures the system's ability to identify all relevant tweets. A higher recall indicates that the system is effective at capturing most of the relevant tweets.

* F1-Score: The F1-score is the harmonic mean of precision and recall. It provides a balanced measure that considers both false positives and false negatives. The F1-score is particularly useful when the balance between precision and recall is important.
*******************************************************************************************************************************
#### Limitations :
* Semantic Gap: Inability to capture nuanced meanings, context, and sarcasm in tweets due to reliance on embeddings and cosine similarity.

* Preprocessing Complexity: Handling noise, misspellings, abbreviations, and emojis in tweets can pose challenges for accurate preprocessing.

* Model Adaptation: Pre-trained models might not be fine-tuned for Twitter's unique language, impacting the quality of embeddings.

* Threshold Dependency: Choosing the right similarity threshold involves a trade-off between recall and precision and requires fine-tuning.

* Media Exclusion: Limited to text-only content, disregarding images, videos, and multi-tweet conversations.


*******************************************************************************************************************************


# References:

[1]:  ACM Digital Library (https://dl.acm.org/): Search for research papers related to text similarity, embeddings, and NLP.\
[2]: Introduction to Information Retrieval by Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schütze.
"A Survey of Text Similarity Approaches" by Sebastian Padó and Eneko Agirre.\
[3]:"Distributed Representations of Words and Phrases and their Compositionality" by Tomas Mikolov et al.
"BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding" by Jacob Devlin et al.