<a href="https://colab.research.google.com/github/msyed92/twitter_clone/blob/main/twitter_edit.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Twitter Edit Button NLP

## Why edit button?

The goal of this project is to ensure users are able to edit spelling and gramatical errors, as well as missing words in their tweets. The idea is, as long as the two tweets have the same meaning, the edit should be "allowed". The model should determine if tweets are too dissimilar, and the edit won't be published.

## Challenges

If a user is editing their tweet for spelling mistakes, the edited version and the original might have completely different meanings.

For example: If a user tweeted this:

"I'm socked to here this"

and meant to tweet this:

"I'm shocked to hear this"

Their cosine similarity score is 0.328 with RoBERTa, and 0.586 with BERT

One way to approach this challenge is to look through current twitter data and determine what the most common spelling errors are, and add those words as aliases.

Possibly take a sample of tweets, determine what edits *could* be made, and determine different ways to embed the words so they are computed as similar.
Or ask people to edit 5? of their own tweets to get a better/less biased sample of data.

# Code

## Install libraries and packages

In [70]:
%pip install transformers
%pip install sentence-transformers



In [71]:
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity

## Roberta

In [80]:
model = SentenceTransformer('all-mpnet-base-v2')

Downloading:   0%|          | 0.00/1.18k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/10.1k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/571 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/116 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/39.3k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/349 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/438M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/239 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/363 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/13.1k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/190 [00:00<?, ?B/s]

## BERT

In [125]:
tweet1 = "I am not shocked to hear that"
tweet2 = "I am shocked to hear that!"
tweet3 = "I'm surprised to hear that"

In [130]:
#model = SentenceTransformer('all-distilroberta-v1')
model = SentenceTransformer('all-mpnet-base-v2')
sentences = [tweet1, tweet2, tweet3]
sentence_embeddings = model.encode(sentences)
sentence_embeddings

array([[-0.00088187,  0.04647539,  0.00575618, ...,  0.02157047,
        -0.04420858,  0.01710483],
       [-0.02299958,  0.03953287, -0.01447212, ..., -0.00511732,
        -0.04936558,  0.02625754],
       [-0.0239177 ,  0.0457035 , -0.00964553, ...,  0.02008985,
        -0.02037429,  0.02701481]], dtype=float32)

In [129]:
scores = cosine_similarity(
    [sentence_embeddings[0]],
    sentence_embeddings[1:]
)[0]

print("Tweet 1 and 2 Similarity Score:", scores[0])
print("Tweet 1 and 3 Similarity Score:", scores[1])

scores = cosine_similarity(
    [sentence_embeddings[1]],
    [sentence_embeddings[0], sentence_embeddings[2]]
)[0]

print("Tweet 2 and 3 Similarity Score:", scores[1])

Tweet 1 and 2 Similarity Score: 0.71748745
Tweet 1 and 3 Similarity Score: 0.7110096
Tweet 2 and 3 Similarity Score: 0.7278075
