<a href="https://colab.research.google.com/github/msyed92/twitter_clone/blob/main/twitter_edit.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Twitter Edit Button NLP

## Why edit button?

The goal of this project is to ensure users are able to edit spelling and gramatical errors, as well as missing words in their tweets. The idea is, as long as the two tweets have the same meaning, the edit should be "allowed". The model should determine if tweets are too dissimilar, and the edit won't be published.

## Challenges

If a user is editing their tweet for spelling mistakes, the edited version and the original might have completely different meanings.

For example: If a user tweeted this:

"I'm socked to here this"

and meant to tweet this:

"I'm shocked to hear this"

Their cosine similarity score is 0.328 with RoBERTa, and 0.586 with BERT

One way to approach this challenge is to look through current twitter data and determine what the most common spelling errors are, and add those words as aliases.

Possibly take a sample of tweets, determine what edits *could* be made, and determine different ways to embed the words so they are computed as similar.
Or ask people to edit 5? of their own tweets to get a better/less biased sample of data.

### Cosine Similarity Example

## Roberta

In [40]:
model = SentenceTransformer('stsb-roberta-large')

Downloading:   0%|          | 0.00/391 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/3.95k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/2.00 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/625 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/122 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/229 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/438M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/399 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [38]:
tweet = "I am socked to here it"
tweet_edit = "I'm shocked to hear it"

In [39]:
# encode sentences to get their embeddings
embedding1 = model.encode(tweet, convert_to_tensor=True)
embedding2 = model.encode(tweet_edit, convert_to_tensor=True)
# compute similarity scores of two embeddings
cosine_scores = util.pytorch_cos_sim(embedding1, embedding2)
print("Sentence 1:", tweet)
print("Sentence 2:", tweet_edit)
print("Similarity score:", cosine_scores.item())

Sentence 1: I am socked to here it
Sentence 2: I'm shocked to hear it
Similarity score: 0.32786959409713745


## BERT

In [48]:
model = SentenceTransformer('bert-base-nli-mean-tokens')
from sklearn.metrics.pairwise import cosine_similarity
sentences = [tweet, tweet_edit, "I'm surprised to hear that"]
sentence_embeddings = model.encode(sentences)
sentence_embeddings.shape


(3, 768)

In [49]:
cosine_similarity(
    [sentence_embeddings[0]],
    sentence_embeddings[1:]
)

array([[0.58561003, 0.52831066]], dtype=float32)

# Code

## Install libraries and packages

In [2]:
%pip install transformers
%pip install sentence-transformers

Collecting transformers
  Downloading transformers-4.16.2-py3-none-any.whl (3.5 MB)
[K     |████████████████████████████████| 3.5 MB 5.3 MB/s 
Collecting sacremoses
  Downloading sacremoses-0.0.47-py2.py3-none-any.whl (895 kB)
[K     |████████████████████████████████| 895 kB 46.5 MB/s 
[?25hCollecting huggingface-hub<1.0,>=0.1.0
  Downloading huggingface_hub-0.4.0-py3-none-any.whl (67 kB)
[K     |████████████████████████████████| 67 kB 5.1 MB/s 
Collecting pyyaml>=5.1
  Downloading PyYAML-6.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (596 kB)
[K     |████████████████████████████████| 596 kB 57.8 MB/s 
[?25hCollecting tokenizers!=0.11.3,>=0.10.1
  Downloading tokenizers-0.11.4-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (6.8 MB)
[K     |████████████████████████████████| 6.8 MB 14.5 MB/s 
Installing collected packages: pyyaml, tokenizers, sacremoses, huggingface-hub, transformers
  Attempting uninstall: pyyaml
  

In [None]:
from sentence_transformers import SentenceTransformer, util
import numpy as np