# Tokenizing Word Embeddings Using Transformers

Transformers are a type of deep learning architecture in Natural Language Processing that processes the representations of words using a mechanism called self-attention. Self-attention layers give us the ability to mathematically represent the relationship between words in a sequence based on their probability of co-occurrence. Pre-trained transformers are deep learning models using self-attention that have been previously trained on large datasets and had its output saved for further researchers to use to train new tasks. 

Due to their scalability, pre-trained transformers have been the engines behind many of the most popular LLMs in the industry. In this notebook, we will explore creating word embeddings using a pre-trained Transformer called "sentence-transformers" found on the website, Huggingface. This model is open-sourced and can be used to create word embeddings based on each sentence's input within your dataset. 

Learn more about the **sentence-transformers** here: https://huggingface.co/sentence-transformers.

In [None]:
import pandas as pd
tweet_df = pd.read_csv("Tweets.csv") # import tweets data saved in tweets.csv file

In [None]:
tweet_df

In [None]:
!pip install sentence-transformers

## Using Sentence Transformers

Once you have installed the sentence-transformers package, you can select which flavor of the transformer you would like to use to create your word embeddings. 

The Sentence Transformers package contains 124 pre-trained models that you can choose from ranging from small to very large. You can use BERT, DistilBERT (a smaller version of BERT), Facebook Question and Answering model and many other types of transformers by calling on them in the SentenceTransformers() line below. In this notebook, we use a miniLM model to create our word embeddings. 

### Creating Embeddings
The Sentence Transformers package makes it easy for us to create word embeddings using our tweets. We will encode our tweets and save the embeddings to columns in our dataset that make it easy for us to visualize the text alongside the tweet sentiment and the selected text. 

We will create embeddings for both the selected text and the full text of our tweets. 

In [None]:
from sentence_transformers import SentenceTransformer, util
sentxformer = SentenceTransformer('sentence-transformers/paraphrase-MiniLM-L6-v2') # 'sentence-transformers/all-mpnet-base-v2'

tweet_df['text_vec'] = sentxformer.encode(tweet_df['text'].values).tolist()
tweet_df['select_vec'] = sentxformer.encode(tweet_df['selected_text'].values).tolist()

In [None]:
tweet_df

# Examine Missing Data and Imputation

One ting we can notice from our dataset is that we have some missing data. We can use the **missingno** package to see if the data we are missing will make a material difference if we exclude or impute the data with something else.

To impute data is to replace the missing data with something else. This can be the most frequently used observation in the dataset; the mean, median or mode if you are working with numerical data; or some other observation that you decide for yourself. 

In [None]:
!pip install missingno

In [None]:
import pandas as pd
import missingno as msno

tweet_df.isna().sum() # show how many missing observations are in each column

We see here that there are not that many missing observations in this data. Let's see if there is a relationship between an observation being missing in one column and another or if missingness is random. 

### Histograms

Histograms are a way to visualize the correlation between features in a dataset. In this case, we can use a historgram to visualize the correlation between two features with missing observations. Is there a relationship between a missing observation in the different columns of our dataset?

Let's see.

In [None]:
#visualize correlations of nullity between features using heatmap
msno.heatmap(tweet_df)

This shows that there is a relationship between selected text missing an observation and sentiment not being defined in the dataset. It also appears that if text is missing, then selected text will be missing as well. This makes sense. 

1. Should you: move on with your data as is, drop the missing data or impute the data with the most frequent observation?

2. What are some considerations that you may make when deciding if you should move on with your data as is, drop the missing data or impute the data with the most frequent observation?

3. If the number of missing observations were much greater (over 20% of the data) and missingness was random (less than 10% correlation across columns) would you make the same choice as you did in question 1? Why or why not?

For Instructors: 

Consider that students may have different answers and that is okay. Make sure that they walk through their reasoning. Depending on the dataset, it may be appropriate to take different approaches. Get them to think about what changes in the dataset may make them make a different choice. 

### Dendrograms 
Another way to see if there is a relationship between missing data in a dataset is with a dendrogram. Dendrograms are tree graphs that show the correlation between different features. Both histograms and dendrograms can be used to correlate the relationship of null values in our dataset using the **missingno** package. 

In [None]:
#visualize nullity of features using tree graph
msno.dendrogram(tweet_df)

### Minor Imputation

Because the number of missing values in this dataset is small, we can impute the missing observation with the string "None".

This can be a good way for us to fill those missing values while still reflecting that these values were once empty. 

In [None]:
#so few null, no need to fix, really
tweet_df.fillna("None", inplace = True)

In [None]:
tweet_df

### Comparing Similarity 

In the dataset, you can notice that selected text is a selection of texts from the full text Tweets. The sentiment of each Tweet is gathered from the selected text. 

Considering that the selected text should merely be an extraction of key values from the full text, we would expect for the embeddings of both features to overlap. We can examine if this is the case by getting cosine similarity scores of both features and visualizing the output using a dimensionality technique called **t-SNE**. 



In [None]:
text = tweet_df['text_vec']
sel = tweet_df['select_vec']

In [None]:
scores = util.cos_sim(text, sel)
scores.numpy()

In [None]:
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE

tsne = TSNE(perplexity = 40, num_components = 2, verbose=1)
scores_tsne = tsne.fit_transform(scores)
scores_tsne

In [None]:
import matplotlib.pyplot as plt

mid = int(len(scores)/2)

fig, ax = plt.subplots(figsize=(15,8))
ax.scatter(scores_tsne[:,0][0:mid], scores_tsne[:,1][0:mid], scores_tsne[:,2][0:mid], c= 'r')
ax.scatter(scores_tsne[:,0][mid:], scores_tsne[:,1][mid:], scores_tsne[:,2][mid:], c= 'g')

## Let's Think

What does this output show us about the two features, *selected text* and *text*? 

Are they similar to each other according to this plot?

What are some ways that having a feature like *select text* could be beneficial to building models in the future? 

What are some ways that bias can creep in to these types of truncated features?

For Instructor:

This is a significant outcome because it shows that select text is a good proxy. In the case of very large datasets, extacting selected text that is heavily similar to the full text is a great way to reduce training time and overcome size restrictions in Transformer pipelines.

Make sure students explore the ways that extracting text can lead to bias, even if the embeddings appear to overlap enough to be similar. 