# Unsupervised Classification with Embeddings - Answers
In this lab, we will be working on the following use case:  You run your company's IT and want to make sure that your employees are not signing up for unauthorized services and creating shadow IT.  To do so, we will be building a model to idenfity account verification emails sent to company email addresses.

## Finding SaaS Applications
The basic approach is that we're going to look at the subjects of emails and compare that subject with the subject(s) of common account operations.  This initial approach starts with email verification emails, but could be expanded to other common SaaS operations such as 2FA, account creation, etc.

First, let's read in the data.  As with other labs, this lab uses the `DATA_PATH` variable so if you moved the data anywhere outside of the repository, you will need to modify the `DATA_PATH` variable.  The dataset we will be using for this exercise is in the file `clean_email.csv` which contains approximately 140k email metadata events.

In [2]:
import pandas as pd
import numpy as np
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity

# These two lines should only be used on a Mac with Apple Silicon.  If you have a GPU, you can use that as well.
import torch
device = torch.device("mps" if torch.backends.mps.is_available() else "cpu")


DATA_PATH = '../data'

# You'll need to define a verification message which we will use to compare all the other subjects to.
VERIFICATION_TEXT = "please verify your email"


### The Data
Let's read in the data and do some EDA.

In [3]:
email_data = pd.read_csv(f"{DATA_PATH}/clean_email.csv")

In [4]:
email_data.sample(5)

Unnamed: 0,date,to_email,to_email_account,to_email_complete_domain,from_email,from_name,from_email_account,from_email_complete_domain,from_email_domain,from_email_subdomain,from_email_suffix,subject
39133,2023-07-10 16:22:08,user1@gmail.com,user1,gmail.com,newsletters-noreply@linkedin.com,Tomasz Tunguz via LinkedIn,newsletters-noreply,linkedin.com,linkedin,,com,Tomasz Tunguz just published an article
9429,2022-06-14 23:10:38,user1@gmail.com,user1,gmail.com,invitations@linkedin.com,Arnaud Chiffert via LinkedIn,invitations,linkedin.com,linkedin,,com,", start a conversation with your new connectio..."
132060,2017-12-20 14:15:19,user1@gmail.com,user1,gmail.com,newsletter@zoomcar.com,Zoomcar,newsletter,zoomcar.com,zoomcar,,com,Zoomcar Now Available At Ahmedabad Airport!
62236,2022-12-12 02:59:50,user1@gmail.com,user1,gmail.com,calendar-notification@google.com,Google Calendar,calendar-notification,google.com,google,,com,"Notification: sync @ Sun Dec 11, 2022 9:30pm -..."
36447,2018-05-19 19:07:32,user1@gmail.com,user1,gmail.com,noreply@r.groupon.com,Groupon,noreply,r.groupon.com,groupon,r,com,These Deals are Good Luck


You should do some exploratory data analysis here to make sure that all the fields you need are populated.

In [5]:
def cleanup_subject(subject: str) -> str:
    # Remove newlines
    subject = subject.replace('\n', ' ')

    # Remove leading and trailing whitespace
    subject = subject.strip()

    # Convert to lowercase
    subject = subject.lower()
    return subject


In [6]:
# Drop rows with an empty subject
email_data.dropna(subset=['subject'], inplace=True)

# Clean up the subject
email_data['subject'] = email_data['subject'].apply(cleanup_subject)

## Step One:  Generate Embeddings
After exploring and cleaning our data, we'll need to compute embeddings for the subject lines of the emails.  You can use a similar method as we did in previous labs OR you can use a library which I really like called `fasttext`.  In the answer notebook, you will see both methods, but the basic idea is the same.  You can use the same embedding model we used in the previous lab to generate the embeddings.

**This step will take several minutes to complete.**

In [8]:
# Load embedding model
model = SentenceTransformer('paraphrase-MiniLM-L6-v2', device=device)
# Generate embeddings
email_data['embedding'] = email_data['subject'].apply(lambda x: model.encode(x, device=device).tolist())

In [9]:
# You should also create embeddings for the verification text
# You will need to reshape it so that we can compute the cosine_similarity.  Use the reshape(1, -1)
VERIFICATION_EMBEDDING = np.array(model.encode(VERIFICATION_TEXT)).reshape(1, -1)

## Step Two: Comparing the Embeddings
What we will need to do her is create a verification text, something like `Please Verify your Email Address` and then we will use cosine similiarity to compare the embeddings for the verification text and the email subjects.

### Calculate the Cosine Similiarity

Next, we calculate the cosine similarity between the template text and the email subject.  We have to do some data cleanup as well.  For the cleanup we:
1. Convert the inputs to strings
2. Remove leading/trailing whitespaces
3. Replace any newlines with spaces
4. Convert to lower case

Once we have done that, we'll use the model to get a vector of the sentence and use cosine similiarity to compute the distance between the template text and the email subject.

You will have to reshape the lists using the `reshape(1,-1)` function in numpy.  To do this, you'll probably have to convert the series to a NumPy array as shown below:

```python
np.array(embedding).reshape(1, -1)
```

You can calculate the cosine similarity by using the `cosine_similarity(a1, a2)` that we imported at the beginning of the lab.  `a1` and `a2` need to be list-like data structures.  Be sure to save these scores in the dataframe.

In [10]:
def text_similarity(embedding) -> float:
    return cosine_similarity(np.array(embedding).reshape(1, -1), VERIFICATION_EMBEDDING)[0][0]

email_data['similarity_score'] = email_data['embedding'].apply(text_similarity)

## Step Three:  Finding the Accounts.
Now that we have the similarity scores, we can look at the results to find candidate emails.  The closer the similarity score is to 1, the more likely the email is a match. You will have to decide what similarity threshold you want to use to detect these emails.  Experiment with the similarity score to see what threshold works best for our use case.

Your task here is to find the emails, then extract the unique sender domains to find out which unauthorized accounts have been created.

In [11]:
email_data[email_data['similarity_score'] >= 0.68][['from_email_complete_domain', 'subject']]

Unnamed: 0,from_email_complete_domain,subject
307,lyftmail.com,confirm your email
2496,alerts.comcast.net,important: please verify your email address
2538,alerts.comcast.net,important: please verify your email address
2607,alerts.comcast.net,important: please verify your email address
4987,hightail.com,please verify your email change
...,...,...
135855,alerts.comcast.net,important: please verify your email address
137511,equityzen.com,please verify your email address to get started
137544,equityzen.com,reminder: please verify your email address
139441,lyftmail.com,confirm your email


In [12]:
email_data[email_data['similarity_score'] >= 0.68][['from_email_complete_domain']].value_counts()

from_email_complete_domain
lyftmail.com                  9
alerts.comcast.net            4
hello.soundcloud.com          4
microsoft.onmicrosoft.com     3
ahs.com                       3
equityzen.com                 3
service.discover.com          2
mail.etsy.com                 2
godaddy.com                   2
mail15.creditkarma.com        2
mail.offeredby.com            2
service.lovense.com           2
studiolab.sagemaker.aws       2
welcome.aexp.com              2
service.hbomax.com            1
republic.co                   1
transferwise.com              1
trekbikes.com                 1
remarkable.com                1
redfin.com                    1
notifications.skiff.org       1
news.sedo.com                 1
trip.com                      1
mail7.creditkarma.com         1
mail19.creditkarma.com        1
mail17.creditkarma.com        1
revolut.com                   1
above.com                     1
mail.instagram.com            1
emaildl.att-mail.com          1
account.pinte