[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/pinecone-io/examples/blob/master/learn/security/alert-similarity/alert-similarity.ipynb) [![Open nbviewer](https://raw.githubusercontent.com/pinecone-io/examples/master/assets/nbviewer-shield.svg)](https://nbviewer.org/github/pinecone-io/examples/blob/master/learn/security/alert-similarity/alert-similarity.ipynb)

We perform some `pip install` commands to install the Pinecone client, sentence transformers (which we use for encoding) and other required libraries.

In [None]:
!pip install -qU pinecone-client sentence-transformers



In [None]:
alert_list = [
    '2021-12-13T00:45:31+00:00 File change alert in directory /users/james/documents/projects',
    '2021-12-11T12:01:08+00:00 File change alert in directory /users/dan/documents/projects',
    '2021-12-10T14:31:12+00:00 New login location for /users/james in location Rome, Italy',
    '2021-12-09T18:04:52+00:00 File change alert in directory /users/dan/documents/projects',
    '2021-12-09T12:01:41+00:00 Directory change alert in /users/james/documents/projects'
]

### Feature Extraction

Given our alerts, we may want to normalize and extract specific features. For full alerts this is much more complex, but in our case all we need to do is remove the timestamp and normalize the directory paths.

In [None]:
import re

# this regex matches the timestamp
timestamp = re.compile(r"\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2}\+\d{2}:\d{2}")
# this regex matches user directories within a filepath
user_dir = re.compile(r"(?<=\/users\/)\w+")

In [None]:
for i, alert in enumerate(alert_list):
    alert = timestamp.sub('', alert)
    alert = user_dir.sub('<user>', alert)
    alert_list[i] = alert.strip()

In [None]:
alert_list

['File change alert in directory /users/<user>/documents/projects',
 'File change alert in directory /users/<user>/documents/projects',
 'New login location for /users/<user> in location Rome, Italy',
 'File change alert in directory /users/<user>/documents/projects',
 'Directory change alert in /users/<user>/documents/projects']

### Tokenization and Vectorization

After extracting the relevant features only, we can go ahead and convert our text into machine-readable vectors. Expel handles this using MinHashing which suits the complex nature of security alerts. As we have only text we can encode them a little easier.

In [None]:
from sentence_transformers import SentenceTransformer
import torch

device = 'cuda' if torch.cuda.is_available() else 'cpu'

# initialize a model for creating the vectors
model = SentenceTransformer(
    'flax-sentence-embeddings/all_datasets_v4_MiniLM-L6',
    device=device
)
# encode the vectors
alert_vectors = model.encode(alert_list)

In [None]:
alert_vectors[0]

array([ 2.63850274e-03,  5.77200539e-02,  1.25898756e-02,  2.10825801e-02,
        1.08278811e-01, -3.81005332e-02,  9.38568115e-02,  6.46000504e-02,
        1.26974329e-01,  9.88046732e-03, -1.58750673e-03,  9.68097337e-03,
        1.53104132e-02,  4.15581204e-02, -9.68658403e-02,  4.53900695e-02,
       -9.25505087e-02,  4.98605147e-03,  2.75635161e-02,  2.44797170e-02,
        2.19219015e-03,  2.97133494e-02,  5.95396981e-02,  4.05759411e-03,
        5.13067469e-02, -3.31423171e-02,  3.54301147e-02, -2.88685504e-02,
       -1.01434596e-01,  2.20345072e-02,  4.25754040e-02,  1.00562200e-02,
        5.89723103e-02, -6.05094843e-02,  4.49954234e-02,  2.97875330e-02,
        8.62189829e-02, -3.52832898e-02,  6.59001805e-03,  1.62643765e-03,
       -6.18802458e-02, -2.92162057e-02, -7.86487479e-03, -1.83716938e-02,
       -9.11098942e-02, -4.29506749e-02,  4.36825491e-02, -8.59973580e-02,
       -2.81795841e-02,  3.23357284e-02,  7.09767593e-03, -1.03866287e-01,
       -3.44000347e-02, -

### Search with a Vector Index

We have our vector representations of the alerts. Now we use a vector index to store those vectors. [Pinecone](https://www.pinecone.io) is a managed vector index that allows us to set this up very easily.

To follow along with this step you will need a [free API key](https://app.pinecone.io).

In [None]:
from pinecone import Pinecone

pinecone.init(
    api_key="YOUR_API_KEY",
    environment="YOUR_ENV"  # find next to API key in console
)

We need the vector dimension to create a new Pinecone index.

In [None]:
dim = alert_vectors.shape[1]
dim

384

In [None]:
index_name = 'alert-similarity'

# we create an index to store the vectors and search through
pinecone.create_index(index_name, dimension=dim)
# then initialize connection to the index
index = pinecone.Index(index_name)

In [None]:
# organize the data
data = []
for i, alert_vec in enumerate(alert_vectors):
    data.append((f'id-{i}', alert_vec.tolist()))
# upsert the data
index.upsert(vectors=data)

{'upserted_count': 5}

Now we can identify that the four document/directory change alerts are all very similar and differ from the one user location alert.

In [None]:
query = "Some change alert in any location"
xq = model.encode([query]).tolist()
# and make the query in Pinecone
result = index.query(vector=xq, top_k=5)
result

{'results': [{'matches': [{'id': 'id-1', 'score': 0.601801634, 'values': []},
                          {'id': 'id-3', 'score': 0.601801634, 'values': []},
                          {'id': 'id-0', 'score': 0.601801634, 'values': []},
                          {'id': 'id-4', 'score': 0.564733624, 'values': []},
                          {'id': 'id-2', 'score': 0.270546645, 'values': []}],
              'namespace': ''}]}

We can map these IDs back to our original sentences.

In [None]:
for record in result['matches']:
    idx = int(record['id'][-1])
    score = round(record['score'], 3)
    print(f"{score}\n{alert_list[idx]}")

0.602
File change alert in directory /users/<user>/documents/projects
0.602
File change alert in directory /users/<user>/documents/projects
0.602
File change alert in directory /users/<user>/documents/projects
0.565
Directory change alert in /users/<user>/documents/projects
0.271
New login location for /users/<user> in location Rome, Italy


And we can see that both file and directory change alerts score much higher with the query `"Some change alert in any location"` than the login location alert.

# Delete the index

Delete the index once you do not want to use it anymore. Once the index is deleted, you cannot use it again.

In [None]:
pinecone.delete_index(index_name)

---