## Sentence Transformer

- interesting observation: no need to tokenize the texts when using sentence transformer

In [2]:
import faiss
import pandas as pd
import numpy as np
import torch
import seaborn as sns
import matplotlib.pyplot as plt
from jupyterthemes import jtplot
jtplot.style(theme='monokai', context='notebook', ticks=True, grid=False)

In [3]:
from sentence_transformers import SentenceTransformer

In [4]:
df = pd.read_csv('../chat_logs.csv')

  exec(code_obj, self.user_global_ns, self.user_ns)


In [5]:
df.line_message

0         one of these days soon I need to see what geas...
1         anyway i'm off to do real work for a couple of...
2                             so its today or not this week
3          we are still waiting for a new forms/geas driver
4                                   oh, guess its not today
                                ...                        
659788                                         Python 2.4.1
659789                                          is this oK?
659790                                  hellllooooo????????
659791                                is this project dead?
659792                                way to respond ppl ;)
Name: line_message, Length: 659793, dtype: object

In [6]:
df

Unnamed: 0,log_id,line_count,user,line_message,date_of_log
0,1,2.0,jcater,one of these days soon I need to see what geas...,2001-06-27
1,2,3.0,neilt,anyway i'm off to do real work for a couple of...,2001-06-27
2,3,4.0,neilt,so its today or not this week,2001-06-27
3,4,5.0,neilt,we are still waiting for a new forms/geas driver,2001-06-27
4,5,6.0,jcater,"oh, guess its not today",2001-06-27
...,...,...,...,...,...
659788,659161,95.0,Randy,Python 2.4.1,2006-10-06
659789,659162,96.0,Randy,is this oK?,2006-10-06
659790,659163,97.0,Randy,hellllooooo????????,2006-10-06
659791,659164,98.0,Randy,is this project dead?,2006-10-06


In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 659793 entries, 0 to 659792
Data columns (total 5 columns):
 #   Column        Non-Null Count   Dtype  
---  ------        --------------   -----  
 0   log_id        659793 non-null  object 
 1   line_count    659165 non-null  float64
 2   user          659165 non-null  object 
 3   line_message  659161 non-null  object 
 4   date_of_log   658537 non-null  object 
dtypes: float64(1), object(4)
memory usage: 25.2+ MB


In [8]:
df.describe()

Unnamed: 0,line_count
count,659165.0
mean,491.868032
std,453.155006
min,1.0
25%,158.0
50%,358.0
75%,691.0
max,3307.0


In [9]:
# print sentences that contain float
# nans are special values of floating points

# line_message = list(df['line_message'])
# sentences_with_float = []

# for idx in range(len(line_message)):
#     if isinstance(line_message[idx],float):
#         sentences_with_float.append(idx)
        
# [df[df.log_id==index]['line_message'] for index in sentences_with_float]
# sentences_with_float

In [10]:
# show all floats
# floats = df.iloc[sentences_with_float,:] #rows
# floats

In [11]:
# drop floats
df = df.dropna(subset=['line_message'])

line_messages = list(df["line_message"])
line_messages

['one of these days soon I need to see what geas has to offer',
 "anyway i'm off to do real work for a couple of days",
 'so its today or not this week',
 'we are still waiting for a new forms/geas driver',
 'oh, guess its not today',
 'cant remember who volunteered for that',
 "I'm doing real work",
 'are the geas docs pretty decent?',
 'sigh',
 'um, what it is and how to use it :)',
 'its just a black box interface to objects',
 'dont know what the docs say',
 'but if you cant find what you are looking for, let me know and well create it',
 "jamest_: what's wrong?",
 'this damn bot',
 "what's it doing",
 'annoying me',
 'nothing more, nothing less',
 'and I should not even be messing with it',
 'no real work today?',
 'jamest_: if its related',
 'jamest_: ash keeps kicking my ssh session off',
 'if i am away for any time at all',
 "but my heart fills with pity whenever I think of poor masta, hudled over his keyboard, shuddering cause he's missed out on 12 hours of goat references in 

In [12]:
df['line_message'][335498] # empty row, since it has been dropped
df['line_message'][336738]

KeyError: 335498

In [None]:
# have an idea of what's the largest/smallest sentence

maxVal = 0
minVal = 0
for idx in range(len(line_messages)):
    if len(line_messages[idx].split())>maxVal:
        maxVal = len(line_messages[idx].split())
    
    if len(line_messages[idx].split())<minVal:
        minVal = len(line_messages[idx].split())

In [None]:
# max, min sentence length
maxVal, minVal

## Figure Out Date Filter Range

In [17]:
# to create a filter we need to write a program to adapt to the date range of data

In [14]:
df = pd.read_csv('../chat_logs.csv')

  exec(code_obj, self.user_global_ns, self.user_ns)


In [19]:
dates = df['date_of_log'].tolist()

In [23]:
# find min and max value
min_date = dates[0]
max_date = dates[-1]

In [24]:
min_date, max_date

('2001-06-27', '2006-10-06')

In [39]:
# extract year (naive)
import re
from datetime import datetime

match_min = re.search(r'\d{4}-\d{2}-\d{2}', dates[0])
match_max = re.search(r'\d{4}-\d{2}-\d{2}', dates[-1])
str_date = datetime.strptime(match_min.group(), '%Y-%m-%d').date()
end_date = datetime.strptime(match_max.group(), '%Y-%m-%d').date()

In [41]:
str_date.year, end_date.year

(2001, 2006)

In [43]:
str_date.month, end_date.month

(6, 10)

## Text Analysis

- idea: associate a title to each text message and then retrive text messages using a simple tf-idf weighting.
<br>

Heuristics
- if we can figure out what labels can be associated with each text, we can essentially perform dmensionality reduction.
<br>
- we can then train a machine learning model to do text classfication.
<br>
- we then may perform simple similarity search and ranking models to retreive texts 

## Semantic Search

- we may be more interested in the "Multi-Lingual Models" from sentence embeddings
<br>
- symmetric or asymmetric search?
<br>
symmetric --> pre-trained Sentence Embedding Model
<br>
asymmetric --> pre-trained MS Marco Model
<br>
- how about the rerank component?


In [26]:
corpus = list(df['line_message'])
corpus

In [28]:
embedder = SentenceTransformer('paraphrase-MiniLM-L6-v2')

In [None]:
corpus_embeddings = embedder.encode(corpus, convert_to_tensor=True)