In [1]:
import os
from pathlib import Path

import pandas as pd
import numpy as np

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline

from ai_security.discriminative_chatter_detector import load_training_data


## Load the [baby blackbriar training data](https://docs.google.com/spreadsheets/d/1vuuuqqRsXhYmDbY0k5--5UibbtEVcA5i5CFusVuXh-s/edit?gid=2047409260#gid=2047409260)

In [2]:
training_df = load_training_data('baby-blackbriar')
for comment_column in ['embedding_comment', 'category_comment']:
    training_df[comment_column] = training_df[comment_column].fillna('')
training_df.head()

Unnamed: 0,transcript,category,embedding_comment,category_comment
0,baby,harmless,Harmless word 1x,
1,baby baby baby,harmless,Harmless word 3x,
2,baby baby baby,harmless,Harmless word 3x,
3,blackbriar,blackbriar,Keyword appears 1x,
4,blackbriar blackbriar blackbriar,blackbriar,Same keyword 3x,


### Convert transcripts into word-count vectors (['Bag of Words'](https://en.wikipedia.org/wiki/Bag-of-words_model) method)

In [3]:
embedder = CountVectorizer(max_features=1000, stop_words="english")
embedded_texts = embedder.fit_transform(training_df['transcript'])
tokens = embedder.get_feature_names_out()
sentence_embeddings = pd.DataFrame(
    embedded_texts.todense(), 
    index=training_df['transcript'],
    columns=tokens
)
sentence_embeddings['comment'] = training_df['embedding_comment'].values
sentence_embeddings['comment'] = sentence_embeddings['comment'].fillna('')
sentence_embeddings

Unnamed: 0_level_0,baby,black,blackbriar,blahkbriar,briar,comment
transcript,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
baby,1,0,0,0,0,Harmless word 1x
baby baby baby,3,0,0,0,0,Harmless word 3x
baby baby baby,3,0,0,0,0,Harmless word 3x
blackbriar,0,0,1,0,0,Keyword appears 1x
blackbriar blackbriar blackbriar,0,0,3,0,0,Same keyword 3x
blackbriar blahkbriar,0,0,1,1,0,"Once spelled correctly, once wrong"
blahkbriar,0,0,0,1,0,"Misspelling, maybe transcription error"
black briar,0,1,0,0,1,Split with space
baby baby,2,0,0,0,0,
Baby,1,0,0,0,0,


In [4]:
classical_model = Pipeline([
        ("vectorizer", CountVectorizer(max_features=1000, stop_words="english")),
        ("classifier", LogisticRegression())
        ])

In [5]:
classical_model.fit(training_df['transcript'], training_df['category'])

0,1,2
,steps,"[('vectorizer', ...), ('classifier', ...)]"
,transform_input,
,memory,
,verbose,False

0,1,2
,input,'content'
,encoding,'utf-8'
,decode_error,'strict'
,strip_accents,
,lowercase,True
,preprocessor,
,tokenizer,
,stop_words,'english'
,token_pattern,'(?u)\\b\\w\\w+\\b'
,ngram_range,"(1, ...)"

0,1,2
,penalty,'l2'
,dual,False
,tol,0.0001
,C,1.0
,fit_intercept,True
,intercept_scaling,1
,class_weight,
,random_state,
,solver,'lbfgs'
,max_iter,100


In [6]:
inference_transcripts = [
    'blackbriar', 
    'baby',
    'baby blackbriar',
    'Jason Bourne'
]
prediction = classical_model.predict(inference_transcripts)

prediction_df = pd.DataFrame({
    'input': inference_transcripts,
    'output': prediction
})
print('For **input** transcripts, what **output** category does the AI model predict?\n')
prediction_df

For **input** transcripts, what **output** category does the AI model predict?



Unnamed: 0,input,output
0,blackbriar,blackbriar
1,baby,harmless
2,baby blackbriar,harmless
3,Jason Bourne,harmless


Let's look at the model weights of the model trained on this dataset

In [7]:
weights = np.concat([
    classical_model.named_steps['classifier'].intercept_, 
    classical_model.named_steps['classifier'].coef_[0]]
)
weights_formatted = [float(round(a_weight, 4)) for a_weight in weights]
print('Model weights:\n')
print(weights_formatted)

Model weights:

[0.1152, 0.9856, 0.3198, -0.8138, -0.5969, 0.3198]


## Extend the dataset for new important transcript cases

The fictional CIA analysts reviewing the `chatter-detector` got flak from the higher ups that transcripts containing `Jason Bourne` weren't classified as `blackbriar`. The also added transcripts with `baby blackbriar` categorized as `blackbriar`, as `baby blackbriar` was a nickname given "Operation Blackbriar" by some cheeky analysts who then went rogue.

In [8]:
extended_training_df = load_training_data('toddler-blackbriar')
for comment_column in ['embedding_comment', 'category_comment']:
    extended_training_df[comment_column] = extended_training_df[comment_column].fillna('')

extended_training_df.tail()

Unnamed: 0,transcript,category,embedding_comment,category_comment
7,black briar,blackbriar,Split with space,Not considered the same as 'blackbriar'
8,baby blackbriar,blackbriar,,Added: 'baby blackbriar' is chatter
9,baby black briar,blackbriar,,
10,jason bourne,blackbriar,,Added: Jason Bourne blackbriar relevant
11,jason bourne,blackbriar,,


In [9]:
extended_model = Pipeline([
    ("vectorizer", CountVectorizer(max_features=1000, stop_words="english")),
    ("classifier", LogisticRegression())
])
extended_model.fit(extended_training_df['transcript'], extended_training_df['category'])
inference_transcripts = [
    'blackbriar', 
    'baby',
    'baby blackbriar',
    'Jason Bourne'
]
prediction = extended_model.predict(inference_transcripts)

prediction_df = pd.DataFrame({
    'input': inference_transcripts,
    'output': prediction
})
print('For **input** transcripts, what **output** category does the AI model predict?\n')
prediction_df

For **input** transcripts, what **output** category does the AI model predict?



Unnamed: 0,input,output
0,blackbriar,blackbriar
1,baby,blackbriar
2,baby blackbriar,blackbriar
3,Jason Bourne,blackbriar


Note: we 'fixed' the issue with 'Jason Bourne' and 'baby blackbriar' now labeled as `blackbriar` chatter.

But in doing so, now the transcript `baby` is wrongly classified as `blackbriar` chatter, not `harmless` as it was before.

With relatively simple, count-based sentence embeddings like 'Bag of Words' and AI model logistic regression (a one-layer neural network, so a "shallow" network, not a deep one), we can't capture more complicated interactions among words.