<div align="center">

# Muzlin Intro

</div>
<br>

Muzlin is a lightweight, and fast library for filtering many aspects of the generative text process train. <br>
In it's simplest application, it contains many methods to help better align a textual task by providing flags <br> to help guide an end-user's query more effectively.

# Let's get started!

To begin, first it is recommended to install the necessary libraries to work with the notebooks



In [None]:
!pip install -q muzlin[notebook]

Now that we have everything installed, let's import a dataset to work with.

In [1]:
from datasets import load_dataset

ds = load_dataset('bigbio/scifact', trust_remote_code=True)

# Or to download locally
#import pandas as pd
#ds = pd.read_csv('bigbio_scifact.csv')

Now grab the text that we want from the dataset.

In [2]:
import pandas as pd

df = pd.DataFrame()
df['data'] = ds['train']['claim']
df = df[df.data!='']

# Quick chcek at the text
print(df['data'].iloc[10])

53% of perinatal mortality is due to low birth weight.


<br>
In order to work with muzlin, the text needs to be encoded first.

In [3]:
import numpy as np
from muzlin.encoders import HuggingFaceEncoder
encoder = HuggingFaceEncoder()

vectors = encoder(df['data'].values.tolist())
vectors = np.array(vectors)

# If you want to save the vectors for later use
#np.save('vectors', vectors)

print(vectors.shape)

(809, 384)


<br>
With the encoded text, we can now create a simple text anomaly filter.

In [4]:
from muzlin.anomaly import OutlierDetector
from pyod.models.pca import PCA

# Read in vectors that were previously saved
#vectors = np.load('vectors.npy')

# Initialize anomaly detection model 
od = PCA(contamination=0.15)

# Set mlflow to true to log the experiment
#mlflow.set_experiment('outlier_model')
clf = OutlierDetector(mlflow=False, detector=od)
clf.fit(vectors)
#mlflow.end_run()

<br>
The filter above can be used to test if new text belongs to the original text collection or not.

In [5]:
from muzlin.anomaly import OutlierDetector
from muzlin.encoders import HuggingFaceEncoder

# Preload trained model - required
clf = OutlierDetector(model='outlier_detector.pkl')

# Encode question
encoder = HuggingFaceEncoder()

question = encoder(['Who is the current president of the USA?']) # This is a clear outlier
#question = encoder(['What treatment raises endoplasmic reticulum stress?']) # This is a clear inlier
#question = encoder(['What dosage affects the kidneys?']) # This just passes an an outlier due to ambiguity
#question = encoder(['Does taking too much folic acid affect kidney disease?']) # This just passes as an inlier (only one true text match) 


vector = np.array(question).reshape(1,-1) # Must be 2D

# Get a binary inlier 0 or outlier 1 output
label = clf.predict(vector)
score = clf.decision_function(vector)

print('Inlier 0, Outlier 1:', label[0])
print('Outlier likelihood:', score[0])

Inlier 0, Outlier 1: 1
Outlier likelihood: 6185210100.838657
