# Picrystal Demo
This notebook covers two topics:
1. Recent output change 
2. New Embedders added w.r.t. LLM Assessment

## Setting Up the environment

Run following commands:

git pull git@github.com:QuantPi/picrystal_metric_compute.git
<br> cd picrystal_metric_compute  
pip install -e ./
<br> pip install -r requirements.txt


## 1. Recent output change

### Brief explanation and output

In [1]:
import joblib
import pandas as pd
from picrystal_metric_compute.embedders import  LabelMapEmbedder
from picrystal_metric_compute.core import run_all_metrics
from picrystal_metric_compute.metrics_catalog import catalog
from picrystal_metric_compute.package_wrappers import PackageWrapper

from functools import cached_property


In [2]:
class HiringUseCase:

    def __init__(self):
        self.df = pd.read_csv('https://storage.googleapis.com/picrystal-bucket/hiring/445b7773-3431-4c34-a762-ce8986670aa3_main_hiring_updated.csv')
        self.df = self.df.drop("Unnamed: 0", axis=1)
    
    class_info = [
            {'default_class_value': 0, 'value': 0, 'name': 'Hired'},
            {'default_class_value': 1, 'value': 1, 'name': 'Not Hired'}
    ]
    
    @cached_property
    def model(self):
        clf = joblib.load("model.joblib")
        return PackageWrapper(model=clf, package_name="sklearn", ml_case="classification")

    @cached_property
    def inputs(self):

        predictors = [
            "State",
            "Sex",
            "MaritalDesc",
            "CitizenDesc",
            "RaceDesc",
            "Department",
            "RecruitmentSource",
            "PerformanceScore",
            "SpecialProjectsCount"
        ]

        return self.df[predictors].values

    @cached_property
    def targets(self):

        target = "HiredOrNot"
        return self.df[target].values

    embedders = [
        LabelMapEmbedder(class_info=class_info, tags=('predictions','binary', 'categorical')),
        LabelMapEmbedder(class_info=class_info, on='groundtruth', tags=('groundtruth', 'binary', 'categorical'))    
    ]

    perturbers = [
        
    ]

In [3]:
usecase = HiringUseCase()

In [4]:
metric_spec = {
	"accuracy": {
		"function": "accuracy",
		"specification": {}
	}
}

In [5]:
run_all_metrics(
        metric_spec,
        usecase,
        catalog
    )

INFO:root:Metrics specification: {'accuracy': {'function': 'accuracy', 'specification': {}}}
INFO:root:Metrics in metrics catalog: ['performance', 'accuracy', 'classification error', 'accuracy regression', 'average distance', 'sklearn_cls_performance_labels', 'sklearn_cls_performance_probabilities', 'sklearn_reg_performance', 'confusion matrix', 'extended binary confusion matrix', 'true negative', 'false positive', 'false negative', 'true positive', 'true negative rate', 'false positive rate', 'false negative rate', 'true positive rate', 'negative predictive value', 'false discovery rate', 'false omission rate', 'positive predictive value', 'positive likelihood ratio', 'negative likelihood ratio', 'diagnostic odds ratio', 'f1 score', 'extended binary confusion matrix accuracy', 'prevalence', 'aggregated extended confusion matrix', 'equalized odds tpr', 'equalized odds fpr', 'equal opportunity', 'predictive parity', 'demographic parity (NYC)', 'performance event bias metric', 'missing v

{'metrics': {'accuracy': [{'value': 0.977491961414791,
    'name': 'accuracy',
    'targets': 'LabelMapEmbedder_2173472566889261758',
    'predictions': 'LabelMapEmbedder_2203006555034328968'}]},
 'embedders': {'LabelMapEmbedder_2203006555034328968': {'name': 'LabelMapEmbedder embedder on predictions',
   'tags': ('inputs',
    'categorical',
    'provides-groups-info',
    'predictions',
    'binary',
    'categorical'),
   'groups': [0, 1]},
  'LabelMapEmbedder_2173472566889261758': {'name': 'LabelMapEmbedder embedder on groundtruth',
   'tags': ('inputs',
    'categorical',
    'provides-groups-info',
    'groundtruth',
    'binary',
    'categorical'),
   'groups': [0, 1]}},
 'perturbers': {},
 'picrystal_metric_compute_version': '0.3+0.gba0035e.dirty',
 'start_time': '2024-05-10T12:41:11.088451Z',
 'end_time': '2024-05-10T12:41:11.118616Z',
 'time_elapsed': '-1 day, 23:59:59.969835'}



## 2.New Embedders added w.r.t. LLM Assessment

### Goal of LLM based Embedders
In short to extract information from raw text ==> convert it into arrays (numpy array).
And then subsequently picrystal will run tests on this new arrays.

And in general if we call an embedder on input (input,target,prediction) it returns numpy array. For example:

embedder_obj = SomeEmbedder() 
<br> embedder_obj( (input,target,prediction) ) ===>> numpy array

#### Various embedders covered in this notebook
1. LengthTextEmbedder
2. VocabularyAppearanceEmbedder
3. GenderEmbedder
4. EthnicityEmbedder
5. RegexMatchEmbedder

#### Brief about Tokenizer and field_to_check
1. Tokenizer= DefaultEnglishTokenizer(stratification="characters") // "words", "sentences"
2. field_to_check = DefaultSampleToTextProcessor(field_to_check="text_field") 

In [6]:
from picrystal_metric_compute.embedders import DefaultEnglishTokenizer, DefaultSampleToTextProcessor

In [7]:
dataset1 = [
    {'problem': 'One plus one', 'answer': '2'},
    {'problem': 'What is the sum of 1 and 4?',  'answer': '5'},
    {'problem': 'Square root of 105625', 'answer': '325'},
    {'problem': 'What is the square root of one hundred five thousand six hundreds twenty five?', 'answer': 'three hundreds twenty five'},
    {'problem': 'Denis bought three hundreds apples, and ate zero of them because he doesn\'t like apples so much, how many apples does he have', 'answer': '300'},
    {'problem': '''
Antoine analyzed 46.4 LLMs, but then he realized that 5.7 of them were analyzed wrong,
10 were not an LLM at all, but two out of 10 has LLM in the name, so they were considered to be an LLM.
Also 13 LLM reports were downloded online, but six of them were already analyzed.
At the end of the day Antoine found out that he can use triple every LLM because he knows three languages.
How many LLMs does he have?
''', 'answer': '167.1'},
    {'problem': 'Anna had 4 eggs, she cracked one egg, fried one egg and ate one egg. How many eggs does she have at the end', 'answer': '3'},
    {'problem': 'Anna had a customer call, there were two attendee in the call, then four attendee disconnected, Anna said "Great!! now if two more attendee join, then there will be noone in the call". How many attendee were there?', 'answer': '-2'},
    {'problem': '''Denis works 40 hours per week, (5-day work week)
Every day he spends 1 hour on staring in the emptiness, 2 hours to eat every snack he finds in the office.
Every week he should spend at least 8 hours on discussisng what is a test, and what he should rename more.
One day he reserved fully to think about jokes he wants to put into homework.
To learn new vim hotkeys he spends 30 minutes on Monday, Wendndsday and Friday and 45 minutes on Tuesday and Thursday.
Also he needs to bring all three hundres apples, that he bought for some reason, this takes two and half hour.

If Denis would smoke, how many small cigarttes he can afford per week, if smoking one small cigaratte takes 3 minutes?
    ''', 'answer': '70'},
    {'problem': 'What is the sum  minus one to the power of power of k divided by 2k plus one where k iterates from 0 to infinity', 'answer': 'pi divided by 4'},
    {'problem': 'What is square root of 2', 'answer': '1.41421356237...'},

]

In [8]:
dataset2 = [
    'One plus one', 
    'What is the sum of 1 and 4?', 
    'Square root of 105625',
    'What is the square root of one hundred five thousand six hundreds twenty five?',
    'Denis bought three hundreds apples, and ate zero of them because he doesn\'t like apples so much, how many apples does he have',
    '''Antoine analyzed 46.4 LLMs, but then he realized that 5.7 of them were analyzed wrong,
10 were not an LLM at all, but two out of 10 has LLM in the name, so they were considered to be an LLM.
Also 13 LLM reports were downloded online, but six of them were already analyzed.
At the end of the day Antoine found out that he can use triple every LLM because he knows three languages.
How many LLMs does he have?''',
'Anna had 4 eggs, she cracked one egg, fried one egg and ate one egg. How many eggs does she have at the end',
'Anna had a customer call, there were two attendee in the call, then four attendee disconnected, Anna said "Great!! now if two more attendee join, then there will be noone in the call". How many attendee were there?',
'''Denis works 40 hours per week, (5-day work week)
Every day he spends 1 hour on staring in the emptiness, 2 hours to eat every snack he finds in the office.
Every week he should spend at least 8 hours on discussisng what is a test, and what he should rename more.
One day he reserved fully to think about jokes he wants to put into homework.
To learn new vim hotkeys he spends 30 minutes on Monday, Wendndsday and Friday and 45 minutes on Tuesday and Thursday.
Also he needs to bring all three hundres apples, that he bought for some reason, this takes two and half hour.

If Denis would smoke, how many small cigarttes he can afford per week, if smoking one small cigaratte takes 3 minutes?''',
'What is the sum  minus one to the power of power of k divided by 2k plus one where k iterates from 0 to infinity',
'What is square root of 2'
]

## LengthTextEmbedder
Usecase: Categorization of input data into bins based on length (characters, word and sentences).
<br> The default value of bins and bin length is 10 and 25.

In [9]:
from picrystal_metric_compute.embedders import LengthTextEmbedder

In [10]:
emb_obj = LengthTextEmbedder(
    #n_bins=10,
    #bin_width=25,
    tokenizer=DefaultEnglishTokenizer(stratification="words")
)

In [11]:
emb_obj( (dataset2,None, None) )

array([0, 0, 0, 0, 1, 3, 1, 1, 5, 1, 0])

## VocabularyAppearanceEmbedder
Usecase: Searches for "vocabulary" in the text.
<br> usefull for underlying embedders based on gender, ethinicity and regular expression.

There are two (more) required parameters that are necessary:
1. vocabulary_in_dict: Is a dictionary of words against group. 

Like for ('he', 'his', 'him', 'himself') for male and ('she', 'hers', 'her', 'herself') for female.

2. groups_processing_rules_dict: How pycrystal uses this information to categorize data.

{ 0 : "male", 1 : "female", 2: "both", 3 : "None""

In [12]:
from picrystal_metric_compute.embedders import VocabularyAppearanceEmbedder

In [13]:
vemb = VocabularyAppearanceEmbedder()

TypeError: VocabularyAppearanceEmbedder.__init__() missing 1 required positional argument: 'vocabulary_in_dict'

Few things required here

In [14]:
from picrystal_metric_compute.embedders import MentionsCategorizer

In [15]:
male_pronouns = ('he', 'his', 'him', 'himself')
female_pronouns = ('she', 'hers', 'her', 'herself')

vocabulary_in_dict={
                'male': MentionsCategorizer(male_pronouns),
                'female': MentionsCategorizer(female_pronouns),
                'None': MentionsCategorizer(male_pronouns + female_pronouns, include=False)
}
groups_processing_rules_dict = {
                frozenset(['male']): 'Only male',
                frozenset(['female']): 'Only female',
                frozenset(['male', 'female']): 'Both',
                frozenset(['None']): 'None',
}

In [16]:
emb_obj = VocabularyAppearanceEmbedder(
    vocabulary_in_dict = vocabulary_in_dict,
    groups_processing_rules_dict = groups_processing_rules_dict,
    tokenizer = DefaultEnglishTokenizer("words")
)

In [17]:
emb_obj((dataset2,None,None))

array([3., 3., 3., 3., 0., 0., 1., 3., 0., 3., 3.])

In [18]:
dataset2

['One plus one',
 'What is the sum of 1 and 4?',
 'Square root of 105625',
 'What is the square root of one hundred five thousand six hundreds twenty five?',
 "Denis bought three hundreds apples, and ate zero of them because he doesn't like apples so much, how many apples does he have",
 'Antoine analyzed 46.4 LLMs, but then he realized that 5.7 of them were analyzed wrong,\n10 were not an LLM at all, but two out of 10 has LLM in the name, so they were considered to be an LLM.\nAlso 13 LLM reports were downloded online, but six of them were already analyzed.\nAt the end of the day Antoine found out that he can use triple every LLM because he knows three languages.\nHow many LLMs does he have?',
 'Anna had 4 eggs, she cracked one egg, fried one egg and ate one egg. How many eggs does she have at the end',
 'Anna had a customer call, there were two attendee in the call, then four attendee disconnected, Anna said "Great!! now if two more attendee join, then there will be noone in the call

In [19]:
emb_obj.info()

{'name': 'Vocabulary appearance embedder',
 'tags': ('inputs',
  'categorical',
  'provides-groups-info',
  'vocabulary-appearance'),
 'class_info': [{'value': 0, 'name': 'Only male'},
  {'value': 1, 'name': 'Only female'},
  {'value': 2, 'name': 'Both'},
  {'value': 3, 'name': 'None'}],
 'vocabulary_in_dict': {'male': <picrystal_metric_compute.embedders.MentionsCategorizer at 0x7d2f71d17eb0>,
  'female': <picrystal_metric_compute.embedders.MentionsCategorizer at 0x7d2f71d9c160>,
  'None': <picrystal_metric_compute.embedders.MentionsCategorizer at 0x7d2f71d9f970>},
 'groups_processing_rules_dict': {frozenset({'male'}): 'Only male',
  frozenset({'female'}): 'Only female',
  frozenset({'female', 'male'}): 'Both',
  frozenset({'None'}): 'None'},
 'sample to text processor config': {'sample to text processor': 'default sample to text processor'},
 'tokenizer config': {'stratification': 'words',
  'tokenizer used': 'word_tokenize'}}

##  GenderEmbedder
Usecase: categorize input data based on gender information

In [20]:
from picrystal_metric_compute.embedders import GenderEmbedder

In [21]:
emb_obj = GenderEmbedder(
    
)

In [22]:
emb_obj.info()

{'name': 'Gender embedder',
 'tags': ('inputs',
  'categorical',
  'provides-groups-info',
  'vocabulary-appearance'),
 'class_info': [{'value': 0, 'name': 'Only male'},
  {'value': 1, 'name': 'Only female'},
  {'value': 2, 'name': 'Either both or None'}],
 'vocabulary_in_dict': {'male': <picrystal_metric_compute.embedders.MentionsCategorizer at 0x7d30943a77f0>,
  'female': <picrystal_metric_compute.embedders.MentionsCategorizer at 0x7d30943a57b0>,
  'None': <picrystal_metric_compute.embedders.MentionsCategorizer at 0x7d30943a58a0>},
 'groups_processing_rules_dict': {frozenset({'male'}): 'Only male',
  frozenset({'female'}): 'Only female',
  frozenset({'female', 'male'}): 'Either both or None',
  frozenset({'None'}): 'Either both or None'},
 'sample to text processor config': {'sample to text processor': 'default sample to text processor'},
 'tokenizer config': {'stratification': 'words',
  'tokenizer used': 'word_tokenize'}}

## EthnicityEmbedder
Usecase: To extract ethinicity information

In [23]:
from picrystal_metric_compute.embedders import EthnicityEmbedder

In [24]:
emb_obj = EthnicityEmbedder()

In [25]:
emb_obj.info()

{'name': 'Ethnicity embedder',
 'tags': ('inputs',
  'categorical',
  'provides-groups-info',
  'vocabulary-appearance'),
 'class_info': [{'value': 0, 'name': 'Hispanic or Latino'},
  {'value': 1, 'name': 'White'},
  {'value': 2, 'name': 'Black or African American'},
  {'value': 3, 'name': 'Native Hawaiian or Pacific Islander'},
  {'value': 4, 'name': 'Asian'},
  {'value': 5, 'name': 'Native American or Alaska Native'},
  {'value': 6, 'name': 'None'},
  {'value': 7, 'name': 'Two or more'}],
 'vocabulary_in_dict': {'Hispanic or Latino': <picrystal_metric_compute.embedders.MentionsCategorizer at 0x7d2f71e66da0>,
  'White': <picrystal_metric_compute.embedders.MentionsCategorizer at 0x7d2f71d16020>,
  'Black or African American': <picrystal_metric_compute.embedders.MentionsCategorizer at 0x7d2f71d154b0>,
  'Native Hawaiian or Pacific Islander': <picrystal_metric_compute.embedders.MentionsCategorizer at 0x7d2f71d14e50>,
  'Asian': <picrystal_metric_compute.embedders.MentionsCategorizer at 0

In [26]:
emb_obj((dataset2,None,None))

array([6., 6., 6., 6., 6., 6., 6., 6., 6., 6., 6.])

## RegexMatchEmbedder
usecase: match regular expression

In [27]:
from picrystal_metric_compute.embedders import RegexMatchEmbedder

In [28]:
emb_obj = RegexMatchEmbedder()

In [29]:
emb_obj.info()

{'name': 'Regex embedder',
 'tags': ('inputs',
  'categorical',
  'provides-groups-info',
  'vocabulary-appearance'),
 'class_info': [{'value': 0, 'name': 'credit card'},
  {'value': 1, 'name': 'No credit card'}],
 'vocabulary_in_dict': {'credit card': <picrystal_metric_compute.embedders.RegexMatchCategorizer at 0x7d2f71eb9e70>,
  'No credit card': <picrystal_metric_compute.embedders.RegexMatchCategorizer at 0x7d2f71e18370>},
 'groups_processing_rules_dict': {frozenset({'credit card'}): 'credit card',
  frozenset({'No credit card'}): 'No credit card'},
 'sample to text processor config': {'sample to text processor': 'default sample to text processor'},
 'tokenizer config': {'stratification': 'characters', 'tokenizer used': None}}