# LLooM: Getting Started - Template Notebook

Last Updated: April 2024

### Installation
First, install the LLooM Python package, available on PyPI as [`text_lloom`](https://pypi.org/project/text_lloom/). We recommend setting up a virtual environment with [venv](https://docs.python.org/3/library/venv.html#creating-virtual-environments) or [conda](https://conda.io/projects/conda/en/latest/user-guide/tasks/manage-environments.html#creating-an-environment-with-commands).

In [32]:
!pip install text_lloom --quiet

In [33]:
!pip install pyserial

Defaulting to user installation because normal site-packages is not writeable


In [34]:
import nltk
nltk.download('punkt_tab')

[nltk_data] Downloading package punkt_tab to
[nltk_data]     /home/lbartolome/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


True

### Imports

In [83]:
import os
import pandas as pd
import numpy as np
from dotenv import load_dotenv
pd.set_option('display.max_colwidth', None)
pd.set_option('display.max_rows', None)

LLooM uses the OpenAI API under the hood to support its core operators (using GPT-3.5 and GPT-4). You'll first need to locally set the `OPENAI_API_KEY` variable to use your own account.

In [84]:
# Please enter in your OpenAI key "sk-123xyz" below.
load_dotenv('/export/usuarios_ml4ds/lbartolome/Repos/repos_con_carlos/RAG_tool/.env')
os.environ["OPENAI_API_KEY"] = os.getenv("OPENAI_API_KEY")

In [85]:
# Import the LLooM package:
import text_lloom.workbench as wb

### Load data
For this example, we'll be using a sample dataset of 100 **Facebook posts** from **political** pages, gathered via CrowdTangle. The main columns we'll be using in our analysis are the following:
- `doc_id`: Unique ID for each post
- `text`: The text of the Facebook post
- `Page Category`: The category of the Facebook page
- `Likes`: The number of "likes" that the post received

In [95]:
# We'll load data from an existing CSV
# data_link = "https://michelle123lam.github.io/lloom/data/political_fb_posts_100.csv"
# data_link = "/content/NCTE_transcript.csv"
# df = pd.read_csv(data_link)

data_link = '/export/usuarios_ml4ds/cggamella/RAG_tool/files/anotacion_manual/fam/datos_modelo_es_Mallet_df_merged_14_topics_45_ENTREGABLE.parquet'
data_link = '/export/usuarios_ml4ds/cggamella/RAG_tool/files/anotacion_manual/fam/S2_Kwds3_AI_with_text_30000.parquet'
df = pd.read_parquet(data_link)
df.head(3)

Unnamed: 0,id,pmid,doi,year,title,paperAbstract,Kwd_count,text
0,3091,,10.1145/505282.505283,2001.0,Machine learning in automated text categorization,"The automated categorization (or classification) of texts into predefined categories has witnessed a booming interest in the last 10 years, due to the increased availability of documents in digital form and the ensuing need to organize them. In the research community the dominant approach to this problem is based on machine learning techniques: a general inductive process automatically builds a classifier by learning, from a set of preclassified documents, the characteristics of the categories. The advantages of this approach over the knowledge engineering approach (consisting in the manual definition of a classifier by domain experts) are a very good effectiveness, considerable savings in terms of expert labor power, and straightforward portability to different domains. This survey discusses the main approaches to text categorization that fall within the machine learning paradigm. We will discuss in detail issues pertaining to three different problems, namely, document representation, classifier construction, and classifier evaluation.",6,"Machine learning in automated text categorization The automated categorization (or classification) of texts into predefined categories has witnessed a booming interest in the last 10 years, due to the increased availability of documents in digital form and the ensuing need to organize them. In the research community the dominant approach to this problem is based on machine learning techniques: a general inductive process automatically builds a classifier by learning, from a set of preclassified documents, the characteristics of the categories. The advantages of this approach over the knowledge engineering approach (consisting in the manual definition of a classifier by domain experts) are a very good effectiveness, considerable savings in terms of expert labor power, and straightforward portability to different domains. This survey discusses the main approaches to text categorization that fall within the machine learning paradigm. We will discuss in detail issues pertaining to three different problems, namely, document representation, classifier construction, and classifier evaluation."
1,17043,17149504.0,10.1055/s-0038-1634127,2006.0,Bayesian Random-effect Model for Predicting Outcome Fraught with Heterogeneity,"Summary Objective: The study aimed to develop a predictive model to deal with data fraught with heterogeneity that cannot be explained by sampling variation or measured covariates. Methods: The random-effect Poisson regression model was first proposed to deal with over-dispersion for data fraught with heterogeneity after making allowance for measured covariates. Bayesian acyclic graphic model in conjunction with Markov Chain Monte Carlo (MCMC) technique was then applied to estimate the parameters of both relevant covariates and random effect. Predictive distribution was then generated to compare the predicted with the observed for the Bayesian model with and without random effect. Data from repeated measurement of episodes among 44 patients with intractable epilepsy were used as an illustration. Results: The application of Poisson regression without taking heterogeneity into account to epilepsy data yielded a large value of heterogeneity (heterogeneity factor = 17.90, deviance = 1485, degree of freedom (df) = 83). After taking the random effect into account, the value of heterogeneity factor was greatly reduced (heterogeneity factor = 0.52, deviance = 42.5, df = 81). The Pearson χ2 for the comparison between the expected seizure frequencies and the observed ones at two and three months of the model with and without random effect were 34.27 (p = 1.00) and 1799.90 (p <0.0001), respectively. Conclusion: The Bayesian acyclic model using the MCMC method was demonstrated to have great potential for disease prediction while data show over-dispersion attributed either to correlated property or to subject-to-subject variability.",3,"Bayesian Random-effect Model for Predicting Outcome Fraught with Heterogeneity Summary Objective: The study aimed to develop a predictive model to deal with data fraught with heterogeneity that cannot be explained by sampling variation or measured covariates. Methods: The random-effect Poisson regression model was first proposed to deal with over-dispersion for data fraught with heterogeneity after making allowance for measured covariates. Bayesian acyclic graphic model in conjunction with Markov Chain Monte Carlo (MCMC) technique was then applied to estimate the parameters of both relevant covariates and random effect. Predictive distribution was then generated to compare the predicted with the observed for the Bayesian model with and without random effect. Data from repeated measurement of episodes among 44 patients with intractable epilepsy were used as an illustration. Results: The application of Poisson regression without taking heterogeneity into account to epilepsy data yielded a large value of heterogeneity (heterogeneity factor = 17.90, deviance = 1485, degree of freedom (df) = 83). After taking the random effect into account, the value of heterogeneity factor was greatly reduced (heterogeneity factor = 0.52, deviance = 42.5, df = 81). The Pearson χ2 for the comparison between the expected seizure frequencies and the observed ones at two and three months of the model with and without random effect were 34.27 (p = 1.00) and 1799.90 (p <0.0001), respectively. Conclusion: The Bayesian acyclic model using the MCMC method was demonstrated to have great potential for disease prediction while data show over-dispersion attributed either to correlated property or to subject-to-subject variability."
2,62989,,10.21917/IJSC.2013.0095,2013.0,ONTOLOGY BASED MEANINGFUL SEARCH USING SEMANTIC WEB AND NATURAL LANGUAGE PROCESSING TECHNIQUES,"The semantic web extends the current World Wide Web by adding facilities for the machine understood description of meaning. The ontology based search model is used to enhance efficiency and accuracy of information retrieval. Ontology is the core technology for the semantic web and this mechanism for representing formal and shared domain descriptions. In this paper, we proposed ontology based meaningful search using semantic web and Natural Language Processing (NLP) techniques in the educational domain. First we build the educational ontology then we present the semantic search system. The search model consisting three parts which are embedding spell-check, finding synonyms using WordNet API and querying ontology using SPARQL language. The results are both sensitive to spell check and synonymous context. This paper provides more accurate results and the complete details for the selected field in a",3,"ONTOLOGY BASED MEANINGFUL SEARCH USING SEMANTIC WEB AND NATURAL LANGUAGE PROCESSING TECHNIQUES The semantic web extends the current World Wide Web by adding facilities for the machine understood description of meaning. The ontology based search model is used to enhance efficiency and accuracy of information retrieval. Ontology is the core technology for the semantic web and this mechanism for representing formal and shared domain descriptions. In this paper, we proposed ontology based meaningful search using semantic web and Natural Language Processing (NLP) techniques in the educational domain. First we build the educational ontology then we present the semantic search system. The search model consisting three parts which are embedding spell-check, finding synonyms using WordNet API and querying ontology using SPARQL language. The results are both sensitive to spell check and synonymous context. This paper provides more accurate results and the complete details for the selected field in a"


In [87]:
len(df)

34257

In [88]:
# df.head(10)
df.keys()

Index(['identifier', 'id_tm', 'texto_preprocesado', 'texto_sin_preprocesar',
       'CPV'],
      dtype='object')

In [96]:
#df.rename(columns={'id_tm': 'doc_id', 'texto_sin_preprocesar': 'text'}, inplace=True)
df.rename(columns={'id': 'doc_id', 'texto_sin_preprocesar': 'text'}, inplace=True)

In [97]:
# Preview of dataframe
display(df[["doc_id", "text"]].head())

Unnamed: 0,doc_id,text
0,3091,"Machine learning in automated text categorization The automated categorization (or classification) of texts into predefined categories has witnessed a booming interest in the last 10 years, due to the increased availability of documents in digital form and the ensuing need to organize them. In the research community the dominant approach to this problem is based on machine learning techniques: a general inductive process automatically builds a classifier by learning, from a set of preclassified documents, the characteristics of the categories. The advantages of this approach over the knowledge engineering approach (consisting in the manual definition of a classifier by domain experts) are a very good effectiveness, considerable savings in terms of expert labor power, and straightforward portability to different domains. This survey discusses the main approaches to text categorization that fall within the machine learning paradigm. We will discuss in detail issues pertaining to three different problems, namely, document representation, classifier construction, and classifier evaluation."
1,17043,"Bayesian Random-effect Model for Predicting Outcome Fraught with Heterogeneity Summary Objective: The study aimed to develop a predictive model to deal with data fraught with heterogeneity that cannot be explained by sampling variation or measured covariates. Methods: The random-effect Poisson regression model was first proposed to deal with over-dispersion for data fraught with heterogeneity after making allowance for measured covariates. Bayesian acyclic graphic model in conjunction with Markov Chain Monte Carlo (MCMC) technique was then applied to estimate the parameters of both relevant covariates and random effect. Predictive distribution was then generated to compare the predicted with the observed for the Bayesian model with and without random effect. Data from repeated measurement of episodes among 44 patients with intractable epilepsy were used as an illustration. Results: The application of Poisson regression without taking heterogeneity into account to epilepsy data yielded a large value of heterogeneity (heterogeneity factor = 17.90, deviance = 1485, degree of freedom (df) = 83). After taking the random effect into account, the value of heterogeneity factor was greatly reduced (heterogeneity factor = 0.52, deviance = 42.5, df = 81). The Pearson χ2 for the comparison between the expected seizure frequencies and the observed ones at two and three months of the model with and without random effect were 34.27 (p = 1.00) and 1799.90 (p <0.0001), respectively. Conclusion: The Bayesian acyclic model using the MCMC method was demonstrated to have great potential for disease prediction while data show over-dispersion attributed either to correlated property or to subject-to-subject variability."
2,62989,"ONTOLOGY BASED MEANINGFUL SEARCH USING SEMANTIC WEB AND NATURAL LANGUAGE PROCESSING TECHNIQUES The semantic web extends the current World Wide Web by adding facilities for the machine understood description of meaning. The ontology based search model is used to enhance efficiency and accuracy of information retrieval. Ontology is the core technology for the semantic web and this mechanism for representing formal and shared domain descriptions. In this paper, we proposed ontology based meaningful search using semantic web and Natural Language Processing (NLP) techniques in the educational domain. First we build the educational ontology then we present the semantic search system. The search model consisting three parts which are embedding spell-check, finding synonyms using WordNet API and querying ontology using SPARQL language. The results are both sensitive to spell check and synonymous context. This paper provides more accurate results and the complete details for the selected field in a"
3,79336,"Natural Object Recognition: A Theoretical Framework and Its Implementation Most work in visual recognition by computer has focused on recognizing objects by their ge­ ometric shape, or by the presence or absence of some prespecified collection of locally measur­ able attributes (e.g., spectral reflectance, texture , or distinguished markings). On the other hand, most entities in the natural world defy compact description of their shapes, and have no characteristic features with discriminatory power. As a result, image-understanding re­ search has achieved little success toward recog­ nition in natural scenes. We offer a fundamentally new approach t.o visual recognition that avoids these limitations and has been used to recognize trees, bushes, grass, and trails in ground-level scenes of a natural environment. 1 Introduction The key scientific question addressed by our research has been the design of a computer vision system that, can approach human level performance in the interpre­ tation of natural scenes such as that shown in Fig­ ure 1. We offer a. new paradigm for the design of computer vision systems that holds promise for achiev­ ing near-human competence, and report the experimen­ tal results of a system implementing that theory which demonstrates its recognition abilities in a natural do­ main of limited geographic extent,. The purpose of this paper is t.o review the key ideas underlying our ap­ proach (discussed in detail in previous publications [12, 13]) and to focus on the results of an ongoing experi­ mental evaluation of these ideas as embodied in an im­ plemented system called Condor. When examining the reasons why the traditional ap­ proaches to computer vision fail in the interpretation of ground-level scenes of the natural world, four fundamen­ tal problems become apparent: Universal partitioning — Most scene-understanding systems begin with the segmentation of an image Figure 1: A natural outdoor scene of the experimenta­ tion site. into homogeneous regions using a single partition ing algorithm applied to the entire image. If that partitioning is wrong, then the interpretation must also be wrong, no matter how a system assigns se­ mantic labels to those regions. Unfortunately, uni­ versal partitioning algorithms are notoriously poor delineators of natural objects in ground-level scenes. Shape — Many man-made artifacts can be recognized by matching a 3D geometric model with features extracted from an image [l, 2, 4, 6, 7, 9, 15], but most natural objects cannot be so recognized. Nat ural objects are assigned names on the basis of their setting, appearance, and context, …"
4,149368,"Exploiting affinity propagation for automatic acquisition of domain concept in ontology learning Semantic Web uses domain ontology to bridge the gap among the members of a domain through minimization of conceptual and terminological incompatibilities. However, several barriers must be overcome before domain ontology becomes a practical and useful tool. One important issue is identification and selection of domain concepts for domain ontology learning when several hundreds or even thousands of terms are extracted and available from relevant text documents shared among the members of a domain. We present a novel domain concept acquisition and selection approach for ontology learning that uses affinity propagation algorithm, which takes as input semantic and structural similarity between pairs of extracted terms called data points. Real-valued messages are passed between data points (terms) until high quality set of exemplars (concepts) and cluster iteratively emerges. All exemplars will be considered as domain concepts for learning domain ontologies. Our empirical results show that our approach achieves high precision and recall in selection of domain concepts using less number of iterations."


In [98]:
len(df)

30000

In [99]:
df = df[:100]

In [100]:
df = df.drop_duplicates(subset=['doc_id'], keep='first')
len(df)

100

## v1: Manual mode

This notebook shows two example workflows: **v1: Manual mode**, or **v2: Auto mode**. We recommend starting with **v1: Manual mode** to survey the LLooM concepts and get a sense for the underlying functions.

### Create a LLooM instance
Then, after loading your data as a Pandas DataFrame, create a new LLooM instance. You will need to specify the name of the column that contains your input text documents (`text_col`). The ID column (`id_col`) is optional.

In [102]:
# Set up the LLooM instance with the specified dataset
l = wb.lloom(
    df=df,
    text_col="text",
    id_col="doc_id",  # Optional
)

### Run concept generation
Next, you can go ahead and start the concept induction process by generating concepts. You can omit the `seed` parameter if you do not want to use a seed.

In [103]:
cur_seed = None  # Optionally replace with string
await l.gen(seed=cur_seed)

N sentences: Median=7, Std=4.65
[1mAuto-suggested parameters[0m: {'filter_n_quotes': 6, 'summ_n_bullets': 4, 'synth_n_concepts': 6}


[1mEstimated cost[0m: $0.2
**Please note that this is only an approximate cost estimate**


[1m[48;5;228mAction required[0m[0m


Proceed with generation? (y/n):  y




[48;5;117mDistill-filter[0m
⠼ Loading Error Error code: 404 - {'error': {'message': 'Invalid URL (POST /v1/chat/completions)', 'type': 'invalid_request_error', 'param': None, 'code': None}}
⠧ Loading Error Error code: 404 - {'error': {'message': 'Invalid URL (POST /v1/chat/completions)', 'type': 'invalid_request_error', 'param': None, 'code': None}}
✅ Done    


Unnamed: 0,doc_id,text
0,3091,"The automated categorization (or classification) of texts into predefined categories has witnessed a booming interest in the last 10 years, due to the increased availability of documents in digital form and the ensuing need to organize them.\nIn the research community the dominant approach to this problem is based on machine learning techniques: a general inductive process automatically builds a classifier by learning, from a set of preclassified documents, the characteristics of the categories.\nThe advantages of this approach over the knowledge engineering approach (consisting in the manual definition of a classifier by domain experts) are a very good effectiveness, considerable savings in terms of expert labor power, and straightforward portability to different domains.\nThis survey discusses the main approaches to text categorization that fall within the machine learning paradigm.\nWe will discuss in detail issues pertaining to three different problems, namely, document representation, classifier construction, and classifier evaluation."
1,17043,"The study aimed to develop a predictive model to deal with data fraught with heterogeneity that cannot be explained by sampling variation or measured covariates.\nThe random-effect Poisson regression model was first proposed to deal with over-dispersion for data fraught with heterogeneity after making allowance for measured covariates.\nThe application of Poisson regression without taking heterogeneity into account to epilepsy data yielded a large value of heterogeneity (heterogeneity factor = 17.90, deviance = 1485, degree of freedom (df) = 83).\nAfter taking the random effect into account, the value of heterogeneity factor was greatly reduced (heterogeneity factor = 0.52, deviance = 42.5, df = 81).\nThe Pearson χ2 for the comparison between the expected seizure frequencies and the observed ones at two and three months of the model with and without random effect were 34.27 (p = 1.00) and 1799.90 (p <0.0001), respectively.\nThe Bayesian acyclic model using the MCMC method was demonstrated to have great potential for disease prediction while data show over-dispersion attributed either to correlated property or to subject-to-subject variability."
2,62989,"The semantic web extends the current World Wide Web by adding facilities for the machine understood description of meaning.\nThe ontology based search model is used to enhance efficiency and accuracy of information retrieval.\nOntology is the core technology for the semantic web and this mechanism for representing formal and shared domain descriptions.\nIn this paper, we proposed ontology based meaningful search using semantic web and Natural Language Processing (NLP) techniques in the educational domain.\nThe search model consisting three parts which are embedding spell-check, finding synonyms using WordNet API and querying ontology using SPARQL language.\nThis paper provides more accurate results and the complete details for the selected field in a"
3,79336,"Most work in visual recognition by computer has focused on recognizing objects by their geometric shape, or by the presence or absence of some prespecified collection of locally measurable attributes.\nWe offer a fundamentally new approach to visual recognition that avoids these limitations and has been used to recognize trees, bushes, grass, and trails in ground-level scenes of a natural environment.\nThe key scientific question addressed by our research has been the design of a computer vision system that can approach human level performance in the interpretation of natural scenes.\nWe offer a new paradigm for the design of computer vision systems that holds promise for achieving near-human competence, and report the experimental results of a system implementing that theory which demonstrates its recognition abilities in a natural domain of limited geographic extent.\nWhen examining the reasons why the traditional approaches to computer vision fail in the interpretation of ground-level scenes of the natural world, four fundamental problems become apparent:\nNatural objects are assigned names on the basis of their setting, appearance, and context."
4,149368,"Semantic Web uses domain ontology to bridge the gap among the members of a domain through minimization of conceptual and terminological incompatibilities.\nOne important issue is identification and selection of domain concepts for domain ontology learning when several hundreds or even thousands of terms are extracted and available from relevant text documents shared among the members of a domain.\nWe present a novel domain concept acquisition and selection approach for ontology learning that uses affinity propagation algorithm, which takes as input semantic and structural similarity between pairs of extracted terms called data points.\nReal-valued messages are passed between data points (terms) until high quality set of exemplars (concepts) and cluster iteratively emerges.\nAll exemplars will be considered as domain concepts for learning domain ontologies.\nOur empirical results show that our approach achieves high precision and recall in selection of domain concepts using less number of iterations."
5,160975,"Automatic recognition of spoken digits is one of the difficult tasks in the field of computer speech recognition.\nSpoken digits recognition process is required in many applications such as speech based telephone dialing, airline reservation, automatic directory to retrieve or send information, etc.\nArabic language is a Semitic language that differs from European languages such as English.\nThe system is designed to recognize an isolated whole-word speech.\nA hidden Markov model based speech recognition system was designed and tested with automatic Arabic digits recognition.\nThis recognition system achieved 93.72% overall correct rate of digit recognition."
6,167440,"Supervised machine learning techniques, such as artificial neural networks (ANNs), are a promising approach to address time complexity concerns of adaptive enterprise DRE systems.\nLikewise, ANNs address the development complexity of adaptive DRE systems by ensuring that adaptations are appropriate for the operating environment.\nThis paper empirically evaluates the accuracy and timeliness of the ANN machine learning technique for environments on which it has been trained.\nOur results show ANNs are highly accurate in determining correct adaptations and provide predictable time complexity, e.g., with response times less than 6 μseconds."
7,203938,"Gaussian-Bernoulli deep Boltzmann machine (GDBM)\nGDBM is designed to be applicable to continuous data\nThe studied improvements of the learning algorithm for GDBM include parallel tempering, enhanced gradient, adaptive learning rate and layer-wise pretraining\nWe empirically show that they help avoid some of the common difficulties found in training deep Boltzmann machines\ndivergence of learning, the difficulty in choosing right learning rate scheduling, and the existence of meaningless higher layers"
8,218714,"The flexibility of semi-random feature lies between the fully adjustable units in deep learning and the random features used in kernel methods.\nwe prove with no unrealistic assumptions that the model classes contain an arbitrarily good function as the width increases (universality)\nwe can find such a good function (optimization theory) that generalizes to unseen new data (generalization bound)\nwe prove universal approximation ability, a lower bound on approximation error, a partial optimization guarantee, and a generalization bound.\nthe generalization bound of deep semi-random features can be exponentially better than the known bounds of deep ReLU nets\nsemi-random features can match the performance of neural networks by using slightly more units, and it outperforms random features by using significantly fewer units."
9,239550,"Our network outperformed a Kalman filter, predicting more of the higher frequency fluctuations in stock price.\nLearning from past history is a fundamentally ill-posed.\nWith recurrent neural networks (RNNs), we leverage the modeling abilities of neural networks (NNs) for time series forecasting.\nIn RNNs, signals passing through recurrent connections constitute an effective memory for the network, which can then use information in memory to better predict future time series values.\nTraditional techniques used with feedforward NNs such as backpropagation fail to yield acceptable performance.\nEcho State Networks (ESNs) and Liquid State Machines (LSMs) have met with success in modeling nonlinear dynamical systems."




[48;5;117mDistill-summarize[0m
⠦ Loading Error Error code: 404 - {'error': {'message': 'Invalid URL (POST /v1/chat/completions)', 'type': 'invalid_request_error', 'param': None, 'code': None}}
✅ Done    


Unnamed: 0,doc_id,text
0,3091,Text categorization using machine learning techniques
1,3091,Advantages over knowledge engineering approach
2,3091,Main approaches within machine learning paradigm
3,3091,"Issues: document representation, classifier construction, evaluation"
4,17043,Developing predictive model for data heterogeneity
5,17043,Using random-effect Poisson regression for over-dispersion
6,17043,Reducing heterogeneity factor with random effect inclusion
7,17043,Bayesian model for disease prediction potential
8,62989,Semantic web enhances web with meaning
9,62989,Ontology-based search improves information retrieval




[48;5;117mCluster[0m
✅ Done    


Unnamed: 0,doc_id,text,cluster_id
0,3091,Text categorization using machine learning techniques,-1
264,2474316,Machine learning for music genre classification,-1
263,2424684,Experiments conducted on 9 Chinese-English test-sets,-1
262,2424684,Special conversion rules ensure complete semantic structures,-1
261,2424684,Improves BLEU score and reduces TER,-1
260,2424684,Algorithm extracts SRL-aware SCFG rules,-1
259,2404842,Optimizing decision trees in real-time,-1
258,2404842,Training new decision trees quickly,-1
257,2404842,Introducing an incremental decision tree algorithm,-1
256,2404842,Adapting to new samples for classification,-1




[48;5;117mSynthesize[0m
✅ Done    


Input examples: ['Text categorization using machine learning techniques', 'Machine learning for music genre classification', 'Experiments conducted on 9 Chinese-English test-sets', 'Special conversion rules ensure complete semantic structures', 'Improves BLEU score and reduces TER', 'Algorithm extracts SRL-aware SCFG rules', 'Optimizing decision trees in real-time', 'Training new decision trees quickly', 'Introducing an incremental decision tree algorithm', 'Adapting to new samples for classification', 'Inductive models for genre distinctions', 'EM interpretation with binary latent variables', 'Learn generative FRAME model using CNN filters', 'CNN successful in computer vision tasks', 'Potential for substantial improvements shown in examples', 'Local decision boundaries determined and neighborhoods modified', 'Effective metric estimated using local linear discriminant analysis', 'Locally adaptive nearest neighbor classification proposed', 'Effic

In [105]:
# View cost/time summary
l.summary()

[1mTotal time[0m: 33.63 sec (0.56 min)
	('Distill-filter', '2024-09-15-19-11-04'): 6.47 sec
	('Distill-summarize', '2024-09-15-19-11-08'): 3.70 sec
	('Cluster', '2024-09-15-19-11-17'): 8.86 sec
	('Synthesize', '2024-09-15-19-11-27'): 9.92 sec
	('Review-remove', '2024-09-15-19-11-27'): 0.68 sec
	('Review-merge', '2024-09-15-19-11-31'): 4.00 sec


[1mTotal cost[0m: $0.17
	('Distill-filter', '2024-09-15-19-11-04'): $0.039
	('Distill-summarize', '2024-09-15-19-11-08'): $0.018
	('Synthesize', '2024-09-15-19-11-27'): $0.102
	('Review-remove', '2024-09-15-19-11-27'): $0.003
	('Review-merge', '2024-09-15-19-11-31'): $0.008


[1mTokens[0m: total=84969, in=64019, out=20950


### Review concepts

Review the generated concepts and select concepts to inspect further:

In [47]:
!jupyter nbextension enable --py widgetsnbextension

Traceback (most recent call last):
  File "/Server/python/anaconda3/bin/jupyter-nbextension", line 5, in <module>
    from notebook.nbextensions import main
ModuleNotFoundError: No module named 'notebook.nbextensions'


In [106]:
l.select()

ConceptSelectWidget(data='{"c8a0d6af-7908-48a9-b8d3-2f063884ba35": {"id": "c8a0d6af-7908-48a9-b8d3-2f063884ba3…

In [72]:
# You can also double-check on your selected concepts with this command
l.show_selected()



[1mActive concepts[0m (n=19):
- [1mSafety and Health[0m: Does the text involve safety measures, health studies, or emergency repairs?
- [1mEnvironmental Management[0m: Does the text discuss environmental restoration, waste management, or geotechnical studies?
- [1mSpecific Location[0m: Does the text specify a particular location or address?
- [1mLandscaping and Urbanization[0m: Does the text involve landscaping or urban development?
- [1mCarpentry and Installations[0m: Does the text mention carpentry or installation of fixtures?
- [1mMaintenance and Cleaning[0m: Is the text about maintenance or cleaning of a facility?
- [1mFlooring Work[0m: Is the text about repairing or replacing flooring?
- [1mWood-Related Work[0m: Does the text involve working with wood, either repairing or installing?
- [1mIsolation Improvements[0m: Does the text discuss improvements related to sound or climate isolation?
- [1mWater Management[0m: Does the example involve management or repai

### Score concepts
Then, apply these concepts to the full dataset with `score()`. This function will score all documents with respect to each concept to indicate the extent to which the document matches the concept inclusion criteria.

In [74]:
# Run concept scoring
score_df = await l.score()



Scoring 19 concepts for 100 documents
[1mEstimated cost[0m: $0.3
**Please note that this is only an approximate cost estimate**


[1m[48;5;228mAction required[0m[0m


Proceed with scoring? (y/n):  y


100%|██████████| 19/19 [03:09<00:00,  9.95s/it]
✅ Done with concept scoring!


In [75]:
# View cost/time summary
l.summary()

[1mTotal time[0m: 222.41 sec (3.71 min)
	('Distill-summarize', '2024-09-11-17-14-54'): 3.69 sec
	('Cluster', '2024-09-11-17-15-05'): 11.63 sec
	('Synthesize', '2024-09-11-17-15-14'): 8.52 sec
	('Review-remove', '2024-09-11-17-15-15'): 0.80 sec
	('Review-merge', '2024-09-11-17-15-23'): 8.74 sec
	('Score', '2024-09-11-17-26-12'): 189.04 sec


[1mTotal cost[0m: $0.74
	('Distill-summarize', '2024-09-11-17-14-54'): $0.010
	('Synthesize', '2024-09-11-17-15-14'): $0.064
	('Review-remove', '2024-09-11-17-15-15'): $0.007
	('Review-merge', '2024-09-11-17-15-23'): $0.016
	('Score-helper', '2024-09-11-17-23-14'): $0.034
	('Score-helper', '2024-09-11-17-23-23'): $0.035
	('Score-helper', '2024-09-11-17-23-34'): $0.034
	('Score-helper', '2024-09-11-17-23-43'): $0.034
	('Score-helper', '2024-09-11-17-23-54'): $0.034
	('Score-helper', '2024-09-11-17-24-00'): $0.034
	('Score-helper', '2024-09-11-17-24-15'): $0.034
	('Score-helper', '2024-09-11-17-24-23'): $0.034
	('Score-helper', '2024-09-11-17-24-3

### Visualize results
Now, you can visualize the results in the main **LLooM Workbench** view. An interactive widget will appear when you run the `vis` function:
![LLooM Workbench UI](https://github.com/michelle123lam/lloom/blob/main/docs/public/media/lloom_workbench_ui.png?raw=1)

The **Concept Overview (A)** provides a high-level summary. Click on a concept row in the **Concept Matrix (B)** to see its **Detail View (C)**, or click on a slice column to see its corresponding Detail View.

In [None]:
# Visualize concept results
# Group data by the number of likes (automatically binned) with slice_col
l.vis(slice_col="sub_labels")
# l.vis()

In [None]:
# Visualize concept results
# Group data by page category with slice_col
l.vis(slice_col="Page Category")

### (Optional) Try normalizing by slice or by concept


In [None]:
l.vis()

In [None]:
l.vis(norm_by="concept")

### (Optional) Add manual concept
You may also manually add your own custom concepts by providing a name and prompt. This will automatically score the data by that concept. Re-run the `vis()` function to see the new concept results.

In [None]:
# Add a custom concept with the given name and prompt
await l.add(
    name="Your new concept name",
    prompt="Your new concept criteria prompt",  # Ex: "Does the text include [...]?"
)

In [None]:
# Visualize concept results
l.vis(slice_col="Likes")

### (Optional) Submit your results
**🖼️ ✨ Submit your work for a chance to be featured on our site!**

If you'd like to share what you've done with LLooM or would like your work featured in a gallery of results, please submit your LLooM instance with the `submit()` function! If your submission is selected, we'll reach out to you to follow up and hear more about your work with LLooM.

In [None]:
l.submit()  # You will be prompted to provide a few details about your analysis

### (Optional) Export and/or save results

In [77]:
l

<text_lloom.workbench.lloom at 0x7f75fa255390>

In [None]:
# Export the results to a dataframe
export_df = l.export_df()

In [None]:
export_df.head()

In [None]:
# Save the lloom to a pickle file
l.save(folder="your/path/here", file_name="your_file_name")

## v2: Auto mode

LLooM also provides a one-function **auto** mode that grants less control, but simplifies the generation and scoring process into a single function. You can try out this version with the functions below.

### Create a LLooM instance
Then, after loading your data as a Pandas DataFrame, create a new LLooM instance. You will need to specify the name of the column that contains your input text documents (`text_col`). The ID column (`id_col`) is optional.

In [101]:
# Set up the LLooM instance with the specified dataset
l = wb.lloom(
    df=df,
    text_col="text",
    id_col="doc_id",  # Optional
)

### Run concept generation
Next, you can go ahead and start the concept induction process by generating concepts. You can omit the `seed` parameter if you do not want to use a seed.

In [93]:
cur_seed = None  # Optionally replace with string
score_df = await l.gen_auto(seed=cur_seed, max_concepts=5)

N sentences: Median=1, Std=1.08
[1mAuto-suggested parameters[0m: {'filter_n_quotes': 1, 'summ_n_bullets': 1, 'synth_n_concepts': 6}


[1mEstimated cost[0m: $10.32
**Please note that this is only an approximate cost estimate**


[1m[48;5;228mAction required[0m[0m


Proceed with generation? (y/n):  n


Cancelled generation


ValueError: Sample larger than population or is negative

In [80]:
# View cost/time summary
l.summary()

[1mTotal time[0m: 85.50 sec (1.42 min)
	('Distill-summarize', '2024-09-11-17-27-01'): 3.27 sec
	('Cluster', '2024-09-11-17-27-08'): 7.33 sec
	('Synthesize', '2024-09-11-17-27-19'): 10.72 sec
	('Review-remove', '2024-09-11-17-27-20'): 0.87 sec
	('Review-merge', '2024-09-11-17-27-32'): 12.41 sec
	('Score', '2024-09-11-17-32-14'): 50.89 sec


[1mTotal cost[0m: $0.26
	('Distill-summarize', '2024-09-11-17-27-01'): $0.010
	('Synthesize', '2024-09-11-17-27-19'): $0.054
	('Review-remove', '2024-09-11-17-27-20'): $0.005
	('Review-merge', '2024-09-11-17-27-32'): $0.015
	('Score-helper', '2024-09-11-17-31-30'): $0.034
	('Score-helper', '2024-09-11-17-31-40'): $0.034
	('Score-helper', '2024-09-11-17-31-52'): $0.034
	('Score-helper', '2024-09-11-17-32-00'): $0.034
	('Score-helper', '2024-09-11-17-32-14'): $0.035


[1mTokens[0m: total=243025, in=179885, out=63140


### Visualize results
Now, you can visualize the results in the main **LLooM Workbench** view. An interactive widget will appear when you run the `vis` function:
![LLooM Workbench UI](https://github.com/michelle123lam/lloom/blob/main/docs/public/media/lloom_workbench_ui.png?raw=1)

The **Concept Overview (A)** provides a high-level summary. Click on a concept row in the **Concept Matrix (B)** to see its **Detail View (C)**, or click on a slice column to see its corresponding Detail View.

In [None]:
# Visualize concept results
# Group data by the number of likes (automatically binned) with slice_col
l.vis(slice_col="Likes")

In [None]:
# Visualize concept results
# Group data by page category with slice_col
l.vis(slice_col="Page Category")

### (Optional) Try normalizing by slice or by concept


In [None]:
l.vis(slice_col="Likes", norm_by="slice")

In [None]:
l.vis(norm_by="concept")

### (Optional) Add manual concept
You may also manually add your own custom concepts by providing a name and prompt. This will automatically score the data by that concept. Re-run the `vis()` function to see the new concept results.

In [None]:

# Add a custom concept with the given name and prompt
await l.add(
    name="Your new concept name",
    prompt="Your new concept criteria prompt",  # Ex: "Does the text include [...]?"
)

In [None]:
# Visualize concept results
l.vis(slice_col="Likes")

### (Optional) Submit your results
**🖼️ ✨ Submit your work for a chance to be featured on our site!**

If you'd like to share what you've done with LLooM or would like your work featured in a gallery of results, please submit your LLooM instance with the `submit()` function! If your submission is selected, we'll reach out to you to follow up and hear more about your work with LLooM.

In [None]:
l.submit()  # You will be prompted to provide a few details about your analysis

### (Optional) Export and/or save results

In [None]:
# Export the results to a dataframe
export_df = l.export_df()

In [None]:
export_df.head()

In [None]:
# Save the lloom to a pickle file
l.save(folder="your/path/here", file_name="your_file_name")