# LLooM: Getting Started - Template Notebook

Last Updated: April 2024

### Installation
First, install the LLooM Python package, available on PyPI as [`text_lloom`](https://pypi.org/project/text_lloom/). We recommend setting up a virtual environment with [venv](https://docs.python.org/3/library/venv.html#creating-virtual-environments) or [conda](https://conda.io/projects/conda/en/latest/user-guide/tasks/manage-environments.html#creating-an-environment-with-commands).

In [32]:
!pip install text_lloom --quiet

In [33]:
!pip install pyserial

Defaulting to user installation because normal site-packages is not writeable


In [34]:
import nltk
nltk.download('punkt_tab')

[nltk_data] Downloading package punkt_tab to
[nltk_data]     /home/lbartolome/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


True

### Imports

In [35]:
import os
import pandas as pd
import numpy as np
from dotenv import load_dotenv
pd.set_option('display.max_colwidth', None)
pd.set_option('display.max_rows', None)

LLooM uses the OpenAI API under the hood to support its core operators (using GPT-3.5 and GPT-4). You'll first need to locally set the `OPENAI_API_KEY` variable to use your own account.

In [36]:
# Please enter in your OpenAI key "sk-123xyz" below.
load_dotenv('/export/usuarios_ml4ds/lbartolome/Repos/repos_con_carlos/RAG_tool/.env')
os.environ["OPENAI_API_KEY"] = os.getenv("OPENAI_API_KEY")

In [37]:
# Import the LLooM package:
import text_lloom.workbench as wb

### Load data
For this example, we'll be using a sample dataset of 100 **Facebook posts** from **political** pages, gathered via CrowdTangle. The main columns we'll be using in our analysis are the following:
- `doc_id`: Unique ID for each post
- `text`: The text of the Facebook post
- `Page Category`: The category of the Facebook page
- `Likes`: The number of "likes" that the post received

In [38]:
# We'll load data from an existing CSV
# data_link = "https://michelle123lam.github.io/lloom/data/political_fb_posts_100.csv"
# data_link = "/content/NCTE_transcript.csv"
# df = pd.read_csv(data_link)

data_link = '/export/usuarios_ml4ds/cggamella/RAG_tool/files/anotacion_manual/fam/datos_modelo_es_Mallet_df_merged_14_topics_45_ENTREGABLE.parquet'
df = pd.read_parquet(data_link)
df.head(3)

Unnamed: 0,identifier,id_tm,texto_preprocesado,texto_sin_preprocesar,CPV
0,contratosMenoresPerfilesContratantes_2018.zip/contratosMenoresPerfilesContratantes_20190225_140722.atom/147,0,rejillas longitud pista_atletismo estadio,Contrato Menor de Obras para la Instalación de Rejillas en los Fosos de Salto de Longitud de la Pista de Atletismo Anexa al Estadio de los Juegos Mediterráneos,45210000.0
1,contratosMenoresPerfilesContratantes_2018.zip/contratosMenoresPerfilesContratantes_20190225_140722.atom/9,1,acabado revestimiento granito garaje génesis,Acabado en revestimiento de granito en torre de ventilación de garaje en edificio génesis.,"45400000, 45000000"
2,contratosMenoresPerfilesContratantes_2018.zip/contratosMenoresPerfilesContratantes_20190225_140722_10.atom/291,2,almacén inferior cerámica concejalía insercion sociolaboral_persona desempleo,"Obras de remodelación de sala y almacen en planta inferior del centro La Ceramica en sala de exposiciones, Concejalia de Juventud Ayto. de Molina de Segura, incorporando condiciones especiales de ejecución de carácter social relativas a insercion sociolaboral de personas en situación de desempleo de larga duración.",45000000.0


In [39]:
# df.head(10)
df.keys()

Index(['identifier', 'id_tm', 'texto_preprocesado', 'texto_sin_preprocesar',
       'CPV'],
      dtype='object')

In [40]:
df.rename(columns={'id_tm': 'doc_id', 'texto_sin_preprocesar': 'text'}, inplace=True)

In [41]:
# Preview of dataframe
display(df[["doc_id", "text"]].head())

Unnamed: 0,doc_id,text
0,0,Contrato Menor de Obras para la Instalación de Rejillas en los Fosos de Salto de Longitud de la Pista de Atletismo Anexa al Estadio de los Juegos Mediterráneos
1,1,Acabado en revestimiento de granito en torre de ventilación de garaje en edificio génesis.
2,2,"Obras de remodelación de sala y almacen en planta inferior del centro La Ceramica en sala de exposiciones, Concejalia de Juventud Ayto. de Molina de Segura, incorporando condiciones especiales de ejecución de carácter social relativas a insercion sociolaboral de personas en situación de desempleo de larga duración."
3,3,"Realización de trabajos de refuerzo estructural de tabiquería de pladur para anclaje de soportes de proyectos en las aulas del pabellón de formación en el edificio de San Juan de la Cruz, 10."
4,4,"Obras de acondicionamiento de parcela en La Barceloneta para huerto urbano, incorporando medidas especiales de ejecución de carácter social relativas a inserción socio-laboral de personas en situación de desempleo de larga duración"


In [42]:
len(df)

34257

In [43]:
df = df[:100]

In [63]:
df = df.drop_duplicates(subset=['doc_id'], keep='first')
len(df)

100

## v1: Manual mode

This notebook shows two example workflows: **v1: Manual mode**, or **v2: Auto mode**. We recommend starting with **v1: Manual mode** to survey the LLooM concepts and get a sense for the underlying functions.

### Create a LLooM instance
Then, after loading your data as a Pandas DataFrame, create a new LLooM instance. You will need to specify the name of the column that contains your input text documents (`text_col`). The ID column (`id_col`) is optional.

In [64]:
# Set up the LLooM instance with the specified dataset
l = wb.lloom(
    df=df,
    text_col="text",
    id_col="doc_id",  # Optional
)

### Run concept generation
Next, you can go ahead and start the concept induction process by generating concepts. You can omit the `seed` parameter if you do not want to use a seed.

In [65]:
cur_seed = None  # Optionally replace with string
await l.gen(seed=cur_seed)

N sentences: Median=1, Std=0.46
[1mAuto-suggested parameters[0m: {'filter_n_quotes': 1, 'summ_n_bullets': 1, 'synth_n_concepts': 6}


[1mEstimated cost[0m: $0.11
**Please note that this is only an approximate cost estimate**


[1m[48;5;228mAction required[0m[0m


Proceed with generation? (y/n):  y




[48;5;117mDistill-summarize[0m
✅ Done    


Unnamed: 0,doc_id,text
0,0,Installation of grilles in long jump pits
1,1,Granite finish on garage ventilation tower
2,2,Remodeling of La Ceramica center's lower floor
3,3,Refuerzo estructural de tabiquería de pladur
4,4,Acondicionamiento de parcela en La Barceloneta
5,5,Landscaping works on greenway in Molina de Segura
6,6,Restoration of wooden box with dovetail joints
7,7,Adaptation of medical office basement area
8,8,Reparación de tejado de garajes
9,9,Repair work on concrete beams




[48;5;117mCluster[0m
✅ Done    


Unnamed: 0,doc_id,text,cluster_id
40,37,Installation of traffic signal and pedestrian crossing,-1
36,34,"Maintenance work for Divalterra, S.A.",-1
94,88,Additional work not included in social housing,-1
38,35,Repair of two voice and data outlets,-1
39,36,Replacing rail joints with aluminothermic welding,-1
42,38,Improvement of wooden bridges in Mallorca,-1
45,41,Urgent repair needed for waste compactor,-1
46,42,Maintenance services for specific doors.,-1
47,43,Safety and health study for construction,-1
48,44,Construction of false ceilings with new lighting,-1




[48;5;117mSynthesize[0m
✅ Done    


Input examples: ['Installation of traffic signal and pedestrian crossing', 'Maintenance work for Divalterra, S.A.', 'Additional work not included in social housing', 'Repair of two voice and data outlets', 'Replacing rail joints with aluminothermic welding', 'Improvement of wooden bridges in Mallorca', 'Urgent repair needed for waste compactor', 'Maintenance services for specific doors.', 'Safety and health study for construction', 'Construction of false ceilings with new lighting', 'Rehabilitation of mining-degraded areas with forest energy crops', 'Maintenance of security and safety equipment', 'Repair and conditioning of loading area', 'Building a rock wall for slope stabilization', 'Installation of pneumatic waste collection system', 'Road stabilization with limestone aggregate', 'Traffic signaling and marking for Algeciras fair 2018', 'Sign removed from Tourism Office.', 'Vacant public housing due to eviction.', 'Installation of fiber optic

In [66]:
# View cost/time summary
l.summary()

[1mTotal time[0m: 33.37 sec (0.56 min)
	('Distill-summarize', '2024-09-11-17-14-54'): 3.69 sec
	('Cluster', '2024-09-11-17-15-05'): 11.63 sec
	('Synthesize', '2024-09-11-17-15-14'): 8.52 sec
	('Review-remove', '2024-09-11-17-15-15'): 0.80 sec
	('Review-merge', '2024-09-11-17-15-23'): 8.74 sec


[1mTotal cost[0m: $0.10
	('Distill-summarize', '2024-09-11-17-14-54'): $0.010
	('Synthesize', '2024-09-11-17-15-14'): $0.064
	('Review-remove', '2024-09-11-17-15-15'): $0.007
	('Review-merge', '2024-09-11-17-15-23'): $0.016


[1mTokens[0m: total=21913, in=18600, out=3313


### Review concepts

Review the generated concepts and select concepts to inspect further:

In [47]:
!jupyter nbextension enable --py widgetsnbextension

Traceback (most recent call last):
  File "/Server/python/anaconda3/bin/jupyter-nbextension", line 5, in <module>
    from notebook.nbextensions import main
ModuleNotFoundError: No module named 'notebook.nbextensions'


In [67]:
l.select()

ConceptSelectWidget(data='{"eba791f8-a069-4bec-9f65-41c459ccb5a1": {"id": "eba791f8-a069-4bec-9f65-41c459ccb5a…

In [72]:
# You can also double-check on your selected concepts with this command
l.show_selected()



[1mActive concepts[0m (n=19):
- [1mSafety and Health[0m: Does the text involve safety measures, health studies, or emergency repairs?
- [1mEnvironmental Management[0m: Does the text discuss environmental restoration, waste management, or geotechnical studies?
- [1mSpecific Location[0m: Does the text specify a particular location or address?
- [1mLandscaping and Urbanization[0m: Does the text involve landscaping or urban development?
- [1mCarpentry and Installations[0m: Does the text mention carpentry or installation of fixtures?
- [1mMaintenance and Cleaning[0m: Is the text about maintenance or cleaning of a facility?
- [1mFlooring Work[0m: Is the text about repairing or replacing flooring?
- [1mWood-Related Work[0m: Does the text involve working with wood, either repairing or installing?
- [1mIsolation Improvements[0m: Does the text discuss improvements related to sound or climate isolation?
- [1mWater Management[0m: Does the example involve management or repai

### Score concepts
Then, apply these concepts to the full dataset with `score()`. This function will score all documents with respect to each concept to indicate the extent to which the document matches the concept inclusion criteria.

In [74]:
# Run concept scoring
score_df = await l.score()



Scoring 19 concepts for 100 documents
[1mEstimated cost[0m: $0.3
**Please note that this is only an approximate cost estimate**


[1m[48;5;228mAction required[0m[0m


Proceed with scoring? (y/n):  y


100%|██████████| 19/19 [03:09<00:00,  9.95s/it]
✅ Done with concept scoring!


In [75]:
# View cost/time summary
l.summary()

[1mTotal time[0m: 222.41 sec (3.71 min)
	('Distill-summarize', '2024-09-11-17-14-54'): 3.69 sec
	('Cluster', '2024-09-11-17-15-05'): 11.63 sec
	('Synthesize', '2024-09-11-17-15-14'): 8.52 sec
	('Review-remove', '2024-09-11-17-15-15'): 0.80 sec
	('Review-merge', '2024-09-11-17-15-23'): 8.74 sec
	('Score', '2024-09-11-17-26-12'): 189.04 sec


[1mTotal cost[0m: $0.74
	('Distill-summarize', '2024-09-11-17-14-54'): $0.010
	('Synthesize', '2024-09-11-17-15-14'): $0.064
	('Review-remove', '2024-09-11-17-15-15'): $0.007
	('Review-merge', '2024-09-11-17-15-23'): $0.016
	('Score-helper', '2024-09-11-17-23-14'): $0.034
	('Score-helper', '2024-09-11-17-23-23'): $0.035
	('Score-helper', '2024-09-11-17-23-34'): $0.034
	('Score-helper', '2024-09-11-17-23-43'): $0.034
	('Score-helper', '2024-09-11-17-23-54'): $0.034
	('Score-helper', '2024-09-11-17-24-00'): $0.034
	('Score-helper', '2024-09-11-17-24-15'): $0.034
	('Score-helper', '2024-09-11-17-24-23'): $0.034
	('Score-helper', '2024-09-11-17-24-3

### Visualize results
Now, you can visualize the results in the main **LLooM Workbench** view. An interactive widget will appear when you run the `vis` function:
![LLooM Workbench UI](https://github.com/michelle123lam/lloom/blob/main/docs/public/media/lloom_workbench_ui.png?raw=1)

The **Concept Overview (A)** provides a high-level summary. Click on a concept row in the **Concept Matrix (B)** to see its **Detail View (C)**, or click on a slice column to see its corresponding Detail View.

In [None]:
# Visualize concept results
# Group data by the number of likes (automatically binned) with slice_col
l.vis(slice_col="sub_labels")
# l.vis()

In [None]:
# Visualize concept results
# Group data by page category with slice_col
l.vis(slice_col="Page Category")

### (Optional) Try normalizing by slice or by concept


In [None]:
l.vis()

In [None]:
l.vis(norm_by="concept")

### (Optional) Add manual concept
You may also manually add your own custom concepts by providing a name and prompt. This will automatically score the data by that concept. Re-run the `vis()` function to see the new concept results.

In [None]:
# Add a custom concept with the given name and prompt
await l.add(
    name="Your new concept name",
    prompt="Your new concept criteria prompt",  # Ex: "Does the text include [...]?"
)

In [None]:
# Visualize concept results
l.vis(slice_col="Likes")

### (Optional) Submit your results
**🖼️ ✨ Submit your work for a chance to be featured on our site!**

If you'd like to share what you've done with LLooM or would like your work featured in a gallery of results, please submit your LLooM instance with the `submit()` function! If your submission is selected, we'll reach out to you to follow up and hear more about your work with LLooM.

In [None]:
l.submit()  # You will be prompted to provide a few details about your analysis

### (Optional) Export and/or save results

In [77]:
l

<text_lloom.workbench.lloom at 0x7f75fa255390>

In [None]:
# Export the results to a dataframe
export_df = l.export_df()

In [None]:
export_df.head()

In [None]:
# Save the lloom to a pickle file
l.save(folder="your/path/here", file_name="your_file_name")

## v2: Auto mode

LLooM also provides a one-function **auto** mode that grants less control, but simplifies the generation and scoring process into a single function. You can try out this version with the functions below.

### Create a LLooM instance
Then, after loading your data as a Pandas DataFrame, create a new LLooM instance. You will need to specify the name of the column that contains your input text documents (`text_col`). The ID column (`id_col`) is optional.

In [78]:
# Set up the LLooM instance with the specified dataset
l = wb.lloom(
    df=df,
    text_col="text",
    id_col="doc_id",  # Optional
)

### Run concept generation
Next, you can go ahead and start the concept induction process by generating concepts. You can omit the `seed` parameter if you do not want to use a seed.

In [79]:
cur_seed = None  # Optionally replace with string
score_df = await l.gen_auto(seed=cur_seed, max_concepts=5)

N sentences: Median=1, Std=0.46
[1mAuto-suggested parameters[0m: {'filter_n_quotes': 1, 'summ_n_bullets': 1, 'synth_n_concepts': 6}


[1mEstimated cost[0m: $0.11
**Please note that this is only an approximate cost estimate**


[1m[48;5;228mAction required[0m[0m


Proceed with generation? (y/n):  y




[48;5;117mDistill-summarize[0m
✅ Done    


Unnamed: 0,doc_id,text
0,0,Construction contract for installing jump grids
1,1,Granite finish on garage ventilation tower
2,2,Remodeling of La Ceramica center's lower floor
3,3,Refuerzo estructural de tabiquería de pladur
4,4,Acondicionamiento de parcela en La Barceloneta
5,5,Landscaping works on greenway in Molina de Segura
6,6,Restoration of wooden box with dovetail joints
7,7,Adaptation of medical office basement area
8,8,Reparación de tejado de garajes
9,9,Repair work on concrete beams




[48;5;117mCluster[0m
✅ Done    


Unnamed: 0,doc_id,text,cluster_id
0,0,Construction contract for installing jump grids,-1
69,65,Improving transitability in rural area,-1
68,64,Installation of compass in second-floor windows,-1
67,63,"Geotechnical study at CEIP JACARANDA, Seville",-1
65,61,Cleaning drains and unclogging toilet,-1
64,60,Improvement of accessible restrooms at conference center,-1
62,58,Correcting detected anomalies in electrical installation.,-1
60,56,Construction contract for telephone line relocation,-1
71,67,Replacing aluminum carpentry at EEI Pan y Guindas,-1
59,55,Installation of fiber optic for Guardia Civil,-1




[48;5;117mSynthesize[0m
✅ Done    


Input examples: ['Construction contract for installing jump grids', 'Improving transitability in rural area', 'Installation of compass in second-floor windows', 'Geotechnical study at CEIP JACARANDA, Seville', 'Cleaning drains and unclogging toilet', 'Improvement of accessible restrooms at conference center', 'Correcting detected anomalies in electrical installation.', 'Construction contract for telephone line relocation', 'Replacing aluminum carpentry at EEI Pan y Guindas', 'Installation of fiber optic for Guardia Civil', 'Installation of pneumatic waste collection system', 'Execution of demolition for specific areas', 'Rehabilitation of mining-degraded areas through forestry', 'Construction of false ceilings with new lighting', 'Safety and health study for construction', 'Maintenance service for automatic pedestrian doors, garage doors, and gates', 'New connections work in business park', 'Sign removed from Tourism Office.', 'Improvement of tw

Proceed with scoring? (y/n):  y


100%|██████████| 5/5 [00:50<00:00, 10.18s/it]
✅ Done with concept scoring!


In [80]:
# View cost/time summary
l.summary()

[1mTotal time[0m: 85.50 sec (1.42 min)
	('Distill-summarize', '2024-09-11-17-27-01'): 3.27 sec
	('Cluster', '2024-09-11-17-27-08'): 7.33 sec
	('Synthesize', '2024-09-11-17-27-19'): 10.72 sec
	('Review-remove', '2024-09-11-17-27-20'): 0.87 sec
	('Review-merge', '2024-09-11-17-27-32'): 12.41 sec
	('Score', '2024-09-11-17-32-14'): 50.89 sec


[1mTotal cost[0m: $0.26
	('Distill-summarize', '2024-09-11-17-27-01'): $0.010
	('Synthesize', '2024-09-11-17-27-19'): $0.054
	('Review-remove', '2024-09-11-17-27-20'): $0.005
	('Review-merge', '2024-09-11-17-27-32'): $0.015
	('Score-helper', '2024-09-11-17-31-30'): $0.034
	('Score-helper', '2024-09-11-17-31-40'): $0.034
	('Score-helper', '2024-09-11-17-31-52'): $0.034
	('Score-helper', '2024-09-11-17-32-00'): $0.034
	('Score-helper', '2024-09-11-17-32-14'): $0.035


[1mTokens[0m: total=243025, in=179885, out=63140


### Visualize results
Now, you can visualize the results in the main **LLooM Workbench** view. An interactive widget will appear when you run the `vis` function:
![LLooM Workbench UI](https://github.com/michelle123lam/lloom/blob/main/docs/public/media/lloom_workbench_ui.png?raw=1)

The **Concept Overview (A)** provides a high-level summary. Click on a concept row in the **Concept Matrix (B)** to see its **Detail View (C)**, or click on a slice column to see its corresponding Detail View.

In [None]:
# Visualize concept results
# Group data by the number of likes (automatically binned) with slice_col
l.vis(slice_col="Likes")

In [None]:
# Visualize concept results
# Group data by page category with slice_col
l.vis(slice_col="Page Category")

### (Optional) Try normalizing by slice or by concept


In [None]:
l.vis(slice_col="Likes", norm_by="slice")

In [None]:
l.vis(norm_by="concept")

### (Optional) Add manual concept
You may also manually add your own custom concepts by providing a name and prompt. This will automatically score the data by that concept. Re-run the `vis()` function to see the new concept results.

In [None]:

# Add a custom concept with the given name and prompt
await l.add(
    name="Your new concept name",
    prompt="Your new concept criteria prompt",  # Ex: "Does the text include [...]?"
)

In [None]:
# Visualize concept results
l.vis(slice_col="Likes")

### (Optional) Submit your results
**🖼️ ✨ Submit your work for a chance to be featured on our site!**

If you'd like to share what you've done with LLooM or would like your work featured in a gallery of results, please submit your LLooM instance with the `submit()` function! If your submission is selected, we'll reach out to you to follow up and hear more about your work with LLooM.

In [None]:
l.submit()  # You will be prompted to provide a few details about your analysis

### (Optional) Export and/or save results

In [None]:
# Export the results to a dataframe
export_df = l.export_df()

In [None]:
export_df.head()

In [None]:
# Save the lloom to a pickle file
l.save(folder="your/path/here", file_name="your_file_name")