[![Open in Google Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/danielmlow/construct-tracker/blob/daniels_branch/tutorials/construct_tracker.ipynb)

# Tutorial to use construct-tracker to measure the constructs you choose in text: 

- Author: Daniel M. Low
- License: Apache 2.0
- Date: 29/07/2024
- **If you use, please cite**: Low DM, Rankin O, Coppersmith DDL, Bentley KH, Nock MK, Ghosh SS (2024). Building lexicons with Generative AI result in zero-shot, lightweight, interpretable text models with high content validity. arXiv.


### construct-tracker
##### - **zero-shot**: no training data needed. Just provide the list of constructs and some examples.
##### - **lightweight**: no GPU needed
##### - **private and free**: you can run on your local computer instead of submitting to a cloud API
##### - **interpretable**: understand why the model outputs a given score
##### - **content validity**: measure what you actually want to measure


There are three options:
* **Lexicon**: Create a lexicon with Generative AI for the constructs you want to measure and obtain the counts of those constructs in text.
* **Lexicon + construct-text similarity (CTS)**: find similar meaning phrases: Since counting exact matches will miss similar and relevant words, use CTS to include similar phrases. **recommended**
* **Just construct-text similarity (CTS)**: skip creating a lexicon and just provide CTS with a few examples (might not work as well). No API key needed. 


We provide A) "Quick start" section below followed by a B) "Use all special features" section




In [2]:
# Install construct-tracker 

# development mode:
!pip install git+https://github.com/danielmlow/construct-tracker.git@daniels_branch#egg=construct_tracker&subdirectory=dist

# production mode:
# pip install contruct-tracker

# # Local mode
# import sys
# sys.path.append('./../src/')
# sys.path.append('./../src/construct_tracker/')

In [3]:
# Import packages
import pandas as pd
import sys
import os
import shutil 
import datetime
import copy
from construct_tracker import lexicon
from construct_tracker import cts

Let's create a lexicon to measure constructs related to insight and mindfulness. Here are some examples from a target dataset of documents (e.g., survey responses, social media posts).

In [4]:
documents = [
 'Every time I speak with my cousin Bob, I have great moments of insight, clarity, and wisdom',
 "He meditates a lot, but he's not super smart",
 'He is too competitive']	

In [5]:
# Or load from Google Drive 
google_drive = False # Load files from Google Drive

if google_drive:
	current_path = '/content/drive/My Drive/Colab Notebooks'
	# Sign into drive to gain access to your documents in a dataframe and/or api_keys.py
	from google.colab import drive
	drive.mount('/content/drive')
	sys.path.append(current_path)
	# Change path and column name accordingly.
	documents = pd.read_csv('/content/drive/My Drive/insight_project/my_documents.csv')['documents_column'].tolist()

In [6]:
if google_drive:
	# Sign into drive to gain access to your documents in a dataframe and/or api_keys.py
	
	
	lexicon_dir = current_path+'/ct_lexicons/' #to save from the lexicon
	output_dir = current_path+'/ct_datasets/' # to save final datasets
	# CHANGE PATH
	documents = pd.read_csv('/content/drive/My Drive/your_project_name/your_csv_file.csv')['documents_column_name'].tolist()
else:

	lexicon_dir = './ct_lexicons/' #to save from the lexicon
	output_dir = './ct_datasets/' # to save final datasets

os.makedirs(lexicon_dir, exist_ok=True)
os.makedirs(output_dir, exist_ok=True)

# Create lexicon with generative AI

If you want to use generative AI models from python, you need to provide a key from a given provider (e.g., OpenAI, Google, etc.). Alternatively, you can use chatgpt, bing or other browsers, to obtain a list and then copy and paste when indicated below. 

# Add API keys

Each provider has one API key associated to your account. Choose provider and a specific provider's model: https://docs.litellm.ai/docs/providers

**Free trial key** 
- Some providers offer a key for a few submissions per minute for free. 
- Obtain Cohere key: 
	1. Sign in to https://dashboard.cohere.com/api-keys
	2. copy the key under "Trial keys FREE"


**Paid key**
- Obtain OpenAI key: 
	1. Select "Log in" (upper right corner): https://platform.openai.com/
	2. Select "Dashboard" (upper right corner)
	3. Select "API keys" (left panel)
	4. Select "Create new secret key"



Add your private API key 

- **Option A** As a string here for the models you wish to use. Just be careful to not share this ipynb with your private key.

In [7]:
os.environ["OPENAI_API_KEY"]  = "YOUR_OPENAI_API_KEY" 

- **Option B** load api_keys.py locally or from google drive, where you save all your private API keys. 

In [8]:
google_drive = False # Load files from / save to Google Drive

if google_drive:
	# Sign into drive to gain access to your documents in a dataframe and/or api_keys.py
	from google.colab import drive
	drive.mount('/content/drive')
	sys.path.append('/content/drive/My Drive/Colab Notebooks')

	# Now you can load documents from Google Drive and/or api_keys.py, change path and column name accordingly.
	
	# See explanation and instructions in Quick Start above. 
	

try:
	import api_keys # local api_keys.py file (where there are variables open_ai='MY-API-KEY') or add string below.
	os.environ["OPENAI_API_KEY"]  = api_keys.open_ai  # private API key, add string here or load from other file for privacy
except:
	print('could not import api_keys.py file. Please add string below or load from other file for privacy. ')
	pass

Now specify which model to use

In [9]:
gpt4o = "gpt-4o-2024-05-13" # model name
gpt4o_mini = "gpt-4o-mini-2024-07-18"

# Select a model you'll use for now:
model = gpt4o_mini


# A. Quick start

### 1. Build lexicon with Generative AI and then count in documents

In [10]:
my_lexicon = lexicon.Lexicon()         # Initialize lexicon

my_lexicon.add('Insight', section = 'tokens', value = 'create', source = model, max_tokens = 150,
	  examples = ['insight', 'realized'])

my_lexicon.add('Mindfulness', section = 'tokens', value = 'create', source = model, max_tokens = 150,
	examples = ['mindfulness', 'meditate'])

my_lexicon.add('Compassion', section = 'tokens', value = 'create', source = model, max_tokens = 150,
	examples = ['compassion', 'love', 'kind', 'help others'])


INFO: Adding 'Insight' to the lexicon (it was not previously there). If you meant to add to an existing construct (and have a typo in the construct name, which is case sensitive), run: del your_lexicon_name.constructs['Insight']
INFO: Adding 'Mindfulness' to the lexicon (it was not previously there). If you meant to add to an existing construct (and have a typo in the construct name, which is case sensitive), run: del your_lexicon_name.constructs['Mindfulness']
INFO: Adding 'Compassion' to the lexicon (it was not previously there). If you meant to add to an existing construct (and have a typo in the construct name, which is case sensitive), run: del your_lexicon_name.constructs['Compassion']


1.1. View lexicons

In [11]:
print(my_lexicon.constructs['Insight']['tokens'])

print(my_lexicon.constructs['Mindfulness']['tokens'])

print(my_lexicon.constructs['Compassion']['tokens'])


['acumen', 'analysis', 'analytical thinking', 'astuteness', 'awareness', 'breakthrough', 'clarity', 'clarity of thought', 'clear perception', 'cognitive insight', 'comprehension', 'conceptual clarity', 'critical thinking', 'deep understanding', 'discerning observation', 'discernment', 'discovery', 'enlightened perspective', 'enlightenment', 'epiphany', 'foresight', 'grasp', 'informed decision', 'insight', 'insight-driven', 'insightful analysis', 'insightful revelation', 'insightfulness', 'interpretation', 'intuition', 'intuitive understanding', 'keen perception', 'knowledge', 'mental clarity', 'observation', 'perception', 'perceptive observation', 'perceptual awareness', 'perspective', 'profound insight', 'profound realization', 'realized', 'recognition', 'reflection', 'reflective thought', 'revelation', 'situational awareness', 'strategic insight', 'thoughtful analysis', 'thoughtful insight', 'understanding', 'wisdom']
['acceptance', 'awareness', 'awareness practice', 'balance', 'body

1.2. Now count whether tokens appear in document:


In [12]:

counts, matches_by_construct, matches_doc2construct, matches_construct2doc  = my_lexicon.extract(documents,
                                                                                      normalize = False,
                                                                                      )

counts.to_csv(f'{output_dir}insight_lexicon_counts.csv', index = False)
display(counts)

extracting... 




100%|██████████| 3/3 [00:06<00:00,  2.05s/it]


Unnamed: 0,document_id,document,Insight,Mindfulness,Compassion,word_count
0,0,"Every time I speak with my cousin Bob, I have ...",3,3,0,17
1,1,"He meditates a lot, but he's not super smart",0,1,0,8
2,2,He is too competitive,0,0,0,4


1.3. Interpret counts: visualize matches in context  


In [13]:
show_n_sentences = 2
construct = 'Mindfulness'
print(f'Matches for {construct}:')
lexicon.highlight_matches(documents, construct,show_n_sentences, matches_construct2doc)
print()


construct = 'Insight'
print(f'Matches for {construct}:')
lexicon.highlight_matches(documents, construct,show_n_sentences, matches_construct2doc)
print()

Matches for Mindfulness:



Matches for Insight:





### 2. Construct-Text Similarity (CTS)

2.1. Similarity between lexicon tokens and document tokens

In [14]:
lexicon_dict = my_lexicon.to_dict()
features, documents_tokenized, lexicon_dict_final_order, cosine_similarities = cts.measure(
    lexicon_dict,
    documents,
	count_if_exact_match = False, # False: for now, do not add exact matches to similarities nor replace similarities for exact matches when found. However, we recommend using 'sum', which sums these similarities to any counts of exact matches. See full example below.
	verbose = True # True: Log information for each step 
    )

display(features.round(2))

features.to_csv(f'{output_dir}insight_lexicon_cts_features.csv', index = False)


INFO: Default input sequence length for all-MiniLM-L6-v2: 256
INFO: Encoding 336 new construct tokens...


Batches:   0%|          | 0/11 [00:00<?, ?it/s]

INFO: Finished encoding construct tokens.
INFO: Tokenizing documents...
INFO: Finished tokenization.
INFO: Encoding all document tokens...
INFO: computing similarity between 3 constructs and 3 documents...
3it [00:00, 322.61it/s]
INFO: Finished.


Unnamed: 0,document_id,documents,documents_tokenized,Insight_max,Mindfulness_max,Compassion_max
0,0,"Every time I speak with my cousin Bob, I have ...","[Every time I speak with my cousin Bob, I have...",0.48,0.38,0.3
1,1,"He meditates a lot, but he's not super smart","[He meditates a lot, he's not super smart]",0.34,0.72,0.48
2,2,He is too competitive,[He is too competitive],0.0,0.0,0.0


#### Interpret: see why a certain document received a score for a given construct

In [15]:
doc_id = 1
construct = 'Mindfulness'
most_similar_lexicon_token, most_similar_document_token, highest_similarity = cts.get_highest_similarity_phrase(doc_id, construct, documents, documents_tokenized, cosine_similarities, lexicon_dict_final_order)

The construct 'Mindfulness' through its token 'meditate' had the highest cosine similarity (0.72) with the following document token:
'He meditates a lot'


# B. Use all special features

Recommended: download and run this on your local computer using jupyter lab, vscode, or your favorite development environment. Saving to and from files is quicker if run locally on your computer than on Google Drive. 

### 1. Build lexicon with Generative AI 


#### Lexicon step 1.1: First provide api keys and general info on the lexicon

In [16]:
my_lexicon = lexicon.Lexicon()			# Initialize lexicon
my_lexicon.name = 'Insight'		# Set lexicon name
my_lexicon.description = 'Insight lexicon with constructs inspired by items of the Emotional Insight Scale'
my_lexicon.creator = 'DML' 				# your name or initials for transparency in logging who made changes
my_lexicon.version = '1.0'				# Set version. Over time, others may modify your lexicon, so good to keep track. MAJOR.MINOR. (e.g., MAJOR: new constructs or big changes to a construct, Minor: small changes to a construct)


In [17]:
# Fill out this information
domain = 'psychology' # Optional: bias definitions to a certain domain. this can be "mental health", "economics" or set to None (no quotation marks)
models_for_lexicon = [gpt4o] # Here you could add more: [gpt4o,command_nightly]
model_for_definition = gpt4o # You can ask model to create definition for each construct in the lexicon. Or set to None (no quotation marks)
temperatures = [0,0.5,1] # Temperature is another important parameter that defines how creative a model should be (0: a model outputs the most probable tokens given the prompt and training data; 1=more creative and less predictable responses, with a maximum of 2, which is not often used).

#### Lexicon step 1.2: Add examples and definitions manually or using a model 

**Examples**

Add a few prototypical tokens you don't want to miss. With these examples, help guide the model as to whether you want adjectives, nouns, phrases or all of the above. 


**Definitions**

We recommend this for expert validation (step after creating the lexicon, so raters can decide whether to include a token as a function of a specific definition). However, you can decide whether the Generative AI model sees the definition or not so as to guide or not guide the lexicon creation.

We'll show an example where we manually provided a definition (insight), where we'll have our model provide it, and where we will not include it. 

In [18]:
# Fill in

construct_information = {
	'Insight': {
		'examples': ['clarity', 'enlightenment', 'wise'],
		'definition': "the clarity of understanding of one's thoughts, feelings and behavior",
		'reference': "Grant, A. M., Franklin, J., & Langford, P. (2002). The self-reflection and insight scale: A new measure of private self-consciousness. Social Behavior and Personality: an international journal, 30(8), 821-835."
			 },
	'Mindfulness': {
		'examples': ['mindful','meditation', 'awareness', 'nonjudgemental', 'present-focused'],
		'definition': model_for_definition, # it will later check if definition == model_for_definition, if it does, it will generate using that model
		'reference':model_for_definition
			 },	
	'Compassion': {
		'examples':['compassion', 'love', 'kind', 'help others'],
		'definition': None,
		'reference':None
			 },
}

## Definition will be added if model_for_definition is not None using our own prompt. Or you can can directly prompt APIs here with your own prompt. This is our default one:
# prompt = f"Provide a brief definition of {construct} (in the {domain} domain). Provide reference in APA format where you got it from. Return result in the following format: {'{construct}': 'the_definition', 'reference':'the_reference'}"
# definition = api_request(prompt, model = model)

In [19]:

my_lexicon.add('Insight', section = 'tokens', value = 'create', source = model, max_tokens = 150,
	  examples = ['insight', 'realized'])

my_lexicon.add('Mindfulness', section = 'tokens', value = 'create', source = model, max_tokens = 150,
	examples = ['mindfulness', 'meditate'])

my_lexicon.add('Compassion', section = 'tokens', value = 'create', source = model, max_tokens = 150,
	examples = ['compassion', 'love', 'kind', 'help others'])


INFO: Adding 'Insight' to the lexicon (it was not previously there). If you meant to add to an existing construct (and have a typo in the construct name, which is case sensitive), run: del your_lexicon_name.constructs['Insight']
INFO: Adding 'Mindfulness' to the lexicon (it was not previously there). If you meant to add to an existing construct (and have a typo in the construct name, which is case sensitive), run: del your_lexicon_name.constructs['Mindfulness']
INFO: Adding 'Compassion' to the lexicon (it was not previously there). If you meant to add to an existing construct (and have a typo in the construct name, which is case sensitive), run: del your_lexicon_name.constructs['Compassion']


In [20]:
# Loop through constructs and generate lexicon and definition as instructed above in construct_information

for construct, information in construct_information.items():
	print('\n',construct, ' ------------------------')
	examples, definition, reference = information['examples'], information['definition'], information['reference']
	
	# DEFINITION: add definition or use model to get one
	# ===================================================
	construct_dict = construct_information.get(construct)
	# if definition exists (is str), use it.
	if construct_dict['definition'] == model_for_definition:
		# Use model to create a definition
		definition, reference = my_lexicon.generate_definition(construct, model = model_for_definition, domain=domain, timeout=45, num_retries=2)
		print(f'Generated definition: {definition}\n')
		print(f'Reference: {reference}\n')
		# Update with definition
		construct_information[construct] = {'definition': definition, 'reference': reference}
	elif isinstance(construct_dict['definition'], str):
		definition = construct_dict['definition']
		reference = construct_dict['reference']
	elif construct_dict['definition'] == None:
		definition = None
		reference = None

	# PROMPT: add definition and examples to prompt
	# ===================================================
	prompt = lexicon.generate_prompt(construct,
                         prompt_name=construct,
                         domain = domain,
						 definition = definition,
						 examples = examples)
	
	print('Prompt used:')
	print(prompt)
	print()

	## Or create your own prompt here:
	# prompt = """
	# Create a list of words and phrases related to {construct} in the {domain} domain. Here are some examples: {examples}
	# """
	# prompt.format(construct = construct, domain = domain, example = examples)
	# print(prompt)
	
	# Generate lexicon
	# ===================================================
	# Each model and temperature will create redudant and new tokens. They will merge with other tokens already generated. 
	for model in models_for_lexicon:
		for temperature in temperatures:
			my_lexicon.add(construct, section = 'tokens', value = 'create', prompt = prompt, source = model, temperature = temperature, max_tokens = 150,
				# so these are saved to metadata
				domain = domain, examples = examples, definition = definition,definition_references = reference, )
	print('Models used to generate tokens:')
	[print(n) for n in list(my_lexicon.constructs[construct]['tokens_metadata'].keys())]
	print()
	print('Generated lexicon:', my_lexicon.constructs[construct]['tokens']) 
	print()





 Insight  ------------------------
Prompt used:
Provide many single words and some short phrases (all in lower case unless the word is generally in upper case) related to Insight (in the psychology domain). Each token should be separated by a semicolon. Do not return duplicate tokens. Do not provide any explanation or additional text beyond the tokens.
Here is a definition of Insight: the clarity of understanding of one's thoughts, feelings and behavior
Here are some examples (include these in the list): clarity; enlightenment; wise

Models used to generate tokens:
gpt-4o-mini-2024-07-18, temperature-0.1, top_p-1, max_tokens-150, seed-42, 24-08-12T19-42-38.357967
gpt-4o-2024-05-13, temperature-0, top_p-1, max_tokens-150, seed-42, 24-08-12T19-42-44.868840
gpt-4o-2024-05-13, temperature-0.5, top_p-1, max_tokens-150, seed-42, 24-08-12T19-42-47.102237
gpt-4o-2024-05-13, temperature-1, top_p-1, max_tokens-150, seed-42, 24-08-12T19-42-49.254632

Generated lexicon: ['acuity', 'acumen', 'aha 


#### Lexicon step 1.3: Important: review resulting lexicon. Add or remove tokens manually.

In [21]:
for construct in construct_information.keys():
	print(f"{construct}:", my_lexicon.constructs[construct]['tokens'])


Insight: ['acuity', 'acumen', 'aha moment', 'analysis', 'analytical thinking', 'astuteness', 'awareness', 'breakthrough', 'brilliance', 'clarity', 'clear vision', 'cognition', 'cognitive shift', 'comprehension', 'consciousness', 'contemplation', 'critical thinking', 'deduction', 'deep dive', 'deep understanding', 'depth', 'detection', 'discernment', 'discovery', 'enlightened thought', 'enlightenment', 'epiphany', 'foresight', 'fresh perspective', 'grasp', 'grasping', 'holistic view', 'illumination', 'insight', 'insight-driven', 'insightful analysis', 'insightfulness', 'interpretation', 'introspection', 'intuition', 'intuitive grasp', 'judiciousness', 'keen observation', 'keenness', 'knowledge', 'lucidity', 'mental clarity', 'mindfulness', 'new angle', 'observance', 'observant', 'observation', 'paradigm shift', 'perception', 'perceptive', 'perceptive insight', 'perceptiveness', 'perspective', 'perspicacity', 'perspicuity', 'profound understanding', 'profundity', 'prudence', 'realization

Add or remove tokens in a list that definitely should /shouldn't be there. BE CAREFUL WITH TYPOS OR MISPELLINGS. 

In source you can add any additional description. Recommend putting your initials as well available in `my_lexicon.creator`. 

In [22]:
print(my_lexicon.creator)
my_lexicon.add('Mindfulness', section ='tokens',value = ['meditate', 'pay attention'], source=my_lexicon.creator + ": added a few verb forms") # Up to you as an expert
my_lexicon.remove('Mindfulness', remove_tokens = ['being', 'flow'], source =my_lexicon.creator + ": might capture non-compassion-related context too often") # Up to you as an expert

DML


Now those appear as entries in the metadata. All additions are merged and all removed are removed

In [23]:
for n in my_lexicon.constructs['Mindfulness']['tokens_metadata'].keys():
	print('action: ', 
	my_lexicon.constructs['Mindfulness']['tokens_metadata'][n]['action'],
	'; from source: ',
	n)
	

action:  create ; from source:  gpt-4o-mini-2024-07-18, temperature-0.1, top_p-1, max_tokens-150, seed-42, 24-08-12T19-42-40.722780
action:  create ; from source:  gpt-4o-2024-05-13, temperature-0, top_p-1, max_tokens-150, seed-42, 24-08-12T19-42-52.978642
action:  create ; from source:  gpt-4o-2024-05-13, temperature-0.5, top_p-1, max_tokens-150, seed-42, 24-08-12T19-42-54.840277
action:  create ; from source:  gpt-4o-2024-05-13, temperature-1, top_p-1, max_tokens-150, seed-42, 24-08-12T19-42-57.982166
action:  manually added ; from source:  DML: added a few verb forms 24-08-12T19-43-07.454435
action:  remove ; from source:  DML: might capture non-compassion-related context too often 24-08-12T19-43-07.456329


Or export to csv, rate spreadsheat and import back into python.

See section "Validate with experts or just crowdsourcing" below on how to code and reload spreadsheet back into python

#### Lexicon step 1.4: Confirm additions and removals were successful. 

`['tokens']` contains the final tokens. 



In [24]:
for construct in construct_information.keys():
	print(f"{construct}:",my_lexicon.constructs[construct]['tokens'])



Insight: ['acuity', 'acumen', 'aha moment', 'analysis', 'analytical thinking', 'astuteness', 'awareness', 'breakthrough', 'brilliance', 'clarity', 'clear vision', 'cognition', 'cognitive shift', 'comprehension', 'consciousness', 'contemplation', 'critical thinking', 'deduction', 'deep dive', 'deep understanding', 'depth', 'detection', 'discernment', 'discovery', 'enlightened thought', 'enlightenment', 'epiphany', 'foresight', 'fresh perspective', 'grasp', 'grasping', 'holistic view', 'illumination', 'insight', 'insight-driven', 'insightful analysis', 'insightfulness', 'interpretation', 'introspection', 'intuition', 'intuitive grasp', 'judiciousness', 'keen observation', 'keenness', 'knowledge', 'lucidity', 'mental clarity', 'mindfulness', 'new angle', 'observance', 'observant', 'observation', 'paradigm shift', 'perception', 'perceptive', 'perceptive insight', 'perceptiveness', 'perspective', 'perspicacity', 'perspicuity', 'profound understanding', 'profundity', 'prudence', 'realization

#### Lexicon step 1.5: Save preliminary lexicon

In [25]:
clean_lexicon_name = my_lexicon.name.replace(" ", "-")+'_v'+my_lexicon.version.replace('.', '-')

if google_drive:
  preprocessing_dir = f'{lexicon_dir}/{clean_lexicon_name}/preprocessing'

else:
  # local
  preprocessing_dir = f'{lexicon_dir}/{clean_lexicon_name}/preprocessing'

my_lexicon.save(preprocessing_dir) # may take a few minutes to appear on Google Drive

INFO: Saved lexicon to ./ct_lexicons//Insight_v1-0/preprocessing/Insight_24-08-12T19-43-07


For a given construct, you can review metadata, which contains the history of different actions (creating using Gen AI, adding, removing), which comes with a timestamp, the specific prompt used (including examples, definition, if any). All of these entries 

In [26]:

import json
for construct in construct_information.keys():
	print("==="*30)
	print(construct)
	print()
	print(json.dumps(my_lexicon.constructs[construct]['tokens_metadata'], indent=4))



Insight

{
    "gpt-4o-mini-2024-07-18, temperature-0.1, top_p-1, max_tokens-150, seed-42, 24-08-12T19-42-38.357967": {
        "action": "create",
        "tokens": [
            "aha moment",
            "analysis",
            "analytical thinking",
            "astuteness",
            "awareness",
            "breakthrough",
            "clarity",
            "clear vision",
            "cognitive shift",
            "comprehension",
            "contemplation",
            "critical thinking",
            "deduction",
            "deep dive",
            "deep understanding",
            "discernment",
            "discovery",
            "enlightened thought",
            "enlightenment",
            "epiphany",
            "foresight",
            "fresh perspective",
            "grasp",
            "holistic view",
            "insight",
            "insight-driven",
            "insightful analysis",
            "insightfulness",
            "interpretation",
            "in

Also available in `_metadata.json` file when saving the lexicon



#### Lexicon step 1.6: Extract counts

Calibration step: this can be done on a examples you come up with or on training set to make sure it's doing a good job

In [27]:
# Now count whether tokens appear in document:

counts, matches_by_construct, matches_doc2construct, matches_construct2doc  = my_lexicon.extract(documents,
                                                                                      normalize = False,
                                                                                      )
display(counts)

extracting... 


100%|██████████| 3/3 [00:01<00:00,  1.75it/s]


Unnamed: 0,document_id,document,Insight,Mindfulness,Compassion,word_count
0,0,"Every time I speak with my cousin Bob, I have ...",3,3,0,17
1,1,"He meditates a lot, but he's not super smart",0,1,0,8
2,2,He is too competitive,0,0,0,4


In [28]:
# Interpret counts: visualize matches in context  

show_n_sentences = 2 # up to this amount
for construct in construct_information.keys():
	print(f'------- Matches for {construct}:')
	lexicon.highlight_matches(documents, construct,show_n_sentences, matches_construct2doc)
	print()


------- Matches for Insight:



------- Matches for Mindfulness:



------- Matches for Compassion:



#### Lexicon step 1.7: Validate with experts or just crowdsourcing (OPTIONAL)

1.7.1. Create `human_ratings/` folder

In [29]:
perform_human_ratings = False

if perform_human_ratings:
	clean_lexicon_name = my_lexicon.name.replace(" ", "-")+'_v'+my_lexicon.version.replace('.', '-')

	if google_drive:
		preprocessing_dir = f'{lexicon_dir}/{clean_lexicon_name}/human_ratings'

	else:
		# local
		ratings_dir = f'{lexicon_dir}/{clean_lexicon_name}/human_ratings/' 



	os.makedirs(ratings_dir, exist_ok=True)

1.7.2. Take the `<date>_ratings.csv` in the `preprocessing/` directory and send a copy to each human rater (aka coder, annotator) with their ID appended to the filename. Here we'll automatically find the last one created.

In [30]:
if perform_human_ratings:

	# I'll create three duplicates of the ratings file identifying each file with raters initiails

	# or use digit IDs:
	n_raters = 3
	rater_ids = [str(i).zfill(3) for i in range(n_raters)] # ['001', '002', '003']
	# Save names for each ID separately in case you need to check in about a certain rating. 


	copy_annotation_file_to_ratings_dir = False

	if copy_annotation_file_to_ratings_dir:
		path_to_unvalidated_lexicon = os.listdir(preprocessing_dir)
		path_to_unvalidated_lexicon = [n for n in path_to_unvalidated_lexicon if n.endswith('ratings.csv')][0]

		for rater_id in rater_ids:
			new_path = path_to_unvalidated_lexicon.replace('preprocessing', 'human_ratings').replace('.csv', f'_{rater_id}.csv')
			# copy from preprocessing to human_raters
			shutil.copy(path_to_unvalidated_lexicon, new_path)


	# Now you have this file in human_ratings Insight_24-08-08T21-29-52_ratings_DLC.csv etc.

	


1.7.3. Have them rate following these or similar instructions. Save files as csv instead of excel to use the code below

https://github.com/danielmlow/construct-tracker/blob/daniels_branch/tutorials/lexicon%20risk%20factors%20final%20instructions.pdf

[Google docs version](https://docs.google.com/document/d/1pu89KmU31grhzFeZmwOoN0U2plSWaDMSKWYUXy_acL4/edit?usp=sharing)

In this toy example, I just duplicated the files as an example so they all have the same ratings. One of the raters added some suggestions. The instructions can take them as 3/3 or the rater can add them at the bottom and can add their rating. Then these can be sent to other raters in a second round before going onto the next step.



1.7.4. Take average ratings, discard tokens below a certain average

In [31]:
ratings_dir

NameError: name 'ratings_dir' is not defined

In [None]:

# average_ratings

rating_files = os.listdir(ratings_dir)
rating_files = [n for n in rating_files if n not in ['.DS_Store']]
print(rating_files) # this should not be empty, you should have some human_rating files for each rater

['Insight_24-08-08T21-29-52_ratings_002.csv', 'Insight_24-08-08T21-29-52_ratings_003.csv', 'Insight_24-08-08T21-29-52_ratings_001.csv']


In [None]:
if perform_human_ratings:
	from construct_tracker.lexicon import avg_above_thresh, merge_rating_dfs

	# Copy and paste correct files here to make sure you don't add the same ratings with two different extensions:
	rating_files = [
		'Insight_24-08-08T21-29-52_ratings_002.csv', 'Insight_24-08-08T21-29-52_ratings_003.csv', 'Insight_24-08-08T21-29-52_ratings_001.csv'
		]

	# Ratings by all raters in a single dictionary
	# all_ratings_per_construct_dict = {
		# construct_1: {token_1: [rating_1, rating_2, ...], token_2: [rating_1, rating_2, ...], ...},
		# construct_2: {token_1: [rating_1, rating_2, ...], token_2: [rating_1, rating_2, ...], ...},
		# ...} 

	all_ratings_per_construct_dict = merge_rating_dfs(ratings_dir, rating_files, construct_information.keys())

	# Save ratings to lexicon
	todays_date = datetime.datetime.utcnow().strftime("%y-%m-%d") 
	my_lexicon.set_attribute('ratings_'+todays_date, all_ratings_per_construct_dict) 


In [None]:

if perform_human_ratings:
	# Average and remove tokens < a threshold
	ratings_avg, ratings_removed = lexicon.avg_above_thresh(all_ratings_per_construct_dict, thresh = 1.3)

	# Remove tokens with low avg. ratings
	for construct in ratings_removed:
		remove_tokens = ratings_removed[construct]
		my_lexicon.remove(construct, remove_tokens=remove_tokens, 
						source = my_lexicon.creator + ": tokens rated lower than 1.3 by raters")




In [None]:
if perform_human_ratings:

	# Average and only keep tokens 3/3 on protoypicality. These are definitely related to the construct and can be used for CTS or to avoid false positives. 
	ratings_avg_prototypical, ratings_removed_prototypical = lexicon.avg_above_thresh(all_ratings_per_construct_dict, thresh = 3) # set to 3 or perhaps above 2 like 2.1. 

	my_lexicon_prototypes = copy.deepcopy(my_lexicon)
	my_lexicon_prototypes.name = my_lexicon.name + ' prototypes'

	# Remove tokens with low avg. ratings
	for construct in ratings_removed_prototypical:
		remove_tokens = ratings_removed[construct]
		my_lexicon_prototypes.remove(construct, remove_tokens=remove_tokens, 
									source = my_lexicon_prototypes.creator + ": tokens rated lower than 3 by raters")



	# Save final prototypes lexicon
	
	my_lexicon_prototypes.save(f'{lexicon_dir}/{clean_lexicon_name}/', filename = 'suicide_risk_lexicon_validated_prototypes-3-3') # save lexicon protocotypes 3/3



INFO: Saved lexicon to ./ct_lexicons//Insight_v1-0//suicide_risk_lexicon_validated_prototypes-3-3_24-08-12T17-53-07


In [None]:
if perform_human_ratings:
	# List construct with less than 10 tokens (which might be too few for a lexicon but sufficient for CTS method). [] means none.
	print([(k,v) for k,v in ratings_avg_prototypical.items() if len(v)<10])

[]


1.7.5. Inter-rater reliability (OPTIONAL)

	See manuscript for more information


In [None]:
if perform_human_ratings:
	from construct_tracker.utils import irr
	import numpy as np
	from sklearn.metrics import cohen_kappa_score


	# df = all_ratings_per_construct_dict
	# or from: all_ratings_per_construct_dict = my_lexicon.get_attribute('ratings_24-08-08')

	# TODO: turn into function
	cohens_kappa_all = {}
	fleiss_kappa_all = {}
	for construct, tokens in all_ratings_per_construct_dict.items():
		construct_c_ratings = list(tokens.values())
		construct_c_ratings_mode = int(np.median([len(n) for n in construct_c_ratings])) # construct_c_ratings)
		construct_c_ratings_all_annotated = []
		for token_i_ratings in construct_c_ratings:
			token_i_ratings = list(token_i_ratings)
		
			
			if len(token_i_ratings)==construct_c_ratings_mode and np.mean(token_i_ratings)>1.3:
			# 	token_i_ratings = token_i_ratings + [np.round(np.mean(token_i_ratings),0)]

				construct_c_ratings_all_annotated.append(token_i_ratings)
				
		# If I dont have them all the same shape, can't calculate
		construct_c_ratings = np.array(construct_c_ratings_all_annotated)
		construct_c_ratings = construct_c_ratings.astype(int)
		
		if construct_c_ratings.shape[1] == 2:
			# kappa = binary_inter_rater_reliability(construct_c_ratings[:,0], construct_c_ratings[:,1])
			kappa = irr.cohens_kappa(construct_c_ratings)
			cohens_kappa_all[construct.replace('_include','')] = kappa
			print(f"Cohen's Weighted Kappa (2 raters) for {construct}: {kappa}")
		elif construct_c_ratings.shape[1] >= 3:
			kappa = irr.calculate_fleiss_kappa(construct_c_ratings)
			fleiss_kappa_all[construct.replace('_include','')] = kappa
			print(f"Fleiss' Kappa (3 or more raters) for {construct}: {kappa}")


	weighted_kappa = pd.DataFrame(cohens_kappa_all, index = ['weighted_kappa']).T.mean()
	fleis_kappa = pd.DataFrame(fleiss_kappa_all, index = ['fleiss_kappa']).T


	display(weighted_kappa)
	display(fleis_kappa)



Fleiss' Kappa (3 or more raters) for Insight: 1.0
Fleiss' Kappa (3 or more raters) for Mindfulness: 1.0
Fleiss' Kappa (3 or more raters) for Compassion: 1.0


weighted_kappa   NaN
dtype: float64

Unnamed: 0,fleiss_kappa
Insight,1.0
Mindfulness,1.0
Compassion,1.0


#### Lexicon step 1.8: Save final lexicon

In [None]:
# Main lexicon
my_lexicon.save(f'{lexicon_dir}/{clean_lexicon_name}/', filename = 'suicide_risk_lexicon_validated') # save lexicon

INFO: Saved lexicon to ./ct_lexicons//Insight_v1-0//suicide_risk_lexicon_validated_24-08-12T17-49-12


#### Lexicon step 1.9: Final feature extraction

If you want to load lexicon in a different script in the future:
```python
path = f'./ct_lexicons/{clean_lexicon_name}/'
my_lexicon = load_lexicon(path)
```

In [None]:
# Now count whether tokens appear in document:

normalize = True
# We'll set normalize to True. Whether to normalize the extracted features by word count. 3 matches in a short document would be weighed higher than in a long document.
counts, matches_by_construct, matches_doc2construct, matches_construct2doc  = my_lexicon.extract(documents,
                                                                                      normalize = normalize,
                                                                                      )
display(counts)
# Might be worth calibrating (removing words that create false positives too often) on a training set. 

extracting... 


100%|██████████| 3/3 [00:01<00:00,  1.86it/s]


Unnamed: 0,document_id,document,Insight,Mindfulness,Compassion,word_count
0,0,"Every time I speak with my cousin Bob, I have ...",0.176471,0.117647,0.0,17
1,1,"He meditates a lot, but he's not super smart",0.0,0.125,0.0,8
2,2,He is too competitive,0.0,0.0,0.0,4


In [None]:
# save counts
output_dir = './data/insight_project/'
os.makedirs(output_dir, exist_ok=True)
counts.to_csv(f'{output_dir}insight_lexicon_counts_normalize-{normalize}.csv', index = False)

## 2. Construct-Text Similarity (CTS)

#### CTS step 2.1: Load validated lexicon or create your a quick one with some keywords

In [None]:
# Here you can use my_lexicon or my_lexicon_prototypes (so it doesn't find similarity with too many low-prototypical tokens)
lexicon_dict = my_lexicon.to_dict() 
lexicon_dict

{'Insight': ['aha moment',
  'analytical thinking',
  'astuteness',
  'awareness',
  'breakthrough',
  'clarity',
  'clarity of thought',
  'clear vision',
  'cognitive insight',
  'comprehension',
  'conceptual insight',
  'contemplation',
  'critical thinking',
  'deep insight',
  'deep understanding',
  'diagnosis',
  'diagnostic insight',
  'discernment',
  'discovery',
  'enlightened perspective',
  'enlightenment',
  'epiphany',
  'foresight',
  'grasp',
  'grasping',
  'illumination',
  'informed decision',
  'insight',
  'insight-driven',
  'insightful',
  'insightful analysis',
  'insightfulness',
  'introspection',
  'introspective',
  'intuition',
  'intuitive grasp',
  'keen perception',
  'knowledge',
  'knowledgeable',
  'lucidity',
  'meaningful understanding',
  'mental clarity',
  'mindful',
  'mindfulness',
  'perception',
  'perceptive',
  'perceptive insight',
  'perceptive understanding',
  'perceptiveness',
  'perceptivity',
  'perspective',
  'perspicacity',
  'p


Or Load lexicon
```python
path = './ct_lexicons/Insight/Insight_24-08-07T21-11-46.539980.pickle'
my_lexicon = lexicon.load_lexicon(path)
```

Or just create your own quickly with some keywords
```python
lexicon_dict = {
 	'Insight': 	['clarity', 'enlightenment', 'wise'],
 	'Mindfulness': 	['mindful','meditation', 'awareness', 'nonjudgemental',
			'present-focused'],
 	'Compassion': 	['compassion', 'love', 'kind', 'help others'],

 }
 ```

#### CTS step 2.2: Measure construct-text similarity

In [None]:
from construct_tracker import cts

features, documents_tokenized, lexicon_dict_final_order, cosine_similarities = cts.measure(
    lexicon_dict,
    documents,
	count_if_exact_match = 'sum',
	similarity_threshold = 0.3,
    )

display(features)

features.to_csv(f'{output_dir}insight_lexicon_cts_features.csv', index = False)

INFO: Default input sequence length for all-MiniLM-L6-v2: 256
INFO: Encoding 568 new construct tokens...


Batches:   0%|          | 0/18 [00:00<?, ?it/s]

INFO: Finished encoding construct tokens.
INFO: Tokenizing documents...
INFO: Finished tokenization.
INFO: Encoding all document tokens...
INFO: computing similarity between 3 constructs and 3 documents...
3it [00:00, 194.78it/s]
INFO: Finished.
INFO: Adding 'Insight' to the lexicon (it was not previously there). If you meant to add to an existing construct (and have a typo in the construct name, which is case sensitive), run: del your_lexicon_name.constructs['Insight']
INFO: Adding 'Mindfulness' to the lexicon (it was not previously there). If you meant to add to an existing construct (and have a typo in the construct name, which is case sensitive), run: del your_lexicon_name.constructs['Mindfulness']
INFO: Adding 'Compassion' to the lexicon (it was not previously there). If you meant to add to an existing construct (and have a typo in the construct name, which is case sensitive), run: del your_lexicon_name.constructs['Compassion']


extracting... 


100%|██████████| 3/3 [00:02<00:00,  1.44it/s]


Unnamed: 0,document_id,documents,documents_tokenized,Insight_max,Mindfulness_max,Compassion_max
0,1,"Every time I speak with my cousin Bob, I have ...","[Every time I speak with my cousin Bob, I have...",3.482038,2.715475,0.482038
1,2,"He meditates a lot, but he's not super smart","[He meditates a lot, he's not super smart]",0.0,1.0,0.0
2,0,He is too competitive,[He is too competitive],0.447937,0.382691,0.327205


You can also provide a counts DF done in a different way:

```python
counts, matches_by_construct, matches_doc2construct, matches_construct2doc  = my_lexicon.extract(documents,normalize = True,return_zero=['anger'])

from construct_tracker import cts

features, documents_tokenized, lexicon_dict_final_order, cosine_similarities = cts.measure(
    lexicon_dict,
    documents,
	lexicon_counts = counts,
	count_if_exact_match = 'sum',
	similarity_threshold = 0.3,
    )

display(features)
```


#### CTS step 2.3: Interpret cosine scores for any given document


In [None]:

doc_id = 2
construct = 'Mindfulness'
most_similar_lexicon_token, most_similar_document_token, highest_similarity = cts.get_highest_similarity_phrase(doc_id, construct, documents, documents_tokenized, cosine_similarities, lexicon_dict_final_order)

The construct 'Mindfulness' through its token 'oneness' had the highest cosine similarity (0.24) with the following document token:
'He is too competitive'
