This notebook was written by Katie Eritano 

## Set Up

Prior to running this notebook user must have Ollama installed from: 
https://ollama.com/download

From there you must install gemma3 or gemma3:1b via the terminal:
`ollama pull gemma3`

After this make sure to install the needed libraries:
`pandas`, `jupyter`, `langchain`. 

Special installations include:
`langchain-ollama` which can be installed using the command `pip install -U langchain-ollama`
and 
`langchain-tavily` installed using the command `pip install -qU langchain-tavily`

Users will need to set up a Tavily account in order to get an API key. Sign up here:
https://app.tavily.com

Upon running the notebook user will be prompted to enter API key. 

In [12]:
import pandas as pd
from langchain_community.tools.tavily_search import TavilySearchResults
from langchain_ollama import OllamaLLM
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
import getpass
import os
from tqdm import tqdm

if not os.environ.get("TAVILY_API_KEY"):
    os.environ["TAVILY_API_KEY"] = getpass.getpass("Tavily API key:\n")


In [19]:
words_df = pd.read_csv('/Users/katieeeritano/Desktop/Unrecognized-Terms-Classification/data/chatgpt_classifier_7923_words - 7923_words.csv')
#terms = words_df.iloc[:, 0]
terms = words_df['word'][:50]

NameError: name 'df' is not defined

In [14]:
# Setup Tavily and Ollama
search = TavilySearchResults()
llm = OllamaLLM(model="gemma3")

In [15]:
# Setup prompt
prompt = ChatPromptTemplate.from_template("""
You are a linguistic classifier. Based on the search summary below, classify the term.

Term: {term}
Search Summary: {summary}

Classify the term into one of the following categories:
0 - Recognized Word: An English term that might be used in everyday speech, 
    or would be reasonably recognized by most people (e.g., "anticlinton" 
    means people opposing Hilary or Bill Clinton, "belgiums" is the 
    possessive form of Belgium, "vaxine" can be recognized as vaccine. 
    This category should include misspellings of reasonably recognizable words).

1 - Recognized word - proper noun/foreign word not used everyday: A real word or 
name not used in typical daily language (e.g., “Qatar”, “Hezbollah”, “Kafka”) that 
people may need to look up. Definitions for these words would be more reputable or 
established references.

2 - Recognized Slur - Not used in everyday language: A derogatory, offensive, or 
coded slur not considered appropriate in general discourse. These terms are used 
and recognized not only on the online forum 4chan.

3 - Unrecognized word: A word or phrase that is either unique to 4chan, originates 
from 4chan culture, or is used in a context that is obscure, ironic, or offensive, 
often not understood outside of 4chan boards such as /pol/, /int/, or /b/. These 
terms typically reflect the platform's insular slang, memes, or coded language.

4 - Unsure: There is not a consistent definition. Multiple possibilities exist 
or there is no definition and it is unclear which definition best fits without 
the context of use.

Return ONLY the number.
""")

chain = prompt | llm | StrOutputParser()

In [16]:
# Store results
classifications = []

print("Classifying terms...")
for term in tqdm(terms):
    try:
        summary = search.invoke(f"Give a concise definition and the context of '{term}'")
        summary_text = summary['summary'] if isinstance(summary, dict) and 'summary' in summary else str(summary)
        result = chain.invoke({"term": term, "summary": summary_text})
    except Exception as e:
        print(f"Error classifying term '{term}': {e}")
        result = "4"
    classifications.append(result)

Classifying terms...


  2%|▏         | 121/7923 [12:54<13:52:19,  6.40s/it]


KeyboardInterrupt: 

In [18]:
# Add results to DataFrame at the desired location
words_df.insert(loc=words_df.columns.get_loc("Hand Label "), column="gemma_label", value=classifications)


ValueError: Length of values (121) does not match length of index (7923)

In [None]:
# Save to new CSV
words_df.to_csv("/Users/katieeeritano/Desktop/Unrecognized-Terms-Classification/data/terms_classified.csv", index=False)
print("Classification complete! Results saved to data/terms_classified.csv")