# Unrecognized Term Classification
by Katie Eritano

The purpose of this notebook is to leverage LLMs to classify large quantities of terms. 

### Set Up

Prior to running this notebook user must have Ollama installed from: 
https://ollama.com/download

From there you must install gemma3 or gemma3:1b via the terminal:
`ollama pull gemma3`

After this make sure to install the needed libraries:
`pandas`, `jupyter`, `langchain`. 

Special installations include:
`langchain-ollama` which can be installed using the command `pip install -U langchain-ollama`
and 
`langchain-tavily` installed using the command `pip install -qU langchain-tavily`

Users will need to set up a Tavily account in order to get an API key. Sign up here:
https://app.tavily.com

Upon running the notebook user will be prompted to enter API key. 

### Import libraries and get Tavily API key

In [46]:
import pandas as pd
import numpy as np
from langchain_community.tools.tavily_search import TavilySearchResults
from langchain_ollama import OllamaLLM
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
import getpass
import os
from tqdm import tqdm

# Asks for API Key input
if not os.environ.get("TAVILY_API_KEY"):
    os.environ["TAVILY_API_KEY"] = getpass.getpass("Tavily API key:\n")


### Read CSV Data

In [47]:
words_df = pd.read_csv('/Users/katieeeritano/Desktop/Unrecognized-Terms-Classification/data/chatgpt_classifier_7923_words - 7923_words.csv')

# Gets all terms from the first column
#terms = words_df.iloc[:, 0]

# Gets terms up to a specifc number
terms = words_df['word'][:20]

In [48]:
# Setup Tavily and Ollama
search = TavilySearchResults()
llm = OllamaLLM(model="gemma3")

In [67]:
# Setup prompt
prompt = ChatPromptTemplate.from_template("""
You are classifying terms found in online text. Each term comes with a summary from web search.

Classify the term into one of the following categories, using both the summary and your best reasoning. Focus especially on morphological variants and culturally specific language.

Classification Labels:

- 0 — Common English Word or Neologism: Includes widely recognized English words, internet slang, and new terms understandable to the general public. Also includes misspellings and morphological variants (e.g., plural or possessive forms) of common terms.

- 1 — Proper Noun, Name, Place, or Event: Names of people (e.g., politicians, theorists), places (e.g., cities, countries), or major events (e.g., Watergate). Includes misspellings and variations of such proper nouns.

- 2 — Foreign Word: A non-English word, whether common or obscure in its origin language. Includes recognizable misspellings of foreign terms.

- 3 — Common Slur or Derogatory Term: Recognizable and historically offensive terms. These are known to be slurs or inflammatory language, regardless of slang form.

- 4 — Unrecognized Word (Niche or Internet-Originated): Words that may not appear in a dictionary but have evolved in niche or online spaces (e.g., 4chan, incel forums, or meme culture). Includes:
    - Stylized neologisms
    - Satirical or corrupted names
    - Portmanteaus
    - Morphological variants that shift or evolve meaning
    - Niche combinations or mashups that convey subcultural or ironic meaning (e.g., "anonkun", "betamale", "gaytheists").

- 4 - Unsure: you do not have enough information to classify the term.

Special Notes:
- Morphological variants (like plurals, possessives) should be classified the same as their base word unless the variation creates new niche meaning.
- Misspellings should be interpreted based on likely intent.
- Use the web summary for guidance but reason through cultural or structural patterns in the term.
- If you cannot find enough reliable information but the term seems interpretable (e.g., "antiukip"), you may still make a judgment.

Examples:
- "albaghdadi" with summary "ISIS leader" → 1
- "aangezien" with summary "Dutch word meaning 'since'" → 2
- "anticlinton" with summary "used by critics of Hillary Clinton" → 0
- "antius" with summary "meaning anti-U.S." → 0
- "trmp" with summary "likely a misspelling of 'Trump'" → 1
- "anonkun" with summary "term from 4chan used in fanfiction" → 4
- “anonkuns” plural variant of “anonkun” → 4
- "gibberfloop" with no results or context → 5

Now classify the following:

Term: {term}

Web Summary:
{summary}

Respond with a single number only: 0, 1, 2, 3, 4, or 5.
""")

chain = prompt | llm | StrOutputParser()

In [68]:
# Store results
classifications = []
summaries = []

print("Classifying terms using Tavily + Gemma...")
for term in tqdm(terms):
    try:
        # Run Tavily first
        search_result = search.invoke({"query": f"Why are people on 4chan talking about '{term}'"})
        summary_text = search_result['summary'] if isinstance(search_result, dict) and 'summary' in search_result else str(search_result)
        summaries.append(summary_text)

        # Now classify
        result = chain.invoke({"term": term, "summary": summary_text})
        classifications.append(result.strip())

    except Exception as e:
        print(f"Error for term '{term}': {e}")
        summaries.append("ERROR")
        classifications.append("4")

Classifying terms using Tavily + Gemma...


100%|██████████| 20/20 [01:54<00:00,  5.73s/it]


In [43]:
words_df['gemma_label'] = np.nan
words_df.loc[:19, 'gemma_label'] = classifications

  words_df.loc[:19, 'gemma_label'] = classifications


In [62]:
# Add results to DataFrame at the desired location
#words_df.insert(loc=words_df.columns.get_loc("Hand Label "), column="gemma_label", value=classifications)


In [34]:
# Save to new CSV
words_df.to_csv("/Users/katieeeritano/Desktop/Unrecognized-Terms-Classification/data/terms_classified.csv", index=False)
print("Classification complete! Results saved to data/terms_classified.csv")

Classification complete! Results saved to data/terms_classified.csv


In [69]:
results_df = pd.DataFrame({
    'term': terms[:20],             # first 20 terms
    'gemma_label': classifications,
    'tavily_summary': summaries
})

print(results_df)

             term gemma_label  \
0       aangezien           2   
1          alaqsa           3   
2         alassad           4   
3           albab           4   
4      albaghdadi           1   
5       allfather           4   
6         altleft           0   
7         anonkun           4   
8     anticlinton           4   
9      antimasker           4   
10    antimaskers           4   
11       antislav           4   
12   antitrumpers           4   
13       antiukip           4   
14    antiukraine           4   
15  antiukrainian           3   
16         antius           4   
17      antonescu           1   
18           aocs           4   
19       avdeevka           4   

                                       tavily_summary  
0   [{'title': '[PDF] Contentious branding - UvA-D...  
1   [{'title': 'What is 4chan, and why is it so co...  
2   [{'title': '[PDF] The Age of Incoherence? Unde...  
3   [{'title': 'For the first time, my Steam Deck ...  
4   [{'title': '[PDF] Expl

NameError: name 'tavily_summary' is not defined