# Unrecognized Term Classification 
by Katie Eritano

The purpose of this notebook is to leverage LLMs to classify large quantities of terms. 

### Set Up

Prior to running this notebook user must have Ollama installed from: 
https://ollama.com/download

From there you must install gemma3 or gemma3:1b via the terminal:
`ollama pull gemma3`

After this make sure to install the needed libraries:
`pandas`, `jupyter`, `langchain`. 

Special installations include:
`langchain-ollama` which can be installed using the command `pip install -U langchain-ollama`
and 
`langchain-tavily` installed using the command `pip install -qU langchain-tavily`

Users will need to set up a Tavily account in order to get an API key. Sign up here:
https://app.tavily.com

Upon running the notebook user will be prompted to enter API key. 

### Import libraries and get Tavily API key

In [2]:
import pandas as pd
import numpy as np
from langchain_community.tools.tavily_search import TavilySearchResults
from langchain_ollama import OllamaLLM
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
import getpass
import os
from tqdm import tqdm

# Asks for API Key input
if not os.environ.get("TAVILY_API_KEY"):
    os.environ["TAVILY_API_KEY"] = getpass.getpass("Tavily API key:\n")




### Read CSV Data

In [3]:
words_df = pd.read_csv('/Users/katieeeritano/Desktop/Unrecognized-Terms-Classification/data/chatgpt_classifier_7923_words - 7923_words.csv')

# Gets all terms from the first column
#terms = words_df.iloc[:, 0]

# Gets terms up to a specifc number
terms = words_df['word'][:20]

### Setup Tavily and Ollama

In [4]:
search = TavilySearchResults()
llm = OllamaLLM(model="gemma3")

### LLM Promptings

In [5]:
prompt_1 = ChatPromptTemplate.from_template("""
You are a researcher analyzing terms from a fringe online community.
Your job is to classify whether each term is part of everyday language or specific to that space.

Use the following definitions:

0 — **Recognized Word**: A common English word or phrase used in everyday language (e.g. "apple", "dog", "table").
1 — **Recognized Proper Noun or Foreign Word**: A proper name, place name, or foreign-language term that is not typically used in casual English conversations but is verifiable through reliable sources (e.g. "al-Assad", "Qatar", "al-Bab").
2 — **Recognized Slur**: A known derogatory term or slur. These are offensive and not typically used in respectful everyday speech (e.g. racial or ethnic slurs).
3 — **Unrecognized Word**: A word that appears to be made up, has no clear definition, or is only found in obscure internet forums.
4 — **Unsure**: If you cannot tell what the word means or there is not enough information.

Based on the internet summary provided, classify the term accordingly.

Term: {term}

Web Summary:
{summary}

Answer with only the classification number (0, 1, 2, 3, or 4), no explanation.
""")

chain_1 = prompt_1 | llm | StrOutputParser()

In [6]:
# Store results as lists
classifications_1 = []
summaries_1 = []

print("Classifying terms using Tavily + Gemma...")
for term in tqdm(terms):
    try:
        # Run Tavily first
        search_result = search.invoke({"query": f"What is '{term}' on 4chan?"})
        summary_text = search_result['summary'] if isinstance(search_result, dict) and 'summary' in search_result else str(search_result)
        summaries_1.append(summary_text)

        # Now classify
        result = chain_1.invoke({"term": term, "summary": summary_text})
        classifications_1.append(result.strip())

    except Exception as e:
        print(f"Error for term '{term}': {e}")
        summaries_1.append("ERROR")
        classifications_1.append("4")

Classifying terms using Tavily + Gemma...


100%|██████████| 20/20 [01:51<00:00,  5.60s/it]


In [7]:
# View results
results_df_1 = pd.DataFrame({
    'term': terms[:20],             # first 20 terms
    'gemma_label': classifications_1,
    'tavily_summary': summaries_1
})

results_df_1

Unnamed: 0,term,gemma_label,tavily_summary
0,aangezien,3,"[{'title': 'What is 4chan, and why is it so co..."
1,alaqsa,1,[{'title': '[PDF] A Flood of Hate How Hamas Fu...
2,alassad,1,"[{'title': 'Portrait of a troll | OCCRP', 'url..."
3,albab,3,"[{'title': '4chan - Wikipedia', 'url': 'https:..."
4,albaghdadi,1,[{'title': 'Abu Bakr al-Baghdadi: A Virgin Nig...
5,allfather,1,[{'title': 'Why is Odin called the Allfather a...
6,altleft,2,"[{'title': 'Alt-right - Wikipedia', 'url': 'ht..."
7,anonkun,3,"[{'title': ""/a/ - Anon-kun, LOOK! I'm Marin no..."
8,anticlinton,3,[{'title': '4chan | Meaning & Origin | Diction...
9,antimasker,2,[{'title': 'anti-masker Meaning | Politics by ...


In [8]:
prompt_2 = ChatPromptTemplate.from_template("""
You are a language analyst tasked with classifying terms from online discussions.
Each term is accompanied by a web search summary. Use that summary to determine whether the term is common, foreign, offensive, obscure, or unknown.

Use these classification rules:

- 0 — Recognized English Word: A standard English word or common internet term.
  Examples: "apple", "mask", "antimaskers"

- 1 — Recognized Proper Noun, Place, Foreign Word, or Name: Any identifiable person, place, organization, or foreign word that can be verified online.
  Examples: "al-Assad", "Antonescu", "Avdeevka", "aangezien" (Dutch), "al-Bab"

- 2 — Recognized Slur or Derogatory Term: Offensive or harmful terms.
  Examples: "altleft", "anonkun"

- 3 — Unrecognized or Nonsense Word: A made-up, typo-ridden, or otherwise meaningless term not found in reliable sources.
  Examples: "anticlinton", "antius" (if not recognized as "anti-US")

- 4 — Unsure: You cannot tell, or the summary lacks enough information.

Your answer should be a **single digit (0–4)** based **only on the term and its web summary.** You are allowed to use reasoning to correct minor spacing issues or infer meaning when reasonable (e.g. "antius" = "anti-US").

Term: {term}

Web Summary:
{summary}

Answer with one digit only: 0, 1, 2, 3, or 4.
""")

chain_2 = prompt_2 | llm | StrOutputParser()

In [9]:
# Store results as lists
classifications_2 = []
summaries_2 = []

print("Classifying terms using Tavily + Gemma...")
for term in tqdm(terms):
    try:
        # Run Tavily first
        search_result = search.invoke({"query": f"What is '{term}' on 4chan?"})
        summary_text = search_result['summary'] if isinstance(search_result, dict) and 'summary' in search_result else str(search_result)
        summaries_2.append(summary_text)

        # Now classify
        result = chain_2.invoke({"term": term, "summary": summary_text})
        classifications_2.append(result.strip())

    except Exception as e:
        print(f"Error for term '{term}': {e}")
        summaries_2.append("ERROR")
        classifications_2.append("4")

Classifying terms using Tavily + Gemma...


100%|██████████| 20/20 [01:05<00:00,  3.28s/it]


In [None]:
# View results
results_df_2 = pd.DataFrame({
    'term': terms[:20],             # first 20 terms
    'gemma_label': classifications_2,
    'tavily_summary': summaries_2
})

print(results_df_2)

In [None]:
prompt_3 = ChatPromptTemplate.from_template("""
You are classifying online terms. Each term is paired with a web summary. Use the summary and your reasoning to determine which category it falls into.

Classification labels:
- 0 — Recognized English Word or Common Neologism: A valid English word, a widely used internet term, or a morphologically understandable construct like "antiX" (e.g. "anticlinton", "antius", "antimasker").
- 1 — Recognized Proper Noun, Foreign Word, Name, or Place: Names of people, countries, cities, foreign-language words, etc.
- 2 — Recognized Slur or Derogatory Term: Harmful or offensive slurs.
- 3 — Unrecognized or Nonsense Term: Made-up, typo-ridden, or meaningless term.
- 4 — Unsure: Cannot determine from the summary or context is missing.

Always base your classification on the term and the web summary.

Examples:
- "albaghdadi" with summary "ISIS leader" → 1
- "aangezien" with summary "Dutch word meaning 'since'" → 1
- "anticlinton" with summary "used by critics of Hillary Clinton" → 0
- "antius" with summary "used to mean 'anti-US'" → 0
- "anonkun" with summary "slang used in extremist forums" → 2

Now, classify the following term:

Term: {term}

Web Summary:
{summary}

Answer with one digit only: 0, 1, 2, 3, or 4.
""")

chain_3 = prompt_3 | llm | StrOutputParser()

In [None]:
# Store results as lists
classifications_3 = []
summaries_3 = []

print("Classifying terms using Tavily + Gemma...")
for term in tqdm(terms):
    try:
        # Run Tavily first
        search_result = search.invoke({"query": f"What is '{term}' on 4chan?"})
        summary_text = search_result['summary'] if isinstance(search_result, dict) and 'summary' in search_result else str(search_result)
        summaries_3.append(summary_text)

        # Now classify
        result = chain_3.invoke({"term": term, "summary": summary_text})
        classifications_3.append(result.strip())

    except Exception as e:
        print(f"Error for term '{term}': {e}")
        summaries_3.append("ERROR")
        classifications_3.append("4")

In [None]:
# View results
results_df_3 = pd.DataFrame({
    'term': terms[:20],             # first 20 terms
    'gemma_label': classifications_3,
    'tavily_summary': summaries_3
})

print(results_df_3)

In [67]:
prompt_4 = ChatPromptTemplate.from_template("""
You are classifying terms found in online text. Each term comes with a summary from web search.

Classify the term into one of the following categories, using both the summary and your best reasoning. Focus especially on morphological variants and culturally specific language.

Classification Labels:

- 0 — Common English Word or Neologism: Includes widely recognized English words, internet slang, and new terms understandable to the general public. Also includes misspellings and morphological variants (e.g., plural or possessive forms) of common terms.

- 1 — Proper Noun, Name, Place, or Event: Names of people (e.g., politicians, theorists), places (e.g., cities, countries), or major events (e.g., Watergate). Includes misspellings and variations of such proper nouns.

- 2 — Foreign Word: A non-English word, whether common or obscure in its origin language. Includes recognizable misspellings of foreign terms.

- 3 — Common Slur or Derogatory Term: Recognizable and historically offensive terms. These are known to be slurs or inflammatory language, regardless of slang form.

- 4 — Unrecognized Word (Niche or Internet-Originated): Words that may not appear in a dictionary but have evolved in niche or online spaces (e.g., 4chan, incel forums, or meme culture). Includes:
    - Stylized neologisms
    - Satirical or corrupted names
    - Portmanteaus
    - Morphological variants that shift or evolve meaning
    - Niche combinations or mashups that convey subcultural or ironic meaning (e.g., "anonkun", "betamale", "gaytheists").

- 5 - Unsure: you do not have enough information to classify the term.

Special Notes:
- Morphological variants (like plurals, possessives) should be classified the same as their base word unless the variation creates new niche meaning.
- Misspellings should be interpreted based on likely intent.
- Use the web summary for guidance but reason through cultural or structural patterns in the term.
- If you cannot find enough reliable information but the term seems interpretable (e.g., "antiukip"), you may still make a judgment.

Examples:
- "albaghdadi" with summary "ISIS leader" → 1
- "aangezien" with summary "Dutch word meaning 'since'" → 2
- "anticlinton" with summary "used by critics of Hillary Clinton" → 0
- "antius" with summary "meaning anti-U.S." → 0
- "trmp" with summary "likely a misspelling of 'Trump'" → 1
- "anonkun" with summary "term from 4chan used in fanfiction" → 4
- “anonkuns” plural variant of “anonkun” → 4
- "gibberfloop" with no results or context → 5

Now classify the following:

Term: {term}

Web Summary:
{summary}

Respond with a single number only: 0, 1, 2, 3, 4, or 5.
""")

chain_4 = prompt_4 | llm | StrOutputParser()

In [None]:
# Store results as lists
classifications_4 = []
summaries_4 = []

print("Classifying terms using Tavily + Gemma...")
for term in tqdm(terms):
    try:
        # Run Tavily first
        search_result = search.invoke({"query": f"What is '{term}' on 4chan?"})
        summary_text = search_result['summary'] if isinstance(search_result, dict) and 'summary' in search_result else str(search_result)
        summaries_4.append(summary_text)

        # Now classify
        result = chain_4.invoke({"term": term, "summary": summary_text})
        classifications_4.append(result.strip())

    except Exception as e:
        print(f"Error for term '{term}': {e}")
        summaries_4.append("ERROR")
        classifications_4.append("4")

In [None]:
# View results
results_df_4 = pd.DataFrame({
    'term': terms[:20],             # first 20 terms
    'gemma_label': classifications_4,
    'tavily_summary': summaries_4
})

print(results_df_4)

In [34]:
# Save to new CSV
words_df.to_csv("/Users/katieeeritano/Desktop/Unrecognized-Terms-Classification/data/terms_classified.csv", index=False)
print("Classification complete! Results saved to data/terms_classified.csv")

Classification complete! Results saved to data/terms_classified.csv
