# Slang Detection using BERT and Custom Dictionary

This notebook demonstrates a process for detecting slang words within a sentence and retrieving their definitions. It utilizes a combination of:

1.  **A Pre-trained BERT Model (dslim/bert-base-NER):** To analyze the sentence structure and identify potential named entities or significant tokens. While not specifically trained for slang, NER models can help isolate nouns, proper nouns, etc., which might include slang terms.
2.  **A Custom Slang Dictionary (`slang.csv`):** Provided by the user, this file maps known slang terms to their definitions.

The approach involves processing the sentence with BERT, then cross-referencing the identified tokens (reconstructed into words) with the custom slang dictionary. This allows leveraging BERT's contextual understanding while relying on the user-provided dictionary for definitive slang identification and definition retrieval.

**Note:** This implementation runs on CPU. BERT inference can be slow without a GPU.

## 1. Setup and Imports

In [1]:
# Install necessary libraries (if not already installed)
# Run this cell once if you haven't installed these libraries in your environment
# !pip install torch transformers pandas

import pandas as pd
import torch
from transformers import AutoTokenizer, AutoModelForTokenClassification
import re
import os

print("Libraries imported successfully.")

Libraries imported successfully.


## 2. Load Slang Dictionary

This section loads the `slang.csv` file provided by the user into a Python dictionary for efficient lookup. Ensure the CSV file is in the specified path.

In [3]:
import os
import pandas as pd

def load_slang_dict(filepath):
    """Loads the slang dictionary from a CSV file into a Python dictionary."""
    try:
        # Try different encodings if default utf-8 fails
        encodings_to_try = ["utf-8", "latin1", "iso-8859-1", "cp1252"]
        slang_df = None
        for encoding in encodings_to_try:
            try:
                slang_df = pd.read_csv(filepath, encoding=encoding)
                print(f"Successfully loaded {os.path.basename(filepath)} with encoding: {encoding}")
                break
            except UnicodeDecodeError:
                continue

        if slang_df is None:
            print(f"Error: Could not decode the file {filepath} with available encodings.")
            return None

        # Check for required columns
        if 'term' not in slang_df.columns or 'definition' not in slang_df.columns:
            print(f"Error: Slang dictionary CSV at {filepath} must contain 'term' and 'definition' columns.")
            return None

        # Clean 'term' column: strip whitespace and convert to lowercase for case-insensitive matching
        slang_df['term'] = slang_df['term'].str.strip().str.lower()

        # Optionally drop duplicates, keep first occurrence
        slang_df = slang_df.drop_duplicates(subset=['term'], keep='first')

        # Convert to dictionary
        slang_dict = pd.Series(slang_df.definition.values, index=slang_df.term).to_dict()

        print(f"Loaded {len(slang_dict)} slang terms into dictionary.")
        return slang_dict

    except FileNotFoundError:
        print(f"Error: Slang dictionary file not found at {filepath}")
        return None
    except Exception as e:
        print(f"An error occurred while loading the slang dictionary: {e}")
        return None


# Updated Windows path (use raw string to handle backslashes)
slang_file_path = r"C:\Users\murta\Desktop\New folder\Handwritten-Alphabets-Recognition\slang.csv"

if not os.path.exists(slang_file_path):
    print(f"WARNING: Slang file not found at '{slang_file_path}'. Please ensure the file exists at this location or update the path. Slang detection will not work.")
    slang_dictionary = None
else:
    slang_dictionary = load_slang_dict(slang_file_path)

# Display a few terms if loaded successfully
if slang_dictionary:
    print("\nSample slang terms loaded:")
    count = 0
    for term, definition in slang_dictionary.items():
        print(f"- {term}: {definition}")
        count += 1
        if count >= 5:
            break


Successfully loaded slang.csv with encoding: utf-8
Loaded 379 slang terms into dictionary.

Sample slang terms loaded:
- unfollow: To stop subscribing to someone’s social media updates
- receipts: Proof of someone’s actions or words, often used during drama
- shade: A subtle insult or criticism
- troll: A person who provokes or upsets others online
- filter: A photo effect used to enhance images


## 3. Slang Detection Function

This function takes a sentence and the loaded slang dictionary as input. It uses the `dslim/bert-base-NER` model to tokenize the sentence. It then reconstructs words from tokens and checks if these words exist (case-insensitively) in the slang dictionary. If a match is found, the slang term and its definition are returned.

**Note:** The BERT model is loaded outside the function for efficiency, so it's only loaded once when this cell is run.

In [None]:
def read_sentences_from_file(file_path):
    """Reads sentences from a text file (one per line)."""
    if not os.path.exists(file_path):
        print(f"Error: File '{file_path}' not found.")
        return []

    try:
        with open(file_path, 'r', encoding='utf-8') as file:
            sentences = [line.strip() for line in file if line.strip()]
        print(f"Loaded {len(sentences)} sentence(s) from {file_path}.")
        return sentences
    except Exception as e:
        print(f"An error occurred while reading the file: {e}")
        return []
# Prompt user for a file path to read sentences from
text_file_path = input("C:/Users/murta/Desktop/New folder/Handwritten-Alphabets-Recognition/text.txt").strip()

if text_file_path:
    file_sentences = read_sentences_from_file(text_file_path)
    if file_sentences and slang_dictionary:
        print("\n--- File-based Detection ---")
        for sentence in file_sentences:
            print(f"\nSentence: {sentence}")
            result = detect_slang_bert(sentence, slang_dictionary)
            print(result)
            print("---")
    elif not slang_dictionary:
        print("Slang dictionary not loaded. Cannot analyze file content.")
else:
    print("No text file provided. Skipping file-based detection.")

def detect_slang_bert(sentence, slang_dictionary):
    """Detects slang in a sentence using BERT tokenization and a slang dictionary lookup."""
    if not slang_dictionary:
        return "Error: Slang dictionary not loaded or empty."
    if not tokenizer or not model:
        return "Error: BERT model or tokenizer failed to load."

    try:
        # Get inputs for the model
        inputs = tokenizer(
            sentence,
            return_tensors="pt",
            truncation=True,
            padding=True,
            is_split_into_words=False
        )

        # Get word_ids using a separate encoding object
        encoding = tokenizer(
            sentence,
            return_offsets_mapping=True,
            return_tensors="pt",
            padding=True,
            truncation=True,
            is_split_into_words=False
        )
        word_indices = encoding.word_ids()

        # Move tensor inputs to device
        model_inputs = {k: v.to(device) for k, v in inputs.items() if isinstance(v, torch.Tensor)}

        token_ids = inputs["input_ids"][0]
        tokens = tokenizer.convert_ids_to_tokens(token_ids)

        with torch.no_grad():
            outputs = model(**model_inputs)

        # Reconstruct words from subword tokens
        found_slang = {}
        current_word_reconstructed = ""
        last_word_idx = -1

        for i, token in enumerate(tokens):
            if token in tokenizer.all_special_tokens:
                continue

            word_idx = word_indices[i] if i < len(word_indices) else None
            if word_idx is None:
                continue

            if word_idx != last_word_idx and last_word_idx != -1:
                if current_word_reconstructed.lower() in slang_dictionary:
                    if current_word_reconstructed not in found_slang:
                        found_slang[current_word_reconstructed] = slang_dictionary[current_word_reconstructed.lower()]
                current_word_reconstructed = ""

            token_part = token[2:] if token.startswith("##") else token
            current_word_reconstructed += token_part
            last_word_idx = word_idx

        if current_word_reconstructed and current_word_reconstructed.lower() in slang_dictionary:
            if current_word_reconstructed not in found_slang:
                found_slang[current_word_reconstructed] = slang_dictionary[current_word_reconstructed.lower()]

        if not found_slang:
            return "No slang detected in the sentence."
        else:
            result = "Detected Slang:\n"
            for word in sorted(found_slang.keys()):
                result += f"- {word}: {found_slang[word]}\n"
            return result.strip()

    except Exception as e:
        import traceback
        print(traceback.format_exc())
        return f"An error occurred during slang detection: {e}"


## 4. Interactive Slang Detection

Enter a sentence in the text box below and run the cell (Shift+Enter) to detect slang terms based on the loaded dictionary.

In [16]:
# Get user input
sentence_to_check = input("Enter a sentence: ")

# Detect slang
if sentence_to_check:
    # Make sure the dictionary loaded correctly before proceeding
    if slang_dictionary is not None:
        detection_result = detect_slang_bert(sentence_to_check, slang_dictionary)
        print("\n-- Result --")
        print(detection_result)
    else:
        print("Error: Cannot detect slang because the dictionary failed to load.")
else:
    print("Please enter a sentence.")


Enter a sentence:  Beyoncé is the GOAT. No debate.



-- Result --
Detected Slang:
- GOAT: Acronym for "Greatest Of All Time."


## 5. Example Sentences

Here are some examples using predefined sentences to test the detection function.

In [9]:
test_sentences = [
    "That movie was fire, totally slayed.",
    "He's just trolling, don't feed the troll.",
    "This is a normal sentence without any specific slang.",
    "OMG that OOTD is goals!",
    "She unfollowed him after the argument.",  # From slang.csv example
    "I’ve got the receipts to back it up.",     # From slang.csv example
    "That fit is drip.",
    "No cap, that was impressive."
]

# Make sure the dictionary loaded correctly before running examples
if slang_dictionary is not None:
    print("Running example sentences...")
    for sentence in test_sentences:
        print(f"\nSentence: {sentence}")
        result = detect_slang_bert(sentence, slang_dictionary)
        print(result)
        print("---")
else:
    print("Error: Cannot run examples because the slang dictionary failed to load.")


Running example sentences...

Sentence: That movie was fire, totally slayed.
Traceback (most recent call last):
  File "C:\Users\murta\AppData\Local\Temp\ipykernel_18020\3864800046.py", line 40, in detect_slang_bert
    word_indices = inputs.word_ids(batch_index=0) # Get word indices
                   ^^^^^^^^^^^^^^^
AttributeError: 'dict' object has no attribute 'word_ids'

An error occurred during slang detection: 'dict' object has no attribute 'word_ids'
---

Sentence: He's just trolling, don't feed the troll.
Traceback (most recent call last):
  File "C:\Users\murta\AppData\Local\Temp\ipykernel_18020\3864800046.py", line 40, in detect_slang_bert
    word_indices = inputs.word_ids(batch_index=0) # Get word indices
                   ^^^^^^^^^^^^^^^
AttributeError: 'dict' object has no attribute 'word_ids'

An error occurred during slang detection: 'dict' object has no attribute 'word_ids'
---

Sentence: This is a normal sentence without any specific slang.
Traceback (most recent ca