# 📌 **Named Entity Recognition (NER) with Large-Scale LLMs & BERT-Based Chunking** 🧠📊  

## 🔎 **Overview**
This Python script efficiently **extracts named entities** from large text documents using **DeepSeek-R1 Distill LLaMA-8B** as the LLM and **BERT (bert-base-uncased)** for **sentence-aware text chunking**. The extracted named entities are categorized into various types (e.g., **Person, Organization, Location, etc.**) and saved in a structured CSV file.  

### 🔥 **Why is this useful?**  
✅ Enables **scalable named entity extraction** from **long documents**.  
✅ Uses **BERT for chunking**, ensuring **meaningful sentence segmentation**.  
✅ Employs a **Large-Scale LLM (DeepSeek-R1 Distill LLaMA-8B)** for **high-quality NER extraction**.  
✅ **Saves structured data** into a CSV file for further processing.

---

## 📌 **Pipeline Overview** 🔄  

### **📝 Step 1: Read Input File**
📂 Reads the input text file containing **large amounts of unstructured text**.  
📌 **File path:** `/kaggle/input/ne-session1/NE.txt`  

### **🧩 Step 2: Chunking the Text using BERT**
🔹 Long documents are **split into smaller chunks** (max 256 tokens) using **BERT tokenization**.  
🔹 **Why BERT?** BERT ensures **meaningful** sentence-aware segmentation.  

### **🛠 Step 3: Named Entity Recognition (NER) using LLM**
🔹 Each **text chunk** is fed into the **DeepSeek-R1 Distill LLaMA-8B** LLM.  
📌 **Model path:** `/kaggle/input/deepseek-r1/transformers/deepseek-r1-distill-llama-8b/2`  
🔹 LLM **extracts named entities** along with their types (e.g., **Person, Organization, etc.**).  
🔹 **Output follows this structured format:**
```python
[('Barack Obama', 'PERSON'), ('Google', 'ORGANIZATION'), ('New York', 'LOCATION')]


In [1]:
import os
import re
import torch
import pandas as pd
from transformers import AutoModelForCausalLM, AutoTokenizer, BertTokenizerFast

# Set environment variables
os.environ["CUDA_VISIBLE_DEVICES"] = "0,1"
os.environ["TOKENIZERS_PARALLELISM"] = "false"

# 📌 Specify Model Names
llm_model_name = "/kaggle/input/deepseek-r1/transformers/deepseek-r1-distill-llama-8b/2"
chunking_model_name = "bert-base-uncased"  # ✅ Use BERT for chunking
cache_dir = "/kaggle/temp"  # ✅ Use Kaggle's temp storage
input_file_path = "/kaggle/input/ne-session1/NE.txt"  # ✅ Input file path
output_file_path = "/kaggle/working/named_entities.xlsx"  # ✅ Output file path

# 🔥 Generation Parameters
TEMPERATURE = 0.3  
TOP_P = 0.95  
MAX_LENGTH = 512  

# ✅ Load Tokenizers & Models
bert_tokenizer = BertTokenizerFast.from_pretrained(chunking_model_name)
llm_tokenizer = AutoTokenizer.from_pretrained(llm_model_name, cache_dir=cache_dir, trust_remote_code=True)
llm_model = AutoModelForCausalLM.from_pretrained(
    llm_model_name, cache_dir=cache_dir, torch_dtype=torch.float16, device_map="auto", trust_remote_code=True
)

# 🔹 Read Input File
def read_file(file_path):
    with open(file_path, "r", encoding="utf-8") as f:
        return f.read()

# 🔹 Chunk Text Using BERT Tokenizer
def chunk_text_bert(text, max_tokens=256):
    tokens = bert_tokenizer.encode(text, add_special_tokens=False)
    chunks = [bert_tokenizer.decode(tokens[i : i + max_tokens], skip_special_tokens=True) for i in range(0, len(tokens), max_tokens)]
    return chunks

# 🔹 Generate Named Entities from LLM
def generate_named_entity_response(chunk):
    prompt = f"""
    Extract named entities and their types from the following text:
    
    {chunk}
    
    Return results as a structured list of entities. Each entity should be in the format:
    (Entity: ENTITY_NAME, Type: ENTITY_TYPE)
    
    Only return the list, do not include explanations.
    """

    inputs = llm_tokenizer(prompt, return_tensors="pt", truncation=True, max_length=MAX_LENGTH).to("cuda")

    with torch.no_grad():
        outputs = llm_model.generate(
            **inputs,
            max_new_tokens=700,  
            do_sample=True,  
            temperature=TEMPERATURE,
            top_p=TOP_P,
            repetition_penalty=1.1
        )

    response = llm_tokenizer.decode(outputs[0], skip_special_tokens=True).strip()

    # 🔹 Print raw response for debugging
    print("&&&&&&&&&&&&&&&&&")

    print("\n🔹 Raw Model Response:\n", response)

    print("&&&&&&&&&&&&&&&&&")

    # ✅ Extract entity-type pairs using regex
    pattern = r"\(Entity:\s*(.*?),\s*Type:\s*(.*?)\)"
    entities = re.findall(pattern, response)
    # ✅ Ensure extracted entities are stored correctly
    processed_entities = [{"Entity": entity.strip(), "Type": entity_type.strip()} for entity, entity_type in entities]

    return processed_entities if processed_entities else [{"Entity": "N/A", "Type": "N/A"}]  

# 🔹 Extract Named Entities from Chunks
def extract_named_entities(chunks):
    all_entities = []
    
    for chunk in chunks:
        named_entities = generate_named_entity_response(chunk)

        for entity_data in named_entities:  # ✅ Process each entity in the list
            all_entities.append({"Chunk": chunk, "Entity": entity_data["Entity"], "Type": entity_data["Type"]})

        # ✅ Debugging Output
        print("\n===========================")
        print(f"📌 **Chunk**: {chunk}")
        print(f"🔍 **Extracted Entities**: {named_entities}")
        print("===========================")

    return all_entities

# ✅ Run the Process
text_data = read_file(input_file_path)
text_chunks = chunk_text_bert(text_data, max_tokens=256)
extracted_data = extract_named_entities(text_chunks)

# ✅ Convert to Pandas DataFrame & Save
df = pd.DataFrame(extracted_data)
print("\n📊 **Final DataFrame Preview:**")
print(df)  
df.to_excel(output_file_path, index=False)

print(f"\n✅ Named entity extraction complete! Output saved to: {output_file_path}")


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


&&&&&&&&&&&&&&&&&

🔹 Raw Model Response:
 Extract named entities and their types from the following text:
    
    in a historic moment, president barack obama delivered a speech at the united nations general assembly in new york on september 24, 2014. during his address, he emphasized the importance of global cooperation in tackling climate change, economic disparities, and political instability. the united nations secretary - general ban ki - moon praised the united states for its commitment to international diplomacy. meanwhile, in silicon valley, apple inc. unveiled its latest product, the iphone 6, at the steve jobs theater in cupertino, california. ceo tim cook stated that the new device would revolutionize mobile technology. google, headquartered in mountain view, california, also announced the development of an advanced ai system, gemini, to enhance search capabilities. across the atlantic, german chancellor angela merkel met with french president emmanuel macron in paris to di