# Train Custom Resume Parser with Spacy (v3)

This notebook demonstrates how to train a custom Named Entity Recognition (NER) model to extract **SKILLS**, **ROLES**, and **DATES** from resumes.

### Steps:
1. Setup & Install
2. Creating Training Data
3. Converting Data to Spacy Format
4. Training the Model
5. Testing the Model
6. Downloading the Result

In [None]:
# 1. Install Spacy and download base model
!pip install -U spacy
!python -m spacy download en_core_web_lg

In [None]:
import spacy
from spacy.tokens import DocBin
from tqdm import tqdm
import random

# Load base model (to prevent catastrophic forgetting, though here we train fresh often)
nlp = spacy.load("en_core_web_lg")

In [None]:
# 2. Define Synthetic Training Data (Text, Annotations)
# Annotations are (Start Index, End Index, Label)

TRAIN_DATA = [
    ("Experienced Python Developer with expertise in Django and Flask.", {"entities": [(12, 18, "SKILL"), (41, 47, "SKILL"), (52, 57, "SKILL")]}),
    ("I have 5 years of experience in React and Node.js.", {"entities": [(32, 37, "SKILL"), (42, 49, "SKILL")]}),
    ("Skilled in AWS, Docker, and Kubernetes for DevOps.", {"entities": [(11, 14, "SKILL"), (16, 22, "SKILL"), (28, 38, "SKILL"), (43, 49, "ROLE")]}),
    ("Seeking a Data Scientist role. Proficient in PyTorch and Pandas.", {"entities": [(10, 24, "ROLE"), (45, 52, "SKILL"), (57, 63, "SKILL")]}),
    ("Developed REST APIs using FastAPI and deployed on Azure.", {"entities": [(10, 19, "SKILL"), (26, 33, "SKILL"), (50, 55, "SKILL")]}),
    ("Senior Java Engineer with Spring Boot experience.", {"entities": [(7, 11, "SKILL"), (26, 37, "SKILL"), (0, 20, "ROLE")]}),
    ("Proficient in C++, SQL, and NoSQL databases.", {"entities": [(14, 17, "SKILL"), (19, 22, "SKILL"), (28, 33, "SKILL")]}),
    ("Looking for AI Engineer position. Strong math background.", {"entities": [(12, 23, "ROLE")]})
]

# TIP: For a real model, you need hundreds of these examples using a tool like LabelStudio.

In [None]:
# 3. Convert to Spacy Binary Format (.spacy)

db = DocBin()

for text, annot in TRAIN_DATA:
    doc = nlp.make_doc(text)
    ents = []
    for start, end, label in annot["entities"]:
        span = doc.char_span(start, end, label=label, alignment_mode="contract")
        if span is None:
            print("Skipping entity")
        else:
            ents.append(span)
    try:
        doc.ents = ents
        db.add(doc)
    except:
        pass

db.to_disk("./train.spacy")
print("Saved train.spacy")

In [None]:
# 4. Create Configuration and Train

# Create a base config
!python -m spacy init fill-config base_config.cfg config.cfg

# Requires a base_config.cfg file. We will create a simple one via CLI command style or download one.
# For simplicity in Colab, we can use the quickstart command:
!python -m spacy init config config.cfg --lang en --pipeline ner --optimize efficiency --force

In [None]:
# 5. Run Training
!python -m spacy train config.cfg --output ./output --paths.train ./train.spacy --paths.dev ./train.spacy

In [None]:
# 6. Test the Model

# Load the best model from training
nlp_best = spacy.load("output/model-best")

test_text = "I am looking for a Machine Learning Engineer job. I know Python and TensorFlow."
doc = nlp_best(test_text)

print("Entities found:")
for ent in doc.ents:
    print(f"{ent.text} -> {ent.label_}")

In [None]:
# 7. Download the Trained Model

import shutil
from google.colab import files

# Zip the best model folder
shutil.make_archive("resume_model", 'zip', "output/model-best")

# Download the zip file
files.download("resume_model.zip")