# NLP Pipeline Project: Hybrid Entity Recognizer for Product Codes

## Project Overview

### Goal:
The primary objective of this project is to extract and normalize "product codes" (alphanumeric patterns like "AB-1234X") from free-form user reviews and classify each review's overall sentiment.

### Steps

1. **Load and Clean Raw Text**:
   - Use regular expressions (regex) to clean and preprocess the raw text data.

2. **Tokenization and POS-Tagging**:
   - Utilize spaCy to tokenize the text and perform part-of-speech (POS) tagging.
   - Develop a custom component within spaCy to recognize product codes based on predefined patterns.

3. **Transformer Input Preparation**:
   - Export the cleaned text to Hugging Face transformer inputs using the `AutoTokenizer`.

4. **Optional Fine-Tuning**:
   - Fine-tune a sentiment classifier on top of the transformer embeddings to improve sentiment analysis accuracy.



### Install dependencies

In [1]:
!pip install spacy transformers datasets

In [2]:
!python -m spacy download en_core_web_sm

In [9]:
#--------------------------------------------------------------
# Import AutoTokenizer from the Hugging Face Transformers library
#---------------------------------------------------------------
from transformers import AutoTokenizer

# -------------------------------------------------------------------------------------------------------
# Examples with raw text strings: 2 contain product-like codes (e.g. “AB-1234X”), and a normal review without any code
# ---------------------------------------------------------------------------------------------------------
examples = [
    "I bought AB-1234X last week and it works great!",
    "Received product XY-9876 today — totally disappointed.",
    "No code here, just a normal review about quality and price."
]

# -----------------------------------------------------------------------------
# Regex-based cleaner that matches codes like “AB-1234” or “XY-9876Z”
# -----------------------------------------------------------------------------
def clean_and_tag_codes(text: str):
    import re
    PRODUCT_CODE_RE = re.compile(r"\b([A-Z]{2}-\d{4}[A-Z]?)\b")
    codes = PRODUCT_CODE_RE.findall(text)
    cleaned = PRODUCT_CODE_RE.sub("<PRODCODE>", text.lower())
    return cleaned, codes

# -----------------------------------------------------------------------------
# Build cleaned texts: apply clean_and_tag_codes to each string in examples; only keep the cleaned text ([0]) of tuple
# -----------------------------------------------------------------------------

clean_texts = [clean_and_tag_codes(t)[0] for t in examples]

# -----------------------------------------------------------------------------
# Instantiate tokenizer and encode in one shot: load a pretrained BERT tokenizer.
#----------------------------------------------------------------------

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
encodings = tokenizer(
    clean_texts,
    padding="longest",
    truncation=True,
    max_length=64,
    return_tensors="pt"
)

# -----------------------------------------------------------------------------
# Print the raw tensors - the raw numeric representations that BERT will consume
#--------------------------------------------------

print("Input IDs:\n", encodings.input_ids)
print("Attention Mask:\n", encodings.attention_mask)


Input IDs:
 tensor([[  101,  1045,  4149, 11113,  1011, 13138,  2549,  2595,  2197,  2733,
          1998,  2009,  2573,  2307,   999,   102],
        [  101,  2363,  4031,  1060,  2100,  1011,  5818,  2581,  2575,  2651,
          1517,  6135,  9364,  1012,   102,     0],
        [  101,  2053,  3642,  2182,  1010,  2074,  1037,  3671,  3319,  2055,
          3737,  1998,  3976,  1012,   102,     0]])
Attention Mask:
 tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0]])



## Interpreting `input_ids` & `attention_mask`

### `input_ids`

- **Shape**: `(3, 16)` → 3 examples, each padded/truncated to 16 tokens.
- Each number is an index in BERT’s vocabulary:
  - **101** = `[CLS]` (start-of-sentence marker)
  - **102** = `[SEP]` (end-of-sentence marker)
  - **0** at the end of rows 2 & 3 = padding token.

### `attention_mask`

- **Shape**: `(3, 16)`.
- A `1` means “attend to this token”; `0` means “ignore (padding)”.
- Notice that only the first example has 16 real tokens; the other two have a trailing `0`.


In [11]:
# Map IDs back to tokens for inspection
print("\n=== Token-level view ===")
for i, ids in enumerate(encodings.input_ids):
    tokens = tokenizer.convert_ids_to_tokens(ids)
    print(f"Example {i} tokens ({len(tokens)}):\n", tokens, "\n")


=== Token-level view ===
Example 0 tokens (16):
 ['[CLS]', 'i', 'bought', 'ab', '-', '123', '##4', '##x', 'last', 'week', 'and', 'it', 'works', 'great', '!', '[SEP]'] 

Example 1 tokens (16):
 ['[CLS]', 'received', 'product', 'x', '##y', '-', '98', '##7', '##6', 'today', '—', 'totally', 'disappointed', '.', '[SEP]', '[PAD]'] 

Example 2 tokens (16):
 ['[CLS]', 'no', 'code', 'here', ',', 'just', 'a', 'normal', 'review', 'about', 'quality', 'and', 'price', '.', '[SEP]', '[PAD]'] 



## Regex Code Explanation

The `clean_and_tag_codes` function is a small but critical preprocessing step that helps your model focus on the **pattern** of “there is a product mention here” rather than on the **idiosyncratic** details of each alphanumeric code.

In many real-world NLP pipelines, you’ll often want to **detect**, **extract**, and then **mask** or **tag** certain structured pieces of text—here, “product codes”—for a few key reasons:

1. **Entity Extraction & Downstream Use**
   By running:
   ```python
   codes = PRODUCT_CODE_RE.findall(text)
   ```
   you pull out every substring that looks like a product code (e.g., “AB-1234X”) into a list. You can then store or analyze these codes separately—for instance:
   - Building a lookup table of which exact products occurred in your data.
   - Aggregating statistics per product (average rating, frequency of mention).
   - Linking back to inventory or database records.

2. **Masking to Improve Model Generalization**
   Raw product codes are essentially arbitrary alphanumeric strings. If you feed them “as-is” into a model (e.g., BERT), the model will see “AB-1234X”, “XY-9876”, etc., each as its own unique token or token sequence. That can:
   - Blow up your vocabulary/embedding usage.
   - Cause the model to “memorize” individual codes instead of learning more general patterns (e.g., “sentiment around a product code”).

   By replacing every code with a single placeholder token (`<PRODCODE>`), you collapse all these variants into one. The model can then learn:
   - “Anytime you see `<PRODCODE>`, treat it like a product mention.”
   - Sentiment or other linguistic cues around codes, without overfitting to the particular alphanumeric patterns.

3. **Privacy & Anonymization**
   If your product codes are sensitive—say they encode personal data or internal SKUs—you may need to ensure they never leak into model outputs or logs. Masking them with `<PRODCODE>` ensures you have a consistent placeholder without exposing the real codes.

4. **Regex Breakdown**
   ```python
   PRODUCT_CODE_RE = re.compile(r"\b([A-Z]{2}-\d{4}[A-Z]?)\b")
   ```
   - `\b`: Word boundary, so we don’t match inside larger words.
   - `[A-Z]{2}`: Exactly two uppercase letters.
   - `-`: A hyphen.
   - `\d{4}`: Exactly four digits.
   - `[A-Z]?`: Optionally one more uppercase letter.
   - The capturing group around it means `.findall()` returns exactly the matched string.

   Examples matched:
   - “AB-1234”
   - “XY-9876Z”

5. **Lower-casing vs. Placeholder Case**
   ```python
   cleaned = PRODUCT_CODE_RE.sub("<PRODCODE>", text.lower())
   ```
   - We lowercase the full text so that downstream models (e.g., “bert-base-uncased”) are consistent.
   - We then replace codes with the literal string `"<PRODCODE>"` (already lowercase after `text.lower()`), so they stand out as a single special token.

In summary, any substring that originally looked like AB-1234X (or XY-9876, etc.) will end up being replaced in your text by `<PRODCODE>`.


##  How to use in a model:

This section demonstrates how to use the preprocessed text with a pre-trained `AutoModelForTokenClassification` from Hugging Face's Transformers library. The model is fine-tuned to recognize product codes in the text.

In [10]:
from transformers import AutoModelForTokenClassification

model = AutoModelForTokenClassification.from_pretrained(
    "bert-base-uncased", num_labels=3  # e.g. B-PRODCODE, I-PRODCODE, O
)
outputs = model(
    input_ids=encodings.input_ids,
    attention_mask=encodings.attention_mask
)
# outputs.logits.shape == (3, 16, 3)


Some weights of BertForTokenClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.



### Explanation

1. **Model Loading**: The `AutoModelForTokenClassification` is loaded with the `bert-base-uncased` model and configured to recognize three labels: `B-PRODCODE` (beginning of a product code), `I-PRODCODE` (inside a product code), and `O` (outside a product code).

2. **Model Inference**: The model takes the `input_ids` and `attention_mask` tensors as inputs. These tensors are the tokenized and padded/truncated representations of the text.

3. **Output Logits**: The model outputs logits, which are raw prediction scores for each token in the input sequence. The shape of the logits tensor is `(batch_size, sequence_length, num_labels)`, where:
   - `batch_size` is the number of examples in the batch (3 in this case).
   - `sequence_length` is the length of each tokenized sequence (16 in this case).
   - `num_labels` is the number of labels the model is trained to predict (3 in this case).

This setup allows the model to identify and classify each token in the input text, helping to recognize product codes effectively.





### Next step: Sentiment analysis

To enhance the functionality of your NLP pipeline, you can include sentiment analysis or fine-tune the model for better performance. By integrating spaCy, you can perform additional NLP tasks such as sentiment analysis, dependency parsing, and named entity recognition, which can complement the product code extraction and improve the overall performance of your pipeline.

Below are the steps to download and use the spaCy model `en_core_web_sm`, which is a “small” English model (≈ 50 MB) containing:

- Tokenizer rules for English
- Part-of-Speech (POS) tagger
- Dependency parser
- Named-Entity Recognizer (NER)



In [12]:
# Install necessary libraries
!pip install spacy transformers datasets

# Download the spaCy model
!python -m spacy download en_core_web_sm

# data_loader.py
from datasets import load_dataset

def load_reviews():
    # e.g. HuggingFace “amazon_polarity” for sentiment
    ds = load_dataset("amazon_polarity", split="train[:1%]")
    # take only text for demo
    return [ex["content"] for ex in ds]

# Load reviews
reviews = load_reviews()
print(reviews[:2])  # Print the first two reviews to verify


Collecting en-core-web-sm==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m65.4 MB/s[0m eta [36m0:00:00[0m
[?25h[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


README.md: 0.00B [00:00, ?B/s]

train-00000-of-00004.parquet:   0%|          | 0.00/260M [00:00<?, ?B/s]

train-00001-of-00004.parquet:   0%|          | 0.00/258M [00:00<?, ?B/s]

train-00002-of-00004.parquet:   0%|          | 0.00/255M [00:00<?, ?B/s]

train-00003-of-00004.parquet:   0%|          | 0.00/254M [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/117M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/3600000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/400000 [00:00<?, ? examples/s]

['This sound track was beautiful! It paints the senery in your mind so well I would recomend it even to people who hate vid. game music! I have played the game Chrono Cross but out of all of the games I have ever played it has the best music! It backs away from crude keyboarding and takes a fresher step with grate guitars and soulful orchestras. It would impress anyone who cares to listen! ^_^', "I'm reading a lot of reviews saying that this is the best 'game soundtrack' and I figured that I'd write a review to disagree a bit. This in my opinino is Yasunori Mitsuda's ultimate masterpiece. The music is timeless and I'm been listening to it for years now and its beauty simply refuses to fade.The price tag on this is pretty staggering I must say, but if you are going to buy any cd for this much money, this is the only one that I feel would be worth every penny."]



## Explanation of the Code

 **Loading the Dataset**

```python
ds = load_dataset("amazon_polarity", split="train[:1%]")
```

- **Purpose**: Downloads the **Amazon Polarity** dataset, which is a two-class sentiment dataset (positive vs. negative).
- **Details**:
  - `"train[:1%]` means “take the first 1% of the training split.” This is for a quick demo; otherwise, the full training set is ~3.6 million examples.
  - `ds` is a `Dataset` object, essentially a list of examples where each example is a dictionary, e.g., `{ "label": 1, "content": "I love this product ..." }`.

**Extracting Review Texts**

```python
return [ex["content"] for ex in ds]
```

- **Purpose**: Builds a plain Python list of strings, extracting only the `"content"` field (the review text).
- **Benefit**: This keeps things simple if you only need raw text.

**Loading and Printing Reviews**

```python
reviews = load_reviews()
print(reviews[:2])
```

- **Purpose**: Calls the loader function to get a list of review strings and prints the first two entries.
- **Benefit**: This allows you to sanity-check that you have successfully loaded the text data.



This setup ensures that you have a list of raw review texts ready for further processing, tokenization, or model training.




### Why Replace First, Then Tag:

1. **Normalization**: Lowercasing and other preprocessing steps yield more consistent input for your downstream model (e.g., BERT).
2. **Privacy/Anonymization**: You strip real codes out of the text before training or logging, ensuring sensitive information is protected.
3. **Entity Marking**: By turning each code into `<PRODCODE>`, you give spaCy (and later BERT) an easy hook to spot them, improving the model's ability to recognize product codes.

### Transformer Tokenization Demo

You then fed the **cleaned** sentences into BERT’s tokenizer:

- **input_ids**: A tensor of shape `(batch_size, seq_len)` where each integer indexes into BERT’s vocabulary.
- **attention_mask**: Same shape, with `1` for real tokens and `0` for padding.

You also saw how `'<PRODCODE>'` gets split into subwords by BERT’s WordPiece tokenizer.

### Key Takeaways

1. **Subword Splitting**
   - Custom tokens like `<PRODCODE>` may be split into subwords by BERT.
   - You need a strategy to align your single-entity label to multiple subwords.

2. **Model Inputs**
   - To fine-tune BERT for token classification (NER), you pass both `input_ids` and `attention_mask` into `AutoModelForTokenClassification`.
   - The model outputs logits of shape `(batch_size, seq_len, num_labels)`.
   - You then decode these logits using `argmax` or a Conditional Random Field (CRF) to get predicted labels for each subword.




## Next:

1. **Generate BIO Labels**:
   - From your `<PRODCODE>` spans, create a sequence of `B-PRODCODE`, `I-PRODCODE`, or `O` tags aligned to the tokenized subwords.

2. **Fine-Tune**:
   - Use the `Trainer` API or a custom training loop to fine-tune `AutoModelForTokenClassification` on your silver-labeled data.

3. **Real-Time Processing**:
   - Develop a real-time processing pipeline to extract product codes from live user reviews.
   - Implement a dashboard to visualize the extracted product codes and associated sentiments.

4. **Privacy and Security**:
   - Enhance privacy measures to ensure that sensitive product codes are not exposed.
   - Implement data anonymization techniques to protect user information.

5. **Evaluation and Metrics**:
   - Conduct thorough evaluations using precision, recall, and F1-score metrics.
   - Perform A/B testing to compare the performance of different models and preprocessing techniques.

## Final Thoughts

This pipeline demonstrates the power of combining regular expressions, spaCy, and Hugging Face Transformers to build a robust NLP system for product code extraction. The use of BERT for token classification ensures that the model can generalize well to new, unseen data.
