# Download and install the spaCy model for English (large)

In [None]:
!python -m spacy download en_core_web_lg

## Importing Necessary Libraries

In [1]:
import os
import re
import spacy
import gdown
import warnings
import numpy as np
import pandas as pd
import seaborn as sns
from tqdm import tqdm
import matplotlib.pyplot as plt

from spacy.tokens import DocBin
from spacy.util import filter_spans

warnings.filterwarnings('ignore')

In [2]:
path_to_email_data = os.path.join(os.path.dirname(os.getcwd()), 'data', 'prod_email.csv')
print(path_to_email_data)

/home/fm-pc-lt-219/Desktop/product_and_sentiment_classification_poc/custom_ai_app/data/prod_email.csv


## Loading the email dataset

In [3]:
df = pd.read_csv(path_to_email_data)
print(df.shape)
df.head()

(450, 5)


Unnamed: 0,Ticket Type,Ticket Subject,Ticket Description,new_product_name,new_product_type
0,Technical issue,Software bug,Subject: Urgent Assistance Required: Issue wit...,SAP ERP,ERP
1,Technical issue,Software bug,Subject: Urgent Assistance Required for SAP ER...,SAP ERP,ERP
2,Technical issue,Software bug,Subject: Frustration with SAP ERP Technical Is...,SAP ERP,ERP
3,Technical issue,Software bug,Subject: Urgent Assistance Required: Software ...,SAP ERP,ERP
4,Technical issue,Software bug,Subject: Urgent: Glitch in SAP ERP Software\nD...,SAP ERP,ERP


## Loading the spacy model

### Loading a spacy model for tokenizing a sentence

**Model: en_core_web_lg**

The "en_core_web_lg" model is a pre-trained language model provided by spaCy. It is designed for various natural language processing (NLP) tasks in English.

**Key Features:**
- **Word Vectors**: This model is equipped with word vectors trained on a large corpus of text data. These word vectors enable it to understand the context and similarity between words, making it suitable for tasks like word embeddings, text similarity analysis and sentiment analysis.

**Training Data:**
- The "en_core_web_lg" model was trained on a diverse range of English text from the web, including news articles, books, and websites. It has been fine-tuned to capture a broad vocabulary and linguistic patterns.

This model serves as a versatile tool for NLP tasks such as text classification, named entity recognition, part-of-speech tagging, and more. It's a valuable resource for understanding and processing English text data efficiently.

**Leveraging the NER Model for Product Recognition**

Our focus is on harnessing the Named Entity Recognition (NER) model to identify and categorize products that customers purchase. These products encompass a range of categories, including ERP, CRM, Appointment Booking Software, and others.

**Uncovering and Labeling Products**

Our goal is to uncover all these products within customer text data and assign them to the respective entities from a set of four predefined categories.




In [4]:
nlp = spacy.load("en_core_web_lg")

### Defining a function to remove commonly used words in a text

<b>Stopwords</b> are common words in natural language that are often filtered out when analyzing text data because they typically don't carry significant meaning on their own. They are frequently used in language but may not provide valuable information for tasks like text analysis and natural language processing. 

Here are two to three examples of common stopwords:

1. English Stopwords:
   - Articles: "a," "an," "the"
   - Pronouns: "I," "you," "he," "she," "it," "we," "they"
   - Prepositions: "in," "on," "at," "with," "by," "for," "of," "to," "from"
   - Conjunctions: "and," "but," "or," "so," "because"

These stopwords are often removed from text data during text preprocessing to focus on more meaningful words and improve the efficiency and effectiveness of natural language processing tasks such as text classification, sentiment analysis, and information retrieval.

In [5]:
def remove_stopwords(text):
    """
    Remove stopwords from the input text.

    Args:
        text (str): The input text from which stopwords will be removed.

    Returns:
        str: The cleaned text with stopwords removed.
    """
    doc = nlp(text)
    cleaned_text = ' '.join(token.text for token in doc if not token.is_stop)
    return cleaned_text

## Text Cleaning

This function performs essential text cleaning steps, making the input text suitable for analysis and natural language processing. The process includes:

- **Tokenization:** Breaking text into words or tokens.
- **Lowercasing:** Converting tokens to lowercase.
- **Punctuation Removal:** Eliminating punctuation marks.
- **Special Character Removal:** Removing non-alphanumeric characters.
- **Whitespace Cleanup:** Ensuring consistent and clean spaces.

The result is a cleaned and preprocessed text, ready for various NLP tasks and analyses.


In [6]:
def clean_text(text):
    """
    Clean and preprocess the input text.

    This function tokenizes the input text, converts tokens to lowercase,
    removes punctuation, and ensures that the text only contains letters,
    digits, and whitespace.

    Args:
        text (str): The input text to be cleaned.

    Returns:
        str: The cleaned and preprocessed text.
    """
    doc = nlp(text)
    cleaned_text = ' '.join(token.text.lower() for token in doc if not token.is_punct)
    cleaned_text = re.sub(r'\s+', ' ', re.sub(r'[^a-zA-Z0-9\s]', '', cleaned_text)).strip()
    return cleaned_text

## Applying text cleaning operation in our email description

In [7]:
df['Ticket Description'] = df['Ticket Description'].apply(clean_text)

## Removing stop words from our email description

In [8]:
df['Ticket Description'] = df['Ticket Description'].apply(remove_stopwords)

# Function Insights: create_product_class_mapping

## Purpose

The `create_product_class_mapping` function is designed to create a mapping of product classes to product names from a provided DataFrame. It is particularly useful when dealing with data that contains information about different types of products and their associated classes or categories.

## Function Signature

```python
def create_product_class_mapping(dataframe):


In [9]:
def create_product_class_mapping(df):
    """
    Create a mapping of product classes to product names from a DataFrame.

    This function takes a DataFrame with two columns: 'new_product_type' and 'new_product_name'.
    It creates a dictionary mapping each unique product class (in lowercase) to a list of 
    unique product names (in lowercase) belonging to that class.

    Args:
        dataframe (pandas.DataFrame): The DataFrame containing 'new_product_type' and 'new_product_name' columns.

    Returns:
        dict: A dictionary mapping product classes to lists of product names.
    """
    prod_class_map = {}
    for item in zip(df['new_product_type'].str.lower(), df['new_product_name'].str.lower()):
        if item[0] not in prod_class_map:
            prod_class_map[item[0]] = [item[1]]
        else:
            if item[1] not in prod_class_map[item[0]]:
                prod_class_map[item[0]].append(item[1])
    return prod_class_map


In [10]:
prod_class_map = create_product_class_mapping(df)
print(prod_class_map)

{'erp': ['sap erp', 'oracle erp cloud', 'microsoft dynamics 365', 'netsuite erp', 'infor cloudsuite', 'epicor erp', 'acumatica erp', 'odoo', 'sage x3', 'workday financial management'], 'crm': ['salesforce', 'hubspot crm', 'zoho crm', 'microsoft dynamics 365 crm', 'pipedrive', 'freshsales', 'insightly', 'nimble', 'sugarcrm', 'bitrix24'], 'appointment booking': ['calendly', 'acuity scheduling', 'setmore', 'simplybook.me', 'bookly', 'square appointments', 'appointy', 'schedulicity', 'youcanbook.me', 'timely'], 'other': ['adobe creative cloud', 'slack', 'zoom', 'trello', 'quickbooks', 'dropbox', 'atlassian jira', 'autodesk autocad', 'tableau', 'lastpass']}


## Finding the name of all products that has been listed in email dataset

In [11]:
product_names = list(df['new_product_name'].str.lower().unique())
print(product_names)

['sap erp', 'oracle erp cloud', 'microsoft dynamics 365', 'netsuite erp', 'infor cloudsuite', 'epicor erp', 'acumatica erp', 'odoo', 'sage x3', 'workday financial management', 'salesforce', 'hubspot crm', 'zoho crm', 'microsoft dynamics 365 crm', 'pipedrive', 'freshsales', 'insightly', 'nimble', 'sugarcrm', 'bitrix24', 'calendly', 'acuity scheduling', 'setmore', 'simplybook.me', 'bookly', 'square appointments', 'appointy', 'schedulicity', 'youcanbook.me', 'timely', 'adobe creative cloud', 'slack', 'zoom', 'trello', 'quickbooks', 'dropbox', 'atlassian jira', 'autodesk autocad', 'tableau', 'lastpass']


In [12]:
def find_product_class(product_name, product_dict):
    """
    Find the product class key for a given product name in a product dictionary.

    Args:
    - product_name (str): The name of the product to search for.
    - product_dict (dict): The dictionary containing product classes and their associated products.

    Returns:
    - str or None: The product class key if found, or None if the product was not found in any class.
    """
    product_class_key = next((key for key, value in product_dict.items() if product_name in value), None)
    return product_class_key


In [13]:
# product_name = "microsoft dynamics 365"
# product_class_key = next((key for key, value in prod_class_map.items() if product_name in value), None)
# print(product_class_key)

## Named Entity Recognition (NER)

NER is a technique used to identify and categorize named entities in text data, such as product names and their associated classes.

### How NER Works

1. **Tokenization**: Text is divided into tokens (words or subwords).

2. **Feature Extraction**: Extract features from tokens.

3. **Sequence Labeling**: A model assigns labels (e.g., ERP_PRODUCT, CRM_PRODUCT) to tokens.

4. **Post-processing**: Combine labeled tokens into product entities with classes.

### Creating a NER Dataset

To train NER models for recognizing and classifying product names and their associated classes, follow these steps:

1. **Collect Text Data**: Gather text containing mentions of products.

2. **Annotate Entities**: Label product names with start and end positions and assign corresponding product classes.

3. **Format Data**: Organize the annotated data in JSON, CSV, or spaCy format.

### Example Dataset

Here's an example dataset in JSON format that you can use to train your NER models:

```json
[
    {"text": "SAP HANA is an ERP solution.", "entities": [{"start": 0, "end": 7, "label": "ERP_PRODUCT"}]},
    {"text": "Salesforce is a CRM platform.", "entities": [{"start": 0, "end": 9, "label": "CRM_PRODUCT"}]},
    {"text": "Booking.com offers hotel reservations.", "entities": [{"start": 0, "end": 10, "label": "BOOKING_PRODUCT"}]}
]

You can use this dataset as a starting point to train NER models for recognizing and classifying product names and their associated classes in new text data.



## Function Description

This function is designed for preparing data intended for training a Named Entity Recognition (NER) model. The primary objective is to enable the identification and classification of product names and their respective classes within textual data.

### Input Parameters

- `DataFrame (df)`: A DataFrame containing text data, such as ticket descriptions.

- `Product Names (list)`: A list comprising product names that serve as entities for recognition.

- `Product Class Mapping (dict)`: A dictionary mapping product names to their associated classes.

### Function Operation

1. **Data Annotation**: The function processes each entry within the text data, focusing on the 'Ticket Description' column.

2. **Entity Identification**: For each text item, it creates a dictionary to store the text and its associated entities, which initially remain empty.

3. **Matching Product Names**: The function searches for occurrences of product names within the text. When a match is found, it calculates the start and end positions of the identified entity.

4. **Entity Classification**: Each product name is classified based on its corresponding class in the provided mapping, represented in uppercase.

5. **Data Organization**: The annotated entities, including their start and end positions and associated classes, are appended to the entities list within the dictionary.

6. **Data Compilation**: The annotated dictionary is then added to the `training_data` list, which accumulates all the annotated text samples.

7. **Output**: The function returns a list of training data formatted as required for training NER models. Each entry within this list includes the text and a list of annotated entities, facilitating the training of NER models tailored specifically for recognizing and classifying product names and their associated classes in text data, such as ticket descriptions.


In [14]:
def create_ner_training_data(df, product_names, prod_class_map):
    """
    Create training data for Named Entity Recognition (NER).

    Args:
        df (pandas.DataFrame): The DataFrame containing text data.
        product_names (list): List of product names to extract as entities.
        prod_class_map (dict): A mapping of product names to their classes.

    Returns:
        list: A list of training data in the required format for NER.

    Example:
        To create training data for NER using a DataFrame 'df' and product names list 'product_names':
        
        >>> training_data = create_ner_training_data(df, product_names, prod_class_map)
        >>> print(training_data[0])
        {
            "text": "Sample ticket description containing product name.",
            "entities": [
                (start_pos, end_pos, "PRODUCT_CLASS"),
                ...
            ]
        }
    """
    training_data = []

    for item in tqdm(df['Ticket Description']):
        training_dict = {}
        training_dict["text"] = item
        training_dict["entities"] = []

        for prod in product_names:
            if prod in item:
                matches = [(match.start(), match.end(), find_product_class(prod, prod_class_map).upper()) for match in re.finditer(prod, item, re.IGNORECASE)]
                training_dict["entities"].append(matches)
        training_data.append(training_dict)

    return training_data


In [16]:
training_data = create_ner_training_data(df=df, product_names=product_names,
                                         prod_class_map=prod_class_map)
print(training_data[9])

100%|██████████████████████████████████████████████████████████████████████| 450/450 [00:00<00:00, 16684.08it/s]

{'text': 'subject urgent assistance needed technical issues oracle erp cloud dear customer support m completely fed ongoing software bugs technical issues facing oracle erp cloud loyal customer expected seamless efficient experience unfortunately case primary concern encountered related grey baby slime 98 feature issue persists multiple devices model indicating widespread problem isolated incident severely impacted ability utilize oracle erp cloud effectively efficiently invested significant time resources implementing oracle erp cloud organization expected enhance operations constant software bugs technical glitches hindered productivity caused unnecessary delays affects day day operations reflects poorly reputation oracle erp cloud kindly request immediate attention matter urge escalate issue technical team provide prompt resolution paying customer believe responsibility ensure product functions advertised meets expectations customers understand software entirely bug free frequency s




## Code Description

The code segment below establishes the groundwork for handling and saving spaCy `Doc` objects in a binary format by utilizing the `DocBin` utility.

### Key Points

- Import the `DocBin` utility from `spacy.tokens`.

- Create a new spaCy language model for English using `spacy.blank("en")`.

- Initialize an empty `DocBin` object, named `doc_bin`.

This code is essential for efficiently processing and storing spaCy `Doc` objects in a binary format, which can be beneficial for various natural language processing tasks.


In [17]:
nlp = spacy.blank("en") 
doc_bin = DocBin()

### Creating a new folder named 'ner_config' to store some configurations in NER

In [18]:
path_to_ner_config = os.path.join(os.path.dirname(os.getcwd()), 'ner_config')
if not os.path.exists(path_to_ner_config):
    os.mkdir(path_to_ner_config)

## Code Explanation

This code below prepares training data for a spaCy Named Entity Recognition (NER) model:

1. It iterates through training examples, extracting text and entity labels.
2. Creates spaCy `Doc` objects with non-overlapping entities.
3. Stores processed `Docs` in a `DocBin`.
4. Saves the `DocBin` to a binary file for NER model training.


In [25]:
def prepare_training_data(training_data, nlp_model):
    """
    Prepare training data for a Named Entity Recognition (NER) model.

    Args:
        training_data (list): A list of training examples, each containing text and entity labels.
        nlp_model (spacy.Language): A spaCy language model.

    Returns:
        spacy.tokens.DocBin: A DocBin object containing processed training data.

    Note:
        This function processes training examples, extracts text and entity labels, creates
        spaCy Doc objects with non-overlapping entities, and returns a DocBin for NER model training.
    """
    doc_bin = DocBin()
    
    for training_example in training_data:
        text = training_example['text']
        labels = training_example['entities']
        doc = nlp_model.make_doc(text)
        ents = []
        print(labels)
        for label_group in labels:
            for start, end, label in label_group:
                span = doc.char_span(start, end, label=label, alignment_mode="contract")
                if span is None:
                    print("Skipping entity")
                else:
                    ents.append(span)
        
        filtered_ents = filter_spans(ents)
        doc.ents = filtered_ents
        doc_bin.add(doc)
    return doc_bin

In [26]:
train_data_path = os.path.join(path_to_ner_config, "train.spacy")

In [27]:
doc_bin = prepare_training_data(training_data, nlp_model = nlp)
doc_bin.to_disk(train_data_path)

[[(41, 48, 'ERP'), (101, 108, 'ERP'), (258, 265, 'ERP'), (530, 537, 'ERP'), (772, 779, 'ERP'), (929, 936, 'ERP'), (1264, 1271, 'ERP')]]
[[(35, 42, 'ERP'), (133, 140, 'ERP'), (333, 340, 'ERP'), (592, 599, 'ERP'), (781, 788, 'ERP'), (988, 995, 'ERP'), (1431, 1438, 'ERP'), (1575, 1582, 'ERP')]]
[[(20, 27, 'ERP'), (161, 168, 'ERP'), (441, 448, 'ERP'), (736, 743, 'ERP'), (914, 921, 'ERP'), (1330, 1337, 'ERP')]]
[[(49, 56, 'ERP'), (148, 155, 'ERP'), (182, 189, 'ERP'), (392, 399, 'ERP'), (619, 626, 'ERP'), (682, 689, 'ERP'), (840, 847, 'ERP'), (1265, 1272, 'ERP')]]
[[(22, 29, 'ERP'), (86, 93, 'ERP'), (236, 243, 'ERP'), (548, 555, 'ERP'), (751, 758, 'ERP'), (1043, 1050, 'ERP')]]
[[(35, 42, 'ERP'), (160, 167, 'ERP'), (401, 408, 'ERP'), (538, 545, 'ERP'), (896, 903, 'ERP'), (1081, 1088, 'ERP'), (1325, 1332, 'ERP')]]
[[(20, 36, 'ERP'), (100, 116, 'ERP'), (534, 550, 'ERP'), (888, 904, 'ERP'), (1096, 1112, 'ERP'), (1655, 1671, 'ERP')]]
[[(33, 49, 'ERP'), (144, 160, 'ERP'), (319, 335, 'ERP'), (619, 

[[(39, 45, 'CRM'), (131, 137, 'CRM'), (263, 269, 'CRM'), (908, 914, 'CRM'), (1158, 1164, 'CRM'), (1401, 1407, 'CRM'), (1541, 1547, 'CRM')]]
[[(700, 705, 'OTHER')], [(46, 52, 'OTHER'), (58, 64, 'OTHER'), (141, 147, 'OTHER'), (306, 312, 'OTHER'), (525, 531, 'OTHER'), (822, 828, 'OTHER'), (969, 975, 'OTHER'), (1062, 1068, 'OTHER'), (1377, 1383, 'OTHER')]]
[[(56, 72, 'ERP'), (145, 161, 'ERP'), (350, 366, 'ERP'), (790, 806, 'ERP'), (1215, 1231, 'ERP'), (1288, 1304, 'ERP')]]
[[(42, 50, 'CRM'), (56, 64, 'CRM'), (139, 147, 'CRM'), (327, 335, 'CRM'), (553, 561, 'CRM'), (1168, 1176, 'CRM'), (1538, 1546, 'CRM'), (1677, 1685, 'CRM')]]
[[(38, 57, 'APPOINTMENT BOOKING'), (63, 82, 'APPOINTMENT BOOKING'), (152, 171, 'APPOINTMENT BOOKING'), (570, 589, 'APPOINTMENT BOOKING'), (1090, 1109, 'APPOINTMENT BOOKING'), (1568, 1587, 'APPOINTMENT BOOKING'), (1679, 1698, 'APPOINTMENT BOOKING')]]
[[(61, 68, 'APPOINTMENT BOOKING'), (74, 81, 'APPOINTMENT BOOKING'), (157, 164, 'APPOINTMENT BOOKING'), (864, 871, 'APPO

## Defining config path and model path

In [28]:
base_config_path = os.path.join(path_to_ner_config, 'base_config.cfg')
output_config_path = os.path.join(path_to_ner_config, 'config.cfg')
model_path = os.path.join(os.path.dirname(os.getcwd()), "models")

### Downloading the training configuration file for training NER model in Spacy

In [29]:
!gdown https://drive.google.com/uc?id=1pfqTcuw0MbQf6K78OHvHX_yudUYWcxAX -O "{base_config_path}"

Downloading...
From: https://drive.google.com/uc?id=1pfqTcuw0MbQf6K78OHvHX_yudUYWcxAX
To: /home/fm-pc-lt-219/Desktop/product_and_sentiment_classification_poc/custom_ai_app/ner_config/base_config.cfg
100%|██████████████████████████████████████| 1.86k/1.86k [00:00<00:00, 5.50MB/s]


In [30]:
!python -m spacy init fill-config "{base_config_path}" "{output_config_path}"

[38;5;2m✔ Auto-filled config with all values[0m
[38;5;2m✔ Saved config[0m
/home/fm-pc-lt-219/Desktop/product_and_sentiment_classification_poc/custom_ai_app/ner_config/config.cfg
You can now add your data and train your pipeline:
python -m spacy train config.cfg --paths.train ./train.spacy --paths.dev ./dev.spacy


# Training NER Model Product Type Classification

In [31]:
!python -m spacy train "{output_config_path}" --output "{model_path}" --paths.train "{train_data_path}" --paths.dev "{train_data_path}"

[38;5;4mℹ Saving to output directory:
/home/fm-pc-lt-219/Desktop/product_and_sentiment_classification_poc/custom_ai_app/models[0m
[38;5;4mℹ Using CPU[0m
[1m
[2023-09-09 16:32:21,965] [INFO] Set up nlp object from config
[2023-09-09 16:32:21,972] [INFO] Pipeline: ['tok2vec', 'ner']
[2023-09-09 16:32:21,974] [INFO] Created vocabulary
[2023-09-09 16:32:23,046] [INFO] Added vectors: en_core_web_lg
[2023-09-09 16:32:23,046] [INFO] Finished initializing nlp object
[2023-09-09 16:32:24,808] [INFO] Initialized pipeline components: ['tok2vec', 'ner']
[38;5;2m✔ Initialized pipeline[0m
[1m
[38;5;4mℹ Pipeline: ['tok2vec', 'ner'][0m
[38;5;4mℹ Initial learn rate: 0.0[0m
E    #       LOSS TOK2VEC  LOSS NER  ENTS_F  ENTS_P  ENTS_R  SCORE 
---  ------  ------------  --------  ------  ------  ------  ------
  0       0          0.00    102.33    0.00    0.00    0.00    0.00
  0     200          5.56  18610.94    0.00    0.00    0.00    0.00
  0     400         28.19   1745.75   19.26   27.06

## Training Summary

- **E**: Epoch or training iteration number.
- **LOSS TOK2VEC**: Loss linked to token-to-vector conversion.
- **LOSS NER**: Loss tied to Named Entity Recognition.
- **ENTS_F**: F-score, a NER performance metric.
- **ENTS_P**: Precision, measuring NER accuracy.
- **ENTS_R**: Recall, for NER entity identification.
- **SCORE**: Overall performance score.

**Training Progress:**

- Progression through multiple epochs (E).
- Decreasing LOSS TOK2VEC and LOSS NER.
- Improved metrics (ENTS_F, ENTS_P, ENTS_R) indicating model learning.

**Convergence:**

- Training continues until desired metrics achieved.
- Completion when metrics reach high values (e.g., F-score near 1.00).

**Saved Pipeline:**

- Trained pipeline saved for future use.
