<a href="https://colab.research.google.com/github/johir-bd/Machine-Learning-Project/blob/master/NER.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Building a text classification ML/
statistical model and implementing entity extraction algorithms can be broken down into several steps, focusing on feature extraction, statistical modeling, and natural language processing techniques:

Step 1: Text Classification using ML Model
We'll use TF-IDF (Term Frequency-Inverse Document Frequency) to convert text data into numerical form and Naive Bayes for text classification.

1.1 Data Collection:
Imagine you have user transcripts labeled with various intents.

In [None]:
import pandas as pd

# Example dataset of user transcripts and their intents
data = {'query': [
    "I want to book an appointment.",
    "What is the status of my order?",
    "Please cancel my booking.",
    "I need help with my account."],
    'intent': ["book_appointment", "check_status", "cancel_order", "help_account"]
}

# Create DataFrame
df = pd.DataFrame(data)
print(df.head())


                             query            intent
0   I want to book an appointment.  book_appointment
1  What is the status of my order?      check_status
2        Please cancel my booking.      cancel_order
3     I need help with my account.      help_account


1.2 Text Preprocessing:
We will clean the text (removing stopwords, special characters) and lemmatize it to reduce the words to their base form.

In [None]:
import re
import string
import nltk
from nltk.corpus import stopwords
import spacy

# Download stopwords from NLTK
nltk.download('stopwords')
stop_words = set(stopwords.words('english'))

# Load SpaCy model for lemmatization
nlp = spacy.load('en_core_web_sm')

# Function to clean and lemmatize text
def clean_lemmatize(text):
    # Convert to lowercase
    text = text.lower()
    # Remove punctuation
    text = re.sub(f'[{string.punctuation}]', '', text)
    # Remove stopwords
    text = ' '.join([word for word in text.split() if word not in stop_words])
    # Lemmatize using SpaCy
    doc = nlp(text)
    return ' '.join([token.lemma_ for token in doc])

# Apply cleaning and lemmatization
df['cleaned_query'] = df['query'].apply(clean_lemmatize)
print(df[['query', 'cleaned_query']])


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


                             query          cleaned_query
0   I want to book an appointment.  want book appointment
1  What is the status of my order?           status order
2        Please cancel my booking.  please cancel booking
3     I need help with my account.      need help account


1.3 Feature Extraction (TF-IDF):
Now we convert the cleaned text into numerical form using TF-IDF Vectorizer.

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Initialize TF-IDF Vectorizer
vectorizer = TfidfVectorizer()

# Fit and transform the cleaned queries
X = vectorizer.fit_transform(df['cleaned_query'])
print(X.toarray())  # Numerical representation of text


[[0.         0.57735027 0.57735027 0.         0.         0.
  0.         0.         0.         0.         0.57735027]
 [0.         0.         0.         0.         0.         0.
  0.         0.70710678 0.         0.70710678 0.        ]
 [0.         0.         0.         0.57735027 0.57735027 0.
  0.         0.         0.57735027 0.         0.        ]
 [0.57735027 0.         0.         0.         0.         0.57735027
  0.57735027 0.         0.         0.         0.        ]]


1.4 Train the Classification Model:
We will use Naive Bayes for text classification to predict the intent of new queries.

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, classification_report

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, df['intent'], test_size=0.2, random_state=42)

# Initialize Naive Bayes classifier
model = MultinomialNB()

# Train the model
model.fit(X_train, y_train)

# Predict on the test set
y_pred = model.predict(X_test)

# Evaluate the model
print("Accuracy:", accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred))


Accuracy: 0.0
                  precision    recall  f1-score   support

book_appointment       0.00      0.00      0.00       0.0
    check_status       0.00      0.00      0.00       1.0

        accuracy                           0.00       1.0
       macro avg       0.00      0.00      0.00       1.0
    weighted avg       0.00      0.00      0.00       1.0



  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


Step 2: Entity Extraction Algorithms
Entity extraction identifies useful information from the text, like names, dates, organizations. We'll use SpaCy for this task.

In [None]:
# Example queries for entity extraction
queries = ["Book an appointment for John Doe on September 12.", "Cancel my order with Order ID 12345."]

# Extract named entities
for query in queries:
    doc = nlp(query)
    print(f"Query: {query}")
    for entity in doc.ents:
        print(f"Entity: {entity.text}, Label: {entity.label_}")


Query: Book an appointment for John Doe on September 12.
Entity: John, Label: PERSON
Entity: September 12, Label: DATE
Query: Cancel my order with Order ID 12345.
Entity: Order, Label: ORG
Entity: 12345, Label: DATE


Step 3: Retraining the Model (Data-Driven Decisions)
Once the initial model is trained, you'll likely receive new transcripts and feedback over time. To improve your virtual agent, you'll re-train the model with new data.

Steps to Retrain:

Gather new labeled transcripts.
Clean and preprocess the data.
Append to the existing training dataset.
Retrain the model using the updated dataset.
Retraining Example:

In [None]:
# New data for retraining
new_data = {'query': ["I want to reschedule my appointment.", "Help me with my order refund."],
            'intent': ["reschedule_appointment", "refund_order"]}

# Convert to DataFrame
new_df = pd.DataFrame(new_data)

# Clean and lemmatize the new data
new_df['cleaned_query'] = new_df['query'].apply(clean_lemmatize)

# Vectorize the new data
new_X = vectorizer.transform(new_df['cleaned_query'])

# Retrain the model with combined old and new data
combined_X = pd.concat([pd.DataFrame(X.toarray()), pd.DataFrame(new_X.toarray())])
combined_y = pd.concat([df['intent'], new_df['intent']])

# Split and retrain
X_train, X_test, y_train, y_test = train_test_split(combined_X, combined_y, test_size=0.2, random_state=42)
model.fit(X_train, y_train)


Step 4: End-to-End Pipeline
Here’s how the entire workflow connects:

Text Classification helps predict intents (e.g., "book appointment", "check status").
Entity Extraction pulls out key information like names or dates.
Retraining improves the model with updated data based on user interactions.

In [None]:
import pandas as pd
import re
import string
import nltk
import spacy
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, classification_report

# Download stopwords from NLTK
nltk.download('stopwords')
stop_words = set(nltk.corpus.stopwords.words('english'))

# Load SpaCy model
nlp = spacy.load('en_core_web_sm')

# Example dataset
data = {'query': [
    "I want to book an appointment.",
    "What is the status of my order?",
    "Please cancel my booking.",
    "I need help with my account."],
    'intent': ["book_appointment", "check_status", "cancel_order", "help_account"]
}
df = pd.DataFrame(data)

# Text cleaning and lemmatization
def clean_lemmatize(text):
    text = text.lower()
    text = re.sub(f'[{string.punctuation}]', '', text)
    text = ' '.join([word for word in text.split() if word not in stop_words])
    doc = nlp(text)
    return ' '.join([token.lemma_ for token in doc])

df['cleaned_query'] = df['query'].apply(clean_lemmatize)

# TF-IDF Vectorizer
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(df['cleaned_query'])

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, df['intent'], test_size=0.2, random_state=42)

# Naive Bayes classifier
model = MultinomialNB()
model.fit(X_train, y_train)

# Predictions and evaluation
y_pred = model.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred))

# Entity extraction
queries = ["Book an appointment for John Doe on September 12.", "Cancel my order with Order ID 12345."]
for query in queries:
    doc = nlp(query)
    print(f"Query: {query}")
    for entity in doc.ents:
        print(f"Entity: {entity.text}, Label: {entity.label_}")


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Accuracy: 0.0
                  precision    recall  f1-score   support

book_appointment       0.00      0.00      0.00       0.0
    check_status       0.00      0.00      0.00       1.0

        accuracy                           0.00       1.0
       macro avg       0.00      0.00      0.00       1.0
    weighted avg       0.00      0.00      0.00       1.0

Query: Book an appointment for John Doe on September 12.
Entity: John, Label: PERSON
Entity: September 12, Label: DATE
Query: Cancel my order with Order ID 12345.
Entity: Order, Label: ORG
Entity: 12345, Label: DATE


  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
