In [1]:
import numpy as np
import tensorflow as tf
from tensorflow.keras.models import Sequential, Model
from tensorflow.keras.layers import Dense, Dropout, Input, Embedding, LSTM, Bidirectional
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
import nltk
from nltk.stem import WordNetLemmatizer
import json
import random
import pickle

In [4]:
# Download NLTK data
nltk.download('punkt')
nltk.download('punkt_tab')
nltk.download('wordnet')

# Initialize lemmatizer
lemmatizer = WordNetLemmatizer()

# Load and preprocess the data
def load_data(file_path):
    with open(file_path) as file:
        data = json.load(file)
    
    words = []
    classes = []
    documents = []
    
    for intent in data['intents']:
        for pattern in intent['patterns']:
            # Tokenize each word
            word_list = nltk.word_tokenize(pattern)
            words.extend(word_list)
            # Add documents in the corpus
            documents.append((word_list, intent['tag']))
            # Add to our classes list
            if intent['tag'] not in classes:
                classes.append(intent['tag'])
    
    # Lemmatize and lowercase each word and remove duplicates
    words = [lemmatizer.lemmatize(word.lower()) for word in words if word not in ['?', '!', '.', ',']]
    words = sorted(list(set(words)))
    classes = sorted(list(set(classes)))
    
    return words, classes, documents

words, classes, documents = load_data('intents.json')

# Create training data
def create_training_data(words, classes, documents):
    training = []
    output_empty = [0] * len(classes)
    
    for doc in documents:
        bag = []
        word_patterns = doc[0]
        word_patterns = [lemmatizer.lemmatize(word.lower()) for word in word_patterns]
        
        for word in words:
            bag.append(1) if word in word_patterns else bag.append(0)
        
        output_row = list(output_empty)
        output_row[classes.index(doc[1])] = 1
        
        training.append([bag, output_row])
    
    random.shuffle(training)
    training = np.array(training, dtype=object)
    
    train_x = list(training[:, 0])
    train_y = list(training[:, 1])
    
    return train_x, train_y

train_x, train_y = create_training_data(words, classes, documents)

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\jeron\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to
[nltk_data]     C:\Users\jeron\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping tokenizers\punkt_tab.zip.
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\jeron\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


This code snippet is part of a Natural Language Processing (NLP) pipeline designed to preprocess text data for a chatbot or similar application. It uses the Natural Language Toolkit (NLTK) library to tokenize, lemmatize, and prepare data for training a machine learning model.

The first section downloads necessary NLTK data packages, specifically `punkt` for tokenization and `wordnet` for lemmatization. A lemmatizer is then initialized using `WordNetLemmatizer`, which reduces words to their base or root form (e.g., "running" becomes "run"). This is crucial for standardizing words and reducing redundancy in the dataset.

The `load_data` function is responsible for loading and preprocessing the data from a JSON file (`intents.json`). This file is expected to contain a list of intents, each with associated patterns (example sentences) and tags (categories). The function iterates through the intents, tokenizes each pattern into individual words using `nltk.word_tokenize`, and stores these tokens in a list called `words`. It also creates a `documents` list, which pairs tokenized patterns with their corresponding tags, and a `classes` list, which contains unique tags. Afterward, the `words` list is lemmatized, converted to lowercase, and deduplicated to create a clean vocabulary. Both `words` and `classes` are sorted alphabetically for consistency.

The `create_training_data` function prepares the data for training a machine learning model. It creates a "bag of words" representation for each document in the dataset. This involves initializing an empty list (`bag`) and iterating through the vocabulary (`words`). For each word in the vocabulary, the function checks if it exists in the tokenized pattern of the current document. If it does, a `1` is appended to the bag; otherwise, a `0` is added. This binary representation allows the model to understand which words are present in a given pattern.

Overall, this code sets up the foundation for training a text classification model by transforming raw text data into a structured format that machine learning algorithms can process. It ensures that the data is clean, consistent, and ready for further steps like feature extraction or model training.

In [5]:
def create_model(input_shape, output_shape):
    model = Sequential([
        Dense(128, input_shape=(input_shape,), activation='relu'),
        Dropout(0.5),
        Dense(64, activation='relu'),
        Dropout(0.5),
        Dense(output_shape, activation='softmax')
    ])
    
    model.compile(loss='categorical_crossentropy', 
                 optimizer='adam', 
                 metrics=['accuracy'])
    return model

model = create_model(len(train_x[0]), len(train_y[0]))

# Convert to numpy arrays
train_x = np.array(train_x)
train_y = np.array(train_y)

# Train the model
history = model.fit(train_x, train_y, epochs=200, batch_size=5, verbose=1)

# Save the model
model.save('chatbot_model1.h5')

# Save the necessary data structures
with open('chatbot_data.pkl', 'wb') as f:
    pickle.dump({'words': words, 'classes': classes, 'train_x': train_x, 'train_y': train_y}, f)

  super().__init__(activity_regularizer=activity_regularizer, **kwargs)


Epoch 1/200
[1m3/3[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 15ms/step - accuracy: 0.2154 - loss: 1.1385
Epoch 2/200
[1m3/3[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 9ms/step - accuracy: 0.2904 - loss: 1.0843 
Epoch 3/200
[1m3/3[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 10ms/step - accuracy: 0.2788 - loss: 1.1638
Epoch 4/200
[1m3/3[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 9ms/step - accuracy: 0.2538 - loss: 1.1357 
Epoch 5/200
[1m3/3[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 9ms/step - accuracy: 0.4558 - loss: 1.1246 
Epoch 6/200
[1m3/3[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 11ms/step - accuracy: 0.6712 - loss: 0.9778
Epoch 7/200
[1m3/3[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 10ms/step - accuracy: 0.3923 - loss: 1.0709
Epoch 8/200
[1m3/3[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 11ms/step - accuracy: 0.2538 - loss: 1.1750
Epoch 9/200
[1m3/3[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[3



In [6]:
def clean_up_sentence(sentence):
    sentence_words = nltk.word_tokenize(sentence)
    sentence_words = [lemmatizer.lemmatize(word.lower()) for word in sentence_words]
    return sentence_words

def bow(sentence, words):
    sentence_words = clean_up_sentence(sentence)
    bag = [0] * len(words)
    for s in sentence_words:
        for i, w in enumerate(words):
            if w == s:
                bag[i] = 1
    return np.array(bag)

def predict_class(sentence, model, words, classes):
    p = bow(sentence, words)
    res = model.predict(np.array([p]))[0]
    ERROR_THRESHOLD = 0.25
    results = [[i, r] for i, r in enumerate(res) if r > ERROR_THRESHOLD]
    
    results.sort(key=lambda x: x[1], reverse=True)
    return_list = []
    for r in results:
        return_list.append({'intent': classes[r[0]], 'probability': str(r[1])})
    return return_list

def get_response(intents_list, intents_json):
    tag = intents_list[0]['intent']
    list_of_intents = intents_json['intents']
    for i in list_of_intents:
        if i['tag'] == tag:
            result = random.choice(i['responses'])
            break
    return result

This code snippet is part of a chatbot implementation that processes user input, predicts the intent behind the input, and generates an appropriate response. It consists of four key functions: `clean_up_sentence`, `bow`, `predict_class`, and `get_response`.

The `clean_up_sentence` function is responsible for preprocessing a given sentence. It tokenizes the sentence into individual words using NLTK's `word_tokenize` function and then lemmatizes each word to reduce it to its base form. Additionally, all words are converted to lowercase to ensure uniformity. This preprocessing step is crucial for standardizing the input text, making it easier to match against the model's vocabulary.

The `bow` (Bag of Words) function transforms the preprocessed sentence into a numerical representation. It takes the tokenized and lemmatized sentence and compares it against a predefined vocabulary (`words`). For each word in the vocabulary, it checks whether the word exists in the sentence. If it does, the corresponding position in the "bag" is set to 1; otherwise, it remains 0. This binary vector representation allows the chatbot's machine learning model to process the input sentence effectively.

The `predict_class` function uses the bag-of-words representation to predict the intent of the input sentence. It takes the sentence, the trained model, the vocabulary (`words`), and the list of intent classes (`classes`) as inputs. First, it generates the bag-of-words vector for the sentence using the `bow` function. Then, it feeds this vector into the trained model to obtain a probability distribution over all possible intents. Any intent with a probability above a predefined `ERROR_THRESHOLD` (0.25 in this case) is considered a valid prediction. The function sorts these predictions by probability in descending order and returns a list of intents along with their probabilities.

Finally, the `get_response` function generates a response based on the predicted intent. It takes the list of predicted intents and a JSON object containing all intents and their associated responses. The function identifies the intent with the highest probability and searches for it in the JSON object. Once found, it randomly selects a response from the list of predefined responses for that intent and returns it. This randomness ensures that the chatbot's replies feel more dynamic and less repetitive.

Together, these functions form the core logic of the chatbot, enabling it to preprocess user input, predict the intent behind the input, and provide a relevant response. This modular design makes the code easy to understand, maintain, and extend.