# REFERENCES

**Information Extraction**
- https://www.analyticsvidhya.com/blog/2020/06/nlp-project-information-extraction/
- https://medium.com/analytics-vidhya/introduction-to-information-extraction-using-python-and-spacy-858f5d6416ca

**Chatbot**
- https://medium.com/predict/create-your-chatbot-using-python-nltk-761cd0aeaed3
- https://medium.com/swlh/a-chatbot-in-python-using-nltk-938a37a9eacc

**Intent**
- https://medium.com/walmartglobaltech/joint-intent-classification-and-entity-recognition-for-conversational-commerce-35bf69195176
- https://medium.com/analytics-vidhya/machine-learning-intent-classification-221ecded7c74
- https://colab.research.google.com/github/deepmipt/dp_notebooks/blob/master/DP_autoFAQ.ipynb (!)
- https://towardsdatascience.com/a-brief-introduction-to-intent-classification-96fda6b1f557
- https://medium.com/artefact-engineering-and-data-science/nlu-benchmark-for-intent-detection-and-named-entity-recognition-in-call-center-conversations-f58e5b4c8d3d
- https://medium.com/iambot/ai-assistance-with-pytext-6308d896566d

**NER**
· Simple Entities
· Composite Entities
· Entity Roles
· Entity Lists
· Regular Expressions
· Prebuilt Models
- https://github.com/DhruvilKarani/NER-Blog/blob/master/analysis.ipynb
- https://towardsdatascience.com/named-entity-recognition-with-nltk-and-spacy-8c4a7d88e7da
- https://towardsdatascience.com/named-entity-recognition-ner-using-keras-bidirectional-lstm-28cd3f301f54
- https://towardsdatascience.com/named-entity-recognition-ner-meeting-industrys-requirement-by-applying-state-of-the-art-deep-698d2b3b4ede
- https://towardsdatascience.com/deep-learning-for-named-entity-recognition-3-reusing-a-bidirectional-lstm-cnn-on-clinical-text-e84bd28052df
- https://medium.com/@b.terryjack/nlp-pretrained-named-entity-recognition-7caa5cd28d7b
- https://medium.com/swlh/custom-natural-language-processing-831a9a8d3dfc

## NLP

In [1]:
import nltk
import random
import string
import re

import tensorflow_datasets as tfds
import numpy as np
import pandas as pd

from nltk.stem import wordnet  # to perform lemmitization
from nltk import pos_tag  # for parts of speech
from nltk import word_tokenize  # to create tokens

In [2]:
# # intent + entity type
# intent_sets = ['greet', 'new', 'update', 'finish', 'query', 'no']
# # possible data types: tabular, image, and text (?) -- training and testing set for each
# slot_sets = {
#     "task": [], # regression or classification
#     "data_source": [], # upload, url, or built-in
#     "target_variable": [], # specific name or undefined
#     "dataset": [], # dataset name or filepath
#     "delivery": [], # on-web or email
# }

# # from user query sentences
# constructed_pipeline = ""

In [3]:
# # possible tasks: tabular classification, tabular regression, image classification, image regression, text classification, and text regression
# states = ['standby', 'inquire', 'inference', 'running', 'deliver'] # possible states of the CA and AutoML Engine
# user_slot = {"method": None, "task": None, "data_source": None, "dataset": None, "target": None, "delivery": 'chat'}

# def intent_classification():
#     return intent

# error handling
# generatin data set (may use dictionary based synonym replacement)

In [4]:
# list of available Datasets (built-in or .csv, .txt, .xls, folders with .png or .jpg), Algorithms (just simply ML or DL), and Tasks

In [5]:
# simple keyword matching?
# Rule-based grammar matching	

def text_normalization(text):
#     text=str(text).lower() # text to lower case
    # stop_words = set(stopwords.words('english'))
    # stop_words.add('please')
    text = re.sub('\-', '', text)
    text = re.sub('[^a-zA-z0-9\_]', ' ', text) # removing special characters
    text = nltk.word_tokenize(text) # word tokenizing
    lema = wordnet.WordNetLemmatizer() # intializing lemmatization
    tags_list = pos_tag(text, tagset=None) # parts of speech
    lema_words = []   # empty list 
    for token, pos_token in tags_list:
        if pos_token.startswith('V'):  # Verb
            pos_val = 'v'
        elif pos_token.startswith('J'): # Adjective
            pos_val = 'a'
        elif pos_token.startswith('R'): # Adverb
            pos_val = 'r'
        else:
            pos_val = 'n' # Noun
        lema_token = lema.lemmatize(token, pos_val) # performing lemmatization
        lema_words.append(lema_token) # appending the lemmatized token into a list
    
    # lema_words = pos_tag(lema_words)
    # text = [item for item in lema_words if item[0].lower() not in stop_words]
    lema_words = [item for item in lema_words if item.lower() not in ['a', 'an', 'the']]
    return " ".join(lema_words) # returns the lemmatized tokens as a sentence

In [6]:

GREETING_INPUTS = ["hello", "hi", "greetings", "sup", "what's up","hey", "morning", "afternoon", "evening", "night"]
CONFIRM_WORDS = ["yes", "yep", "okay", "ok", "sure", "certainly", "definitely", "absolutely", "go ahead", "cool", "right", "of course"]
DENY_WORDS = ["no", "nope", "na", "not yet", "not sure", "more", "not", "don't", "do not", "again"]
END_WORDS = ["goodbye", "bye", "see you later", "end", "finish", "stop", "not anymore"]
ML_METHODS = ['machine learning', 'ml', 'machine']
DL_METHODS = ['deep learning', 'dl', 'neural network', 'nn', 'neural', 'network']
CLASSIFICATIONS = ['class', 'classify', 'classification', 'classifier', 'discrete output']
REGRESSIONS = ['regress', 'regression', 'regressor', 'continuous output']
DATA_SOURCES = ['upload', '<url>', 'this data', 'this dataset', 'my data', 'my dataset']
IMAGE_TYPES = ['image', 'picture', 'figure', 'art', 'draw', 'photo', 'photograph', 'portrait', 'painting', 'visual', 'illustration', 'symbol', 'view', 'vision', 'sketch', 'icon']
TEXT_TYPES = ['text', 'word', 'message', 'writing', 'script', 'content', 'document', 'passage', 'context', 'essay', 'manuscript', 'paper', 'language', 'letter', 'written', 'write', 'character', 'note', 'darft']
TABLE_TYPES = ['structured', 'structure', 'tabular', 'table', 'relation', 'database', 'dataframe', 'frame', 'normal', 'excel', 'csv', 'file', 'summary', 'process']
# AVAIL_DATASETS = pd.read_csv('openml_datasets.csv')['name'].apply(lambda x: x.lower()).to_list()
AVAIL_DATASETS = tfds.list_builders()
TARGET_VAR = ['be dependent variable', 'be target variable', 'be dependent feature' 'be target feature', 'variable be', 'feature be', 'value', 'feature', 'variable', 'predict', 'forecast', 'classify'] # regex ?
DELIVERY = ['by email', 'by e-mail', 'email', 'e-mail']

GREETING_RESPONSES = ["Yep, It's nice to see you here! 🙌🏻", "Hey~", "*nods*", "Hi there!", "Hello", "I am glad! You are talking to me~~~"]
END_RESPONSES = ["See you then! 🙌🏻", "Bye~", "Goodluck!", "Hope to see you again~", "Goodbye!~", "Thanks~"]

In [7]:
user_slot = {"method": None, "task": None, "data_source": None,
             "data_type": None, "dataset": None, "target": None, "delivery": 'chat'}


def reset_slot():
    global user_slot
    user_slot = {"method": None, "task": None, "data_source": None,
                 "data_type": None, "dataset": None, "target": None, "delivery": 'chat'}


def is_slot_complete():
    return user_slot['method'] != None and user_slot['task'] != None and user_slot['data_source'] != None and user_slot['data_type'] != None and user_slot['dataset'] != None and user_slot['target'] != None and user_slot['delivery'] != None


def response_for_incomplete_slot(slot):
    return ''


def update_slot(msg):
    global user_slot

    for ml in ML_METHODS:
        if ml in msg:
            user_slot['method'] = 'ml'
            break

    for dl in DL_METHODS:
        if dl in msg:
            user_slot['method'] = 'dl'
            break

    for img in IMAGE_TYPES:
        if img in msg:
            user_slot['data_type'] = 'image'
            break

    for txt in TEXT_TYPES:
        if txt in msg:
            user_slot['data_type'] = 'text'
            break

    for table in TABLE_TYPES:
        if table in msg:
            user_slot['data_type'] = 'table'
            break

    for cl in CLASSIFICATIONS:
        if cl in msg:
            user_slot['task'] = 'cls'
            break

    for reg in REGRESSIONS:
        if reg in msg:
            user_slot['task'] = 'reg'
            break

    for ds in AVAIL_DATASETS:
        if ds in msg:
            user_slot['dataset'] = ds
            user_slot['data_source'] = 'built_in'
            user_slot['target'] = 'label'

    for ds in DATA_SOURCES:
        if ds in msg:
            user_slot['data_source'] = 'user_define'

    for d in DELIVERY:
        if d in msg:
            user_slot['delivery'] = 'email'
            break

    for tv in TARGET_VAR:
        if tv in msg:
            if tv in ['value', 'feature', 'variable', 'be target variable', 'be target feature']:
                user_slot['target'] = msg.split(tv)[0].split()[-1]
            else:
                user_slot['target'] = msg.split(tv)[-1].split()[0]


def standby_state(user_message):
    text = ''
    msg = user_message.lower()
    current_state = 'standby'
    global user_slot

    for greet in GREETING_INPUTS:
        if greet in msg:
            text += random.choice(GREETING_RESPONSES) + '<br/>'
            break

    for end_word in END_WORDS:
        if end_word in msg:
            current_state = 'end'
            text = random.choice(END_RESPONSES)
            break

    if current_state != 'end':
        update_slot(msg)
        if is_slot_complete():
            text += f"All you requested are well received! <br/> Please review the following list, do you want to proceed? <br/>"
            current_state = 'await'
        else:
            if user_slot['data_source'] == 'user_define' and user_slot['data_source'] == None:
                text += 'Please upload your data file (.csv, .txt, .zip) below.'

            current_state = 'active'
    else:
        current_state = 'standby'

    return text, current_state, user_slot


def active_state(user_message, await_feature=None):
    text = ''
    msg = user_message.lower()
    current_state = 'active'
    global user_slot

    for greet in GREETING_INPUTS:
        if greet in msg:
            text += random.choice(GREETING_RESPONSES) + ' (again) <br/>'
            break

    for end_word in END_WORDS:
        if end_word in msg:
            current_state = 'standby'
            text = random.choice(END_RESPONSES)
            break

    if current_state != 'standby':
        if await_feature != None:
            pass
        else:
            update_slot(msg)

        if is_slot_complete():
            text += f"All you requested are well received! <br/> Please reiview the model's specs, do you would like to proceed? <br/>"
            current_state = 'await'
        else:
            if user_slot['data_source'] == 'user_define' and user_slot['data_source'] == None:
                text += 'Please upload your data file (.csv, .txt, .zip) below.'

            current_state = 'active'

    return text, current_state, user_slot


def await_state(user_message):
    text = ''
    msg = user_message.lower()
    current_state = 'await'
    global user_slot

    for greet in GREETING_INPUTS:
        if greet in msg:
            text += 'Still wanna greet now?, 😂 <br/>'
            break

    for end_word in END_WORDS:
        if end_word in msg:
            current_state = 'standby'
            text = random.choice(END_RESPONSES)
            break

    for con in CONFIRM_WORDS:
        if con in msg:
            current_state = 'building'
            break

    for den in DENY_WORDS:
        if den in msg:
            current_state = 'await'
            text += '🤔 Umm... Please check your requirement~ <br/>'
            break

    if current_state != 'building' and current_state != 'standby':
        if is_slot_complete():
            text += f"I've got all needed information as follows. <br/> Do you want to proceed? </br>"
            current_state = 'await'

    return text, current_state, user_slot


def building_state(user_message):
    text = ''
    current_state = 'building'
    global user_slot

    return text, current_state, user_slot


def get_response(current_state, user_message, await_feature):
    filtered_text = text_normalization(user_message)
    await_feature = await_feature

    response = {
        'standby': standby_state(filtered_text),
        'active': active_state(filtered_text, await_feature),
        'await': await_state(filtered_text),
        'building': building_state(filtered_text),
    }

    return response[current_state]

In [10]:
current_state = 'standby'
print("Bot:", "Hi 👋🏻! I'm your model builder🧑🏻‍💻~ Just tell me which model do you want by simply following the examples below👇🏻.")

while current_state != 'building':
    user_query = input()
    print('User:', text_normalization(user_query))
    response, current_state, user_slots = get_response(current_state, user_query, None)
    print('Bot:', response, current_state)

Bot: Hi 👋🏻! I'm your model builder🧑🏻‍💻~ Just tell me which model do you want by simply following the examples below👇🏻.
User: Hi
Bot: Hi there!<br/> active
User: I want deep learning model for image classification with MNIST dataset
Bot: All you requested are well received! <br/> Please reiview the model's specs, do you would like to proceed? <br/> await
User: yes
Bot:  building


## AutoML

In [1]:
import tensorflow as tf
import autokeras as ak
import tensorflow_datasets as tfds
import numpy as np
import pandas as pd

import zipfile
import autosklearn
import autosklearn.classification

import sklearn.metrics

from sklearn.utils.multiclass import type_of_target

In [2]:
user_slot = {'method': 'dl', 'task': 'cls', 'data_source': 'built_in', 'data_type': 'image', 'dataset': 'mnist', 'target': 'label', 'delivery': 'chat'}

In [6]:
if user_slot['data_source'] == 'built_in':
    train_ds = tfds.load(user_slot['dataset'], split='train[:10%]+test[-10%:]', as_supervised=True, shuffle_files=True)
else:
    if user_slot['data_type'] == 'image':
        # unzip --> train, test folders
        with zipfile.ZipFile(f"./upload/{user_slot['dataset']}", 'r') as zip_ref:
            zip_ref.extractall('./dataset/')
    elif user_slot['data_type'] == 'text':
        # .csv or .txt
        pass
    else:
        pass
        # .csv or .txt

if user_slot['method'] == 'dl':
    if user_slot['task'] == 'cls':
        if user_slot['data_type'] == 'image':
            clf = ak.ImageClassifier(overwrite=True, max_trials=3, objective='val_accuracy')
            clf.fit(train_ds, epochs=3, validation_split=0.10)
            model = clf.export_model()
            model.save("model.h5")
        elif user_slot['data_type'] == 'text':

    else:
        if user_slot['data_type'] == 'image':
            reg = ak.ImageRegressor(overwrite=True, max_trials=3)
            reg.fit(train_ds, epochs=3, validation_split=0.10)
            model = reg.export_model()
            model.save("model.h5")
        elif user_slot['data_type'] == 'text':

else:
    pass
        

Trial 2 Complete [01h 27m 41s]
val_accuracy: 0.3147590458393097

Best val_accuracy So Far: 0.6897590160369873
Total elapsed time: 01h 28m 23s

Search: Running Trial #3

Hyperparameter    |Value             |Best Value So Far 
image_block_1/b...|efficient         |vanilla           
image_block_1/n...|True              |True              
image_block_1/a...|True              |False             
image_block_1/i...|True              |None              
image_block_1/i...|False             |None              
image_block_1/i...|0                 |None              
image_block_1/i...|0                 |None              
image_block_1/i...|0.1               |None              
image_block_1/i...|0                 |None              
image_block_1/e...|True              |None              
image_block_1/e...|b7                |None              
image_block_1/e...|True              |None              
image_block_1/e...|True              |None              
classification_...|global_avg    

KeyboardInterrupt: 