# Intent discovery in the Banking77 dataset

* Build an NLU component
    * CKY parser (dynamic programming, CFG algo)
        * TO DO: search for production-grade library

* Impact
    * dataset automatic labelling --> reduce labor intensive annotation 


## Setup  
### Dependencies

In [17]:
import os
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, plot_confusion_matrix, classification_report

# text preprocessing
import nltk
import re
import numpy as np
nltk.download('punkt') # 13 MB zip containing pretrained punkt sentence tokenizer (Kiss and Strunk, 2006)
import time

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/steeve_laquitaine/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


### Paths

In [2]:
proj_path = "/Users/steeve_laquitaine/desktop/CodeHub/intent/intent/"
train_data_path = proj_path + "data/01_raw/banking77/train.csv"
test_data_path = proj_path + "data/01_raw/banking77/test.csv"

## Load data

Most public corpora found had little data, except a task-oriented banking dataset Banking77. So we use it as a benchmark.

### Load

In [3]:
train_data  = pd.read_csv(train_data_path)
test_data  = pd.read_csv(test_data_path)

In [4]:
# preview
train_data.head(5)

Unnamed: 0,text,category
0,I am still waiting on my card?,card_arrival
1,What can I do if my card still hasn't arrived ...,card_arrival
2,I have been waiting over a week. Is the card s...,card_arrival
3,Can I track my card while it is in the process...,card_arrival
4,"How do I know if I will get my card, or if it ...",card_arrival


### Preview

In [5]:
# preview
test_data.head(5)

Unnamed: 0,text,category
0,How do I locate my card?,card_arrival
1,"I still have not received my new card, I order...",card_arrival
2,I ordered a card but it has not arrived. Help ...,card_arrival
3,Is there a way to know when my card will arrive?,card_arrival
4,My card has not arrived yet.,card_arrival


### normalize columns

In [6]:
def standardize_col_names(data:pd.DataFrame):
    return data.rename(columns={"text":"text","category":"intent"})

In [7]:
train_data = standardize_col_names(train_data)
test_data = standardize_col_names(test_data)

### summary description 

In [8]:
print("\nValue count:\n")
print(train_data.count())
print("\nUnique values:\n")
print(train_data.nunique())


Value count:

text      10003
intent    10003
dtype: int64

Unique values:

text      10003
intent       77
dtype: int64


In [9]:
train_data.head(1)

Unnamed: 0,text,intent
0,I am still waiting on my card?,card_arrival


# CLUSTERING

## Preprocessing

In [27]:
# prep
stop_words = nltk.corpus.stopwords.words('english')

def normalize_document(doc:list):
    """
    Normalize document

    parameters:
    ---------
    doc

    return
    ------
    doc

    """
    # lower case and drop special characters\whitespaces
    doc = re.sub(r'[^a-zA-Z0-9\s]', '', doc, re.I|re.A)
    doc = doc.lower()
    doc = doc.strip()
    
    # tokenize
    tokens = nltk.word_tokenize(doc)
    
    # drop stop words
    filtered_tokens = [token for token in tokens if token not in stop_words]
    
    # re-create doc from filtered tokens
    doc = ' '.join(filtered_tokens)
    return doc

# time
tic = time.time()

# vectorize doc
normalize_corpus = np.vectorize(normalize_document)

# normalize doc
norm_corpus = normalize_corpus(list(train_data['text']))
len(norm_corpus)
print(f"(normalize_document) took: {round(time.time()-tic,2)} secs")

# show
print("\nPreview:")

norm_corpus

(normalize_document) took: 1.98 secs

pPreview:


array(['still waiting card', 'card still hasnt arrived 2 weeks',
       'waiting week card still coming', ..., 'countries getting support',
       'cards available eu', 'countries represented'], dtype='<U309')

# References

(1) https://www.nltk.org/_modules/nltk/ccg/chart.html  
(2) https://github.com/dipanjanS/text-analytics-with-python/blob/master/New-Second-Edition/Ch07 - Text Similarity and Clustering/Ch07c - Document Clustering.ipynb  