# NLP ML Engineer Code Challenge Writeup

This document contains:<br>
I. challenge prompt answers <br>
II. an additional section where I explored data, techniques, results

## I. Prompt answers

### a. How would you evaluate your autocomplete server? b. If you made another version, how would you compare the two to decide which is better?

a. If the software was designed with more general corpus-intake functionality, performance of ranking, completion return, and run time could be evaluated by comparison against published results on known datasets. 

An A/B test in which human interaction with the software was recorded could provide another means of evaluation. The control would be conversations without autocompletions. Generally, metrics would fall along these lines, and be implemented statistically: Prefixes with ~3-12 characters should have completion sets. Selection time should be limited. Selections should occur for completion sets. Completions that are never or seldom selected should be marked for re-evaluation in text manifestation and rank in completion lists. Completions in top positions of selection lists offered should be selected more than those at the end. With autocompletions, conversation time should be decreased, and customer-reported satisfaction higher. 

Run time should be evaluated, the presence of bottlenecks discerned. A test suite of customer service representative-generated prefixes, as well as non-sensical prefixes, could be run and average length of completion lists for both sets evaluated. Grammar, spelling, capitalization, and punctuation correction still need to be implemented on completions in the current library—it'd be good to run tests on those attributes of the text provided for the autocompletions. <br>

b. I would compare performance times, test suite results, and A/B test results between the versions. The next version would include sentence detection and separate sentences prior to matching and ranking, as well as correct grammar and spelling on completions, provide ranking on more than corpus frequency (perhaps incorporating conversation topic, company information, customer information and information on the representative)—so, tests could be designed to determine the comparative efficacy of the implementation of those features. Also, from a development perspective, the logging system should be evaluated for usefulness—the current logging system is catching exceptions that are too general, and logs verbosely—this should be improved.

### One way to improve the autocomplete server is to give topic-specific suggestions. How would you design an auto-categorization server? It should take a list of messages and return a TopicId. (Assume that every conversation in the training set has a TopicId).

Each possible completion could have a stored list of topic IDs ranked according to the frequency of the completion naturally occurring, and selected from options, in conversations with each topic. Completions matching prefixes with high ranking matching topic IDs could be more weighted more heavily in the overall match ranking system.

### How would you evaluate if your auto-categorization server is good?


Evaluate topic extraction performance with respect to AUC, precision, recall and F1 against results published on known datasets. Run A/B test to determine if the completions offered at runtime are selected more quickly, and more often, than non-topic weighted completions; if conversations with topic-weighted autocompletions rated more highly by customers and take less time on average. 

### The autocomplete server data set we gave you is pretty small, so it doesn't take very long to load and it's fine to process on server startup.  How would you change your design if the dataset was several gigabytes?  What if it was 100 terabytes?

I would not process the data set on server startup, but store a precomputed autocompletion system that allows for efficient lookups. For data storage, I would evaluate the cost and efficacy of multiple operational datastores at different scales, and more archival datastores for redundancy. This is subject I would need to learn more about to do well.

### What would we need to do if we had millions of agents using the Autocomplete service at the same time?

Implement the service with a framework like [Dask](https://dask.org/) on something like a cluster of AWS EC2 nodes. Serverless computing would help facilitate scalability.

## II. Code, data, and results exploration

In [1]:
import sys
import os
import spacy
import numpy as np
import pandas as pd
import json

In [2]:
from pandas.io.json import json_normalize

In [3]:
AUTOC = os.path.abspath("{}/../../".format("challenge_writeup.ipynb"))
sys.path.insert(0, "{}/library".format(AUTOC))

### Load data

In [4]:
from data_load import data_load as dl

In [5]:
data = dl.DataLoad()

### Prep data

Inspect time for json_normalize; shape, information on json_normalized data:

In [6]:
convos = [convo for convo in data.json['Issues']]

In [7]:
%time convos_normalized = json_normalize(convos, 'Messages', ['CompanyGroupId', 'IssueId'])

CPU times: user 23.6 ms, sys: 1.96 ms, total: 25.6 ms
Wall time: 24.3 ms


In [8]:
convos_normalized.head()

Unnamed: 0,IsFromCustomer,Text,CompanyGroupId,IssueId
0,True,Hi! I placed an order on your website and I ca...,1,1
1,True,I think I used my email address to log in.,1,1
2,True,My battery exploded!,1,10001
3,True,"It's on fire, it's melting the carpet!",1,10001
4,True,What should I do!,1,10001


Note: We now have 1 row per text message. Number of messages ('NumTextMessages': 22264) matches number of rows.

In [9]:
convos_normalized.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 22264 entries, 0 to 22263
Data columns (total 4 columns):
IsFromCustomer    22264 non-null bool
Text              22264 non-null object
CompanyGroupId    22264 non-null int64
IssueId           22264 non-null int64
dtypes: bool(1), int64(2), object(1)
memory usage: 543.6+ KB


Check out time to return customer service reps' messages:

In [10]:
%time outbound = convos_normalized.loc[np.where(np.equal(convos_normalized["IsFromCustomer"], False))]

CPU times: user 4.61 ms, sys: 2.05 ms, total: 6.66 ms
Wall time: 5.16 ms


Take a look at outbound messages prepped for use in autocompletions in library:

In [11]:
from data_prep import data_prep as dp

In [12]:
data_prepped = dp.DataPrep(data.json)

In [13]:
data_prepped.outbound_messages.head()

9             b'Hello Werner how may I help you today?'
11    b'Sure I can help you with that? Could you ple...
13      b'Let me update that information on our system'
14    b'OK Wernzio, I have updated your address to t...
16    b'Ok let me go ahead and request a work order ...
Name: Text, dtype: object

In [14]:
data_prepped.outbound_messages.describe()

count                                                 11060
unique                                                 8452
top       b'Is there anything else I can help you with t...
freq                                                    191
Name: Text, dtype: object

### Autocompletion

In [15]:
from autocomplete import autocompleter as a

In [16]:
autoc = a.Autocompleter()

Take a look at docstrings, code:

In [17]:
??autoc

Check out prefix: 

In [18]:
prefix = a.Prefix("How can")

In [19]:
prefix.compiled

re.compile(rb'How can', re.IGNORECASE)

Check out time and returns of autocomplete suggestions:

In [20]:
%time autoc.generate_completions("is th")

CPU times: user 10.4 ms, sys: 1.38 ms, total: 11.8 ms
Wall time: 10.7 ms


[b'Is there anything else I can help you with today?',
 b'Is there anything else I can assist you with?',
 b'Is there anything else I can help you with?',
 b'Is there anything else I can assist you with today?',
 b'is there anything else i can help you with today']