<a href="https://colab.research.google.com/github/dcshapiro/wordEmbeddingOttawaAiAlliance/blob/master/Ottawa_AI_Workshop.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Overview
In this notebook, you will play around with various natural language processing technologies and tools, focusing on applications.

The exercises in this notebook are:
- Have a look at the embedding projector
- Training a FastText model to label sentences
- Training a character-level Keras model to predict the type of a business from the name of the business
- Using a GloVe model from spaCy to analyze synthetic medical chart notes (text) and then correct any associated medical billing errors
- Customizing FastText to a specialized corpus (first using the built-in approach, and then using sidecar)

Let's have a look at the tensorflow embedding projector at this link:
http://projector.tensorflow.org/

Note: This notebook has a massive selection bias. Most things you try won't work. This notebook is set up to show you highly tuned examples that do work, but alas life is bitterness, and so don't be surprised when you jump in and everything seems a lot harder.

# FastText

Let's install FastText

In [5]:
!pip3 install fasttext

Collecting fasttext
[?25l  Downloading https://files.pythonhosted.org/packages/10/61/2e01f1397ec533756c1d893c22d9d5ed3fce3a6e4af1976e0d86bb13ea97/fasttext-0.9.1.tar.gz (57kB)
[K     |█████▊                          | 10kB 18.2MB/s eta 0:00:01[K     |███████████▍                    | 20kB 1.7MB/s eta 0:00:01[K     |█████████████████               | 30kB 2.5MB/s eta 0:00:01[K     |██████████████████████▊         | 40kB 1.6MB/s eta 0:00:01[K     |████████████████████████████▍   | 51kB 2.0MB/s eta 0:00:01[K     |████████████████████████████████| 61kB 1.9MB/s 
Building wheels for collected packages: fasttext
  Building wheel for fasttext (setup.py) ... [?25l[?25hdone
  Created wheel for fasttext: filename=fasttext-0.9.1-cp36-cp36m-linux_x86_64.whl size=2387629 sha256=5ae97e8529ed052bb7336cb4293da9f4757a449ad711e2b7c06f8ce1a1632ee4
  Stored in directory: /root/.cache/pip/wheels/9f/f0/04/caa82c912aee89ce76358ff954f3f0729b7577c8ff23a292e3
Successfully built fasttext
Installing c

Let's mount Google Drive as our file system

In [6]:
from google.colab import drive
drive.mount('/content/gdrive')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /content/gdrive


Now let's get some data to work on (recipes)

In [7]:
base_dir="/content/gdrive/My\ Drive/AuditMap_workshop/"
py_base_dir=base_dir.replace("\\","")
print(py_base_dir)

/content/gdrive/My Drive/AuditMap_workshop/


In [0]:
!mkdir {base_dir}
!cd {base_dir} && wget https://dl.fbaipublicfiles.com/fasttext/data/cooking.stackexchange.tar.gz 
!cd {base_dir} && tar xvzf cooking.stackexchange.tar.gz
!ls -l {base_dir}

mkdir: cannot create directory ‘/content/gdrive/My Drive/AuditMap_workshop/’: File exists
--2019-11-27 19:32:05--  https://dl.fbaipublicfiles.com/fasttext/data/cooking.stackexchange.tar.gz
Resolving dl.fbaipublicfiles.com (dl.fbaipublicfiles.com)... 104.20.6.166, 104.20.22.166, 2606:4700:10::6814:6a6, ...
Connecting to dl.fbaipublicfiles.com (dl.fbaipublicfiles.com)|104.20.6.166|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 457609 (447K) [application/x-tar]
Saving to: ‘cooking.stackexchange.tar.gz.6’


2019-11-27 19:32:06 (1009 KB/s) - ‘cooking.stackexchange.tar.gz.6’ saved [457609/457609]

cooking.stackexchange.id
cooking.stackexchange.txt
readme.txt
total 16729
-rw------- 1 root root   90095 Apr 28  2017 cooking.stackexchange.id
-rw------- 1 root root  457609 Jan 18  2019 cooking.stackexchange.tar.gz
-rw------- 1 root root  457609 Jan 18  2019 cooking.stackexchange.tar.gz.1
-rw------- 1 root root  457609 Jan 18  2019 cooking.stackexchange.tar.gz.2
-rw-----

Now let's split up the data into training and validation

In [0]:
!cd {base_dir} && head -n 12404 cooking.stackexchange.txt > cooking.train
!cd {base_dir} && tail -n 3000 cooking.stackexchange.txt > cooking.valid

Let's look at the data with our eyes

In [0]:
!head -n 10 {base_dir}/cooking.stackexchange.txt

__label__sauce __label__cheese How much does potato starch affect a cheese sauce recipe?
__label__food-safety __label__acidity Dangerous pathogens capable of growing in acidic environments
__label__cast-iron __label__stove How do I cover up the white spots on my cast iron stove?
__label__restaurant Michelin Three Star Restaurant; but if the chef is not there
__label__knife-skills __label__dicing Without knife skills, how can I quickly and accurately dice vegetables?
__label__storage-method __label__equipment __label__bread What's the purpose of a bread box?
__label__baking __label__food-safety __label__substitutions __label__peanuts how to seperate peanut oil from roasted peanuts at home?
__label__chocolate American equivalent for British chocolate terms
__label__baking __label__oven __label__convection Fan bake vs bake
__label__sauce __label__storage-lifetime __label__acidity __label__mayonnaise Regulation and balancing of readymade packed mayonnaise and other sauces


Now let's try out FastText for *supervised learning*

In [0]:
import fasttext
model = fasttext.train_supervised(input=py_base_dir+"cooking.train")
model.save_model(py_base_dir+"model_supervised.bin")
model.predict("What's the purpose of a bread box?")

(('__label__baking',), array([0.1147495]))

# Keras + Character level embedding

OK, so now let's do something cooler than recipe label prediction... Let's predict company type from name of company...
More here: https://towardsdatascience.com/deep-learning-magic-small-business-type-8ac484d8c3bf

It uses character by character embedding as we see here: https://github.com/lemay-ai/smallCompanyType2.0/blob/master/smallCompanyType/smallCompanyType.py

In [0]:
!pip3 install smallCompanyType



In [0]:
import smallCompanyType as s
import warnings
warnings.filterwarnings('ignore', '.*tensorflow.*',)
warnings.filterwarnings('ignore', '.*OneHotEncoder.*',)

b=s.SmallCompanyType()
texts=["Lemay.ai Night Club","Farah's variety","felding and associates","Lemay.ai Consulting", "Jims Garage"]
for text in texts:
    ctype = b.getCompanyType(text)
    csubtype = b.getCompanySubtype(text)
    print(text,"is a",ctype,csubtype)

Using TensorFlow backend.
  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
  np_resource = np.dtype([("resource", np.ubyte, 1)])


Instructions for updating:
Colocations handled automatically by placer.
Instructions for updating:
Use tf.cast instead.
Lemay.ai Night Club is a B2BC Entertainment Services
Farah's variety is a B2C Retail Dealer
felding and associates is a B2C Retail Dealer
Lemay.ai Consulting is a B2BC Office
Jims Garage is a B2C Retail Dealer


Well, that was not super satisfying, and sort of high-level. Can we go a bit deeper and get our hands dirtier? Why, yes! Yes we can!

# spaCy's GloVe vectors

Let's look at a model for detecting errors in medical billing codes.

In [21]:
!pip install names
!python -m spacy download en_core_web_lg

Collecting names
[?25l  Downloading https://files.pythonhosted.org/packages/44/4e/f9cb7ef2df0250f4ba3334fbdabaa94f9c88097089763d8e85ada8092f84/names-0.3.0.tar.gz (789kB)
[K     |████████████████████████████████| 798kB 2.7MB/s 
[?25hBuilding wheels for collected packages: names
  Building wheel for names (setup.py) ... [?25l[?25hdone
  Created wheel for names: filename=names-0.3.0-cp36-none-any.whl size=803688 sha256=bbfc34752a984c5f51a2a8c11e96b0cd1fa18c2b04952dcfcaaa9116c6991478
  Stored in directory: /root/.cache/pip/wheels/f9/a5/e1/be3e0aaa6fa285575078fa2aafd9959b45bdbc8de8a6803aeb
Successfully built names
Installing collected packages: names
Successfully installed names-0.3.0
Collecting en_core_web_lg==2.1.0
[?25l  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_lg-2.1.0/en_core_web_lg-2.1.0.tar.gz (826.9MB)
[K     |████████████████████████████████| 826.9MB 62.2MB/s 
[?25hBuilding wheels for collected packages: en-core-web-lg
  Building w

Import libraries

In [0]:
from random import randint
from random import shuffle  
import pandas as pd
import random
import spacy
import names
import time

Define a helper function to statistically generate a dataset of patient records including notes and billing codes.

In [0]:
def nextVisitAction(item,bodyParts,psychDisorders,vaccines):    
    chance = (random.randint(1,100))
    if item is 'bodyPart':
        #15% chance that dr writes the patient was treated on body part (CODE1)
        if chance <=15:
            note='Patient '+random.choice(bodyParts)+' was injured and was treated.'
            return note, 1
        else:
            return '', 0
    elif item is 'psychDisorder':
        #20% chance that dr writes the patient was treated for a mental disorder (CODE2)
        if chance <=20:
            note='Patient was diagnosed with and treated for '+random.choice(psychDisorders) +'.'
            return note, 2
        else:
            return '', 0
    elif item is 'vaccine':
        if chance <=5:
            #5% chance that dr writes the patient was treated with a vaccine (CODE3)
            note='Patient was administered the vaccice '+random.choice(vaccines) +'.'
            return note, 3
        else:
            return '', 0
    elif item is 'hasCold':
        if chance <=15:
            return 'It appears the patient has a mild virus.', 0
        else:
            return '', 0
    elif item is 'catAllergy':
        if chance <=15:
            if chance <=7:
                return 'The patient is mildly allergic to cats.', 0
            return 'The patient is dealthly allergic to cats.', 0
        else:
            return '', 0
    elif item is 'dogAllergy':
        if chance <=15:
            if chance <=7:
                return 'The patient is mildly allergic to dogs.', 0
            return 'The patient is dealthly allergic to dogs.', 0
        else:
            return '', 0
    elif item is 'lactoseIntolerant':
        if chance <=15:
            if chance <=7:
                return 'The patient is lactose intolerant.', 0
            return 'The patient is very lactose intolerant.', 0
        else:
            return '', 0
    elif item is 'looksPale':
        if chance <=15:
            return 'The patient looks very pale.', 0
        else:
            return '', 0
    return item+'MISTAKE. Probably a BUG. WHAAAA!', 0

Define a function for data generation. 
### Important: 
*realCodes* is the correct dataset of billing codes for a patient visit

*newCodes* is the noisy data (with errors added at random to simulate human error)

In [0]:
def getRecords(stamp,mistakeChance=1,recordsToGenerate=1000):
    bodyParts = ['ankle', 'arch', 'arm', 'armpit', 'beard', 'breast', 'calf', 'cheek', 'chest', 'chin', 'earlobe', 'elbow', 'eyebrow', 'eyelash', 'eyelid', 'face', 'finger', 'forearm', 'forehead', 'gum', 'heel', 'hip', 'index finger', 'jaw', 'knee', 'knuckle', 'leg', 'lip', 'mouth', 'mustache', 'nail', 'neck', 'nostril', 'palm', 'pinkie', 'pupil', 'scalp', 'shin', 'shoulder', 'sideburns', 'thigh', 'throat', 'thumb', 'tongue', 'tooth', 'waist', 'wrist']
    psychDisorders = ['Alcohol Addiction','Drug Addiction','Caffeine Addiction','Cannabis Addiction','Hallucinogen Addiction','Inhalant Addiction','Opioid Addiction','Sedative, Hypnotic, Anxiolytic Addiction','Stimulant Addiction','Tobacco Addiction','Gambling Addiction','Agoraphobia','Generalized Anxiety Disorder','Panic Disorder','Selective Mutism','Separation Anxiety Disorder','Social Anxiety Disorder','Specific Phobias','Bipolar Disorder','Cyclothymia','Other Bipolar Disorders','Major Depression','Dysthymia (now called Persistent Depressive Disorder)','Postpartum Depression','Premenstrual Dysphoric Disorder','Seasonal Affective Disorder','Depersonalization / Derealization Disorder','Dissociative Amnesia','Dissociative Fugue','Dissociative Identity Disorder','Other Dissociative Disorders','Anorexia Nervosa','Binge Eating Disorder','Bulimia Nervosa','Pica','Conduct Disorder','Intermittent Explosive Disorder','Kleptomania','Oppositional Defiant Disorder','Pyromania','Alzheimer’s Disease','Amnestic Disorder','Delerium','Huntington’s Disease','Neurocognitive Disorder (formerly called Dementia)','Parkinson’s Disease','Other Neurocognitive Disorders','Asperger’s Syndrome','Attention Deficit Hyperactivity Disorder','Autism Spectrum Disorder','Childhood Disintegrative Disorder','Childhood Onset Fluency Disorder','Dyslexia','Intellectual Development Disorder','Language Disorder','Learning Disorders','Retts Disorder','Tourettes Syndrome','Other Neurodevelopmental Disorders','Body Dysmorphic Disorder','Obsessive-Compulsive Disorder','Trichotillomania','Other Obsessive-Compulsive Disorders','Antisocial Personality Disorder','Avoidant Personality Disorder','Borderline Personality Disorder','Dependent Personality Disorder','Histrionic Personality Disorder','Narcissistic Personality Disorder','Obsessive-Compulsive Personality Disorder','Paranoid Personality Disorder','Schizoid Personality Disorder','Schizotypal Personality Disorder','Other Personality Disorders','Brief Psychotic Disorder','Delusional Disorder','Schizoaffective Disorder','Schizophrenia','Shared Psychotic Disorder','Other Psychotic Disorders','Breathing-Related Sleep Disorder','Circadian Rhythm Disorders','Hypersomnia','Insomnia','Narcolepsy','Nightmare Disorder','Non Rapid Eye Movement','REM Sleep Behavior Disorder','Restless Leg Syndrome','Sleep Arousal Disorders','Other Sleep Disorders','Conversion Disorder','Factitious Disorder','Hypochondriasis','Malingering','Munchausen Syndrome','Munchausen by Proxy','Somatization Disorder','Other Somatic Disorders','Acute Stress Disorder','Adjustment Disorder','Posttraumatic Stress Disorder','Reactive Attachment Disorder','Other Trauma Disorders']
    vaccines = ['ACAM2000','ActHIB','Adacel','Afluria','AFLURIA QUADRIVALENT','Agriflu','BCG Vaccine','BEXSERO','Biothrax','Boostrix','Cervarix','Comvax','DAPTACEL','Engerix-B','FLUAD','Fluarix','Fluarix Quadrivalent','Flublok','Flublok Quadrivalent','Flucelvax','Flucelvax Quadrivalent','FluLaval','FluLaval Quadrivalent','FluMist','FluMist Quadrivalent','Fluvirin','Fluzone Quadrivalent','Fluzone, Fluzone High-Dose and Fluzone Intradermal','Gardasil','Gardasil 9','Havrix','HEPLISAV-B','Hiberix','Imovax','Infanrix','IPOL','Ixiaro','JE-Vax','KINRIX','M-M-R II','M-M-Vax','Menactra','MenHibrix','Menomune-A/C/Y/W-135','Menveo','Pediarix','PedvaxHIB','Pentacel','Pneumovax 23','Poliovax','Prevnar 13','ProQuad','Quadracel','RabAvert','Recombivax HB','ROTARIX','RotaTeq','SHINGRIX','TENIVAC','TICE BCG','TRUMENBA','Twinrix','TYPHIM Vi','VAQTA','Varivax','Vaxchora','Vivotif','YF-Vax','Zostavax']
    f=open(py_base_dir+"patientRecords"+stamp+".csv", "a+")
    f.write('visitNote|newCodes|realCodes\n')
    startTime=time.time()
    for i in range(recordsToGenerate):
        if i%100==0:
            secs=max(1,int(time.time()-startTime))
            lng=float(i)
            print("rows=",lng,"%=",100*lng/recordsToGenerate,"s=",secs,"rec/s=",lng/secs)

        actionList=['bodyPart','psychDisorder','vaccine','hasCold','catAllergy','dogAllergy','lactoseIntolerant','looksPale']
        patientName = names.get_full_name()
        visitNote = 'The patient '+patientName+' was assessed in the clinic. '
        realCodes = [1,0,0,0]
        shuffle(actionList)
        # print(actionList)
        for item in actionList:
            visitNoteAppend, newCode = nextVisitAction(item,bodyParts,psychDisorders,vaccines)
            realCodes[newCode]=1
            if len(visitNoteAppend)>2:
                visitNote = visitNote + ' ' + str(visitNoteAppend) #.decode('utf-8')
        newCodes=[0,0,0,0]
        for index in range(len(realCodes)):
            chance = (random.randint(1,100))
            # Incorrect if chance == mistakeChance == 1
            # Odds of a mistake are 1:100, i.e. 1%
            if chance<=mistakeChance:
                #swap codes
                opposite =0
                if realCodes[index] is 0:
                    opposite=1
                newCodes[index]=opposite
            else:
                #copy codes
                newCodes[index]=realCodes[index]
        f.write(str(visitNote)+'|'+ str(newCodes)+ '|'+str(realCodes)+"\n")
    f.close()  
    return

Generate 5000 patient treatment records including correct billing codes, and also a set of billing codes containing errors.

In [0]:
stamp=str(time.time())
getRecords(stamp,1,5000)
df=pd.read_csv(py_base_dir+"patientRecords"+stamp+".csv", delimiter="|")

#create dataframe
store = pd.HDFStore(py_base_dir+'patientRecords_'+stamp+'.h5')
store['patientRecords'] = df
store.close()
print('patientRecords_'+stamp+'.h5')
df.head()

rows= 0.0 %= 0.0 s= 1 rec/s= 0.0
rows= 100.0 %= 2.0 s= 1 rec/s= 100.0
rows= 200.0 %= 4.0 s= 1 rec/s= 200.0
rows= 300.0 %= 6.0 s= 1 rec/s= 300.0
rows= 400.0 %= 8.0 s= 2 rec/s= 200.0
rows= 500.0 %= 10.0 s= 3 rec/s= 166.66666666666666
rows= 600.0 %= 12.0 s= 3 rec/s= 200.0
rows= 700.0 %= 14.0 s= 4 rec/s= 175.0
rows= 800.0 %= 16.0 s= 4 rec/s= 200.0
rows= 900.0 %= 18.0 s= 5 rec/s= 180.0
rows= 1000.0 %= 20.0 s= 5 rec/s= 200.0
rows= 1100.0 %= 22.0 s= 6 rec/s= 183.33333333333334
rows= 1200.0 %= 24.0 s= 6 rec/s= 200.0
rows= 1300.0 %= 26.0 s= 7 rec/s= 185.71428571428572
rows= 1400.0 %= 28.0 s= 7 rec/s= 200.0
rows= 1500.0 %= 30.0 s= 8 rec/s= 187.5
rows= 1600.0 %= 32.0 s= 8 rec/s= 200.0
rows= 1700.0 %= 34.0 s= 9 rec/s= 188.88888888888889
rows= 1800.0 %= 36.0 s= 9 rec/s= 200.0
rows= 1900.0 %= 38.0 s= 10 rec/s= 190.0
rows= 2000.0 %= 40.0 s= 10 rec/s= 200.0
rows= 2100.0 %= 42.0 s= 11 rec/s= 190.9090909090909
rows= 2200.0 %= 44.0 s= 11 rec/s= 200.0
rows= 2300.0 %= 46.0 s= 12 rec/s= 191.66666666666666
r

Unnamed: 0,visitNote,newCodes,realCodes
0,The patient Joe Ford was assessed in the clini...,"[1, 0, 1, 0]","[1, 0, 1, 0]"
1,The patient Linda Carter was assessed in the c...,"[1, 1, 0, 0]","[1, 1, 0, 0]"
2,The patient Juan Veile was assessed in the cli...,"[1, 1, 0, 0]","[1, 0, 0, 0]"
3,The patient Lisa Miller was assessed in the cl...,"[1, 0, 0, 0]","[1, 0, 0, 0]"
4,The patient Mark Wills was assessed in the cli...,"[1, 0, 0, 0]","[1, 0, 0, 0]"


Load model vectors using one of spaCy's GloVe models. See more here: https://spacy.io/models/en#en_core_web_lg

In [0]:
nlp = spacy.load('en')

Now prepare the dataset for training and testing using a neural network

In [0]:
import numpy as np

def makeArr(s):
  return np.array(s[1:-1].split(",")).astype(np.int)

x = df.copy(deep=True)

# turn note text into a GloVe vector for each visit note.
def toVectors(row):
    vec = nlp(row["visitNote"])
    return [vec.vector]

x["vec"] = x.apply(toVectors, axis=1)
print('x vector shape = ', x["vec"].shape)

the_x = []
for i in range(len(x["vec"])):
    the_x.append(x["vec"].iloc[i][0])
the_x=np.array(the_x)
print('x location shape = ', the_x.shape)
y = df["realCodes"].apply(makeArr)
print('y (realCodes) shape = ', y.shape)
the_y = []
for i in range(len(y)):
    the_y.append(y[i][:])
the_y=np.array(the_y)

print('y location shape = ', the_y.shape)
print('y location 1:30 shape', the_y[0:30])
#after testing and training do model.predict on df["newCodes"]

x vector shape =  (5000,)
x location shape =  (5000, 96)
y (realCodes) shape =  (5000,)
y location shape =  (5000, 4)
y location 1:30 shape [[1 0 1 0]
 [1 1 0 0]
 [1 0 0 0]
 [1 0 0 0]
 [1 0 0 0]
 [1 0 0 0]
 [1 0 1 0]
 [1 0 0 1]
 [1 0 1 0]
 [1 0 1 0]
 [1 1 0 0]
 [1 1 1 0]
 [1 0 0 0]
 [1 0 0 0]
 [1 0 0 0]
 [1 0 0 0]
 [1 0 0 0]
 [1 0 0 0]
 [1 0 0 0]
 [1 0 1 0]
 [1 0 0 0]
 [1 0 0 0]
 [1 0 1 0]
 [1 0 0 0]
 [1 0 0 0]
 [1 0 0 0]
 [1 1 0 0]
 [1 0 0 0]
 [1 0 0 0]
 [1 0 1 0]]


In [0]:
yNew = df["newCodes"].apply(makeArr)
print(yNew.shape)
the_yNew = []
for i in range(len(yNew)):
    the_yNew.append(yNew.iloc[i][:])
    
the_yNew=np.array(the_yNew)
print(the_yNew.shape)
print(the_yNew[0:30])

(5000,)
(5000, 4)
[[1 0 1 0]
 [1 1 0 0]
 [1 1 0 0]
 [1 0 0 0]
 [1 0 0 0]
 [1 0 0 0]
 [1 0 1 0]
 [1 0 0 1]
 [1 0 1 0]
 [1 0 0 0]
 [1 1 0 0]
 [1 1 0 0]
 [1 0 0 0]
 [1 0 0 0]
 [1 0 0 0]
 [1 0 0 0]
 [1 0 0 0]
 [1 0 0 0]
 [1 0 0 0]
 [1 0 1 0]
 [1 0 0 0]
 [1 0 0 0]
 [1 0 1 0]
 [1 0 0 0]
 [1 0 0 0]
 [1 0 0 0]
 [1 1 0 0]
 [1 0 0 0]
 [1 0 0 0]
 [1 0 1 0]]


Define the neural network, plug in the data, and launch the model training

In [0]:
%load_ext tensorboard
# %%time
y=the_y
x=the_x

#test/train split
from sklearn.model_selection import train_test_split
import keras
from keras.callbacks import TensorBoard
from keras.models import Model, Sequential
from keras.layers import Input, Dense, Dropout, Activation
from keras.losses import binary_crossentropy,sparse_categorical_crossentropy, mean_squared_error
from keras.optimizers import SGD
import numpy as np

x_train, x_test, y_train, y_test=train_test_split(x, y, test_size=0.20, random_state=42)

y_train=np.hsplit(y_train, y_train.shape[1])
y_test =np.hsplit(y_test, y_test.shape[1])

print(x_train.shape, x_test.shape)
for i in range(len(y_train)):
  print(y_train[i].shape, y_test[i].shape)

# This returns a tensor
inputs = Input(shape=(x_train.shape[1],))
layer1 = Dense(128, activation='relu')(inputs)
# d1 = Dropout(.5)(layer1)
d1=layer1
for i in range(2):
  d1=Dense(64, activation='relu')(d1)
  # d1=Dropout(.5)(d1)

outs=[]
losses=[]
for i in range(y.shape[1]):
  outs.append(Dense(1, activation='sigmoid')(d1))
  losses.append(binary_crossentropy)
model = Model(inputs=inputs, outputs=outs)
model.compile(optimizer="adam", loss=losses, metrics=['accuracy'])
# print(model.summary())


!mkdir {base_dir}.log
tbCallBack = TensorBoard(log_dir=py_base_dir+'/.log', #histogram_freq=1,
                         write_graph=True,
                         write_grads=True,
                         batch_size=512,
                         write_images=True)

model.fit(x_train, y_train, epochs=100,  batch_size=512, callbacks=[tbCallBack])


The tensorboard module is not an IPython extension.
(4000, 96) (1000, 96)
(4000, 1) (1000, 1)
(4000, 1) (1000, 1)
(4000, 1) (1000, 1)
(4000, 1) (1000, 1)
mkdir: cannot create directory ‘/content/gdrive/My Drive/AuditMap_workshop/.log’: File exists
Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 5

<keras.callbacks.History at 0x7f4c637cb4a8>

Let's see how often the billing code corrections are wrong:

In [0]:
score = model.evaluate(x_test, y_test, batch_size=128)
print(score)
fails=0
for i in range(y_test[0].shape[0]): #
    y_true=np.array([y_test[0][i],y_test[1][i],y_test[2][i],y_test[3][i]]).flatten()
    p_x=x_test[i]
    p_x=p_x[np.newaxis,:]
    prediction=model.predict(p_x)
    #prints everything
    #print(y_true,np.around(prediction).flatten().astype(np.int))
    y_pred=np.around(prediction).flatten().astype(np.int)
    if not np.array_equal(y_true,y_pred):
      print(y_true,y_pred)
      fails+=1
print(fails,"out of",y_test[0].shape[0],"predictions were wrong")

[0.006845786179183051, 8.623011926829349e-07, 0.0060346907352359265, 0.00039148143856436945, 0.00041875168844126166, 1.0, 0.999, 1.0, 1.0]
[1 1 0 1] [1 0 0 1]
1 out of 1000 predictions were wrong


In [0]:
%tensorboard --logdir {base_dir}/.log

UsageError: Line magic function `%tensorboard` not found.


OK, that was pretty cool. However, don't expect this level of quality from real world datasets.... 

# Customizing a pretrained model

In [0]:
!pip3 install lemay-ai-sidecar



In [0]:
!cd {base_dir} && mkdir sidecar
!cd {base_dir}/sidecar && git clone https://github.com/lemay-ai/sidecar.git
!cd {base_dir}/sidecar/sidecar && ls -l

mkdir: cannot create directory ‘sidecar’: File exists
fatal: destination path 'sidecar' already exists and is not an empty directory.
total 11671
-rw------- 1 root root     8683 Nov 27 19:00 CWS_gen_mp.py
-rw------- 1 root root 11128365 Nov 27 19:00 dataset.csv
drwx------ 2 root root     4096 Nov 27 19:00 images
drwx------ 2 root root     4096 Nov 27 19:00 lemay_ai_sidecar
-rw------- 1 root root    35149 Nov 27 19:00 LICENSE
-rw------- 1 root root   764859 Nov 27 19:00 notebook_showing_steps.ipynb
-rw------- 1 root root     2936 Nov 27 19:00 README.md
-rw------- 1 root root      405 Nov 27 19:00 setup.py
-rw------- 1 root root      965 Nov 27 19:00 test.py


In [0]:
!head -n 20 {base_dir}/sidecar/sidecar/dataset.csv

,[LocalizedFileNames],body,tags
0,,<p>How i can convert word file (.docx &amp; doc ) to .pdf in c# without using SaveAs() or Save method ? or without uploading on server?</p>,c#
1,,"<p>I essentially have the following:</p>

<pre><code>    int? myVal = null;
    myVal |= 1;
    bool stillNull = myVal == null; //returns true
</code></pre>

<p>Why does this behave this way?  My understanding of bitwise operator/operand behavior is not terribly strong, and I could not find a reason that it wouldn't be treated as a simple assignment in this case.</p>",c#
2,,"<p>I have a variable which I am populating with records from my database. I then will display this list on a view as a Drop down box. However, it fails once it reaches the drop down. </p>

<p>Controller:</p>

<pre><code>  public ActionResult Review() {
            var reviews = reviewRepo.GetAllReviews();

            var clients = clientRepo.Clients();    

            List&lt;SelectListItem&gt; items = new SelectList(clients, ""Client

In [10]:
import pandas as pd
df = pd.read_csv(py_base_dir+"/sidecar/sidecar/dataset.csv",index_col=0)
df=df.sample(frac=1.0)
df.head(10)

Unnamed: 0,[LocalizedFileNames],body,tags
372,,<p>Having some problems with a javascript code...,php
47,,<p>I've got a large data set spanning many yea...,r
958,,<p>I have a dataset that I loaded into R using...,r
304,,<p>Interrupting the program below with Ctrl + ...,perl
319,,<p>I am trying to read data from MySQL and sho...,vb.net
970,,"<p>I'm new to python, and I'm having problems ...",python
786,,<p>This query takes about a minute to give res...,sql
993,,<p>I'm getting list of strings from a method a...,c#
547,,<p>So I know how to increment. I have the foll...,vb.net
756,,<pre><code> import requests \n def pos...,python


In [0]:
display(df["tags"].value_counts())

javascript    1000
php           1000
c++           1000
r             1000
c#            1000
sql           1000
python        1000
perl          1000
vb.net        1000
java          1000
Name: tags, dtype: int64

In [0]:
len_train=0
len_test=0
with open(py_base_dir+'/sidecar/model_text.test', 'w') as testFile:
  with open(py_base_dir+'/sidecar/model_text.train', 'w') as trainFile:
    for index,row in df.iterrows():
      line_text="__label__"+str(row['tags'])+" "+str(row['body']).replace("\n"," ")+"\n"
      try:
        if int(index)%10==0:
          testFile.write(line_text)
          len_test+=1
        else:
          trainFile.write(line_text)
          len_train+=1
      except:
        print("nope")
trainFile.close()
testFile.close()
print(len_train,len_test)

nope
9000 1000


In [0]:
!head -n 10 {base_dir}/sidecar/model_text.train
!head -n 10 {base_dir}/sidecar/model_text.test

__label__c++ <p>I'm trying to remove an element from a vector. But I think, I have a specific problem:</p>  <p><strong>My data from file of reading:</strong></p>  <p><em>move ctrl+a,F3</em></p>  <p><em>copy ctrl+v,shift+v</em></p>  <p><em>search F3,F4</em></p>  <ol> <li>trying to read from a file</li> <li>input a word ,which I want to earse(e.x.: move)</li> <li><p>And the problem is that, I need to input ONLY ONE (like in an example) word, and earse the whole string of commands(<em>move ctrl+a,F3</em>).  </p>  <p>What I need is just to find a string by one word. But in code below I can't do this,help,please solving a problem. In a code below, all I can is find only one word, if in a file there is only one word, not like (<em>move ctrl+a,F3</em>), but if a string consists of several words.. It can't find.</p>  <pre><code> std::vector&lt;std::string&gt; HotMap::remove_element(std::vector&lt;std::string&gt; MyVector){ //remove a element ,reading a vector of cmds, then deletting one chosen

In [0]:
# Train a model on this corpus
import fasttext
model = fasttext.train_supervised(input=py_base_dir+"/sidecar/model_text.train")
model.save_model(py_base_dir+"sidecar/model_text.bin")
model.test(py_base_dir+"/sidecar/model_text.test")
# model.predict("NHQRegional SACO reviews ISP0421B form to ensure it meets the generic job profile matrix.")

(1000, 0.424, 0.424)

In [0]:
import fasttext
model = fasttext.train_supervised(input=py_base_dir+"/sidecar/model_text.train", epoch=25)
model.save_model(py_base_dir+"sidecar/model_text.bin")
model.test(py_base_dir+"/sidecar/model_text.test")

(1000, 0.796, 0.796)

In [17]:
import fasttext
base_model = fasttext.train_supervised(input=py_base_dir+"/sidecar/model_text.train", 
                                       epoch=25, lr=1.0)
base_model.save_model(py_base_dir+"sidecar/model_text.bin")
base_model.test(py_base_dir+"/sidecar/model_text.test")

(1000, 0.813, 0.813)

In [0]:
# note: -autotune-validation and many other options are available

In [3]:
!pip3 install wget

Collecting wget
  Downloading https://files.pythonhosted.org/packages/47/6a/62e288da7bcda82b935ff0c6cfe542970f04e29c756b0e147251b2fb251f/wget-3.2.zip
Building wheels for collected packages: wget
  Building wheel for wget (setup.py) ... [?25l[?25hdone
  Created wheel for wget: filename=wget-3.2-cp36-none-any.whl size=9681 sha256=7199ed9f5ebfabc9689b38f011c935788fe74db8a6916e702eaed9b9865a6dbb
  Stored in directory: /root/.cache/pip/wheels/40/15/30/7d8f7cea2902b4db79e3fea550d7d7b85ecb27ef992b618f3f
Successfully built wget
Installing collected packages: wget
Successfully installed wget-3.2


In [0]:
import os
import zipfile
import wget
from IPython.display import Image
from IPython.core.display import HTML 
import spacy

# load pretrained model
nlp = spacy.load("en")

df.drop(columns=['[LocalizedFileNames]'],inplace=True)


In [64]:
# for each row, concatenate the base model vector with the pretrained vector
import numpy as np

def makeCustomVec(row):
  try:
    txt = row["customV"]
    txt = ''.join(e for e in txt if e.isalnum() or e is ' ')
    vec = base_model.get_sentence_vector(txt)
    return [vec]
  except:
    return[np.zeros((1,96))]

def makePretrainedVec(row):
  try:
    txt = row["body"].replace("\n","")
    txt = ''.join(e for e in txt if e.isalnum() or e is ' ')
    vec = nlp(txt).vector
    return [vec]
  except:
    return[np.zeros((1,100))]

display(Image(url= "https://imgs.xkcd.com/comics/compiling.png"))

df["customV"]=df["body"].replace("\n","")
df["customV"]=df.apply(makeCustomVec, axis=1)
df["pretrainedV"]=df.apply(makePretrainedVec, axis=1)
display(df.head(10))
for index,row in df.iterrows():
  text = row['body'].replace("\n","")
  customV = row["customV"]
  pretrainedV = row["pretrainedV"]
  print(len(text),customV[0].shape,pretrainedV[0].shape)
  break

Unnamed: 0,body,tags,customV,pretrainedV
372,<p>Having some problems with a javascript code...,php,"[[-0.016572924, 0.005136265, 0.037449963, -0.0...","[[0.99867225, -1.1788186, -0.6309549, -1.04448..."
47,<p>I've got a large data set spanning many yea...,r,"[[0.031693637, 0.08152228, -0.00843287, 0.0890...","[[5.5865903, 0.27739644, -2.6804729, 1.7886392..."
958,<p>I have a dataset that I loaded into R using...,r,"[[-0.004416636, 0.13198957, -0.04892641, 0.124...","[[0.65342474, -1.0588115, -1.1940643, -1.39399..."
304,<p>Interrupting the program below with Ctrl + ...,perl,"[[-0.15919286, -0.1953033, -0.004372969, 0.240...","[[1.7554406, -1.3144658, -1.3427788, -1.416685..."
319,<p>I am trying to read data from MySQL and sho...,vb.net,"[[0.14902955, -0.13814355, 0.04496123, -0.0713...","[[2.1630957, -1.4345362, -1.2440366, -1.542554..."
970,"<p>I'm new to python, and I'm having problems ...",python,"[[-0.01622935, 0.013932429, 0.016172351, 0.021...","[[1.78211, -1.3009847, -1.251538, -1.2445885, ..."
786,<p>This query takes about a minute to give res...,sql,"[[0.059694316, -0.11732004, 0.055257183, -0.17...","[[0.71675146, -0.9493438, -0.5939291, -1.41842..."
993,<p>I'm getting list of strings from a method a...,c#,"[[-0.005962717, 0.047765773, 0.032258432, -0.0...","[[0.9023432, -0.9161297, -1.0015508, -1.520370..."
547,<p>So I know how to increment. I have the foll...,vb.net,"[[0.14267495, -0.03386349, -0.025051177, 0.048...","[[1.0881171, -1.5441401, -0.6176418, -1.811053..."
756,<pre><code> import requests \n def pos...,python,"[[-0.12212327, 0.11319813, 0.054229822, 0.0007...","[[1.602962, -1.6458162, 0.054127373, -1.437824..."


1465 (100,) (96,)


In [79]:
from tqdm import tqdm
# we know the custom model performance alone, 
# let's see if augmenting with the pretrained vector helps
x = np.zeros((df.shape[0],196))
y = np.zeros((df.shape[0],10))

labels=list(df["tags"].unique())[:10]
encoder={}
for l in labels:
  encoder[l]=labels.index(l)
print(encoder)

newIndex=0
for index,row in tqdm(df.iterrows(), total=df.shape[0]):
  customV = row["customV"]
  pretrainedV = row["pretrainedV"]
  label=row["tags"]
  if label in encoder.keys():
    combinedV = np.hstack((customV[0],pretrainedV[0]))
    x[newIndex,:]=combinedV
    ix=encoder[label]
    y_val=np.zeros(10)
    y_val[ix]=1
    y[newIndex,:]=y_val
    newIndex+=1
  else:
    print("Someone is having a bad day")

  4%|▍         | 396/10001 [00:00<00:02, 3957.40it/s]

{'php': 0, 'r': 1, 'perl': 2, 'vb.net': 3, 'python': 4, 'sql': 5, 'c#': 6, 'java': 7, 'c++': 8, 'javascript': 9}


 45%|████▌     | 4518/10001 [00:01<00:01, 4528.01it/s]

Someone is having a bad day


100%|██████████| 10001/10001 [00:02<00:00, 4352.66it/s]


In [91]:
#test/train split
from sklearn.model_selection import train_test_split
import keras
from keras.callbacks import TensorBoard
from keras.models import Model, Sequential
from keras.layers import Input, Dense, Dropout, Activation, Convolution1D
from keras.losses import binary_crossentropy,sparse_categorical_crossentropy, mean_squared_error
from keras.optimizers import SGD

x_train, x_test, y_train, y_test=train_test_split(x, y, test_size=0.20, random_state=42)
x_train.shape, x_test.shape, y_train.shape, y_test.shape

((8000, 196), (2001, 196), (8000, 10), (2001, 10))

In [104]:
from keras import regularizers
model = Sequential()
model.add(Dense(512, activation='relu', input_dim=196))
model.add(Dropout(0.5))
model.add(Dense(10, activation='softmax'))
model.compile(optimizer='rmsprop',
              loss='categorical_crossentropy',
              metrics=['accuracy'])

model.fit(x_train, y_train, epochs=50, batch_size=256, validation_data=(x_test,y_test))

Train on 8000 samples, validate on 2001 samples
Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50


<keras.callbacks.History at 0x7f3f9a953160>

In [105]:
model.evaluate(x_test,y_test)



[0.5918504187519821, 0.8265867067062516]

So we got a tiny (~1%) improvement for a lot of work...

More text cleanup might help.

Classifying sequences instead of average vectors might do better...

Now let's go to the unsupervised side of the fence.

In [0]:
# More to explore:
## https://github.com/huggingface/transformers