## Custom NER on Drugs Review Dataset

 The objective of this note book to perform a named entity recognition on drugs dataset.
  - To add entities like drug name and drug condition
  - Extract drug name and drug condition entities from new reviews which would assess in classification


##### SpaCy recognizes the following built-in entity types:

**PERSON** - People, including fictional.

**NORP** - Nationalities or religious or political groups.

**FAC** - Buildings, airports, highways, bridges, etc.

**ORG** - Companies, agencies, institutions, etc.

**GPE** - Countries, cities, states.

**LOC** - Non-GPE locations, mountain ranges, bodies of water.

**PRODUCT** - Objects, vehicles, foods, etc. (Not services.)

**EVENT** - Named hurricanes, battles, wars, sports events, etc.

**WORK_OF_ART** - Titles of books, songs, etc.

**LAW** - Named documents made into laws.

**LANGUAGE** - Any named language.

**DATE** - Absolute or relative dates or periods.

**TIME** - Times smaller than a day.

**PERCENT** - Percentage, including "%".

**MONEY** - Monetary values, including unit.

**QUANTITY** - Measurements, as of weight or distance.

**ORDINAL** - "first", "second", etc.

**CARDINAL** - Numerals that do not fall under another type.

#### Along with these, we will train SpaCy NER to recognize drug names as new entity. We will train 10000 reviews with drug names as DRUG entity.

#### Packages to import

In [2]:
import numpy as np 
import pandas as pd 
from wordcloud import WordCloud, STOPWORDS # conda install -c conda-forge wordcloud
from sklearn.feature_extraction.text import TfidfVectorizer
import spacy#conda install -c conda-forge spacy
import matplotlib.pyplot as plt
import re
import random
from spacy.util import minibatch, compounding
from nltk.tokenize import word_tokenize
#import nltk
#nltk.download('punkt')

In [3]:
df_train = pd.read_csv('drugsComTrain_raw.tsv',sep='\t',index_col=0)
df_train.head()

Unnamed: 0,drugName,condition,review,rating,date,usefulCount
206461,Valsartan,Left Ventricular Dysfunction,"""It has no side effect, I take it in combinati...",9.0,"May 20, 2012",27
95260,Guanfacine,ADHD,"""My son is halfway through his fourth week of ...",8.0,"April 27, 2010",192
92703,Lybrel,Birth Control,"""I used to take another oral contraceptive, wh...",5.0,"December 14, 2009",17
138000,Ortho Evra,Birth Control,"""This is my first time using any form of birth...",8.0,"November 3, 2015",10
35696,Buprenorphine / naloxone,Opiate Dependence,"""Suboxone has completely turned my life around...",9.0,"November 27, 2016",37


#### extract entities for a custom entity by training a model in SpaCy

In [4]:
drug_list = df_train['drugName'].value_counts().index.tolist()
drug_list = [x.lower() for x in drug_list]

In [13]:
medicalCondtion_list = df_train['condition'].value_counts().index.tolist()
medicalCondtion_list = [x.lower() for x in medicalCondtion_list]

#### Training SpaCy NER for DRUG

In [5]:
def process_review(review):
    processed_token = []
    for token in review.split():
        token = ''.join(e.lower() for e in token if e.isalnum())
        processed_token.append(token)
    return ' '.join(processed_token)

#### Train data creation for drug

In [26]:
#Step 1: Let's create the training data
count = 0
TRAIN_DATA_drug = []
for _, item in df_train.iterrows():
    ent_dict = {}
    if count < 1000:
        review = process_review(item['review'])
        #We will find a drug and its positions once and add to the visited items.
        visited_items = []
        entities = []
        for token in review.split():
            if token in drug_list:
                for i in re.finditer(token, review):
                    if token not in visited_items:
                        entity = (i.span()[0], i.span()[1], 'DRUG')
                        visited_items.append(token)
                        entities.append(entity)
        if len(entities) > 0:
            ent_dict['entities'] = entities
            train_item = (review, ent_dict)
            TRAIN_DATA_drug.append(train_item)
            count+=1

In [41]:
count = 0
TRAIN_DATA_medicalCondition = []
for _, item in df_train.iterrows():
    ent_dict_mc = {}
    if count < 1000:
        review_mc = process_review(item['review'])
        #We will find a drug and its positions once and add to the visited items.
        visited_items_mc = []
        entities_mc = []
        for token_mc in review_mc.split():
            if token_mc in medicalCondtion_list:
                for i in re.finditer(token_mc, review_mc):
                    if token_mc not in visited_items_mc:
                        entity_mc = (i.span()[0], i.span()[1], 'medicalCondition')
                        visited_items_mc.append(token_mc)
                        entities_mc.append(entity_mc)
        if len(entities_mc) > 0:
            ent_dict_mc['entities'] = entities_mc
            train_item = (review_mc, ent_dict_mc)
            TRAIN_DATA_medicalCondition.append(train_item)
            count+=1

In [44]:
TRAIN_DATA_medicalCondition

[('been on reclipsen for a few months now pros lighter periods no stomach cramps no acne no babies cons mood swings terrible back pain during period lower sex drive no weight gain if already active',
  {'entities': [(81, 85, 'medicalCondition'),
    (127, 131, 'medicalCondition')]}),
 ('i just read most of the reviews and half of them to me are ridiculous anxiety and depression your period gives you mood swings being a girl you get mood swings  i039ve been on this pill since i was 15 i am 22 now never had a problem unless i miss a few pills i have never been pregnant very light blood no cramps very little pimples to anyone with bad acne and bad cramps i highly recommend this pill i used to throw up because of my bad cramps and now i rarely even get one',
  {'entities': [(52, 54, 'medicalCondition'),
    (70, 77, 'medicalCondition'),
    (82, 92, 'medicalCondition'),
    (352, 356, 'medicalCondition')]}),
 ('started off with concerta in may 2012 it was not effective at all switched over

In [24]:
n_iter = 10
def train(TRAIN_DATA):
    nlp = spacy.blank("en")  # create blank Language class
    print("Created blank 'en' model")
    
    if "ner" not in nlp.pipe_names:
        ner = nlp.create_pipe("ner")
        nlp.add_pipe(ner, last=True)
    # otherwise, get it so we can add labels
    else:
        ner = nlp.get_pipe("ner")
        
    # add labels
    for _, annotations in TRAIN_DATA:
        for ent in annotations.get("entities"):
            ner.add_label(ent[2])
            
    nlp.begin_training()
    for itn in range(n_iter):
        random.shuffle(TRAIN_DATA)
        losses = {}
        # batch up the examples using spaCy's minibatch
        batches = minibatch(TRAIN_DATA, size=compounding(4.0, 32.0, 1.001))
        for batch in batches:
            texts, annotations = zip(*batch)
            nlp.update(
                texts,  # batch of texts
                annotations,  # batch of annotations
                drop=0.5,  # dropout - make it harder to memorise data
                losses=losses,
            )
        print("Losses", losses)
    return nlp

In [27]:
#Step 2: Let's train custom model with the training data
nlp_drugIdentifier = train(TRAIN_DATA_drug)

Created blank 'en' model
Losses {'ner': 4220.2779708741555}
Losses {'ner': 1371.556508247962}
Losses {'ner': 1003.8670679914386}
Losses {'ner': 894.4396859489137}
Losses {'ner': 726.1803427564366}
Losses {'ner': 658.2168352722816}
Losses {'ner': 598.1228123366262}
Losses {'ner': 580.3089729032414}
Losses {'ner': 496.0578926867852}
Losses {'ner': 497.1031868440588}


In [43]:
nlp_medicalConditionIdentifier=train(TRAIN_DATA_medicalCondition)

Created blank 'en' model


ValueError: [E103] Trying to set conflicting doc.ents: '(33, 46, 'medicalCondition')' and '(37, 39, 'medicalCondition')'. A token can only be part of one entity, so make sure the entities you're setting don't overlap.

In [None]:
#Test the model
for text, _ in TRAIN_DATA[:10]:
    doc = nlp_drugIdentifier(text)
    print("Entities", [(ent.text, ent.label_) for ent in doc.ents])
    doc1 =nlp_medicalConditionIdentifier(text)
    print("Entities", [(ent.text, ent.label_) for ent in doc1.ents])

In [None]:
test_reviews = df_train.iloc[-10:, :]['review']
for review in test_reviews:
    review = process_review(review)
    print(review)
    doc = nlp_drugIdentifier(review)
    doc1= nlp_medicalConditionIdentifier(review)
    print("Entities", [(ent.text, ent.label_) for ent in doc.ents])
    print("Entities", [(ent.text, ent.label_) for ent in doc1.ents])
    print('________________________')