# Entity Extraction from ANSM materials

This notebook extracts from the public ANSM drug description file, some value domains related to specific/relevant **named entities**:
* drug name (eg: xanax, ...)
* active chemical ingredient
* pharma company name (eg: Sanofi, ...)

These custom named entities will be used to tag the texts accordingly

In [56]:
# load drug descriptor file
pd.set_option("display.max_colwidth",10000)
CIP = pd.read_csv('../../data/ANSM/CIS_bdpm.txt', sep='\t',encoding = 'ISO-8859-1',header=None)
CIP.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11
0,61266250,"A 313 200 000 UI POUR CENT, pommade",pommade,cutanée,Autorisation active,Procédure nationale,Commercialisée,12/03/1998,,,PHARMA DEVELOPPEMENT,Non
1,62869109,"A 313 50 000 U.I., capsule molle",capsule molle,orale,Autorisation active,Procédure nationale,Commercialisée,07/07/1997,,,PHARMA DEVELOPPEMENT,Non
2,66513085,"ABASAGLAR 100 unités/ml, solution injectable en cartouche",solution injectable,sous-cutanée,Autorisation active,Procédure centralisée,Commercialisée,09/09/2014,,EU/1/14/944,ELI LILLY REGIONAL OPERATIONS (AUTRICHE),Oui
3,64332894,"ABASAGLAR 100 unités/ml, solution injectable en stylo prérempli",solution injectable,sous-cutanée,Autorisation active,Procédure centralisée,Commercialisée,09/09/2014,,EU/1/14/944,ELI LILLY REGIONAL OPERATIONS (AUTRICHE),Oui
4,66207341,"ABELCET 5 mg/ml, suspension à diluer pour perfusion",suspension à diluer pour perfusion,intraveineuse,Autorisation active,Procédure nationale,Commercialisée,10/06/1997,,,ACINO FRANCE,Non


## Stop Words
I need to use the french dictionary as a stopwords provider in order to keep only words which characterize shortly the custom named entities

In [30]:
import pandas as pd
import random

# load french dictionary and convert all words into upper case
frenchDictionary = pd.read_csv('../../data/vocabulary/french_dictionary.txt', encoding = 'ISO-8859-1',header=None)
frenchDictionary[0] = frenchDictionary[0].map(lambda x : x.upper() )
frenchDictionary = set(frenchDictionary[0].values)

random.sample(frenchDictionary, 6)

['DÉFRISASSIEZ',
 'CUVIEZ',
 'ENCASTRERAIENT',
 'SIFFLOTÂTES',
 'RÉINCARNASSES',
 'ÉVERTUERONT']

In [54]:
from itertools import chain
# load country names as stopword too
countries = pd.read_csv('../../data/vocabulary/countries.txt',sep=",", header=None)
countriesInFrench = countries[4].map(lambda x : x.upper() )
countriesInEnglish = countries[5].map(lambda x : x.upper() )

countriesInFrench = set(countriesInFrench.values)
countriesInEnglish = set(countriesInEnglish.values)

# merge both country names
countries = set(chain(countriesInFrench,countriesInEnglish))
pass

## Pharma company names

In [47]:
companyNameColumn = CIP[10]

def getCompanyName(name, stopWords1, stopWords2):
    words = re.findall("([A-Z]+)", name)
    words = [w for w in words if not w in stopWords1 and not w in stopWords2]
    words = " ".join(words)
    return words
        
# return a flat list of short company names (country part is removed)
# french dictionary is not sufficient to define stopwords as compan names may contain non-french common terms
def getCompanyNames(column, stopWords1, stopWords2):
    s = column.map(lambda x: getCompanyName(x, stopWords1, stopWords2)).as_matrix()
    flatList = []
    for name in s:
        words = name.split(' ')
        flatList.extend(words)
    flatList = [w for w in flatList if len(w) >= 4]    
    # get distinct values
    flatList = list(set(flatList))
    return flatList

companyNames = pd.DataFrame(getCompanyNames(companyNameColumn, frenchDictionary, countries))

companyNames.to_csv('../../data/staging_data/company_names.txt',index=None, header=None)
companyNames.head()

  if sys.path[0] == '':


Unnamed: 0,0
0,RECKITT
1,ENTERPRISES
2,CHEMI
3,TAKEDA
4,SWEDISH


## Drug names

The long drug name can be identified as group of words in upper case: it contains the commercial drug name plus optional the pharma company name and some common terms.
To get the short drug name, eliminate the extra attributes (company name and common terms which are part of the french thesaurus)

In [73]:
import re
drugColumn = CIP[1]
drugColumn.head(8)

# consider only upper case/non-numerical words from the long name and prune words related to company name
# also removed stopwords (eg: 'DE')
def getShortDrugName(longDrugName, stopWords1, stopWords2):
    words = re.findall("([A-Z]+)", longDrugName)
    words = [w for w in words if not w in stopWords1 and not w in stopWords2 and len(w) >= 3]    
    if len(words) > 0:
        return words[0]
    else:
        return None

drugShortNames = pd.DataFrame(CIP[1].map(lambda x: getShortDrugName(x, companyNames, frenchDictionary)))
drugShortNames = drugShortNames[drugShortNames[1].str.len()  > 0]
drugShortNames.drop_duplicates(inplace=True)
drugShortNames.dropna(inplace=True)
drugShortNames.columns = ['name']
drugShortNames.to_csv('../../data/staging_data/drug_names.txt',index=None, header=None)
drugShortNames.tail(10)

Unnamed: 0,name
14413,ZYLORIC
14416,ZYMAD
14419,ZYMADUO
14421,ZYMAFLUOR
14425,ZYPADHERA
14428,ZYPREXA
14436,ZYRTEC
14437,ZYRTECSET
14438,ZYTIGA
14439,ZYVOXID


## Active Ingredient names

In [77]:
COMP = pd.read_csv('../../data/ANSM/COMPO.txt', sep='\t',encoding = 'ISO-8859-1',header=None)

COMP.head(200)

def getIngredientName(longName, stopWords1):
    if (not type(longName) is str):
        return None
    
    words = re.findall("([A-Z]+)", longName)
    words = [w for w in words if not w in stopWords1]    
    if len(words) > 0:
        return words[0]
    else:
        return None

ingredients = pd.DataFrame(COMP[3].map(lambda x: getIngredientName(x, frenchDictionary)))
ingredients.drop_duplicates(inplace=True)
ingredients.dropna(inplace=True)
ingredients.to_csv('../../data/staging_data/ingredients.txt',index=None, header=None)
ingredients.head()

Unnamed: 0,3
0,INDAPAMIDE
1,RINDOPRIL
3,COD
4,PARAC
5,TAMIZOLE
