# Custom NLP Family Relations NER Extraction

Here we will work with high level NLP tools built on top of next gen langiage processing engnes like hugging face transformers to target and tag specific data inside a body of text and then further refine those taged targets with a specifc identity filter using direct python or regex to find all family related data in a section of unstructered text.

## We Will start by Implimenting Stanza

Stanza is a Python NLP toolkit that supports 60+ human languages. It is built with highly accurate neural network components that enable efficient training and evaluation with your own annotated data, and offers pretrained models on 100 treebanks. Additionally, Stanza provides a stable, officially maintained Python interface to Java Stanford CoreNLP Toolkit.

Note that Stanza only supports Python 3.6 and above. Installing and importing Stanza are as simple as running the following commands:

In [1]:
!pip install Stanza

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting Stanza
  Downloading stanza-1.4.2-py3-none-any.whl (691 kB)
[K     |████████████████████████████████| 691 kB 12.9 MB/s 
Collecting emoji
  Downloading emoji-2.1.0.tar.gz (216 kB)
[K     |████████████████████████████████| 216 kB 48.7 MB/s 
Building wheels for collected packages: emoji
  Building wheel for emoji (setup.py) ... [?25l[?25hdone
  Created wheel for emoji: filename=emoji-2.1.0-py3-none-any.whl size=212392 sha256=278740e6f20a9141ba406563f5d3ecec456d48afb5e9beb12dae1dca7d3467e6
  Stored in directory: /root/.cache/pip/wheels/77/75/99/51c2a119f4cfd3af7b49cc57e4f737bed7e40b348a85d82804
Successfully built emoji
Installing collected packages: emoji, Stanza
Successfully installed Stanza-1.4.2 emoji-2.1.0


Next download the english model

In [2]:
# Import the package
import stanza

# Download an English model into the default directory
print("Downloading English model...")
stanza.download('en')

Downloading English model...


Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.4.1.json:   0%|   …

INFO:stanza:Downloading default packages for language: en (English) ...


Downloading https://huggingface.co/stanfordnlp/stanza-en/resolve/v1.4.1/models/default.zip:   0%|          | 0…

INFO:stanza:Finished downloading models and saved to /root/stanza_resources.



## 3. Processing Text
Constructing Pipeline To process a piece of text, you'll need to first construct a Pipeline with different Processor units. The pipeline is language-specific, so again you'll need to first specify the language (see examples).

In [3]:
# Build an English pipeline, with all processors by default
print("Building an English pipeline...")
en_nlp = stanza.Pipeline('en',verbose=True)

INFO:stanza:Checking for updates to resources.json in case models have been updated.  Note: this behavior can be turned off with download_method=None or download_method=DownloadMethod.REUSE_RESOURCES


Building an English pipeline...


Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.4.1.json:   0%|   …

Downloading https://huggingface.co/stanfordnlp/stanza-en/resolve/v1.4.1/models/pretrain/fasttextcrawl.pt:   0%…

INFO:stanza:Loading these models for language: en (English):
| Processor    | Package   |
----------------------------
| tokenize     | combined  |
| pos          | combined  |
| lemma        | combined  |
| depparse     | combined  |
| sentiment    | sstplus   |
| constituency | wsj       |
| ner          | ontonotes |

INFO:stanza:Use device: cpu
INFO:stanza:Loading: tokenize
INFO:stanza:Loading: pos
INFO:stanza:Loading: lemma
INFO:stanza:Loading: depparse
INFO:stanza:Loading: sentiment
INFO:stanza:Loading: constituency
INFO:stanza:Loading: ner
INFO:stanza:Done loading processors!


## Annotating Text
After a pipeline is successfully constructed, you can get annotations of a piece of text simply by passing the string into the pipeline object. The pipeline will return a Document object, which can be used to access detailed annotations from. For example:

In [4]:
# Processing English text
Text ="My mother came with me to the hospital last week with my grandmother to get treated for alzheimer's it was a long day my brother came later in the evening"

en_doc = en_nlp(Text)
print(type(en_doc))

<class 'stanza.models.common.doc.Document'>


We can now have the model perform a full prediction on the structure and anotation of the text-

In [5]:
text_res = []
for i, sent in enumerate(en_doc.sentences):
    print("[Sentence {}]".format(i+1))

    for word in sent.words:
        print("{:12s}\t{:12s}\t{:6s}\t{:d}\t{:12s}".format(\
              word.text, word.lemma, word.pos, word.head, word.deprel))
        res ={'text':word.text,'lema':word.lemma, 'pos':word.pos, 'head':word.head, 'deprel':word.deprel}
        text_res.append(res)
        
    print("")
  
#text_res

[Sentence 1]
My          	my          	PRON  	2	nmod:poss   
mother      	mother      	NOUN  	3	nsubj       
came        	come        	VERB  	0	root        
with        	with        	ADP   	5	case        
me          	I           	PRON  	3	obl         
to          	to          	ADP   	8	case        
the         	the         	DET   	8	det         
hospital    	hospital    	NOUN  	3	obl         
last        	last        	ADJ   	10	amod        
week        	week        	NOUN  	3	obl:tmod    
with        	with        	ADP   	16	mark        
my          	my          	PRON  	13	nmod:poss   
grandmother 	grandmother 	NOUN  	16	nsubj:pass  
to          	to          	PART  	16	mark        
get         	get         	AUX   	16	aux:pass    
treated     	treat       	VERB  	3	advcl       
for         	for         	ADP   	18	case        
alzheimer   	alzheimer   	NOUN  	16	obl         
's          	's          	PART  	18	case        
it          	it          	PRON  	24	nsubj       
was         	be  

## Building Family Targeting
Now we are going to build out a few corpuses of words to help with targeting family relations

In [6]:
family = ['household','ménage','family','ancestry', 'parent','child', 'sibling' 'parentage','birth','genealogy','heritage', 'descendent', 'lineage', 'bloodline',
          'relative','relations','step-','in-law-', 'offspring']
parent =['mother', 'mom', 'maternal','matriarch','father', 'dad','patriarch']
sibling =['brother', 'bro', 'male sibling','sister', 'sis', 'female sibling']
child =['son', 'male child','male offspring','daughter', 'female child','female offspring']
extended =['aunt','uncle','grandma','grandmother','grandmom','grandpa','grandfather','grandpop','niece', 'nephew', 'cousin', 'great grandma', 'great grandpa', 'grandson','granddaughter',
           'stepfather', 'stepmother','stepson', 'stepdaughter', 'stepsister', 'stepbrother', 'half-brother', 'half-sister', 'father-in-law',
           'mother-in-law','son-in-law','daughter-in-law','brother-in-law','sister-in-law']

## Identify Family Info -
 From here we are all set to identify family info simply extract the identified Nouns in the orginal text and then check them to the data we have in the family targeting lists this should tell us what we want to know 


In [7]:
detected_matches =[]
for x in text_res:
  if(x['pos']=='NOUN'):
    text = str(x['text']).lower()
    if(text in family):
      detected_matches.append({'Family reference found':text})
    if(text in parent):
      detected_matches.append({'Parent reference found':text})
    if(text in sibling):
      detected_matches.append({'Sibling reference found':text})
    if(text in child):
      detected_matches.append({'Child reference found':text})
    if(text in extended):
      detected_matches.append({'Extended Family reference found':text})
    
    
print('Indentified matches for Target Text : ')
print(Text)

print(detected_matches)

Indentified matches for Target Text : 
My mother came with me to the hospital last week with my grandmother to get treated for alzheimer's it was a long day my brother came later in the evening
[{'Parent reference found': 'mother'}, {'Extended Family reference found': 'grandmother'}, {'Sibling reference found': 'brother'}]



## Now lets perform Basic NER
Running the NERProcessor simply requires the TokenizeProcessor. After the pipeline is run, the Document will contain a list of Sentences, and the Sentences will contain lists of Tokens. Named entities can be accessed through Document or Sentence’s properties entities or ents. Alternatively, token-level NER tags can be accessed via the ner fields of Token.

Accessing Named Entities for Sentence and Document Here is an example of performing named entity recognition for a piece of text and accessing the named entities in the entire document:

<br/>

we do this to see what else can be picked up automaticly with stanza and how much of that can be used to help us further understand family history notes

In [8]:
# Processing English text
Text ="My mother came with me to the hospital last week with my grandmother to get treated for alzheimer's it was a long day my brother came later in the evening."

nlp2 = stanza.Pipeline(lang='en', processors='tokenize,ner')

docs = nlp2(Text)
print()

print(*[f'entity: {ent.text}\ttype: {ent.type}' for ent in docs.ents], sep='\n')
if(len(docs.ents)==0):
  print('No entitys detected in text')

INFO:stanza:Checking for updates to resources.json in case models have been updated.  Note: this behavior can be turned off with download_method=None or download_method=DownloadMethod.REUSE_RESOURCES


Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.4.1.json:   0%|   …

INFO:stanza:Loading these models for language: en (English):
| Processor | Package   |
-------------------------
| tokenize  | combined  |
| ner       | ontonotes |

INFO:stanza:Use device: cpu
INFO:stanza:Loading: tokenize
INFO:stanza:Loading: ner
INFO:stanza:Done loading processors!



entity: last week	type: DATE
entity: later in the evening	type: TIME


This can help us identify if time and dates are being disscused a key factor in idetifying data around family history conversations -

## ID medical text
Now lets use Stanzas medical libs to define our text further to see if medications or other things are bing disscused directly

In [9]:
# Processing English text
Text ="My mother came with me to the hospital last week with my grandmother to get treated for alzheimer's it was a long day my brother came later in the evening."


stanza.download('en', package='mimic', processors={'ner': 'i2b2'})
nlp3 = stanza.Pipeline('en', package='mimic', processors={'ner': 'i2b2'})

docz = nlp3(Text)

# print out the entities
for ent in docz.entities:
    print(f'{ent.text}\t{ent.type}') 

Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.4.1.json:   0%|   …

INFO:stanza:Downloading these customized packages for language: en (English)...
| Processor       | Package |
-----------------------------
| tokenize        | mimic   |
| pos             | mimic   |
| lemma           | mimic   |
| depparse        | mimic   |
| ner             | i2b2    |
| backward_charlm | mimic   |
| forward_charlm  | mimic   |
| pretrain        | mimic   |



Downloading https://huggingface.co/stanfordnlp/stanza-en/resolve/v1.4.1/models/tokenize/mimic.pt:   0%|       …

Downloading https://huggingface.co/stanfordnlp/stanza-en/resolve/v1.4.1/models/pos/mimic.pt:   0%|          | …

Downloading https://huggingface.co/stanfordnlp/stanza-en/resolve/v1.4.1/models/lemma/mimic.pt:   0%|          …

Downloading https://huggingface.co/stanfordnlp/stanza-en/resolve/v1.4.1/models/depparse/mimic.pt:   0%|       …

Downloading https://huggingface.co/stanfordnlp/stanza-en/resolve/v1.4.1/models/ner/i2b2.pt:   0%|          | 0…

Downloading https://huggingface.co/stanfordnlp/stanza-en/resolve/v1.4.1/models/backward_charlm/mimic.pt:   0%|…

Downloading https://huggingface.co/stanfordnlp/stanza-en/resolve/v1.4.1/models/forward_charlm/mimic.pt:   0%| …

Downloading https://huggingface.co/stanfordnlp/stanza-en/resolve/v1.4.1/models/pretrain/mimic.pt:   0%|       …

INFO:stanza:Finished downloading models and saved to /root/stanza_resources.
INFO:stanza:Checking for updates to resources.json in case models have been updated.  Note: this behavior can be turned off with download_method=None or download_method=DownloadMethod.REUSE_RESOURCES


Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.4.1.json:   0%|   …

INFO:stanza:Loading these models for language: en (English):
| Processor | Package |
-----------------------
| tokenize  | mimic   |
| pos       | mimic   |
| lemma     | mimic   |
| depparse  | mimic   |
| ner       | i2b2    |

INFO:stanza:Use device: cpu
INFO:stanza:Loading: tokenize
INFO:stanza:Loading: pos
INFO:stanza:Loading: lemma
INFO:stanza:Loading: depparse
INFO:stanza:Loading: ner
INFO:stanza:Done loading processors!


alzheimer's	PROBLEM


## Problem Detected 

With all this in place we now have a custom muti rules and AI based hybrid model framework to 

* 1 identify if family is being disscused int text - and to clasify what family it was 

* 2 identify if time is a factor are dates being disscuased is this about the past or the future 

* 3 are their medical problems or medications being t;laked about and what are they


With all these put togther we can make a strong estimate around weather family history is being discused and create a stuctered data set arund the entire converstaion all from entierly unstructered text 

## Putting it all togther 
last lets put it all togther into one final method around 
FHI (Family History Information) based indenifatction break down and 
full target detection for future use on our data for our clients -  

In [23]:
from re import M
# Full FHI Indentifir
import stanza

# Download an English model into the default directory
print("Downloading English model...")
stanza.download('en')
# Build an English pipeline, with all processors by default
print("Building an English pipeline...")
en_nlp = stanza.Pipeline('en',verbose=True)
stanza.download('en', package='mimic', processors={'ner': 'i2b2'})
nlp2 = stanza.Pipeline(lang='en', processors='tokenize,ner')
nlp3 = stanza.Pipeline('en', package='mimic', processors={'ner': 'i2b2'})



family = ['household','ménage','family','ancestry', 'parent','child', 'sibling' 'parentage','birth','genealogy','heritage', 'descendent', 'lineage', 'bloodline',
          'relative','relations','step-','in-law-', 'offspring']
parent =['mother', 'mom', 'maternal','matriarch','father', 'dad','patriarch']
sibling =['brother', 'bro', 'male sibling','sister', 'sis', 'female sibling']
child =['son', 'male child','male offspring','daughter', 'female child','female offspring']
extended =['aunt','uncle','grandma','grandmother','grandmom','grandpa','grandfather','grandpop','niece', 'nephew', 'cousin', 'great grandma', 'great grandpa', 'grandson','granddaughter',
           'stepfather', 'stepmother','stepson', 'stepdaughter', 'stepsister', 'stepbrother', 'half-brother', 'half-sister', 'father-in-law',
           'mother-in-law','son-in-law','daughter-in-law','brother-in-law','sister-in-law']


def FHI_Identifier(Text: str):

  family_info = False
  date_info = False
  medical_info = False

  en_doc = en_nlp(Text)
  # identify structure -
  text_res = []
  date_res =[]
  fam_res =[]
  med_res =[]

  for i, sent in enumerate(en_doc.sentences):
      for word in sent.words:
          res ={'text':word.text,'lema':word.lemma, 'pos':word.pos, 'head':word.head, 'deprel':word.deprel}
          text_res.append(res)


  #TRACK Family Info
  for x in text_res:
    if(x['pos']=='NOUN'):
      text = str(x['text']).lower()
      if(text in family):
        fam_res.append({'Family reference found':text})
        family_info = True
      if(text in parent):
        fam_res.append({'Parent reference found':text})
        family_info = True
      if(text in sibling):
        fam_res.append({'Sibling reference found':text})
        family_info = True
      if(text in child):
        fam_res.append({'Child reference found':text})
        family_info = True
      if(text in extended):
        fam_res.append({'Extended Family reference found':text})
        family_info = True

      
    

  #TRACK DATE INFO
  docs = nlp2(Text)
  if(len(docs.ents)==0):
    date_info = False

  for ent in docs.ents:
    if('DATE' in ent.type or 'TIME' in ent.type):
      date_info = True
      res = {'text': ent.text, 'type': ent.type}
      date_res.append(res)


  #Track Medical
  docz = nlp3(Text)

  # print out the entities
  if(len(docz.entities)>0):
    medical_info = True
  for ent in docz.entities:
      res = {'text': ent.text, 'type': ent.type}
      med_res.append(res)

  FHI_score = 0
  if(family_info):
    FHI_score +=2
  if(date_info and medical_info):
    FHI_score +=1 


  fhi_text = "NO FHI DETECTED"
  if(FHI_score == 1):
    fhi_text = "Potential FHI DETECTED"
  elif(FHI_score == 2):
    fhi_text = "Some FHI DETECTED"
  elif(FHI_score == 3):
    fhi_text = "Full FHI DETECTED"

  final_res = [{'FHI': fhi_text, 'Family Info': fam_res, 'Date Info':date_res, 'Medical Info': med_res}]
  return final_res



Downloading English model...


Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.4.1.json:   0%|   …

INFO:stanza:Downloading default packages for language: en (English) ...
INFO:stanza:File exists: /root/stanza_resources/en/default.zip
INFO:stanza:Finished downloading models and saved to /root/stanza_resources.
INFO:stanza:Checking for updates to resources.json in case models have been updated.  Note: this behavior can be turned off with download_method=None or download_method=DownloadMethod.REUSE_RESOURCES


Building an English pipeline...


Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.4.1.json:   0%|   …

INFO:stanza:Loading these models for language: en (English):
| Processor    | Package   |
----------------------------
| tokenize     | combined  |
| pos          | combined  |
| lemma        | combined  |
| depparse     | combined  |
| sentiment    | sstplus   |
| constituency | wsj       |
| ner          | ontonotes |

INFO:stanza:Use device: cpu
INFO:stanza:Loading: tokenize
INFO:stanza:Loading: pos
INFO:stanza:Loading: lemma
INFO:stanza:Loading: depparse
INFO:stanza:Loading: sentiment
INFO:stanza:Loading: constituency
INFO:stanza:Loading: ner
INFO:stanza:Done loading processors!


Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.4.1.json:   0%|   …

INFO:stanza:Downloading these customized packages for language: en (English)...
| Processor       | Package |
-----------------------------
| tokenize        | mimic   |
| pos             | mimic   |
| lemma           | mimic   |
| depparse        | mimic   |
| ner             | i2b2    |
| backward_charlm | mimic   |
| forward_charlm  | mimic   |
| pretrain        | mimic   |

INFO:stanza:File exists: /root/stanza_resources/en/tokenize/mimic.pt
INFO:stanza:File exists: /root/stanza_resources/en/pos/mimic.pt
INFO:stanza:File exists: /root/stanza_resources/en/lemma/mimic.pt
INFO:stanza:File exists: /root/stanza_resources/en/depparse/mimic.pt
INFO:stanza:File exists: /root/stanza_resources/en/ner/i2b2.pt
INFO:stanza:File exists: /root/stanza_resources/en/backward_charlm/mimic.pt
INFO:stanza:File exists: /root/stanza_resources/en/forward_charlm/mimic.pt
INFO:stanza:File exists: /root/stanza_resources/en/pretrain/mimic.pt
INFO:stanza:Finished downloading models and saved to /root/stanza_re

Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.4.1.json:   0%|   …

INFO:stanza:Loading these models for language: en (English):
| Processor | Package   |
-------------------------
| tokenize  | combined  |
| ner       | ontonotes |

INFO:stanza:Use device: cpu
INFO:stanza:Loading: tokenize
INFO:stanza:Loading: ner
INFO:stanza:Done loading processors!
INFO:stanza:Checking for updates to resources.json in case models have been updated.  Note: this behavior can be turned off with download_method=None or download_method=DownloadMethod.REUSE_RESOURCES


Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.4.1.json:   0%|   …

INFO:stanza:Loading these models for language: en (English):
| Processor | Package |
-----------------------
| tokenize  | mimic   |
| pos       | mimic   |
| lemma     | mimic   |
| depparse  | mimic   |
| ner       | i2b2    |

INFO:stanza:Use device: cpu
INFO:stanza:Loading: tokenize
INFO:stanza:Loading: pos
INFO:stanza:Loading: lemma
INFO:stanza:Loading: depparse
INFO:stanza:Loading: ner
INFO:stanza:Done loading processors!


## Run The new FHI detection method


In [24]:
Text ="My mother came with me to the hospital last week with my grandmother to get treated for alzheimer's it was a long day my brother came later in the evening."

result = FHI_Identifier(Text)

result

[{'FHI': 'Full FHI DETECTED',
  'Family Info': [{'Parent reference found': 'mother'},
   {'Extended Family reference found': 'grandmother'},
   {'Sibling reference found': 'brother'}],
  'Date Info': [{'text': 'last week', 'type': 'DATE'},
   {'text': 'later in the evening', 'type': 'TIME'}],
  'Medical Info': [{'text': "alzheimer's", 'type': 'PROBLEM'}]}]