## Hierarchical Text Categorization using Watson NLP - Fine Tune (PubMed dataset)

### Use Case 
Hierarchical text categorization provides a more structured and organized approach to categorizing text, enabling better analysis, improved search, and more efficient content management. So Creating a hierarchical categorization system for public medical dataset that is more specific than broad categories like Humanities, Health Care, Anatomy, Phenomena and Processes, Named Groups, Geographicals, Technology, Industry, and Agriculture, Chemicals and Drugs, Anthropology, Education, Sociology, and Social Phenomena, Information Science, Disciplines and Occupations, Analytical, Diagnostic and Therapeutic Techniques, and Equipment, Diseases, Psychiatry and Psychology, and Organisms. 

This notebook demonstrates how to use the Explicit Semantic Analysis (ESA) block for performing text categorization. This model has been pre-trained on scrapped web data & news data set.

The dataset contains over 50000 medical data with categories. The original source of this dataset is from [Kaggle](https://www.kaggle.com/datasets/owaiskhan9654/pubmed-multilabel-text-classification)

### What you'll learn in this notebook
Watson NLP offers so-called blocks for various NLP tasks. This notebooks shows:

- **Syntax analysis** with the _Syntax block_ for English (`syntax_izumo_en_stock`). The syntax block performs NLP primitive tasks on the input text. It uses Izumo, the standard NLP primitives component of Watson NLP to perform the following tasks:
    1. Sentence detection
    1. Tokenization: can't -> ca + n't
    1. Part-of-Speech tagging: I thought -> I/PRON, thought/VERB
    1. Lemmatization: I thought -> I/I, thought/think
    1. Dependency parsing: I -> nsubj -> thought -> root
    
- **ESA Hierarchical Algorithm** The ESA Hierarchical Algorithm provides a data-free method for hierarchical text categorization. Instead of relying on training data, each label is equipped with a collection of key phrases, represented as n-grams, that are meant to define the semantic scope associated with that label. These key phrases can then be used to obtain an ESA concept vector for the label.

- **Hierarchical Text Categorization** with the ESAHierarchical (`categories_esa_en_stock`),categories block. This pre-train model is useful in Adtech usecases where webpages are categorized into a taxonomy of general domain topics, for advertisement placement and content recommendation. 
   


## Table of Contents

1. [Before you start](#beforeYouStart)
1. [Data Loading](#loadData)
1. [Data Processing & EDA](#EDA)
1. [Prepare Training data set](#training)
1. [Summary](#summary)

<a id="beforeYouStart"></a>
### 1. Before you start

<div class="alert alert-block alert-danger">
<b>Stop kernel of other notebooks.</b></div>

**Note:** If you have other notebooks currently running with the _DO + NLP Runtime xx.x on Python 3.x_ environment, **stop their kernels** before running this notebook. All these notebooks share the same runtime environment, and if they are running in parallel, you may encounter memory issues. To stop the kernel of another notebook, open that notebook, and select _File > Stop Kernel_.

<div class="alert alert-block alert-warning">
<b>Set Project token.</b></div>

Before you can begin working on this notebook in Watson Studio in Cloud Pak for Data as a Service, you need to ensure that the project token is set so that you can access the project assets via the notebook.

When this notebook is added to the project, a project access token should be inserted at the top of the notebook in a code cell. If you do not see the cell above, add the token to the notebook by clicking **More > Insert project token** from the notebook action bar.  By running the inserted hidden code cell, a project object is created that you can use to access project resources.

![ws-project.mov](https://media.giphy.com/media/jSVxX2spqwWF9unYrs/giphy.gif)

<div class="alert alert-block alert-info">
<b>Tip:</b> Cell execution</div>

Note that you can step through the notebook execution cell by cell, by selecting Shift-Enter. Or you can execute the entire notebook by selecting **Cell -> Run All** from the menu.

<span style="color:blueviolet">Begin by importing and initializing some helper libs that are used throughout the notebook.</span>

In [3]:
%%capture
# word cloud is used to create graphs below
!pip install wordcloud
!pip install ibm-watson
!pip install watson_nlp

In [4]:
# Silence Tensorflow warnings
import tensorflow as tf
tf.get_logger().setLevel('ERROR')
tf.autograph.set_verbosity(0)


In [44]:
import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning) 

In [36]:
pd.options.mode.chained_assignment = None

In [6]:
import json
import os
import pandas as pd

import watson_nlp

from watson_nlp.toolkit.categories import train_esa_utils
from watson_nlp.blocks.categories import ESAHierarchical
# we want to show large text snippets to be able to explore the relevant text
pd.options.display.max_colwidth = 400
import matplotlib.pyplot as plt
import numpy as np

<a id="loadData"></a>
## 2. Data Loading (PubMed Dataset)

The dataset contains over 50000 medical data with categories. The original source of this dataset is from [Kaggle](https://www.kaggle.com/datasets/owaiskhan9654/pubmed-multilabel-text-classification)

<div class="alert alert-block alert-info">
<b>Tip:</b> If you want to carry out Text Categorization on any other dataset, you should first upload the dataset into the project and then update the name of the file in the next cell</div>

<span style="color:blueviolet"><strong>Step 2.1</strong> We load the medical dataset into a DataFrame.</span>

<span style="color:blue">This data set contains <strong>50000</strong> medical dataset with the ['Title', 'abstractText', 'meshMajor', 'pmid', 'meshid', 'meshroot', 'A',
       'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'L', 'M', 'N', 'Z'].</span>

In [7]:
# load data set into a dataframe
file_name = "PubMed Multi Label Text Classification Dataset Processed.csv"
buffer = project.get_file(file_name)
med_df = pd.read_csv(buffer)


# preview the data set
med_df.head(5)

Unnamed: 0,Title,abstractText,meshMajor,pmid,meshid,meshroot,A,B,C,D,E,F,G,H,I,J,L,M,N,Z
0,Expression of p53 and coexistence of HPV in premalignant lesions and in cervical cancer.,"Fifty-four paraffin embedded tissue sections from patients with dysplasia (21 cases) and with cervical cancer (33 cases) were analysed. HPV was detected and identified in two stages. Firstly, using mixed starters, chosen genomic DNA sequences were amplified; secondly the material thus obtained was analyzed by hybridization method using oligonucleotyde 31-P labelled probe. HPVs of type 6, 11, 1...","['DNA Probes, HPV', 'DNA, Viral', 'Female', 'Humans', 'Immunohistochemistry', 'Papillomaviridae', 'Tumor Suppressor Protein p53', 'Uterine Cervical Dysplasia', 'Uterine Cervical Neoplasms']",8549602,"[['D13.444.600.223.555', 'D27.505.259.750.600.223.620', 'D27.720.470.530.600.223.620'], ['D13.444.308.568'], ['B01.050.150.900.649.313.988.400.112.400.400'], ['E01.370.225.500.607.512', 'E01.370.225.750.551.512', 'E05.200.500.607.512', 'E05.200.750.551.512', 'E05.478.583', 'H01.158.100.656.234.512', 'H01.158.201.344.512', 'H01.158.201.486.512', 'H01.181.122.573.512', 'H01.181.122.605.512'], ['...","['Chemicals and Drugs [D]', 'Organisms [B]', 'Analytical, Diagnostic and Therapeutic Techniques, and Equipment [E]', 'Disciplines and Occupations [H]', 'Diseases [C]']",0,1,1,1,1,0,0,1,0,0,0,0,0,0
1,Vitamin D status in pregnant Indian women across trimesters and different seasons and its correlation with neonatal serum 25-hydroxyvitamin D levels.,"The present cross-sectional study was conducted to determine the vitamin D status of pregnant Indian women and their breast-fed infants. Subjects were recruited from the Department of Obstetrics, Armed Forces Clinic and Army Hospital (Research and Referral), Delhi. A total of 541 apparently healthy women with uncomplicated, single, intra-uterine gestation reporting in any trimester were consec...","['Adult', 'Alkaline Phosphatase', 'Breast Feeding', 'Cross-Sectional Studies', 'Female', 'Humans', 'India', 'Infant', 'Infant Nutrition Disorders', 'Lactation', 'Mothers', 'Nutritional Status', 'Parathyroid Hormone', 'Pregnancy', 'Pregnancy Complications', 'Pregnancy Trimesters', 'Seasons', 'Vitamin D', 'Vitamin D Deficiency', 'Vitamins', 'Young Adult']",21736816,"[['M01.060.116'], ['D08.811.277.352.650.035'], ['F01.145.407.199', 'G07.203.650.195', 'G07.203.650.220.500.500', 'G07.203.650.353.199'], ['E05.318.372.500.875', 'N05.715.360.330.500.875', 'N06.850.520.450.500.875'], ['B01.050.150.900.649.313.988.400.112.400.400'], ['Z01.252.245.393'], ['M01.060.703'], ['C18.654.422'], ['G08.686.523', 'G08.686.702.500'], ['F01.829.263.500.320.200', 'I01.880.853...","['Named Groups [M]', 'Chemicals and Drugs [D]', 'Psychiatry and Psychology [F]', 'Phenomena and Processes [G]', 'Analytical, Diagnostic and Therapeutic Techniques, and Equipment [E]', 'Health Care [N]', 'Organisms [B]', 'Geographicals [Z]', 'Diseases [C]', 'Anthropology, Education, Sociology, and Social Phenomena [I]', 'Technology, Industry, and Agriculture [J]']",0,1,1,1,1,1,1,0,1,1,0,1,1,1
2,[Identification of a functionally important dipeptide in sequences of atypical opioid peptides].,"The occurrence of individual amino acids and dipeptide fragments in the sequences of 60 known atypical opioid peptides was analyzed. An expressed predominance of Tyr-Pro fragment suggested a high probability of analgesic activity for this dipeptide, and it was experimentally studied. It was shown on somatic and visceral pain sensitivity models that, on the i.p. administration of Tyr-Pro at dos...","['Amino Acid Sequence', 'Analgesics, Opioid', 'Animals', 'Consensus Sequence', 'Dipeptides', 'Guinea Pigs', 'In Vitro Techniques', 'Male', 'Mice', 'Molecular Sequence Data', 'Muscle Contraction', 'Muscle, Smooth', 'Narcotic Antagonists', 'Opioid Peptides', 'Pain Measurement', 'Rats', 'Receptor, Cannabinoid, CB1', 'Receptors, Opioid']",19060934,"[['G02.111.570.060', 'L01.453.245.667.060'], ['D27.505.696.277.600.500', 'D27.505.696.663.850.014.760.500', 'D27.505.954.427.040.550.500', 'D27.505.954.427.210.600.500'], ['B01.050'], ['G02.111.570.580.175'], ['D12.644.456.345'], ['B01.050.150.900.649.313.992.550'], ['E05.481'], ['B01.050.150.900.649.313.992.635.505.500'], ['L01.453.245.667'], ['G11.427.494'], ['A02.633.570', 'A10.690.467'], [...","['Phenomena and Processes [G]', 'Information Science [L]', 'Chemicals and Drugs [D]', 'Organisms [B]', 'Analytical, Diagnostic and Therapeutic Techniques, and Equipment [E]', 'Anatomy [A]']",1,1,0,1,1,0,1,0,0,0,1,0,0,0
3,Multilayer capsules: a promising microencapsulation system for transplantation of pancreatic islets.,"In 1980, Lim and Sun introduced a microcapsule coated with an alginate/polylysine complex for encapsulation of pancreatic islets. Characteristic to this type of capsule is, that it consists of a plain membrane which is formed during a single procedural step. With such a simple process it is difficult to obtain instantly a membrane optimized with respect to all the properties requested for isle...","['Acrylic Resins', 'Alginates', 'Animals', 'Biocompatible Materials', 'Biopolymers', 'Carboxymethylcellulose Sodium', 'Cells, Cultured', 'Compressive Strength', 'Drug Compounding', 'Female', 'Fibrosis', 'Glucuronic Acid', 'Hexuronic Acids', 'Islets of Langerhans Transplantation', 'Materials Testing', 'Microspheres', 'Muscle, Skeletal', 'Particle Size', 'Permeability', 'Polyethyleneimine', 'Pol...",11426874,"[['D05.750.716.822.111', 'D25.720.716.822.111', 'J01.637.051.720.716.822.111'], ['D09.698.068'], ['B01.050'], ['D25.130', 'D27.720.102.130', 'J01.637.051.130'], ['D05.750.078', 'D25.720.099', 'J01.637.051.720.099'], ['D09.698.365.180.663.329'], ['A11.251'], ['G01.374.180'], ['E05.916.270'], ['C23.550.355'], ['D02.241.081.844.915.162.249', 'D02.241.152.811.162.500', 'D02.241.511.902.915.162.500...","['Chemicals and Drugs [D]', 'Technology, Industry, and Agriculture [J]', 'Organisms [B]', 'Anatomy [A]', 'Phenomena and Processes [G]', 'Analytical, Diagnostic and Therapeutic Techniques, and Equipment [E]', 'Diseases [C]']",1,1,1,1,1,0,1,0,0,1,0,0,0,0
4,"Nanohydrogel with N,N'-bis(acryloyl)cystine crosslinker for high drug loading.","Substantially improved hydrogel particles based on poly(N-isopropylacrylamide) (pNIPA) have been obtained. First, as a result of replacing commercially available N,N'-bis(acryloyl)cystamine (BAC), the crosslinker, with acryloyl derivative of cystine containing a carboxylic group (BISS), the hydrogel particles acquired improved stability vs. ionic strength and allowed further chemical modificat...","['Antineoplastic Agents', 'Cell Proliferation', 'Cell Survival', 'Cross-Linking Reagents', 'Cystine', 'Doxorubicin', 'Drug Carriers', 'Drug Liberation', 'HeLa Cells', 'Humans', 'Hydrogels', 'Nanoparticles']",28323099,"[['D27.505.954.248'], ['G04.161.750', 'G07.345.249.410.750'], ['G04.346'], ['D27.720.470.410.210'], ['D01.248.497.158.874.390.369', 'D01.875.350.850.150.369', 'D02.886.030.230.369', 'D02.886.520.150.087', 'D12.125.095.369', 'D12.125.119.369', 'D12.125.166.230.369'], ['D02.455.426.559.847.562.050.200.175', 'D04.615.562.050.200.175', 'D09.408.051.059.200.175'], ['D26.255.260', 'E02.319.300.380']...","['Chemicals and Drugs [D]', 'Phenomena and Processes [G]', 'Analytical, Diagnostic and Therapeutic Techniques, and Equipment [E]', 'Anatomy [A]', 'Organisms [B]', 'Technology, Industry, and Agriculture [J]']",1,1,0,1,1,0,1,0,0,1,0,0,0,0


<a id="EDA"></a>
## 3. Data Processing & EDA

<span style="color:blueviolet"> <strong>Step 3.1 <strong> Checking columns name from medical dataset</span>

In [8]:
len(med_df)

50000

In [9]:
# preview the data set
med_df.columns

Index(['Title', 'abstractText', 'meshMajor', 'pmid', 'meshid', 'meshroot', 'A',
       'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'L', 'M', 'N', 'Z'],
      dtype='object')

In [10]:
med_df.abstractText[0]

'Fifty-four paraffin embedded tissue sections from patients with dysplasia (21 cases) and with cervical cancer (33 cases) were analysed. HPV was detected and identified in two stages. Firstly, using mixed starters, chosen genomic DNA sequences were amplified; secondly the material thus obtained was analyzed by hybridization method using oligonucleotyde 31-P labelled probe. HPVs of type 6, 11, 16, 18, 33 were identified. The p-53 expression was assayed by immunohistochemical method. HPV infection was often associated with dysplasia and cervical cancer. In cervical cancer mainly HPV 16 and 18 with high oncogenic potential were found. The p-53 was present rarely, and in minute quantities. No correlation was observed between presence of p-53 and HPVs DNA.'

In [11]:
med_df.meshMajor[0]

"['DNA Probes, HPV', 'DNA, Viral', 'Female', 'Humans', 'Immunohistochemistry', 'Papillomaviridae', 'Tumor Suppressor Protein p53', 'Uterine Cervical Dysplasia', 'Uterine Cervical Neoplasms']"

In [12]:
med_df.meshroot[0]

"['Chemicals and Drugs [D]', 'Organisms [B]', 'Analytical, Diagnostic and Therapeutic Techniques, and Equipment [E]', 'Disciplines and Occupations [H]', 'Diseases [C]']"

In [13]:
med_df.meshMajor[1]

"['Adult', 'Alkaline Phosphatase', 'Breast Feeding', 'Cross-Sectional Studies', 'Female', 'Humans', 'India', 'Infant', 'Infant Nutrition Disorders', 'Lactation', 'Mothers', 'Nutritional Status', 'Parathyroid Hormone', 'Pregnancy', 'Pregnancy Complications', 'Pregnancy Trimesters', 'Seasons', 'Vitamin D', 'Vitamin D Deficiency', 'Vitamins', 'Young Adult']"

In [14]:
med_df.meshroot[1]

"['Named Groups [M]', 'Chemicals and Drugs [D]', 'Psychiatry and Psychology [F]', 'Phenomena and Processes [G]', 'Analytical, Diagnostic and Therapeutic Techniques, and Equipment [E]', 'Health Care [N]', 'Organisms [B]', 'Geographicals [Z]', 'Diseases [C]', 'Anthropology, Education, Sociology, and Social Phenomena [I]', 'Technology, Industry, and Agriculture [J]']"

<span style="color:blueviolet"> <strong>Step 3.2 <strong> Preparing taraining dataset for text categorization</span>

In [15]:
### Training data for Text Categarization
med_training_df = med_df[['meshMajor','meshroot']]


In [16]:
med_training_df.head(30)

Unnamed: 0,meshMajor,meshroot
0,"['DNA Probes, HPV', 'DNA, Viral', 'Female', 'Humans', 'Immunohistochemistry', 'Papillomaviridae', 'Tumor Suppressor Protein p53', 'Uterine Cervical Dysplasia', 'Uterine Cervical Neoplasms']","['Chemicals and Drugs [D]', 'Organisms [B]', 'Analytical, Diagnostic and Therapeutic Techniques, and Equipment [E]', 'Disciplines and Occupations [H]', 'Diseases [C]']"
1,"['Adult', 'Alkaline Phosphatase', 'Breast Feeding', 'Cross-Sectional Studies', 'Female', 'Humans', 'India', 'Infant', 'Infant Nutrition Disorders', 'Lactation', 'Mothers', 'Nutritional Status', 'Parathyroid Hormone', 'Pregnancy', 'Pregnancy Complications', 'Pregnancy Trimesters', 'Seasons', 'Vitamin D', 'Vitamin D Deficiency', 'Vitamins', 'Young Adult']","['Named Groups [M]', 'Chemicals and Drugs [D]', 'Psychiatry and Psychology [F]', 'Phenomena and Processes [G]', 'Analytical, Diagnostic and Therapeutic Techniques, and Equipment [E]', 'Health Care [N]', 'Organisms [B]', 'Geographicals [Z]', 'Diseases [C]', 'Anthropology, Education, Sociology, and Social Phenomena [I]', 'Technology, Industry, and Agriculture [J]']"
2,"['Amino Acid Sequence', 'Analgesics, Opioid', 'Animals', 'Consensus Sequence', 'Dipeptides', 'Guinea Pigs', 'In Vitro Techniques', 'Male', 'Mice', 'Molecular Sequence Data', 'Muscle Contraction', 'Muscle, Smooth', 'Narcotic Antagonists', 'Opioid Peptides', 'Pain Measurement', 'Rats', 'Receptor, Cannabinoid, CB1', 'Receptors, Opioid']","['Phenomena and Processes [G]', 'Information Science [L]', 'Chemicals and Drugs [D]', 'Organisms [B]', 'Analytical, Diagnostic and Therapeutic Techniques, and Equipment [E]', 'Anatomy [A]']"
3,"['Acrylic Resins', 'Alginates', 'Animals', 'Biocompatible Materials', 'Biopolymers', 'Carboxymethylcellulose Sodium', 'Cells, Cultured', 'Compressive Strength', 'Drug Compounding', 'Female', 'Fibrosis', 'Glucuronic Acid', 'Hexuronic Acids', 'Islets of Langerhans Transplantation', 'Materials Testing', 'Microspheres', 'Muscle, Skeletal', 'Particle Size', 'Permeability', 'Polyethyleneimine', 'Pol...","['Chemicals and Drugs [D]', 'Technology, Industry, and Agriculture [J]', 'Organisms [B]', 'Anatomy [A]', 'Phenomena and Processes [G]', 'Analytical, Diagnostic and Therapeutic Techniques, and Equipment [E]', 'Diseases [C]']"
4,"['Antineoplastic Agents', 'Cell Proliferation', 'Cell Survival', 'Cross-Linking Reagents', 'Cystine', 'Doxorubicin', 'Drug Carriers', 'Drug Liberation', 'HeLa Cells', 'Humans', 'Hydrogels', 'Nanoparticles']","['Chemicals and Drugs [D]', 'Phenomena and Processes [G]', 'Analytical, Diagnostic and Therapeutic Techniques, and Equipment [E]', 'Anatomy [A]', 'Organisms [B]', 'Technology, Industry, and Agriculture [J]']"
5,"['Animal Distribution', 'Animals', 'Asia', 'Larva', 'Moths', 'Vietnam']","['Psychiatry and Psychology [F]', 'Phenomena and Processes [G]', 'Organisms [B]', 'Geographicals [Z]']"
6,"['Algorithms', 'Equipment Design', 'Equipment Failure Analysis', 'France', 'Internationality', 'Occupational Exposure', 'Power Plants', 'Radiation Dosage', 'Radiation Monitoring', 'Radiation Protection', 'Reproducibility of Results', 'Sensitivity and Specificity']","['Phenomena and Processes [G]', 'Information Science [L]', 'Analytical, Diagnostic and Therapeutic Techniques, and Equipment [E]', 'Geographicals [Z]', 'Anthropology, Education, Sociology, and Social Phenomena [I]', 'Health Care [N]', 'Technology, Industry, and Agriculture [J]']"
7,"['Adenoidectomy', 'Airway Extubation', 'Analgesics, Non-Narcotic', 'Analgesics, Opioid', 'Anesthetics, Inhalation', 'Bradycardia', 'Child', 'Child, Preschool', 'Dexmedetomidine', 'Emergence Delirium', 'Female', 'Humans', 'Hypotension', 'Male', 'Methyl Ethers', 'Pain Measurement', 'Pain, Postoperative', 'Prospective Studies', 'Sevoflurane', 'Tonsillectomy', 'Tramadol']","['Analytical, Diagnostic and Therapeutic Techniques, and Equipment [E]', 'Chemicals and Drugs [D]', 'Diseases [C]', 'Named Groups [M]', 'Psychiatry and Psychology [F]', 'Organisms [B]', 'Health Care [N]']"
8,"['Adenosine Diphosphate', 'Adenosine Triphosphate', 'Animals', 'Electrophysiology', 'Guanosine Diphosphate', 'Magnesium', 'Mice', 'Muscles', 'Potassium Channels', 'Thionucleotides']","['Chemicals and Drugs [D]', 'Organisms [B]', 'Disciplines and Occupations [H]', 'Anatomy [A]']"
9,"['Checklist', 'Civil Defense', 'Emergencies', 'Female', 'Humans', 'Male', 'Natural Disasters', 'Operating Rooms', 'Patient Care Team', 'Patient Safety', 'Simulation Training', 'Time Factors', 'United States']","['Health Care [N]', 'Anthropology, Education, Sociology, and Social Phenomena [I]', 'Diseases [C]', 'Organisms [B]', 'Phenomena and Processes [G]', 'Geographicals [Z]']"


In [17]:
df_meshRoot = med_training_df['meshroot']

In [18]:
df_meshRoot.head()

0                                                                                                                                                                                                          ['Chemicals and Drugs [D]', 'Organisms [B]', 'Analytical, Diagnostic and Therapeutic Techniques, and Equipment [E]', 'Disciplines and Occupations [H]', 'Diseases [C]']
1    ['Named Groups [M]', 'Chemicals and Drugs [D]', 'Psychiatry and Psychology [F]', 'Phenomena and Processes [G]', 'Analytical, Diagnostic and Therapeutic Techniques, and Equipment [E]', 'Health Care [N]', 'Organisms [B]', 'Geographicals [Z]', 'Diseases [C]', 'Anthropology, Education, Sociology, and Social Phenomena [I]', 'Technology, Industry, and Agriculture [J]']
2                                                                                                                                                                                    ['Phenomena and Processes [G]', 'Information Science [L]', 'Chemicals and Dru

<span style="color:blueviolet"> <strong>Step 3.3 <strong> Creating unique categories values from dataset using meshRoot column</span>

In [19]:
# extarct message data through the channel name 
import re  
def extarct_dictionary_list(df_meshRoot):
    dictionary_list = set()
    for val in df_meshRoot:
        #print("original ----",len(val))
        val_data = val.split("',")
        #print("After split---",len(val_data))
        for val_value in val_data:
            val_value = val_value.replace("'","")
            for i in range(65,91):
                val_value = re.sub("\\["+chr(i)+"\\]", "", val_value)
            val_value= val_value.replace("[","").replace("]","")
            dictionary_list.add(val_value.strip())
    return dictionary_list

In [20]:
dictionary_list =extarct_dictionary_list(df_meshRoot)
print("final---",len(dictionary_list))
print("final---",dictionary_list)

final--- 16
final--- {'Technology, Industry, and Agriculture', 'Diseases', '', 'Chemicals and Drugs', 'Disciplines and Occupations', 'Humanities', 'Information Science', 'Named Groups', 'Phenomena and Processes', 'Health Care', 'Geographicals', 'Anthropology, Education, Sociology, and Social Phenomena', 'Psychiatry and Psychology', 'Analytical, Diagnostic and Therapeutic Techniques, and Equipment', 'Anatomy', 'Organisms'}


In [21]:
len(dictionary_list)

16

In [22]:
dict_list =list(dictionary_list)

<span style="color:blueviolet"> <strong>Step 3.4 <strong>Removed blank category from categories list</span>

In [23]:
del dict_list[0]
print(dict_list)

['Diseases', '', 'Chemicals and Drugs', 'Disciplines and Occupations', 'Humanities', 'Information Science', 'Named Groups', 'Phenomena and Processes', 'Health Care', 'Geographicals', 'Anthropology, Education, Sociology, and Social Phenomena', 'Psychiatry and Psychology', 'Analytical, Diagnostic and Therapeutic Techniques, and Equipment', 'Anatomy', 'Organisms']


<span style="color:blueviolet"> <strong>Step 3.5 <strong> Unique categories from public medical dataset</span>

In [24]:
print(len(dict_list))

15


<a id="training"></a>
## 4. Prepare Training data set

<span style="color:blueviolet"> <strong>Step 4.1 <strong> Creating training data set using unique dict values with meshMajor</span>

In [25]:
training_data=[]
for dict_val in dict_list:
    x_out =med_df['meshroot'].str.contains(dict_val)
    top_doc_list =set()
    for i in range(len(x_out)):
        if x_out[i] == True:
            mesh_value = med_df['meshMajor'][i]
            mesh_val_list = mesh_value.split("',")
            for mesh in mesh_val_list:top_doc_list.add(mesh.replace("[","").replace("]","").replace("'","").strip())
    training_data.append({'labels':[dict_val],'key_phrases':list(top_doc_list)})
    

In [26]:
print(len(training_data))

15


<span style="color:blueviolet"> <strong>Step 4.2 <strong> Sample Training data</span>

In [27]:
print(training_data[0])

{'labels': ['Diseases'], 'key_phrases': ['Protein Tyrosine Phosphatase, Non-Receptor Type 6', 'Alcohol-Related Disorders', 'Ellagic Acid', 'Freeze Drying', 'Prince Edward Island', 'Disaccharidases', 'Quercus', 'Syphilis, Cardiovascular', '2-Acetylaminofluorene', 'Donepezil', 'Noise', 'Oxidopamine', 'Lamins', 'Ketoconazole', 'Acetates', 'Transfusion Reaction', 'Disulfiram', 'Anticonvulsants', 'Hemospermia', 'Pregnadienes', 'Malaria, Cerebral', 'Thermolysin', 'Tablets', 'Gene Expression Regulation, Bacterial', 'Isotretinoin', 'Persian Gulf Syndrome', 'Chlorogenic Acid', 'Receptors, Kainic Acid', 'Parathyroid Glands', 'Diabetic Nephropathies', 'Neuroleptic Malignant Syndrome', 'Febrile Neutropenia', 'Autonomic Denervation', 'Echinococcus', 'Hepatopancreas', 'Alkynes', 'Spina Bifida Occulta', 'Klebsiella oxytoca', 'Spinocerebellar Ataxias', 'Geranium', 'Benzoyl Peroxide', 'Peptide Fragments', 'South Africa', 'Paliperidone Palmitate', 'Macromolecular Substances', 'Prisons', 'Thymidine Phosp

In [28]:
data_path = './categories_train_data.json'

<span style="color:blueviolet"> <strong>Step 4.3 <strong> Dumping training data into required json format</span>

In [29]:
def prepare_stream_from_python_list(data, syntax_model, data_path):
    '''Given a Python data object, dump it to disk as a JSON file, then use that
    to initialize a new training stream. Note that the data stream is lazily
    initialized; the file at data_path needs to exist when we re-enter the data stream,
    since we don't want to load the whole thing into memory.

    Args:
        data: list(dict)
            ESA Categories training data.
        syntax_model: watson_nlp.blocks.syntax.izumo.IzumoTextProcessing
            Syntax model to be used to tokenize training texts.
        data_path: str
            Location to which we want to save our training data.
    Returns:
        watson_core.data_model.streams.data_stream.DataStream
            DataStream to be passed to ESA train.
    '''
    # Dump the Python object to a JSON file
    with open(data_path, 'w', encoding='utf-8') as f:
        json.dump(data, f)
    # Prepare traininig data from a JSON file
    return train_esa_utils.prepare_data_from_json(data_path, syntax_model)

<span style="color:blueviolet"> <strong>Step 4.4 <strong> Downloading required specific models for training data</span>

In [30]:
# Prepare the Categories DataStream
print('Downloading existing Syntax / Categories models...')
syntax_model = watson_nlp.load(watson_nlp.download('syntax_izumo_en_stock'))
# Download an existing categories model; note that we are not loading this into memory, just downloading it.
categories_model_path = watson_nlp.download('categories_esa_en_stock')


Downloading existing Syntax / Categories models...


<a id="MB"></a>
## 5. Model Building

<span style="color:blueviolet"> <strong>Step 5.1 <strong> Training the model using ESAHierarchical.train method</span>

In [45]:
print('Training the model...')
train_data_stream = prepare_stream_from_python_list(training_data, syntax_model, data_path)
model = ESAHierarchical.train(train_data_stream, categories_model_path)
print('[DONE]')

Training the model...
[DONE]


<span style="color:blueviolet"> <strong>Step 5.2 <strong> Saving the model</span>

In [32]:
model.save('pub_med_categories_model')
project.save_data('pub_med_categories_model', data=model.as_file_like_object(), overwrite=True)

{'file_name': 'pub_med_categories_model',
 'message': 'File saved to project storage.',
 'bucket_name': 'watsoncore-donotdelete-pr-olkxvfa8bk0pb1',
 'asset_id': '52b9f710-ee62-4795-8bde-44cf9c48f12d'}

<span style="color:blueviolet"> <strong>Step 5.3 <strong> Loading and testing the model</span>


In [33]:
med_df.abstractText[0]

'Fifty-four paraffin embedded tissue sections from patients with dysplasia (21 cases) and with cervical cancer (33 cases) were analysed. HPV was detected and identified in two stages. Firstly, using mixed starters, chosen genomic DNA sequences were amplified; secondly the material thus obtained was analyzed by hybridization method using oligonucleotyde 31-P labelled probe. HPVs of type 6, 11, 16, 18, 33 were identified. The p-53 expression was assayed by immunohistochemical method. HPV infection was often associated with dysplasia and cervical cancer. In cervical cancer mainly HPV 16 and 18 with high oncogenic potential were found. The p-53 was present rarely, and in minute quantities. No correlation was observed between presence of p-53 and HPVs DNA.'

In [34]:
med_df.meshroot[0]

"['Chemicals and Drugs [D]', 'Organisms [B]', 'Analytical, Diagnostic and Therapeutic Techniques, and Equipment [E]', 'Disciplines and Occupations [H]', 'Diseases [C]']"

In [43]:
# Run syntax on text
text =med_df.abstractText[0]
syntax_result = syntax_model.run(text)
results = model.run(syntax_result)

results.categories

[{
   "labels": [
     "Humanities"
   ],
   "score": 0.576131,
   "explanation": []
 },
 {
   "labels": [
     "Anthropology, Education, Sociology, and Social Phenomena"
   ],
   "score": 0.522536,
   "explanation": []
 },
 {
   "labels": [
     "Chemicals and Drugs"
   ],
   "score": 0.50377,
   "explanation": []
 }]

<a id="summary"></a>
## 5. Summary

<span style="color:blue">This notebook shows you how to use the Watson NLP library to:
1. Extract tokens, Parts of Speech, Lemmas etc
1. Extract Keywords and phrases from a text corpus
1. Extract Text Categories from a text corpus to understand what people are talking about.
    
</span>

Please note that this content is made available by IBM Build Lab to foster Embedded AI technology adoption. The content may include systems & methods pending patent with USPTO and protected under US Patent Laws. For redistribution of this content, IBM will use release process. For any questions please log an issue in the [GitHub](https://github.com/ibm-build-labs/Watson-NLP). 

Developed by IBM Build Lab 

Copyright - 2023 IBM Corporation 