NCBI-DISEASE-CLASSIFIER

Disease word classifier model comparison with different types of neural networks (CNN, GRU, LSTM) using ncbi_disease dataset. This data was retrieved from Hugging face dataset api. The project aims to apply named entity recognition on disease names.

Dataset Preview

id (string)	tokens (array)	ner_tags (array)
0	[ "Identification", "of", "APC2", ",", "a", "homologue", "of", "the", "adenomatous", "polyposis", "coli", "tumour", "suppressor", "." ]	[ 0, 0, 0, 0, 0, 0, 0, 0, 1, 2, 2, 2, 0, 0 ]
1	[ "The", "adenomatous", "polyposis", "coli", "(", "APC", ")", "tumour", "-", "suppressor", "protein", "controls", "the", "Wnt", "signalling", "pathway", "by", "forming", "a", "complex", "with", "glycogen", "synthase", "kinase", "3beta", "(", "GSK", "-", "3beta", ")", ",", "axin", "/", "conductin", "and", "betacatenin", "." ]	[ 0, 1, 2, 2, 2, 2, 2, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 ]
2	[ "Complex", "formation", "induces", "the", "rapid", "degradation", "of", "betacatenin", "." ]	[ 0, 0, 0, 0, 0, 0, 0, 0, 0 ]
3	[ "In", "colon", "carcinoma", "cells", ",", "loss", "of", "APC", "leads", "to", "the", "accumulation", "of", "betacatenin", "in", "the", "nucleus", ",", "where", "it", "binds", "to", "and", "activates", "the", "Tcf", "-", "4", "transcription", "factor", "(", "reviewed", "in", "[", "1", "]", "[", "2", "]", ")", "." ]	[ 0, 1, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 ]
4	[ "Here", ",", "we", "report", "the", "identification", "and", "genomic", "structure", "of", "APC", "homologues", "." ]	[ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 ]
5	[ "Mammalian", "APC2", ",", "which", "closely", "resembles", "APC", "in", "overall", "domain", "structure", ",", "was", "functionally", "analyzed", "and", "shown", "to", "contain", "two", "SAMP", "domains", ",", "both", "of", "which", "are", "required", "for", "binding", "to", "conductin", "." ]	[ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 ]

Dataset Structure

Data Instances

Instances of the dataset contain an array of tokens, ner_tags and an id.

Sample data from the dataset:

{
  'tokens': ['Identification', 'of', 'APC2', ',', 'a', 'homologue', 'of', 'the', 'adenomatous', 'polyposis', 'coli', 'tumour', 'suppressor', '.'],
  'ner_tags': [0, 0, 0, 0, 0, 0, 0, 0, 1, 2, 2, 2, 0, 0],
  'id': '0'
}

Data Fields

id: Sentence identifier.
tokens: Array of tokens composing a sentence.
ner_tags: Array of tags, where 0 indicates no disease mentioned, 1 signals the first token of a disease and 2 the subsequent disease tokens.

For more detailed information about dataset

Preprocessing

Removing Stopwords and Punctuations

The data above had some punctuation chars and stopwords inside tokens column. These elements and their corresponding tags have been removed for a clean result and no unwanted classifications.

Words to Sequences and Sequence Padding

The data is vectorized in order to be sequenced and padded via tokenizer. Since CNN only accepts fixed size data, the lengths of all sentences are made the same size.

Model Training

Since our problem is basically a multi class classification problem all of the output layers use "softmax" as an activation function. Also for each model "sparse categorical corssentropy" loss function is used.

Only Embedding Model

After some experiment it is observed that the model is tend to overfit. For this reason dropout layer has been used with a probability 0.6. Also the learning rate has been setted to 0.0005.

CNN Model

On this model only single 1d convolutional layer has been used. Filters are defined as 16 and kernel size defined as 2. Same as previous model this model is tend to overfit. Therefore before the output layer dropout layer has been used.

GRU Model

All of the RNN based models (GRU, LSTM, Multiple LSTM) have bidirectional layer in this experiment. Basic GRU model with 32 unit. Only advantage of this model it is more easy to train than basic LSTM cells.

LSTM

This model has better results than the GRU model but it's training duration is more than the GRU model.

Multiple LSTM

Due to not enough data this model is the most overfitted model in the experiment. In spite of dropout layer and low learning rate. Also takes the longest to train.

Evaluation

The accuracy and loss values on the test data are listed below.

test loss, test acc: [0.2331709861755371, 0.9312196969985962]  <->  Only Embedding

test loss, test acc: [0.2950023710727691, 0.8634837865829468]  <->  CNN

test loss, test acc: [0.2362357676029205, 0.9317418932914734]  <->  GRU

test loss, test acc: [0.2362425923347473, 0.9315180778503418]  <->  Bidi LSTM

test loss, test acc: [0.2752841413021087, 0.8601267933845526]  <->  Multiple Bidi LSTM

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
LICENSE		LICENSE
README.md		README.md
ncbi-disease.ipynb		ncbi-disease.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LICENSE

LICENSE

README.md

README.md

ncbi-disease.ipynb

ncbi-disease.ipynb

Repository files navigation

NCBI-DISEASE-CLASSIFIER

Dataset Preview

Dataset Structure

Data Instances

Data Fields

Preprocessing

Removing Stopwords and Punctuations

Words to Sequences and Sequence Padding

Model Training

Only Embedding Model

CNN Model

GRU Model

LSTM

Multiple LSTM

Evaluation

About

Releases

Packages

Contributors 2

Languages

License

oguzakif/ncbi-disease-classifier

Folders and files

Latest commit

History

Repository files navigation

NCBI-DISEASE-CLASSIFIER

Dataset Preview

Dataset Structure

Data Instances

Data Fields

Preprocessing

Removing Stopwords and Punctuations

Words to Sequences and Sequence Padding

Model Training

Only Embedding Model

CNN Model

GRU Model

LSTM

Multiple LSTM

Evaluation

About

Resources

License

Stars

Watchers

Forks

Languages