Skip to content

oguzakif/ncbi-disease-classifier

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 

Repository files navigation

NCBI-DISEASE-CLASSIFIER

Disease word classifier model comparison with different types of neural networks (CNN, GRU, LSTM) using ncbi_disease dataset. This data was retrieved from Hugging face dataset api. The project aims to apply named entity recognition on disease names.

Dataset Preview

id (string)tokens (array)ner_tags (array)
0
[ "Identification", "of", "APC2", ",", "a", "homologue", "of", "the", "adenomatous", "polyposis", "coli", "tumour", "suppressor", "." ]
[ 0, 0, 0, 0, 0, 0, 0, 0, 1, 2, 2, 2, 0, 0 ]
1
[ "The", "adenomatous", "polyposis", "coli", "(", "APC", ")", "tumour", "-", "suppressor", "protein", "controls", "the", "Wnt", "signalling", "pathway", "by", "forming", "a", "complex", "with", "glycogen", "synthase", "kinase", "3beta", "(", "GSK", "-", "3beta", ")", ",", "axin", "/", "conductin", "and", "betacatenin", "." ]
[ 0, 1, 2, 2, 2, 2, 2, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 ]
2
[ "Complex", "formation", "induces", "the", "rapid", "degradation", "of", "betacatenin", "." ]
[ 0, 0, 0, 0, 0, 0, 0, 0, 0 ]
3
[ "In", "colon", "carcinoma", "cells", ",", "loss", "of", "APC", "leads", "to", "the", "accumulation", "of", "betacatenin", "in", "the", "nucleus", ",", "where", "it", "binds", "to", "and", "activates", "the", "Tcf", "-", "4", "transcription", "factor", "(", "reviewed", "in", "[", "1", "]", "[", "2", "]", ")", "." ]
[ 0, 1, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 ]
4
[ "Here", ",", "we", "report", "the", "identification", "and", "genomic", "structure", "of", "APC", "homologues", "." ]
[ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 ]
5
[ "Mammalian", "APC2", ",", "which", "closely", "resembles", "APC", "in", "overall", "domain", "structure", ",", "was", "functionally", "analyzed", "and", "shown", "to", "contain", "two", "SAMP", "domains", ",", "both", "of", "which", "are", "required", "for", "binding", "to", "conductin", "." ]
[ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 ]

Dataset Structure

Data Instances

Instances of the dataset contain an array of tokens, ner_tags and an id.

Sample data from the dataset:

{
  'tokens': ['Identification', 'of', 'APC2', ',', 'a', 'homologue', 'of', 'the', 'adenomatous', 'polyposis', 'coli', 'tumour', 'suppressor', '.'],
  'ner_tags': [0, 0, 0, 0, 0, 0, 0, 0, 1, 2, 2, 2, 0, 0],
  'id': '0'
}

Data Fields

  • id: Sentence identifier.
  • tokens: Array of tokens composing a sentence.
  • ner_tags: Array of tags, where 0 indicates no disease mentioned, 1 signals the first token of a disease and 2 the subsequent disease tokens.

For more detailed information about dataset


Preprocessing

Removing Stopwords and Punctuations

The data above had some punctuation chars and stopwords inside tokens column. These elements and their corresponding tags have been removed for a clean result and no unwanted classifications.

Words to Sequences and Sequence Padding

The data is vectorized in order to be sequenced and padded via tokenizer. Since CNN only accepts fixed size data, the lengths of all sentences are made the same size.


Model Training

Since our problem is basically a multi class classification problem all of the output layers use "softmax" as an activation function. Also for each model "sparse categorical corssentropy" loss function is used.

Only Embedding Model

After some experiment it is observed that the model is tend to overfit. For this reason dropout layer has been used with a probability 0.6. Also the learning rate has been setted to 0.0005.

CNN Model

On this model only single 1d convolutional layer has been used. Filters are defined as 16 and kernel size defined as 2. Same as previous model this model is tend to overfit. Therefore before the output layer dropout layer has been used.

GRU Model

All of the RNN based models (GRU, LSTM, Multiple LSTM) have bidirectional layer in this experiment. Basic GRU model with 32 unit. Only advantage of this model it is more easy to train than basic LSTM cells.

LSTM

This model has better results than the GRU model but it's training duration is more than the GRU model.

Multiple LSTM

Due to not enough data this model is the most overfitted model in the experiment. In spite of dropout layer and low learning rate. Also takes the longest to train.


Evaluation

The accuracy and loss values on the test data are listed below.

test loss, test acc: [0.2331709861755371, 0.9312196969985962]  <->  Only Embedding

test loss, test acc: [0.2950023710727691, 0.8634837865829468]  <->  CNN

test loss, test acc: [0.2362357676029205, 0.9317418932914734]  <->  GRU

test loss, test acc: [0.2362425923347473, 0.9315180778503418]  <->  Bidi LSTM

test loss, test acc: [0.2752841413021087, 0.8601267933845526]  <->  Multiple Bidi LSTM

About

Disease word classifier model comparison with different types of neural networks (CNN, GRU, LSTM) using ncbi_disease dataset

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published