Skip to content

Process text, including tokenizing and representing sentences as vectors and Applying some concepts like RNN, LSTM and GRU to create a classifier can detect the language in which a sentence is written from among 17 languages.

License

hossamasaad/Language-Identifier

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

23 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Language Identifier

GitHub repo size GitHub contributors GitHub stars GitHub forks Twitter Follow

The goal of this project is to create a model that is able to predict a given sentence language through text processing, including tokenizing and representation of sentences as vectors and applying concepts such as RNN, LSTM and GRU to create the classifier that can detect the language among 16 languages.

Dataset

Language Detection It's a small language detection dataset. This dataset consists of text details for 16 different languages

Project Structure

.
├── LICENSE
├── requirements.txt
├── README.md
├── data
│   ├── processed
│   │   ├── labels.pkl
│   │   ├── train_labels.pkl
│   │   ├── train_padded.pkl
│   │   ├── valid_labels.pkl
│   │   └── valid_padded.pkl
│   └── raw
│       └── Language Detection.csv
├── models
│   ├── meta.tsv
│   ├── vecs.tsv
│   ├── best_conv.pkl
│   ├── best_gru.pkl
│   ├── best_lstm.pkl
│   ├── conv_history.pkl
│   ├── gru_history.pkl
│   └── lstm_history.pkl
├── notebooks
│   └── Language Identifier.ipynb
├── reports
│   └── figures
│       ├── conv1d_accuracy.png
│       ├── conv1d_loss.png
│       ├── conv_accuracy.png
│       ├── conv_confusion_matrix.png
│       ├── conv_loss.png
│       ├── gru_accuracy.png
│       ├── gru_confusion_matrix.png
│       ├── gru_loss.png
│       ├── langauges_count.png
│       ├── lanuages_pie.png
│       ├── lstm_accuracy.png
│       ├── lstm_confusion_matrix.png
│       └── lstm_loss.png
├── src
│   ├── __init__.py
│   ├── features
│   ├── data
│   │   ├── __init__.py
│   │   ├── cleaner.py
│   │   └── explore.py
│   ├── models
│   │   ├── __init__.py
│   │   ├── gru.py
│   │   ├── lstm.py
│   │   ├── conv1d.py
│   │   └── gridsearch.py
│   └── visualization
│       ├── __init__.py
│       ├── graphs.py
│       └── visualize.py
└── tests
    ├── conftest.py
    ├── pytest.ini
    ├── test_data.csv
    ├── test_data.py
    └── test_models.py

Model Archtichtures

Results

  • All models achieved high accuracy even when using one convolution layer instead of LSTM or GRU, But GRU achieved highest accuracy 99.6% training accuracy 96.5% validation accuracy.
  • Using LSTM achieved high accuracy about 96.87% validation accuracy
  • Using fewer embedding dimensions makes the model reach high accuracy faster but in Embedding Projector alot of words grouped with other languages.

32 Embedding dimensions examples

image image

3 Embedding dimensions examples

image image

LSTM Accuracy and Loss

LSTM Confusion matrix

image

Contributing to Langauge Identifier

To contribute to Langauge Identifier, follow these steps:

  1. Fork this repository.
  2. Create a branch: git checkout -b <branch_name>.
  3. Make your changes and commit them: git commit -m '<commit_message>'
  4. Push to the original branch: git push origin <project_name>/<location>
  5. Create the pull request.

Alternatively see the GitHub documentation on creating a pull request.

Tools

  • Python
  • Tensorflow
  • Scikit-learn
  • NumPy
  • Pandas
  • Matplotlib
  • seaborn
  • pytest

About

Process text, including tokenizing and representing sentences as vectors and Applying some concepts like RNN, LSTM and GRU to create a classifier can detect the language in which a sentence is written from among 17 languages.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published