The goal of this project is to create a model that is able to predict a given sentence language through text processing, including tokenizing and representation of sentences as vectors and applying concepts such as RNN, LSTM and GRU to create the classifier that can detect the language among 16 languages.
Language Detection It's a small language detection dataset. This dataset consists of text details for 16 different languages
.
├── LICENSE
├── requirements.txt
├── README.md
├── data
│ ├── processed
│ │ ├── labels.pkl
│ │ ├── train_labels.pkl
│ │ ├── train_padded.pkl
│ │ ├── valid_labels.pkl
│ │ └── valid_padded.pkl
│ └── raw
│ └── Language Detection.csv
├── models
│ ├── meta.tsv
│ ├── vecs.tsv
│ ├── best_conv.pkl
│ ├── best_gru.pkl
│ ├── best_lstm.pkl
│ ├── conv_history.pkl
│ ├── gru_history.pkl
│ └── lstm_history.pkl
├── notebooks
│ └── Language Identifier.ipynb
├── reports
│ └── figures
│ ├── conv1d_accuracy.png
│ ├── conv1d_loss.png
│ ├── conv_accuracy.png
│ ├── conv_confusion_matrix.png
│ ├── conv_loss.png
│ ├── gru_accuracy.png
│ ├── gru_confusion_matrix.png
│ ├── gru_loss.png
│ ├── langauges_count.png
│ ├── lanuages_pie.png
│ ├── lstm_accuracy.png
│ ├── lstm_confusion_matrix.png
│ └── lstm_loss.png
├── src
│ ├── __init__.py
│ ├── features
│ ├── data
│ │ ├── __init__.py
│ │ ├── cleaner.py
│ │ └── explore.py
│ ├── models
│ │ ├── __init__.py
│ │ ├── gru.py
│ │ ├── lstm.py
│ │ ├── conv1d.py
│ │ └── gridsearch.py
│ └── visualization
│ ├── __init__.py
│ ├── graphs.py
│ └── visualize.py
└── tests
├── conftest.py
├── pytest.ini
├── test_data.csv
├── test_data.py
└── test_models.py
- All models achieved high accuracy even when using one convolution layer instead of LSTM or GRU, But GRU achieved highest accuracy 99.6% training accuracy 96.5% validation accuracy.
- Using LSTM achieved high accuracy about 96.87% validation accuracy
- Using fewer embedding dimensions makes the model reach high accuracy faster but in Embedding Projector alot of words grouped with other languages.
To contribute to Langauge Identifier, follow these steps:
- Fork this repository.
- Create a branch:
git checkout -b <branch_name>
. - Make your changes and commit them:
git commit -m '<commit_message>'
- Push to the original branch:
git push origin <project_name>/<location>
- Create the pull request.
Alternatively see the GitHub documentation on creating a pull request.
- Python
- Tensorflow
- Scikit-learn
- NumPy
- Pandas
- Matplotlib
- seaborn
- pytest