Language Identifier

The goal of this project is to create a model that is able to predict a given sentence language through text processing, including tokenizing and representation of sentences as vectors and applying concepts such as RNN, LSTM and GRU to create the classifier that can detect the language among 16 languages.

Dataset

Language Detection It's a small language detection dataset. This dataset consists of text details for 16 different languages

Project Structure

.
├── LICENSE
├── requirements.txt
├── README.md
├── data
│   ├── processed
│   │   ├── labels.pkl
│   │   ├── train_labels.pkl
│   │   ├── train_padded.pkl
│   │   ├── valid_labels.pkl
│   │   └── valid_padded.pkl
│   └── raw
│       └── Language Detection.csv
├── models
│   ├── meta.tsv
│   ├── vecs.tsv
│   ├── best_conv.pkl
│   ├── best_gru.pkl
│   ├── best_lstm.pkl
│   ├── conv_history.pkl
│   ├── gru_history.pkl
│   └── lstm_history.pkl
├── notebooks
│   └── Language Identifier.ipynb
├── reports
│   └── figures
│       ├── conv1d_accuracy.png
│       ├── conv1d_loss.png
│       ├── conv_accuracy.png
│       ├── conv_confusion_matrix.png
│       ├── conv_loss.png
│       ├── gru_accuracy.png
│       ├── gru_confusion_matrix.png
│       ├── gru_loss.png
│       ├── langauges_count.png
│       ├── lanuages_pie.png
│       ├── lstm_accuracy.png
│       ├── lstm_confusion_matrix.png
│       └── lstm_loss.png
├── src
│   ├── __init__.py
│   ├── features
│   ├── data
│   │   ├── __init__.py
│   │   ├── cleaner.py
│   │   └── explore.py
│   ├── models
│   │   ├── __init__.py
│   │   ├── gru.py
│   │   ├── lstm.py
│   │   ├── conv1d.py
│   │   └── gridsearch.py
│   └── visualization
│       ├── __init__.py
│       ├── graphs.py
│       └── visualize.py
└── tests
    ├── conftest.py
    ├── pytest.ini
    ├── test_data.csv
    ├── test_data.py
    └── test_models.py

Model Archtichtures

Results

All models achieved high accuracy even when using one convolution layer instead of LSTM or GRU, But GRU achieved highest accuracy 99.6% training accuracy 96.5% validation accuracy.
Using LSTM achieved high accuracy about 96.87% validation accuracy
Using fewer embedding dimensions makes the model reach high accuracy faster but in Embedding Projector alot of words grouped with other languages.

32 Embedding dimensions examples

3 Embedding dimensions examples

LSTM Accuracy and Loss

LSTM Confusion matrix

Contributing to Langauge Identifier

To contribute to Langauge Identifier, follow these steps:

Fork this repository.
Create a branch: git checkout -b <branch_name>.
Make your changes and commit them: git commit -m '<commit_message>'
Push to the original branch: git push origin <project_name>/<location>
Create the pull request.

Alternatively see the GitHub documentation on creating a pull request.

Tools

Python
Tensorflow
Scikit-learn
NumPy
Pandas
Matplotlib
seaborn
pytest

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

data

data

notebooks

notebooks

reports/figures

reports/figures

src

src

tests

tests

LICENSE

LICENSE

README.md

README.md

requirements.txt

requirements.txt

Repository files navigation

Language Identifier

Dataset

Project Structure

Model Archtichtures

Results

32 Embedding dimensions examples

3 Embedding dimensions examples

LSTM Accuracy and Loss

LSTM Confusion matrix

Contributing to Langauge Identifier

Tools

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
data		data
notebooks		notebooks
reports/figures		reports/figures
src		src
tests		tests
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

License

hossamasaad/Language-Identifier

Folders and files

Latest commit

History

Repository files navigation

Language Identifier

Dataset

Project Structure

Model Archtichtures

Results

32 Embedding dimensions examples

3 Embedding dimensions examples

LSTM Accuracy and Loss

LSTM Confusion matrix

Contributing to Langauge Identifier

Tools

About

Resources

License

Stars

Watchers

Forks

Languages