Skip to content

The Dataset and the dialect identification problem were addressed by Qatar Computing Research Institute. I implemented a data-scraper, data-preprocessor (Regex and NLTK), data-modeling (SVM with TF-IDF and MarBert using HuggingFace) and deployment using FlaskAPIs locally.

License

Notifications You must be signed in to change notification settings

Jabalov/Arabic-Dialects-Identification

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

35 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Arabic-Dialects-Identification

  • This repo is dedicated to build a classification pipeline to the problem of Arabic dialects identification, labels contain 18 clasess/dialects, using a classic ML and modern DL approaches.

Description

  • The Dataset and the dialect identification problem were addressed by Qatar Computing Research Institute. More on: https://arxiv.org/pdf/2005.06557.pdf
  • I Implemented a data-scraper, data-preprocessor (using Regex), data-modeling (SVM with TF-IDF and MarBert using HuggingFace) and deployment-script using FlaskAPIs locally.

Files/Source

Fetching (Notebooks/Fetch-Dialects-Data.ipynb or Scripts/data_fetching.py)

  • Data fetching is done by sending 1000 (max-json-length) ids per post-request using a start & end index, parsing its content and appending it in the data-frame. There's a small trick in the request loop, if the end index exceeded the length of the ids, it will take the remainder.
  • The output of this process is a data-frame that has the retrieved text and labels.

Preprocessing (Notebooks/Preprocessing-Dialect-Data.ipynb or Scripts/preprocess_and_tokenize.py)

  • I have implemented two classes, one for preprocessing and the other is for tokenization.
  • Preprocessing has a pipeline that applies letters normalization, removing tashkeel, substitute characters, removing symbols, punctuation, spaces, quotation marks and English letters. All of this is done by regex python module.
  • Tokenization has a pipeline that applies a normal tokenizing for text, removing stop words and removing repeated words. There’s also a small EDA in the notebook.

Classic ML Modeling (Notebooks/Modeling-Dialect-Data-ClassicML.ipynb)

  • Before training any models, I prepared data by dropping the non-value rows in both class (dialect) and text columns.
  • There were an imbalance in the dialects, so I used SMOTE (Synthetic Minority Oversampling Technique) to balance the classes. The amount of data is big but if I used Undersampling there will be loss, more text data will help ML-Models to learn better.
  • I used TF-IDF vectorizer as it's one of the best statistical vector spaces used in NLP Classic solutions.
  • The Model trained is LinearSVM as it's one of the best linear models in text classification problems.
  • SVM understands relationships between features better than naïve-bayes. Also working with well prepared vector spaces makes SVM work well as in text classification problems.
  • Trained using Kaggle (CPU).

DL Modeling (Notebooks/MARBERT-FineTuning.ipynb)

  • I Used MARBERT (A Deep Bidirectional Transformer for Arabic).
  • MARBERT is a large-scale pre-trained masked language model focused on both Dialectal Arabic (DA) and MSA.
  • Fine-tunning a pre-trained dialectical Arabic transformer will excel in this task better than any BI-LSTM/LSTM model, as it has a tremendous word-embeddings trained on 1B Arabic tweets from a large in-house dataset of about 6B tweets.
  • MARBERT is the same network architecture as ARBERT (BERT-base), but without the next sentence prediction (NSP) objective since tweets are short.
  • MARBERT tokenizer and fine-tuned transformer will be used later in deployment.
  • Trained using Kaggle GPU (NVIDIA V100).

Model Deployment (Scripts/app.py)

  • This is a small script, based on Flask micro-framework, that takes an input from user and displayes an output using a static page. Every back-end processing is done in the previous stages.

Evaluation Metrics and Results:

  • Imbalance-learn module uses Macro Average internally.

  • Macro averaging is perhaps the most straightforward amongst the numerous averaging methods.

  • The macro-averaged F1 score (or macro F1 score) is computed by taking the arithmetic mean (aka unweighted mean) of all the per-class F1 scores.

  • This method treats all classes equally regardless of their support values. working with an imbalanced dataset where all classes are equally important, using the macro average would be a good choice as it treats all classes equally.

    image image


Install

You have to download Python 3 and the following Python libraries: All are provided in env.yml

API

Screenshot_13

About

The Dataset and the dialect identification problem were addressed by Qatar Computing Research Institute. I implemented a data-scraper, data-preprocessor (Regex and NLTK), data-modeling (SVM with TF-IDF and MarBert using HuggingFace) and deployment using FlaskAPIs locally.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published