Skip to content

Offensive Moroccan Comments Dataset (OMCD) is a dataset for offensive identification collected from YouTube comments.

Notifications You must be signed in to change notification settings

kabilessefar/OMCD-Offensive-Moroccan-Comments-Dataset

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

OMCD: Offensive Moroccan Comments Dataset

This repository contains the code and dataset for the research paper titled "OMCD: Offensive Moroccan Comments Dataset" published in Language Resources and Evaluation.

Paper Information:

Abstract:

Offensive content, such as verbal attacks, demeaning comments, or hate speech, has become widespread on social media. Automatic detection of this content is considered an important and challenging task. Although several research works have been proposed to address this challenge for high-resource languages, research on detecting offensive content in Dialectal Arabic (DA) remains under-explored. Recently, the detection of offensive language in DA has gained increasing interest among researchers in Natural Language Processing (NLP). However, only a limited number of annotated datasets have been introduced for single or multiple coarse-grained dialects.

In this paper, we introduce Offensive Moroccan Comments Dataset (OMCD), the first dataset for offensive language detection for the Moroccan dialect. First, we present the data collection steps, the statistical analysis, and the annotation guidelines of the introduced dataset. Then, we evaluate several state-of-the-art Machine Learning (ML) and Deep Learning (DL) based models on the OMCD dataset. Finally, we highlight the impact of emojis on the evaluated models for offensive language detection.

Key Features:

  • Dataset for offensive language detection in Moroccan dialect
  • Data collection steps, statistical analysis, and annotation guidelines
  • Evaluation of state-of-the-art Machine Learning (ML) and Deep Learning (DL) models
  • Analysis of the impact of emojis on offensive language detection
  • Relevant keywords: Offensive language, Arabic NLP, Moroccan dialect, Text classification, Social media platforms

Please refer to the paper for detailed information on the dataset, methodologies, and results.

About

Offensive Moroccan Comments Dataset (OMCD) is a dataset for offensive identification collected from YouTube comments.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages