LABR: Large Scale Arabic Book Reviews Dataset
Clone or download
Pull request Compare This branch is 26 commits ahead of mohamedadaly:master.
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
data
python
.gitignore
LICENSE
README.md
README.md~
README.txt

README.md

LABR: A Large-SCale Arabic Book Reviews Dataset

This dataset contains over 63,000 book reviews in Arabic. It is the largest sentiment analysis dataset for Arabic to-date. The book reviews were harvested from the website Goodreads during the month or March 2013. Each book review comes with the goodreads review id, the user id, the book id, the rating (1 to 5) and the text of the review.

Contents:

  • README.txt: this file

  • data/labr_data

    - reviews.tsv: a tab separated file containing the "cleaned up" reviews. It contains over 63,000 reviews. The format is:
                   
                   rating<TAB>review id<TAB>user id<TAB>book id<TAB>review
                   
      where:
    
                   rating: the user rating on a scale of 1 to 5
                   review id: the goodreads.com review id
                   user id: the goodreads.com user id
                   book id: the goodreads.com book id
                   review: the text of the review
    
    - 2class-balanced-train/test.txt: text file containing indices of reviews 
                   (from the reviews.tsv file) that are in the training/test
                   sets. Balanced means the number of reviews in the 
                   positive/negative classes are equal. The ratings are 
                   converted into positive (rating 4 & 5) and negative 
                   (rating 1 & 2) and rating 3 is ignored.
                   
    - 2class-unbalanced-train/test.txt: the same, but the sizes of the calsses 
                   are not equal.
                   
    
    - 3class-balanced/unbalanced-train/test/validation.txt: the same, but for 3 classes 
                   instead of just 2.
    
    - 5class-balanced/unbalanced-train/test.txt: the same, but for 5 classes 
                   instead of just 2.
    
  • data/labr_lexicon

    • POS.txt: A file contain positive phrases generated from the labr.
    • NEGATIVE.txt: A file contain negative phrases generated from the labr.
    • Neg.txt: A file contain some negation operators generated from the labr.
  • data/dr_samha_lex

    • POS.txt: A file contain positive phrases generated by El-Beltagy, Samhaa R and Ali, Ahmed, "Open issues in the sentiment analysis of arabic social media: A case study" (2013), 215--220.
    • NEGATIVE.txt: A file contain negative phrases generated by El-Beltagy, Samhaa R and Ali, Ahmed, "Open issues in the sentiment analysis of arabic social media: A case study" (2013), 215--220.
  • python/

    • labr.py: the main interface to the dataset. Contains functions that can read/write training and test sets.

    • experiments.py: a Python script containing the code used to generate the experiments of http://arxiv.org/abs/1411.6718

    • Defiantions.py: a python file contain the definations for the used classifiers and the feature generators.

    • Utilities.py: a python file contain the some reading functions and classifier performance measure functions.

Demo

In order to replicate the splits with different test/train/validation precent

l=LABR()

(rating, a, b, c, body) = l.read_clean_reviews()

l.split_train_validation_test_3class(self, rating, percent_test, percent_valid, balanced="unbalanced"):

In order to try new classifier just add it to "classifiers" list in Definations.py then run experiment.py

Reference

Please cite this paper for any usage of the dataset:

Mohamed Aly and Amir Atiya. LABR: Large-scale Arabic Book Reviews Dataset. Association of Computational Linguistics (ACL), Bulgaria, August 2013.