No description, website, or topics provided.
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
.idea
src/main
License.md
RDF_Extractor.iml
README.md
pom.xml
stopwords.txt

README.md

AnnotatedTweets2RDF

This open source library was created in order to generate rdf tuples from a text corpus. It aims at extracting tuples from twitter's data (mostly) which must be stored in a specific way. It can also be used for any other dataset which fullfil the below description of indices.

  • We have extracted entities from text (tweets) using FEL implementation (threshold was set to -3).
  • For sentiment annotation of the text we have used SentiStrength implementation.

The implementation was done in scala and Java for distributed environment: Apache Spark and Apache Hadoop.

Dataset Description

Each instance of the dataset must be stored in a new line. Each attribute of an instance has to be separated by tab character ("\t"). Below is the list with the indices for each attribute:

  1. Document Id: Long
  2. Username: String. In our case we have encrypted this field for privacy issues.
  3. timestamp: Format ( "EEE MMM dd HH:mm:ss Z yyyy" )
  4. #Followers: Integer
  5. #Friends: Integer
  6. #Retweets: Integer
  7. #Favorites: Integer
  8. Entities: String. For each entity we aggregated the original text, the annotated entity and the produced score from FEL library. Each entity is separated from another entity by char ";". Also each entity is separated by char ":" in order to store "original_text:annotated_entity:score;". If FEL did not find any entiites then we have stored "null;"
  9. Sentiment: String. Sentistrength produces a score for positive (1 to 5) and negative (-1 to -5) sentiment. We splited these two numbers by whitespace char " ". Positive sentiment was stored first and then negative sentiment (i.e "2 -1").
  10. Mentions: String. If text contains mentions we remove the char "@" and concatenate the mentions with whitespace char " ". If no mentions appear on text then we have stored "null;"
  11. Hashtags: String: If text contains hashtags we remove the char "#" and concatenate the hashtags with whitespace char " ". If no hashtags appear on text then we have stored "null;"

Deployment

After you generate the jar file you can deploy the code on spark distributed system.

We provide a class for generating the tuples and also some other classes for generating statistics from the dataset.

Class 'RdfExtractor' is responsible for generating the tuples. As arguments it receives:

  1. Directory of your dataset (String)
  2. Percentage of the dataset's input float (0, 1] (if you want full dataset set to 1)
  3. output directory

i.e "spark-submit --class RdfExtractor file_name.jar input_directory/ 1 output_directory/"