Simple modules that map twitter dataset stored in CSV to easy to use arrays. A small resume of the dataset is also computed.
Load modules and specify dataset path:
require_relative "../lib/dataset_mapper.rb"
include DatasetMapper
@dataset_path = '/src/thesis/inesc_data_set_sample/decompressed'
@base_file = 'tweets01_aaaa'
@default_data = :with_stemGet the array of tweets in the dataset:
tweets # => array full of tweets
tweets.first # => 'paintbrush work ipad sensubrushman' Get the dataset stats with default values:
puts inspect_dataset_stats() # =>
# total number of tweets, 87558
# number of tweets in selected words, 425
# number of words in the dataset, 50880
# number of words used in sample, 18
# number of word ocurrences in sample, 463
# percentil used, 0.94
# words used, ["recogn", "bracket", "basket", "mar", "length", "initi", "dye", "eras", "tradit", "liverpol", "delici", "advantag", "robot", "potus", "belief", "volum", "hok", "thirstythursday"]It is possible to overwrite the default value of words to be found in the dataset. The offset percentil where words start to be choosen can also be changed.
selected_words.size() #=> whatever number it was before
@percentil = 0.5
@number_of_words = 250
selected_words.size() #=> 250This code was used with other gems in order to manipulate the initial dataset in json. It expects a dataset with following structure:
| Dataset File | Use |
|---|---|
| tweets01_aaaa | json file with tweets captured by INESC-id |
| tweets01_aaaa_english_trimed.csv | CSV file containing tweets coma separated. Ex: 200740321783578624,ReRoc_Rochi_KTB,females kno yal jus sit nigas job supose |
| tweets01_aaaa_english_trimed_with_stem.csv | Same as above but with stemming applied. Ex: 200740321783578624,ReRoc_Rochi_KTB,femal kno yal jus sit niga job supos |
| *.wcount | file corresponding to the file with same finished with .csv, with the words that appear are summed. Ex (love apeard 4253 times ): love,4253 |
| data_set_words | all the words that appear in the dataset |
This code is released under the MIT License.

