Skip to content

bernardo-cs/dataset_mapper

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Dataset Mapper

IST Logo

Simple modules that map twitter dataset stored in CSV to easy to use arrays. A small resume of the dataset is also computed.

How to use

Load modules and specify dataset path:

require_relative "../lib/dataset_mapper.rb"
include DatasetMapper

@dataset_path = '/src/thesis/inesc_data_set_sample/decompressed' 
@base_file = 'tweets01_aaaa'
@default_data = :with_stem

Get the array of tweets in the dataset:

tweets       # => array full of tweets
tweets.first # => 'paintbrush work ipad sensubrushman' 

Get the dataset stats with default values:

puts inspect_dataset_stats() # =>
# total number of tweets,                 87558
# number of tweets in selected words,     425
# number of words in the dataset,         50880
# number of words used in sample,         18
# number of word ocurrences in sample,    463
# percentil used,                         0.94
# words used,             ["recogn", "bracket", "basket", "mar", "length", "initi", "dye", "eras", "tradit", "liverpol", "delici", "advantag", "robot", "potus", "belief", "volum", "hok", "thirstythursday"]

It is possible to overwrite the default value of words to be found in the dataset. The offset percentil where words start to be choosen can also be changed.

selected_words.size()  #=> whatever number it was before
@percentil              = 0.5
@number_of_words        = 250
selected_words.size()  #=> 250

Dataset Format

This code was used with other gems in order to manipulate the initial dataset in json. It expects a dataset with following structure:

Dataset Files

Dataset File Use
tweets01_aaaa json file with tweets captured by INESC-id
tweets01_aaaa_english_trimed.csv CSV file containing tweets coma separated. Ex: 200740321783578624,ReRoc_Rochi_KTB,females kno yal jus sit nigas job supose
tweets01_aaaa_english_trimed_with_stem.csv Same as above but with stemming applied. Ex: 200740321783578624,ReRoc_Rochi_KTB,femal kno yal jus sit niga job supos
*.wcount file corresponding to the file with same finished with .csv, with the words that appear are summed. Ex (love apeard 4253 times ): love,4253
data_set_words all the words that appear in the dataset

License

This code is released under the MIT License.

About

Ruby mapper for a twitter dataset

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages