Dataset Mapper

Simple modules that map twitter dataset stored in CSV to easy to use arrays. A small resume of the dataset is also computed.

How to use

Load modules and specify dataset path:

require_relative "../lib/dataset_mapper.rb"
include DatasetMapper

@dataset_path = '/src/thesis/inesc_data_set_sample/decompressed' 
@base_file = 'tweets01_aaaa'
@default_data = :with_stem

Get the array of tweets in the dataset:

tweets       # => array full of tweets
tweets.first # => 'paintbrush work ipad sensubrushman'

Get the dataset stats with default values:

puts inspect_dataset_stats() # =>
# total number of tweets,                 87558
# number of tweets in selected words,     425
# number of words in the dataset,         50880
# number of words used in sample,         18
# number of word ocurrences in sample,    463
# percentil used,                         0.94
# words used,             ["recogn", "bracket", "basket", "mar", "length", "initi", "dye", "eras", "tradit", "liverpol", "delici", "advantag", "robot", "potus", "belief", "volum", "hok", "thirstythursday"]

It is possible to overwrite the default value of words to be found in the dataset. The offset percentil where words start to be choosen can also be changed.

selected_words.size()  #=> whatever number it was before
@percentil              = 0.5
@number_of_words        = 250
selected_words.size()  #=> 250

Dataset Format

This code was used with other gems in order to manipulate the initial dataset in json. It expects a dataset with following structure:

Dataset File	Use
tweets01_aaaa	json file with tweets captured by INESC-id
tweets01_aaaa_english_trimed.csv	CSV file containing tweets coma separated. Ex: 200740321783578624,ReRoc_Rochi_KTB,females kno yal jus sit nigas job supose
tweets01_aaaa_english_trimed_with_stem.csv	Same as above but with stemming applied. Ex: 200740321783578624,ReRoc_Rochi_KTB,femal kno yal jus sit niga job supos
*.wcount	file corresponding to the file with same finished with .csv, with the words that appear are summed. Ex (love apeard 4253 times ): love,4253
data_set_words	all the words that appear in the dataset

License

This code is released under the MIT License.

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
lib		lib
spec		spec
.gitignore		.gitignore
readme.md		readme.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Dataset Mapper

How to use

Dataset Format

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Dataset Mapper

How to use

Dataset Format

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages