Collection of Brazilian Personal Story Posts
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
lexicons
pre-process
.gitignore
LexRank.py
README.md
Untitled.ipynb
WordEmbeddings_Feat_GloVe.ipynb
WordEmbeddings_Feat_Word2Vec.ipynb
WordEmbeddings_Feat_Word2Vec_Conv1D.ipynb
WordEmbeddings_Feat_Word2Vec_Conv1D_Cross.ipynb
WordEmbeddings_Load_Models.ipynb
WordEmbeddings_Save_Models.ipynb
blogIDs.csv.gz
blogger.js
blogs_stats.ipynb
classify_posts.ipynb
classify_posts_liwc.ipynb
compare_english_portuguese_liwc.ipynb
compare_english_portuguese_prop_liwc.ipynb
compute_english_liwc.ipynb
compute_portuguese_liwc.ipynb
compute_readability.py
config.dev.js
corpus.csv.gz
corpus_all_training.ipynb
corpus_idf_liwc_training.ipynb
corpus_liwc_mtx.csv.gz
corpus_prop_liwc_training.ipynb
corpus_prop_liwc_wilcoxon.ipynb
corpus_random_training.ipynb
corpus_readability.csv.gz
corpus_readability_training.ipynb
corpus_stats.ipynb
corpus_tfidf_bayes.ipynb
corpus_tfidf_cw.ipynb
corpus_tfidf_liwc_training.ipynb
corpus_tfidf_training.ipynb
corpus_topics.ipynb
corpus_topics_meaning.ipynb
corpus_topics_training.ipynb
corpus_training.ipynb
corpus_wilcoxon_test.ipynb
countries.json
english_posts_liwc.csv.gz
feature_names.csv
feature_names_cw.csv
fix_mac_locale.sh
generateCSV.js
icwsm09_stories_liwc.csv.gz
index.js
index2stories.py
liwc.py
miscellaneous.py
package.json
portuguese_stories_liwc.csv.gz
post_compute_pol.ipynb
post_selection.ipynb
posts_filter.ipynb
posts_resample.csv
posts_sample.csv
sample_summarization.ipynb
santos2017portuguese.bib
similarity.py
slides.pdf
story_liwc.csv.gz
story_liwc_author.csv.gz
story_liwc_topics.csv.gz
story_polarity_topic_analysis.ipynb
story_topics_meaning.ipynb
zipBlogs.sh

README.md

brazilian-blog-dataset

Collection of Brazilian Blogspot Posts

Author: Henrique D. P. dos Santos, Vinicius Woloszyn, and Renata Vieira

Abstract: Diary-like content expressing authors personal experiences and sentiments over a variety of topics is generated every day and made available on the Internet. This rich content can be used for psychological analysis and knowledge discovery regarding human related issues in several ways. This paper presents the creation of a Brazilian Portuguese corpus, using blog posts, for personal stories analyses and detection. We present an analysis of psycholinguistic categories across personal story and non-story posts, discussing their similarities and differences. We also study the use of these psycholinguistic categories as classifying features. Then we describe the evaluation of several machine learning approaches and the process of applying them to identify personal stories on the basis of our dataset. Finally, we investigate the main topic-related polarity of personal narratives posts.

Keywords: Corpus, Natural Language Processing, Personal Story, Psycholinguistic, Social Media.

Full text , Slides , Bibtex

Complete Reference: Henrique D. P. dos Santos, Vinicius Woloszyn, and Renata Vieira. 2017. Portuguese Personal Story Analysis and Detection in Blogs. In Proceedings of WI ’17, Leipzig, Germany, August 23-26, 2017, 7 pages. DOI: 10.1145/3106426.3106517

Basic Statistics

https://github.com/heukirne/brazilian-blog-dataset/blob/master/blogs_stats.ipynb

Countries Stats

https://github.com/heukirne/brazilian-blog-dataset/blob/master/countries.json

Blogset-BR Dataset

http://www.inf.pucrs.br/linatural/blogset-br (4.7 GB, 7.4M posts)

Personal Story Annotated Posts

https://github.com/heukirne/brazilian-blog-dataset/raw/master/corpus.csv.gz (1K Posts)

PUCRS NLP Group

This project belongs to NLP Group at PUCRS, Brazil