Skip to content
Tag cloud generator that extracts hot keywords from Twitter page of a Persian news agency
Java
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
corpus
docs
results
src/main/java
.gitignore
ProjectNews.jar
README.md
economy.txt
economy_seperated.txt
pom.xml
social.txt
social_seperated.txt
stopwords.txt

README.md

Twitter Persian news tagcloud extraction

Final project of Information retrieval course.

TPNT is a Tag cloud generator that extracts hot keywords from Twitter page of a major Persian news agency in the fields of Economics and Socials for each month in a year.


Dependencies

  • GetOldTweets-java v1.2.0
  • Lucene 7.2.1

News agency

How to Run

This project has to main steps. First, twitts are stored in a csv file with the help of Crawler class. this class needs some options to work properly:

Flag Desc Requisition
-i The Id of twitter page required
-s Start date of extraction, format: YYY-MM-DD required
-e End date of extraction, format: YYY-MM-DD no
-m Limitation in the number of retrieved twitts no
-p Path of csv file no
-n Name of csv file no

An example for retrieving twitts from (@TasnimNews_Fa) starting from 2018-06-01 to 2018-07-01 in $PWD/result/ path:

java -cp ProjectNews.jar ir.ac.um.ce.projectnews.crawler.Crawler -i Tasnimnews_Fa -s 2018-06-01 -e 2018-07-01 -p result/

The next step is indexing docs. After removing stop-words from docs we use Searcher and Classifier classes plus a Bag of word to create some queries to estimate the correlation of each doc with context. Finally, we use the most corrolated words to generate a tag clud.

Contributors

You can’t perform that action at this time.