Twitter Persian news tagcloud extraction
Final project of Information retrieval course.
TPNT is a Tag cloud generator that extracts hot keywords from Twitter page of a major Persian news agency in the fields of Economics and Socials for each month in a year.
- Tasnim News(@TasnimNews_Fa)
How to Run
This project has to main steps. First, twitts are stored in a
csv file with the help of
Crawler class. this class needs some options to work properly:
||The Id of twitter page||
||Start date of extraction, format:
||End date of extraction, format:
||Limitation in the number of retrieved twitts||no|
||Path of csv file||no|
||Name of csv file||no|
An example for retrieving twitts from (@TasnimNews_Fa) starting from 2018-06-01 to 2018-07-01 in
java -cp ProjectNews.jar ir.ac.um.ce.projectnews.crawler.Crawler -i Tasnimnews_Fa -s 2018-06-01 -e 2018-07-01 -p result/
The next step is indexing docs. After removing stop-words from docs we use
Classifier classes plus a Bag of word to create some queries to estimate the correlation of each doc with context. Finally, we use the most corrolated words to generate a tag clud.