An experiment of applying topic model to a tweet archive, downloaded from the Twitter server (Settings -> Account -> Request Twitter Archive). The code is originally adapted from Tan He, where he demostrated how to analyze and visualize Chinese text using topic model. For more information about topic model, refer to links below (or google "Topic Models". I'm pretty sure there are plenty of awesome texts out there!):
What tools/packages do I need?
- Python 2.7.9
- R 3.2.2
- Notepad++ v 6.7.8 (or above)
- in Python: Pandas, Numpy
- in R: JiebaR, LDA, LDAvis, servr
How to use the files here?
The repository contains three files:
LDATweets.R: the core file that runs the LDA model.
stopword.txt(optional): a plain text file containing main English and Chinese stopwords. Currently in utf-8 format; downloaded from ibook360. Please, substitute this with your favorite stopword list if you'd like!
tweetExtract.py: preprocessing tweets before running the topic model.
Thus, it is recommended to run the files in the following order:
- Request your Twitter Archive from twitter.com.
- Unzip your Twitter Archive to the local disk. You should find
tweets.csvunder the root.
- Ideally, your tweets should be mostly Chinese. If it's in English, refer to A topic model for movie reviews as shown above.
- Make sure that
tweetExtract.py, and (optional)
stopword.txtare all in your working directory.
LDATweets.R. It may take a while for R to execute the script, depending on the size of your tweet archive.
- Upload the whole
visfolder to a server, or open the
- Do the following steps ONLY IF Chinese characters are not correctly displaying in your browser:
- Open the
visfolder, found under the root of your working directory.
- Create an empty file with whatever name you like, but the extension must be
- Edit that file in Notepad++. Click on "Encoding" -> "Encode in UTF-8".
- Save the file.
lda.json. Copy and paste everything in the
- Save the
- Remove the
lda.jsonfile and renamed the previous
Submit a pull request here!! I will reply ASAP.
I shall update the readme once I found out more information about Topic Model.
2018/03/13 update: will try to convert the code to Python and rewrite the code base. Stay tuned! :)