A Twitter Bot to Offer Sympathy to Paper Cut Sufferers
This is an exercise in writing a Twitter-bot. I am attempting to find out if it is possible to determine from the text of a tweet alone if the tweeter has suffered a paper cut.
This Twitter-bot also gratuitouslty tested the acccuracy of its predictions for a while by replying to those tweets it has determined to be from paper cut sufferers with a sympathetic message.
How it Works
There are 3 main programs
- do_twitter.py Monitors and replies to tweets on Twitter.
- do_label.py Used to labels tweets as being about papercuts or not.
- do_classify.py Builds a tweet classification model from labelled tweets and evaluates it.
After installing this code you can build a working twitter-bot by
- running do_twitter.py in non-replying mode to build a corpus of tweets containing variants of the term "paper cut".
- using do_label.py to label the tweets as being from people with paper cuts or not.
- running do_classify on the corpus of labelled tweets to build a classification model.
When you have a classification model that meets your acccuracy requirements you can run do_twitter.py in replying mode and see how well it chooses which tweets to reply to.
The following explains each of these steps in more detail.
do_twitter.py monitors and replies to tweets on Twitter.
- Making Twitter queries to find all tweets variants of the term "paper cut".
- Doing some extra filtering on the query results.
- Saving the tweets to file.
Replying is somewhat more involved.
- Care is taken to avoid replying more than once to a person or a conversation.
- Tweets are checked against the classification model.
- Replies are made and saved to file.
- Summary tweets are generated on regular intervals so that the twitter-bot's activity can be checked with a Twitter query.
OwwwPapercut is currently being run from an AWS Ubuntu micro-instance with the following crontab line.
29 * * * * python /home/ubuntu/twitter_bot/do_twitter.py 55 -r
do_label.py is used to label tweets as whether tweeter has a papercut or not. It creates a text file of tweets where each line has a placeholder for classification and the text of the tweet.
? | If I see one more back to school commercial I'm giving my eyes a paper cut. ? | i got lemon on my finger and it stings .-. stupid paper cut -.-
You should edit this file and replace the ? with the correct classfication.
n | If I see one more back to school commercial I'm giving my eyes a paper cut. y | i got lemon on my finger and it stings .-. stupid paper cut -.-
-C <class> Use <class> as classifier. -l <n> limit number of tweets in test set to <n> -n, --ngrams show ngrams -s, --self-validate do self-validation -c, --cross-validate do full cross-validation -e, --show-errors show false positives and false negatives -t <string>, show details of how <string> was classified -o, --optimize find optimum threshold, back-offs and smoothings -m, --model save calibration model
You should run
python do_classify.py -c to see how well the classification predicts new tweets based on
It will produce some output like this
===== = ===== = ===== | False | True ----- + ----- + ----- False | 1876 | 165 ----- + ----- + ----- True | 189 | 995 ===== = ===== = ===== Total = 3225 ===== = ===== = ===== | False | True ----- + ----- + ----- False | 58% | 5% ----- + ----- + ----- True | 5% | 30% ===== = ===== = ===== Precision = 0.858, Recall = 0.840, F1 = 0.849
The columns are the predicted classifications of the tweets and rows ares the actual classifications.
In this result 3,225 tweets were evaluated and
- 1,876 were correctly predicted as people tweeting about their paper cuts.
- 995 were correctly predicted as people not tweeting about their paper cuts.
- 165 were incorrectly predicted as people tweeting about their paper cuts.
- 189 were incorrectly predicted as people not tweeting about their paper cuts.
(These numbers are from the start of development. Scores for the current code are here.)
The measures Precision, Recall and F1 are explained here
- Precision is the fraction of tweets predicted to be about paper cuts that actually were.
- Recall is the fraction of tweets about paper cuts that were predicted as being so.
- F1 is a combined measure that increases with increasing precision and increasing recall.
We especially want precision to be high so that we don't reply to people who don't have paper cuts. We also want recall to be high so we can reply to as many paper cut sufferers as possible
In this example an F1 of 0.85 is reasonable but not great. The precision of 0.86 means that 86% of the tweets predicted to be people tweeting about the paper cuts are so, and therefore that 14% are not. This is important. It means that 14% of the replies we make could be wrong. We call these replies false positives.
We therefore run
python do_classify.py -e to see what these false positives are.
27 0.95: #ItHurts when I get a paper cut. :/ Those little cuts KILL! 2115 0.98: @crimescript Sounds pretty nasty. The worst, medically, I face in my job is a paper cut :o) But then my job is dull & not good book material 2116 0.99: ....and paper cut ???????? 764 1.03: Why is the first day of work after vacation have to be like giving yourself a papercut then pouring vodka in it? #retailproblems 2950 1.06: Lol only you would mysteriously pop up with a paper cut.
27, 2116 amd 764 seem ambiguous and could be mistaken as being tweets from people with paper cuts. The other two are definitely not. Another filter we use (tweets starting with @) will reomove 2115.
Based on this analysis of 5 tweets the twitter-bot's replies may not be too inappropriate.
(Using 5 tweets was for illustration only. In a real analysis we would evaluate all 165 false positive tweets. You can see the false positives in the current version of the classfier before and after filtering.)
When our classification model is peforming well enough we run
python do_classify.py -m to save it.
At this stage we run
python do_twitter.py 30 -r and see how the twitter-bot performs
interacting with people on twitter.
The n-grams for the paper cut tweets are here. The most influential are
TRIGRAMS [ 0,118] 118.0 '[TAG_START] i got' [ 0, 51] 51.0 '[TAG_START] just got' [ 0, 40] 40.0 'on my finger' ... [ 12, 0] -12.0 '[TAG_USER] [TAG_USER] [TAG_USER]' [ 12, 0] -12.0 'hope you get' [ 18, 0] -18.0 'i hope you BIGRAMS [ 0, 39] 39.0 'gave myself' [ 0, 28] 28.0 'my thumb' [ 0, 15] 15.0 'ow [TAG_END]' ... [ 20, 0] -20.0 'hope you' [ 23, 0] -23.0 'i hope' [ 25, 0] -25.0 'PAPER_CUT out'' UNIGRAMS [ 0, 14] 14.0 'remover' [ 0, 14] 14.0 'stings' [ 0, 9] 9.0 'fuckin' ... [ 15, 0] -15.0 'make' [ 18, 0] -18.0 'i'll' [ 28, 0] -28.0 'want'
BayesClassifier.py outperformed the other classifiers I tested on. This surprised me a little as it is very simple. Two possible explanations come to mind.
- Its simplicity allowed me to tune its preprocessing and parametrization much better than I did for the other classifiers.
- As a generative classifier it is suited to the sparsity of the classification problem where
the number of calibration samples (thousands) is small compared to the number of possible
ngrams (millions or more). See
Jordan and Ng.
The latest internal test results are here.
Results from the My week on twitter bot:
- 408 retweets received
- 54 new followers
- 303 mentions
Some positive tweets.
Someone tweeting about the Twitter-bot.
A man telling the Twitter-bot not to tweet to his girlfried
A comparison with a tweeter from the office.
Paper cut empathee's remorse