A bot that offers sympathy to people who have suffered paper cuts.
Python
Latest commit 0ff26d9 Oct 6, 2012 peterwilliams97 updated formatting
Permalink
Failed to load latest commit information.
data
results
.gitignore
BayesClassifier.py
INSTALL.md
KnnClassifier.py
PorterStemmer.py
README.md
RocchioClassifier.py
TODO.md
common.py
definitions.py
do_classify.py
do_label.py
do_setup.py
do_twitter.py
filters.py
preprocessing.py

README.md

A Twitter Bot to Offer Sympathy to Paper Cut Sufferers

This is an exercise in writing a Twitter-bot. I am attempting to find out if it is possible to determine from the text of a tweet alone if the tweeter has suffered a paper cut.

This Twitter-bot also gratuitouslty tested the acccuracy of its predictions for a while by replying to those tweets it has determined to be from paper cut sufferers with a sympathetic message.

How it Works

There are 3 main programs

  • do_twitter.py Monitors and replies to tweets on Twitter.
  • do_label.py Used to labels tweets as being about papercuts or not.
  • do_classify.py Builds a tweet classification model from labelled tweets and evaluates it.

After installing this code you can build a working twitter-bot by

  • running do_twitter.py in non-replying mode to build a corpus of tweets containing variants of the term "paper cut".
  • using do_label.py to label the tweets as being from people with paper cuts or not.
  • running do_classify on the corpus of labelled tweets to build a classification model.

When you have a classification model that meets your acccuracy requirements you can run do_twitter.py in replying mode and see how well it chooses which tweets to reply to.

The following explains each of these steps in more detail.

do_twitter.py

do_twitter.py monitors and replies to tweets on Twitter.

Monitoring comprises

  • Making Twitter queries to find all tweets variants of the term "paper cut".
  • Doing some extra filtering on the query results.
  • Saving the tweets to file.

Replying is somewhat more involved.

  • Care is taken to avoid replying more than once to a person or a conversation.
  • Tweets are checked against the classification model.
  • Replies are made and saved to file.
  • Summary tweets are generated on regular intervals so that the twitter-bot's activity can be checked with a Twitter query.

OwwwPapercut is currently being run from an AWS Ubuntu micro-instance with the following crontab line.

29 * * * * python /home/ubuntu/twitter_bot/do_twitter.py 55 -r

do_label.py

do_label.py is used to label tweets as whether tweeter has a papercut or not. It creates a text file of tweets where each line has a placeholder for classification and the text of the tweet.
e.g.

? | If I see one more back to school commercial I'm giving my eyes a paper cut.
? | i got lemon on my finger and it stings .-. stupid paper cut -.-

You should edit this file and replace the ? with the correct classfication.
e.g.

n | If I see one more back to school commercial I'm giving my eyes a paper cut.
y | i got lemon on my finger and it stings .-. stupid paper cut -.-    

do_classify.py

do_classify.py builds a tweet classification model from the labelled tweets end evaluates it. The classifier is discussed later under the heading The Classifier.

Options:

-C <class>            Use <class> as classifier.
-l <n>                limit number of tweets in test set to <n>        
-n, --ngrams          show ngrams
-s, --self-validate   do self-validation
-c, --cross-validate  do full cross-validation
-e, --show-errors     show false positives and false negatives
-t <string>,          show details of how <string> was classified  
-o, --optimize        find optimum threshold, back-offs and smoothings
-m, --model           save calibration model

You should run python do_classify.py -c to see how well the classification predicts new tweets based on cross-validation.

It will produce some output like this

===== = ===== = =====
      | False |  True
----- + ----- + -----
False |  1876 |   165
----- + ----- + -----
 True |   189 |   995
===== = ===== = =====
Total = 3225

===== = ===== = =====
      | False |  True
----- + ----- + -----
False |   58% |    5%
----- + ----- + -----
 True |    5% |   30%
===== = ===== = =====
Precision = 0.858, Recall = 0.840, F1 = 0.849 

The columns are the predicted classifications of the tweets and rows ares the actual classifications.

In this result 3,225 tweets were evaluated and

  • 1,876 were correctly predicted as people tweeting about their paper cuts.
  • 995 were correctly predicted as people not tweeting about their paper cuts.
  • 165 were incorrectly predicted as people tweeting about their paper cuts.
  • 189 were incorrectly predicted as people not tweeting about their paper cuts.

(These numbers are from the start of development. Scores for the current code are here.)

The measures Precision, Recall and F1 are explained here

  • Precision is the fraction of tweets predicted to be about paper cuts that actually were.
  • Recall is the fraction of tweets about paper cuts that were predicted as being so.
  • F1 is a combined measure that increases with increasing precision and increasing recall.

We especially want precision to be high so that we don't reply to people who don't have paper cuts. We also want recall to be high so we can reply to as many paper cut sufferers as possible

In this example an F1 of 0.85 is reasonable but not great. The precision of 0.86 means that 86% of the tweets predicted to be people tweeting about the paper cuts are so, and therefore that 14% are not. This is important. It means that 14% of the replies we make could be wrong. We call these replies false positives.

We therefore run python do_classify.py -e to see what these false positives are.

  27    0.95: #ItHurts when I get a paper cut. :/ Those little cuts KILL!
 2115   0.98: @crimescript Sounds pretty nasty. The worst, medically, I face in my job is a paper cut :o) But then my job is dull &amp; not good book material
 2116   0.99: ....and paper cut ????????
  764   1.03: Why is the first day of work after vacation have to be like giving yourself a papercut then pouring vodka in it? #retailproblems
 2950   1.06: Lol only you would mysteriously pop up with a paper cut. 

27, 2116 amd 764 seem ambiguous and could be mistaken as being tweets from people with paper cuts. The other two are definitely not. Another filter we use (tweets starting with @) will reomove 2115.

Based on this analysis of 5 tweets the twitter-bot's replies may not be too inappropriate.

(Using 5 tweets was for illustration only. In a real analysis we would evaluate all 165 false positive tweets. You can see the false positives in the current version of the classfier before and after filtering.)

When our classification model is peforming well enough we run python do_classify.py -m to save it.

At this stage we run python do_twitter.py 30 -r and see how the twitter-bot performs interacting with people on twitter.

The Classifier

The classifier we use to predict whether tweets are about paper cuts is BayesClassifier.py. This is a simple n-gram classifier.

The n-grams for the paper cut tweets are here. The most influential are

TRIGRAMS
[  0,118] 118.0 '[TAG_START] i got'
[  0, 51] 51.0 '[TAG_START] just got'
[  0, 40] 40.0 'on my finger'
...
[ 12,  0] -12.0 '[TAG_USER] [TAG_USER] [TAG_USER]'
[ 12,  0] -12.0 'hope you get'
[ 18,  0] -18.0 'i hope you

BIGRAMS
[  0, 39] 39.0 'gave myself'
[  0, 28] 28.0 'my thumb'
[  0, 15] 15.0 'ow [TAG_END]'
...
[ 20,  0] -20.0 'hope you'
[ 23,  0] -23.0 'i hope'
[ 25,  0] -25.0 'PAPER_CUT out''

UNIGRAMS
[  0, 14] 14.0 'remover'
[  0, 14] 14.0 'stings'
[  0,  9]  9.0 'fuckin'
...
[ 15,  0] -15.0 'make'
[ 18,  0] -18.0 'i'll'
[ 28,  0] -28.0 'want'

BayesClassifier.py outperformed the other classifiers I tested on. This surprised me a little as it is very simple. Two possible explanations come to mind.

  • Its simplicity allowed me to tune its preprocessing and parametrization much better than I did for the other classifiers.
  • As a generative classifier it is suited to the sparsity of the classification problem where the number of calibration samples (thousands) is small compared to the number of possible ngrams (millions or more). See Jordan and Ng.

Results

The latest internal test results are here.

Results from the My week on twitter bot:

  • 408 retweets received
  • 54 new followers
  • 303 mentions

Recent mentions.

Some positive tweets.

Someone tweeting about the Twitter-bot.

A man telling the Twitter-bot not to tweet to his girlfried

A comparison with a tweeter from the office.

Paper cut empathee's remorse