Switch branches/tags
Nothing to show
Find file Copy path
Fetching contributors…
Cannot retrieve contributors at this time
212 lines (132 sloc) 13.7 KB


Brand disambiguator for tweets to differentiate e.g. Orange vs orange (brand vs foodstuff), using NLTK and scikit-learn. This is a form of

NOTE this is a work in progress, started June 2013, currently it doesn't do very much at all. Currently I use a LogisticRegression classifier with a default CountVectorizer to classify the sense of a single word ('apple' at present) as in-class (is-a-brand) and out-of-class (is-anything-else).

I'm only using English tweets and only annotating the ones I can clearly differentiate.


There are some write-ups and presentations online if you'd like some background:


The code runs with Python 2.7 using sqlite.

$ git clone ./src  # clones code into local ./src folder
$ virtualenv ./env  # do this in the same folder, not the src subfolder (it would be ok but it is cleaner to leave the environment separate from the src)
$ . env/bin/activate  # '.' is short for 'source', it runs the env/bin/activate script to enable the virtualenv
$ # note that our path line probably changes to show '(env)' as we have activated the virtualenv
$ cd src
$ pip install -r requirements.txt  # base libs, installed into our virtualenv
$ ipython  # now we'll test the basic installation
$   In [1]: import numpy  # test that numpy can be imported
$   In [2]: import pandas  # test that pandas can be imported
$ # oddly the cld won't be installed via requirements.txt as the path is too long (see below), so we do it by hand:
$ pip install chromium-compact-language-detector==0.2  # language detector for annotating
$ ipython  # test that cld is installed
$   In [1]: import cld
$ pip install -r requirements_2.txt  # get around MEP11/sklearn requirement for numpy with second requirements file
$ ipython  # another quick test
$   In [1]: import sklearn
$   In [2]: import matplotlib

Note that if you get any errors, you might be missing required dependencies in your current setup. E.g. on linux I install numpy using the Ubuntu package manager (apt-get) so that the core dependencies (like ATLAS) are pre-installed, then I install numpy using pip to the named version (and it uses the OS-installed dependencies that came in via apt-get). You'll have to read up on what's required for your OS depending on which library breaks.

Note that the pip IOError: [Errno 36 File name too long error is related to my Ubuntu installation with an encrypted home directory as detailed here:

Note that generally I'm using Python 3 __future__ imports, the code isn't tested with Python 3 but the porting should be straight-forward. sqlite only wants byte/strings for key indexing (not unicode strings).


If you're in the src\ folder and you've activated your virtualenv then you should be ready to run:

$ nosetests  # the project defaults to 'testing' if DISAMBIGUATOR_CONFIG isn't set
$ nosetests -s  # runs without capturing stdout, useful if you're using `import pdb; pdb.set_trace()` for debugging
$ nosetests --with-coverage --cover-html  # with an HTML coverage report to cover/index.html

Note if you see ImportError: cannot import name opencalais_key then if you plan to use OpenCalais, you need an API key (but if you don't plan to use their NER then you don't need the key and you can ignore this test failure).

A word on environment variables

I'm running on Linux, I use an environment variable DISAMBIGUATOR_CONFIG to tell the code if we're running in testing or production. This switches between an in-memory SQLite DB for testing which is blanked and an on-disk SQLite DB which is not blanked. By default if the environment variable has not been set, the code will assume that we're in testing mode (so nothing is stored beyond the single session).

If you want to use SQLite DB then the code expects to see something like data/annotations.sqlite. You can then either put DISAMBIGUATOR_CONFIG=production before whatever you run, or export DISAMBIGUATOR_CONFIG=production for that shell window (but if you then run the nosetests remember that it'll think it is in production mode and will start to mess with the production database).

A word on SQLite Databases

SQLite is built into Python 2.7, it is a lightweight single-file database. You can investigate it at the command line usinng sqlite3. In Firefox look at for a nice graphical manager.

Creating a gold standard

If you have a SQLiteDB then you won't need to do this. apple10.json is an example input file of tweets you've collected from Twitter via cURL on their streaming API, it is a many-line text file where each line is a valid JSON document (exactly as streamed via Twitter).

$ u'/home/ian/workspace/virtualenvs/tweet_disambiguation_project/prototype1/src'
$ #export DISAMBIGUATOR_CONFIG=production  # might be useful if not using ipython
$ DISAMBIGUATOR_CONFIG=production ipython
$ %run ../../apple10.json apple
# or
$ DISAMBIGUATOR_CONFIG=production python ../../apple10.json apple

Annotating the tweets using OpenCalais

OpenCalais have a strong named entity recognition API offering, we can use it to annotate tweets to see how it does. You need an API key from them via and you need to copy the key into a 1 line file named ner_apis/opencalais/ which will look like:

$ API_KEY = "<opencalais-key>"

You can run the annotator using the following, it looks for the company name apple in the return from OpenCalais and stores the result in a new opencalais_<brand> table:

$ DISAMBIGUATOR_CONFIG=production python apple opencalais --drop  # optionally drop the destination table so we start afresh
$ DISAMBIGUATOR_CONFIG=production python apple opencalais # run in another window to double fetching speed (note that we've removed the --drop flag for the second and subsequent runs)

In the above case you'll have a new table opencalais_apple. If you run more than one terminal remember to use --drop on the first run (to drop any historic results), then to remove --drop from subsequent parallel runs. I tend to run 2 or 3 processes in parallel, they annotate anything that's not yet annotated (and if two annotate the same thing, only one result gets written).

Creating a test/train and validation subset

Copy a subset of the annotations table to a test/train set and a validation set:

If you have 2014 Tweets in a SQLite DB named annotations.sqlite in a table named annotations_apple then here we extract 584 from each of the two classes to produce a new table scikit_testtrain_apple and a further 100 from each of the two classes into scikit_validation_apple (dropping the destination tables beforehand so we start with blank tables). This gives us 1168 test/train items and 200 held-out validation items:

$ DISAMBIGUATOR_CONFIG=production python annotations_apple 584 scikit_testtrain_apple 100 scikit_validation_apple --drop

Exporting results

Output an ordered list of classifications and tweets (by tweet_id), allows a diff (e.g. using meld or a similar graphical diff tool):

$ DISAMBIGUATOR_CONFIG=production python annotations_apple > output/annotations_apple.csv
$ DISAMBIGUATOR_CONFIG=production python opencalais_apple > output/opencalais_apple.csv

Exporting test/train data

Now we need to write out our data files, the following will generate data/scikit_testtrain_apple_in_class.csv and data/scikit_testtrain_apple_out_class.csv:

$ DISAMBIGUATOR_CONFIG=production python scikit_testtrain_apple

Simple learning

Use sklearn in a trivial way to classify in and out of class examples. learn1 uses Leave One Out Cross Validation with a Logistic Regression classifier using default text preparation methods, the results are pretty poor. Note that there is no real validation set (just an out of sample test for the 2 cases as a sanity check after training). This is a trivial classifier and isn't to be trusted for any real work.

NOTE first we have to make the destination table for learning, use SQLiteManager to copy scikit_validation_apple (right click on it and select Copy) to learn1_validation_apple, this gives us an identical copy in the sqlite database.

NOTE that we expect to see exported .csv text files for in_class and out_class examples (see previous section).

Results are written to the table specified as the second argument:

$ DISAMBIGUATOR_CONFIG=production python scikit_testtrain_apple --validation_table=learn1_validation_apple

By supplying the --validation_table table the newly predicted class labels are written to the new table (and the tweet text and other details are left untouched).

To investigate a Receiver Operating Characteristics result use:

$ DISAMBIGUATOR_CONFIG=production python scikit_testtrain_apple --roc

To investigate a Precision/Recall result see:

$ DISAMBIGUATOR_CONFIG=production python scikit_testtrain_apple --pr

Scoring predictions

We can score the validation subset of the Gold Standard against the learned result:

$ DISAMBIGUATOR_CONFIG=production python scikit_validation_apple learn1_validation_apple

We can score the validation subset of the Gold Standard against the OpenCalais results too:

$ DISAMBIGUATOR_CONFIG=production python scikit_validation_apple opencalais_apple

We can score the full Gold Standard against the OpenCalais results:

$ DISAMBIGUATOR_CONFIG=production python annotations_apple opencalais_apple


Here are a few ideas of mini projects if you'd like to collaborate but aren't sure about the machine-learning side of things:

  • IAN - export a list of tweet ids and class labels to a .csv file
  • Take my list of tweet-ids, fetch them from Twitter and insert into SQLite using sql_convenience, add the classification that I'll also provide (note - ask me for this list as it isn't in the repo yet)
  • Inherit and build an interface to DBPediaSpotlight, include fixtures in the test code
  • Extract a validation set of 100 in- and 100 out-of-class tweets, send to DBPedia
  • Extend and build a local simple "capitalised brand detector" (e.g. it only looks for the exact term "Apple" in the tweet) - it'll do a similar job to the OpenCalais and DBPedia API calls but without an API call as the code will be local (and really simple) - this would be a very sensible baseline tool
  • Extend and build a heuristic based classifier - e.g. maybe look at capitalised first letter in the target word and a couple of obvious apple-related terms (e.g. iphone, phone, ios, ipad etc) - this would be the equivalent to a human writing some Boolean rules and it serves as another sensible baseline (if we can't beat this then it raises the question "does this system offer anything that a human couldn't already do?")

Design flaws

  • sqlite table structure assumes 1 brand per table (e.g. annotations_apple with class set to is-apple-brand or is-apple-somethingelse), this isn't normal form but is probably fine for the prototype

Other notes

3rd party bugs

SQLite Manager (0.8) in Firefox suffers from a large-int rounding bug due to JavaScript, large ints like Twitter Ids are rounded on display/entry (but they're correctly handled by sqlite3 in Python and sqlite3 at the cmd line):



Copyright (c) 2013 Ian Ozsvald

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentati on files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Sof tware.


Copyright (c) 2013 Ian Ozsvald