Skip to content
The regular expression(s) Twitter uses to match URLs.
Python
Find file
Latest commit 5d2c69f Will McCutchen Merge branch 'alt-impl-tests'
Failed to load latest commit information.
alt_impls
tests Split alt tests out, to be run manually
.gitmodules Add submodules and tests for alternate implementations
README.md
test_requirements.txt Update requirements
twitter_regex.py Trim to minimal subset of regular expressions for matching URLs.

README.md

Twitter URL Regexen

This is an attempt to extract the URL-matching regular expression(s) from the twitter-text-rb project and port them to Python.

What? Why?

This is important because any service that will be creating tweets on behalf of its users needs to be able to provide those users an accurate character count. Should be easy, right? Nope! Now that Twitter is wrapping URLs in t.co short links, the service in question will need to know exactly what parts of a user's tweet will be replaced by shortened t.co URLs.

This is mind-bendingly stupid.

To alleviate some of the pain, Twitter has kindly provided reference implementations for this behavior for Ruby, JavaScript, and Java, along with a suite of conformance tests for third party implementers.

Unfortunately, they do not provide and maintain a reference implementation for Python, and those that exist are incomplete (this particular little lib very much included).

Okay, so what is this then?

This project, twitter-url-regexen, is just an attempt to extract the URL-related regular expressions from the twitter-text-rb source code (specifically, lib/regex.rb) and port them to Python. Nothing more, nothing less.

How good is this port?

Well, it passes 62 of the 70 (at time of writing) URL-extraction conformance tests. Hopefully, that's good enough for government work (as they say).

What about those 8 failing tests?

See, here's where things get even dumber. Twitter is not just extracting these URLs based on a regular expression. There is also a fair amount of extra work done to, e.g., specially handle t.co URLs, protocol-less ccTLD URLs, non-ASCII URLs, etc.

This library does not attempt to duplicate that logic, so some of the failures stem from that basic incompatibility.

Fuck if I know about the other ones.

Running the tests

First, get the conformance tests:

git submodule update --init

Make sure you have all of the requirements (py.test and pyyaml):

pip install -r test_requirements.txt

Run the tests:

py.test

If you'd like to test some alternate Python implentations (twitter-text-python and twitter-text-py) of the same idea, you'll have to run those tests a little more manually:

py.test tests/alt*

See Also

Twitter's own docs about this inane misfeature: https://dev.twitter.com/docs/tco-url-wrapper/how-twitter-wrap-urls

Something went wrong with that request. Please try again.