# Text Pre-processing

In this notebook we will define pre-processing functions we can use to clean-up our data. These functions can be fused/unfused together where necessary.

Example:

```python
# apply function1 and function2 on text
text = function1(function2(text))
# disable function2, just apply function1
text = function1(text)
```

Below, we'll code examples of some pre-processing functions applied on a sample text variable.

In [1]:
text = u'This is a sample sentence for our #machinelearning presentation. Follow me @sirfz! :)'

In [2]:
# lower-case
def to_lower(text):
    return text.lower()

In [3]:
print to_lower(text)

this is a sample sentence for our #machinelearning presentation. follow me @sirfz! :)


In [4]:
# keep alphanumeric characters only
import re  # regular expressions module
alphanum = re.compile('[^\W]+')  # add re.UNICODE flag to match unicode characters as well

def get_alphanum(text):
    return u' '.join(alphanum.findall(text))

In [5]:
print get_alphanum(text)

This is a sample sentence for our machinelearning presentation Follow me sirfz


In [6]:
# replace URLs with __URL__
url_re = re.compile(r'(https?://|www\.)([a-z0-9-]+\.)+[a-z]{2,6}(/[\w]+)*', re.I)

def replace_urls(text):
    return url_re.sub(u'__URL__', text)

In [7]:
print replace_urls(u'visit http://mypage.com or follow me on www.foobar.com.lb.')

visit __URL__ or follow me on __URL__.


In [8]:
# transliterate (using the translitcodec module)
import translitcodec

def transliterate(text):
    return text.encode('transliterate')

In [9]:
print transliterate(u'French e, è, é, ê, ë – what’s the difference?')

French e, e, e, e, e - what's the difference?


In [10]:
# remove tags
tags = re.compile(r'(\s*)[@#](\w+)\b')

def remove_tags(text):
    return tags.sub(ur'\1\2', text)

In [11]:
print remove_tags(text)

This is a sample sentence for our machinelearning presentation. Follow me sirfz! :)


In [12]:
# correct typos (custom rules)
typo_map = {
    ur'\biphon\b': u'iphone',
    ur'\bsamsng\b': u'samsung'
}
typo_map = [(re.compile(typo), replacement) for typo, replacement in typo_map.iteritems()]

def correct_typos(text):
    for typo, replacement in typo_map:
        text = typo.sub(replacement, text)
    return text

In [13]:
print correct_typos(u'I prefer samsng phones over an iphon any day!')

I prefer samsung phones over an iphone any day!


In [14]:
# fuse functions together
output = get_alphanum(to_lower(text))
print output

this is a sample sentence for our machinelearning presentation follow me sirfz


In [15]:
text = u'follow us on Instagram @lamerz (http://lame.link)'
output = correct_typos(transliterate(remove_tags(replace_urls(to_lower(text)))))
print output

follow us on instagram lamerz (__URL__)
