## Advanced text processing
1. Using regular expressions
2. Using NLP based string manipulations

###  A. Regular expressions:
1. Used to extract patterns in a text
2. Makes some common mundane tasks easier


In [1]:
import re ##This module is required to do regex based processing

### Regex: Python
1. Understanding the python API: .compile(), .search(), .group(), .findall()
2. Understanding wildcards: ?,$, ^,|,{},[]
3. Demo: Extracting an email id, phone number

In [2]:
message='Call me tommorow at 9739520276. 9465837277 is my secondary number'
print message

Call me tommorow at 9739520276. 9465837277 is my secondary number


In [3]:
#Is there a way to extract only the phone number out of this text?
#What might be the programmatic approach to this?
#Phone numbers are usually of similar length, telephone numbers are always, numbers!!!
# We can create a function which returns a boolean value, if the string contains numbers or not
def is_phone(chunk):
    if len(chunk)!=10:
        return False
    for i in range(0,10):
        if not chunk[i].isdigit():
            return False
    return True

In [4]:
is_phone('9739520276')

True

In [5]:
for i in range(len(message)):
    chunk=message[i:i+10]
    if is_phone(chunk):
        print chunk
    

9739520276
9465837277


In [6]:
##This approach obviously has many drawbacks, the code writting is a tedious task, one has to think of custom logic every time
#A better approach is to use a regular expression

#Finding digits in a text: \d is the wilcard used to find a digit
num_regex=re.compile(r'\d')
text='This text contains a number:2'
go=num_regex.search(text)
go.group()

'2'

In [7]:
#The standard process of using regexes is:
#1 Use .compile() to create a pattern
#2 Use .search() to seacrh for a pattern
#4 Use .group() to display the pattern searched for

text='This text contains one number here 123 and another number here 143'
#Can you think of a regex?
num_regex=re.compile(r'\d\d\d')
go=num_regex.search(text)
go.group()#What is going on?

'123'

In [8]:
go=num_regex.findall(text)
go

['123', '143']

In [9]:
#Let's recapitulate, if there is more than one pattern that one needs to be searched for, use .find_all(), if first 
#occurence has to be searched use .search()

#Wild cards can be used to, make certain tasks easier
text='This text contains one number here 23, another here 12345 and the last one here 124'
num_regex=re.compile(r'\d{2,5}')
go=num_regex.findall(text)
go

['23', '12345', '124']

In [10]:
#Another wild card used is +, it matches one or more occurence
text='This text contains one number here 23, another here 12345667889909090909 and the last one here 3489'
num_regex=re.compile(r'\d+')
go=num_regex.findall(text)
go

['23', '12345667889909090909', '3489']

In [12]:
#Sometimes one needs to match patterns that are optional, ? is used in such a scenario
text='phone numbers are written either as 9739520276 or with a country code 919739520276'
num_regex=re.compile(r'91?\d+')
go=num_regex.findall(text)
go

['9739520276', '999739520276']

In [5]:
# A more realistic scenario would be:
text='phone numbers are written either as 9739520276 or with a country code +919739520276'
# + is a wildcard, wildcards have to be escaped if not used in the context of pattern matching
num_regex=re.compile(r'\+91?\d+')
go=num_regex.findall(text)
go

['+919739520276']

In [9]:
#This doesn't return the first phone number: The way to handle this is to search for groups, what does that mean?
num_regex=re.compile(r'(\+91)?(\d+)')
go=num_regex.findall(text)
go

[('', '9739520276'), ('+91', '9739520276')]

In [13]:
text='Not all people live in India, my friend who is in England, his number is 020 7946 0234 and my number is +919739520276'
num_regex=re.compile(r'(020\s)?(\d{4}\s\d{4}|\+\d{11})')
go=num_regex.findall(text)
go

[('020 ', '7946 0234'), ('020 ', '+91973952027')]

In [15]:
# A few new wildcards have been introduced \s->whitespace, |-> or condition
# Can you think of a wildcard for searching phone numbers in the following text?
text='Here is a phone number +91 9739 52076 and another phone number 090 973 952 0276'
num_regex=re.compile(r'(\+91\s|090\s)?(\d{3}\s\d{3}\s\d{4}|\d{4}\s\d+)')
go=num_regex.findall(text)
go

[('+91 ', '9739 52076'), ('090 ', '973 952 0276')]

#### There are many wildcards or metacharacters, that are a part of regular expressions. Here is a list of these 
<img src='character_classes.png'>

#### Here is a list of all the regex operators
<img,src='regex.png'>

In [17]:
#Although there are several inbuilt character classes, pre-defined in most regex frameworks, there are instances where
#one needs to create their own custom classes.
text='my email id is john123@gmail.com'
#To match an email id, its better if we can define a class, usernames can be a mix of numbers or alphabets, we can create
#a custom class 
email_regex=re.compile(r'([a-zA-Z0-9]+@)')
email_regex.findall(text)

['john123@']

In [18]:
text='email id is 123abc@flatmail.com'
email_regex.findall(text)

['123abc@']

In [19]:
##There are some other methods available in re module
#search
email_regex.search(text)

<_sre.SRE_Match at 0x7f49766426c0>

In [23]:
bool(email_regex.search(text))
#email_regex.search(text)!=None

True

In [24]:
email_regex.sub("abc",text)

'email id is abcflatmail.com'

### B. Using NLP based string manipulations:
1. Tokenizing
2. Removing punctuation
3. Doing simple counts


In [26]:
##Many times there are some text processing tasks that aren't easily handled by regex or anything we have discussed so far
import nltk
text='This sentence has commas, full stops names with dots, spacy.loads(). Can we break down the whole sentence into words? '

In [27]:
words=nltk.word_tokenize(text)

In [28]:
print words

['This', 'sentence', 'has', 'commas', ',', 'full', 'stops', 'names', 'with', 'dots', ',', 'spacy.loads', '(', ')', '.', 'Can', 'we', 'break', 'down', 'the', 'whole', 'sentence', 'into', 'words', '?']


In [29]:
words_norm=[word.lower() for word in words if word.isalpha()]

In [30]:
print words_norm

['this', 'sentence', 'has', 'commas', 'full', 'stops', 'names', 'with', 'dots', 'can', 'we', 'break', 'down', 'the', 'whole', 'sentence', 'into', 'words']


In [32]:
#Convert it into a string
" ".join(words_norm)

'this sentence has commas full stops names with dots can we break down the whole sentence into words'

In [33]:
#Counting words
count=nltk.FreqDist(words_norm)

In [34]:
count

FreqDist({'break': 1,
          'can': 1,
          'commas': 1,
          'dots': 1,
          'down': 1,
          'full': 1,
          'has': 1,
          'into': 1,
          'names': 1,
          'sentence': 2,
          'stops': 1,
          'the': 1,
          'this': 1,
          'we': 1,
          'whole': 1,
          'with': 1,
          'words': 1})

In [35]:
count['This']

0