# Text Processing

This section will cover the basic things we do after we have acquired raw text from various sources

- Using Strings
- Using Unicode Strings
- Regular Expressions
- Normalization with Stemmers and Lemmatizers
- Tokenization
- Segmentation
- Formatting

## Using Strings

In [2]:
# Single line
circus = "Monty Python's flying circus"
circus

"Shall I compare thee to a Sumer's dat?Thow are more lovely and more temperate:"

In [3]:
# Multi line
couplet = "Shall I compare thee to a Sumer's dat?"\
    "Thow are more lovely and more temperate:"
couplet

"Shall I compare thee to a Sumer's dat?Thow are more lovely and more temperate:"

In [4]:
# Concatenation
'very' + 'very' + 'very'

'veryveryvery'

In [5]:
# Access individual characters
circus[0]

'M'

In [None]:
# for loops
for char in circus:
    print(char)

In [8]:
# substrings
circus[-12:-7]

'lying'

In [9]:
phrase = 'And now for something completely different'
if 'thing' in phrase:
    print('found it')

found it


In [10]:
# substring with find
"""
Others: find(t), rfind(t), index(t), rindex(t), join(text), split(t), splitlines(), lower(), upper(), titlecase(), strip(), replace(t)
""" 
circus.find('Python')

6

In [11]:
# split word
text = 'Define which data represents "ham" class and which data represents "spam" class for the machine-learning algorithm.'
# text = "i. e."
delimiters = ['"', "."]
words = []
current_word = "" 
for char in text:
    if char==" ":
        if not current_word=="":
            words.append(current_word)
            current_word = "" 
    elif char in delimiters:
        if current_word=="":
            words.append(char)
        else:
            words.append(current_word)
            words.append(char)
            current_word = "" 
    else:
        current_word += char

print(words)

['Define', 'which', 'data', 'represents', '"', 'ham', '"', 'class', 'and', 'which', 'data', 'represents', '"', 'spam', '"', 'class', 'for', 'the', 'machine-learning', 'algorithm', '.']


## Using Unicode Strings

## Regular Expressions

In [None]:
import re
import nltk
from nltk import corpus

wordlist = [w for w in corpus.words.words('en') if w.islower()] 
wordlist

## Normalization with Stemmers and Lemmatizers

Normalization makes text ready for processing by removing punctuation, converting to lower or upper case, changing numbers to words, expanding of abbreviations and more.

In [1]:
import re
import string
from nltk.tokenize import word_tokenize

text = ['It was a great show','Artists acted','Dancers performed']
tokenized_docs = [word_tokenize(doc) for doc in text]
x = re.compile('[%s]' % re.escape(string.punctuation))
tokenized_docs_no_punctuation = []

for review in tokenized_docs:
    new_review = []
    for token in review:
        new_token = x.sub(u'', token)
        if not new_token == u'':
            new_review.append(new_token)
    tokenized_docs_no_punctuation.append(new_review)
print(tokenized_docs_no_punctuation)

[['It', 'was', 'a', 'great', 'show'], ['Artists', 'acted'], ['Dancers', 'performed']]


## Tokenization

Tokenization is the process of splitting text or sentence into smaller parts called tokens

In [2]:
import nltk
from nltk.tokenize import  sent_tokenize

text = 'Welcome, I hope the journey was fine? Please stay safe!'
print(sent_tokenize(text))

['Welcome, I hope the journey was fine?', 'Please stay safe!']


Working with large number of sentences

In [3]:
import nltk
tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')
text = 'Hello everyone. Hope you are doing well. Good to know you are'
print(tokenizer.tokenize(text))

['Hello everyone.', 'Hope you are doing well.', 'Good to know you are']


Using other languages

In [4]:
import nltk

french_tokenizer = nltk.data.load('tokenizers/punkt/french.pickle')
print(french_tokenizer.tokenize('Deux agressions en quelques jours, voilà ce qui a motivé hier matin le débrayage  collège franco-britanniquedeLevallois-Perret. Deux agressions en quelques jours, voilà ce qui a motivé hier matin le débrayage  Levallois. L’équipe pédagogique de ce collège de 750 élèves avait déjà été choquée par l’agression, janvier , d’un professeur d’histoire. L’équipe pédagogique de ce collège de 750 élèves avait déjà été choquée par l’agression, mercredi , d’un professeur d’histoire'))

['Deux agressions en quelques jours, voilà ce qui a motivé hier matin le débrayage  collège franco-britanniquedeLevallois-Perret.', 'Deux agressions en quelques jours, voilà ce qui a motivé hier matin le débrayage  Levallois.', 'L’équipe pédagogique de ce collège de 750 élèves avait déjà été choquée par l’agression, janvier , d’un professeur d’histoire.', 'L’équipe pédagogique de ce collège de 750 élèves avait déjà été choquée par l’agression, mercredi , d’un professeur d’histoire']


## Segmentation

## Formatting