## Text Segmentation
Text segmentation is the process of transforming text into meaningful units from text data. These units can be words, sentences or different topics (collection of words)

In [4]:
import nltk
from nltk.tokenize import sent_tokenize, word_tokenize

nltk.download('punkt')

[nltk_data] Downloading package punkt to /home/kkumar/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [5]:
text = "CODE is founded by Mr. Bachem. Studying at CODE will be unlike any other higher education experience. Our intensive, interdisciplinary bachelor’s programs are designed to dramatically improve the way you work and to prepare you for the reality of tomorrow’s workplace."

In [6]:
# get the text split into sentences
print(sent_tokenize(text))

['CODE is founded by Mr. Bachem.', 'Studying at CODE will be unlike any other higher education experience.', 'Our intensive, interdisciplinary bachelor’s programs are designed to dramatically improve the way you work and to prepare you for the reality of tomorrow’s workplace.']


In [7]:
# get the text split into words
print(word_tokenize(text))

['CODE', 'is', 'founded', 'by', 'Mr.', 'Bachem', '.', 'Studying', 'at', 'CODE', 'will', 'be', 'unlike', 'any', 'other', 'higher', 'education', 'experience', '.', 'Our', 'intensive', ',', 'interdisciplinary', 'bachelor', '’', 's', 'programs', 'are', 'designed', 'to', 'dramatically', 'improve', 'the', 'way', 'you', 'work', 'and', 'to', 'prepare', 'you', 'for', 'the', 'reality', 'of', 'tomorrow', '’', 's', 'workplace', '.']


## Stop Words and Word Segmentation
Also part of Natural Language are words (common words that do not contribute significantly to our cause) that are basically useless, which are referred to as "stop words". Since these words extend out processing time or take up unnecessary space in our database, we will remove them.

In [9]:
# Import stopwords from nltk package

nltk.download('stopwords')
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

[nltk_data] Downloading package stopwords to /home/kkumar/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [10]:
stop_words = set(stopwords.words('english'))

tokens = word_tokenize(text)

In [11]:
# Filter the text of stopwords
filtered_tokens = [w for w in tokens if not w in stop_words]

In [12]:
print(tokens)

['CODE', 'is', 'founded', 'by', 'Mr.', 'Bachem', '.', 'Studying', 'at', 'CODE', 'will', 'be', 'unlike', 'any', 'other', 'higher', 'education', 'experience', '.', 'Our', 'intensive', ',', 'interdisciplinary', 'bachelor', '’', 's', 'programs', 'are', 'designed', 'to', 'dramatically', 'improve', 'the', 'way', 'you', 'work', 'and', 'to', 'prepare', 'you', 'for', 'the', 'reality', 'of', 'tomorrow', '’', 's', 'workplace', '.']


In [13]:
print(filtered_tokens)

['CODE', 'founded', 'Mr.', 'Bachem', '.', 'Studying', 'CODE', 'unlike', 'higher', 'education', 'experience', '.', 'Our', 'intensive', ',', 'interdisciplinary', 'bachelor', '’', 'programs', 'designed', 'dramatically', 'improve', 'way', 'work', 'prepare', 'reality', 'tomorrow', '’', 'workplace', '.']


## Stemming
Stemming and Lemmatization are the process of getting the root/base form of a word by removing the derivational terms. In stemming we basically chose a crude way to achieve this by simply chopping up the derivational term in hope of getting the base form of any word. While in lemmatization we go for finding the base form matching a dictionary word, hence giving us more meaningful results.

In [14]:
from nltk.stem import PorterStemmer

stemmer = PorterStemmer()

#### Stemming single words

In [15]:
example_words = ["ride", "riding", "rider"]

In [16]:
for w in example_words:
    print(stemmer.stem(w))

ride
ride
rider


#### Stemming sentences

In [17]:
new_text = """CODE is a newly founded private university of applied sciences that is embedded into the vibrant 
network of Berlin's digital economy."""

In [18]:
words = word_tokenize(new_text)

for w in words:
    print(stemmer.stem(w))

code
is
a
newli
found
privat
univers
of
appli
scienc
that
is
embed
into
the
vibrant
network
of
berlin
's
digit
economi
.


## Parsing (Speech tagging & Chunking)

### 1. Speech Tagging
Speech tagging in NLP is the process of labelling words in a sentence as specifice part of speech type (nouns, adjectives etc).

NLTK provides us with a sentence tokenizer called the "PunktSentenceTokenizer", which is an unsupervised ML algorithm implementation that can be trained on any text corpus you wish to.

In [19]:
import nltk
from nltk import PunktSentenceTokenizer

In [21]:
# load novels from chesterton
nltk.download('gutenberg')
nltk.download('averaged_perceptron_tagger')

from nltk.corpus import gutenberg

test = gutenberg.raw("chesterton-ball.txt")
train = gutenberg.raw("chesterton-brown.txt")

[nltk_data] Downloading package gutenberg to /home/kkumar/nltk_data...
[nltk_data]   Package gutenberg is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /home/kkumar/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


In [22]:
# Train the tokenizer
custom_sent_tokenizer = PunktSentenceTokenizer(train)

# Tokenize the data
tokenized_data = custom_sent_tokenizer.tokenize(test)

In [25]:
def tag_text(data):
    try:
        for i in data[:7]:
            actual_words = nltk.word_tokenize(i)
            tagged_words = nltk.pos_tag(actual_words)
            print(tagged_words)
    except Exception as e:
        print(str(e))
        
tag_text(tokenized_data)

[('[', 'IN'), ('The', 'DT'), ('Ball', 'NNP'), ('and', 'CC'), ('The', 'DT'), ('Cross', 'NNP'), ('by', 'IN'), ('G.K', 'NNP'), ('.', '.')]
[('Chesterton', 'NNP'), ('1909', 'CD'), (']', 'NN'), ('I', 'PRP'), ('.', '.')]
[('A', 'DT'), ('DISCUSSION', 'NNP'), ('SOMEWHAT', 'NNP'), ('IN', 'NNP'), ('THE', 'NNP'), ('AIR', 'NNP'), ('The', 'DT'), ('flying', 'VBG'), ('ship', 'NN'), ('of', 'IN'), ('Professor', 'NNP'), ('Lucifer', 'NNP'), ('sang', 'VBD'), ('through', 'IN'), ('the', 'DT'), ('skies', 'NNS'), ('like', 'IN'), ('a', 'DT'), ('silver', 'NN'), ('arrow', 'NN'), (';', ':'), ('the', 'DT'), ('bleak', 'JJ'), ('white', 'JJ'), ('steel', 'NN'), ('of', 'IN'), ('it', 'PRP'), (',', ','), ('gleaming', 'VBG'), ('in', 'IN'), ('the', 'DT'), ('bleak', 'JJ'), ('blue', 'JJ'), ('emptiness', 'NN'), ('of', 'IN'), ('the', 'DT'), ('evening', 'NN'), ('.', '.')]
[('That', 'IN'), ('it', 'PRP'), ('was', 'VBD'), ('far', 'RB'), ('above', 'IN'), ('the', 'DT'), ('earth', 'NN'), ('was', 'VBD'), ('no', 'DT'), ('expression', '

### 2. Chunking Text

Chunking is the process of grouping words into more meaningful chunks than mere speech tags. For example noun and verb phrases. With the help of chunking you can build a parsen tree.

In this example we will search for chunks that corresponds to invidual noun phrases.

In [26]:
# using pre-tagged text for simplicity
text = [("the", "DT"), ("huge", "JJ"), ("german", "JJ"), ("Rottweiler", "NN"), 
        ("barked", "VBD"), ("at", "IN"),  ("the", "DT"), ("cat", "NN")]

In [27]:
# define a noun phrase as:
#     NP = determiner + adjective + singular_noun
grammar = "NP: {<DT>?<JJ>*<NN>}"

# apply grammar to regexparser
cp = nltk.RegexpParser(grammar)

# carry out chunking
result = cp.parse(text)
print(result)

(S
  (NP the/DT huge/JJ german/JJ Rottweiler/NN)
  barked/VBD
  at/IN
  (NP the/DT cat/NN))
