<a href="https://colab.research.google.com/github/kilos11/Michigan-State-university-Natural-Language-Processing-/blob/main/Chapter_2_A_Quick_Tour_of_Traditional_NLP.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#*Corpora, Tokens, and Types*

All NLP methods, be they classic or modern, begin with a text dataset, also called a corpus (plural:
corpora). A corpus usually contains raw text (in ASCII or UTF­8) and any metadata associated with the text. The raw text is a sequence of characters (bytes), but most times it is useful to group those characters into contiguous units called tokens. In English, tokens correspond to words and numeric sequences separated by white­space characters or punctuation.
The metadata could be any auxiliary piece of information associated with the text, like identifiers, labels, and timestamps. In machine learning parlance, the text along with its metadata is called an instance or data point. The corpus ( igure 2­1), a collection of instances, is also known as a dataset.
Given the heavy machine learning focus of this book, we freely interchange the terms corpus and dataset throughout.

In [None]:
#Example 2­1. Tokenizing text
import spacy

nlp = spacy.load("en_core_web_sm")
text = "Mary, don’t slap the green witch"
print([str(token) for token in nlp(text.lower())])

['mary', ',', 'do', 'n’t', 'slap', 'the', 'green', 'witch']


In [None]:
from nltk.tokenize import TweetTokenizer

tweet=u"Snow White and the Seven Degrees#MakeAMovieCold@midnight:­)"
tokenizer = TweetTokenizer()
print(tokenizer.tokenize(tweet.lower()))


['snow', 'white', 'and', 'the', 'seven', 'degrees', '#makeamoviecold', '@midnight', ':', '\xad', ')']


Example 2­3. Lemmatization: reducing words to their root forms

In [3]:
import spacy

nlp = spacy.load('en_core_web_sm')
doc = nlp(u"he was running lately")
for token in doc:
    print('{} ­­> {}'.format(token, token.lemma_))

he ­­> he
was ­­> be
running ­­> run
lately ­­> lately


Categorizing Words: POS Tagging We can extend the concept of labeling from documents to individual words or tokens. A common example of categorizing words is part­of­speech (POS) tagging, as demonstrated in  example 2­4

In [5]:
import spacy

nlp = spacy.load('en_core_web_sm')
doc = nlp(u"Mary slapped the green witch.")
for token in doc:
    print('{} ­ {}'.format(token, token.pos_))

Mary ­ PROPN
slapped ­ VERB
the ­ DET
green ­ ADJ
witch ­ NOUN
. ­ PUNCT


#Categorizing Spans: Chunking and Named Entity Recognition

Often, we need to label a span of text; that is, a contiguous multitoken boundary. For example, consider the sentence, “Mary slapped the green witch.” We might want to identify the noun phrases (NP) and verb phrases (VP) in it, as shown here:

Example 2­5. Noun Phrase (NP) chunking

In [6]:
import spacy

nlp = spacy.load('en_core_web_sm')
doc  = nlp(u"Mary slapped the green witch.")
for chunk in doc.noun_chunks:
    print ('{} ­ {}'.format(chunk, chunk.label_))

Mary ­ NP
the green witch ­ NP
