## Tokenization
Tokenization is the process of breaking down a text into smaller units called tokens. These tokens can be individual words or subwords, and they serve as the building blocks for further NLP tasks. Tokenization is an essential step in NLP as it helps in understanding the structure of the text and makes it easier to process.

In [2]:
#import all the necessary libraries NLTK and Spacy
import nltk
from nltk.tokenize import word_tokenize
import spacy

In [3]:
# Download necessary resources (you only need to do this once)
nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\mw50000150\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [4]:
# Input text
text = "Tokenization is the first step in NLP. It splits text into words or subwords."

# Tokenize the text
tokens = word_tokenize(text)

# Print the tokens
print(tokens)

['Tokenization', 'is', 'the', 'first', 'step', 'in', 'NLP', '.', 'It', 'splits', 'text', 'into', 'words', 'or', 'subwords', '.']


### Tokenization with spaCy:
spaCy is another powerful Python library for NLP that provides tokenization along with many other NLP functionalities. Install spaCy using pip:

In [5]:
# Load the English language model
nlp = spacy.load("en_core_web_sm")

# Input text
text = "Tokenization is the first step in NLP. It splits text into words or subwords."

# Process the text (tokenization happens here)
doc = nlp(text)

# Get the tokens from the processed document
tokens = [token.text for token in doc]

# Print the tokens
print(tokens)

['Tokenization', 'is', 'the', 'first', 'step', 'in', 'NLP', '.', 'It', 'splits', 'text', 'into', 'words', 'or', 'subwords', '.']


## Text Preprocessing
Lowercasing: Converting all text to lowercase can help in standardizing the text and avoid discrepancies due to different cases.

Removing Punctuation: Punctuation marks often do not carry significant meaning and can be removed to focus on the actual words.

Tokenization: We've already covered this step, but tokenization is often considered a part of text preprocessing.

Stopword Removal: Stopwords are common words like "the," "is," "and," etc., that occur frequently in the language but typically don't contribute much to the meaning of the text. Removing them can reduce noise in the data.

Stemming or Lemmatization: Reducing words to their root form can help in reducing inflectional forms and standardizing the vocabulary.

In [6]:
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer

# Download necessary resources (you only need to do this once)
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\mw50000150\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [8]:
# Input text
text = "Text preprocessing is an essential step! It helps clean and normalize the text data. Preprocessing involves tokenization, stopword removal, stemming, and more."

# Step 1: Convert text to lowercase
text = text.lower()
print(text)
# Step 2: Remove punctuation
import string
text = text.translate(str.maketrans("", "", string.punctuation))
print(text)
# Step 3: Tokenize the text
tokens = word_tokenize(text)
print(tokens)
# Step 4: Remove stopwords
stop_words = set(stopwords.words("english"))
filtered_tokens = [token for token in tokens if token not in stop_words]
print(filtered_tokens)
# Step 5: Stem the tokens (You can also use Lemmatization if you prefer)
stemmer = PorterStemmer()
stemmed_tokens = [stemmer.stem(token) for token in filtered_tokens]

# Print the preprocessed tokens
print(stemmed_tokens)

text preprocessing is an essential step! it helps clean and normalize the text data. preprocessing involves tokenization, stopword removal, stemming, and more.
text preprocessing is an essential step it helps clean and normalize the text data preprocessing involves tokenization stopword removal stemming and more
['text', 'preprocessing', 'is', 'an', 'essential', 'step', 'it', 'helps', 'clean', 'and', 'normalize', 'the', 'text', 'data', 'preprocessing', 'involves', 'tokenization', 'stopword', 'removal', 'stemming', 'and', 'more']
['text', 'preprocessing', 'essential', 'step', 'helps', 'clean', 'normalize', 'text', 'data', 'preprocessing', 'involves', 'tokenization', 'stopword', 'removal', 'stemming']
['text', 'preprocess', 'essenti', 'step', 'help', 'clean', 'normal', 'text', 'data', 'preprocess', 'involv', 'token', 'stopword', 'remov', 'stem']


### Part-of-speech (POS) tagging 
is a fundamental task in natural language processing that involves assigning a grammatical category (or "part-of-speech") to each word in a given text. The categories can include nouns, verbs, adjectives, adverbs, pronouns, prepositions, conjunctions, and more. POS tagging is useful in various NLP tasks like syntax parsing, named entity recognition, and information extraction.

In [10]:
from nltk import pos_tag
# Download necessary resources (you only need to do this once)
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\mw50000150\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

In [12]:
# Input text
text = "Noordeen is a great footballer, especially when he is playing in the midfield."

# Tokenize the text
tokens = word_tokenize(text)

# Perform part-of-speech tagging
pos_tags = pos_tag(tokens)

# Print the part-of-speech tagged tokens
print(pos_tags)

[('Noordeen', 'NNP'), ('is', 'VBZ'), ('a', 'DT'), ('great', 'JJ'), ('footballer', 'NN'), (',', ','), ('especially', 'RB'), ('when', 'WRB'), ('he', 'PRP'), ('is', 'VBZ'), ('playing', 'VBG'), ('in', 'IN'), ('the', 'DT'), ('midfield', 'NN'), ('.', '.')]


### Named Entity Recognition (NER) 
is a natural language processing task that aims to identify and classify named entities in a given text. Named entities are real-world objects that have names, such as people, organizations, locations, dates, and more. NER is a crucial component of information extraction systems and is used in various applications, including information retrieval, question-answering systems, and sentiment analysis.

In [13]:
from nltk import ne_chunk

# Download necessary resources (you only need to do this once)
nltk.download('maxent_ne_chunker')
nltk.download('words')


[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     C:\Users\mw50000150\AppData\Roaming\nltk_data...
[nltk_data]   Package maxent_ne_chunker is already up-to-date!
[nltk_data] Downloading package words to
[nltk_data]     C:\Users\mw50000150\AppData\Roaming\nltk_data...
[nltk_data]   Package words is already up-to-date!


True

In [14]:
# Input text
text = "Apple Inc. was founded by Steve Jobs and Steve Wozniak on April 1, 1976, in California."

# Tokenize the text
tokens = word_tokenize(text)

# Perform part-of-speech tagging
pos_tags = pos_tag(tokens)

# Perform Named Entity Recognition
ner_tags = ne_chunk(pos_tags)

# Print the named entities
for chunk in ner_tags:
    if hasattr(chunk, 'label'):
        print(chunk.label(), ' '.join(c[0] for c in chunk))

PERSON Apple
ORGANIZATION Inc.
PERSON Steve Jobs
PERSON Steve Wozniak
GPE California


In [15]:
print(ner_tags)

(S
  (PERSON Apple/NNP)
  (ORGANIZATION Inc./NNP)
  was/VBD
  founded/VBN
  by/IN
  (PERSON Steve/NNP Jobs/NNP)
  and/CC
  (PERSON Steve/NNP Wozniak/NNP)
  on/IN
  April/NNP
  1/CD
  ,/,
  1976/CD
  ,/,
  in/IN
  (GPE California/NNP)
  ./.)
