### Part 1: Tokenization and Normalization

- Import the nltk module
- Select a text file or excerpt to work with (e.g. a paragraph from a book, a news arEcle, etc.)
- Tokenize the text using nltk's word_tokenize() and sentence_tokenize() funcEons
- Normalize the tokens by converEng them to lowercase and stemming using nltk's PorterStemmer

In [14]:
import nltk
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.stem import PorterStemmer

# Download necessary NLTK resources
nltk.download('punkt')

# Select a text excerpt
text = "Natural language processing (NLP) is an interdisciplinary subfield of computer science and linguistics. It is primarily concerned with giving computers the ability to support and manipulate human language. It involves processing natural language datasets, such as text corpora or speech corpora, using either rule-based or probabilistic (i.e. statistical and, most recently, neural network-based) machine learning approaches. The goal is a computer capable of 'understanding' the contents of documents, including the contextual nuances of the language within them. The technology can then accurately extract information and insights contained in the documents as well as categorize and organize the documents themselves. Challenges in natural language processing frequently involve speech recognition, natural-language understanding, and natural-language generation."

# Tokenize by word and sentence
tokens_word = word_tokenize(text)
tokens_sentence = sent_tokenize(text)

# Normalize tokens by converting to lowercase
tokens_word_lower = [word.lower() for word in tokens_word]

# Stemming using PorterStemmer
porter = PorterStemmer()
tokens_stemmed = [porter.stem(word) for word in tokens_word_lower]

# Print the results
print("Original Text:")
print(text)
print("\nTokenization by Word:")
print(tokens_word)
print("\nTokenization by Sentence:")
print(tokens_sentence)
print("\nNormalized Tokens (Lowercase and Stemmed):")
print(tokens_stemmed)


Original Text:
Natural language processing (NLP) is an interdisciplinary subfield of computer science and linguistics. It is primarily concerned with giving computers the ability to support and manipulate human language. It involves processing natural language datasets, such as text corpora or speech corpora, using either rule-based or probabilistic (i.e. statistical and, most recently, neural network-based) machine learning approaches. The goal is a computer capable of 'understanding' the contents of documents, including the contextual nuances of the language within them. The technology can then accurately extract information and insights contained in the documents as well as categorize and organize the documents themselves. Challenges in natural language processing frequently involve speech recognition, natural-language understanding, and natural-language generation.

Tokenization by Word:
['Natural', 'language', 'processing', '(', 'NLP', ')', 'is', 'an', 'interdisciplinary', 'subfie

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\karth\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


### Part 2: Part-of-Speech Tagging

- Use nltk's pos_tag() funcEon to tag each tokenized word with its part-of-speech
- Examine the tagged words and take note of any paMerns, surprises or errors you see

In [15]:
# Part-of-Speech Tagging
pos_tags = nltk.pos_tag(tokens_word)

# Print the results
print("Original Text:")
print(text)
print("\nPart-of-Speech Tags:")
print(pos_tags)


Original Text:
Natural language processing (NLP) is an interdisciplinary subfield of computer science and linguistics. It is primarily concerned with giving computers the ability to support and manipulate human language. It involves processing natural language datasets, such as text corpora or speech corpora, using either rule-based or probabilistic (i.e. statistical and, most recently, neural network-based) machine learning approaches. The goal is a computer capable of 'understanding' the contents of documents, including the contextual nuances of the language within them. The technology can then accurately extract information and insights contained in the documents as well as categorize and organize the documents themselves. Challenges in natural language processing frequently involve speech recognition, natural-language understanding, and natural-language generation.

Part-of-Speech Tags:
[('Natural', 'JJ'), ('language', 'NN'), ('processing', 'NN'), ('(', '('), ('NLP', 'NNP'), (')', 

### Part 3: N-Grams

- Use ngrams() to create a series of bi-grams and tri-grams from the tokenized text
- Calculate the frequency distribuEon of the n-grams using FreqDist()
- Identify the most common bi-grams and tri-grams

In [16]:
from nltk.util import ngrams
from nltk.probability import FreqDist

# Create bi-grams and tri-grams
bi_grams = list(ngrams(tokens_word, 2))
tri_grams = list(ngrams(tokens_word, 3))

# Calculate frequency distribution
fd_bi_grams = FreqDist(bi_grams)
fd_tri_grams = FreqDist(tri_grams)

# Identify most common bi-grams and tri-grams
common_bi_grams = fd_bi_grams.most_common(5)
common_tri_grams = fd_tri_grams.most_common(5)

# Print the results
print("\nBi-Grams:")
print(bi_grams)
print("\nTri-Grams:")
print(tri_grams)
print("\nMost Common Bi-Grams:")
print(common_bi_grams)
print("\nMost Common Tri-Grams:")
print(common_tri_grams)



Bi-Grams:
[('Natural', 'language'), ('language', 'processing'), ('processing', '('), ('(', 'NLP'), ('NLP', ')'), (')', 'is'), ('is', 'an'), ('an', 'interdisciplinary'), ('interdisciplinary', 'subfield'), ('subfield', 'of'), ('of', 'computer'), ('computer', 'science'), ('science', 'and'), ('and', 'linguistics'), ('linguistics', '.'), ('.', 'It'), ('It', 'is'), ('is', 'primarily'), ('primarily', 'concerned'), ('concerned', 'with'), ('with', 'giving'), ('giving', 'computers'), ('computers', 'the'), ('the', 'ability'), ('ability', 'to'), ('to', 'support'), ('support', 'and'), ('and', 'manipulate'), ('manipulate', 'human'), ('human', 'language'), ('language', '.'), ('.', 'It'), ('It', 'involves'), ('involves', 'processing'), ('processing', 'natural'), ('natural', 'language'), ('language', 'datasets'), ('datasets', ','), (',', 'such'), ('such', 'as'), ('as', 'text'), ('text', 'corpora'), ('corpora', 'or'), ('or', 'speech'), ('speech', 'corpora'), ('corpora', ','), (',', 'using'), ('using', 