<a href="https://colab.research.google.com/github/philipsnhan1010/NLP/blob/main/Basic_Text_Analysis_Python.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Import Data

Python supports a number of standard and custom libraries to read files of all types into python.

In [1]:
import os

url = 'https://raw.githubusercontent.com/dearbharat/NLP/main/Course-Description.txt'

! wget https://raw.githubusercontent.com/dearbharat/NLP/main/Course-Description.txt -O ./Course-Description.txt


#Read the file using standard python libaries
with open(os.getcwd()+ "/Course-Description.txt", 'r') as fh:  
    filedata = fh.read()
    
#Print first 200 characters in the file
print("Data read from file : ", filedata[0:200] )

--2023-05-06 13:40:08--  https://raw.githubusercontent.com/dearbharat/NLP/main/Course-Description.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 640 [text/plain]
Saving to: ‘./Course-Description.txt’


2023-05-06 13:40:08 (22.9 MB/s) - ‘./Course-Description.txt’ saved [640/640]

Data read from file :  In order to construct data pipelines and networks that stream, process, and store data, data engineers and data-science DevOps specialists must understand how to combine multiple big data technologies


## Reading using NLTK CorpusReader

Read the same text file using a Corpus Reader

NLTK supports multiple CorpusReaders depending upon the type of data source. Details available in http://www.nltk.org/howto/corpus.html


In [2]:
#install nltk using "pip install nltk"
import nltk
#Download punkt package, used part of the other commands
nltk.download('punkt')

from nltk.corpus.reader.plaintext import PlaintextCorpusReader

#Read the file into a corpus. The same command can read an entire directory
corpus=PlaintextCorpusReader(os.getcwd(),"Course-Description.txt")

#Print raw contents of the corpus
print(corpus.raw())

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


In order to construct data pipelines and networks that stream, process, and store data, data engineers and data-science DevOps specialists must understand how to combine multiple big data technologies. In this course, discover how to build big data pipelines around Apache Spark. Join Kumaran Ponnambalam as he takes you through how to make Apache Spark work with other big data technologies. He covers the basics of Apache Kafka Connect and how to integrate it with Spark for real-time streaming. In addition, he demonstrates how to use the various technologies to construct an end-to-end project that solves a real-world business problem.


## Exploring the Corpus

The corpus library supports a number of functions to extract words, paragraphs and sentences from the corpus

In [3]:
#Extract the file IDs from the corpus
print("Files in this corpus : ", corpus.fileids())

#Extract paragraphs from the corpus
paragraphs=corpus.paras()
print("\n Total paragraphs in this corpus : ", len(paragraphs))

#Extract sentences from the corpus
sentences=corpus.sents()
print("\n Total sentences in this corpus : ", len(sentences))
print("\n The first sentence : ", sentences[0])
print("\n The last sentence : ", sentences[-1])

#Extract words from the corpus
print("\n Words in this corpus : ",corpus.words() )

Files in this corpus :  ['Course-Description.txt']

 Total paragraphs in this corpus :  1

 Total sentences in this corpus :  5

 The first sentence :  ['In', 'order', 'to', 'construct', 'data', 'pipelines', 'and', 'networks', 'that', 'stream', ',', 'process', ',', 'and', 'store', 'data', ',', 'data', 'engineers', 'and', 'data', '-', 'science', 'DevOps', 'specialists', 'must', 'understand', 'how', 'to', 'combine', 'multiple', 'big', 'data', 'technologies', '.']

 The last sentence :  ['In', 'addition', ',', 'he', 'demonstrates', 'how', 'to', 'use', 'the', 'various', 'technologies', 'to', 'construct', 'an', 'end', '-', 'to', '-', 'end', 'project', 'that', 'solves', 'a', 'real', '-', 'world', 'business', 'problem', '.']

 Words in this corpus :  ['In', 'order', 'to', 'construct', 'data', 'pipelines', ...]


## Analyze the Corpus

The NLTK library provides a number of functions to analyze the distributions and aggregates for data in the corpus.

In [None]:
#Find the frequency distribution of words in the corpus
course_freq_dist=nltk.FreqDist(corpus.words())

#Print most commonly used words
print("Top 10 words in the corpus : ", course_freq_dist.most_common(10))

#find the distribution for a specific word
print("\n Distribution for \"Spark\" : ",course_freq_dist.get("Spark"))

Top 10 words in the corpus :  [('to', 8), ('data', 7), (',', 5), ('-', 5), ('how', 5), ('.', 5), ('and', 4), ('In', 3), ('big', 3), ('technologies', 3)]

 Distribution for "Spark" :  3


# Text Cleansing and Extraction
## Tokenization

Tokenization refers to converting a text string into individual tokens. Tokens may be words or punctations

In [None]:
import nltk
import os


#Read the base file into a raw text variable
base_file = open(os.getcwd()+ "/Course-Description.txt", 'rt')
raw_text = base_file.read()
base_file.close()

#Extract tokens
token_list = nltk.word_tokenize(raw_text)
print("Token List : ",token_list[:20])
print("\n Total Tokens : ",len(token_list))

Token List :  ['In', 'order', 'to', 'construct', 'data', 'pipelines', 'and', 'networks', 'that', 'stream', ',', 'process', ',', 'and', 'store', 'data', ',', 'data', 'engineers', 'and']

 Total Tokens :  110


## Cleansing Text

We will see examples of removing punctuation and converting to lower case

#### Remove Punctuation

In [None]:
#Use the Punkt library to extract tokens
token_list2 = list(filter(lambda token: nltk.tokenize.punkt.PunktToken(token).is_non_punct, token_list))
print("Token List after removing punctuation : ",token_list2[:20])
print("\nTotal tokens after removing punctuation : ", len(token_list2))

Token List after removing punctuation :  ['In', 'order', 'to', 'construct', 'data', 'pipelines', 'and', 'networks', 'that', 'stream', 'process', 'and', 'store', 'data', 'data', 'engineers', 'and', 'data-science', 'DevOps', 'specialists']

Total tokens after removing punctuation :  100


#### Convert to Lower Case

In [None]:
token_list3=[word.lower() for word in token_list2 ]
print("Token list after converting to lower case : ", token_list3[:20])
print("\nTotal tokens after converting to lower case : ", len(token_list3))

Token list after converting to lower case :  ['in', 'order', 'to', 'construct', 'data', 'pipelines', 'and', 'networks', 'that', 'stream', 'process', 'and', 'store', 'data', 'data', 'engineers', 'and', 'data-science', 'devops', 'specialists']

Total tokens after converting to lower case :  100


## Stop word Removal

Removing stop words by using a standard stop word list available in NLTK for English

In [None]:
#Download the standard stopword list
nltk.download('stopwords')
from nltk.corpus import stopwords

#Remove stopwords
token_list4 = list(filter(lambda token: token not in stopwords.words('english'), token_list3))
print("Token list after removing stop words : ", token_list4[:20])
print("\nTotal tokens after removing stop words : ", len(token_list4))

Token list after removing stop words :  ['order', 'construct', 'data', 'pipelines', 'networks', 'stream', 'process', 'store', 'data', 'data', 'engineers', 'data-science', 'devops', 'specialists', 'must', 'understand', 'combine', 'multiple', 'big', 'data']

Total tokens after removing stop words :  62


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


## Stemming

In [None]:
#Use the PorterStemmer library for stemming.
from nltk.stem import PorterStemmer
stemmer = PorterStemmer()

#Stem data
token_list5 = [stemmer.stem(word) for word in token_list4 ]
print("Token list after stemming : ", token_list5[:20])
print("\nTotal tokens after Stemming : ", len(token_list5))

Token list after stemming :  ['order', 'construct', 'data', 'pipelin', 'network', 'stream', 'process', 'store', 'data', 'data', 'engin', 'data-sci', 'devop', 'specialist', 'must', 'understand', 'combin', 'multipl', 'big', 'data']

Total tokens after Stemming :  62


## Lemmatization

In [None]:
#Use the wordnet library to map words to their lemmatized form
nltk.download('wordnet')
from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()
token_list6 = [lemmatizer.lemmatize(word) for word in token_list4 ]
print("Token list after Lemmatization : ", token_list6[:20])
print("\nTotal tokens after Lemmatization : ", len(token_list6))

Token list after Lemmatization :  ['order', 'construct', 'data', 'pipeline', 'network', 'stream', 'process', 'store', 'data', 'data', 'engineer', 'data-science', 'devops', 'specialist', 'must', 'understand', 'combine', 'multiple', 'big', 'data']

Total tokens after Lemmatization :  62


[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


#### Comparison of tokens between raw, stemming and lemmatization

In [None]:
#Check for token technlogies
print( "Raw : ", token_list4[20]," , Stemmed : ", token_list5[20], " , Lemmatized : ", token_list6[20])


Raw :  technologies  , Stemmed :  technolog  , Lemmatized :  technology


# Advanced Text Processing

In [None]:
#Prepare data

import nltk
import os
#Download punkt package, used part of the other commands
nltk.download('punkt')

#Read the base file into a token list
base_file = open(os.getcwd()+ "/Course-Description.txt", 'rt')
raw_text = base_file.read()
base_file.close()

#Execute the same pre-processing done in module 3
token_list = nltk.word_tokenize(raw_text)

token_list2 = list(filter(lambda token: nltk.tokenize.punkt.PunktToken(token).is_non_punct, token_list))

token_list3=[word.lower() for word in token_list2 ]

nltk.download('stopwords')
from nltk.corpus import stopwords
token_list4 = list(filter(lambda token: token not in stopwords.words('english'), token_list3))

nltk.download('wordnet')
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
token_list6 = [lemmatizer.lemmatize(word) for word in token_list4 ]

print("\n Total Tokens : ",len(token_list6))

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!



 Total Tokens :  62


[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


## Build ngrams

In [None]:
from nltk.util import ngrams
from collections import Counter

#Find bigrams and print the most common 5
bigrams = ngrams(token_list6,2)
print("Most common bigrams : ")
print(Counter(bigrams).most_common(5))

#Find trigrams and print the most common 5
trigrams = ngrams(token_list6,3)
print(" \n Most common trigrams : " )
print(Counter(trigrams).most_common(5))

Most common bigrams : 
[(('big', 'data'), 3), (('data', 'pipeline'), 2), (('data', 'technology'), 2), (('apache', 'spark'), 2), (('order', 'construct'), 1)]
 
 Most common trigrams : 
[(('big', 'data', 'technology'), 2), (('order', 'construct', 'data'), 1), (('construct', 'data', 'pipeline'), 1), (('data', 'pipeline', 'network'), 1), (('pipeline', 'network', 'stream'), 1)]


## Parts-of-Speech Tagging

Some examples of Parts-of-Speech abbreviations:
NN : noun
NNS : noun plural
VBP : Verb singular present.

In [None]:
#download the tagger package
nltk.download('averaged_perceptron_tagger')

#Tag and print the first 10 tokens
nltk.pos_tag(token_list4)[:10]

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


[('order', 'NN'),
 ('construct', 'NN'),
 ('data', 'NNS'),
 ('pipelines', 'NNS'),
 ('networks', 'NNS'),
 ('stream', 'VBP'),
 ('process', 'NN'),
 ('store', 'NN'),
 ('data', 'NNS'),
 ('data', 'NNS')]

## Building TF-IDF matrix

In [None]:
#Use scikit-learn library
from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd

#Use a small corpus for each visualization
vector_corpus = [
    'NBA is a Basketball league',
    'Basketball is popular in America.',
    'TV in America telecast BasketBall.',
]

#Create a vectorizer for english language
vectorizer = TfidfVectorizer(stop_words='english')

#Create the vector
tfidf=vectorizer.fit_transform(vector_corpus)

print("Tokens used as features are : ")
print(vectorizer.get_feature_names_out())

print("\n Size of array. Each row represents a document. Each column represents a feature/token")
print(tfidf.shape)

print("\n Actual TF-IDF array")
tfidf.toarray()


Tokens used as features are : 
['america' 'basketball' 'league' 'nba' 'popular' 'telecast' 'tv']

 Size of array. Each row represents a document. Each column represents a feature/token
(3, 7)

 Actual TF-IDF array


array([[0.        , 0.38537163, 0.65249088, 0.65249088, 0.        ,
        0.        , 0.        ],
       [0.54783215, 0.42544054, 0.        , 0.        , 0.72033345,
        0.        , 0.        ],
       [0.44451431, 0.34520502, 0.        , 0.        , 0.        ,
        0.5844829 , 0.5844829 ]])