#Experiment - 7: Perform the steps involved in Text Analytics in Python & R


Task to be performed :
Explore Top-5 Text Analytics Libraries in Python (w.r.t Features & Applications)

Explore Top-5 Text Analytics Libraries in R (w.r.t Features & Applications)

Perform the following experiments using Python & R
Tokenization (Sentence & Word)

Frequency Distribution

Remove stopwords & punctuations

Lexicon Normalization (Stemming, Lemmatization)

Part of Speech tagging

Named Entity Recognization

Scrape data from a website

Prepare a document with the Aim, Tasks performed, Program, Output, and Conclusion.

#Python
**NLTK (Natural Language Toolkit)**:

NLTK is a powerful library for working with human language data and performing various tasks such as tokenization, stemming, lemmatization, part-of-speech tagging, and named entity recognition.

**spaCy:**

spaCy is an open-source library for advanced natural language processing tasks. It provides pre-trained models for various languages and is known for its efficiency in terms of speed and memory usage.

**TextBlob:**

TextBlob is a simple and easy-to-use library for processing textual data. It offers a high-level interface for common NLP tasks such as part-of-speech tagging, noun phrase extraction, sentiment analysis, classification, translation, and more.

**Gensim:**

Gensim is a library for topic modeling and document similarity analysis. It is often used for unsupervised learning on large text corpora, providing implementations for algorithms like Latent Semantic Analysis (LSA) and Latent Dirichlet Allocation (LDA).

**BeautifulSoup (for Web Scraping):**

BeautifulSoup is a library used for web scraping purposes. It helps to pull the data out of HTML and XML files and is widely used in conjunction with requests library to scrape and parse web content.
Website: BeautifulSoup

Tokenization (Sentence & Word)

In [None]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [None]:
#sent_tokenize is sentence tokenize
from nltk.tokenize import sent_tokenize

text = "Hello everyone. Welcome to college"
sent_tokenize(text)

['Hello everyone.', 'Welcome to college']

In [None]:
#tokenizing other languages

import nltk.data

spanish_tokenizer=nltk.data.load('tokenizers/punkt/PY3/spanish.pickle')

text = 'Hola amigo. Estoy bien.'
spanish_tokenizer.tokenize(text)

['Hola amigo.', 'Estoy bien.']

In [None]:
#word_tokenize is word tokenize
from nltk.tokenize import word_tokenize

text = "Hello everyone. Welcome to college"
word_tokenize(text)

['Hello', 'everyone', '.', 'Welcome', 'to', 'college']

Frequency Distribution

In [None]:
from nltk.probability import FreqDist

text = "insights : Text analytics is the process of analyzing unstructured text data for useful insights and patterns."

words=word_tokenize(text)

fdist=FreqDist(words)

print("Word\tFrequency")
print("----------------")
for word,frequency in fdist.items():
  print(f"{word}\t{frequency}")

Word	Frequency
----------------
insights	2
:	1
Text	1
analytics	1
is	1
the	1
process	1
of	1
analyzing	1
unstructured	1
text	1
data	1
for	1
useful	1
and	1
patterns	1
.	1


In [None]:
nltk.download('stopwords')
from nltk.corpus import stopwords
import string

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [None]:
def remove_stopwords_and_punctuation(text):
    # Tokenize the text
    tokens = word_tokenize(text)

    # Get English stopwords
    stop_words = set(stopwords.words('english'))

    # Define punctuation characters
    punctuations = set(string.punctuation)

    # Remove stopwords and punctuation
    clean_tokens = [token for token in tokens if token.lower() not in stop_words and token not in punctuations]

    # Reconstruct the text without stopwords and punctuation
    clean_text = ' '.join(clean_tokens)

    return clean_text
cleaned_text = remove_stopwords_and_punctuation(text)
print("Cleaned text:", cleaned_text)

Cleaned text: insights Text analytics process analyzing unstructured text data useful insights patterns


In [None]:
from nltk.stem import PorterStemmer, WordNetLemmatizer
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to /root/nltk_data...


True

`Lemmatization takes into account the context of the word in a sentence and typically involves dictionary lookup and morphological analysis to determine the lemma. Unlike stemming, which simply removes prefixes or suffixes to produce a truncated form of the word (stem), lemmatization ensures that the resulting word is a valid word in the language's dictionary. This makes lemmatization more accurate but also computationally more intensive compared to stemming.`







In [None]:
#tokens is the tokenized text
tokens = word_tokenize(text)
# Initialize Porter Stemmer and WordNet Lemmatizer
porter_stemmer = PorterStemmer()
wordnet_lemmatizer = WordNetLemmatizer()

# Perform stemming
stemmed_words = [porter_stemmer.stem(word) for word in tokens]

# Perform lemmatization
lemmatized_words = [wordnet_lemmatizer.lemmatize(word) for word in tokens]

print("Original text:", text)
print("Stemmed words:", stemmed_words)
print("Lemmatized words:", lemmatized_words)

Original text: insights : Text analytics is the process of analyzing unstructured text data for useful insights and patterns.
Stemmed words: ['insight', ':', 'text', 'analyt', 'is', 'the', 'process', 'of', 'analyz', 'unstructur', 'text', 'data', 'for', 'use', 'insight', 'and', 'pattern', '.']
Lemmatized words: ['insight', ':', 'Text', 'analytics', 'is', 'the', 'process', 'of', 'analyzing', 'unstructured', 'text', 'data', 'for', 'useful', 'insight', 'and', 'pattern', '.']


 Here are some of the common Part of Speech (POS) tags used in the Penn Treebank POS tagging scheme:

CC - Coordinating conjunction,
CD - Cardinal number,
DT - Determiner,
EX - Existential there,
FW - Foreign word,
IN - Preposition or subordinating conjunction,
JJ - Adjective,
JJR - Adjective, comparative,
JJS - Adjective, superlative,
LS - List item marker,
MD - Modal,
NN - Noun, singular or mass,
NNS - Noun, plural,
NNP - Proper noun, singular,
NNPS - Proper noun, plural,
PDT - Predeterminer,
POS - Possessive ending,
PRP - Personal pronoun,
PRP - Possessive pronoun,
RB - Adverb,
RBR - Adverb, comparative,
RBS - Adverb, superlative,
RP - Particle,
SYM - Symbol,
TO - to,
UH - Interjection,
VB - Verb, base form,
VBD - Verb, past tense,
VBG - Verb, gerund or present participle,
VBN - Verb, past participle,
VBP - Verb, non-3rd person singular present,
VBZ - Verb, 3rd person singular present,
WDT - Wh-determiner,
WP - Wh-pronoun,
WP$ - Possessive wh-pronoun,
WRB - Wh-adverb





In [None]:
nltk.download('averaged_perceptron_tagger')#needed for pos

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


True

In [None]:
from nltk.chunk import ne_chunk#needed for named recognization
nltk.download('maxent_ne_chunker')
nltk.download('words')

[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package maxent_ne_chunker is already up-to-date!
[nltk_data] Downloading package words to /root/nltk_data...
[nltk_data]   Unzipping corpora/words.zip.


True

In [None]:
#parts of speech tagging
tokens = word_tokenize(text)

# Perform Part of Speech tagging
pos_tags = nltk.pos_tag(tokens)

# Print the tagged words with their POS
print(pos_tags)



[('insights', 'NNS'), (':', ':'), ('Text', 'NN'), ('analytics', 'NNS'), ('is', 'VBZ'), ('the', 'DT'), ('process', 'NN'), ('of', 'IN'), ('analyzing', 'VBG'), ('unstructured', 'JJ'), ('text', 'NN'), ('data', 'NNS'), ('for', 'IN'), ('useful', 'JJ'), ('insights', 'NNS'), ('and', 'CC'), ('patterns', 'NNS'), ('.', '.')]


In [None]:
from nltk.tag import pos_tag
text = "Barack Obama was born in Hawaii. He was the 44th President of the United States."

# Tokenize the text
tokens = word_tokenize(text)

# Tag the tokens with part-of-speech tags
tagged_tokens = pos_tag(tokens)

# Perform named entity recognition
named_entities = ne_chunk(tagged_tokens)

# Print the named entities
for entity in named_entities:
    if isinstance(entity, nltk.Tree):
        print(" ".join([word for word, tag in entity]), "-", entity.label())

Barack - PERSON
Obama - PERSON
Hawaii - GPE
United States - GPE


In [1]:
#scaping data from a website

import requests
from bs4 import BeautifulSoup

# URL of the website you want to scrape
url = 'https://stackoverflow.com/questions/4634787/freqdist-with-nltk'

# Send a GET request to the URL
response = requests.get(url)

# Check if the request was successful (status code 200)
if response.status_code == 200:
    # Parse the HTML content of the page
    soup = BeautifulSoup(response.text, 'html.parser')

    # Find and extract the text data you're interested in
    text_data = soup.get_text()

    # Print or process the text data as needed
    print(text_data)

else:
    print('Failed to retrieve data from the website')





python - FreqDist with NLTK - Stack Overflow







































Stack Overflow



About



						Products
					


For Teams







Stack Overflow
Public questions & answers




Stack Overflow for Teams
Where developers & technologists share private knowledge with coworkers




Talent

								Build your employer brand
							




Advertising
Reach developers & technologists worldwide




Labs
The future of collective knowledge sharing



About the company











Loading…














current community
















            Stack Overflow
        



help
chat









            Meta Stack Overflow
        






your communities            



Sign up or log in to customize your list.                


more stack exchange communities

company blog








Log in

Sign up

















 Home






 Questions






 Tags







 Users






 Companies





Labs








 Discussions

New







Collectives







Collectives™ on Stack Overflow
      

IN **R**

**tm (Text Mining Infrastructure in R):**

The tm package provides a framework for text mining applications within R. It supports various text preprocessing tasks, including text cleaning, stemming, and term-document matrix creation.

**quanteda:**

quanteda is a comprehensive package for text analysis, providing tools for text corpus creation, document-feature matrix construction, and various text mining and analysis functions.

**NLP (Natural Language Processing):**

The NLP package provides functions for natural language processing tasks, including tokenization, stemming, and part-of-speech tagging. It is often used in conjunction with other text analysis libraries.

**openNLP:**

openNLP is an R interface to Apache OpenNLP, which is a library for natural language processing. It provides functions for part-of-speech tagging, named entity recognition, and other language processing tasks.

**rvest:**

rvest is a powerful package for web scraping in R. It is particularly useful for extracting information from HTML and XML documents on the web.

In [None]:
install.packages("tm")
install.packages("rvest")
install.packages("tokenizers")

Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)

Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)

Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)



In [None]:
library(tm)
library(rvest)
library(NLP)
library(tokenizers)
library(SnowballC)

In [None]:
text <- "He raced to the grocery store. He went inside but realized he forgot his wallet. He raced back home to grab it. Once he found it, he raced to the car again and drove back to the grocery store."
sent_tokens <- unlist(tokenize_sentences(text))
word_tokens <- unlist(tokenize_words(text))
cat("Sentence Tokens:", sent_tokens, "\n")
cat("Word Tokens:", word_tokens, "\n")
# Frequency Distribution
fdist <- table(unlist(word_tokens))
print(head(sort(fdist, decreasing = TRUE), 2))

Sentence Tokens: He raced to the grocery store. He went inside but realized he forgot his wallet. He raced back home to grab it. Once he found it, he raced to the car again and drove back to the grocery store. 
Word Tokens: he raced to the grocery store he went inside but realized he forgot his wallet he raced back home to grab it once he found it he raced to the car again and drove back to the grocery store 

he to 
 6  4 


In [None]:
# Remove stopwords and punctuations
stop_words <- stopwords("en")
filtered_tokens <- word_tokens[!(word_tokens %in% stop_words) & grepl("[a-zA-Z]", word_tokens)]
cat("Filtered Tokens (without stopwords and punctuations):", filtered_tokens, "\n")


Filtered Tokens (without stopwords and punctuations): raced grocery store went inside realized forgot wallet raced back home grab found raced car drove back grocery store 


In [None]:
# Stemming
stemmed_tokens <- wordStem(filtered_tokens, language = "en")

# Lemmatization
lemmatized_text <- tolower(text)
lemmatized_text <- wordStem(lemmatized_text, language = "en")
cat("Stemmed Tokens:", stemmed_tokens, "\n")
cat("Lemmatized Text:", lemmatized_text, "\n")

Stemmed Tokens: race groceri store went insid realiz forgot wallet race back home grab found race car drove back groceri store 
Lemmatized Text: he raced to the grocery store. he went inside but realized he forgot his wallet. he raced back home to grab it. once he found it, he raced to the car again and drove back to the grocery store. 


In [None]:
# Web Scraping
url <- 'http://quotes.toscrape.com/'
web_page <- read_html(url)
web_text <- html_text(web_page)
cat("Scraped Data from the Website:\n", web_text)


Scraped Data from the Website:
 Quotes to Scrape
    
        
            
                
                    Quotes to Scrape
                
            
            
                
                
                    Login
                
                
            
        
    


    

    
        “The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”
        by Albert Einstein
        (about)
        
        
            Tags:
            change
            
            deep-thoughts
            
            thinking
            
            world
            
        
    

    
        “It is our choices, Harry, that show what we truly are, far more than our abilities.”
        by J.K. Rowling
        (about)
        
        
            Tags:
            abilities
            
            choices
            
        
    

    
        “There are only two ways to live your life. One is as though nothing