<a href="https://colab.research.google.com/github/pawaskar-shreya/DAV_50/blob/main/DAV_Exp7.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Experiment - 7: Perform the steps involved in Text Analytics in Python**

Some of the most used text analytics libraries in Python are NLTK, spaCy, TextBlob, Gensim, Transformers

In [None]:
# Define your custom dataset as a string
custom_dataset = "What is a sentence? A sentence is a group of words that makes complete sense."

# Print the custom dataset
print("Custom Dataset:")
print(custom_dataset)

Custom Dataset:
What is a sentence? A sentence is a group of words that makes complete sense.


In [None]:
import nltk
from nltk.tokenize import word_tokenize, sent_tokenize

# Download NLTK resources (if not already downloaded)
nltk.download('punkt')

# Tokenization (Sentence & Word)
sentences = sent_tokenize(custom_dataset)
words = word_tokenize(custom_dataset)

# Print the results
print("Tokenization (Sentences):", sentences)
print("\nTokenization (Words):", words)


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


Tokenization (Sentences): ['What is a sentence?', 'A sentence is a group of words that makes complete sense.']

Tokenization (Words): ['What', 'is', 'a', 'sentence', '?', 'A', 'sentence', 'is', 'a', 'group', 'of', 'words', 'that', 'makes', 'complete', 'sense', '.']


In [None]:
from nltk.probability import FreqDist

# Frequency Distribution
freq_dist = FreqDist(words)

# Print the results
print("Frequency Distribution:")
print(freq_dist)

Frequency Distribution:
<FreqDist with 14 samples and 17 outcomes>


In [None]:
from nltk.corpus import stopwords
import string

# Download NLTK resources (if not already downloaded)
nltk.download('stopwords')

# Remove stopwords and punctuation
stop_words = set(stopwords.words('english'))
filtered_words = [word.lower() for word in words if (word.isalpha() and word.lower() not in stop_words)]

# Print the results
print("Filtered Words (without stopwords and punctuations):")
print(filtered_words)


Filtered Words (without stopwords and punctuations):
['sentence', 'sentence', 'group', 'words', 'makes', 'complete', 'sense']


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [None]:
from nltk.stem import PorterStemmer, WordNetLemmatizer

# Download NLTK resources (if not already downloaded)
nltk.download('wordnet')

# Lexicon Normalization (Stemming, Lemmatization)
ps = PorterStemmer()
stemmed_words = [ps.stem(word) for word in filtered_words]

lemmatizer = WordNetLemmatizer()
lemmatized_words = [lemmatizer.lemmatize(word) for word in filtered_words]

# Print the results
print("Stemmed Words:")
print(stemmed_words)

print("\nLemmatized Words:")
print(lemmatized_words)


[nltk_data] Downloading package wordnet to /root/nltk_data...


Stemmed Words:
['sentenc', 'sentenc', 'group', 'word', 'make', 'complet', 'sens']

Lemmatized Words:
['sentence', 'sentence', 'group', 'word', 'make', 'complete', 'sense']


In [None]:
nltk.download('averaged_perceptron_tagger')

# Part of Speech tagging
pos_tags = nltk.pos_tag(words)

# Print the results
print("Part of Speech Tags:")
print(pos_tags)


[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


Part of Speech Tags:
[('What', 'WP'), ('is', 'VBZ'), ('a', 'DT'), ('sentence', 'NN'), ('?', '.'), ('A', 'DT'), ('sentence', 'NN'), ('is', 'VBZ'), ('a', 'DT'), ('group', 'NN'), ('of', 'IN'), ('words', 'NNS'), ('that', 'WDT'), ('makes', 'VBZ'), ('complete', 'JJ'), ('sense', 'NN'), ('.', '.')]


In [None]:
nltk.download('maxent_ne_chunker')
nltk.download('words')

# Named Entity Recognition
ner_tags = nltk.ne_chunk(pos_tags)

# Print the results
print("Named Entity Recognition Tags:")
print(ner_tags)


[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping chunkers/maxent_ne_chunker.zip.


Named Entity Recognition Tags:
(S
  What/WP
  is/VBZ
  a/DT
  sentence/NN
  ?/.
  A/DT
  sentence/NN
  is/VBZ
  a/DT
  group/NN
  of/IN
  words/NNS
  that/WDT
  makes/VBZ
  complete/JJ
  sense/NN
  ./.)


[nltk_data] Downloading package words to /root/nltk_data...
[nltk_data]   Unzipping corpora/words.zip.


**Some of the most used text analytics libraries in R are tm, quanteda, tidytext, text, textTinyR**


In [6]:
# Define your custom dataset as a string
custom_dataset <- "What is a sentence? A sentence is a group of words that makes complete sense."

# Print the custom dataset
cat("Custom Dataset:\n")
cat(custom_dataset, "\n")


Custom Dataset:
What is a sentence? A sentence is a group of words that makes complete sense. 


In [7]:
install.packages("tokenizers")

# Load the tokenizers library
library(tokenizers)

# Tokenization (Sentence & Word)
sentences <- unlist(tokenize_sentences(custom_dataset))
words <- unlist(tokenize_words(custom_dataset))

# Print the results
cat("Tokenization (Sentences):", sentences, "\n")
cat("\nTokenization (Words):", words, "\n")

Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)



Tokenization (Sentences): What is a sentence? A sentence is a group of words that makes complete sense. 

Tokenization (Words): what is a sentence a sentence is a group of words that makes complete sense 


In [8]:
# Frequency Distribution
freq_dist <- table(words)

# Print the results
cat("Frequency Distribution:\n")
print(freq_dist)

Frequency Distribution:
words
       a complete    group       is    makes       of    sense sentence 
       3        1        1        2        1        1        1        2 
    that     what    words 
       1        1        1 


In [9]:
# Load required libraries
install.packages("tm")
library(tm)
library(stringi)

# Remove stopwords and punctuation
stop_words <- stopwords("en")
filtered_words <- tolower(words[!tolower(words) %in% stop_words & stri_trans_totitle(words) == words])

# Print the results
cat("Filtered Words (without stopwords and punctuations):\n")
print(filtered_words)


Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)



Filtered Words (without stopwords and punctuations):
character(0)


In [13]:
install.packages("udpipe")
install.packages("textTinyR")

# Load required libraries
library(udpipe)
library(textTinyR)

# Define your filtered_words vector
filtered_words <- c("What", "is", "a", "sentence", "A", "sentence", "is", "a", "group", "of", "words", "that", "makes", "complete", "sense")

# Download the English model
ud_model <- udpipe_download_model(language = "english", model_dir = "~/udpipe_models")

# Annotate the text with the model
annotated_text <- udpipe_annotate(ud_model, x = paste(filtered_words, collapse = " "))

# Perform stemming
stemmed_words <- as.data.frame(udpipe_stem(ud_model, x = annotated_text))$lemma

# Perform lemmatization
lemmatized_words <- textTinyR::lemma_general(filtered_words)

# Print the results
cat("Stemmed Words:\n")
print(stemmed_words)

cat("\nLemmatized Words:\n")
print(lemmatized_words)


Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)

Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)

also installing the dependency ‘RcppArmadillo’


“installation of package ‘textTinyR’ had non-zero exit status”


ERROR: Error in library(textTinyR): there is no package called ‘textTinyR’
