1. Submit the Python script containing all your implementations for tokenization, stopword removal, and named entity extraction.

2. Provide answers to the following questions:

  - What are the advantages and limitations of NLTK and spaCy in text prepocessing?

  - How can tokenization help with analyzing customer feedback?

  - How does removing stop words impact the analysis.

  - Why is it important to extract named entities from customer feedback?

  - What insights would you look for from tokenized feedback?

# Step 1: Install required Libraries

In [1]:
!pip install nltk spacy



In [2]:
import spacy
spacy.cli.download("en_core_web_sm")

[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


In [3]:
import nltk
nltk.download('punkt')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

# NLTK Method

### Tokenizing

In [4]:
import nltk

# Download the 'punkt_tab' resource
nltk.download('punkt_tab')

from nltk.tokenize import word_tokenize

text = "Great product, but the software crashed twice in the last week. The customer support team was very helpful, though. Could improve the battery life."
tokens = word_tokenize(text)
tokens

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


['Great',
 'product',
 ',',
 'but',
 'the',
 'software',
 'crashed',
 'twice',
 'in',
 'the',
 'last',
 'week',
 '.',
 'The',
 'customer',
 'support',
 'team',
 'was',
 'very',
 'helpful',
 ',',
 'though',
 '.',
 'Could',
 'improve',
 'the',
 'battery',
 'life',
 '.']

### Removing stopwords

In [5]:
from nltk.corpus import stopwords

stop_words = set(stopwords.words('english'))
filtered_tokens_nltk = [words for words in tokens if not words.lower() in stop_words]
filtered_tokens_nltk

['Great',
 'product',
 ',',
 'software',
 'crashed',
 'twice',
 'last',
 'week',
 '.',
 'customer',
 'support',
 'team',
 'helpful',
 ',',
 'though',
 '.',
 'Could',
 'improve',
 'battery',
 'life',
 '.']

# spaCY Method


### Tokenizing

In [6]:
nlp = spacy.blank("en")

doc = nlp("Great product, but the software crashed twice in the last week. The customer support team was very helpful, though. Could improve the battery life.")

for token in doc:
  print(token)

Great
product
,
but
the
software
crashed
twice
in
the
last
week
.
The
customer
support
team
was
very
helpful
,
though
.
Could
improve
the
battery
life
.


### Removing stopwords

In [7]:
filtered_tokens_spacy = [token.text for token in doc if not token.is_stop]

filtered_tokens_spacy

['Great',
 'product',
 ',',
 'software',
 'crashed',
 'twice',
 'week',
 '.',
 'customer',
 'support',
 'team',
 'helpful',
 ',',
 '.',
 'improve',
 'battery',
 'life',
 '.']

### Extracting Named Entities

In [8]:
nlp = spacy.load("en_core_web_sm")

for ent in doc.ents:
  print(ent.text, ent.label_)

---

# Analysis




Both NLTK and spaCy can effectively tokenize the specified customer feedback, however they differ in their approach and performance. In this scenario, their main difference is the running time and the removing of stopwords. Some stopwords were still included in NLTK while it isn't seen anymore in spaCy, with "last" as one example.

## What are the advantages and limitations of NLTK and spaCy in text preprocessing?
NLTK is older and more widely used, with broader functionalities, unlike spaCy which merely focuses on core NLP tasks. However, because of this, spaCy runs faster than NLTK at the cost of decreased options for algorithm selections. Other than that, NLTK supports a wide variety of languages whilst spaCy primarily focuses on English, though it has expanded to support other languages but is still limited.

With this in mind, deciding between which library to use depends on the user's needs and priorities. NLTK is slower but offers more flexibility and customization, making it more suitable for small-scale projects. Meanwhile, spaCy is more suitable for large-scale projects that require high-speed processing.

## How can tokenization help with analyzing customer feedback?
Tokenization can help with analyzing customer feedback by improving readability through breaking down the text into smaller units. In doing so, it aids in enabling efficient sentiment nalysis, facilitating better topic modeling, and improving data visualization by grouping together related tokens such as ones that are positive, negative, and neutral.

## How does removing stop words impact the analysis.
It reduces the noise in NLP algorithms, enhancing the efficiency of the analyses of customer feedbacks. It makes the analysis straight-to-the-point andd targeted. However, depending on the dataset, it may also reduce the accuracy due to how some sentences highly rely on these stopwords for contextual information that the model isn't able to catch onto. As a result, removing stopwords should be carefully considered based on the nature of the data being processed and what the user would like to achieve with the current system.

## Why is it important to extract named entities from customer feedback?
It is crucial to extract named entities from customer feedback in order for businesses to pinpoint the exact concerns being brought up, enabling more targeted actions--whether by adjusting marketing strategies or knowing which areas for products need improvement. In general, it helps businesses make more data-driven decisions.

## What insights would you look for from tokenized feedback?
The insights I would look for are common trends when it comes to sentiment analysis. It would reveal patterns in customer opinions, gauging overall customer satisfaction and pinpointing which areas need improvement and other ones to be kept as is.