# NLP Introduction & Text Processing |
# Assignment

## (1) What is Computational Linguistics and how does it relate to NLP?

Ans: Computational Linguistics (CL) is a field that focuses on using computers to understand, interpret, analyze, and generate human language. It combines knowledge from linguistics, computer science, mathematics, and artificial intelligence.

Computational linguistics tries to answer questions like:

- How do humans understand language?

- How can computers be programmed to understand and process language?

- How can language rules be modeled mathematically?

How Computational Linguistics Relates to NLP

Natural Language Processing (NLP) is a practical application area of computational linguistics.
While computational linguistics is the theoretical and scientific study, NLP focuses on building working systems such as:

- Speech recognition (Alexa, Siri)

- Machine translation (Google Translate)

- Chatbots (like ChatGPT)

- Sentiment analysis



## (2) Briefly describe the historical evolution of Natural Language Processing.

Ans:
Historical Evolution of Natural Language Processing (NLP)
The evolution of NLP can be divided into major phases:

- 1950s – Early Rule-Based Approaches
NLP began with simple rule-based systems and machine translation attempts. Alan Turing introduced the Turing Test (1950).

- 1960s–1980s – Linguistic and Symbolic Models
Systems used grammar rules to understand language. Famous programs like ELIZA (1966) and SHRDLU appeared.

- 1980s–1990s – Statistical NLP
The shift from rules to probability-based models happened due to the availability of large datasets. Methods like Hidden Markov Models and n-grams became popular.

- 2000s – Machine Learning Era
NLP started using ML algorithms such as SVMs, Decision Trees, and Logistic Regression for tasks like text classification and speech recognition.

- 2010s–Present – Deep Learning and Neural NLP
With neural networks and models like RNNs, LSTMs, Transformers (BERT, GPT), NLP achieved human-like language understanding and generation.

## (3) List and explain three major use cases of NLP in today’s tech industry.

Ans:
- Machine Translation

NLP is used in automated translation systems like Google Translate or Microsoft Translator.
These systems convert text or speech from one language to another by understanding grammar, meaning, and context.

- Sentiment Analysis

Companies use NLP to analyze customer feedback, reviews, or social media posts.
It helps determine whether the expressed opinion is positive, negative, or neutral, which supports brand monitoring, customer service, and market analysis.

- Chatbots and Virtual Assistants

NLP powers conversational systems such as ChatGPT, Siri, Alexa, and WhatsApp chatbots.
These systems understand user queries and generate meaningful responses, helping in customer support, automation, and user interaction.

## (4) What is text normalization and why is it essential in text processing tasks?

Ans:
- Text Normalization

Text normalization is the process of converting text into a standard and consistent format before processing it in NLP tasks. It reduces variations in language so that similar words are treated the same by algorithms.

Normalization steps may include:
- Lowercasing text

- Removing punctuation or special characters

- Expanding contractions (e.g., don’t → do not)

- Lemmatization or stemming (e.g., running → run)

## (5) Compare and contrast stemming and lemmatization with suitable examples ?

Ans:
- Stemming

Stemming is a technique used to remove word endings and reduce a word to its root form.

The resulting root may not be a meaningful word.

It is rule-based and faster but less accurate.

Example: Running → runn, Studies → studi.

- Lemmatization

Lemmatization reduces a word to its meaningful base form, called a lemma.

The output is always a valid dictionary word.

It uses linguistic knowledge like grammar and vocabulary, making it more accurate but slower.

Example: Running → run, Better → good.

In [7]:
# 6
import re

text = """
Hello team, please contact us at support@xyz.com for technical issues,
or reach out to our HR at hr@xyz.com. You can also connect with John at
john.doe@xyz.org and jenny via jenny_clarke126@mail.co.us.
For partnership inquiries, email partners@xyz.biz.
"""

# Regex pattern for extracting emails
pattern = r'[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}'

# Extract emails
emails = re.findall(pattern, text)

# Display output
print("Extracted Email Addresses:")
for email in emails:
    print(email)


Extracted Email Addresses:
support@xyz.com
hr@xyz.com
john.doe@xyz.org
jenny_clarke126@mail.co.us
partners@xyz.biz


In [8]:
# 7
import nltk
from nltk.tokenize import word_tokenize
from nltk.probability import FreqDist

# Sample paragraph
text = """
Natural Language Processing (NLP) is a fascinating field that combines linguistics,
computer science, and artificial intelligence. It enables machines to understand,
interpret, and generate human language. Applications of NLP include chatbots,
sentiment analysis, and machine translation. As technology advances, the role of NLP
in modern solutions is becoming increasingly critical.
"""

# Download tokenizer (only needed once)
nltk.download('punkt')

# Tokenizing the text
tokens = word_tokenize(text)

# Frequency Distribution
freq_dist = FreqDist(tokens)

# Displaying output
print("Tokenized Words:\n", tokens)
print("\nTop 10 Most Common Words:")
for word, freq in freq_dist.most_common(10):
    print(f"{word}: {freq}")


Tokenized Words:
 ['Natural', 'Language', 'Processing', '(', 'NLP', ')', 'is', 'a', 'fascinating', 'field', 'that', 'combines', 'linguistics', ',', 'computer', 'science', ',', 'and', 'artificial', 'intelligence', '.', 'It', 'enables', 'machines', 'to', 'understand', ',', 'interpret', ',', 'and', 'generate', 'human', 'language', '.', 'Applications', 'of', 'NLP', 'include', 'chatbots', ',', 'sentiment', 'analysis', ',', 'and', 'machine', 'translation', '.', 'As', 'technology', 'advances', ',', 'the', 'role', 'of', 'NLP', 'in', 'modern', 'solutions', 'is', 'becoming', 'increasingly', 'critical', '.']

Top 10 Most Common Words:
,: 7
.: 4
NLP: 3
and: 3
is: 2
of: 2
Natural: 1
Language: 1
Processing: 1
(: 1


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [10]:
# 8
!pip install spacy

import spacy

# Load English spaCy model
nlp = spacy.load("en_core_web_sm")

# Sample text
text = """
Apple is working with Microsoft and Google to build AI-powered applications.
Elon Musk also announced updates at Tesla headquarters in California.
"""

# Process text
doc = nlp(text)

print("Proper Nouns Found:\n")

for token in doc:
    if token.pos_ == "PROPN":  # PROPN = Proper Noun in spaCy
        print(f"{token.text} → {token.pos_}")


Proper Nouns Found:

Apple → PROPN
Microsoft → PROPN
Google → PROPN
AI → PROPN
Elon → PROPN
Musk → PROPN
Tesla → PROPN
California → PROPN


In [11]:
# 9
# Install gensim if not already installed
!pip install gensim

from gensim.models import Word2Vec
import nltk
from nltk.tokenize import word_tokenize
import string

# Download tokenizer resources
nltk.download('punkt')

# Given dataset
dataset = [
 "Natural language processing enables computers to understand human language",
 "Word embeddings are a type of word representation that allows words with similar meaning to have similar representation",
 "Word2Vec is a popular word embedding technique used in many NLP applications",
 "Text preprocessing is a critical step before training word embeddings",
 "Tokenization and normalization help clean raw text for modeling"
]

# Step 1: Tokenization and preprocessing
processed_data = []

for sentence in dataset:
    tokens = word_tokenize(sentence.lower())  # lowercase + tokenize
    tokens = [word for word in tokens if word not in string.punctuation]  # remove punctuation
    processed_data.append(tokens)

print("Tokenized and Preprocessed Text:")
print(processed_data)

# Step 2: Train Word2Vec model
model = Word2Vec(sentences=processed_data, vector_size=50, window=5, min_count=1, workers=4)

# Step 3: Display similar words
print("\nWords similar to 'word':")
print(model.wv.most_similar("word"))

print("\nWords similar to 'nlp':")
print(model.wv.most_similar("nlp"))

# Step 4: Show word vector example
print("\nVector representation for the word 'language':")
print(model.wv['language'])


Tokenized and Preprocessed Text:
[['natural', 'language', 'processing', 'enables', 'computers', 'to', 'understand', 'human', 'language'], ['word', 'embeddings', 'are', 'a', 'type', 'of', 'word', 'representation', 'that', 'allows', 'words', 'with', 'similar', 'meaning', 'to', 'have', 'similar', 'representation'], ['word2vec', 'is', 'a', 'popular', 'word', 'embedding', 'technique', 'used', 'in', 'many', 'nlp', 'applications'], ['text', 'preprocessing', 'is', 'a', 'critical', 'step', 'before', 'training', 'word', 'embeddings'], ['tokenization', 'and', 'normalization', 'help', 'clean', 'raw', 'text', 'for', 'modeling']]

Words similar to 'word':
[('before', 0.2706666588783264), ('enables', 0.2547191083431244), ('meaning', 0.24074727296829224), ('normalization', 0.21101471781730652), ('nlp', 0.18646620213985443), ('are', 0.17563006281852722), ('raw', 0.16719907522201538), ('applications', 0.16099633276462555), ('help', 0.15025003254413605), ('popular', 0.1453729271888733)]

Words similar to

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


# 10

As a data scientist at a fintech startup analyzing customer feedback, the goal is to convert thousands of raw customer reviews into meaningful insights. The NLP workflow would include the following major steps:

- step 1: Data Collection
Collect reviews from sources like CSV files, app reviews, surveys, or databases.

- Step 2: Text Cleaning

Remove noise such as:

Special characters

URLs

Numbers

Extra whitespace

- Step 3: Text Preprocessing

Apply normalization steps:

Lowercasing

Tokenization

Stopword removal

Lemmatization or stemming

- Step 4: Exploratory Text Analysis

Use NLP methods like:

Frequency distribution (common words)

N-grams (phrases like "bad service", “fraud alert”)

- Step 5: Sentiment Analysis

Use models like VADER, BERT, or TextBlob to classify feedback into positive, neutral, or negative sentiment.

- Step 6: Topic Modeling

Use LDA (Latent Dirichlet Allocation) to identify key themes such as:

Loan approval issues

Payment failures

Customer support problems

- Step 7: Visualization

Create visual insights like:

Word clouds

Sentiment bar charts

Topic clusters

- Step 8: Reporting Insights

Provide actionable recommendations to the business based on patterns found in sentiment and topics.

In [12]:
# Install required packages
!pip install nltk spacy gensim wordcloud textblob

import nltk
import re
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from textblob import TextBlob
from wordcloud import WordCloud
from gensim import corpora, models

# Download resources
nltk.download('punkt')
nltk.download('stopwords')

# Sample dataset (representative – normally thousands of reviews)
reviews = [
    "The loan approval process was fast and easy!",
    "Customer support is terrible, no one responds.",
    "Great user experience, smooth transactions.",
    "Payment failed twice and app kept crashing.",
    "Love the interface, very intuitive and helpful."
]

# Step 1: Cleaning function
def clean_text(text):
    text = re.sub(r'[^a-zA-Z\s]', '', text)  # remove special chars
    return text.lower()

cleaned_reviews = [clean_text(review) for review in reviews]

# Step 2: Tokenization & Stopword removal
stop_words = set(stopwords.words('english'))
tokenized = [[word for word in word_tokenize(review) if word not in stop_words] for review in cleaned_reviews]

print("Tokenized and Cleaned Reviews:\n", tokenized)

# Step 3: Sentiment Analysis
sentiments = [TextBlob(review).sentiment.polarity for review in reviews]
print("\nSentiment Scores:\n", sentiments)

# Step 4: Topic Modeling with LDA
dictionary = corpora.Dictionary(tokenized)
corpus = [dictionary.doc2bow(text) for text in tokenized]
lda_model = models.LdaModel(corpus, num_topics=2, id2word=dictionary, passes=10)

print("\nIdentified Topics:")
for topic in lda_model.print_topics():
    print(topic)

# Step 5: Word Cloud
text_combined = " ".join(cleaned_reviews)
wordcloud = WordCloud(width=600, height=400).generate(text_combined)

wordcloud




[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


Tokenized and Cleaned Reviews:
 [['loan', 'approval', 'process', 'fast', 'easy'], ['customer', 'support', 'terrible', 'one', 'responds'], ['great', 'user', 'experience', 'smooth', 'transactions'], ['payment', 'failed', 'twice', 'app', 'kept', 'crashing'], ['love', 'interface', 'intuitive', 'helpful']]

Sentiment Scores:
 [0.37083333333333335, -1.0, 0.6000000000000001, -0.5, 0.35]

Identified Topics:
(0, '0.057*"experience" + 0.057*"user" + 0.057*"transactions" + 0.056*"great" + 0.056*"smooth" + 0.056*"helpful" + 0.056*"love" + 0.056*"interface" + 0.056*"intuitive" + 0.056*"customer"')
(1, '0.063*"kept" + 0.063*"payment" + 0.063*"twice" + 0.063*"app" + 0.063*"failed" + 0.063*"crashing" + 0.063*"loan" + 0.063*"fast" + 0.063*"approval" + 0.063*"process"')


<wordcloud.wordcloud.WordCloud at 0x783748f54ef0>