## Q1: What is Computational Linguistics and how does it relate to NLP?
Computational Linguistics is the interdisciplinary field that studies language using computational methods.  
- It combines linguistics, computer science, and AI to model and analyze human language.  
- **Relation to NLP:** NLP (Natural Language Processing) is the practical application of computational linguistics.  
  - Computational linguistics provides the theory.  
  - NLP implements it in real-world systems like chatbots, translation tools, and speech recognition.

---

## Q2: Historical Evolution of NLP
- **1950s–1960s:** Rule-based systems, early machine translation (Georgetown-IBM experiment).  
- **1970s–1980s:** Formal grammars, symbolic AI, parsing.  
- **1990s:** Statistical methods (Hidden Markov Models, probabilistic approaches).  
- **2000s:** Machine learning (SVMs, decision trees).  
- **2010s–Present:** Deep learning, word embeddings (Word2Vec, GloVe), transformers (BERT, GPT).

---

## Q3: Major Use Cases of NLP
1. **Chatbots & Virtual Assistants** – Customer support automation (Alexa, Siri, ChatGPT).  
2. **Sentiment Analysis** – Understanding customer feedback, reviews, social media posts.  
3. **Machine Translation** – Tools like Google Translate for cross-language communication.

---

## Q4: Text Normalization
Text normalization = converting text into a consistent format.  
- **Steps:** lowercasing, removing punctuation, expanding contractions, handling special characters.  
- **Importance:**  
  - Reduces variability in text.  
  - Ensures consistent meaning.  
  - Improves accuracy in tokenization, sentiment analysis, and ML models.

---

## Q5: Stemming vs Lemmatization
- **Stemming:**  
  - Rule-based truncation of word endings.  
  - May produce non-dictionary words.  
  - Example: *“studies” → “studi”*.  

- **Lemmatization:**  
  - Uses vocabulary + morphology.  
  - Produces valid dictionary words.  
  - Example: *“studies” → “study”*.  

**Comparison Table:**

| Aspect       | Stemming                | Lemmatization             |
|--------------|-------------------------|---------------------------|
| Approach     | Rule-based truncation   | Dictionary + morphology   |
| Output       | May not be valid words  | Always valid words        |
| Speed        | Faster                  | Slower                    |
| Accuracy     | Lower                   | Higher                    |

---

## Q10: Workflow for Customer Reviews (Fintech Startup)
1. **Data Collection:** Gather reviews from app, website, social media.  
2. **Text Cleaning:** Remove HTML tags, emojis, normalize text.  
3. **Tokenization:** Split into words/sentences.  
4. **Preprocessing:** Stemming/lemmatization, handle negations, POS tagging.  
5. **Feature Extraction:** Bag of Words, TF-IDF, embeddings (Word2Vec, BERT).  
6. **Sentiment Analysis:** Classify reviews (positive/negative/neutral).  
7. **Topic Modeling:** LDA to find recurring themes (fees, support, usability).  
8. **Visualization:** Dashboards for sentiment trends, frequent complaints.  
9. **Insights & Action:** Share findings with product/customer service teams.

In [4]:
# Q6
import nltk
nltk.download('punkt')

text = """Natural Language Processing (NLP) is a field of AI that helps computers understand human language.
It is widely used in chatbots, translation, and sentiment analysis."""

# Sentence Tokenization
sentences = nltk.sent_tokenize(text)
print("Sentence Tokenization:")
print(sentences)

# Word Tokenization
words = nltk.word_tokenize(text)
print("\nWord Tokenization:")
print(words)

Sentence Tokenization:
['Natural Language Processing (NLP) is a field of AI that helps computers understand human language.', 'It is widely used in chatbots, translation, and sentiment analysis.']

Word Tokenization:
['Natural', 'Language', 'Processing', '(', 'NLP', ')', 'is', 'a', 'field', 'of', 'AI', 'that', 'helps', 'computers', 'understand', 'human', 'language', '.', 'It', 'is', 'widely', 'used', 'in', 'chatbots', ',', 'translation', ',', 'and', 'sentiment', 'analysis', '.']


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [2]:
# Q7
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('punkt_tab') # Added to fix the LookupError

text = "NLP is a powerful tool for analyzing human language data."

# Tokenize words
words = nltk.word_tokenize(text)

# Remove stopwords
stop_words = set(stopwords.words('english'))
filtered_words = [word for word in words if word.lower() not in stop_words]

print("Original Words:", words)
print("Filtered Words (without stopwords):", filtered_words)

Original Words: ['NLP', 'is', 'a', 'powerful', 'tool', 'for', 'analyzing', 'human', 'language', 'data', '.']
Filtered Words (without stopwords): ['NLP', 'powerful', 'tool', 'analyzing', 'human', 'language', 'data', '.']


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


In [3]:
# Q8
import nltk
from nltk.stem import PorterStemmer, WordNetLemmatizer
nltk.download('wordnet')

words = ["running", "studies", "better", "flies"]

# Stemming
stemmer = PorterStemmer()
stems = [stemmer.stem(word) for word in words]

# Lemmatization
lemmatizer = WordNetLemmatizer()
lemmas = [lemmatizer.lemmatize(word) for word in words]

print("Original Words:", words)
print("Stems:", stems)
print("Lemmas:", lemmas)

[nltk_data] Downloading package wordnet to /root/nltk_data...


Original Words: ['running', 'studies', 'better', 'flies']
Stems: ['run', 'studi', 'better', 'fli']
Lemmas: ['running', 'study', 'better', 'fly']


In [6]:
# Q9
from sklearn.feature_extraction.text import TfidfVectorizer

# Sample documents
docs = [
    "NLP is a field of artificial intelligence",
    "Machine learning is used in NLP",
    "Deep learning has improved NLP performance"
]

# Create TF-IDF Vectorizer
vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(docs)

# Convert to array
print("TF-IDF Scores:")
print(tfidf_matrix.toarray())

# Feature names (words)
print("\nFeature Names:")
print(vectorizer.get_feature_names_out())

TF-IDF Scores:
[[0.45050407 0.         0.45050407 0.         0.         0.
  0.45050407 0.34261996 0.         0.         0.26607496 0.45050407
  0.         0.        ]
 [0.         0.         0.         0.         0.         0.4711101
  0.         0.35829137 0.35829137 0.4711101  0.27824521 0.
  0.         0.4711101 ]
 [0.         0.45050407 0.         0.45050407 0.45050407 0.
  0.         0.         0.34261996 0.         0.26607496 0.
  0.45050407 0.        ]]

Feature Names:
['artificial' 'deep' 'field' 'has' 'improved' 'in' 'intelligence' 'is'
 'learning' 'machine' 'nlp' 'of' 'performance' 'used']
