# Natural Language Processing (NLP) — Comprehensive Guide

---

## Introduction

Natural Language Processing (NLP) is an exciting branch of Artificial Intelligence (AI) that focuses on enabling computers to understand, interpret, and generate human language. From voice assistants like Siri and Alexa to automatic translation services and sentiment analysis on social media, NLP is at the core of many technologies we use daily.

This guide introduces key NLP concepts with clear explanations and simple, practical Python code examples. It is designed for beginners and developers looking to understand NLP fundamentals and build basic NLP applications.

---

## Why Learn NLP?

- **Unlock the power of human language:** Teach machines to read and understand text and speech.  
- **Improve user experiences:** Build chatbots, translators, and recommendation systems.  
- **Analyze huge text data:** Extract insights from reviews, articles, social media, and more.  
- **Grow your career:** NLP skills are in high demand in AI, data science, and software development.  

---

## What You Will Learn

1. Importing Libraries and Preparing Text
2. Text Cleaning and Preprocessing  
3. Tokenization  
4. Stemming and Lemmatization  
5. Parts of Speech Tagging  
6. Named Entity Recognition  
7. Text Vectorization  
8. Stop Words Removal  
9. Word Embeddings  
10. Practical NLP Project: Sentiment Analysis  


### 1. Importing Libraries and Preparing Text

Before working on NLP tasks, you need essential Python libraries like `nltk` and `re` for text processing.


In [7]:
import nltk
import re

# Download necessary datasets (run once)
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')

text = "Natural Language Processing allows computers to understand human language!"

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\makmo\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\makmo\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\makmo\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\makmo\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


*Explanation:*
We import nltk for natural language tools and re for regular expressions (pattern matching). The nltk.download calls ensure you have the necessary datasets for tokenization, stopwords, lemmatization, and POS tagging.

### 2. Text Cleaning Using Regular Expressions

Cleaning text involves removing unwanted characters like numbers, punctuation, or special symbols.


In [None]:
clean_text = re.sub('[^a-zA-Z]', ' ', text).lower()
print(clean_text)

natural language processing allows computers to understand human language 


*Explanation:*

- re.sub('[^a-zA-Z]', ' ', text) replaces everything except letters with a space.

- .lower() converts text to lowercase for consistency.
This prepares the text for easier analysis.

### 3. Tokenization: Splitting Text into Words

Tokenization breaks text into smaller pieces called tokens, usually words or sentences.


In [19]:
import nltk
nltk.download('punkt_tab')
from nltk.tokenize import word_tokenize

tokens = word_tokenize(clean_text)
print(tokens)

['natural', 'language', 'processing', 'allows', 'computers', 'to', 'understand', 'human', 'language']


[nltk_data] Downloading package punkt_tab to
[nltk_data]     C:\Users\makmo\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


*Explanation:*
Tokenization is essential because most NLP models work on tokens rather than whole texts. word_tokenize splits the cleaned text into individual words.

### 4. Removing Stop Words

Stop words are common words like "is", "the", "and" that often do not carry significant meaning and can be removed to reduce noise.


In [20]:
from nltk.corpus import stopwords

stop_words = set(stopwords.words('english'))
filtered_tokens = [word for word in tokens if word not in stop_words]
print(filtered_tokens)

['natural', 'language', 'processing', 'allows', 'computers', 'understand', 'human', 'language']


*Explanation:*
Removing stop words helps focus on meaningful words and improves performance in many NLP tasks.

### 5. Stemming: Reduce Words to Root Form

Stemming is a crude way to reduce words to their base or root by chopping off endings.


In [21]:
from nltk.stem import PorterStemmer

ps = PorterStemmer()
stemmed_tokens = [ps.stem(word) for word in filtered_tokens]
print(stemmed_tokens)

['natur', 'languag', 'process', 'allow', 'comput', 'understand', 'human', 'languag']


*Explanation:*
Stemming algorithms like Porter Stemmer remove suffixes (“running” → “run”) but can produce non-words like “organizat” instead of “organization”.

### 6. Lemmatization: More Accurate Word Base Forms

Lemmatization reduces words to their dictionary form (lemma), using context and word meaning.


In [22]:
from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()
lemmatized_tokens = [lemmatizer.lemmatize(word) for word in filtered_tokens]
print(lemmatized_tokens)

['natural', 'language', 'processing', 'allows', 'computer', 'understand', 'human', 'language']


*Explanation:*
Lemmatization considers part of speech and results in real words (e.g., “better” → “good”). It is more linguistically informed than stemming.

### 7. Parts of Speech Tagging

Assigning word roles like noun, verb, adjective helps understand sentence structure and meaning.


In [24]:
import nltk
nltk.download('averaged_perceptron_tagger_eng')

pos_tags = nltk.pos_tag(filtered_tokens)
print(pos_tags)

[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     C:\Users\makmo\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping taggers\averaged_perceptron_tagger_eng.zip.


[('natural', 'JJ'), ('language', 'NN'), ('processing', 'NN'), ('allows', 'VBZ'), ('computers', 'NNS'), ('understand', 'VBP'), ('human', 'JJ'), ('language', 'NN')]


*Explanation:*
POS tags tell you if a word is a noun (NN), verb (VB), adjective (JJ), etc. This information is valuable for syntactic analysis and advanced NLP tasks.

### 8. Named Entity Recognition (NER) with spaCy

NER finds and classifies real-world entities like people, locations, and organizations in text.


In [26]:
import spacy
nlp = spacy.load("en_core_web_sm")

doc = nlp("Apple was founded by Steve Jobs in California.")
for ent in doc.ents:
    print(ent.text, ent.label_)

Apple ORG
Steve Jobs PERSON
California GPE


*Explanation:*
NER extracts important information like company names, people’s names, places, dates, and monetary values, crucial for information extraction and knowledge graphs.

### 9. Text Vectorization: Bag of Words Example

Computers understand numbers better than text. Bag of Words (BoW) converts text into vectors based on word frequency.


In [27]:
from sklearn.feature_extraction.text import CountVectorizer

corpus = ["Natural Language Processing is fun.",
          "Text processing helps computers understand language."]

vectorizer = CountVectorizer()
X = vectorizer.fit_transform(corpus)

print(vectorizer.get_feature_names_out())
print(X.toarray())

['computers' 'fun' 'helps' 'is' 'language' 'natural' 'processing' 'text'
 'understand']
[[0 1 0 1 1 1 1 0 0]
 [1 0 1 0 1 0 1 1 1]]


*Explanation:*
BoW counts how many times each word appears, ignoring grammar and order. It is simple but effective for many applications.

### 10. Word Embeddings with Word2Vec (Gensim)

Word embeddings capture semantic meanings by representing words as dense vectors in multi-dimensional space.


In [28]:
from gensim.models import Word2Vec
from nltk.tokenize import word_tokenize

sentences = [word_tokenize(sentence.lower()) for sentence in corpus]
model = Word2Vec(sentences, vector_size=50, window=3, min_count=1)

print(model.wv['language'])

[-0.01723938  0.00733148  0.01037977  0.01148388  0.01493384 -0.01233535
  0.00221123  0.01209456 -0.0056801  -0.01234705 -0.00082045 -0.0167379
 -0.01120002  0.01420908  0.00670508  0.01445134  0.01360049  0.01506148
 -0.00757831 -0.00112361  0.00469675 -0.00903806  0.01677746 -0.01971633
  0.01352928  0.00582883 -0.00986566  0.00879638 -0.00347915  0.01342277
  0.0199297  -0.00872489 -0.00119868 -0.01139127  0.00770164  0.00557325
  0.01378215  0.01220219  0.01907699  0.01854683  0.01579614 -0.01397901
 -0.01831173 -0.00071151 -0.00619968  0.01578863  0.01187715 -0.00309133
  0.00302193  0.00358008]


*Explanation:*
Word2Vec learns relationships between words based on their context. Words with similar meanings have vectors close together, enabling tasks like analogy solving and similarity measurement.

## Practical NLP Project Example: Sentiment Analysis on Restaurant Reviews

This example demonstrates the full NLP pipeline — from loading data to predicting sentiment with a machine learning model.

---

### Step 1: Importing Required Libraries

```python
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
```

*Explanation:*  
We start by importing three essential libraries:

- numpy for efficient numerical operations and handling arrays,

- matplotlib.pyplot for data visualization (optional but useful for later analysis),

- pandas for data loading and manipulation, particularly to read the dataset into a structured DataFrame format.

---

### Step 2: Loading the Dataset

```python
dataset = pd.read_csv('NLP\Restaurant_Reviews.tsv', delimiter = '\t', quoting = 3)
```

*Explanation:*  
We load the dataset of restaurant reviews using pandas’ read_csv method.

- The dataset file is a tab-separated values (TSV) file, so we specify the delimiter as a tab (\t).

- quoting = 3 tells pandas to ignore double quotes around text, which can sometimes interfere with parsing.

- The resulting dataset DataFrame contains two main columns: the review text and the sentiment label (0 for negative, 1 for positive).

---

### Step 3: Text Cleaning and Preprocessing

```python
import re
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer

corpus = []
for i in range(0, 1000):
    review = re.sub('[^a-zA-Z]', ' ', dataset['Review'][i])
    review = review.lower()
    review = review.split()
    ps = PorterStemmer()
    all_stopwords = stopwords.words('english')
    all_stopwords.remove('not')
    review = [ps.stem(word) for word in review if not word in set(all_stopwords)]
    review = ' '.join(review)
    corpus.append(review)

print(corpus[:5])
```

*Explanation:*  
This loop performs several important preprocessing steps on each review:

- Cleaning: Using regular expressions to remove any characters that are not letters (removes punctuation, numbers, etc.) so the text is cleaner and more consistent.

- Lowercasing: Converting all letters to lowercase to treat words like “Good” and “good” as the same token.

- Tokenization: Splitting the text into individual words for processing.

- Stopwords Removal: Common, less informative words like “the”, “is”, and “and” are removed. However, “not” is kept because it negates meaning and is important for sentiment.

- Stemming: Each word is reduced to its root form to consolidate different forms of a word (e.g., “loved”, “loving” → “love”).

---

### Step 4: Creating the Bag of Words Model

```python
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(max_features=1500)
X = cv.fit_transform(corpus).toarray()
y = dataset.iloc[:, -1].values
```

*Explanation:*  
- The Bag of Words (BoW) model transforms the cleaned text into a matrix of token counts.

- CountVectorizer builds a vocabulary of the most frequent 1500 words across all reviews (using max_features=1500).

- Each review is converted into a fixed-length vector where each position counts how many times a vocabulary word appears.

- X is the feature matrix representing all reviews numerically, suitable for machine learning.

- y extracts the sentiment labels (target values) from the dataset for supervised learning.

---

### Step 5: Splitting Data into Training and Test Sets

```python
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=0)
```

*Explanation:*  
- We split the dataset into training and testing sets.

- test_size=0.20 means 20% of data is held out for testing to evaluate the model’s performance on unseen data.

- random_state=0 ensures reproducibility of the split.

---

### Step 6: Training the Naive Bayes Classifier

```python
from sklearn.naive_bayes import GaussianNB
classifier = GaussianNB()
classifier.fit(X_train, y_train)
```

*Explanation:*  
- We use a Gaussian Naive Bayes classifier, a probabilistic model well-suited for text classification.

- Naive Bayes assumes feature independence and uses Bayes’ theorem to calculate the probability of each class given the input features.

- The fit method trains the model on the training data, learning patterns that associate word counts with positive or negative sentiment.

---

### Step 7: Predicting the Test Set Results

```python
y_pred = classifier.predict(X_test)
print(np.concatenate((y_pred.reshape(len(y_pred),1), y_test.reshape(len(y_test),1)),1))
```

*Explanation:*  
- We use the trained model to predict sentiments on the test set.

- Predictions (y_pred) are compared side-by-side with actual labels (y_test) to visually inspect model accuracy.

- The concatenated output helps identify which predictions were correct or incorrect.

---

### Step 8: Evaluating Model Performance

```python
from sklearn.metrics import confusion_matrix, accuracy_score
cm = confusion_matrix(y_test, y_pred)
print(cm)
print("Accuracy:", accuracy_score(y_test, y_pred))
```

*Explanation:*  
- The confusion matrix shows true positives, true negatives, false positives, and false negatives in a matrix form.

- accuracy_score gives the overall percentage of correct predictions.

- Together, these metrics give a clear understanding of how well the sentiment analysis model performs.
