<a href="https://colab.research.google.com/github/prabhu-patil/NLP-using-Python/blob/main/Natural_Language_Processing_(NLP)_in_Python.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Step 1: Text Preprocessing

Proper text preprocessing is essential for effective NLP model development.
**bold text**
# 1.1 Tokenization

Using nltk and spaCy for sentence and word tokenization. **bold text**

In [1]:
import nltk
from nltk.tokenize import word_tokenize, sent_tokenize
import spacy

# Download the 'punkt_tab' resource
nltk.download('punkt_tab') # Downloading the missing resource for sentence tokenization.

text = "Natural Language Processing is amazing! Let's explore it."
word_tokens = word_tokenize(text)
sent_tokens = sent_tokenize(text)

print("Word Tokens:", word_tokens)
print("Sentence Tokens:", sent_tokens)

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


Word Tokens: ['Natural', 'Language', 'Processing', 'is', 'amazing', '!', 'Let', "'s", 'explore', 'it', '.']
Sentence Tokens: ['Natural Language Processing is amazing!', "Let's explore it."]


# **1.2 Stopword Removal**

**Removing common words that do not contribute to meaning.**

In [2]:
from nltk.corpus import stopwords
nltk.download('stopwords')

stop_words = set(stopwords.words('english'))
filtered_tokens = [word for word in word_tokens if word.lower() not in stop_words]

print("Filtered Tokens:", filtered_tokens)

Filtered Tokens: ['Natural', 'Language', 'Processing', 'amazing', '!', 'Let', "'s", 'explore', '.']


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


# **1.3 Lemmatization and Stemming**
bold text
**Using WordNet Lemmatizer and Porter Stemmer.**

In [3]:
from nltk.stem import WordNetLemmatizer, PorterStemmer
nltk.download('wordnet')

lemmatizer = WordNetLemmatizer()
stemmer = PorterStemmer()

lemmatized_tokens = [lemmatizer.lemmatize(word) for word in filtered_tokens]
stemmed_tokens = [stemmer.stem(word) for word in filtered_tokens]

print("Lemmatized Tokens:", lemmatized_tokens)
print("Stemmed Tokens:", stemmed_tokens)

[nltk_data] Downloading package wordnet to /root/nltk_data...


Lemmatized Tokens: ['Natural', 'Language', 'Processing', 'amazing', '!', 'Let', "'s", 'explore', '.']
Stemmed Tokens: ['natur', 'languag', 'process', 'amaz', '!', 'let', "'s", 'explor', '.']


# **Step 2: Feature Extraction**

Transforming text into numerical features for model input.

# **2.1 Bag-of-Words (BoW)**

Using CountVectorizer to convert text into a matrix of token counts. **bold text**

# 2.2 TF-IDF (Term Frequency-Inverse Document Frequency)

Measuring the importance of words in a document. **bold text**

In [4]:
from sklearn.feature_extraction.text import CountVectorizer

corpus = ["NLP is fun and exciting.", "Machine learning enhances NLP capabilities."]
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(corpus)

print("Feature Names:", vectorizer.get_feature_names_out())
print("BoW Representation:\n", X.toarray())

Feature Names: ['and' 'capabilities' 'enhances' 'exciting' 'fun' 'is' 'learning'
 'machine' 'nlp']
BoW Representation:
 [[1 0 0 1 1 1 0 0 1]
 [0 1 1 0 0 0 1 1 1]]


# Step 3: Building and Training NLP Models

# 3.1 Sentiment Analysis with Naive Bayes **bold text**

In [5]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf_vectorizer = TfidfVectorizer()
X_tfidf = tfidf_vectorizer.fit_transform(corpus)

print("TF-IDF Representation:\n", X_tfidf.toarray())

TF-IDF Representation:
 [[0.47107781 0.         0.         0.47107781 0.47107781 0.47107781
  0.         0.         0.33517574]
 [0.         0.47107781 0.47107781 0.         0.         0.
  0.47107781 0.47107781 0.33517574]]


# Step 4: Model Evaluation

# 4.1 Classification Metrics

Evaluating model performance using Precision, Recall, F1-score, and Confusion Matrix.**bold text** **bold text**

In [6]:
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score

# Example dataset
data = [
    ("I love this product!", "positive"),
    ("This is the worst experience ever.", "negative"),
    ("It's okay, not great but not terrible.", "neutral"),
]

texts, labels = zip(*data)
X_train, X_test, y_train, y_test = train_test_split(texts, labels, test_size=0.2, random_state=42)

# Create pipeline
model = Pipeline([
    ('vectorizer', TfidfVectorizer()),
    ('classifier', MultinomialNB())
])

# Train model
model.fit(X_train, y_train)

# Predict
predictions = model.predict(X_test)
print("Accuracy:", accuracy_score(y_test, predictions))

Accuracy: 0.0


# **Confusion Matrix (Error Analysis)**

In [7]:
from sklearn.metrics import classification_report, confusion_matrix

print("Classification Report:\n", classification_report(y_test, predictions))
print("Confusion Matrix:\n", confusion_matrix(y_test, predictions))

Classification Report:
               precision    recall  f1-score   support

    negative       0.00      0.00      0.00       0.0
    positive       0.00      0.00      0.00       1.0

    accuracy                           0.00       1.0
   macro avg       0.00      0.00      0.00       1.0
weighted avg       0.00      0.00      0.00       1.0

Confusion Matrix:
 [[0 0]
 [1 0]]


  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


# **Named Entity Recognition (NER)**

In [8]:
import spacy # Make sure you have installed spacy with `!pip install spacy`

nlp = spacy.load("en_core_web_sm")
doc = nlp("Apple is looking at buying a U.K. startup for $1 billion.")

for ent in doc.ents:
    print(ent.text, "-", ent.label_)

Apple - ORG
U.K. - GPE
$1 billion - MONEY
