<a href="https://colab.research.google.com/github/kalai2315/NLP_Projects/blob/main/nlp_Basics.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**NLP components**

**Importing Libraries**

In [None]:
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

In [None]:
#Download NLTK Resources:
nltk.download('stopwords')
nltk.download('punkt') #punkt is a tokenizer model used by NLTK to split text into words or sentences.


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [None]:

text = "This is a simple example to show how to remove stopwords in Python."

In [None]:
# Tokenize the text
words = word_tokenize(text)
print(words)

['This', 'is', 'a', 'simple', 'example', 'to', 'show', 'how', 'to', 'remove', 'stopwords', 'in', 'Python', '.']


In [None]:
# Get stop words for English
stop_words = set(stopwords.words('english'))
print(len(stop_words))

179


In [None]:
# Remove stop words
filtered_sentence = [word for word in words if word.lower() not in stop_words]
filtered_sentence

['simple', 'example', 'show', 'remove', 'stopwords', 'Python', '.']

In Python, **stemming** and **lemmatization** are two common text preprocessing techniques used in natural language processing (NLP). They help reduce words to their root forms, which can improve the performance of NLP models.

# **Stemming** involves reducing words to their base or root form. The nltk library provides a simple way to perform stemming.

# **PorterStemmer:**

PorterStemmer is a specific stemming algorithm provided by NLTK, which reduces words to their root form.

In [None]:
import nltk
from nltk.stem import PorterStemmer

# Download the NLTK data files (only needed once)
nltk.download('punkt')

# Create a Porter Stemmer object
stemmer = PorterStemmer()

# Example words
words = ["running", "runs", "easily", "fairly", "better", "best"]

# Apply stemming
stemmed_words = [stemmer.stem(word) for word in words]

print("Original words:", words)
print("Stemmed words:", stemmed_words)


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


Original words: ['running', 'runs', 'easily', 'fairly', 'better', 'best']
Stemmed words: ['run', 'run', 'easili', 'fairli', 'better', 'best']


# **SnowballStemmer**

In [None]:
import nltk
from nltk.stem import SnowballStemmer

# Download NLTK data files (only needed once)
nltk.download('punkt')

# Create Snowball Stemmer object for English
nltk_stemmer = SnowballStemmer('english')

# Example words
words = ["running", "runs", "easily", "fairly", "better", "best"]

# Apply stemming using nltk
nltk_stemmed_words = [nltk_stemmer.stem(word) for word in words]

# Display results
print(f"{'Original':<15} {'NLTK Stemmed':<15}")
print("="*30)

for original, nltk_stemmed in zip(words, nltk_stemmed_words):
    print(f"{original:<15} {nltk_stemmed:<15}")


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


Original        NLTK Stemmed   
running         run            
runs            run            
easily          easili         
fairly          fair           
better          better         
best            best           


# **LancasterStemmer**

In [None]:
import nltk
from nltk.stem import LancasterStemmer

# Download NLTK data files (only needed once)
nltk.download('punkt')

# Create Lancaster Stemmer object
stemmer = LancasterStemmer()

# Example words
words = ["running", "runs", "easily", "fairly", "better", "best"]

# Apply stemming using Lancaster Stemmer
stemmed_words = [stemmer.stem(word) for word in words]

print(f"{'Original':<15} {'Lancaster Stemmed':<15}")
print("="*30)

for original, stemmed in zip(words, stemmed_words):
    print(f"{original:<15} {stemmed:<15}")


Original        Lancaster Stemmed
running         run            
runs            run            
easily          easy           
fairly          fair           
better          bet            
best            best           


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


# **Lemmatization**

# **Lemmatization** reduces words to their base or dictionary form. It is more sophisticated than stemming and often produces more meaningful results. The nltk library also supports lemmatization with the WordNet Lemmatizer.

wordnet: This dataset is necessary for the lemmatizer to access the WordNet lexical database, which provides word relationships and base forms.

omw-1.4: This is the Open Multilingual WordNet dataset, which supports lemmatization in different languages. It’s required for more comprehensive lemmatization.

# WordNet Lemmatizer

In [None]:
import nltk
from nltk.stem import WordNetLemmatizer

# Download the NLTK data files (only needed once)
nltk.download('wordnet')
nltk.download('omw-1.4')

# Create a WordNet Lemmatizer object
lemmatizer = WordNetLemmatizer()

# Example words
words = ["running", "runs", "easily", "fairly", "better", "best"]

# Apply lemmatization
lemmatized_words = [lemmatizer.lemmatize(word, pos='v') for word in words]  # 'pos' specifies part of speech

print("Original words:", words)
print("Lemmatized words:", lemmatized_words)


[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...


Original words: ['running', 'runs', 'easily', 'fairly', 'better', 'best']
Lemmatized words: ['run', 'run', 'easily', 'fairly', 'better', 'best']


# **Spacy**: If you prefer using spacy, another popular NLP library, it also supports lemmatization and stemming.

In [None]:
import spacy

# Load the English tokenizer, tagger, parser, NER, and dependency parser
nlp = spacy.load("en_core_web_sm")

# Example text
text = "running runs easily fairly better best"

# Process the text
doc = nlp(text)

# Extract lemmatized tokens
lemmatized_words = [token.lemma_ for token in doc]

# Display results
print(f"{'Original':<15} {'Lemmatized':<15}")
print("="*30)

for original, lemmatized in zip(text.split(), lemmatized_words):
    print(f"{original:<15} {lemmatized:<15}")


Original        Lemmatized     
running         run            
runs            run            
easily          easily         
fairly          fairly         
better          well           
best            good           


In [None]:
from textblob import Word, TextBlob
# Example words
words = ["running", "runs", "easily", "fairly", "better", "best"]

# Lemmatize each word using TextBlob
lemmatized_words = [Word(word).lemmatize() for word in words]

print("Original words:", words)
print("Lemmatized words:", lemmatized_words)


Original words: ['running', 'runs', 'easily', 'fairly', 'better', 'best']
Lemmatized words: ['running', 'run', 'easily', 'fairly', 'better', 'best']


# **Text Normalization:**

Text normalization is an essential preprocessing step in natural language processing (NLP) and text analysis. It involves converting text into a standardized format to make it easier to process and analyze. The goal is to reduce variations in the text and ensure consistency.

In [None]:
import nltk
nltk.download('stopwords')
import re
import string
import spacy
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

# Initialize NLP tools
nlp = spacy.load("en_core_web_sm")
stop_words = set(stopwords.words('english'))

def normalize_text(text):
    # Lowercase
    text = text.lower()

    # Remove punctuation
    text = text.translate(str.maketrans('', '', string.punctuation))

    # Remove numbers
    text = re.sub(r'\d+', '', text)

    # Tokenization
    tokens = word_tokenize(text)

    # Remove stop words
    tokens = [word for word in tokens if word not in stop_words]

    # Lemmatization
    doc = nlp(" ".join(tokens))
    lemmatized_tokens = [token.lemma_ for token in doc]

    return ' '.join(lemmatized_tokens)

text = "The quick brown foxes are jumping over the lazy dogs."
normalized_text = normalize_text(text)
print(normalized_text)  # Output: "quick brown fox jump lazy dog"


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


quick brown fox jump lazy dog


# **Bag of words BOW**

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

# Initialize the CountVectorizer
v = CountVectorizer()

# Fit the vectorizer on the sample text
v.fit(["Thor Hathodawala is looking for a job"])

# Get the vocabulary generated by the vectorizer
vocabulary = v.vocabulary_

# Display the vocabulary
print(vocabulary)


{'thor': 5, 'hathodawala': 1, 'is': 2, 'looking': 4, 'for': 0, 'job': 3}


# **N-grams**

In [None]:
v = CountVectorizer(  =(1,2))
v.fit(["Thor Hathodawala is looking for a job"])
v.vocabulary_

{'thor': 9,
 'hathodawala': 2,
 'is': 4,
 'looking': 7,
 'for': 0,
 'job': 6,
 'thor hathodawala': 10,
 'hathodawala is': 3,
 'is looking': 5,
 'looking for': 8,
 'for job': 1}

In [None]:
v = CountVectorizer(ngram_range=(1,3))
v.fit(["Thor Hathodawala is looking for a job"])
v.vocabulary_

{'thor': 12,
 'hathodawala': 2,
 'is': 5,
 'looking': 9,
 'for': 0,
 'job': 8,
 'thor hathodawala': 13,
 'hathodawala is': 3,
 'is looking': 6,
 'looking for': 10,
 'for job': 1,
 'thor hathodawala is': 14,
 'hathodawala is looking': 4,
 'is looking for': 7,
 'looking for job': 11}

# **TfidfVectorizer**

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Sample documents
documents = [
    "The cat sat on the mat.",
    "The dog sat on the log."
]

# Create a TF-IDF Vectorizer
vectorizer = TfidfVectorizer()

# Fit and transform the documents
tfidf_matrix = vectorizer.fit_transform(documents)

# Convert to a dense format and view the result
dense = tfidf_matrix.todense()
print(dense)

# Get feature names (terms)
print(vectorizer.get_feature_names_out())


[[0.44554752 0.         0.         0.44554752 0.31701073 0.31701073
  0.63402146]
 [0.         0.44554752 0.44554752 0.         0.31701073 0.31701073
  0.63402146]]
['cat' 'dog' 'log' 'mat' 'on' 'sat' 'the']


# **Word2Vec Embedding**

In [None]:
pip install gensim




In [None]:
from gensim.models import Word2Vec

# Sample corpus
sentences = [
    ["the", "cat", "sits", "on", "the", "mat"],
    ["the", "dog", "barks"],
    ["cats", "and", "dogs", "are", "friends"],
    ["the", "cat", "and", "the", "dog", "are", "playing"]
]

# Train the Word2Vec model
model = Word2Vec(sentences, vector_size=100, window=5, min_count=1, workers=4)

# Save the model
model.save("word2vec.model")


In [None]:
# Load the model
model = Word2Vec.load("word2vec.model")

# Get the vector for a specific word
cat_vector = model.wv["cat"]
print("Vector for 'cat':", cat_vector)

# Find similar words
similar_words = model.wv.most_similar("cat", topn=5)
print("Words similar to 'cat':", similar_words)


Vector for 'cat': [-0.00713986  0.00124088 -0.00717995 -0.00224818  0.00372279  0.00583649
  0.00119641  0.00210052 -0.0041096   0.00722774 -0.00631051  0.00464887
 -0.00822053  0.00203516 -0.00497694 -0.0042481  -0.00311     0.00565629
  0.00580238 -0.00497291  0.00077116 -0.00849946  0.00780949  0.00925927
 -0.00274503  0.00080148  0.00074716  0.00547721 -0.00860547  0.00058599
  0.00687331  0.00223275  0.00112166 -0.00932197  0.00848464 -0.00626474
 -0.00299564  0.00349585 -0.00077315  0.00141302  0.00178579 -0.0068322
 -0.00972374  0.00904355  0.00619953 -0.00691385  0.0034063   0.00020263
  0.00475308 -0.00712433  0.00403038  0.00434652  0.00996038 -0.0044747
 -0.00139274 -0.00731768 -0.00970131 -0.00908031 -0.00102227 -0.00650779
  0.00485144 -0.00616542  0.00252111  0.00074205 -0.0033922  -0.00097998
  0.00998134  0.00914575 -0.00446196  0.00908382 -0.00564441  0.00592973
 -0.00309761  0.00343544  0.0030169   0.00690159 -0.00237646  0.00877759
  0.00759191 -0.00954651 -0.0080093

In [None]:
#find 10 closest words in the vector space
model.wv.most_similar("cat", topn=10)

[('and', 0.1702033281326294),
 ('sits', 0.1501343548297882),
 ('playing', 0.13904547691345215),
 ('friends', 0.034870728850364685),
 ('are', 0.004501866642385721),
 ('on', -0.005923843942582607),
 ('the', -0.027924194931983948),
 ('dogs', -0.028515880927443504),
 ('dog', -0.0444510318338871),
 ('cats', -0.06903860718011856)]