<a href="https://colab.research.google.com/github/laibaabbas/NLP/blob/main/NLP_Basic_.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


# üß† Natural Language Processing (NLP)

## 1. Introduction to NLP
Natural Language Processing (NLP) is a subfield of AI that enables computers to understand, interpret, and generate human language.

**Applications:** Chatbots, sentiment analysis, translation, summarization, question answering, etc.



### Why NLP?
- Enables machines to communicate with humans in natural language

- Helps extract information from text data (emails, tweets, reviews, etc.)

- Powers many AI systems: chatbots, translators, summarizers, etc.


##Examples of NLP Applications

| Application           | Example                             |
| --------------------- | ----------------------------------- |
| Sentiment Analysis    | ‚ÄúThis product is great!‚Äù ‚Üí Positive |
| Machine Translation   | English ‚Üí French                    |
| Text Summarization    | Condensing long articles            |
| Chatbots              | Virtual assistants like Siri, Alexa |
| Information Retrieval | Search engines                      |
| Spam Detection        | Filtering junk emails               |



## 2. Key NLP Tasks
- Tokenization  
- Stopword Removal  
- Stemming and Lemmatization  
- POS Tagging  
- Named Entity Recognition (NER)  
- Bag of Words (BoW)  
- TF-IDF (Term Frequency‚ÄìInverse Document Frequency)  
- Word Embeddings (Word2Vec, GloVe, FastText)  
- Transformers (BERT, GPT, etc.)


### 2.1 Text Data and Corpus

A corpus is a large collection of text used for analysis.
Example: news articles, tweets, Wikipedia dumps.

In [None]:
text = "Natural Language Processing enables machines to understand human language."


### 2.2 Tokenization

Splitting text into smaller pieces (tokens).

In [None]:
import nltk
nltk.download('punkt')
nltk.download('punkt_tab') # Added to fix the LookupError
from nltk.tokenize import word_tokenize, sent_tokenize

text = "NLP is amazing. It helps computers understand text."
print(word_tokenize(text))
print(sent_tokenize(text))

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


['NLP', 'is', 'amazing', '.', 'It', 'helps', 'computers', 'understand', 'text', '.']
['NLP is amazing.', 'It helps computers understand text.']


### 2.3 Stopwords Removal

Removing common words that don‚Äôt carry much meaning.

In [None]:
from nltk.corpus import stopwords
nltk.download('stopwords')

words = word_tokenize("NLP helps machines understand human language.")
filtered = [w for w in words if w.lower() not in stopwords.words('english')]
print(filtered)


['NLP', 'helps', 'machines', 'understand', 'human', 'language', '.']


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


###2.4 Stemming and Lemmatization

- **Stemming**: Reduces words to root form (crude)

- **Lemmatization**: Converts words to base form using vocabulary

In [None]:
from nltk.stem import PorterStemmer, WordNetLemmatizer
nltk.download('wordnet')

ps = PorterStemmer()
lm = WordNetLemmatizer()

print(ps.stem("running"))       # run
print(lm.lemmatize("running"))

[nltk_data] Downloading package wordnet to /root/nltk_data...


run
running


### Basic Preprocessing in NLP

In [None]:
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer

nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

text = "Natural Language Processing allows computers to understand human language."

# Tokenization
tokens = word_tokenize(text)
print("Tokens:", tokens)

# Stopword Removal
stop_words = set(stopwords.words('english'))
filtered_tokens = [w for w in tokens if w.lower() not in stop_words]
print("Filtered Tokens:", filtered_tokens)

# Stemming
stemmer = PorterStemmer()
stems = [stemmer.stem(w) for w in filtered_tokens]
print("Stems:", stems)

# Lemmatization
lemmatizer = WordNetLemmatizer()
lemmas = [lemmatizer.lemmatize(w) for w in filtered_tokens]
print("Lemmas:", lemmas)


Tokens: ['Natural', 'Language', 'Processing', 'allows', 'computers', 'to', 'understand', 'human', 'language', '.']
Filtered Tokens: ['Natural', 'Language', 'Processing', 'allows', 'computers', 'understand', 'human', 'language', '.']
Stems: ['natur', 'languag', 'process', 'allow', 'comput', 'understand', 'human', 'languag', '.']
Lemmas: ['Natural', 'Language', 'Processing', 'allows', 'computer', 'understand', 'human', 'language', '.']


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


### 2.5 Part-of-Speech (POS) Tagging

Assigning grammatical labels (noun, verb, etc.) to words.

In [None]:
import nltk
nltk.download('averaged_perceptron_tagger')
nltk.download('averaged_perceptron_tagger_eng') # Added to fix the LookupError

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger_eng.zip.


True

In [None]:
tokens = word_tokenize("John loves coding in Python.")
print(nltk.pos_tag(tokens))

[('John', 'NNP'), ('loves', 'VBZ'), ('coding', 'VBG'), ('in', 'IN'), ('Python', 'NNP'), ('.', '.')]


### 2.6 Named Entity Recognition (NER)

Identifying entities like names, locations, dates, etc.

In [None]:
import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp("Elon Musk founded SpaceX in 2002 in California.")
for ent in doc.ents:
    print(ent.text, ent.label_)


Elon Musk PERSON
2002 DATE
California GPE


## 4. Bag of Words and TF-IDF

In [None]:

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

corpus = [
    "I love NLP and machine learning.",
    "NLP is amazing for text analysis.",
    "Machine learning and NLP are related fields."
]

# Bag of Words
cv = CountVectorizer()
bow = cv.fit_transform(corpus)
print("Vocabulary:", cv.get_feature_names_out())
print("BoW Matrix:\n", bow.toarray())

# TF-IDF
tfidf = TfidfVectorizer()
tfidf_matrix = tfidf.fit_transform(corpus)
print("TF-IDF Features:", tfidf.get_feature_names_out())
print("TF-IDF Matrix:\n", tfidf_matrix.toarray())


Vocabulary: ['amazing' 'analysis' 'and' 'are' 'fields' 'for' 'is' 'learning' 'love'
 'machine' 'nlp' 'related' 'text']
BoW Matrix:
 [[0 0 1 0 0 0 0 1 1 1 1 0 0]
 [1 1 0 0 0 1 1 0 0 0 1 0 1]
 [0 0 1 1 1 0 0 1 0 1 1 1 0]]
TF-IDF Features: ['amazing' 'analysis' 'and' 'are' 'fields' 'for' 'is' 'learning' 'love'
 'machine' 'nlp' 'related' 'text']
TF-IDF Matrix:
 [[0.         0.         0.43306685 0.         0.         0.
  0.         0.43306685 0.56943086 0.43306685 0.33631504 0.
  0.        ]
 [0.43238509 0.43238509 0.         0.         0.         0.43238509
  0.43238509 0.         0.         0.         0.2553736  0.
  0.43238509]
 [0.         0.         0.33729513 0.44350256 0.44350256 0.
  0.         0.33729513 0.         0.33729513 0.26193976 0.44350256
  0.        ]]


## 5. Word Embeddings (Word2Vec Example)

In [None]:

from gensim.models import Word2Vec

sentences = [
    ["I", "love", "natural", "language", "processing"],
    ["Word2Vec", "creates", "vector", "representations", "of", "words"]
]

model = Word2Vec(sentences, vector_size=50, window=3, min_count=1, sg=1)
print("Vector for 'language':\n", model.wv['language'])
print("Most similar words to 'love':", model.wv.most_similar('love'))


Vector for 'language':
 [ 0.00855287  0.00015212 -0.01916856 -0.01933109 -0.01229639 -0.00025714
  0.00399483  0.01886394  0.0111687  -0.00858139  0.00055663  0.00992872
  0.01539662 -0.00228845  0.00864684 -0.01162876 -0.00160838  0.0162001
 -0.00472013 -0.01932691  0.01155852 -0.00785964 -0.00244575  0.01996103
 -0.0045127  -0.00951413 -0.01065877  0.01396178 -0.01141774  0.00422733
 -0.01051132  0.01224143  0.00871461  0.00521271 -0.00298217 -0.00549213
  0.01798587  0.01043155 -0.00432504 -0.01894062 -0.0148521  -0.00212748
 -0.00158989 -0.00512582  0.01936544 -0.00091704  0.01174752 -0.01489517
 -0.00501215 -0.01109973]
Most similar words to 'love': [('Word2Vec', 0.22442302107810974), ('words', 0.0998455286026001), ('processing', 0.08992855250835419), ('of', 0.0013571316376328468), ('language', -0.001363718998618424), ('representations', -0.037274789065122604), ('creates', -0.06343826651573181), ('I', -0.11219385266304016), ('vector', -0.12279319763183594), ('natural', -0.25958916

In [None]:
!pip install gensim

Collecting gensim
  Downloading gensim-4.4.0-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl.metadata (8.4 kB)
Downloading gensim-4.4.0-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl (27.9 MB)
[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m27.9/27.9 MB[0m [31m73.2 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: gensim
Successfully installed gensim-4.4.0


## 6. Named Entity Recognition (NER)

In [None]:

import spacy

nlp = spacy.load("en_core_web_sm")
text = "Apple is looking at buying U.K. startup for $1 billion."
doc = nlp(text)

for ent in doc.ents:
    print(ent.text, ent.label_)


Apple ORG
U.K. GPE
$1 billion MONEY


## 7. Text Classification (Example: Sentiment Analysis)

In [None]:

from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score

X = ["I love this movie", "I hate this movie", "Amazing performance", "Terrible direction"]
y = [1, 0, 1, 0]

vec = CountVectorizer()
X_vec = vec.fit_transform(X)

X_train, X_test, y_train, y_test = train_test_split(X_vec, y, test_size=0.5)
model = MultinomialNB()
model.fit(X_train, y_train)

pred = model.predict(X_test)
print("Accuracy:", accuracy_score(y_test, pred))



Accuracy: 0.0


## 8. Transformer Model (BERT) Example

In [None]:

from transformers import pipeline

classifier = pipeline("sentiment-analysis")
result = classifier("I really enjoy learning NLP with deep learning!")
print(result)


No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision 714eb0f (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
Error while fetching `HF_TOKEN` secret value from your vault: 'Requesting secret HF_TOKEN timed out. Secrets can only be fetched when running from the Colab UI.'.
You are not authenticated with the Hugging Face Hub in this notebook.
If the error persists, please let us know by opening an issue on GitHub (https://github.com/huggingface/huggingface_hub/issues/new).


config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

Device set to use cpu


[{'label': 'POSITIVE', 'score': 0.999743640422821}]
