<a href="https://colab.research.google.com/github/laibaabbas/NLP/blob/main/NLP_Basic.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


# üß† Natural Language Processing (NLP)

## 1. Introduction to NLP
Natural Language Processing (NLP) is a subfield of AI that enables computers to understand, interpret, and generate human language.

**Applications:** Chatbots, sentiment analysis, translation, summarization, question answering, etc.



### Why NLP?
- Enables machines to communicate with humans in natural language

- Helps extract information from text data (emails, tweets, reviews, etc.)

- Powers many AI systems: chatbots, translators, summarizers, etc.


##Examples of NLP Applications

| Application           | Example                             |
| --------------------- | ----------------------------------- |
| Sentiment Analysis    | ‚ÄúThis product is great!‚Äù ‚Üí Positive |
| Machine Translation   | English ‚Üí French                    |
| Text Summarization    | Condensing long articles            |
| Chatbots              | Virtual assistants like Siri, Alexa |
| Information Retrieval | Search engines                      |
| Spam Detection        | Filtering junk emails               |



## 2. Key NLP Tasks
- Tokenization  
- Stopword Removal  
- Stemming and Lemmatization  
- POS Tagging  
- Named Entity Recognition (NER)  
- Bag of Words (BoW)  
- TF-IDF (Term Frequency‚ÄìInverse Document Frequency)  
- Word Embeddings (Word2Vec, GloVe, FastText)  
- Transformers (BERT, GPT, etc.)


### 2.1 Text Data and Corpus

A corpus is a large collection of text used for analysis.
Example: news articles, tweets, Wikipedia dumps.

In [None]:
text = "Natural Language Processing enables machines to understand human language."


### 2.2 Tokenization

Splitting text into smaller pieces (tokens).

In [None]:
import nltk
nltk.download('punkt')
nltk.download('punkt_tab') # Added to fix the LookupError
from nltk.tokenize import word_tokenize, sent_tokenize

text = "NLP is amazing. It helps computers understand text."
print(word_tokenize(text))
print(sent_tokenize(text))

### 2.3 Stopwords Removal

Removing common words that don‚Äôt carry much meaning.

In [None]:
from nltk.corpus import stopwords
nltk.download('stopwords')

words = word_tokenize("NLP helps machines understand human language.")
filtered = [w for w in words if w.lower() not in stopwords.words('english')]
print(filtered)


###2.4 Stemming and Lemmatization

- **Stemming**: Reduces words to root form (crude)

- **Lemmatization**: Converts words to base form using vocabulary

In [None]:
from nltk.stem import PorterStemmer, WordNetLemmatizer
nltk.download('wordnet')

ps = PorterStemmer()
lm = WordNetLemmatizer()

print(ps.stem("running"))       # run
print(lm.lemmatize("running"))

### Basic Preprocessing in NLP

In [None]:
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer

nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

text = "Natural Language Processing allows computers to understand human language."

# Tokenization
tokens = word_tokenize(text)
print("Tokens:", tokens)

# Stopword Removal
stop_words = set(stopwords.words('english'))
filtered_tokens = [w for w in tokens if w.lower() not in stop_words]
print("Filtered Tokens:", filtered_tokens)

# Stemming
stemmer = PorterStemmer()
stems = [stemmer.stem(w) for w in filtered_tokens]
print("Stems:", stems)

# Lemmatization
lemmatizer = WordNetLemmatizer()
lemmas = [lemmatizer.lemmatize(w) for w in filtered_tokens]
print("Lemmas:", lemmas)


### 2.5 Part-of-Speech (POS) Tagging

Assigning grammatical labels (noun, verb, etc.) to words.

In [None]:
import nltk
nltk.download('averaged_perceptron_tagger')
nltk.download('averaged_perceptron_tagger_eng') # Added to fix the LookupError

In [None]:
tokens = word_tokenize("John loves coding in Python.")
print(nltk.pos_tag(tokens))

### 2.6 Named Entity Recognition (NER)

Identifying entities like names, locations, dates, etc.

In [None]:
import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp("Elon Musk founded SpaceX in 2002 in California.")
for ent in doc.ents:
    print(ent.text, ent.label_)


## 4. Bag of Words and TF-IDF

In [None]:

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

corpus = [
    "I love NLP and machine learning.",
    "NLP is amazing for text analysis.",
    "Machine learning and NLP are related fields."
]

# Bag of Words
cv = CountVectorizer()
bow = cv.fit_transform(corpus)
print("Vocabulary:", cv.get_feature_names_out())
print("BoW Matrix:\n", bow.toarray())

# TF-IDF
tfidf = TfidfVectorizer()
tfidf_matrix = tfidf.fit_transform(corpus)
print("TF-IDF Features:", tfidf.get_feature_names_out())
print("TF-IDF Matrix:\n", tfidf_matrix.toarray())


## 5. Word Embeddings (Word2Vec Example)

In [None]:

from gensim.models import Word2Vec

sentences = [
    ["I", "love", "natural", "language", "processing"],
    ["Word2Vec", "creates", "vector", "representations", "of", "words"]
]

model = Word2Vec(sentences, vector_size=50, window=3, min_count=1, sg=1)
print("Vector for 'language':\n", model.wv['language'])
print("Most similar words to 'love':", model.wv.most_similar('love'))


In [None]:
!pip install gensim

## 6. Named Entity Recognition (NER)

In [None]:

import spacy

nlp = spacy.load("en_core_web_sm")
text = "Apple is looking at buying U.K. startup for $1 billion."
doc = nlp(text)

for ent in doc.ents:
    print(ent.text, ent.label_)


## 7. Text Classification (Example: Sentiment Analysis)

In [None]:

from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score

X = ["I love this movie", "I hate this movie", "Amazing performance", "Terrible direction"]
y = [1, 0, 1, 0]

vec = CountVectorizer()
X_vec = vec.fit_transform(X)

X_train, X_test, y_train, y_test = train_test_split(X_vec, y, test_size=0.5)
model = MultinomialNB()
model.fit(X_train, y_train)

pred = model.predict(X_test)
print("Accuracy:", accuracy_score(y_test, pred))



## 8. Transformer Model (BERT) Example

In [None]:

from transformers import pipeline

classifier = pipeline("sentiment-analysis")
result = classifier("I really enjoy learning NLP with deep learning!")
print(result)
