# **Natuarl Language Processing**

Natural Language Processing (NLP) is a subfield of Artificial Intelligence (AI) and Linguistics that focuses on enabling machines to understand, interpret, generate, and respond to human language — just like a human would.

Example: Imagine you're talking to Alexa or ChatGPT.


### Advantages 
* Automates tasks like summarizing documents or sorting emails
* Improves customer experience (chatbots, recommendations)
* Enables insights from massive unstructured data

### Real-World Applications
| Industry     | Use Case                                    |
| ------------ | ------------------------------------------- |
| E-commerce   | Product reviews sentiment analysis          |
| Healthcare   | Extracting symptoms from clinical notes     |
| Banking      | Chatbots for customer service               |
| Social Media | Hate speech and spam detection              |
| Legal/HR     | Resume screening and document summarization |



## Basic Term used in NLP 

1. **Corpus:** A corpus is a collection of entir text documents or say a paragraph.
```python
corpus = {
    "Hi, your OTP is 1234",
    "Win $10,000 now! Click here",
    "You have a meeting at 3:00 PM today"
}
print(corpus)
```

Alternative example of corpus

```python
corpus = {
    """
    Once upon a time, a farmer had a goose that laid a golden egg every day. The farmer used to sell that egg and earn enough money to meet their family's day-to-day needs. One day, the farmer thought that if he could get more such golden eggs and make a lot of money and become a wealthy person.
    """
}
print(corpus)
```

2. **Document:** A document is a single text entry or unit within a corpus.
``` python 
document = "Win ₹10,000 now! Click here"
print(document)
```

3. **Vocabulary:** The vocabulary is the set of unique words or tokens found across the entire corpus.
```python
text = ["I love NLP", "NLP is fun", "I love Python"]
vocabulary = {"I", "love", "NLP", "is", "fun", "Python"}
```

4. **Words / Tokens:** Words are the basic units of language (e.g., "dog", "run", "beautiful"). Tokens are the result of breaking text into smaller pieces (usually words, sometimes sub-words or characters) — the process is called tokenization. 

```python
text = "I love NLP!"
tokenization =  ["I", "love", "NLP", "!"]
```

In [3]:
## Initial Example 
import nltk
from nltk.tokenize import word_tokenize
import string

## Download required tokenizer
nltk.download("punkt")
nltk.download('punkt_tab')

# Step 1: Define the corpus (a collection of documents)
corpus = [
    "Natural Language Processing is fascinating.",
    "I love exploring machine learning and NLP.",
    "NLP includes text classification, translation, and generation."
]

# Step 2: Each item in the corpus is a document
print("📄 Documents in Corpus:")
for i, doc in enumerate(corpus, start=1):
    print(f"Document {i}: {doc}")

# Step 3: Tokenize each document into words (tokens)
tokenized_docs = []
for doc in corpus:
    tokens = word_tokenize(doc)
    tokenized_docs.append(tokens)

# Step 4: Show words/tokens
print("\n🔤 Tokens per Document:")
for i, tokens in enumerate(tokenized_docs, start=1):
    print(f"Document {i} Tokens: {tokens}")

# Step 5: Build the Vocabulary (set of unique tokens across all documents)
# Removing punctuation and lowercasing
vocab = set()
for tokens in tokenized_docs:
    for token in tokens:
        if token not in string.punctuation:
            vocab.add(token.lower())

print("\n📚 Vocabulary:")
print(sorted(vocab))

print(f"\n🧮 Vocabulary Size: {len(vocab)}")


[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\mksmu\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to
[nltk_data]     C:\Users\mksmu\AppData\Roaming\nltk_data...


📄 Documents in Corpus:
Document 1: Natural Language Processing is fascinating.
Document 2: I love exploring machine learning and NLP.
Document 3: NLP includes text classification, translation, and generation.

🔤 Tokens per Document:
Document 1 Tokens: ['Natural', 'Language', 'Processing', 'is', 'fascinating', '.']
Document 2 Tokens: ['I', 'love', 'exploring', 'machine', 'learning', 'and', 'NLP', '.']
Document 3 Tokens: ['NLP', 'includes', 'text', 'classification', ',', 'translation', ',', 'and', 'generation', '.']

📚 Vocabulary:
['and', 'classification', 'exploring', 'fascinating', 'generation', 'i', 'includes', 'is', 'language', 'learning', 'love', 'machine', 'natural', 'nlp', 'processing', 'text', 'translation']

🧮 Vocabulary Size: 17


[nltk_data]   Unzipping tokenizers\punkt_tab.zip.
