# Natural Language Processing (NLP)

Welcome to this notebook on Natural Language Processing (NLP), part of the 'Part_4_Deep_Learning_and_Specializations' section of our machine learning tutorial series. In this notebook, we'll explore the fundamentals of NLP, a field of artificial intelligence that focuses on the interaction between computers and humans through natural language. We'll cover traditional and modern approaches to text processing and analysis, with practical examples.

## What You'll Learn
- The basics of NLP and its applications in machine learning.
- Traditional text processing techniques like Bag of Words and TF-IDF.
- Modern approaches using word embeddings and transformer models.
- How to perform text classification for sentiment analysis using scikit-learn.
- An introduction to transformer models with Hugging Face's library.

Let's dive into the world of Natural Language Processing!

## 1. Introduction to Natural Language Processing

Natural Language Processing (NLP) is a subfield of artificial intelligence that deals with the ability of computers to understand, interpret, and generate human language. NLP encompasses a wide range of tasks, including:
- **Text Classification**: Categorizing text into predefined categories (e.g., spam detection, sentiment analysis).
- **Named Entity Recognition (NER)**: Identifying entities like names, dates, and organizations in text.
- **Machine Translation**: Translating text from one language to another (e.g., Google Translate).
- **Question Answering**: Building systems that can answer questions posed in natural language.
- **Text Generation**: Creating coherent and contextually relevant text (e.g., chatbots, story generation).

NLP has evolved significantly with the advent of deep learning, moving from traditional statistical methods to powerful neural network-based models like transformers.

## 2. Traditional NLP Techniques

Before deep learning, NLP relied on statistical and rule-based methods to process text. Key techniques include:

- **Tokenization**: Splitting text into individual words or tokens.
- **Bag of Words (BoW)**: Representing text as a collection of word frequencies, ignoring grammar and word order.
- **Term Frequency-Inverse Document Frequency (TF-IDF)**: Weighing words based on their importance in a document relative to a corpus, highlighting unique terms.
- **N-grams**: Capturing sequences of words to preserve some context (e.g., bigrams for pairs of words).

These methods are simple but effective for many tasks, especially when combined with machine learning algorithms like Naive Bayes or Support Vector Machines.

## 3. Modern NLP with Deep Learning

Deep learning has transformed NLP by enabling models to learn semantic relationships and context from text. Key advancements include:

- **Word Embeddings**: Dense vector representations of words that capture semantic meaning (e.g., Word2Vec, GloVe, fastText). Words with similar meanings are closer in vector space.
- **Recurrent Neural Networks (RNNs)**: Models like LSTMs and GRUs for sequential data, useful for tasks like language modeling and sentiment analysis.
- **Transformers**: A breakthrough architecture introduced in the paper "Attention is All You Need" (2017). Transformers use self-attention mechanisms to process entire sequences simultaneously, leading to state-of-the-art performance in tasks like translation and text generation. Models like BERT (Bidirectional Encoder Representations from Transformers) have become foundational in NLP.

We'll explore both traditional and modern approaches in this notebook.

## 4. Setting Up the Environment

Let's import the necessary libraries for traditional NLP with scikit-learn and introduce modern NLP with Hugging Face's transformers library. We'll also use pandas for data handling and matplotlib for visualizations.

In [None]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, accuracy_score
import matplotlib.pyplot as plt
import seaborn as sns

# For modern NLP with transformers (optional installation check)
try:
    from transformers import pipeline
except ImportError:
    print("Transformers library not installed. Install with: pip install transformers")
    pipeline = None

# Set random seed for reproducibility
np.random.seed(42)

## 5. Traditional NLP: Text Classification with TF-IDF

Let's start with a traditional NLP approach by performing text classification for sentiment analysis. We'll use a simple dataset of movie reviews (positive or negative) and apply TF-IDF to transform the text into numerical features, then train a logistic regression model.

For this example, we'll create a small synthetic dataset of reviews. In a real scenario, you would use a larger dataset like the IMDB dataset.

In [None]:
# Create a synthetic dataset of movie reviews
data = {
    'review': [
        'This movie was fantastic and I loved every moment',
        'Terrible waste of time, hated the plot',
        'Amazing acting and a great story',
        'Awful, the worst film I have seen',
        'Brilliant direction and stunning visuals',
        'Boring and predictable, not worth it',
        'A masterpiece of cinema, truly inspiring',
        'Disappointing, expected much better',
        'Loved the characters and the soundtrack',
        'Horrible, could not even finish it'
    ],
    'sentiment': [1, 0, 1, 0, 1, 0, 1, 0, 1, 0]  # 1 for positive, 0 for negative
}
df = pd.DataFrame(data)

# Display the dataset
print("Sample dataset:")
print(df.head(10))

### 5.1. Feature Extraction with TF-IDF

We'll use the TfidfVectorizer to convert text into TF-IDF features, which weigh words based on their frequency in a document and rarity across all documents.

In [None]:
# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(df['review'], df['sentiment'], test_size=0.2, random_state=42)

# Initialize TF-IDF vectorizer
vectorizer = TfidfVectorizer(max_features=1000, stop_words='english')

# Transform text to TF-IDF features
X_train_tfidf = vectorizer.fit_transform(X_train)
X_test_tfidf = vectorizer.transform(X_test)

print(f"Shape of training data (TF-IDF): {X_train_tfidf.shape}")
print(f"Shape of test data (TF-IDF): {X_test_tfidf.shape}")

### 5.2. Training a Logistic Regression Model

Now, let's train a logistic regression model on the TF-IDF features to classify reviews as positive or negative.

In [None]:
# Train logistic regression model
model = LogisticRegression(random_state=42)
model.fit(X_train_tfidf, y_train)

# Make predictions on test set
y_pred = model.predict(X_test_tfidf)

# Evaluate the model
print("Accuracy on test set:", accuracy_score(y_test, y_pred))
print("\nClassification Report:")
print(classification_report(y_test, y_pred, target_names=['Negative', 'Positive']))

### 5.3. Visualizing Important Features

Let's visualize the most important words (features) for positive and negative sentiments based on the coefficients of the logistic regression model.

In [None]:
# Get feature names and coefficients
feature_names = vectorizer.get_feature_names_out()
coefficients = model.coef_[0]

# Create DataFrame of features and their coefficients
feature_importance = pd.DataFrame({'feature': feature_names, 'coefficient': coefficients})
feature_importance = feature_importance.sort_values(by='coefficient', ascending=False)

# Plot top positive and negative features
plt.figure(figsize=(10, 6))
sns.barplot(x='coefficient', y='feature', data=feature_importance.head(5), color='green', label='Positive Sentiment')
sns.barplot(x='coefficient', y='feature', data=feature_importance.tail(5), color='red', label='Negative Sentiment')
plt.title('Top Words for Positive and Negative Sentiment')
plt.xlabel('Coefficient (Importance)')
plt.ylabel('Word')
plt.legend()
plt.show()

## 6. Modern NLP: Introduction to Transformers

Transformers have revolutionized NLP by enabling models to understand context over long sequences of text through self-attention mechanisms. Models like BERT, GPT, and T5 are pre-trained on massive datasets and can be fine-tuned for specific tasks.

We'll use the Hugging Face `transformers` library to demonstrate a pre-trained model for sentiment analysis. If the library isn't installed, you'll need to run `pip install transformers`.

**Note**: This section requires an internet connection to download the pre-trained model weights the first time you run it.

In [None]:
# Check if transformers library is available
if pipeline is not None:
    # Initialize sentiment analysis pipeline with a pre-trained model
    sentiment_analyzer = pipeline('sentiment-analysis', model='distilbert-base-uncased-finetuned-sst-2-english')
    
    # Test the model on sample texts
    texts = [
        "I absolutely loved this movie, it was fantastic!",
        "This film was terrible, I hated every moment."
    ]
    results = sentiment_analyzer(texts)
    
    # Display results
    for text, result in zip(texts, results):
        print(f"Text: {text}")
        print(f"Sentiment: {result['label']}, Confidence: {result['score']:.4f}\n")
else:
    print("Transformers library not available. Skipping this section. Install with: pip install transformers")

## 7. Conclusion

In this notebook, we've explored the fundamentals of Natural Language Processing (NLP), covering both traditional and modern approaches. We performed text classification for sentiment analysis using TF-IDF and logistic regression, a classic method that remains effective for many tasks. We also introduced transformer models, showcasing their power with a pre-trained sentiment analysis model from Hugging Face.

### Key Takeaways
- Traditional NLP techniques like TF-IDF transform text into numerical features for machine learning models.
- Logistic regression can effectively classify text based on extracted features, with interpretable results.
- Modern NLP with transformers captures complex contextual relationships in text, achieving state-of-the-art performance on various tasks.

Feel free to experiment with different datasets, tweak the models, or explore other NLP tasks like named entity recognition or text generation!

## 8. Further Exploration

If you're interested in diving deeper into NLP, consider exploring:
- **Word Embeddings**: Train or use pre-trained embeddings like Word2Vec or GloVe for better text representation.
- **Sequence Models**: Experiment with RNNs, LSTMs, or GRUs for tasks like language modeling.
- **Fine-Tuning Transformers**: Fine-tune a pre-trained model like BERT on a custom dataset using Hugging Face's tools.
- **Advanced Tasks**: Try tasks like machine translation, summarization, or question answering with transformer models.

Stay tuned for more specialized topics in this 'Part_4_Deep_Learning_and_Specializations' section!