As a beginner with a passion for NLP and a long-term goal in R&D, you’re working through the updated course outline to build a strong foundation, having completed Chapters 1–4 (Introduction to NLP, Text Preprocessing, Text Representation, Basic NLP Tasks). Below is a detailed, beginner-friendly version of **Chapter 5: Text Classification and Sentiment Analysis**, tailored for someone with basic Python skills and knowledge of preprocessing, text representation, and POS/NER from prior chapters. This chapter introduces supervised learning and text classification, focusing on sentiment analysis, a key NLP task with wide applications in research and industry. It aligns with your R&D aspirations by emphasizing hands-on practice, research connections, and portfolio-building, while keeping the content engaging and accessible.

This chapter includes:
- **Theory**: Clear explanations of supervised learning, text classification, and sentiment analysis, designed for beginners.
- **Practical**: Step-by-step tasks using free tools (Scikit-learn, SpaCy) and a new dataset (IMDB reviews) for consistency and depth.
- **Mini-Project**: A Sentiment Classifier to predict movie review sentiment, building coding skills and a portfolio piece.
- **Resources**: Free, beginner-friendly materials.
- **Debugging Tips**: Solutions to common beginner issues.
- **Checkpoints**: Quizzes and tasks to confirm mastery.
- **R&D Focus**: Links to research concepts (e.g., evaluation metrics) to inspire your long-term goal.

The dataset (IMDB reviews) leverages your familiarity from Chapter 2 but focuses on classification, ensuring continuity while introducing new challenges. The content is structured for self-paced learning, with clear steps to build confidence and prepare for advanced NLP research.

**Time Estimate**: ~20 hours (spread over 1–2 weeks, 10–12 hours/week).  
**Tools**: Free (Google Colab, Scikit-learn, SpaCy, Pandas).  
**Dataset**: [IMDB Dataset of 50k Movie Reviews](https://www.kaggle.com/datasets/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews) (free on Kaggle).  
**Prerequisites**: Basic Python (Chapter 1), text preprocessing (Chapter 2), text representation (Chapter 3), POS/NER (Chapter 4); Colab or Anaconda setup with libraries (`pip install scikit-learn spacy pandas`).  
**Date**: June 19, 2025.

---

## **Chapter 5: Text Classification and Sentiment Analysis**

*Goal*: Learn supervised learning and build a text classifier for sentiment analysis, mastering evaluation metrics and preparing for research-grade NLP tasks.

### **Theory (5 hours)**

#### **What is Text Classification?**
- **Definition**: Assigning predefined labels to text based on its content, a supervised learning task.
  - Example: Labeling a movie review as “positive” or “negative.”
- **Why It Matters**: Powers applications like spam detection, topic classification, and sentiment analysis.
- **R&D Relevance**: In research, text classification is used for tasks like detecting bias in social media or classifying medical reports.

#### **Key Concepts**
1. **Supervised Learning**:
   - **What**: Training a model on labeled data (input text + correct labels) to predict labels for new text.
   - **Example**: Training on IMDB reviews (text + “positive”/“negative”) to predict sentiment.
   - **Steps**:
     - Collect labeled data (e.g., reviews).
     - Split into training (80%) and test (20%) sets.
     - Train a model (e.g., Logistic Regression).
     - Evaluate on test set.
   - **Key Idea**: Models learn patterns (e.g., “great” → positive) from training data.
2. **Text Classification**:
   - **Types**:
     - Binary: Two classes (e.g., positive vs. negative).
     - Multi-class: Multiple classes (e.g., news topics: sports, business, tech).
     - Multi-label: Multiple labels per text (e.g., a review is both “positive” and “funny”).
   - **Models**:
     - Logistic Regression: Simple, interpretable, good for text.
     - Naive Bayes: Probabilistic, fast for small datasets.
     - Neural Networks: Advanced, used in research (e.g., BERT, Chapter 9).
   - **Features**: Use BoW or TF-IDF (Chapter 3) as input to models.
3. **Sentiment Analysis**:
   - **What**: Classifying text by emotional tone (e.g., positive, negative, neutral).
   - **Example**: “I loved this movie!” → Positive; “It was boring” → Negative.
   - **Applications**: Analyzing X posts, customer reviews, or political speeches.
   - **Challenges**: Sarcasm, ambiguity (e.g., “Great job!” can be negative).
4. **Evaluation Metrics**:
   - **Accuracy**: % of correct predictions (good for balanced data).
   - **Precision**: % of positive predictions that are correct (important for imbalanced data).
   - **Recall**: % of actual positives correctly predicted.
   - **F1 Score**: Harmonic mean of precision and recall (balances both).
   - **Example**: If 90/100 predictions are correct, accuracy = 90%. If 10/10 predicted positives are correct, precision = 100%.
   - **Research Insight**: F1 score is widely used in NLP papers for robust evaluation.

#### **Trade-Offs**
- **Logistic Regression vs. Naive Bayes**:
  - Logistic Regression: More accurate, slower on large data.
  - Naive Bayes: Faster, assumes word independence (less realistic).
- **BoW vs. TF-IDF**:
  - BoW: Simpler, but common words dominate.
  - TF-IDF: Weights rare words, better for classification.
- **Challenges**: Imbalanced data (e.g., more positive reviews) or noisy text (e.g., typos) can hurt performance.

#### **Resources**
- [Scikit-learn Text Classification Tutorial](https://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html): Beginner guide to classifiers.
- [Google’s ML Crash Course](https://developers.google.com/machine-learning/crash-course/classification): Supervised learning basics.
- [Jurafsky’s NLP Book, Chapter 4](https://web.stanford.edu/~jurafsky/slp3/4.pdf): Free PDF on classification.
- **R&D Resource**: Skim the introduction of [Socher, 2013](https://nlp.stanford.edu/~socherr/EMNLP2013_RNTN.pdf) (5 minutes) for sentiment analysis in research.

#### **Learning Tips**
- Note why F1 score is better than accuracy for imbalanced data.
- Search X for #NLP or #SentimentAnalysis to see real-world examples (I can analyze posts if you share links).
- Think about using sentiment analysis for your R&D goal (e.g., analyzing X posts for public opinion).

---

### **Practical (10 hours)**

*Goal*: Build and evaluate a text classifier for sentiment analysis, applying preprocessing and representation skills.

#### **Setup**
- **Environment**: Google Colab (free GPU) or Anaconda (from Chapter 1).
- **Libraries**: Install (run in Colab or terminal):
  ```bash
  pip install scikit-learn spacy pandas
  python -m spacy download en_core_web_sm
  ```
- **Dataset**: [IMDB Dataset of 50k Movie Reviews](https://www.kaggle.com/datasets/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews).
  - Download: Sign up for Kaggle, download `IMDB Dataset.csv`.
  - Why? Familiar from Chapter 2, labeled (positive/negative), ideal for classification.
  - Columns: `review` (text), `sentiment` (positive/negative).
- **Load Data**:
  ```python
  import pandas as pd
  df = pd.read_csv('IMDB Dataset.csv')[:1000]  # Use 1000 reviews for speed
  print(df.head())
  ```

#### **Tasks**
1. **Preprocessing (2 hours)**:
   - Clean and preprocess reviews using Chapter 2 skills (remove URLs, lowercase, lemmatize).
   - Code:
     ```python
     import spacy
     import re
     nlp = spacy.load("en_core_web_sm")
     def preprocess(text):
         cleaned = re.sub(r'http\S+|[^\x00-\x7F]+|[.,!?]', '', text.lower())
         doc = nlp(cleaned)
         return ' '.join([token.lemma_ for token in doc if token.text not in nlp.Defaults.stop_words and token.is_alpha])
     df['cleaned_text'] = df['review'].apply(preprocess)
     print(df['cleaned_text'].head())
     ```
   - Output: Cleaned text (e.g., “love movie great”).
2. **Feature Extraction with TF-IDF (2 hours)**:
   - Convert text to TF-IDF vectors (Chapter 3).
   - Code:
     ```python
     from sklearn.feature_extraction.text import TfidfVectorizer
     vectorizer = TfidfVectorizer(max_features=5000)
     X = vectorizer.fit_transform(df['cleaned_text'])
     y = df['sentiment'].map({'positive': 1, 'negative': 0})
     print(X.shape)  # (1000, 5000)
     ```
   - Output: Matrix shape (1000 reviews × 5000 features).
3. **Train Logistic Regression (2 hours)**:
   - Split data and train a classifier.
   - Code:
     ```python
     from sklearn.model_selection import train_test_split
     from sklearn.linear_model import LogisticRegression
     X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
     model = LogisticRegression(max_iter=1000)
     model.fit(X_train, y_train)
     y_pred = model.predict(X_test)
     print(y_pred[:10])
     ```
   - Output: Predicted labels (e.g., [1, 0, 1]).
4. **Evaluate Model (2 hours)**:
   - Compute accuracy, precision, recall, F1 score.
   - Code:
     ```python
     from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
     print(f"Accuracy: {accuracy_score(y_test, y_pred):.2f}")
     print(f"Precision: {precision_score(y_test, y_pred):.2f}")
     print(f"Recall: {recall_score(y_test, y_pred):.2f}")
     print(f"F1 Score: {f1_score(y_test, y_pred):.2f}")
     ```
   - Output: e.g., Accuracy: 0.85, F1: 0.84.
5. **Try Naive Bayes (2 hours)**:
   - Compare with Logistic Regression.
   - Code:
     ```python
     from sklearn.naive_bayes import MultinomialNB
     nb_model = MultinomialNB()
     nb_model.fit(X_train, y_train)
     y_pred_nb = nb_model.predict(X_test)
     print(f"Naive Bayes F1: {f1_score(y_test, y_pred_nb):.2f}")
     ```
   - Output: e.g., F1: 0.82.

#### **Debugging Tips**
- Low accuracy? Check class balance (`df['sentiment'].value_counts()`) or increase `max_features`.
- Model not converging? Increase `max_iter` in LogisticRegression.
- SpaCy slow? Process in batches (e.g., 200 reviews at a time).
- Memory issues? Reduce to 500 reviews (`df[:500]`).
- CSV error? Use absolute path (e.g., `/content/IMDB Dataset.csv` in Colab).

#### **Resources**
- [Scikit-learn Classification](https://scikit-learn.org/stable/modules/linear_model.html): Logistic Regression guide.
- [Scikit-learn Metrics](https://scikit-learn.org/stable/modules/model_evaluation.html): Evaluation metrics.
- [SpaCy Preprocessing](https://spacy.io/usage/linguistic-features): Lemmatization review.
- [Kaggle Pandas](https://www.kaggle.com/learn/pandas): Data handling.

---

### **Mini-Project: Sentiment Classifier (5 hours)**

*Goal*: Build a sentiment classifier for IMDB reviews, evaluate its performance, and create a portfolio piece for R&D.

- **Task**: Train a Logistic Regression classifier on 1,000 IMDB reviews, predict sentiment, and report evaluation metrics.
- **Input**: `IMDB Dataset.csv` (first 1,000 reviews).
- **Output**: 
  - CSV with predictions: `review`, `true_sentiment`, `predicted_sentiment`.
  - Text file with metrics: `accuracy`, `precision`, `recall`, `F1`.
  - Plot of precision-recall curve.
- **Steps**:
  1. Preprocess reviews (clean, lemmatize).
  2. Create TF-IDF features.
  3. Train Logistic Regression on 80% data, test on 20%.
  4. Compute and save metrics.
  5. Plot precision-recall curve.
- **Example Output**:
  - CSV:
    ```csv
    review,true_sentiment,predicted_sentiment
    "I loved this movie!",positive,positive
    "It was boring.",negative,negative
    ```
  - Text:
    ```
    Accuracy: 0.85
    Precision: 0.86
    Recall: 0.84
    F1 Score: 0.85
    ```
  - Plot: Precision-recall curve showing trade-off.
- **Code**:
  ```python
  import pandas as pd
  import re
  import spacy
  from sklearn.feature_extraction.text import TfidfVectorizer
  from sklearn.linear_model import LogisticRegression
  from sklearn.model_selection import train_test_split
  from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, precision_recall_curve
  import matplotlib.pyplot as plt

  # Load SpaCy
  nlp = spacy.load("en_core_web_sm")

  # Preprocess
  def preprocess(text):
      cleaned = re.sub(r'http\S+|[^\x00-\x7F]+|[.,!?]', '', text.lower())
      doc = nlp(cleaned)
      return ' '.join([token.lemma_ for token in doc if token.text not in nlp.Defaults.stop_words and token.is_alpha])

  # Load data
  df = pd.read_csv('IMDB Dataset.csv')[:1000]
  df['cleaned_text'] = df['review'].apply(preprocess)

  # TF-IDF
  vectorizer = TfidfVectorizer(max_features=5000)
  X = vectorizer.fit_transform(df['cleaned_text'])
  y = df['sentiment'].map({'positive': 1, 'negative': 0})

  # Train-test split
  X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

  # Train model
  model = LogisticRegression(max_iter=1000)
  model.fit(X_train, y_train)
  y_pred = model.predict(X_test)

  # Metrics
  metrics = {
      'Accuracy': accuracy_score(y_test, y_pred),
      'Precision': precision_score(y_test, y_pred),
      'Recall': recall_score(y_test, y_pred),
      'F1 Score': f1_score(y_test, y_pred)
  }
  with open('sentiment_metrics.txt', 'w') as f:
      for k, v in metrics.items():
          f.write(f"{k}: {v:.2f}\n")

  # Save predictions
  results = pd.DataFrame({
      'review': df['review'].iloc[-200:],  # Last 200 (test set)
      'true_sentiment': df['sentiment'].iloc[-200:],
      'predicted_sentiment': ['positive' if p == 1 else 'negative' for p in y_pred]
  })
  results.to_csv('sentiment_predictions.csv', index=False)

  # Precision-recall curve
  y_scores = model.predict_proba(X_test)[:, 1]
  precision, recall, _ = precision_recall_curve(y_test, y_scores)
  plt.plot(recall, precision)
  plt.xlabel('Recall')
  plt.ylabel('Precision')
  plt.title('Precision-Recall Curve')
  plt.savefig('pr_curve.png')
  plt.show()
  ```
- **Tools**: Scikit-learn, SpaCy, Pandas, Matplotlib.
- **Variation**: Try Naive Bayes or increase `max_features` to compare performance.
- **Debugging Tips**:
  - CSV not saving? Use absolute path (e.g., `/content/sentiment_predictions.csv`).
  - Plot not showing? Run `plt.show()` or use `%matplotlib inline` in Colab.
  - Low F1? Check preprocessing or class balance.
- **Resources**:
  - [Scikit-learn Metrics](https://scikit-learn.org/stable/modules/model_evaluation.html#classification-metrics).
  - [Matplotlib Plotting](https://matplotlib.org/stable/users/explain/quick_start.html).
- **R&D Tip**: Add this project to your GitHub portfolio. Document preprocessing and evaluation choices to show research rigor.

---

### **Checkpoints**

1. **Quiz (30 minutes)**:
   - Questions:
     1. What is supervised learning in the context of text classification?
     2. How does sentiment analysis differ from general text classification?
     3. Why is F1 score preferred over accuracy for imbalanced data?
     4. What role does TF-IDF play in classification?
   - Answers (example):
     1. Training a model on labeled text to predict labels for new text.
     2. Sentiment analysis classifies emotional tone (e.g., positive/negative); general classification includes other tasks (e.g., spam detection).
     3. F1 balances precision and recall, robust for uneven class distributions.
     4. TF-IDF converts text to weighted vectors, highlighting important words.
   - **Task**: Write answers in a notebook or share on X with #NLP.

2. **Task (30 minutes)**:
   - Check `sentiment_predictions.csv`: Are predictions logical (e.g., “I loved it” → positive)?
   - Inspect `sentiment_metrics.txt`: Is F1 score >0.80? If not, revisit preprocessing.
   - Verify `pr_curve.png`: Does it show a trade-off (high precision, low recall)?
   - Save files to GitHub; share on X for feedback.
   - **R&D Connection**: Evaluating metrics like F1 is a research skill for model validation.

---

### **R&D Focus**

- **Why It Matters**: Sentiment analysis is a key research area, used in social media analysis, customer feedback, and bias detection.
- **Action**: Skim the introduction of [Socher, 2013](https://nlp.stanford.edu/~socherr/EMNLP2013_RNTN.pdf) (5 minutes). Note how it discusses recursive neural networks for sentiment.
- **Community**: Share your metrics or PR curve on X with #NLP or [Hugging Face Discord](https://huggingface.co/join-discord). Ask for feedback on model performance.
- **Research Insight**: Experiment with `max_features` (e.g., 5000 vs. 10000) to see its impact on F1, mimicking research optimization.

---

### **Execution Plan**

**Total Time**: ~20 hours (1–2 weeks, 10–12 hours/week).  
- **Day 1–2**: Theory (5 hours). Read Scikit-learn tutorial, Jurafsky Chapter 4, note evaluation metrics.  
- **Day 3–5**: Practical (10 hours). Complete tasks (preprocessing, TF-IDF, Logistic Regression, evaluation).  
- **Day 6–7**: Mini-Project (5 hours). Build Sentiment Classifier, save CSV/text/plot, share on GitHub/X.  

**Tips for Success**:
- **Stay Motivated**: Think about using sentiment analysis for your R&D goal (e.g., analyzing X posts for sentiment trends).  
- **Debugging**: Search errors on [Stack Overflow](https://stackoverflow.com/) or ask in Hugging Face Discord.  
- **Portfolio**: Add `sentiment_predictions.csv`, `sentiment_metrics.txt`, `pr_curve.png`, and code to GitHub with comments explaining steps.  
- **Foundation Check**: If you complete the mini-project in <5 hours and achieve F1 >0.80, you’re ready for Chapter 6 (Language Models).  
- **Variation**: If you want a new dataset, try [Twitter Sentiment](https://www.kaggle.com/datasets/kazanova/sentiment140) for shorter texts.

---

### **Why This Chapter is Ideal for You**

- **Beginner-Friendly**: Simple explanations, step-by-step code, and free tools make classification accessible.  
- **Practical**: Hands-on tasks and a mini-project build coding skills for research applications.  
- **Research-Oriented**: Connects classification to research tasks, with paper references for R&D.  
- **Engaging**: IMDB reviews are relatable, keeping your passion alive.  
- **Structured**: Clear timeline, debugging tips, and checkpoints ensure progress.  

This chapter strengthens your NLP foundation by mastering text classification and sentiment analysis, essential for R&D, while building a portfolio piece. If you want a detailed code walkthrough (e.g., precision-recall curve), a different dataset (e.g., Twitter), or help with specific issues (e.g., low F1 score), let me know! Ready to start with the theory or setup?