# Introduction

- 📺 **Video:** [https://youtu.be/k5p8teUNHX4](https://youtu.be/k5p8teUNHX4)

## Overview
- Frame NLP tasks as supervised learning problems with inputs, outputs, and evaluation metrics.
- Understand the data-centric mindset: curate text corpora, define labels, and close the loop with error analysis.

## Key ideas
- **Task formulation:** classification, sequence labeling, and generation share the pattern of mapping text to labels.
- **Features vs. models:** representation choices often matter more than the classifier in early stages.
- **Generalization:** split data into train/dev/test to estimate out-of-sample performance.
- **Iteration cycle:** inspect failures to refine features or collect better data.

## Demo
Build a minimal text classification dataset, split it into train/test, and evaluate a linear baseline to show the learning cycle introduced in the lecture (https://youtu.be/i5uG9c0ho7Y).

In [1]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline
from sklearn.metrics import classification_report

texts = [
    'This sequel surprised me with emotional depth',
    'Flat characters and a weak script ruined it',
    'Stunning visuals and clever writing',
    'The pacing dragged and I lost interest',
    'Heartwarming performances all around',
    'Poor editing made it hard to follow',
    'Inventive storytelling that kept me hooked',
    'Dull dialogue and forgettable scenes'
]
labels = [1, 0, 1, 0, 1, 0, 1, 0]

X_train, X_test, y_train, y_test = train_test_split(texts, labels, test_size=0.25, random_state=7, stratify=labels)
model = make_pipeline(TfidfVectorizer(), LogisticRegression(max_iter=1000, random_state=7))
model.fit(X_train, y_train)

y_pred = model.predict(X_test)
print(classification_report(y_test, y_pred, digits=3))


              precision    recall  f1-score   support

           0      1.000     1.000     1.000         1
           1      1.000     1.000     1.000         1

    accuracy                          1.000         2
   macro avg      1.000     1.000     1.000         2
weighted avg      1.000     1.000     1.000         2



## Try it
- Modify the demo
- Add a tiny dataset or counter-example


## References
- [Eisenstein 2.0-2.5, 4.2-4.4.1](https://github.com/jacobeisenstein/gt-nlp-class/blob/master/notes/eisenstein-nlp-notes.pdf)
- [Perceptron and logistic regression](https://www.cs.utexas.edu/~gdurrett/courses/online-course/perc-lr-connections.pdf)
- [Eisenstein 4.1](https://github.com/jacobeisenstein/gt-nlp-class/blob/master/notes/eisenstein-nlp-notes.pdf)
- [Perceptron and LR connections](https://www.cs.utexas.edu/~gdurrett/courses/online-course/perc-lr-connections.pdf)
- [Thumbs up? Sentiment Classification using Machine Learning Techniques](https://www.aclweb.org/anthology/W02-1011/)
- [Baselines and Bigrams: Simple, Good Sentiment and Topic Classification](https://www.aclweb.org/anthology/P12-2018/)
- [Convolutional Neural Networks for Sentence Classification](https://www.aclweb.org/anthology/D14-1181/)
- [[GitHub] NLP Progress on Sentiment Analysis](https://github.com/sebastianruder/NLP-progress/blob/master/english/sentiment_analysis.md)


*Links only; we do not redistribute slides or papers.*