# Sentiment Analysis and Basic Feature Extraction

- 📺 **Video:** [https://youtu.be/0jSElGFUxro](https://youtu.be/0jSElGFUxro)

## Overview
- Learn how to transform raw text into useful lexical features for sentiment analysis.
- Compare binary, count, and n-gram representations as precursors to more advanced models.

## Key ideas
- **Tokenization choices:** whitespace vs. character n-grams capture different signals.
- **Bag-of-words:** convert token counts into numeric vectors for linear models.
- **Feature scaling:** TF-IDF balances frequent function words versus rare sentiment words.
- **Baselines:** start with simple models to identify whether the labeling task is feasible.

## Demo
Extract unigram and bigram TF-IDF features with scikit-learn, then train a logistic regression classifier to echo the workflow in the lecture (https://youtu.be/j_M2i0TyBCw).

In [1]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report

texts = [
    'Loved the witty banter throughout the film',
    'The jokes were predictable and stale',
    'Heartfelt performances made me tear up',
    'Boring plot with wooden acting',
    'Inventive directing and charming leads',
    'Too long and devoid of energy',
    'Unexpected twists kept me engaged',
    'Dull exposition with no payoff'
]
labels = [1, 0, 1, 0, 1, 0, 1, 0]

pipe = Pipeline([
    ('tfidf', TfidfVectorizer(ngram_range=(1, 2))),
    ('clf', LogisticRegression(max_iter=1000, random_state=31))
])

X_train, X_test, y_train, y_test = train_test_split(texts, labels, test_size=0.25, random_state=31, stratify=labels)
pipe.fit(X_train, y_train)

print(classification_report(y_test, pipe.predict(X_test), digits=3))


              precision    recall  f1-score   support

           0      0.500     1.000     0.667         1
           1      0.000     0.000     0.000         1

    accuracy                          0.500         2
   macro avg      0.250     0.500     0.333         2
weighted avg      0.250     0.500     0.333         2



  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


## Try it
- Modify the demo
- Add a tiny dataset or counter-example


## References
- [Eisenstein 2.0-2.5, 4.2-4.4.1](https://github.com/jacobeisenstein/gt-nlp-class/blob/master/notes/eisenstein-nlp-notes.pdf)
- [Perceptron and logistic regression](https://www.cs.utexas.edu/~gdurrett/courses/online-course/perc-lr-connections.pdf)
- [Eisenstein 4.1](https://github.com/jacobeisenstein/gt-nlp-class/blob/master/notes/eisenstein-nlp-notes.pdf)
- [Perceptron and LR connections](https://www.cs.utexas.edu/~gdurrett/courses/online-course/perc-lr-connections.pdf)
- [Thumbs up? Sentiment Classification using Machine Learning Techniques](https://www.aclweb.org/anthology/W02-1011/)
- [Baselines and Bigrams: Simple, Good Sentiment and Topic Classification](https://www.aclweb.org/anthology/P12-2018/)
- [Convolutional Neural Networks for Sentence Classification](https://www.aclweb.org/anthology/D14-1181/)
- [[GitHub] NLP Progress on Sentiment Analysis](https://github.com/sebastianruder/NLP-progress/blob/master/english/sentiment_analysis.md)


*Links only; we do not redistribute slides or papers.*