[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/juanhuguet/intro_to_nlp/blob/main/notebooks/03-basic-sentiment-classification-supervised-classification.ipynb)

# Feature extraction for ML classification

A count vectorizer is a **feature extraction** technique in natural language processing (NLP)
that converts a collection of text documents into a matrix of word counts.

In other words, it **transforms text data into a numerical representation that machine learning models can understand and process.**

In [None]:
import pandas as ..

In [None]:
from sklearn.linear_model import ...

In [None]:
from sklearn.feature_extraction.text import ...

#### Let's simulate some reviews...

In [None]:
reviews_docs = ["The food was good",
                "The service was bad",
                "The service and the food were good",
               ]

labels = ["positive",
          "negative",
          "positive"]

### Let's use scikit learn to calculate the counts vectors...

1. The count vectorizer takes a collection of text documents and tokenizes them into individual words.
2. It then counts how many times each word appears in each document.
3. Finally, it creates a matrix where each row represents a document and each column represents a word, and the values in the matrix are the word counts for each document.

In [None]:
vect = ...

In [None]:
word_counts = vect.fit_transform(reviews_docs)

### Let's inspect the results

Note: the word_counts are returned as a sparse vectors, so we call the `todense()` method befor passing it to the dataframe constructor

In [None]:
reviews = pd.DataFrame(word_counts.todense(),
                       columns=vect.get_feature_names_out(),
                       index=reviews_docs)

In [None]:
reviews["sentiment"] = labels

In [None]:
reviews

### Let's use these feature as an input for the classifier

We can train a classifier like a logistic regression without explicitly having to categorise the positive or negative intent of each word as in the previous example.

The algorithm will automatically infer such relationships from the vector representation and their labels.

> **This means, we would have transformed the sentiment classification into a supervised machine learning classification problem**

Note: the exercise here is only to demonstrate how we can learn from features extracted from text, of course, we need more examples and a proper modelling process that involves train/test split and validations...

In [None]:
clf_lr = ...

In [None]:
X = reviews.drop(columns=["sentiment"])

In [None]:
y = reviews["sentiment"]

In [None]:
clf_lr.fit(X, y)

### Let' s make a prediction...

In [None]:
new_review = ["the weather was very bad"]

In [None]:
new_review_vect = vect.transform(new_review)

In [None]:
clf_lr.predict(new_review_vect)