[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/juanhuguet/intro_to_nlp/blob/main/notebooks/03-basic-sentiment-classification-supervised-classification.ipynb)

# Basic sentiment classification using annotated data

In [1]:
import pandas as pd

In [2]:
from sklearn.linear_model import LogisticRegression

In [3]:
from sklearn.feature_extraction.text import CountVectorizer

#### Let's simulate some reviews...

In [4]:
reviews_docs = ["The food was good",
                "The service was bad",
                "The service and the food were good",
               ]

labels = ["positive",
          "negative",
          "positive"]

### Let's use scikit learn to calculate the counts vectors...

In [5]:
vect = CountVectorizer()

In [6]:
word_counts = vect.fit_transform(reviews_docs)

### Let's inspect the results

Note: the word_counts are returned as a sparse vectors, so we call the `todense()` method befor passing it to the dataframe constructor

In [7]:
reviews = pd.DataFrame(word_counts.todense(),
                       columns=vect.get_feature_names_out(),
                       index=reviews_docs)

In [8]:
reviews["sentiment"] = labels

In [9]:
reviews

Unnamed: 0,and,bad,food,good,service,the,was,were,sentiment
The food was good,0,0,1,1,0,1,1,0,positive
The service was bad,0,1,0,0,1,1,1,0,negative
The service and the food were good,1,0,1,1,1,2,0,1,positive


### Let's use these feature as an input for the classifier

Note: the exercise here is only to demonstrate how we can learn from features extracted from text, of course, we need more examples and a proper modelling process that involves train/test split and validations...

In [10]:
clf_lr = LogisticRegression()

In [11]:
X = reviews.drop(columns=["sentiment"])

In [12]:
y = reviews["sentiment"]

In [13]:
clf_lr.fit(X, y)

### Let' s make a prediction...

In [14]:
new_review = ["the weather was very bad"]

In [15]:
new_review_vect = vect.transform(new_review)

In [16]:
clf_lr.predict(new_review_vect)



array(['negative'], dtype=object)