# AI Lab Assignment 4

# 3. Text processing and unbalanced data (4.5 points)

In [None]:
import pandas as pd

df = pd.read_csv("train.tsv", sep='\t', index_col="PhraseId")
pd.set_option('display.max_colwidth', None)
df.head(2)

We will use a dataset to predict the sentiment of some text from tagged phrases.
Labels for sentiments are:

* 0 - negative
* 1 - somewhat negative
* 2 - neutral
* 3 - somewhat positive
* 4 - positive

More information [here](https://www.kaggle.com/c/sentiment-analysis-on-movie-reviews).


Throughout this exercise we will see different ways of processing the text and balancing the classes to be learned.

In [None]:
from sklearn.model_selection import train_test_split

train, test = train_test_split(df[["Phrase", "Sentiment"]], random_state=0)
X_train = train.Phrase
X_test = test.Phrase
y_train = train.Sentiment
y_test = test.Sentiment

**Basic Models**

We are going to start with a *pipe* of very basic models and see if they have any issues.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.pipeline import make_pipeline
from sklearn.tree import DecisionTreeClassifier

count_vectorizer = CountVectorizer(max_features=1000)
decision_tree = DecisionTreeClassifier(criterion='entropy', max_depth=2)
pipe = make_pipeline(count_vectorizer,decision_tree)

In [None]:
from sklearn.metrics import accuracy_score

def get_accuracy(pipe):
    pipe.fit(X_train, y_train)
    y_pred = pipe.predict(X_test)
    return round(accuracy_score(y_pred, y_test),3)

In [None]:
get_accuracy(pipe)

Let's see which classes are actually being predicted

In [None]:
# Training labels
round(pd.Series(y_train).value_counts(normalize=True),2)

In [None]:
# Predicted classes
y_pred = pipe.predict(X_test)
round(pd.Series(y_pred).value_counts(normalize=True),2)

That is: being an unbalanced problem, the prediction is clearly biased towards one of the classes.

This can also be seen by analyzing the **confusion matrix**

In [None]:
from sklearn.metrics import confusion_matrix
import matplotlib.pyplot as plt
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

def plot_confusion_matrix(pipe):
    labels = [i for i in range(5)]
    y_pred = pipe.predict(X_test)
    cm = confusion_matrix(y_test, y_pred, labels)
    fig = plt.figure(figsize=(10, 5))
    ax = fig.add_subplot(111)
    cax = ax.matshow(cm)
    plt.title('Confusion matrix')
    fig.colorbar(cax)
    ax.set_xticklabels(['']+ labels)
    ax.set_yticklabels(['']+ labels)
    plt.xlabel('Predicted')
    plt.ylabel('True')
    plt.show()

plot_confusion_matrix(pipe)

## Changing models and data processing

Analyze if any of the other models seen in previous exercises avoid this problem of unbalanced classes.

For example, in the following cell we changed the decision tree classifier and used a KNN instead.

In [None]:
from sklearn.neighbors import KNeighborsClassifier

knn = KNeighborsClassifier(n_neighbors=1)
pipe_knn = make_pipeline(count_vectorizer,knn)
get_accuracy(pipe_knn)

In [None]:
plot_confusion_matrix(pipe_knn)

### Explore multiple models and reply to the following questions:

* Which model gives better results?
* Are there any parameters of the models that are particularly effective in avoiding imbalance?

Note: include as many cells as you need to show the code you used to answer these questions.

(answers)

### Data Processing

The first step used in the *pipe* above is very simple: it counts how many times each term appears. Also, it has been limited to a maximum of 1000 dimensions. Try other ways of processing the text and discuss the differences. 

As an example, the following cell uses a method that considers the frequency of terms ([TfidfVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html) ). The documentation shows that it allows several configurations (removing accents, changing to lowercase, performing more complex transformations, removing common words, etc.).

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf_vectorizer = TfidfVectorizer()
# tfidf_vectorizer = TfidfVectorizer(ngram_range=(1,3))
pipe_tfidf = make_pipeline(tfidf_vectorizer, knn)
print(get_accuracy(pipe_tfidf))
plot_confusion_matrix(pipe_tfidf)

Try various classifiers and settings for text processing and answer the following questions:

* According to your experiments, which processing step has the greatest effect on the results: the classifier or the text processing?
* Have you found any type of processing that always improves the results? What hypothesis would you propose to explain this behavior?

In [None]:
# include code about this section here