# **Problem 2 Knn & Adaboost**

In this part, you'll be working with KNN and AdaBoost.

# 0) Loading Data & Libraries

In [1]:
import numpy as np

import pandas as pd
import matplotlib.pyplot as plt

from sklearn import datasets

# set a seed for reproducibility
random_seed = 25
np.random.seed(random_seed)

# We need to ignore FutureWarnings due to a bug in our version of sklearn
from warnings import simplefilter
# ignore all future warnings
simplefilter(action='ignore', category=FutureWarning)

# 1) Twitter Sentiment Analysis using KNN

[Sentiment analysis](https://en.wikipedia.org/wiki/Sentiment_analysis) is the study of how we can systematically identify and quantify sentiment of a given segment of text. In this problem, you will be using a reduced version the [Sentiment140](https://www.kaggle.com/datasets/kazanova/sentiment140) dataset to build a classifier that will determine if a tweet expressive positive or negative sentiment. The original dataset has sentiment ratings ranging from 0 (very negative) to 4 (very positive), as well as inbetween values with 2 being neutral. However, for simplicity, we've only given you explicitly positive or negative tweets.

**WARNING:** This is *real* data from *real* people from Twitter. This means that if you browse through the actual text (which you are not required to do so), you might see offensive, toxic, and potentially triggering language. [Toxicity Detection](https://aclanthology.org/2020.acl-main.396/) is an ongoing field of research.

Let's take a look at our data, and what attributes we have.

* **polarity**: The assumed polarity of the tweet. For this subset, we're only considering positive and negative tweets, no neutrality.
* **id**: The tweet ID.
* **date**: The date the tweet was posted.
* **query**: The search term used in order to find tweets of a certain topic.
* **text**: The actual text of the tweet.

For now, we only consider the **polarity** and **text** attributes.

In [2]:
train_data = pd.read_csv('./hw2_p3.csv')
train_data.head()

Unnamed: 0.1,Unnamed: 0,polarity,id,date,query,user,text
0,639736,0,1881528306,Fri May 22 04:54:30 PDT 2009,NO_QUERY,ClareMacG,where is he? hmmmm he didnt even reply to me t...
1,228095,0,1759081303,Sun May 10 18:25:07 PDT 2009,NO_QUERY,jaredhaha,family guy sucks tonight
2,689126,0,2322274292,Wed Jun 24 22:20:07 PDT 2009,NO_QUERY,spazzyyarn,"@jesus_iscomin I am so sorry, that sucks!"
3,372153,0,1693569517,Sun May 03 22:59:23 PDT 2009,NO_QUERY,missprettylady,goin to bed...definitly didn't study like i wa...
4,365761,0,2244313762,Fri Jun 19 14:35:02 PDT 2009,NO_QUERY,jojoe777,"just was at the hospital, long story made shor..."


In [3]:
# Check our disribution of polarity
# 4 means postive sentiment
# 0 means negative sentiment
train_data.polarity.value_counts()

polarity
0    20000
4    20000
Name: count, dtype: int64

## 1.1) Example: Brief Introduction to tf-idf

Read through and run the following example, and answer the question at the end.

In information retreival [tf-idf](https://en.wikipedia.org/wiki/Tf%E2%80%93idf) (*term frequency – inverse document frequency*) is a metric that represents how "important" a word is to a corpus of text. While we won't go into detail about how it works, essentially all you need to know is that it balances two metrics.

**Term Frequency**: Is exactly what one would expect, it is the the frequency at which a word is present in a corpus of text. A word with a higher term-frequency score appears much more in the corpus compared to one with a low term frequency.

**Inverse Document Frequency**: If we only used term frequency, common words like "the" or "and" would have a high score, even though they don't give us that much information since they are present in every document. *Inverse Document Frequency* is a metric of how much "information" a word provides, and if a word is common or rare across all documents.

### tf-idf with sklearn

In [4]:
from sklearn.feature_extraction.text import TfidfVectorizer
corpus = [
    'This is the first document.',
    'This document is the second document.',
    'And this is the third one.',
    'Is this the first document?'
]
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(corpus)
print(X.shape)

(4, 9)


The tf-idf vectorizer's `fit_transform` method returns a NxM matrix. `N` is the number of documents (sentences) you have in your corpus, and `M` is the number of unique words in your corpus. Item `n`x`m` is how important word `m` is to document `n`.

In [5]:
# Printing out the tf-idf matrix
np.set_printoptions(precision=4)
print(X.toarray())

[[0.     0.4698 0.5803 0.3841 0.     0.     0.3841 0.     0.3841]
 [0.     0.6876 0.     0.2811 0.     0.5386 0.2811 0.     0.2811]
 [0.5118 0.     0.     0.2671 0.5118 0.     0.2671 0.5118 0.2671]
 [0.     0.4698 0.5803 0.3841 0.     0.     0.3841 0.     0.3841]]


In [6]:
# Notice that if we try and print X directly, we get an overview saying that X is a "sparse matrix".
# In very large corpi with many unique words, a lot of row entries are going to consist of majority zeros
# Thus numpy saves theses in a special compressed sparse format
X

<4x9 sparse matrix of type '<class 'numpy.float64'>'
	with 21 stored elements in Compressed Sparse Row format>

In [7]:
# Next let's see what word each column corresponds to:
vectorizer.get_feature_names_out()

array(['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third',
       'this'], dtype=object)

Let's look at the tf-idf vectors for two different documents.

**Note**: `dic(zip(A, B))` in pyton makes a dictionary out of a list of keys (A) and values (B). This just makes it easier to view each term with it's corresponding TFIDF value.

In [8]:
print(corpus[0])
dict(zip(vectorizer.get_feature_names_out(),X.toarray()[0]))

This is the first document.


{'and': 0.0,
 'document': 0.46979138557992045,
 'first': 0.5802858236844359,
 'is': 0.38408524091481483,
 'one': 0.0,
 'second': 0.0,
 'the': 0.38408524091481483,
 'third': 0.0,
 'this': 0.38408524091481483}

In [9]:
print(corpus[1])
dict(zip(vectorizer.get_feature_names_out(),X.toarray()[1]))

This document is the second document.


{'and': 0.0,
 'document': 0.6876235979836938,
 'first': 0.0,
 'is': 0.281088674033753,
 'one': 0.0,
 'second': 0.5386476208856763,
 'the': 0.281088674033753,
 'third': 0.0,
 'this': 0.281088674033753}

Take a look at the tf-idf vectors for both of these sentences and answer the following questions:
1. Why is the value for the term "is" higher in document1 than document2?
2. Why is the value for the term "document" higher in document2 than document1?

**Answer Here**

**1. Why is the value for the term "is" higher in document1 than document2?**

This may be because of the difference in document length. Since there are the same instance of "is" in both document1 and document2, the only difference would be the length of the documents. Document1 is shorter than document 2 by one word, meaning that the word "is" may have a higher impact/value due to the fact that it has an influence of 1/5 in document1 instead of 1/6 in document 2. This calculation may be the result of the normalization techniques applied in the TF_IDF metric.
   
**2. Why is the value for the term "document" higher in document2 than document1?**

This may be due to the fact that document2 have two instances of the term "document" instead of one instance in document1. Since there are two instances in document2, the term frequency part of the TF-IDF metric may have scored it higher than document1's "document" since document1 only have one instance of the word.

## 1.2) tf-idf on Twitter

Now, before we build a classifier, let's just try and see what the nearest neighbors of a specified message are.

In [10]:
# Get our text
corpus = train_data["text"]

In [11]:
# Run our transform
from sklearn.feature_extraction.text import TfidfVectorizer

tf = TfidfVectorizer()
tfidf_matrix = tf.fit_transform(corpus)

In [12]:
# Let's check the size of our matrix
tfidf_matrix

<40000x50209 sparse matrix of type '<class 'numpy.float64'>'
	with 476467 stored elements in Compressed Sparse Row format>

Now, let's fit the nearest neighbors tree! Please use [NearestNeighbors](https://scikit-learn.org/stable/modules/neighbors.html) from sklearn library

NOTE: fit the nearest neighbors tree (with **five** neighbors) on _**"tfidf_matrx"**_ we got from above, and return the model as _**"nbrs"**_

In [13]:
from sklearn.neighbors import NearestNeighbors
# TODO
### BEGIN SOLUTION

nbrs = NearestNeighbors(n_neighbors=5)
nbrs.fit(tfidf_matrix)

### END SOLUTION

In [14]:
assert(nbrs.n_features_in_ == 50209)
assert(nbrs.n_samples_fit_ == 40000)
assert(nbrs.get_params()['n_neighbors'] == 5)

We can now run custom sentences and see what sentences in the corpus are "closest" to what we put in. Try a few and see what shows up! In addition to this, you can change the `n_neighbors` param and get more queries.

In [15]:
test_docs = ["Man, the weather today SUCKS!!!"]
test_docs = tf.transform(test_docs)
distances, indicies = nbrs.kneighbors(test_docs)
train_data.iloc[indicies[0]]

Unnamed: 0.1,Unnamed: 0,polarity,id,date,query,user,text
8386,230301,0,2234441446,Thu Jun 18 23:03:45 PDT 2009,NO_QUERY,Cheeseter550,@ThisFails man that sucks
14483,1886,0,2241488966,Fri Jun 19 11:00:48 PDT 2009,NO_QUERY,Johnn_G,"The weather sucks. And I'm hungry, but there's..."
10417,147250,0,2015648053,Wed Jun 03 05:11:31 PDT 2009,NO_QUERY,renofeliz,why the weather is so crap today?
11601,560236,0,1980392498,Sun May 31 06:36:33 PDT 2009,NO_QUERY,MissCammie,oooh man.. being ill sucks
7872,415455,0,1961049897,Fri May 29 08:54:33 PDT 2009,NO_QUERY,ArtyBloodyFarty,that just sucks!


In [16]:
# As a bonus, show our distances
print(distances)

[[1.065  1.0657 1.076  1.0781 1.0874]]


Now investigate:
1. Try manually classifying the tweet "Wow, this is so cool!" What are the classes of the neighbors? How would a 5-NN classifier classify that tweet?
2. Can you think of a tweet that might fool this classifier? For example, how would it do with sarcasm?

We will continue working witht this dataset in hw2-p3!

# 2) AdaBoost

In this exercise, you'll be learning how to use [AdaBoost](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.AdaBoostClassifier.html), as well as vizualize decision boundries.

## 2.1) Exmple: Decision Tree Baseline

In [17]:
# Read the breast cancer dataset and translate to pandas dataframe
bc_sk = datasets.load_breast_cancer()
# Note that the "target" attribute is species, represented as an integer
bc_data = pd.DataFrame(data= np.c_[bc_sk['data'], bc_sk['target']],columns= list(bc_sk['feature_names'])+['target'])

In [18]:
from sklearn.model_selection import train_test_split
# The fraction of data that will be test data
test_data_fraction = 0.10

bc_features = bc_data.iloc[:,0:-1]
bc_labels = bc_data["target"]
X_train, X_test, Y_train, Y_test = train_test_split(bc_features, bc_labels, test_size=test_data_fraction,  random_state=random_seed)

First, let's have a baseline non-boosted decision tree to compare against.

In [19]:
from sklearn.tree import DecisionTreeClassifier
gini_tree = DecisionTreeClassifier(criterion = "gini", random_state=random_seed).fit(X=X_train, y=Y_train)

In [20]:
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
predicted_y = gini_tree.predict(X_test)

In [21]:
print(classification_report(predicted_y,Y_test))

              precision    recall  f1-score   support

         0.0       0.85      0.85      0.85        20
         1.0       0.92      0.92      0.92        37

    accuracy                           0.89        57
   macro avg       0.88      0.88      0.88        57
weighted avg       0.89      0.89      0.89        57



In [22]:
confusion_matrix(predicted_y,Y_test)

array([[17,  3],
       [ 3, 34]], dtype=int64)

## 2.2: Adaboost Classifier

Now, let's get boosting with the [AdaBoostClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.AdaBoostClassifier.html). The default estimator for AdaBoost is a *decision stump*. (Remember: a decision stump is simply a decision tree with a height of 1).

* The `n_estimators` parameter is the number of base models in the ensemble.

In [23]:
from sklearn.ensemble import AdaBoostClassifier

# TODO: Create and train an AdaBoostClassifier with 20 estimators on the X_train and Y_train data
# Note: Make sure to use random_state=random_seed
ada_model = None
### BEGIN ANSWER

ada_model = AdaBoostClassifier(n_estimators=20, random_state=random_seed)
ada_model.fit(X_train, Y_train)

### END ANSWER

In [24]:
from sklearn.metrics import accuracy_score
np.testing.assert_almost_equal(accuracy_score(ada_model.predict(X_test), Y_test), 0.9473684210526315)

In [25]:
predicted_y = ada_model.predict(X_test)

In [26]:
print(classification_report(predicted_y,Y_test))

              precision    recall  f1-score   support

         0.0       0.90      0.95      0.92        19
         1.0       0.97      0.95      0.96        38

    accuracy                           0.95        57
   macro avg       0.94      0.95      0.94        57
weighted avg       0.95      0.95      0.95        57



In [27]:
confusion_matrix(predicted_y,Y_test)

array([[18,  1],
       [ 2, 36]], dtype=int64)

As we can see, the boosted model performs better than a full decision tree, even though it only uses some decision stumps.