In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt


%matplotlib inline

---
# Part 3. Extract feature from texts and images

## Images

In [None]:
# load the dataset from sklearn
from sklearn.datasets import load_digits
digits = load_digits()

print(digits['images'].shape)
print(digits['data'].shape)
print(digits['target'].shape)

(1797, 8, 8)
(1797, 64)
(1797,)


In [None]:
# images contain 8x8 black'n'white pictures of hadwritten digits
digits['images'][0]

array([[ 0.,  0.,  5., 13.,  9.,  1.,  0.,  0.],
       [ 0.,  0., 13., 15., 10., 15.,  5.,  0.],
       [ 0.,  3., 15.,  2.,  0., 11.,  8.,  0.],
       [ 0.,  4., 12.,  0.,  0.,  8.,  8.,  0.],
       [ 0.,  5.,  8.,  0.,  0.,  9.,  8.,  0.],
       [ 0.,  4., 11.,  0.,  1., 12.,  7.,  0.],
       [ 0.,  2., 14.,  5., 10., 12.,  0.,  0.],
       [ 0.,  0.,  6., 13., 10.,  0.,  0.,  0.]])

#### 1. Explore the data

Let's plot several pictures, split dataset on train and test

In [None]:
# make a plot


We will assume that each pixel is a feature. Note, that images were already rechaped into vectors for us in `digits['data']` 

In [None]:
from sklearn.model_selection import train_test_split

# perform train test split

#### 2. Train the model

We can now train the model. This is a classification task with 10 classes. 

The model, that we will use is called `KNeighborsClassifier` (kNN). It does not have trainable parameters and works in the following manner:
* Remember the training data
* When new point arrives 
    * find `K` nearest points in the training dataset (e.g. by euclidean distance)
    * return the most freaquent class among these `K`.
    

   Create and train the following pipeline:
* Scale the input vectorized image 
* Classify by kNN

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler

# define pipeline

# fit


#### 3. Evaluate on test dataset

We will use accuracy - proportion of correct predictions, to measure the quality

In [None]:
from sklearn.metrics import accuracy_score

# predict on test set

# compute accuracy

In [None]:
print(score)

0.9755555555555555


**Optional task**

Implement accuracy youself usinf numpy only;


In [None]:
def my_accuracy(true, predicted):
    # your code here
    

# test that your function work the same as `accuracy_score` from sklearn

---
## Texts

### 1.1 Convert documents to vectors

Note that out dataset now is a set of documents (or texts): $D = (d_1, \dots, d_N)$

We will assume that there is vocabulary of size $M$, which contains all possible words, from which our documents are composed. 

Of course, each documents has different number of words in it. 

Today, we will consider 2 options how to represent texts with the vectors (embeddings):
1. Bag of words
2. Tf-idf

Let us firstly use the simplest example to undertand how both methods work:

In [None]:
d1 = "This is my favourite movie"
d2 = "Is this movie boring? Yes, it is!"
d3 = "This is an exiting movie"

In [None]:
import re
D = [re.sub('[.!?,]', '', d.lower()).split(' ') for d in [d1, d2, d3]]
D

[['this', 'is', 'my', 'favourite', 'movie'],
 ['is', 'this', 'movie', 'boring', 'yes', 'it', 'is'],
 ['this', 'is', 'an', 'exiting', 'movie']]

The Vocabulary for such dataset would be:

['boring', 'movie', 'exiting', 'my', 'yes', 'is', 'an', 'favourite', 'it', 'this']


#### Option 1. Bag of words

We can say, that text is charaterized by the vector of length $M$, which shows how many times each word from the vector is present in the document

Let us now calculate bag of words for each document

In [None]:
X

array([[0., 1., 0., 1., 0., 1., 0., 1., 0., 1.],
       [1., 1., 0., 0., 1., 2., 0., 0., 1., 1.],
       [0., 1., 1., 0., 0., 1., 1., 0., 0., 1.]])

#### Option 2. Tf-idf

**Term Frequency times Inverce Document Frequency**

A method to describe each document in the dataset with a vector of the same length. Takes into account, how often the word appears in the whole dataset.



**Term frequency (tf)** - number of times a term occurs in a given document
$$
tf(t, d) = \frac{\# t \text{ in } d}{len(d)}
$$


**Inverce document frequency (idf)** - measures informativeness of a term

$$
idf(t) = \log \frac{N}{(\# d \text{ with } t)} , N - \text{ number of documents}
$$

If the word occures almost in all the documents (e.g. article, popular verb), then $idf$ will be very low.

---
Now we can covert each document onti the vector of size $M$:
$$
d \rightarrow \left(tf(t_1, d)\cdot idf(t_1),\,\, \dots, \,\, tf(t_M, d) \cdot idf(t_M)\right)
$$


---
Let us calculate it for our simple example

In [None]:
print(D, '\n')
print(V)

[['this', 'is', 'my', 'favourite', 'movie'], ['is', 'this', 'movie', 'boring', 'yes', 'it', 'is'], ['this', 'is', 'an', 'exiting', 'movie']] 

['boring', 'movie', 'exiting', 'my', 'yes', 'is', 'an', 'favourite', 'it', 'this']


In [None]:
# tf


In [None]:
#idf


In [None]:
X = tf*idf
np.round(X, 2)

array([[0.  , 0.  , 0.  , 0.22, 0.  , 0.  , 0.  , 0.22, 0.  , 0.  ],
       [0.16, 0.  , 0.  , 0.  , 0.16, 0.  , 0.  , 0.  , 0.16, 0.  ],
       [0.  , 0.  , 0.22, 0.  , 0.  , 0.  , 0.22, 0.  , 0.  , 0.  ]])

---
**Optional task**

Below we will use `BoW` and `tf-idf` features to classify texts from the dataset with news articles. 

### BoW and Tf-Idf in Sklearn

In practice, we can use `Trasfromers` from sklearn to get the same vector representation of texts as we implemented above. 

#### 1. Import the data

In [1]:
import pandas as pd

categories = ['alt.atheism', 'sci.space']
train = pd.read_csv('https://github.com/mbburova/MDS/raw/main/train_news.csv', index_col=0)
X_train, y_train = train.news, train.target


test = pd.read_csv('https://github.com/mbburova/MDS/raw/main/test_news.csv', index_col=0)
X_test, y_test = test.news, test.target

#### 2. Explore the data



In [None]:
for i in range(3):
    print('label:',y_train[i])
    print(X_train[i])
    print('-------\n')

label: 0
: 
: >> Please enlighten me.  How is omnipotence contradictory?
: 
: >By definition, all that can occur in the universe is governed by the rules
: >of nature. Thus god cannot break them. Anything that god does must be allowed
: >in the rules somewhere. Therefore, omnipotence CANNOT exist! It contradicts
: >the rules of nature.
: 
: Obviously, an omnipotent god can change the rules.

When you say, "By definition", what exactly is being defined;
certainly not omnipotence. You seem to be saying that the "rules of
nature" are pre-existant somehow, that they not only define nature but
actually cause it. If that's what you mean I'd like to hear your
further thoughts on the question.
-------

label: 1
In <19APR199320262420@kelvin.jpl.nasa.gov> baalke@kelvin.jpl.nasa.gov 

Sorry I think I missed a bit of info on this Transition Experiment. What is it?

Will this mean a loss of data or will the Magellan transmit data later on ??

BTW: When will NASA cut off the connection with Magellan

#### 3.1 BoW 

Our pipeline:
* BoW vectorizer
* kNN classifier

We will use accuracy to evaluate model on test

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.pipeline import Pipeline

In [None]:
# Define traning pipeline
bow = CountVectorizer(min_df=0.1, stop_words='english')

pipe = # your code here

# Fit on train data
pipe.fit(X_train, y_train)

# Evaluate on test data (compute accuracy)
# your code here

#### 3.2 Tf-Idf

Let's repeat the same procedure, but for tf-idf vectorizer

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Define traning pipeline
tf_idf = TfidfVectorizer(sublinear_tf=True, min_df=0.1, stop_words='english')

pipe = # your code here


# Fit
# your code here

# Evaluate on test
# your code here