# Week 13 Exercises


## Ex1: TFIDF and Co-Occurrences
Consider the following training set of 5 strings:

d1 = "machine learning is my favourite new topic"

d2 = "i really like support vector machines"

d3 = "gradient descent is a really neat algorithm"

d4 = "the lecturer has no imagination with these strings"

d5 = "enjoy the very last TA class"

* In the tf-idf representation of string d1, what is the coordinate/entry corresponding to the word "is"?

* If we consider a context window of width 1 around every word in every string, how many co-occurrences are there between "the" and "is"?

### BEGIN MATH SOLUTION
For the tf-idf question, we first compute the idf part. Here we see that "is" occcurs in 2 out of 5 document, hence the idf term is $\ln(5/2) \approx 0.92$. The tf term is $1/7$ as d1 has 7 words of which one is "is". In total, the tf-idf entry for "is" becomes $\ln(5/2)/7 \approx 0.13$.

For the co-occurrences, we observe that "the" and "is" never co-occur within a context window of width 1. Hence it is 0.
### END SOLUTION

## Ex2: TFIDF Using standard tools 
In this exercise you must implement basic text classification pipeline and apply it to SMS spam classification.
The pipeline is

* Transform strings to vectors using a library implementation of tf-idf.
* Apply a standard Logistic Regression classifier on the training data
  You need to complete the implementation of **run_tfidf_classifier**


See here for a default implementation of transforming text to TFIDF transformed vectors
https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html#sklearn.feature_extraction.text.TfidfVectorizer

You may need to unzip the smsspamcollection file. For some reason, I need to run it twice after unzipping.



In [1]:
import re
import numpy as np
from collections import Counter
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer

def get_data():
    #wget https://archive.ics.uci.edu/ml/machine-learning-databases/00228/smsspamcollection.zip
    with open('smsspamcollection/SMSSpamCollection', 'r') as f:
        dat = f.readlines()
    labels = [x.split('\t')[0] for x in dat]
    texts = [x.split('\t')[1] for x in dat]
    sl = set(labels)
    X_train, X_test, y_train, y_test = train_test_split(texts, labels, test_size=0.33, random_state=42)
    return X_train, X_test, y_train, y_test



def run_tfidf_classifier(train_strings, test_strings, train_labels, test_labels):
    """ Transform strings to vectors and run a logistic regression model on it and see the results 

    You can use the following Logisticregression model (ask google for details if needed). Supports fit and score
    from sklearn.linear_model import LogisticRegression

    1. transform data, both train and test using TFIDF transform
    2. fit Logistic Regression model
    3. Print train score and test score
    """
    print(train_strings[0:10])
    
    
    ### YOUR CODE HERE
    vectorizer = TfidfVectorizer()

    vectorizer.fit(train_strings)
    X_train = vectorizer.transform(train_strings)
    X_test = vectorizer.transform(test_strings)
    clf = LogisticRegression(random_state=0)
    clf.fit(X_train, train_labels)
    train_score =  clf.score(X_train, train_labels)
    test_score = clf.score(X_test, test_labels)
    print(f'train mean accuracy {train_score}, test mean accuracy {test_score}')
    ### END CODE


run_tfidf_classifier(*get_data())

['Probably not, still going over some stuff here\n', 'I HAVE A DATE ON SUNDAY WITH WILL!!\n', 'Thanks 4 your continued support Your question this week will enter u in2 our draw 4 £100 cash. Name the NEW US President? txt ans to 80082\n', "Dear 0776xxxxxxx U've been invited to XCHAT. This is our final attempt to contact u! Txt CHAT to 86688 150p/MsgrcvdHG/Suite342/2Lands/Row/W1J6HL LDN 18yrs\n", 'I sent my scores to sophas and i had to do secondary application for a few schools. I think if you are thinking of applying, do a research on cost also. Contact joke ogunrinde, her school is one me the less expensive ones\n', 'Kothi print out marandratha.\n', 'Arun can u transfr me d amt\n', 'I asked you to call him now ok\n', 'Ringtone Club: Gr8 new polys direct to your mobile every week !\n', 'Hello! Just got here, st andrews-boy its a long way! Its cold. I will keep you posted\n']
(3734, 7097)
train mean accuracy 0.9734868773433315, test mean accuracy 0.967391304347826


## Ex3: Analyzing Feature Hashing
In feature hashing, we map a vector $x \in R^d$ to a vector in $R^k$ using two hash functions $h : [d] \to [k]$ and $g : [d] \to \{-1,1\}$. The hash functions are chosen randomly and independently before seeing any data. Assume the hash functions satisfy the following two properties:
1. For any two distinct coordinates $i \neq j$, we have that $g(i)$ and $g(j)$ are independent and uniform random, i.e. for any $a,b \in \{-1,1\}$ it holds that $\Pr_g[g(i)=a \wedge g(j)=b] = 1/4$.
2. For any two distinct coordinates $i \neq j$, we have that $\Pr_h[h(i)=h(j)] \leq 1/k$.

The embedding $f(x)$ of a vector $x$ is obtained by hashing each index $i \in [d]$ to the index $h(i)$ and adding $g(i) \cdot x_i$ to $f(x)_{h(i)}$.

Your task it to prove:
1. For two vectors $x,y$, we have $\mathbb{E}[f(x)^\intercal f(y)] = x^\intercal y$.

Hint: The following re-writing may be useful:
$$
f(x)^\intercal f(y) = \sum_{i=1}^d \sum_{j=1}^d 1_{[h(i)=h(j)]} x_i y_j g(i) g(j),
$$
where $1_{[h(i)=h(j)]}$ is the indicator random variable taking the value $1$ if $h(i)=h(j)$ and $0$ otherwise.
You may also need linearity of expectation $\mathbb{E}[A + B] = \mathbb{E}[A] + \mathbb{E}[B]$ and that for independent random variables $X,Y$ we have $\mathbb{E}[XY] = \mathbb{E}[X]\mathbb{E}[Y]$.

### BEGIN MATH SOLUTION
1. Using the hint and linearity of expectation, we see that
$$
\mathbb{E}[f(x)^\intercal f(y)] = \sum_{i=1}^d \sum_{j=1}^d \mathbb{E}[1_{[h(i)=h(j)]} x_i y_j g(i) g(j)]
$$
Using that $h$ and $g$ are chosen independently and that $g(i)$ is independent of $g(j)$ when $i \neq j$, it holds for any term with $i \neq j$ that $\mathbb{E}[1_{[h(i)=h(j)]} x_i y_j g(i) g(j)] = \mathbb{E}[1_{[h(i)=h(j)]}] x_i y_j \mathbb{E}[g(i)] \mathbb{E}[g(j)]$. Since $g(i)$ is uniform random among $\{-1,1\}$, its expectation is $0$ and the whole term is $0$. Therefore we have
$$
\mathbb{E}[f(x)^\intercal f(y)] = \sum_{i=1}^d \mathbb{E}[1_{[h(i)=h(i)]} x_i y_i g(i)^2]
$$
The indicator $1_{[h(i)=h(i)]}$ is always $1$ and $g(i)^2$ is also $1$. Hence
$$
\mathbb{E}[f(x)^\intercal f(y)] = \sum_{i=1}^d x_i y_i.
$$
### END SOLUTION

## Ex4: Skip-Gram
This exercise can be found in the separate notebook skipgram.ipynb. We recommend uploading the notebook to Google Colab for faster execution.