# Week 13 Exercises


## Ex1: TFIDF and Co-Occurrences
Consider the following training set of 5 strings:

d1 = "machine learning is my favourite new topic"

d2 = "i really like support vector machines"

d3 = "gradient descent is a really neat algorithm"

d4 = "the lecturer has no imagination with these strings"

d5 = "enjoy the very last TA class"

* In the tf-idf representation of string d1, what is the coordinate/entry corresponding to the word "is"?

* If we consider a context window of width 1 around every word in every string, how many co-occurrences are there between "the" and "is"?

## Ex2: TFIDF Using standard tools 
In this exercise you must implement basic text classification pipeline and apply it to SMS spam classification.
The pipeline is

* Transform strings to vectors using a library implementation of tf-idf.
* Apply a standard Logistic Regression classifier on the training data
  You need to complete the implementation of **run_tfidf_classifier**


See here for a default implementation of transforming text to TFIDF transformed vectors
https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html#sklearn.feature_extraction.text.TfidfVectorizer

You may need to unzip the smsspamcollection file. For some reason, I need to run it twice after unzipping.



In [None]:
import re
import numpy as np
from collections import Counter
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer

def get_data():
    #wget https://archive.ics.uci.edu/ml/machine-learning-databases/00228/smsspamcollection.zip
    with open('smsspamcollection/SMSSpamCollection', 'r') as f:
        dat = f.readlines()
    labels = [x.split('\t')[0] for x in dat]
    texts = [x.split('\t')[1] for x in dat]
    sl = set(labels)
    X_train, X_test, y_train, y_test = train_test_split(texts, labels, test_size=0.33, random_state=42)
    return X_train, X_test, y_train, y_test



def run_tfidf_classifier(train_strings, test_strings, train_labels, test_labels):
    """ Transform strings to vectors and run a logistic regression model on it and see the results 

    You can use the following Logisticregression model (ask google for details if needed). Supports fit and score
    from sklearn.linear_model import LogisticRegression

    1. transform data, both train and test using TFIDF transform
    2. fit Logistic Regression model
    3. Print train score and test score
    """
    print(train_strings[0:10])
    
    
    ### YOUR CODE HERE
    ### END CODE


run_tfidf_classifier(*get_data())

## Ex3: Analyzing Feature Hashing
In feature hashing, we map a vector $x \in R^d$ to a vector in $R^k$ using two hash functions $h : [d] \to [k]$ and $g : [d] \to \{-1,1\}$. The hash functions are chosen randomly and independently before seeing any data. Assume the hash functions satisfy the following two properties:
1. For any two distinct coordinates $i \neq j$, we have that $g(i)$ and $g(j)$ are independent and uniform random, i.e. for any $a,b \in \{-1,1\}$ it holds that $\Pr_g[g(i)=a \wedge g(j)=b] = 1/4$.
2. For any two distinct coordinates $i \neq j$, we have that $\Pr_h[h(i)=h(j)] \leq 1/k$.

The embedding $f(x)$ of a vector $x$ is obtained by hashing each index $i \in [d]$ to the index $h(i)$ and adding $g(i) \cdot x_i$ to $f(x)_{h(i)}$.

Your task it to prove:
1. For two vectors $x,y$, we have $\mathbb{E}[f(x)^\intercal f(y)] = x^\intercal y$.

Hint: The following re-writing may be useful:
$$
f(x)^\intercal f(y) = \sum_{i=1}^d \sum_{j=1}^d 1_{[h(i)=h(j)]} x_i y_j g(i) g(j),
$$
where $1_{[h(i)=h(j)]}$ is the indicator random variable taking the value $1$ if $h(i)=h(j)$ and $0$ otherwise.
You may also need linearity of expectation $\mathbb{E}[A + B] = \mathbb{E}[A] + \mathbb{E}[B]$ and that for independent random variables $X,Y$ we have $\mathbb{E}[XY] = \mathbb{E}[X]\mathbb{E}[Y]$.


## Ex4: Skip-Gram
This exercise can be found in the separate notebook skipgram.ipynb. We recommend uploading the notebook to Google Colab for faster execution.