Steps Required to model logistics regressions.

* Extract useful information from the data and transform them into a set of inputs aka **features**
* Train classify model while minimising the cost funtion
* Make Prediction 

### Feature Extraction (Fequency Dictionary Mapping)
Represent a text in  a vector of dimension |v| (Vocabulary size)
* features are a list of words (vocabulary)
* numerical representation. if a word exists then the corresponding feature is marked one. if the word appears twice, it is marked two and so on. 
* number of parameters, n, is equal to number of features, |v|

Sparse Representations
for a large vocabulary model, two problems arise
* expensive computational time
* overfiting. Model is complex (too many parameters)

Vocabulary frquence vector (dictionary mapping from words to frequency) counts the number times of word appears for in either positive or negative. 
Feature extraction 
Encode three terms - bias, positive features & negative features
positive - counts the freq of words that appear in positive vocabulary frequency vector
negarive - counts the freq of words that appear in negative vocabulary frequecy table

### Preprocessing Text
1. Eliminate handles and URLs
2. Tokenize the strings into words (process by which a large quantity of text is devided into smaller parts - remove duplicates, punctuations)
3. remove stops words
4. convert all words to lower case
5. stemming the words - transform each word to its root words

### General NLP Steps
1. Perform preprocessing 
2. Feature extraction to convert text into numerical representation. Dictionary Frequecy Mapping (list all words in positive text and negative text sepreately). for each tweet, Extract 3 columns (bias, sum of positive words, sum of negative words). 



## Sentiment Analysis with Logistics Regression

refer to week 1 notebook for codes 

## Sentiment Analysis Classifier with Naive Bayes Model. 

Probability is a fundamental concept in NLP tasks. 

### Naive Bayes Inference Conditional Rule for Binary Classification 
For a balance dataset, product of all ratios of the probability of each word given positive class and probability of similar word given the negative class. if the value is bigger than one, the numerator probiblity > than the denominator. we classify the text as positve. 

$$ \prod_{i=1}^m \frac{P(W^{(i)}|pos)}{P(W^{(i)}|neg)}$$

non-balance dataset requires a prior distribution $ \frac{P(pos)}{P(neg)} $

The problem with this approach is that some words may appear in one class but not the other. this means that the probability for the class without the word is zero and when the above approach is computed, we get a calculated value of zero - not good loss of information. So to combat this problem we use laplacian smoothing. 

$$ P(W^{(i)} | class ) = \frac {freq (w^{(i)} \cap class) + 1}{N_{class} + V_{class}} $$

- $ V_{class} $ refers to number of unique words in vocabulary
- $ N_{class} $ refers to frequency of all words in class


As $ m $ gets larger, we will encourter numerical overflow issues - that is a problem when the number is very small for computer to store. We use the propreties of log transformation to solve this issues. 

To make an inference, we now use this formula

$$ \sum_{i=1}^m \lambda(w^{(i)}) = \sum_{i=1}^m log\frac{P(W^{(i)}|pos)}{P(W^{(i)}|neg)} $$

A small ratio less than one produces negative value while ratio bigger than one produces positive value. The bigger/smaller the ratio, the bigger/smaller the log likelihood.

#### Train Naive Bayes

1. Collect and annote corpus 
2. Preprocessing (lowercase, remove punctuations, url, names, remove stop words, stemming word, tokenize sentences) from raw data (messy) into a clean data. (Take a big chuck of the whole project)
3. Compute word count $ freq(W, class) $ to produce freq table
4. Get conditional probility of a word given the class $ P(w^{(i)}|pos) $ $ P(w^{(i)}|neg) $ using laplace smoothing
5. Get the lambda, $ \lambda(w^{(i)}) $ 
6. Compute logprior $ log\frac{P(pos)}{P(neg)} $


### Application of Naive Bayes Model
- Sentiment Analysis
- Author Auntentication - given  a text, predict from which author the text is written by
- Spam Classification
- Information Retirieval - relevant vs irrelavant document based on keyword input


### Assumptions it holds
1. Independece - not true in NLP. some words appears in pair more often than the other
2. Relative frequence of class in corpus affects the model - more positive class then the model is biased towards this class

## Vector Space Models

represent words and documents as vectors to capture the relative meanings of words in a sentence. 
some applications in machine translations, chatbox and text extraction. 

Two design paradigms 
- Word by word design - fix bandwith $k$ , distance for which to decide wether two words are next to one another 
- Word by Document design - keep tracks of number of times a word appears in a document in matrix form. words stored in row


#### Measure of similiarity 
- euclidean distance on n-dimensional space $ d(\vec a, \vec b) = \sqrt {\sum_{i=1}^{n}(\vec a_{i}- \vec b_{i})^2} = || \vec a - \vec b || = norm(\vec a - \vec b) $
- cosine similiarity can be a better proxy for similiarity than eucliden distance one vector is shorter than the other but they are close to one another as it is not biased on the size of corpus representations (vector length). Here the angle between them better captures the clossness between the two vectors. Recall $ \vec a \cdot \vec b = \sum_{i=1}^n ( a_{i} \cdot  b_{i})$ so cosine metric is given by $ cos(\beta) = \frac {\vec a \cdot \vec b}{||\vec a|| \cdot ||\vec b||}$
- two orthogonal vectors = Maximally dissimilar 
- notice that $ -1 \leq cos(\beta) \leq 1 $, the higher the cosine value, the smaller the angle. Hence the closer the two vectors are. 
- if  $ cos(\beta)$ is between -1 and 0 then the two vectors are dissmiliar. (Recall the cartesian plane cosine rule - cosine is only positive on first and fourth quardrant.)
- if two vectors are identical then $ cos(\beta) = 1$ 

We use measure of similiarity to predict closest meaning word for a given a word. The catch here is to represent the vector space where the word representations capture the relative meaning of words.

#### Predicting rela

In [9]:
h = np.array([1,2,3])
np.array(h)

array([1, 2, 3])

In [10]:
import numpy as np 

turkey = [3,1]
ankara = [9,1]
japan = [4,3]
usa = [5,6]
wash = [10, 5]

v1 = [1,2,3]
v2 = [4,7,2]
v3 = [3,1,4]
v4 = [1,0,-1]
v5 = [2,8,1]

def euclidean(array1, array2):
    return np.linalg.norm(np.array(array2)-np.array(array1))

def cosine(array1, array2):
    array1 = np.array(array1)
    array2 = np.array(array2)
    return (np.dot(array1, array2))/(np.linalg.norm(array1)*np.linalg.norm(array2))

city = np.array(usa) - np.array(wash)
country = np.array(ankara) + city
print(euclidean(japan, country))
print(euclidean(turkey, country))

distance = np.array(v1) - np.array(v2)
desired = distance + np.array(v3)
print(euclidean(v1, desired))
print(euclidean(v2, desired))
print(cosine(v4,v5))

1.0
1.4142135623730951
6.4031242374328485
12.083045973594572
0.08512565307587484


### PCA 
Dimensionality reduction technique that projects n dimensional space to a smaller dimension. 

Application
- reduce n dimensional space into two dimension and plot them on a 2-D graph to see where the data points are relative to others. 

How it works?
1. Find the eigenvector and eigenvalues of the matrix
2. Eigenvector gives the direction of uncorrelated features
3. Eigenvalues : amount of information retained by new features which tells how much variance there is in the vector. 

Steps
1. mean normalize the data
2. find the covariance matrix
3. find the eigenvalues and eigenvectos of the covariance matrix
4. rearrange the eigenvectors such that its eigenvalues are in deacreasing order. 
5. take a subset of the first n eigenvectors and multiply them with the normalize data.

## Machine Translation

1. Word embeddings in English (X) and Word embeddings in France (Y)
2. To find the relationshipe between vector space X and Y, we need to find the maxtrix R, where XR $ \approx $ Y

Finding R using  frobenius norm
1. initiate loop Loss = $|| XR - Y ||_F$
2. Minimise Loss by taking its derivative $ g = \frac{d}{dR}$Loss
3. Update R = R - $\alpha \times g$ where $ \alpha $ is a learning rate
4. Stop when Loss falls within a certain threshold

#### Frobenius norm, 
$A$ is m x n matrix
$$ ||A||_F = \sqrt {\sum^m_{i=1} \sum^n_{j=1} |\alpha_{ij}|^2}$$

#### Hast Table & Hash Function

A data structure used to query a data from a table that runs constant time $O(1)$ is faster than a linear search $O(n)$
Hash Function is a function that gives a hash value (key).

A simple hash table that stores a list of number in n buckets is shown below. 

In [11]:
#frobenium norm 

v = np.array([[1,3], [4,5]])

print(np.sqrt(np.sum(np.square(v))))
print(np.linalg.norm(v))

7.14142842854285
7.14142842854285


In [5]:
def hash_function(value, n_buckets):
    return value % n_buckets

def basic_hash_table(values, n_buckets):
    """ values : a list of elements 
        n_buckets : a scalar
    """
    hash_table = {i:[] for i in range(n_buckets)}
    for value in values:
        hash_value = hash_function(value, n_buckets)
        hash_table[hash_value].append(value)
    return hash_table

list_num = [1,44,22,77,3,88,9,13,12]

table = basic_hash_table(list_num, 10)
table

{0: [],
 1: [1],
 2: [22, 12],
 3: [3, 13],
 4: [44],
 5: [],
 6: [],
 7: [77],
 8: [88],
 9: [9]}

#### Locality sensitive hashing
used to reduce **the computational cost** of finding k-neareast neighbour in high dimensional space. 
a hash function that takes into account the relative location of a vector in a vector space to build up the hash table. 

1. Given a plane, find a normal vector perpendicular to the given plane.
2. Find the dot product of a vector representing the data (need to be transposed) and the normal vector
3. Extract the resulting vector sign to decide wether the data is below or above the plane (location of a data point relative to a plane). 
4. Generalise this idea into multiple planes 

#### Multiple plane generalisation
given a point denoted by $v$, we can run in on several planes, $P_1,P_2,P_3$. If the sign of $P_1v^T$ is posivte, then we label $h_1$ as 1 while 0 if it is negative. A hash function is created such that 
$$hashvalue = \sum_{i=0}^H 2^i \times h_{i+1}$$

In [8]:
import numpy as np

def side_of_planes(P, v):
    """
        P, V both need to be a vector
    """
    dot_product = np.dot(P, v.T)
    sign = np.sign(dot_product)
    sign_scalar = np.asscalar(sign)
    return sign_scalar

def hash_multiple_plane(Ps, v):
    """ Ps : a list of multiple plane
        v : vector representing a data point
    """
    hash_value = 0
    for i, P in enumerate(Ps):
        h_sign = side_of_planes(P, v)
        h_i = 1 if h_sign >= 0 else 0
        hash_value += 2**i * h_i
        
    return hash_value

#### Approximate Nearest Neighbour

In [9]:
np.random.normal(size=(3,4))

array([[-1.30431584,  0.4105472 ,  0.71611908,  0.49633876],
       [ 1.22778007,  1.55515613, -0.06870154, -0.67537823],
       [-2.4299803 , -1.07048808,  0.41357607,  1.04544117]])

In [10]:
n_dimensions = 2
num_planes =3 

random_plane_matrix = np.random.normal(size =(num_planes, n_dimensions))

def side_of_planes(Ps, v):
    dotprod = np.dot(Ps, v.T)
    sign = np.sign(dotprod)
    return sign

v = np.array([[1,2]])
print(v.shape)
side_of_planes(random_plane_matrix, v)

(1, 2)


array([[ 1.],
       [-1.],
       [-1.]])