###**Document Analysis**
- ex. Spam detection, review classification

Given $n$ samples: $\{(x_1,y_2), (x_2,y_2),...,(x_n,y_n)\}$

Goal: Learn a mapping function from $x$ to $y$

Each sample could be:
- an email (spam detection)
- a paragraph (review classification)
- an article

1. Convert categorical features into numerical values
  - Label encoding
  - One-hot encoding
  - Ordinal encoding

How to convert words into numerical values?
- Each sentence/paragraph contains multiple words
- Bag-of-words!



**Bag of Words**

Can represent a sentence/paragraph/article as a bag of words vector

Contains the set of unique words in a section of text to construct a vocabulary

Steps:

1. Build the vocabulary/dictionary from the given dataset
  - get all unique words
  - each word in the vocabulary has an index

2. Represent each sentence/paragraph/article with the vocabulary
  - use a vectore whose dimenionality equals the size of the vocabulary
  - if the word appears, add 1 to the corresponding element in the vector

Properties:
- Can not preserve the order of the words
- high dimensional
- very sparse
- some words are too common for all documents (it, the, ...)

**Term Frequency-Inverse Document Frequency (TF-IDF)**
- reflects how important a word is to a document in a collection
- definition:
  $TF(t,d) = \frac{\#t\: in\: document\: d}{\#words\: in\: document\: d}$
  
  $IDF(t) = log\frac{\#documents}{\#documents\: containing\: t}$
  
  $TF-IDF = TF(t,d) * IDF(t)$

- Term Frequency (TF):
  - measures the frequency of a word in a document

- Inverse Document Frequency (IDF):
  - measures the rareness of a word in all documents
  - the more documents a word appears in, the less valuable that word is, as a signal to differentiate any given document

**Two Classes**: binary classification - logistic regression, KNN, etc
- imbalanced data: use recall, precision, F1

**Multiple Classes**: multi-class logistic regression, KNN, etc
- softmax
- imbalanced data: use micro/macro recall, precision, F1

###**Non-Negative Matrix Factorization (NMF)**

Defn: $min||X-FG^T||^2_F$

- columns of F are the underlying basis vectors
- rows of G give the weights of the basis vectors
- application: topic models

###**Multiplicative Update Method**

Most commonly used method

The update rule: fix F, solve for G; fix G, solve for F

- arises from gradient descent method

$F_{ik} + ɛ_{ik}[(XG)_{ik} - (FG^TG)_{ik}]$

set $ɛ_{ik} = \frac{F_{ik}}{(FG^TG)_{ik}}$

Then $F_{ik} = F_{ik}\frac{(XG)_{ik}}{(FG^TG)_{ik}}$

update F: ^
update G: $G_{jk}\frac{(X^TF)_{jk}}{(GF^TF)_{jk}}$

until converges

###**Page Rank**

Components of a graph:
- nodes/vertices
- edges/links
- graph/network

Types of graphs:
- **directed** - links are directed
- **undirected** - links are undirected

Adjacency matrix: describes links between nodes

Node degrees:
- **undirected** - the number of edges adjacent to a node
- **directed** -
  - *in-degree*: number of head ends adjacent to a node
  - *out-degree*: number of tail ends adjacent to a node

page $i$ with importance $r_i$ has $d_i$ out-links, each link gets $\frac{r_i}{d_i}$ votes

page j's importance, $r_j$ is the sum of votes on its in-links

importance of each node :
  $r_j = \sum{\frac{r_i}{d_i}}$

