# Text Classification

## Fundamentals of classification

<center>
<img src="./figures/week2l1-4.png" width = "400" alt="图片名称" align=center />
</center>

input: document $d$ (a vector of features); output set of classes $C=\{c_1,c_2,...c_k\}$

output: A predicted class $c\in C$

<font color=red>NOTE: most classification need feature enginerring</font>



## Text classification tasks

e.g.: Topic classification, sentiment analysis, Native-language identification, natural language inference, automatic fact-checking (?)

input may not be a long document (e.g. sentence, tweet-level)


### Topic Classification
Motivation: library science (categorize them), information retrival

Classes: Topic categories (e.g. "jobs","international news")

Features: 
1. Unigram (bag of words, BOW) (<font color=red>with stop-words removed</font>)
2. N-grams (for phrases)

Examples of corpora: Reurers news corpus (RCV1??,NLTK), Pubmed abstracts (出版摘要), Tweets with hashtags


### Sentiment Analysis
Motivation: opinion mining, business analytics

Classes: Positive / Negative / Neutral

Features: N-grams (<font color=red>don't remove stop words...or BOW?</font>), Polarity lexicons (积极/消极词汇的集合-手工涉定)

Examples of corpora: Movie review dataset, SEMEVAL (??) twitter polarity datasets


### Native-Language Identification
Motivation: forensic linguistics, educational applications

Classes: first language of author (e.g. Indonesian)

Features: Word (?) N-gram, Syntactic patterns (POS, parse trees), Phonological features (语音特征)

Examples of corpora: TOEFL/IELTS essay corpora


### Natural Language Inference (textual entailment 蕴含)
Motivation: language understanding (the relationship between sentences)

Classes: entailment, contradiction, neutral

Features: Word overlap, Length difference between the sentences (?), N-grams (<font color=red>not remove stop word</font>)

Examples of corpora: SNLI, MNLI


### Building a Text Classifier
1. Identify a task of interest
2. Collect an appropriate corpus (or you can build it)
3. Carry out annotation
4. Select features
5. Choose a machine learning algorithm (SVM?)
6. Train model and tune hyper-parameters using held-out development data
7. Repeat earlier steps as needed (noisy / small dataset)
8. Train final model (never seen examples)
9. Evaluate model on held-out test data (avoid over-fitting)



## Algorithms for classification

### Choosing a Classification Algorithm

Bias: assumptions we made in our model (more assump, more bias)

Variance: sensitivity to training set (high-over fit)

e.g. underlying assumptions (e.g. independence in Naive Bayes)

Trade-off between complexity and speed.

### Naive Bayes
highest likelihood under Bayes law
$$P(C|F)\in P(F|C)P(C)$$

$P(C)$-class prob

$P(F|C)$-prob of eatures given the class

Assumption: independence (can't be satisfied in most cases)
$$P(c_n|f_1,...f_m)=\prod_{i=1}^mP(f_i|c_n)p(c_n)$$

Pros:
1. Fast to train and classify
2. robust, low-variance -> <font color=red>good for low data situations</font>
3. optimal classifier if independence assumption is correct
4. extremely simple to implement

Cons:
1. Independence assumption rarely holds (High bias)
2. low accuracy compared to similar methods in most situations
3. smoothing required for unseen class/feature combinations (frequency of the class; typical smoothing techniques)


### Logistic Regression
Linear model, uses $softmax$ "squashing" to get valid prob

$$p(c_n|f_1,...f_m)=\frac{1}{Z}\exp(\sum_{i=1}^mw_if_i)$$

$Z$-normalisation factor

Training maximizes probability of training data subject to <font color=red>regularization</font> which encourages low or sparse weights (easilty over-fit)


Pros:
1. not confounded by diverse, correlated features (better performance)
2. can look at weights for interpretability

Cons:
1. Slow to train 
2. feature scaling needed (to similar matitude????) should learn a low weight????
3. Requires a lot of data to work well in pracice
4. Choosing regularisation strategy is important since overfitting easily


### Support Vector Machines
Finds hyperplane which separates the training data with maximum margin

Pros:
1. Fast and accurate linear classifier (for NLP tasks)
2. Can do non-linearity with kernel trick
3. Workd well with huge feature sets

Cons:
1. Multiclass classification awkward (but you can still work with it ???)
2. Feature scaling needed
3. Deals poorly with class imbalances
4. Interpretability

NLP problems often involve <font color=red>large feature sets</font>, Prior to deep learning, SVM is very 
popular for NLP

### K-Nearest Neighbour
Classify based on majority class of k-nearest training examples in feature space

Definition of nearest: Euclidean distance, Cosine distance

Pros:
1. Simple but surprisingly effective
2. No training required
3. Inherently multiclass
4. Optimal classifier with infinite data

Cons:
1. Have to select k (tricky)
2. Issues with imbalanced classes
3. Often slow (for finding the neighbours; large feature set, then slow -> require feature engineering)
4. Features must be selected carefully


### Decision tree
Construct a tree where nodes correspond to tests on individual features

leaves are final class decisions

based on greedy maximization of mutual information

Pros:
1. Fast to build and test
2. Feature scaling irrelevant
3. Good for small feature sets
4. Handles non-linearly-separable problems

Cons:
1. not that interpretable
2. Highly redundant sub-trees
3. Not competitive for large feature sets (not good for nlp)


### Ransom Forests
ensemble classifier

consists of decision trees trained on different subsets

Final class decision is majority vote of sub-classifiers

Pros:
1. Usually more accurate and more robust than decision trees
2. Great classifier for medium feature sets
3. Training easily parallelised

Cons:
1. Interpretability
2. Slow with large feature sets


### Neural Networks
An interconnected set of nodes typically arranged in layers

input layer (features), output layer (class proba), and one or more hidden layers

Each node performs a linear weighting of its inputs from previous layer, passes result through activation function to nodes in next layer

Pros:
1. Extremely powerful, custmize your architecture
2. little feature engineering

Cons:
1. Not an off-the-shelf classifier
2. Many hyper-parameters, difficult to optimise
3. Slow to train (GPU)
4. Prone to overfitting (engineering tricks??)


### Hyper-parameter Tuning
Development set; k-fold cross-validation (small dataset case)

Specific hyper-parameters are classifier specific

But many hyper-parameters relate to regularisation

For multiple hyper-parameters, use <font color=red>grid search</font>

## Evaluation

### Accuracy

<img src='example'>

### Precision & Recall

### F1-score

### 

补一下