# Class 1

## Identify Features from Text

### Types of Texual Features

1. Words
    - By far the most common class of features
    - Stop-Words: Handling or removing the most commonly ocurring words sush as *the*, *is*, *...*
    - Normalization decision: Make all the characters in the text lower case vesus leaving them as they are
    - Stemming / Lemmatization
2. Characteristics of words: Capitalization
    - For example, is different to say the White House that to say the white house
3. Parts-Of-Speech (POS) in a sentence
    - Identify the Nouns, adjectives, pronouns, etc...
4. Grouping words of ssimilar meaning / semantics:
    - *Example 1*: The words Buy and Purchase refere to the same action
    - *Example 2*: Mr., Ms., Dr., Prof., PhD., etc...
    - *Example 3*: Numbers/Digits
    - *Example 4*: Dates
5. Depending on the classification task, features may come from inside words and words sequences
    - bigrams, trigrams, n-grams: 'White House'
    - character sub-sequences in words: words ending with -'ing', words ending with - 'ion'

---

# Class 2

## Naïve Bayes Classifier

### Bayes Rule

- Posterior probability 
$$\text{Posterior Probability} = \frac{\text{Prior Probability} · \text{Likelihood}}{\text{Evidence}}$$

$$Pr(\text{y | X}) = \frac{P(y)·Pr(\text{X | y})}{Pr(X)}$$

$$y^* = \underset{y}{\operatorname{argmax}}  Pr(\text{y | X}) = \underset{y}{\operatorname{argmax}} Pr(y) · \prod_{i = 1}^{n} Pr(x_{i}\text{ | }y)$$

*Example 1*:
**Query** - *Python Download*
$$y^* = \underset{y}{\operatorname{argmax}} Pr(y) · Pr(\text{"Python" | y}) · Pr(\text{"Download" | y})$$



**Naïve Bayes Assumption:** Given the class label, the features are assumed to be independent from each other 

### Naïve Bayes: What are the Parameters?

What parameters are required in order to be able to apply the Naïve Bayes 
- **Prior Probabilities**: $Pr(y)$ for all $y$ in $Y$

- **Likelihood**: $Pr(x_{i}\text{ | }y)$ for all features $x_{i}$ and labels $y$ in $Y$

- If there are 3 classes $(|Y| = 3)$ and 100 features in $X$, how many parameters does the Naïve Bayes models have?
*Answer* = 603 features


### Naïve Bayes: Learning Parameters
- **Prior Probabilities**: $Pr(y)$ for all $y$ in $Y$
    - Count the number of instances in each class
    - If there are $N$ instances in all, and $n$ out of those are labeled as class $y$, this implies that, 
    
    $$Pr(y) = \frac{n}{N}$$
    
    
- **Likelihood**: $Pr(x_{i}|y)$ for all features $x_{i}$ and labels $y$ in $Y$
    - Count how many times the feature $x_{i}$ appears in instances labeled as class $y$
    - If there are $p$ instances of class $y$, and $x_{i}$ appears in $k$ of those instances, then, 
    
    $$Pr(x_{i}\text{ | }y) = \frac{k}{p}$$
    

### Naïve Bayes: Smoothing
- What happends if $Pr(x_{i}\text{ | }y) = 0$
    - Feature $x_{i}$ never occurs in documents labeled $y$
    - But then, the posteriori probability $Pr(x_{i}\text{ | }y)$ will be zero!
    
 To prevent this from happening, instead we apply a smoothing over the parameters
 - **Laplace Smoothing** or **Additive Smoothing** adds a duumy count to the feature (adds +1)
 
 $$Pr(x_{i}\text{ | }y) = \frac{(k+1)}{(p+n)}, n=\text{Number of Features}$$


### Naïve Bayes Variations
- Multinomial Naïve Bayes
    - The data follows a **Multinomial distribution**
    - Each feature value represents a count (word ocurrance count, TF-IDF weighting, ...)


- Bernouli Naïve Bayes
    - The data follow a **Multivariate Bernoulli distribution**
    - Each feature is binary (word is present or absent in the text)



### Take Home Concepts
- Naïve Bayes is a probabilsitic model
- Naïve Bayes it assumes that the features are independent from each other, given the class label
- For text classification problems, Naïve Bayes models typically provides a very strong baseline model (use first)
- It's a very simple model with easy to learn parameters

---

# Class 3

## Support Vector Machines 

### SVM Parameters: Parameter C


- Regularization: How much importance should you give to individual data points as compared to better generalized model?


- **Regularization C Parameter**
    - Larger values of **C = Less Regularization**: Fit training data as well as possible, every data points is important
    - Smaller vales of **C = More Regularization**: More tolerant to errors on individual points
    

### SVM Parameters: Other Parameters

- Lineal kernel usually work at it's best for text data
    - Other possible kernels: RBF, polynomial

- ``multi_class scenario`` One-vs-Rest (ovr): This sets a binary comparison from all the possible classes or labels that are available in the dataset, for example if we have the next 3 classes a comparison would be compare group 3 againt everything else, so this would assign a 1 if the data point belongs to group 3 and 0 otherwise: 
    1. Group 1: Smokes and white teeth
    2. Group 2: Smokes and stained teeth
    3. Group 3: Don't Smoke
    
  A comparison would be compare group 3 againt everything else, so this would **assign a 1 if the data point belongs to group 3 and 0 otherwise**
  
  
  
- ``class_weight``: Different classes or labels can get different weights


### Take Home Concepts

- **Support Vector Machines tend to be the most accurate classifiers, especially in high-dimensional data**


- Has strong theoretical foundations (linear algebra, optimization among others)


- Can handle **only numeric features**
    - Tip 1: Transform all hte categorical features to numeric features
    - Tip 2: Normalization of the data (StandardScaler, MinMaxScales, Normalizer, ...)


- A draw back of the Support Vector Machine is that the hyperplane that it's used to separate and classify the data it's hard to interpret, so this kind of methods are best to use when it's not so important to know HOW the classification is done, and it's more relevant to have an accurate classifier

---

# Class 4

## Learning Text Classifiers in Python

### Supervised Text Classification in Natural Languaje Took Kit - NLTK

- NLTK has some classification algorithms integrated within the library:

    - ```NaiveBayesClassifier```
    - ``DecisionTreeClassifier``
    - ``ConditionalExponentialClassifier``
    - ``MaxentClassifier``
    - ``WekaClassifier``
    - ``SkleanClassifier``



### NaiveBayesClassifier Applied use in Python

```python
from nltk.classify import NaiveBayesClassifier

classifier = NaiveBayesClassifier.train(train_set)
classifier.classify(unlabeled_instance) # binary-class classification
classifier.classify_many(unlabeled_instances) # multi-class classification

nltk.classify.util.accuracy(classifier, test_set)
classifier.labels()

classifier.show_most_informative_features()```



### SklearnClassifier Applied use in Python

```python
from nltk.classify import SklearnClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import SVC

clfNB = SklearnClassifier(MultinomialNB()).train(train_set)
clfSVC = SklearnClassifier(SVC(), kernel='linear').train(train_set)```


### Take Home Concepts

- Sklearn is one of the most commonly used for Machine Learning library in Python

- NLTK has it's own Naïve Bayes Classifier implementation

- NLTK can also interface with Sklearn and other ML toolkits such as Weka

---