## Introduction

Logistics regression is a well known statistics tool for classification including text.
The steps to implement a text classifier with this statistic tool will be explained here.

## Training Flow


```{mermaid}
flowchart LR
    A[Text] --> B["Feature_Extraction(T)"]
    B --> C["Prediction_Function(X)"]
    C --> D["Output Y^"]
    D --> E["Cost_Function(Y, Y^)"]
    E --> C
```


## Vocabulary & Feature Extraction

### One-Hot-Encoding

Consists of creating a vector of 0s and 1s (no other values) where each position represents a word in the vocabulary. If a word is present in a phrase (such a Twit) the corresponding position would be marked as 1 otherwise 0.

* *Problem: Long training and prediction time:*
It can get too big and sparse when a large vocabulary from many different texts are used. 

### Negative and Positive Frequencies

The idea is to use feature vectors to count word frequencies for each prediction category (such as positive/negative in sentiment analysis). Global feature vector is calculated for each word, following the steps below:

Map(word) ---> occurrence of that word in a given class 

|          | I | am | happy | sad | never |
|----------|---|----|-------|-----|-------|
| Positive | 2 | 2  | 1     | 0   | 0     |
| Negative | 3 | 3  | 0     | 2   | 1     |

## Preprocessing

* **tokenization** - break text into array of words


In [None]:
from nltk.tokenize import TweetTokenizer

* **stop words** - eliminate meaningless words (punctuation, articles, prepositions, not-important symbols, etc.)


In [None]:
from nltk.corpus import stopwords
import string
nltk.download(‘stopwords’)
stopwords_english = stopwords.words(‘english’)
punctuation = string.punctuation

* **stemming** - map word to its root form (remove ing, ed, etc.)


In [None]:
from nltk.stem import PorterStemmer
stemmer = PorterStemmer()
stemmed_text = [ ]
for word in text:
stem_word = stemmer.stem(word)
stemmed_text.append(stem_word)

* **lowercase** - convert all words to lowercase

## Logistics Regression

Linear Regression with Sigmoid function.
$h(z) = 1 / ( 1 + e^{-z} )$  with  $z = \theta^T x$                

OR

$sigmoid(x_0 \theta_0 + x_1 \theta_1 + x_2 \theta_2 + x_3 \theta_3)$

## Training Workflow

:::::::::::::: {.columns}
::: {.column width="45%"}


```{mermaid}
flowchart TD
    A["&#952"] --> B["h = h(X, &#952)"]
    B --> C["&#8711 = 1/m X^t (h - y)"]
    C --> D["&#952 = &#952 - &#945&#8711"]
    D --> E["J(&#952)"]
    E --> B

```


:::
::: {.column width="5%"}

\

:::
::: {.column width="45%"}


\


```{mermaid}
flowchart TD
    A["Initialize parameters"] --> B["Classify/predict"]
    B --> C["Get gradient"]
    C --> D["Update"]
    D --> E["Get Loss"]
    E --> B

```


:::
::::::::::::::


### Testing (with accuracy)

Testing can be done via cross-validation data with $X_val$, $Y_val$ and $\theta$ on the model to optimize hyper-parameters.


* $X_val Y_val \theta$
    * $h(X_val . \theta)$
    * $pred = h(X_val . \theta) >= 0.5$

$$
\begin{bmatrix}
0.3 \\
0.8 \\
0.5 \\
\vdots \\
h_m
\end{bmatrix}
$$
