# Feature Extraction

http://fastml.com/classifying-text-with-bag-of-words-a-tutorial/

### Go Linear for Bag of Words.

This is because you will be dealing with sparse and high dimensional data. Methods like random Forest will be very prone to overfitting in this situation. Linear models train fast, add a layer of simplicity, and will not overfit the very sparse data.

### TFIDF Vectorizer reduces noise
TFIDF is a scaling algorithm that adjust your bag of words to deemphasize frequently use words. The purpose of doing this is to have your model focus on only those words that might be distinguishing characteristics of certain types of documents.

In [1]:
from sklearn.datasets import fetch_20newsgroups
newsgroups_train = fetch_20newsgroups(subset='train')

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer


Downloading 20news dataset. This may take a few minutes.
Downloading dataset from https://ndownloader.figshare.com/files/5975967 (14 MB)


KeyboardInterrupt: 

In [None]:
X = newsgroup_train.filenames
y = newsgroup_train.target

#### Stop Words
Predefined set of common words in any language that provide little substance to the meaning of the document. Cutting them  from your bag of words reduces dimensionality and noise of your data resulting in a more accurate model. Removal of stop words must be cross-validated.

#### n-grams
Do not use stop words if you are using n-grams. puts n number of sequential words in your bag. Good for inferring meaning.

### Dimensionality and Memory
Linear models are preferred with bag of word because  d>>n where d is dimension and n is number of samples. Regular TFIDF Vectorizer will need a lot of memory, instead you could try HashingVectorizer and online learning to create a model that is less accurate, trains quicker, and is less memory intensive.

### RNNs vs Linear Models.
For smaller datasets in sentiment analysis tasks, linear models with n-grams will out perform RNNs. When the number of samples grows to 100k+ RNNs start to outperform linear models.

https://arxiv.org/abs/1412.5335

# sklearn Text Feature Extraction

http://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction

### sklearn.feature_extraction.text.HashingVectorizer
Converts a np collection of text documents to a matrix of token occurences. (token is string name knowing idx column)


Pros:

- low memory scalable for large datasets, no need to store vocab ditionary in memory.
- can be used for partial_fit learning methods (online learning)

Cons: 

- No inverse transform (matrix indices to string names) Less useful for figuring out which words are characteristic of different classes.
- Therre can be hashing collisions. (not a problem if hyperparameters are set reeally high 2^18)
- No TFIDF Weighting , need uniform weighting of all words.

# Latent Dirichlet Allocation (LDA)

https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation

http://brooksandrew.github.io/simpleblog/articles/latent-dirichlet-allocation-under-the-hood/


A generative statistical model (unsupervised algorithm) that allows the inference of unobserved groups that explain why some samples in the data are similar to others. Topic modelling in NLP.

Each document is a collection of smaller topics, and each words use in the document is attributed to one of the documen's topics.

### Topics
Each document is viewed as a mixture of different topics. LDA assigns topics to each document. Assumes that documents cover a small set of topics and topics contain a small number of frequently used terms.

A topic has probabilities of generating various words, whereas words without special relevance will have a even probability between classes.

Topics are identified on the basis of detection of the likelihood of term cooccurence. (ie some words are used more frequently together)

### Inference

The model learns topic word probability, topic of each word, and the particular topic mixture of each document through bayesian inference. IE given that these words appeared together what is the probability of the topic being X?

### Takeaway

Each document is analyzed in a bag of words perspective and the probability distribution of the words contained. Topics are created by detecting clusters of words that frequently cooccur in documents. Each topic has its own distribution of words contained.  Match the document distibution to the various topic distributions we found, and pick the best fits. The best fitting topics will be assigned as the topics of the document.

In [2]:
from sklearn.datasets import fetch_20newsgroups
X = fetch_20newsgroups()
X_train = X.data
y_train = X.target
print(len(X_train), len(y_train))

Downloading 20news dataset. This may take a few minutes.
Downloading dataset from https://ndownloader.figshare.com/files/5975967 (14 MB)


11314 11314


# Kaggle Kernel Notes


## Logistic Regression w/ n-grams

https://www.kaggle.com/tunguz/logistic-regression-with-words-and-char-n-grams

In [1]:
import matplotlib.pyplot as plt
%matplotlib inline
import pandas as pd
import numpy as np
from sklearn.feature_extraction