## TDIF Vectorizer
[TfidVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html)

Convert a collection of raw documents to a matrix of TF-IDF features.

Equivalent to CountVectorizer followed by TfidfTransformer.


In [3]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer

def get_tf_idf(doc_path):
    vectorizer =TfidfVectorizer(use_idf=True)
    corpus = []
    with open(doc_path, "r") as f:
        corpus = f.readlines()

    # tfidf = vectorizer.fit_transform(corpus).toarray()
    tfidf = vectorizer.fit_transform(corpus)

    #get first vector from first doc
    first_vector_tfidf = tfidf[0]

    df = pd.DataFrame(first_vector_tfidf.T.todense(), index=vectorizer.get_feature_names(), columns=["tfidf"])
    return df.sort_values(by=["tfidf"],ascending=False)

    
# get_tf_idf("Data/oldmanandthesea.txt")

## Count Vectorizer

CountVectorizer - transform a corpora of text to a vector of term / token counts. It also provides the capability to preprocess your text data prior to generating the vector representation making it a highly flexible feature representation module for text.

[How to use CountVectorizer](https://kavita-ganesan.com/how-to-use-countvectorizer/)


By default, CountVectorizer does the following:

- lowercases your text (set lowercase=false if you don’t want lowercasing)
- uses utf-8 encoding
- performs tokenization (converts raw text to smaller units of text)
- uses word level tokenization (meaning each word is treated as a separate token)
- ignores single characters during tokenization (say goodbye to words like ‘a’ and ‘I’)

### Stop Words
Can add custom stop words as parameter
` stop_words=['all', 'and', the']`

Can check stop words being used with `cv.stop_words` and `cv.stop_words_` - Check example below

### Min_df and Max_df (Document Frequency)
*How many Documents contained a term*

goal of min_df to ignore words with too few occurances to be meaningful

Can be absolute value (eg 1, 2, 3) or proportion (0.25)

Max_DF - Remove words that are too common - Typically use 0.75 - 0.85

### *Why Doc Frequency for eliminating words?*
Term frequency can be misleading
Ex. say 1 document in 250 contains a word 'firetruck' 500 times


### Custom Tokenization
pass in parameter of a function `cv = CountVectorizer(dataDocs, tokenizer=my_tokenizer)`


### Custom Preprocessing
Preprocessing helps reduce noise and improves sparsity issues => more accurate analysis


In [14]:
from sklearn.feature_extraction.text import CountVectorizer

cat_in_the_hat_docs=[
    "One Cent, Two Cents, Old Cent, New Cent: All About Money (Cat in the Hat's Learning Library",
    "Inside Your Outside: All About the Human Body (Cat in the Hat's Learning Library)",
    "Oh, The Things You Can Do That Are Good for You: All About Staying Healthy (Cat in the Hat's Learning Library)",
    "On Beyond Bugs: All About Insects (Cat in the Hat's Learning Library)",
    "There's No Place Like Space: All About Our Solar System (Cat in the Hat's Learning Library)" 
]

cv = CountVectorizer(cat_in_the_hat_docs)
count_vector=cv.fit_transform(cat_in_the_hat_docs)

# Show resulting vocab
cv.vocabulary_

# shape of count vector: 5 docs (book titles) and 43 unique words
count_vector.shape

# --- STOP WORDS --- 

# CUSTOM STOP WORD LIST
cv = CountVectorizer(cat_in_the_hat_docs,stop_words=["all","in","the","is","and"], min_df=2)
count_vector=cv.fit_transform(cat_in_the_hat_docs)
count_vector.shape

# CHECK STOP WORDS BEING USED when explicitly specified
cv.stop_words


# MIN_DF AND MAX DF
# cv = CountVectorizer(cat_in_the_hat_docs,stop_words=["all","in","the","is","and"], min_df=2)
cv = CountVectorizer(cat_in_the_hat_docs,stop_words=["all","in","the","is","and"], max_df=0.85)
count_vector=cv.fit_transform(cat_in_the_hat_docs)
count_vector.shape

# CHECK STOP WORDS inferred from min_df and max_df (data frequency)
cv.stop_words_


# --- CUSTOM TOKENIZATION ---
import re

# keep punctuation 
def my_tokenizer(text):
    #create space b/w characters
    text=re.sub("(\\W)"," \\1 ",text)

    # split based on whitespace
    return re.split("\\s+",text)

cv = CountVectorizer(cat_in_the_hat_docs,tokenizer=my_tokenizer)
count_vector=cv.fit_transform(cat_in_the_hat_docs)
print(cv.vocabulary_)

{'one': 34, 'cent': 14, ',': 4, 'two': 47, 'cents': 15, 'old': 32, 'new': 29, ':': 5, 'all': 7, 'about': 6, 'money': 28, '(': 2, 'cat': 13, 'in': 22, 'the': 44, 'hat': 19, "'": 1, 's': 38, 'learning': 25, 'library': 26, 'inside': 24, 'your': 49, 'outside': 36, 'human': 21, 'body': 10, ')': 3, '': 0, 'oh': 31, 'things': 46, 'you': 48, 'can': 12, 'do': 16, 'that': 43, 'are': 8, 'good': 18, 'for': 17, 'staying': 41, 'healthy': 20, 'on': 33, 'beyond': 9, 'bugs': 11, 'insects': 23, 'there': 45, 'no': 30, 'place': 37, 'like': 27, 'space': 40, 'our': 35, 'solar': 39, 'system': 42}


In [17]:
# ----- CUSTOM PREPROCESSING WITH PORTERSTEMMER -----

import re
import nltk
import pandas as pd
from nltk.stem import PorterStemmer

# init stemmer
porter_stemmer=PorterStemmer()

def my_cool_preprocessor(text):
    
    text=text.lower() #lowercase text (done be default if don't use a custom preprocessor)
    text=re.sub("\\W"," ",text) # remove special chars
    text=re.sub("\\s+(in|the|all|for|and|on)\\s+"," _connector_ ",text) # normalize certain words
    
    # stem words
    words=re.split("\\s+",text)
    stemmed_words=[porter_stemmer.stem(word=word) for word in words]
    return ' '.join(stemmed_words)

cv = CountVectorizer(cat_in_the_hat_docs,preprocessor=my_cool_preprocessor)
count_vector=cv.fit_transform(cat_in_the_hat_docs)
print(cv.vocabulary_)

{'one': 25, 'cent': 8, 'two': 37, 'old': 23, 'new': 20, '_connector_': 0, 'about': 1, 'money': 19, 'cat': 7, 'the': 34, 'hat': 11, 'learn': 16, 'librari': 17, 'insid': 15, 'your': 39, 'outsid': 27, 'human': 13, 'bodi': 4, 'oh': 22, 'thing': 36, 'you': 38, 'can': 6, 'do': 9, 'that': 33, 'are': 2, 'good': 10, 'stay': 31, 'healthi': 12, 'on': 24, 'beyond': 3, 'bug': 5, 'insect': 14, 'there': 35, 'no': 21, 'place': 28, 'like': 18, 'space': 30, 'our': 26, 'solar': 29, 'system': 32}
