# 1. Term Frequency

### CountVectorizer (TF)

This is pretty simple. This just takes all the unique words in a series of documents and creates a count of each unqiue word **(ACROSS ALL INSTANCES)**.

Here are some key arguments:
- **inputs**: obviously just the entire corupus (all of the instances, usually a list of sentences I think!?!?!?)
- **strip_accents:** remove accents during preprocessing step. Use ascii if encoding is ascii. Unicode will work on ALL encodings but it is slower.
- **analyzer:** Whether each feature should be made of a single word, or a character n-grams.
- **max_df:** float in range [0.0 1.0]. When building vocabulary ignore terms that have a document frequency strictly higher than the given threshold. SO GET RID OF WORDS THAT SHOW UP EVERYWHERE AFTER A CERTAIN THRESHOLD.  
- **min_df:** float in range [0.0, 1.0] or int, default=1. When building the vocabulary ignore terms that have a document
    frequency strictly lower than the given threshold. This value is also called cut-off in the literature.
    
Check out the documentation for more! (?CountVectorizer)
   

In [1]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
docs = ["You can catch more flies with honey than you can with vinegar you you. Test",
        "You can lead a horse to water, but you can't make him drink. Test",
        "You can lead a mouse to cheese, but you can't make him drink wine Test.",
       "You can lead a mouse to cheese, but you can't make him drink wine."]

#DON'T INCLUDE WORDS THAT SHOW UP IN MORE THAN HALF OF THE INSTANCES
vect = CountVectorizer(min_df=0., max_df=0.50, analyzer='word')
X = vect.fit_transform(docs)


#you can also get the feature names here
print(vect.get_feature_names(),"\n")

#X.A will print the same data but represented in a matrix!!!
print(X.A,"\n")

#printing just X will get you a set of tuples (one for each unique word in each instance and the corresponding count)
print(X, "\n")

df_VC = pd.DataFrame(X.A, columns=vect.get_feature_names()).to_string()
print(df_VC)

['catch', 'cheese', 'flies', 'honey', 'horse', 'more', 'mouse', 'than', 'vinegar', 'water', 'wine', 'with'] 

[[1 0 1 1 0 1 0 1 1 0 0 2]
 [0 0 0 0 1 0 0 0 0 1 0 0]
 [0 1 0 0 0 0 1 0 0 0 1 0]
 [0 1 0 0 0 0 1 0 0 0 1 0]] 

  (0, 8)	1
  (0, 7)	1
  (0, 3)	1
  (0, 11)	2
  (0, 2)	1
  (0, 5)	1
  (0, 0)	1
  (1, 9)	1
  (1, 4)	1
  (2, 10)	1
  (2, 1)	1
  (2, 6)	1
  (3, 10)	1
  (3, 1)	1
  (3, 6)	1 

   catch  cheese  flies  honey  horse  more  mouse  than  vinegar  water  wine  with
0      1       0      1      1      0     1      0     1        1      0     0     2
1      0       0      0      0      1     0      0     0        0      1     0     0
2      0       1      0      0      0     0      1     0        0      0     1     0
3      0       1      0      0      0     0      1     0        0      0     1     0


# 2. Jaccard Similarity

This is used to determine similarity between documents:
- **jaccard_similarity:** It's simply the length of the intersection of the sets of tokens divided by the length of the union of the two sets.
![](pictures/LP_NLP_ex1_jacard.jpg)

A few key issues with this metric:

- Length is irrelevant. (bias towards longer documents).
- Words that appear in a lot of documents are weighted the same as those that appear in few. (bias towards longer documents as well as non-descriptive words)

**We need a way to weigh certain words differently than others.** Words that appear in all the documents are not going to be good at identifying documents because of the fact that, well... they appear in all the documents. Next we will discuss another similarity measure that takes this into account: **TF-IDF and Cosine Similarity.**


In [2]:
def jaccard_similarity(query, document):
    intersection = set(query).intersection(set(document))
    union = set(query).union(set(document))
    return len(intersection)/len(union)
    
query= 'Hello my name is mike'.split(" ")
document = 'Hello my name is Jane'.split(" ")
jaccard_similarity(query, document)

0.6666666666666666

# 4. TFIDF Vectorizer

Taken from: http://billchambers.me/tutorials/2014/12/21/tf-idf-explained-in-python.html

One technique for vectorizing documents is to pick the most frequently occurring terms. However, the most frequent word is a less useful metric since some words like 'this', 'a'  occur very frequently across all documents. 

**Hence, we also want a measure of how unique a word is i.e. how infrequently the word occurs across all documents (inverse document frequency or idf).** So TFIDF is the product of two components:
1. **TF: term frequency** - how many times the word appears in the document
2. **IDF: Inverse document frequency:** How many times the word appears in any document

**EXAMPLE:** 

Consider a document containing **100 words wherein the word sun appears 3 times.** The term frequency (i.e., tf) for sun is then (3 / 100) = 0.03. Now, assume we have **10 million documents and the word sun appears in one thousand of these.** Then, the inverse document frequency (i.e., idf) is calculated as log(10,000,000 / 1,000) = 4 (with log base = 4). Thus, the Tf-idf weight is the product of these quantities: 0.03 * 4 = 0.12.

- TF = 3/100 = 0.03
- IDF = log(10,000,000/1,000) = 4
- TF IDF = 0.03 * 4 = 0.12

In [3]:
import math
base = 10

#basic term frequency
tf = 3/100
#apply natural log to doc frequency
idf = math.log(10000000/1000, base)

tf*idf

0.12

### Building a custom tf idf algo
Taken from: https://gist.github.com/anabranch/48c5c0124ba4e162b2e3

The overall tfidf algorithm will be composed of 5 sub-functions. 
- **Tokenize:** this just splits documents into word tokens (making all lower case)
- **term frequency**: simple way of conting the number of occurences of a token in a document
- **sublinear term frequency:** Addresses bias towards longer documents by weighting terms according to document length. This is called normalization. There are two methods to do this, sublienar and augmented frequency (I think this is what sklearn uses?!?)
![](pictures/LP_NLP_ex1_sublineartf.jpg)
- **augmented term frequnecy:** Accomplished the same thing as sublinear term frequency with a slightly different method. 
- **IDF:** Inverse document frequency targets words that are unique to certain documents. Its the log of the number of documents (N) over the number of times the term (t) appears in a document (d) in the full list of documents (D).This gives us a weight for every token in every document. This helps determine the important words from the unimportant ones.


![](pictures/LP_NLP_ex1_IDF.jpg)

In [4]:
#build quick tokenize function
tokenize = lambda document: document.lower().split(" ")

In [5]:
#term frequency
def term_frequency(term, tokenized_document):
    return tokenized_document.count(term)

tokenized_document=tokenize('Hello Hello my name is mike')
term = 'hello'
term_frequency(term, tokenized_document)

2

In [6]:
#sublinear term frequency
def sublinear_term_frequency(term, tokenized_document):
    count = tokenized_document.count(term)
    if count == 0:
        return 0
    return 1 + math.log(count)

tokenized_document=tokenize('Hello Hello Hello my name is mike')
term = 'hello'
sublinear_term_frequency(term, tokenized_document)

2.09861228866811

In [7]:
#augmented term frequency
def augmented_term_frequency(term, tokenized_document):
    #picks max frequency of any given term
    max_count = max([term_frequency(t, tokenized_document) for t in tokenized_document])
    #term frequency relative to max term frequency in doc
    return (0.5 + ((0.5 * term_frequency(term, tokenized_document))/max_count))

tokenized_document=tokenize('Hello Hello my name is mike mike mike')
term = 'hello'
augmented_term_frequency(term, tokenized_document)

0.8333333333333333

In [8]:
#inverse document frequency
def inverse_document_frequencies(tokenized_documents):
    #extract number of docs
    num_docs = len(tokenized_documents)
    #extract unique words from all docs combined into a single list
    all_tokens_set = set([item for sublist in tokenized_documents for item in sublist])
    #loop through each token in all_tokens_set and create idf value dict
    idf_values = {}
    for tkn in all_tokens_set:
        #count number of docs tkn is found in
        contains_token = sum(map(lambda doc: tkn in doc, tokenized_documents))
        #compute idf value for tkn
        idf_values[tkn] = 1 + math.log(num_docs/contains_token)
    return idf_values

tokenized_documents = [tokenize("Hi my name is Mike"),
                      tokenize("Hello my name is Jane"),
                      tokenize("Hello my name is John"),
                      tokenize("Hello my name is John")]

print("Notice how unique words have higher weights!!!!\nWords in every document have weight of 1\nAs the number of documents grow, the weight of unique words grows")
inverse_document_frequencies(tokenized_documents)

Notice how unique words have higher weights!!!!
Words in every document have weight of 1
As the number of documents grow, the weight of unique words grows


{'hello': 1.2876820724517808,
 'my': 1.0,
 'name': 1.0,
 'jane': 2.386294361119891,
 'hi': 2.386294361119891,
 'is': 1.0,
 'mike': 2.386294361119891,
 'john': 1.6931471805599454}

In [9]:
#put it all together with a tfidf function!!!
def tfidf(documents):
    #tokenize each document
    tokenized_documents = [tokenize(d) for d in documents]
    #calcualte idf values for each unique term
    idf = inverse_document_frequencies(tokenized_documents)
    #loop through document, and each unique idf term to get set of tfidf values for each document
    tfidf_documents = []
    for document in tokenized_documents:
        #loop through each term
        doc_tfidf = []
        for term in idf.keys():
            #calculate sublinear term frequency
            tf = sublinear_term_frequency(term, document)
            #calculate FINAL tfidf values in single document
            doc_tfidf.append(tf * idf[term])
        #append doc_tfidf to ALL documents list
        tfidf_documents.append(doc_tfidf)
    return tfidf_documents, idf.keys()

documents = ["Hi my name is Mike Ciniello",
             "Hello hi my name is Jane",
             "Hello  hi my name is John",
             "Hello my name is John"]
print("Result should be a n*m matrix, where n is the number of documents, and m is the number of unique words")    
vals, words = tfidf(documents)
pd.DataFrame(vals, columns = words)

Result should be a n*m matrix, where n is the number of documents, and m is the number of unique words


Unnamed: 0,Unnamed: 1,hello,my,name,jane,ciniello,hi,is,mike,john
0,0.0,0.0,1.0,1.0,0.0,2.386294,1.287682,1.0,2.386294,0.0
1,0.0,1.287682,1.0,1.0,2.386294,0.0,1.287682,1.0,0.0,0.0
2,2.386294,1.287682,1.0,1.0,0.0,0.0,1.287682,1.0,0.0,1.693147
3,0.0,1.287682,1.0,1.0,0.0,0.0,0.0,1.0,0.0,1.693147


In [11]:
#test out the tfidf funciton
document_0 = "China has a strong economy that is growing at a rapid pace. However politically it differs greatly from the US Economy."
document_1 = "At last, China seems serious about confronting an endemic problem: domestic violence and corruption."
document_2 = "Japan's prime minister, Shinzo Abe, is working towards healing the economic turmoil in his own country for his view on the future of his people."
document_3 = "Vladimir Putin is working hard to fix the economy in Russia as the Ruble has tumbled."
document_4 = "What's the future of Abenomics? We asked Shinzo Abe for his views"
document_5 = "Obama has eased sanctions on Cuba while accelerating those against the Russian Economy, even as the Ruble's value falls almost daily."
document_6 = "Vladimir Putin is riding a horse while hunting deer. Vladimir Putin always seems so serious about things - even riding horses. Is he crazy?"

all_documents = [document_0, document_1, document_2, document_3, document_4, document_5, document_6]
all_documents

print("Total word count:")
print(sum([len(x.split(" ")) for x in all_documents]))
print("Unique word count:")
full_list = []
[full_list.extend(item) for item in [tokenize(doc) for doc in all_documents]]
print(len(set(full_list)))
print('Shape of tfidf conversion (should be 7 rows and 94 features)')
tfidf_conversion = tfidf(all_documents)
#print(pd.DataFrame(tfidf_conversion).shape)

Total word count:
133
Unique word count:
94
Shape of tfidf conversion (should be 7 rows and 94 features)


In [12]:
#compare output to sklearn
#import vectorizer
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
#initialize
sklearn_tfidf = TfidfVectorizer(all_documents, 
                norm='l2', min_df=0, use_idf=True, 
                smooth_idf=False, 
                sublinear_tf=True,
               tokenizer=tokenize)
#fit transform
sklearn_representation = sklearn_tfidf.fit_transform(all_documents)

sklearn_representation.shape

(7, 94)

In [13]:
print("Manual tfidf:", tfidf_conversion[0][0:5])
print("Sklearn tfidf:", sklearn_representation.A[0][0:5])

Manual tfidf: [[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.9459101490553135, 1.336472236621213, 0.0, 2.9459101490553135, 0.0, 2.9459101490553135, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.9459101490553135, 0.0, 0.0, 0.0, 0.0, 0.0, 2.252762968495368, 0.0, 2.9459101490553135, 2.9459101490553135, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.5596157879354227, 0.0, 1.8472978603872037, 2.9459101490553135, 0.0, 2.9459101490553135, 2.9459101490553135, 0.0, 0.0, 2.252762968495368, 0.0, 0.0, 0.0, 2.252762968495368, 0.0, 2.9459101490553135, 0.0, 2.9459101490553135, 0.0, 0.0, 2.9459101490553135, 3.8142592685777856, 0.0, 0.0, 0.0, 0.0, 0.0, 2.9459101490553135, 0.0, 0.0], [2.9459101490553135, 2.9459101490553135, 0.0, 0.0, 0.0, 2.252762968495368, 0.0, 0.0, 2.9459101490553135, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.252762968495368, 0.0, 0.0, 0.0, 0.0