## Instructions
<p>Write a function which computes the inverse document frequency of a group of words - you can choose any corpus of text you want for this exercise. </p>

<p> The examples should give you a good starting point, but you are going to have to utilise some of the techniques you have already learned to get these to work with your existing data. </p>

[TfidVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html)

Convert a collection of raw documents to a matrix of TF-IDF features
Equivalent to CountVectorizer followed by TfidTransformer

[How to use CountVectorizer](https://kavita-ganesan.com/how-to-use-countvectorizer/)

[TfidTransformer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html)

[Tf-idf term weighting Guide](https://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction)

<p>Advanced Ex - implement a pre-processor for distance measuring that weights the vector by IDF</p>

[Differences b/w transformer and vectorizer](https://kavita-ganesan.com/tfidftransformer-tfidfvectorizer-usage-differences/)

In [3]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
import pandas as pd

def get_idf(doc_path):
    cv = CountVectorizer()
    tfidf_transformer = TfidfTransformer(smooth_idf=True,use_idf=True)
    corpus = []
    with open("Data/oldmanandthesea.txt", "r") as f:
        corpus = f.readlines()

    counts = cv.fit_transform(corpus).toarray()

    tfidf = tfidf_transformer.fit_transform(counts).toarray()

    first_doc_tfidf = tfidf[0]

    # print idf values 
    df_idf = pd.DataFrame(tfidf_transformer.idf_, index=cv.get_feature_names(),columns=["idf_weights"]) 

    # sort ascending 
    return df_idf.sort_values(by=['idf_weights'])    

get_idf("Data/oldmanandthesea.txt")

Unnamed: 0,idf_weights
the,1.000000
in,1.182322
and,1.182322
was,1.182322
days,1.405465
...,...
empty,2.098612
either,2.098612
eighty,2.098612
flour,2.098612


In [69]:
from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd 

def get_tf_idf(doc_path):
    vectorizer =TfidfVectorizer(use_idf=True)
    corpus = []
    with open(doc_path, "r") as f:
        corpus = f.readlines()

    # tfidf = vectorizer.fit_transform(corpus).toarray()
    tfidf = vectorizer.fit_transform(corpus)

    #get first vector from first doc
    first_vector_tfidf = tfidf[0]

    df = pd.DataFrame(first_vector_tfidf.T.todense(), index=vectorizer.get_feature_names(), columns=["tfidf"])
    return df.sort_values(by=["tfidf"],ascending=False)

    
get_tf_idf("Data/oldmanandthesea.txt")

Unnamed: 0,tfidf
he,0.372307
in,0.259981
alone,0.230733
eighty,0.230733
an,0.230733
...,...
flag,0.000000
first,0.000000
finally,0.000000
empty,0.000000


In [10]:
import nltk, string, numpy
from sklearn.feature_extraction.text import TfidfVectorizer
from scipy.spatial import distance_matrix

#Normalize by lemmatization
# nltk.download('wordnet')
lemmer = nltk.stem.WordNetLemmatizer()

def LemTokens(tokens):
    return[lemmer.lemmatize(token) for token in tokens]

remove_punct_dict = dict((ord(punct), None) for punct in string.punctuation)

def LemNormalize(text):
     return LemTokens(nltk.word_tokenize(text.lower().translate(remove_punct_dict)))

def get_tf_idf(doc_path):
    vectorizer =TfidfVectorizer(tokenizer=LemNormalize, stop_words='english', use_idf=True)
    corpus = []
    with open(doc_path, "r") as f:
        corpus = f.readlines()

    # tfidf = vectorizer.fit_transform(corpus).toarray()
    tfidf = vectorizer.fit_transform(corpus).toarray()
#     print(tfidf)
    #get first vector from first doc
    first_vector_tfidf = tfidf[0]
    
    DM = distance_matrix(tfidf,tfidf)
    print(DM)

#     df = pd.DataFrame(first_vector_tfidf.T.todense(), index=vectorizer.get_feature_names(), columns=["tfidf"])
#     return df.sort_values(by=["tfidf"],ascending=False)

    
get_tf_idf("Data/oldmanandthesea.txt")

[[0.         1.32324653 1.20801646 1.26431495 1.38914829]
 [1.32324653 0.         1.17166911 1.2373693  1.41421356]
 [1.20801646 1.17166911 0.         1.30610073 1.39776442]
 [1.26431495 1.2373693  1.30610073 0.         1.31158913]
 [1.38914829 1.41421356 1.39776442 1.31158913 0.        ]]


In [60]:
from sklearn.feature_extraction.text import TfidfVectorizer

docs=["the house had a tiny little mouse", 
"the cat saw the mouse", 
"the mouse ran away from the house", 
"the cat finally ate the mouse", 
"the end of the mouse story"
]

# settings that you use for count vectorizer will go here 
tfidf_vectorizer=TfidfVectorizer(use_idf=True) 
 
# just send in all your docs here 
tfidf_vectorizer_vectors=tfidf_vectorizer.fit_transform(corpus)

first_vector_tfidfvectorizer=tfidf_vectorizer_vectors[0]

df = pd.DataFrame(first_vector_tfidfvectorizer.T.todense(), index=tfidf_vectorizer.get_feature_names(), columns=["tfidf"]) 
df.sort_values(by=["tfidf"],ascending=False)



Unnamed: 0,tfidf
fish,0.332542
abundantly,0.222707
meat,0.221694
spirit,0.187348
second,0.187348
...,...
let,0.000000
lesser,0.000000
land,0.000000
day,0.000000
