## Implement TF-IDF from scrach

TF-IDF, short for term frequency–inverse document frequency, is a numerical statistic that is intended to reflect how **important** a word is to a document in a collection or corpus.

In this notebook, we will impelment TF-IDF from scrach. First, we construct tf matrix, then calculate the idf matrix, after normalization both matrix, we obtain the product of these two matrix: TF-IDF = tf * idf, according to the formular stated below:

### $tf_{w, d} = \frac{f_{w, d}}{\sum_{w'\in d} f{w', d}}$
### $idf_{w, c} = \log \frac{N}{1 + |\{{d \in C \ | \ w \in d}\}|}$

<font color="blue"/>

### dsp:
  * &#x1f642; Nice!
  * "scrach" ~> "scratch", "impelment" ~> "implement", "construct tf matrix" ~> "construct the tf matrix"
  * One "matrix", two "matrices" or "matrixes"; "after normalization both matrix" ~> "after normalizing both matrices"
  * Instead of `### $...$`  you might want to use `$$ ... $$`

In [1]:
import pickle
import pandas as pd
import numpy as np
import os

Before obtaining the TF-IDF matrix, make sure that text corpus is generate during preprocessing. If not, go back to [Corpus Generation](03-Corpus_Generation.ipynb).

#### 1. Calculate TF-IDF on preprocessed documents
- params
    - processed_doc_path: file that has been preprocessed
    - document_limit: restrict the number to process with tf-idf
- variables
    - word2index: dict, key:word (str), value:index (int)
    - document2index: dict, key:document name (str), value:index (int)
    - document_word_vectors: dict, key:document name (str), value:list of index for specific word (list)

In [None]:
def tf_idf_PrerocessedDoc(processed_doc_path, document_limit=200):
    
    word2index = {}
    document2index = {}
    index2document = {}
    document_word_vectors = {}
    
    w_cnt = 0
    d_cnt = 0
    
    for root, dirs, files in os.walk(processed_doc_path):
        for f in files:
            print('.', end='')
            document_limit -= 1
            
            document_word_vectors[f] = []
            document2index[f] = d_cnt
            index2document[d_cnt] = f
            d_cnt+=1
            
            with open(root+'/'+f, 'rb') as fs:
                try:
                    #load preprocessed files
                    words = pickle.load(fs)
                    for w_ in words:
                        w = w_.lower()
                        if w not in word2index:
                            word2index[w] = w_cnt
                            w_cnt += 1
                        document_word_vectors[f].append(word2index[w])
                except:
                    print ('Error while processing: ',f)
            
            if document_limit == 0:
                print('\n')
                break

    #create word_frequency matrix                        
    w_f_matrix = np.zeros((len(word2index),len(document2index)))
    
    for doc in document_word_vectors:
        i = document2index[doc]
        for j in document_word_vectors[doc]:
            w_f_matrix[j,i]+=1 

    # obtain normalized term-frequency matrix
    t_f = np.copy(w_f_matrix)
    sum_f = np.zeros(len(document2index))
    for i in range(len(document2index)):
        sum_f[i] = np.sum(t_f[:,i])
    t_f = np.divide(t_f,sum_f)  

    # obtaining tf-idf matrix
    inv_doc_freq = np.count_nonzero(t_f,axis=1)
    
    def normalize(a,x):
        return np.log(x/(1+a))
    
    #normalize tf-idf matrix
    norm = np.vectorize(normalize)
    inv_doc_freq = norm(inv_doc_freq,len(document2index))
    norm_tf_idf = np.multiply(t_f,inv_doc_freq.reshape(-1,1))
    
    return norm_tf_idf, word2index, index2document, inv_doc_freq

<font color="blue"/>

### dsp:
  * &#x1f632; Don't mix `camelCase` with `snake_case` as you do in `tf_idf_PrerocessedDoc`. In Python `snake_case` is prefered anyway: https://www.python.org/dev/peps/pep-0008/?#id45
  * For maths sometimes short variable names are better as is easier to understand the structure of the calculation. In this function more expressive names would be better: `w_cnt` ~> `word_count`, `d_cnt` ~> `doc_count` (or use `itertools.count()`), `f` ~> `file` or `document_name`, `w_` ~> `word`. Are the `word_vectors` word vectors? `w_f_matrix` ~> `word_frequencies`.
  * You don't need to copy `w_f_matrix` to `t_f` as it is never modified before it is overwritten by `t_f = np.divide...`.
  * Isn't `len(document2index) == doc_count` and `len(word2index) == word_count`? I guess `doc_count` and `word_count` are better readable.
  * Instead of
```Python
    sum_f = np.zeros(len(document2index))
    for i in range(len(document2index)):
        sum_f[i] = np.sum(t_f[:,i])
```  
you could just write
```Python
    sum_f = np.sum(t_f, axis=0)
```
  * Use a space after a comma!
  * `norm` has an established meaning (like in vector norm). Find a better name!
  * I never used `np.vectorize` before. Nice &#x1f642;!
  * In the given case `np.log(doc_count / (1 + inv_doc_freq))` might be as readable, though.
  * &#x1f642; Nice to see a "manual" implemenation of tf-idf, since it is not that difficult.

In [3]:
%%time
#please modify the relative path
# processed_doc_path = '/home/bit/ma0/LabShare/data/chui_ma/spacy_corpus/'
processed_doc_path = '../spacy_corpus/'
tf_idf, word2index, index2document, inv_doc_freq = tf_idf_PrerocessedDoc(processed_doc_path, 200)

........................................................................................................................................................................................................

CPU times: user 2.25 s, sys: 272 ms, total: 2.52 s
Wall time: 1.55 s


Overview about tf_idf matrix

In [4]:
tf_idf

array([[5.02727986e-04, 5.44898817e-04, 6.04604635e-04, ...,
        0.00000000e+00, 0.00000000e+00, 0.00000000e+00],
       [4.09646074e-04, 2.66405294e-04, 0.00000000e+00, ...,
        2.43070849e-04, 0.00000000e+00, 0.00000000e+00],
       [1.05725137e-03, 0.00000000e+00, 3.81450266e-03, ...,
        6.97043416e-05, 0.00000000e+00, 6.98329906e-03],
       ...,
       [0.00000000e+00, 0.00000000e+00, 0.00000000e+00, ...,
        0.00000000e+00, 0.00000000e+00, 8.88979424e-03],
       [0.00000000e+00, 0.00000000e+00, 0.00000000e+00, ...,
        0.00000000e+00, 0.00000000e+00, 8.88979424e-03],
       [0.00000000e+00, 0.00000000e+00, 0.00000000e+00, ...,
        0.00000000e+00, 0.00000000e+00, 8.88979424e-03]])

#### 2. Find most important words using tf-idf
- how often does a word appear in a report => “term frequency tf ”
- how rare is a word across all reports => “inverse document frequency idf”
- => importance score = tf * idf
- params:
    - tf_idf: matrix, obtain from function: tf_idf_PrerocessedDoc(processed_doc_path, document_limit=200)
    - word2index: dict, key:word (str), value:index (int)
    - index2document: dict, key:document name (str), value:index (int)
    - k: number of keywords to display
- variables: 
    - document_important_words: dict, key: file name, value: important words after calculating tf-idf

In [None]:
def find_top_k_words(tf_idf, word2index, index2document, k):
    
    document_important_words = {}
    
    for i in range(tf_idf.shape[1]):
        important_words = []
        
        #get indices of k-maximum values in numpy column 
        index = np.argpartition(tf_idf[:, i], -k)[-k:]
        index = index[np.argsort(tf_idf[:, i][index])]
        index = index[::-1]
        
        for ind in index:
            #find frequent words with coresponding index
            important_words.append(list(word2index.keys())[list(word2index.values()).index(ind)])
        
        filename = index2document[i]
        document_important_words[filename] = important_words
        
        #use pandas DataFrame to provide better visulization
        df = pd.DataFrame(document_important_words)
        
    return df

<font color="blue"/>

### dsp:
  * You have more than one index in `index`, so it might be `indices`.
  * It might be wise to build a reverse index `index2word` once instead of recreating it for every word that you look up.

Find the most important 10 keywords in each business report, according the the term weighting matrix obtained in tf_idf

In [6]:
#find most important 10 keywords
#return pandas DataFrame
find_top_k_words(tf_idf, word2index, index2document, 10)

Unnamed: 0,NORMA_Group-QuarterlyReport-2016-Q2,DIC-Asset-AnnualReport-2015,HeidelbergCement-QuarterlyReport-2011-Q3,WCM-QuarterlyReport-2011-Q3,Capital_Stage-QuarterlyReport-2012-Q1,Hugo_Boss-AnnualReport-2010,Sartorius-QuarterlyReport-2012-Q3,Wirecard-AnnualReport-2016,KwsSaat-QuarterlyReport-2011-Q2,NORMA_Group-QuarterlyReport-2014-Q2,...,NORMA_Group-QuarterlyReport-2011-Q2,TlgImmobilien-QuarterlyReport-2016-Q1,Hornbach-QuarterlyReport-2014-Q3,WackerChemie-QuarterlyReport-2013-Q2,Gerry_Weber-AnnualReport-2013,TAKKT-QuarterlyReport-2014-Q3,Wacker_Neuson-QuarterlyReport-2015-Q3,MLP-AnnualReport-2014,EvonikIndustries-QuarterlyReport-2012-Q3,Hella-QuarterlyReport-2014-Q3
0,ebita,immobilie,heidelbergcement,körperschaftsteuergesetzes,solarparks,metzingen,biohit,acquiring,züchtung,five,...,börsengang,hotel,like,prozent,retail,takkt,m2014,finanzdienstleistungen,evonik,hella
1,autoline,balance,zement,insolvenzverfahren,stage,permira,handling,wirecard,segmentergebnis,ebita,...,börsenganges,assetklasse,sseverewinter,wacker,verkaufsflächen,topdeq,kompaktmaschinen,vermögensmanagement,wohnen,altersteilzeitprogramm
2,halbjahr,kodexes,zuschlagstoffe,verlustvorträge,park,einzelhandels,sparte,händler,roundup,kundenreklamationen,...,verbindungstechnologie,immobilie,innominaltermsandby1,betonen,kollektion,prozent,prozent,finanzholding,fotovoltaik,lippstadt
3,fremdwährungsderivate,fond,konzerngebiet,gewerbesteuer,solarstrom,kollektion,sartorius,visa,maissaatgut,group,...,halbjahr,ankauf,inrealterms,polysilicium,hallhuber,gesellschaften,baugeräten,berater,vivawest,neuzulassungen
4,bauproduktion,mieter,klinkerabsatz,jahresabschlüsse,prozent,appreciation,liquid,payment,strukturkosten,grundlagen,...,verbindungstechnik,büroobjekt,intensifiedonceagainnowthatmost,vorquartal,houses,ratioform,q32014,altersvorsorge,prozent,geschäftsquartal
5,five,stimmrechtsanteil,bauaktivitäten,gespräch,solarthermie,anschrift,industrial,zahlungsabwicklung,halbjahresabschluss,halbjahr,...,endmärkte,neuerwerbungen,inthenine,solarsilicium,logistikzentrums,geschäftsbereichs,landwirtschaft,krankenversicherung,produktionsanlage,behr
6,eur1,fondsgeschäft,afrika,problem,warmwasser,manager,einbezug,issuing,kläger,bereinigungen,...,personenkraftwagen,verpachtung,ingermanythusyetagainoutperform,dränbeton,wholesale,inputfaktoren,baugeräte,wiesloch,steag,fahrerassistenz
7,brexit,ermächtigung,republik,investor,globalstrahlung,valentino,sondereffekte,wirecards,landwirt,marktwertermittlung,...,operational,büro,netofcurrencyitems,berichtsquartal,kundin,prozentbereich,marktnachfrage,bankgeschäft,nachfrageabschwächung,netzwerkes
8,materialeinsatzquote,aufsichtsrat,mittelmeerraum,prüfung,solar,lotus,ebita,fintech,geschäftsausweitung,leichtfahrzeuge,...,einmalaufwendungen,transaktionsvolumen,itssalesby1,chemie,stores,sonstige,neuson,finanzfragen,anhangziffer,kgaa
9,group,bürofonds,festzins,oktober,erneuerbaren,partizipationsrechte,pazific,prozent,gruppe,polk,...,vertriebsservice,portfolioankaufs,inthequarterasawhole,absatzmengen,gruppe,neunmonatszeitraum,gesamt,geschäftsstellen,colorants,fahrzeug
