## Implement TF-IDF from scrach

TF-IDF, short for term frequency–inverse document frequency, is a numerical statistic that is intended to reflect how **important** a word is to a document in a collection or corpus.

In this notebook, we will implement TF-IDF from scratch. First, we construct the tf matrix, then calculate the idf matrix. After normalizing both matrices, we obtain the product of these two matrices: TF-IDF = tf * idf, according to the formulas stated below:

$$tf_{w, d} = \frac{f_{w, d}}{\sum_{w'\in d} f{w', d}}$$
$$idf_{w, c} = \log \frac{N}{1 + |\{{d \in C \ | \ w \in d}\}|}$$

In [1]:
import pickle
import pandas as pd
import numpy as np
import os

Before obtaining the TF-IDF matrix, make sure that text corpus is generate during preprocessing. If not, go back to [Corpus Generation](03-Corpus_Generation.ipynb).

#### 1. Calculate TF-IDF on preprocessed documents
- params
    - processed_doc_path: file that has been preprocessed
    - document_limit: restrict the number to process with tf-idf
- variables
    - word2index: dict, key:word (str), value:index (int)
    - document2index: dict, key:document name (str), value:index (int)
    - document_word_vectors: dict, key:document name (str), value:list of index for specific word (list)

In [2]:
def tf_idf_prerocessed_doc(processed_doc_path, document_limit=200):
    
    word2index = {}
    document2index = {}
    index2document = {}
    document_word_vectors = {}
    
    word_count = 0
    doc_count = 0
    
    for root, dirs, files in os.walk(processed_doc_path):
        for file in files:
            print('.', end='')
            document_limit -= 1
            
            document_word_vectors[file] = []
            document2index[file] = doc_count
            index2document[doc_count] = file
            doc_count+=1
            
            with open(root+'/'+file, 'rb') as fs:
                try:
                    #load preprocessed files
                    words = pickle.load(fs)
                    for word in words:
                        w = word.lower()
                        if w not in word2index:
                            word2index[w] = word_count
                            word_count += 1
                        document_word_vectors[file].append(word2index[w])
                except:
                    print ('Error while processing: ', file)
            
            if document_limit == 0:
                print('\n')
                break
    
    
    #create word_frequency matrix                        
    word_frequencies = np.zeros((word_count, doc_count))
    
    for doc in document_word_vectors:
        i = document2index[doc]
        for j in document_word_vectors[doc]:
            word_frequencies[j,i]+=1 

    # obtain normalized term-frequency matrix 
#     sum_f = np.zeros(len(document2index))
#     for i in range(len(document2index)):
#         sum_f[i] = np.sum(t_f[:,i])
    t_f = np.copy(word_frequencies)
    sum_f = np.sum(t_f, axis=0)
    t_f = np.divide(t_f, sum_f)  

    # obtaining tf-idf matrix
    inv_doc_freq = np.count_nonzero(t_f, axis=1)
    
    def normalize(inv_doc_freq, doc_count):
        return np.log(doc_count / (1 + inv_doc_freq))
    
    #normalize tf-idf matrix
    matrix_norm = np.vectorize(normalize)
    inv_doc_freq = matrix_norm(inv_doc_freq, doc_count)
    norm_tf_idf = np.multiply(t_f, inv_doc_freq.reshape(-1, 1))
 
    return norm_tf_idf, word2index, index2document, inv_doc_freq

In [3]:
%%time
#please modify the relative path
# processed_doc_path = '/home/bit/ma0/LabShare/data/chui_ma/spacy_corpus/'
processed_doc_path = '../spacy_corpus/'
tf_idf, word2index, index2document, inv_doc_freq = tf_idf_prerocessed_doc(processed_doc_path, 200)

........................................................................................................................................................................................................

CPU times: user 2.81 s, sys: 339 ms, total: 3.15 s
Wall time: 2.53 s




Overview about tf_idf matrix

In [4]:
tf_idf

array([[1.58008316e-03, 4.27707663e-05, 1.50417815e-04, ...,
        1.70027751e-05, 2.54406129e-04, 3.80000496e-05],
       [3.38198745e-02, 0.00000000e+00, 0.00000000e+00, ...,
        0.00000000e+00, 0.00000000e+00, 0.00000000e+00],
       [5.38161824e-05, 2.41089320e-04, 2.82623950e-04, ...,
        9.58408706e-05, 0.00000000e+00, 1.07098924e-04],
       ...,
       [0.00000000e+00, 0.00000000e+00, 0.00000000e+00, ...,
        0.00000000e+00, 0.00000000e+00, 4.78489812e-04],
       [0.00000000e+00, 0.00000000e+00, 0.00000000e+00, ...,
        0.00000000e+00, 0.00000000e+00, 4.78489812e-04],
       [0.00000000e+00, 0.00000000e+00, 0.00000000e+00, ...,
        0.00000000e+00, 0.00000000e+00, 4.78489812e-04]])

#### 2. Find most important words using tf-idf
- how often does a word appear in a report => “term frequency tf ”
- how rare is a word across all reports => “inverse document frequency idf”
- => importance score = tf * idf
- params:
    - tf_idf: matrix, obtain from function: tf_idf_prerocessed_doc(processed_doc_path, document_limit=200)
    - word2index: dict, key:word (str), value:index (int)
    - index2document: dict, key:document name (str), value:index (int)
    - k: number of keywords to display
- variables: 
    - document_important_words: dict, key: file name, value: important words after calculating tf-idf

In [5]:
def find_top_k_words(tf_idf, word2index, index2document, k):
    
    document_important_words = {}
    
    for i in range(tf_idf.shape[1]):
        important_words = []
        
        #get indices of k-maximum values in numpy column 
        indices = np.argpartition(tf_idf[:, i], -k)[-k:]
        indices = indices[np.argsort(tf_idf[:, i][indices])]
        indices = indices[::-1]
        
        for ind in indices:
            #find frequent words with coresponding index
            important_words.append(list(word2index.keys())[list(word2index.values()).index(ind)])
        
        filename = index2document[i]
        document_important_words[filename] = important_words
        
        #use pandas DataFrame to provide better visulization
        df = pd.DataFrame(document_important_words)
        
    return df

Find the most important 10 keywords in each business report, according the the term weighting matrix obtained in tf_idf

In [6]:
#find most important 10 keywords
#return pandas DataFrame
find_top_k_words(tf_idf, word2index, index2document, 10)

Unnamed: 0,DeutscheWohnen-AnnualReport-2016,BayWa-QuarterlyReport-2015-Q2,Freenet-QuarterlyReport-2011-Q1,BASF-QuarterlyReport-2017-Q1,DIC-Asset-AnnualReport-2014,Fraport-QuarterlyReport-2016-Q1,Deutsche_Euroshop-QuarterlyReport-2010-Q2,Aixtron-QuarterlyReport-2011-Q1,Bechtle-QuarterlyReport-2015-Q2,Fresenius-QuarterlyReport-2017-Q1,...,BVB-QuarterlyReport-2015-Q1,Deutsche_Post-AnnualReport-2014,eon-AnnualReport-2016,ADVA_Optical-QuarterlyReport-2012-Q2,Deutsche_Post-QuarterlyReport-2014-Q1,AmadeusFiRe-QuarterlyReport-2014-Q1,Drillisch-QuarterlyReport-2013-Q1,Durr-QuarterlyReport-2013-Q3,Beiersdorf-QuarterlyReport-2014-Q1,Fresenius_Medical_Care-QuarterlyReport-2016-Q3
0,wohnen,baywa,freenet,sondereinflüssen,dic,fcs,euroshop,aixtron,bechtle,fresenius,...,borussia,post,uniper,optical,post,amadeus,drillisch,dürr,beiersdorf,gesundheitsdienstleistungen
1,gsw,agrar,cashﬂ,betriebstätigkeit,asset,fraport,wildau,gasphasenabscheidung,prozent,quirónsalud,...,dortmund,dhl,abspaltung,networking,unternehmensbereich,fire,freenet,tsd,nivea,fresenius
2,wohnungen,energien,millionen,chemicals,maintor,passagier,kassel,herzogenrath,tsd,onzern,...,iduna,unternehmensbereich,kundenlösungen,adva,forwarding,zeitarbeit,millionen,thermotechnik,tesa,wechselkursen
3,immobilien,prozent,vergleichsquartal,basf,höller,lima,dresden,halbleiterindustrie,modus,medical,...,dortmunder,express,energien,millionen,dhl,personalvermittlung,msp,systems,zwischenbericht,niereninsuffizienz
4,betreut,agrarhandel,weitergeführt,verkaufsmengen,main,anteilsverkauf,wuppertal,depositionsanlagen,managed,kgaa,...,signal,forwarding,prozent,netzbetreiber,pep,fakturierbaren,prozent,cleaning,unternehmensbereich,cms
5,katharinenhof,regenerative,next,verkaufspreise,abs,passagiere,kommanditanteile,bauelemente,neckarsulm,kabi,...,spielzeit,sar,beziehungsweise,proforma,freight,weiterbildung,maintal,bayreuth,consumer,fmch
6,berlin,beratungsdienst,aufgegeben,vorjahresquartal,beteiligungs,ljubljana,center,leds,anwendungslösungen,fmch,...,kgaa,berichtsjahr,energienetze,fsp,chain,berichtsquartal,stichtagsbewertung,ltb,shower,dialyseprodukten
7,epra,solar,üsse,mengen,frankfurt,ausgabeverhalten,bewertungsergebnis,vorquartal,consult,care,...,mannschaft,chain,textziffer,entwicklungskosten,supply,zeitarbeitsmitarbeiter,mvno,assembly,eucerin,versorgungsmanagement
8,pflegen,neuseeland,datendienste,fixkosten,stimmrechtsanteil,antalya,shoppingcentern,jahresvergleich,schweizer,helios,...,industries,freight,preussenelektra,entwicklungsprojekte,währungseffekte,zeitarbeitnehmer,vodafone,filtration,vorjahrs,patienten
9,philip,limited,vertriebsrecht,immaterielles,kodexes,konstante,ece,grundbeleuchtungseinheiten,neuaufnahme,onzerns,...,evonik,supply,erneuerbare,netze,express,zeitarbeitsbranche,mobil,clean,nominal,dialysedienstleistungen


In comparison, there is TF-IDF implementation in the Scikit-Learn Library, have a look at the notebook [TF-IDF_ScikitLearn](05-TF-IDF_ScikitLearn.ipynb)