## Implement TF-IDF using Scikit-Learn Library

Documentation: [TFidfVectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html)

Convert a collection of raw documents to a matrix of TF-IDF features.

In [1]:
import os
import json
from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd
import pickle
import numpy as np

Before obtaining the TF-IDF matrix, make sure that text corpus is generate during preprocessing. If not, go back to [Corpus Generation](03-Corpus_Generation.ipynb).

#### 1. Calculate TF-IDF on preprocessed documents
- params: 
    - processed_doc_path: file that has been preprocessed
    - document_limit: restrict the number to process with tf-idf
- variables:
    - dictionary: set, to save unique token
    - documents_list: list, for each document, there is a documents_list to save all tokens that occured
    - files_list: to save file name

In [None]:
def scikit_tf_idf_PrerocessedDoc(processed_doc_path, document_limit=200):
    
    dictionary = set()
    documents_list = []
    files_list = []
    
    for root, dirs, files in os.walk(processed_doc_path):
        for f in files:
            document_limit -= 1
            files_list.append(f)
            print('.', end='')
            words_list = ''
            with open(root+'/'+f, 'rb') as fr:
                try:
                    words = pickle.load(fr)
                    for word in words:
                        dictionary.add(word)
                        words_list += word + ' '
                except:
                    print ('Error while processing file: ', f)
            
            documents_list.append(words_list)
            
            if document_limit == 0:
                print('\n')
                break

    sklearn_vector = TfidfVectorizer(vocabulary=dictionary)
    sklearn_tf_idf = sklearn_vector.fit_transform(documents_list)
    return sklearn_vector, sklearn_tf_idf, files_list

<font color="blue"/>

### dsp:
  * Your `sklearn_vector` is not a vector but an algorithm, namely a vectorizer.
  * For me `documents`, `files` would be as good as `documents_list`, `files_list`.
  * You can get rid of the for-loop over words: `dictionary.update(words)` plus `words_list = ' '.join(words)`.

In [3]:
%%time
#please modify the relative path
# processed_doc_path = '/home/bit/ma0/LabShare/data/chui_ma/spacy_corpus/'
processed_doc_path = '../spacy_corpus/'
sklearn_vector, sklearn_tf_idf, files_list = scikit_tf_idf_PrerocessedDoc(processed_doc_path, 200)

........................................................................................................................................................................................................

CPU times: user 1.02 s, sys: 56 ms, total: 1.08 s
Wall time: 957 ms


#### 2. Find most important words using tf-idf
- how often does a word appear in a report => “term frequency tf ”
- how rare is a word across all reports => “inverse document frequency idf”
- => importance score = tf * idf
- params:
    - sklearn_vector: a vector generated by TfidfVectorizer
    - tf_idf: matrix, obtain from function: scikit_tf_idf_PrerocessedDoc(processed_doc_path, document_limit=200)
    - k: number of keywords to display
    - filenames: list of file names
- variables: 
    - document_important_words: dict, key: file name, value: important words after calculating tf-idf

In [4]:
def sklearn_find_top_k_words(sklearn_vector, tf_idf, k, filenames):
    
    document_important_words = {}
    corpus = sklearn_vector.get_feature_names()
    
    for i in range(tf_idf.shape[1]):
        important_words = []
        
        #get indices of k-maximum values in numpy column
        index = np.argpartition(tf_idf[:, i], -k)[-k:]
        index = index[::-1]
        
        for ind in index:
            important_words.append(corpus[ind])
        
        document_important_words[filenames[i]] = important_words
        
        df = pd.DataFrame(document_important_words)
    return df

Find the most important 10 keywords in each business report, according the the term weighting matrix obtained in tf_idf

In [5]:
#find most important 10 keywords
#return pandas DataFrame
sk_tf_idf = np.transpose(sklearn_tf_idf)
sklearn_find_top_k_words(sklearn_vector, sk_tf_idf.toarray(), 10, files_list)

Unnamed: 0,NORMA_Group-QuarterlyReport-2016-Q2,DIC-Asset-AnnualReport-2015,HeidelbergCement-QuarterlyReport-2011-Q3,WCM-QuarterlyReport-2011-Q3,Capital_Stage-QuarterlyReport-2012-Q1,Hugo_Boss-AnnualReport-2010,Sartorius-QuarterlyReport-2012-Q3,Wirecard-AnnualReport-2016,KwsSaat-QuarterlyReport-2011-Q2,NORMA_Group-QuarterlyReport-2014-Q2,...,NORMA_Group-QuarterlyReport-2011-Q2,TlgImmobilien-QuarterlyReport-2016-Q1,Hornbach-QuarterlyReport-2014-Q3,WackerChemie-QuarterlyReport-2013-Q2,Gerry_Weber-AnnualReport-2013,TAKKT-QuarterlyReport-2014-Q3,Wacker_Neuson-QuarterlyReport-2015-Q3,MLP-AnnualReport-2014,EvonikIndustries-QuarterlyReport-2012-Q3,Hella-QuarterlyReport-2014-Q3
0,vergleichen,immobilie,anleihe,gesellschaft,solarstrom,metzingen,vorjahr,acquiring,gruppe,vergleichen,...,halbjahr,quartal,like,wacker,hallhuber,topdeq,baugeräten,finanzdienstleistungen,prozent,hella
1,region,aufsichtsrat,zement,verlustvorträge,park,einzelhandels,sparte,payment,züchtung,juni,...,börsengang,immobilie,thanksaboveall,juni,kollektion,geschäftsbereichs,prozent,aufsichtsrat,evonik,vorjahr
2,juni,aktie,september,insolvenzverfahren,solarthermie,aufsichtsrat,sartorius,visa,segmentergebnis,five,...,quartal,märz,thatdiydemandoutside,quartal,verkaufsflächen,takkt,m2014,vorjahr,wohnen,altersteilzeitprogramm
3,quartal,vorstehen,zuschlagstoffe,körperschaftsteuergesetzes,prozent,konzern,biohit,unternehmen,roundup,höhe,...,wachstum,hotel,the2013,polysilicium,retail,prozent,konzern,berater,fotovoltaik,wachstum
4,halbjahr,unternehmen,heidelbergcement,investor,solarparks,geschäftsjahr,handling,risiko,segment,quartal,...,höhe,assetklasse,terof2014,vorjahr,vorjahr,gesellschaften,kompaktmaschinen,finanzholding,produktionsanlage,fahrzeug
5,monat,fond,höhe,oktober,quartal,permira,liquid,zahlungsabwicklung,maissaatgut,monat,...,börsenganges,ankauf,itsoftware,betonen,gruppe,september,vorjahr,vermögensmanagement,vivawest,neuzulassungen
6,ebita,balance,konzerngebiet,gewerbesteuer,stage,unternehmen,industrial,prozent,strukturkosten,halbjahr,...,kosten,vergleichen,thegroupsignificantlyoffsethigh,prozent,geschäftsjahr,sonstige,quartal,geschäftsjahr,geschäfts,kgaa
7,höhe,gesellschaft,bauaktivitäten,gespräch,ausbau,dezember,einbezug,händler,halbjahresabschluss,ebita,...,kapitalerhöhung,büro,sseverewinter,vorquartal,houses,ebitda,umsatz,altersvorsorge,steag,lippstadt
8,dezember,grund,klinkerabsatz,jahresabschlüsse,solar,kollektion,sondereffekte,wirecard,halbjahr,group,...,kunde,verpachtung,thediystoreswithgardencenters,berichtsquartal,risiko,stufen,monat,risiko,segment,behr
9,aufwendung,höhe,republik,problem,erneuerbaren,manager,dezember,bereich,kläger,schulde,...,juni,ergebnis,itssalesby1,chemie,wholesale,monat,september,unternehmen,bereinigungen,geschäftsquartal
