# TF-IDF keywords trend

## Introduction of TF-IDF
This file contains code about using TF-IDF technique to analyize keywords' trend of one company through several year. TF-IDF is exactly the abbreviation of the production of Term Frequency( TF) and Inverse Document Frequency( IDF). Here Term Frequency is defined as follow:

<img src="http://latex.codecogs.com/svg.latex?tf(t,d)&space;=&space;number\&space;of\&space;word\&space;t\&space;in\&space;document\&space;d" title="tf(t,d) = number\ of\ word\ t\ in\ document\ d" />

Here t refers to a word, and d refers to a document, therefore Term Frequency is just the number of this word in this document.

Inverse Document Frequency is defined as following:

<img src="http://latex.codecogs.com/svg.latex?idf(t,D)&space;=&space;log\frac{N}{\begin{vmatrix}\begin{Bmatrix}d&space;\in&space;D;t&space;\in&space;d&space;\end{Bmatrix}&space;\end{vmatrix}}" title="idf(t,D) = log\frac{N}{\begin{vmatrix}\begin{Bmatrix}d \in D;t \in d \end{Bmatrix} \end{vmatrix}}" />

Here d,t agian is document and word, respectively. The capital D refers to corpus, which is a set containing all documents. The capital N refers to how many documents in this corpus. And here the denominator is the number of documents which contain this word t and in the corpus D. We can see if one word occur in many documents, then its IDF value will be low. This is meaningful since usually a word that occur in more documents are less distinctive and meaningless.


The mathematics form of TF-IDF is:

<img src="http://latex.codecogs.com/svg.latex?tfidf(t,d,D)&space;=&space;tf(t,d)*idf(t,d,f)" title="tfidf(t,d,D) = tf(t,d)*idf(t,d,f)" />

We can see that, for a given corpus D, we have two indices for this tfidf function. Therefore we can use a 2 dimension matrix to represent this function.

The interpretation of this TF-IDF is, if for one word and one document, its TF-IDF value is high, then we can say this word is important for this document and can be a key word of this document. This is because, a high TF-IDF value is reached by a high TF( Term Frequency) value and a high IDF( Inverse Document Frequency), which means this word occur many times in this document, and nearly dose not occur in other documents. Therefore, this word make this document distinctive from other documents, and can be seen as key word of this document.


## Structure of this notebook
In this notebook, we first use TF-IDF technique to analyize BMW's annual report from 2010 to 2017, and get their keywords' trend. Then for some of these keywords, we indeed find the reason why these are keyword in these documents. 
We also analyize DeutscheBank's annual report from 2010 to 2016, and gets their keywords' trend.  

Since notebook in github does not show the visualization image, we save all image in img folder.



In [2]:
import spacy
import json
from sklearn.feature_extraction.text import TfidfVectorizer
from spacy.lang.de.stop_words import STOP_WORDS
import pandas as pd
from sklearn.cluster import KMeans
import time

In [3]:
#other needed configurations
%run src/file_utils.py
%run src/configuration.py
%run "load_and_prepro_document.ipynb"

# BMW 2010-2017

Here we start to apply TFIDF in BMW annual report from 2010-2017. The corpus here are these 8 documents.  

In [4]:
#the file list of BMW from 2010 to 2017
bmw_lemm_docs_prep = [
     'BMW-AnnualReport-2010.json', 
     'BMW-AnnualReport-2011.json', 
     'BMW-AnnualReport-2012.json',
     'BMW-AnnualReport-2013.json', 
     'BMW-AnnualReport-2014.json', 
     'BMW-AnnualReport-2015.json',
     'BMW-AnnualReport-2016.json', 
     'BMW-AnnualReport-2017.json']

In [5]:
#stop tfidf from preprocessing and split the word
def preProcess(s):
    return s

In [6]:
#remove all the stop words and other meaningless characters
bmw_doc, bmw_name= get_clean_data(bmw_lemm_docs_prep)

#do the TF/IDF and produce the tfidf-matrix
vectorizer_bmw = TfidfVectorizer()
start_time = time.time()
tfidf_matrix_bmw = vectorizer_bmw.fit_transform(bmw_doc)
print (time.time() - start_time)

filtered_BMW-AnnualReport-2010.json has already done preprocess
filtered_BMW-AnnualReport-2011.json has already done preprocess
filtered_BMW-AnnualReport-2012.json has already done preprocess
filtered_BMW-AnnualReport-2013.json has already done preprocess
filtered_BMW-AnnualReport-2014.json has already done preprocess
filtered_BMW-AnnualReport-2015.json has already done preprocess
filtered_BMW-AnnualReport-2016.json has already done preprocess
filtered_BMW-AnnualReport-2017.json has already done preprocess
0.25508904457092285


Here we use pandas to show result. However, since notebook in github does not show this panda form. Readers may need to re run this part.

In [7]:
#use pandas to show the result (make data structure more clear)
bmw_feature_names = vectorizer_bmw.get_feature_names()
bmw_corpus_index = [n for n in [
    'BMW-2010', 'BMW-2011', 'BMW-2012', 
    'BMW-2013', 'BMW-2014', 'BMW-2015',
    'BMW-2016', 'BMW-2017']]
idf = vectorizer_bmw.idf_
df = pd.DataFrame(tfidf_matrix_bmw.T.todense(), index=bmw_feature_names, columns=bmw_corpus_index)
df['idf'] = idf

# Deutsche Bank 2010-2016

Here is for Deutsche Bank's annual report. We do the same procedure.

In [8]:
#the file list of Deutsche Bank from 2010 to 2017
db_lemm_docs_prep = [
     'DeutscheBank-AnnualReport-2010.json', 
     'DeutscheBank-AnnualReport-2011.json', 
     'DeutscheBank-AnnualReport-2012.json',
     'DeutscheBank-AnnualReport-2013.json', 
     'DeutscheBank-AnnualReport-2014.json', 
     'DeutscheBank-AnnualReport-2015.json',
     'DeutscheBank-AnnualReport-2016.json']

In [9]:
#remove the stop words and other meaningless characters
db_doc, db_name = get_clean_data(db_lemm_docs_prep)

#do the TF/IDF and produce the tfidf-matrix
vectorizer_db = TfidfVectorizer()
start_time = time.time()
tfidf_matrix_db = vectorizer_db.fit_transform(db_doc)
print (time.time() - start_time)

filtered_DeutscheBank-AnnualReport-2010.json has already done preprocess
filtered_DeutscheBank-AnnualReport-2011.json has already done preprocess
filtered_DeutscheBank-AnnualReport-2012.json has already done preprocess
filtered_DeutscheBank-AnnualReport-2013.json has already done preprocess
filtered_DeutscheBank-AnnualReport-2014.json has already done preprocess
filtered_DeutscheBank-AnnualReport-2015.json has already done preprocess
filtered_DeutscheBank-AnnualReport-2016.json has already done preprocess
0.44694948196411133


In [10]:
#use pandas to show the result (make data structure more clear)
db_feature_names = vectorizer_db.get_feature_names()
db_corpus_index = [n for n in [
    'DB-2010', 'DB-2011', 'DB-2012', 
    'DB-2013', 'DB-2014', 'DB-2015',
    'DB-2016']]
idf = vectorizer_db.idf_
df_db = pd.DataFrame(tfidf_matrix_db.T.todense(), index=db_feature_names, columns=db_corpus_index)
df_db['idf'] = idf

# Visualization

Here is code for visualization. The results is not shown in this notebook. Reader can get these result in img folder.

In [13]:
import plotly as py
import plotly.graph_objs as go
import numpy as np

#set up the configuration of plotly in notebook
py.offline.init_notebook_mode(connected=True)

## BMW

In [14]:
#the keywords chosen from the BMW Annual reports
key = ['Husqvarna', 'aktienbasierte', 'Citroën', 'electrification', 'amsterdam', 'Drivenow', 'co2', 'brexit', 'HERE / Amsterdam', 'there']

Here we show some key words we extract from BMW's annual reports, and they are exactly meaningful. For example, this "HERE \ Amsterdam" word is the key word of one year annual report, since at that year, BMW bought a navigation company called "HERE" in Amsterdam.

In [15]:
#get the value of certain row (as y-value of visualization result)
y1 = df.loc['husqvarna'].tolist()
y3 = df.loc['citroën'].tolist()
y5 = df.loc['amsterdam'].tolist()
y6 = df.loc['drivenow'].tolist()
y7 = df.loc['co2'].tolist()
y8 = df.loc['brexit'].tolist()
y9 = df.loc['here'].tolist()
y10= df.loc['there'].tolist()

In [16]:
#x-value(year 2010-2017) of visualization result
years = np.linspace(2010, 2017, 8)

#define all the lines(keywords) with the data from TF/IDF
line1 = go.Scatter(x=years, y=y1, mode='lines+markers', name=key[0])
line3 = go.Scatter(x=years, y=y3, mode='lines+markers', name=key[2])
line6 = go.Scatter(x=years, y=y6, mode='lines+markers', name=key[5])
line7 = go.Scatter(x=years, y=y7, mode='lines+markers', name=key[6])
line8 = go.Scatter(x=years, y=y8, mode='lines+markers', name=key[7])
line9 = go.Scatter(x=years, y=y9, mode='lines+markers', name=key[8])

In [17]:
#the layout of visualization of BMW (main title, axis title)
decay = [line1, line3]
increase = [line6, line7, line8, line9]

layout_decay = dict(title = 'BMW:TF-IDF keywords trend - decrease', xaxis = dict(title = 'years'))
fig_decay = dict(data=decay, layout=layout_decay)

layout_increase = dict(title = 'BMW:TF-IDF keywords trend - increase', xaxis = dict(title = 'years'))
fig_increase = dict(data=increase, layout=layout_increase)

#draw the line chart
py.offline.iplot(fig_decay, filename='BMW:TF-IDF keywords trend - decrease')
py.offline.iplot(fig_increase, filename='BMW:TF-IDF keywords trend - increase')

## Deutsche Bank

In [18]:
#the keywords chosen from the Deutsche Bank Annual reports
key_co = ['Goodwill', 'ABN AMRO', 'Deutsche Bank National Trust Co.']

Here is key words for Deutsche Bank. And they are also meaningful. For instance, this "Deutsche Bank National Trust Co." is actually the name of Deutsche Bank America, and in that year, Deutsche bank got a high penalty from USA government.

In [19]:
#get the value of certain row (as y-value of visualization result)
y1 = df_db.loc['goodwill'].tolist()
y2 = df_db.loc['amro'].tolist()
y3 = df_db.loc['dbntc'].tolist()

In [20]:
#x-value(year 2010-2016) of visualization result
years = np.linspace(2010, 2016, 7)

#define all the lines(keywords) with the data from TF/IDF
line1 = go.Scatter(x=years, y=y1, mode='lines+markers', name=key_co[0])
line2 = go.Scatter(x=years, y=y2, mode='lines+markers', name=key_co[1])
line3 = go.Scatter(x=years, y=y3, mode='lines+markers', name=key_co[2])

In [21]:
#the layout of visualization of Deutsche Bank (main title, axis title)
company = [line1, line2, line3]

layout_company = dict(title = 'Deutsche Bank:TF-IDF keywords trend', xaxis = dict(title = 'years'))
fig_company = dict(data=company, layout=layout_company)

#draw the line chart
py.offline.iplot(fig_company, filename='Deutsche Bank:TF-IDF keywords trend')

## Summary
All these result shows, that TFIDF technique is a easy and effective way to find key words for each documents. However, TFIDF can only find keywords, which make this document different from other documents, but it cannot get topics which appear in every documents but also important.