# Natural Language Processing and Swedish Mutual Funds

#### We will be using NLP on a collection of short descriptions of 1320 mutual funds. The primary NLP packages we will playing with is NLTK. The fund descriptions are in swedish, and NLTK does have some support for the swedish language.

#### First, we import some packages we'll be using and open the fund dictionary (i.e. hash table), which is currently stored in json format.

In [260]:
import nltk
from nltk.text import Text
from collections import Counter
from nltk.corpus import stopwords
from nltk.stem.snowball import SwedishStemmer
from sklearn.decomposition import TruncatedSVD
from sklearn.preprocessing import Normalizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn import metrics
import random
import pandas as pd
import numpy as np
from sklearn.pipeline import Pipeline
from sklearn.metrics import pairwise_distances
import json
from collections import defaultdict
import warnings
warnings.filterwarnings("ignore")

with open('dict_of_funds.json', 'r') as g:
     dict_of_funds = json.load(g)

In [223]:
len(dict_of_funds)

1320

#### There are 1320 funds and below is an example of the information available for a given fund. We will primarily be interested in a short 1-4 sentence summary ofthe fund, whichis stored under the key 'info'.

In [224]:
dict_of_funds['Handelsbanken Tillväxtmarknad Tema']

{'10_yr': 52.5,
 '1_yr': 16.41,
 '3_yr': 23.6,
 '5_yr': 68.61,
 'ISIN': 'SE0000429748',
 'bank': 'Handelsbanken',
 'category': 'Tillväxtmarknader',
 'fund_type': 'Aktiefond',
 'holdings': {'AIA Group Ltd': '2.1',
  'Alibaba Group Holding Ltd ADR': '6.4',
  'Kroton Educacional SA': '2.2',
  'Naspers Ltd Class N': '1.6',
  'NetEase Inc ADR': '2.8',
  'Samsung Electronics Co Ltd': '4.6',
  'Sberbank of Russia PJSC': '2.7',
  'Taiwan Semiconductor Manufacturing Co Ltd': '5.8',
  'Tencent Holdings Ltd': '8.7',
  'Övrigt': '63.3'},
 'industries': {'Fastigheter': '1.3',
  'Finans': '24.5',
  'Industri': '6.6',
  'Kommunikation': '3.0',
  'Konsument, cyklisk': '13.3',
  'Konsument, stabil': '10.7',
  'Råvaror': '1.2',
  'Sjukvård': '3.5',
  'Teknik': '36.0',
  'Övrigt': '0.0'},
 'info': 'Fonden är en aktivt förvaltad värdepappersfond som placerar främst i börsnoterade aktier utgivna av företag i Asien, Latinamerika, Östeuropa och Afrika. Målet är att över tiden överträffa den genomsnittliga av

#### Next we convert the dictionary to a pandas dataframe, which can be easily manipulated with respect to the operations we will be doing.

In [261]:
fund_df=pd.DataFrame.from_dict(dict_of_funds,orient='index')
fund_df.head()

Unnamed: 0,info,min_amt,sub_categories,sharpe_ratio,1_yr,man_fee,morningstar,bank,start_date,3_yr,...,fund_type,link,10_yr,normanbelopp,std_dev,industries,holdings,risk,ISIN,regions
10TEN Kvanthedge,Fonden är en aktivt förvaltad hedgefond som fö...,25000,"Hedgefond, lång/kort, Europa",,2.08,0.8,,10TEN,2016-08-02,,...,Hedgefond,https://www.avanza.se/fonder/om-fonden.html/67...,,7017,,{'Just nu finns ingen information om fondens i...,{'Just nu finns ingen information om fondens i...,2.0,SE0008586945,{'Just nu finns ingen information om fondens i...
AGCM Asia Growth RC SEK,AGCM Asia Growth Fund är en aktivt förvaltad a...,100,Asien ex Japan,,9.65,1.85,,AGCM Asia Growth,2014-10-03,,...,Aktiefond,https://www.avanza.se/fonder/om-fonden.html/54...,,18319,,"{'Finans': '20.7', 'Konsument, cyklisk': '22.1...","{'China Resources Land Ltd': '3.8', 'Samsung E...",6.0,LU1091660909,{'Asien exkl Japan': '100.0'}
AMF Aktiefond Asien Stilla havet,Fonden är en aktiefond med en bred inriktning ...,50,Asien,0.78,7.21,0.4,3.0,AMF,2008-09-25,25.3,...,Aktiefond,https://www.avanza.se/fonder/om-fonden.html/14...,,3908,12.76,,,5.0,SE0002572313,
AMF Aktiefond Europa,Fonden är en aktiefond som placerar i marknads...,50,"Europa, värdebolag",0.74,11.94,0.4,5.0,AMF,1999-04-30,22.96,...,Aktiefond,https://www.avanza.se/fonder/om-fonden.html/73...,35.93,3821,13.56,"{'Kommunikation': '4.6', 'Finans': '20.0', 'Ko...",{'Roche Holding AG Dividend Right Cert.': '4.2...,5.0,SE0000739153,"{'Västeuropa exkl Sverige': '95.0', 'Östeuropa..."
AMF Aktiefond Global,Aktiefond - Global placerar framför allt i utl...,50,"Global, mix bolag",1.02,8.02,0.4,4.0,AMF,2001-11-15,35.97,...,Aktiefond,https://www.avanza.se/fonder/om-fonden.html/18...,76.14,3996,12.24,"{'Kommunikation': '4.7', 'Finans': '19.0', 'Ko...","{'Microsoft Corp': '2.5', 'Novartis AG': '1.0'...",5.0,SE0000862278,"{'Australien och Nya Zeeland': '3.3', 'Östeuro..."


#### Most of the preprocessing we will be doing can be condensed into the function written below. We first split apart (i.e. tokenize) the 'info' string into a list of words, symbols, and punctuations. Then we remove allof the swedish stopwords. In english stopwords are words such as 'is', 'a', 'the', which are commonplace, but have little predictive value. 

#### After that we remove any tokens of length 1, which includes punctuation marks. The last step is to onlykeep the 'stem' of a word. For instance the stem of 'called' and 'cars' is 'call' and 'car'.

In [262]:
def fund_cleaner(df):
    for fund in list(df.index):
        info=df.loc[fund,'info']
        info=nltk.word_tokenize(info,'swedish')
        info=[word for word in info if word not in stopwords.words('swedish')]
        info=[word for word in info if len(word)>1]
        info=[SwedishStemmer().stem(word) for word in info]
        info=' '.join(info)
        df.loc[fund,'info']=info
    return df

#### Let's look at an example of the fund info before and after cleaning things up. Notice the length is reduced by around 1/3.

In [227]:
fund_df.loc['Amundi Fds SBI FM Eqty India Select AU-C','info']

'Delfondens mål är att söka långsiktig kapitaltillväxt genom att investera minst två 67 % av sina tillgångar i aktier i indiska företag. Delfonden kan investera i finansiella derivatinstrument för säkring och för effektiv förvaltning av portföljer. BSE 100 indexet utgör Delfondens referensindikator. Delfonden syftar inte till att replikera referensindikatorn och kan därför väsentligt avvika från den.'

In [265]:
_=fund_cleaner(fund_df)
fund_df.loc['Amundi Fds SBI FM Eqty India Select AU-C','info']

'delfond mål sök långsik kapitaltillväxt genom invest minst två 67 tillgång akti indisk företag delfond invest finansiell derivatinstrument säkring effektiv förvaltning portfölj bse 100 indexet utgör delfond referensindik delfond syft replik referensindikatorn därför väsent avvik'

#### First we apply an unsupervised methods to group funds based on 'info'. The process consists of the following steps:

#### 1) For each fund 'info' and for each unique word in 'info',  count the number of times the word occurs in the fund 'info' and divide this by the number of times the word occurs in all funds 'info'. This is known as Term Frequency Inverse Document Frequency (i.e. tfidf ), and the output is a large sparse matrix where each row corresponds to a particular fund's 'info'.

#### 2) Apply a truncated Singular Value Decomposition (i.e. svd ), which attempts to compress the less usefule information in the matrix. 

#### This process more commonly goes by thename Latent Semantic Analysis or LDA.

In [229]:
svd_model=TruncatedSVD(n_components=500,algorithm='arpack')

transformer = TfidfVectorizer()

svd_transformer = Pipeline([('tfidf', transformer), 
                            ('svd', svd_model)])

svd_matrix = svd_transformer.fit_transform(fund_df['info'].values)

#### Now that each fund has been transformed into a vector in a high-dimensional space, let's see what fund's are similar to a given fund by comparing the fund vectors. In particular, we use the cosine distance to give the angle between vectors. If the two vectors are the same, then the angle will be zero.

In [230]:
def k_similar_funds(fund,k,fund_df):
    query_fund = svd_transformer.transform([fund_df.loc[fund,'info']])
    distance_matrix = pairwise_distances(query_fund, 
                                     svd_matrix, 
                                     metric='cosine', 
                                     n_jobs=-1)
    indices=distance_matrix[0,:].argsort()[:k+1]
    for i in indices:
        print('Fund Name: '+fund_df.index[i])
        print('Fund Category: '+fund_df['category'].iloc[i])
        print('')

In [231]:
k_similar_funds('AMF Aktiefond Sverige',5,fund_df)

Fund Name: AMF Aktiefond Sverige
Fund Category: Sverige

Fund Name: Monyx Svenska Aktier
Fund Category: Sverige

Fund Name: AMF Aktiefond Europa
Fund Category: Europa

Fund Name: SEB Världenfond
Fund Category: Blandfonder

Fund Name: SEB Nordic Focus C SEK
Fund Category: Norden

Fund Name: SEB Trygg Placeringsfond
Fund Category: Blandfonder



##### Above we see the five 'closest' funds to the fund AMF Aktiefond Sverige. Comparing this to the fund categories seems to show only a minor correlation between the fund's 'info' and the fund's category.

#### However, if we look at the geographical distribution of a couple of those funds we see a heavy investment presence in Sweden (i.e. Sverige), which is somewhat encouraging.

In [232]:
print('Geographical Distribution of SEB Nordic Focus C SEK investments: {}'.format(fund_df.loc['SEB Nordic Focus C SEK','regions']))
print('-----------------------------------------------------------------------------------------------------------')
print('Geographical Distribution of SEB Trygg Placeringsfond investments: {}'.format(fund_df.loc['SEB Trygg Placeringsfond','regions']))

Geographical Distribution of SEB Nordic Focus C SEK investments: {'Västeuropa exkl Sverige': '52.4', 'Nordamerika': '3.7', 'Sverige': '43.9'}
-----------------------------------------------------------------------------------------------------------
Geographical Distribution of SEB Trygg Placeringsfond investments: {'Australien och Nya Zeeland': '0.7', 'Östeuropa': '3.7', 'Nordamerika': '33.2', 'Latinamerika': '0.0', 'Västeuropa exkl Sverige': '13.4', 'Japan': '6.9', 'Afrika och Mellanöstern': '0.2', 'Asien exkl Japan': '5.1', 'Sverige': '36.9'}


#### Below we see that there are 37 distinct fund categories, and some of the categories have as few as 2 funds. This is a mutli-class dataset with significant class imbalance.

In [233]:
print('# of fund categories: {}'.format(len(fund_df.groupby('category').count().index)))
fund_df.loc[:,['info','category']].groupby('category').count()

# of fund categories: 37


Unnamed: 0_level_0,info
category,Unnamed: 1_level_1
-,21
Afrika och Mellanöstern,14
Asien,135
BRIC,4
Bioteknologi,6
Blandfonder,95
Brasilien,5
EURO,36
Energi,5
Europa,115


#### We now move on to supervised learning via classification, and we begin by splitting the dataset of funds into a traing set and a test set. We randomly split a quarter of the set into the test set, and do so by maintaing the ratio of classes/categories.

#### Given the combination of the imbalanced classes and the small populations of many of the fund classes, our expectations should not be high.

In [243]:
def data_split(df):
    test_funds=[]
    class_counts=df.loc[:,['info','category']].groupby('category').count()
    test_counts=class_counts['info'].apply(lambda x: int(0.25*x))
    for index, count in test_counts.iteritems():
        indices=random.sample(range(0,class_counts.loc[index][0]),count)
        test_funds+=list(df.loc[fund_df['category']==index].index[indices])
    return test_funds

In [244]:
test_index=data_split(fund_df)
test_set=fund_df.loc[test_index,:]
train_set=fund_df.loc[~fund_df.index.isin(test_index)]

In [245]:
X_test=test_set['info'].values
y_test=test_set['category'].values
X_train=train_set['info'].values
y_train=train_set['category'].values

#### We need a couple of functions for running a Naive Bayes classifier based on a multinomial distribution and a function to show how each category fared.

#### A Naive Bayes classifier makes a naive assumption that the features (i.e. word frequencies) are statistically independent. For some words this may be true, but for many this is not true at all (i.e. 'rush' and 'hour'). Later we can use n-grams (multiple words) as features to help this situation.

In [246]:
def score_counts(y_test,y_predic):
    correct_counts=defaultdict(int)
    counts=defaultdict(int)
    i=0
    for fund_class in y_test:
        counts[fund_class]+=1
        if fund_class==y_predic[i]:
            correct_counts[fund_class]+=1
        i+=1
    return (correct_counts,counts)

In [247]:
def NB_classify_and_score(X_train,y_train,X_test,y_test):
    steps = [('tfid', TfidfVectorizer()),
         ('NB', MultinomialNB())]

    pipeline = Pipeline(steps)

    pipeline.fit(X_train,y_train)
    
    y_predic = pipeline.predict(X_test)
    
    (correct_counts,counts)=score_counts(y_test,y_predic)

    print("Accuracy: {}".format(pipeline.score(X_test, y_test)))
    for key, value in counts.items():
        print(key+": {}".format(correct_counts[key])+'/{}'.format(value))


#### Below we see that only around 48% of the funds in the test set were correctly classified. This is not very encouraging, but it is far better than the around 3% accuracy a truely random approach would yield.

#### Unsurprisingly, the under represented fund categories performed the worst.

In [248]:
NB_classify_and_score(X_train,y_train,X_test,y_test)

Accuracy: 0.4794952681388013
-: 0/5
Östeuropa: 0/5
Ny Energi: 0/1
USD: 0/2
Läkemedel: 0/3
Afrika och Mellanöstern: 0/3
Sverige: 23/26
Energi: 0/1
Nordamerika: 2/15
BRIC: 0/1
Finans: 0/1
SEK: 16/19
Ädelmetaller: 0/1
Blandfonder: 12/23
Japan: 1/8
Norden: 0/9
Latinamerika: 0/3
Ny Teknologi: 0/5
Infrastruktur: 0/1
Asien: 31/33
EURO: 0/9
Övriga: 19/25
Miljö: 0/1
Europa: 17/28
Råvaror: 0/3
Konsument: 0/2
Tillväxtmarknader: 1/18
Bioteknologi: 0/1
Turkiet: 0/1
Ryssland: 0/4
Global: 22/31
Fastigheter: 0/6
Hedgefonder: 8/21
Konvertibler: 0/1
Brasilien: 0/1


#### Let's run it again, but this time usingonly the top 5 most populated fund categories.

In [249]:
top_classes=list(fund_df.loc[:,['info','category']].groupby('category').count().sort_values('info',ascending=False).head(5).index)
print(top_classes)

['Asien', 'Global', 'Europa', 'Sverige', 'Övriga']


In [250]:
X_test_sub=test_set.loc[test_set['category'].isin(top_classes)]['info'].values
y_test_sub=test_set.loc[test_set['category'].isin(top_classes)]['category'].values
X_train_sub=train_set.loc[train_set['category'].isin(top_classes)]['info'].values
y_train_sub=train_set.loc[train_set['category'].isin(top_classes)]['category'].values

#### Our accuracy has almost doubled to 81%, which is a nice improvement. We could either gather more dataon the under represented categories, or we could consider consolidating some of the fund categories to get better performance outof our classifier. Of course, we have only tried a fairly basic classifier, so there are many more methods we can consider. 

In [251]:
NB_classify_and_score(X_train_sub,y_train_sub,X_test_sub,y_test_sub)

Accuracy: 0.8111888111888111
Asien: 30/33
Europa: 17/28
Övriga: 22/25
Global: 24/31
Sverige: 23/26


#### Before we move on to other methods let's look at 'fund_type' asopposed to fund 'category'. THere are only 7 fund types as opposed to 37 fund categories. However, class imbalance is still an issue, since the 'Aktiefond' type is heavily represented in the dataset.

In [252]:
print('# of fund types: {}'.format(len(fund_df.groupby('fund_type').count().index)))
fund_df.loc[:,['info','fund_type']].groupby('fund_type').count()

# of fund types: 7


Unnamed: 0_level_0,info
fund_type,Unnamed: 1_level_1
-,21
Aktiefond,763
Blandfond,95
Branschfond,126
Hedgefond,87
Räntefond,224
Övrigt,4


In [253]:
def data_split(df,column):
    test_funds=[]
    class_counts=df.loc[:,['info',column]].groupby(column).count()
    test_counts=class_counts['info'].apply(lambda x: int(0.25*x))
    for index, count in test_counts.iteritems():
        indices=random.sample(range(0,class_counts.loc[index][0]),count)
        test_funds+=list(df.loc[fund_df[column]==index].index[indices])
    
    return test_funds

In [254]:
test_index=data_split(fund_df,'fund_type')
test_set=fund_df.loc[test_index,:]
train_set=fund_df.loc[~fund_df.index.isin(test_index)]

In [255]:
X_test=test_set['info'].values
y_test=test_set['fund_type'].values
X_train=train_set['info'].values
y_train=train_set['fund_type'].values

In [256]:
NB_classify_and_score(X_train,y_train,X_test,y_test)

Accuracy: 0.6758409785932722
Branschfond: 0/31
Aktiefond: 190/190
Blandfond: 0/23
Räntefond: 31/56
Hedgefond: 0/21
Övrigt: 0/1
-: 0/5


#### Over 67% of the funds in the test set are correctly given the fund type by the Naive Bayes classifier. This isn't too bad, but we should be able to do better by bringing in other features and methods.