<center>
<hr>
<h1>Multiplex Dependency Stock Network - Collect Data</h1>
<hr>
</center>

In [1]:
import pandas
import quandl 
quandl.ApiConfig.api_key = 'dYKUm41mFRJiwbdqSMSY' #api key for quandl, register on quandl to get it
import math as mt
import numpy as np

### Download Data

We import ticker o S&P500 components from a csv file: 

In [2]:
companies = []

fh = open('../data/input/S&P500List.csv','r')
for line in fh: 
    s = line.strip().split(',') 
    companies.append(s[0])
fh.close()
print('Number of Tickers:', len(companies))

Number of Tickers: 505


We download data from Quandl API (https://www.quandl.com/) and we record in a dictionary only close prices:

In [3]:
data = dict()
for i in companies: 
    try: 
        raw_data = quandl.get_table('WIKI/PRICES', 
                                    qopts = { 'columns': ['ticker', 'date', 'close', 'adj_close'] }, 
                                    date = { 'gte': '2017-01-01', 'lte': '2017-12-31' },
                                    ticker = i)
        data[i] = raw_data['close']
    except Exception: 
        print('not found', i)

We perform now some controls on data: 

- if some company has not been found we delete it: 

In [4]:
companies = []
for k in data.keys(): 
    companies.append(k)

- we check there are not empty dataframe (and we upload the list of companies): 

In [5]:
for c in companies: 
    if data[c].empty: 
        del data[c]
companies = []
for k in data.keys(): 
    companies.append(k)

- we control that all dataframes have the same length (and again we upload the list of companies):

In [6]:
for c in companies: 
    if data[c].size != 250: #trading days
        del data[c]

companies = []
for k in data.keys(): 
    companies.append(k)

### Log Return Price

We evaluate now log return prices before computing correlations (returns follow log normal distribution):

In [7]:
size = 250

datalog = dict() 
for c in companies: 
    log = []
    for i in range(1,size-1): 
        log.append(mt.log(data[c].get(i)/data[c].get(i-1)))
    logseries = pandas.Series(log)
    datalog[c] = logseries

### Correlation Matrices - Pearson, Kendall, Spearman

In [8]:
def crosscorr(datax, datay, lag=0, method = 'pearson'):
    """ Lag-N cross correlation. 
    Parameters
    ----------
    lag : int, default 0
    datax, datay : pandas.Series objects of equal length

    Returns
    ----------
    crosscorr : float
    """
    return datax.corr(datay.shift(lag), method)

We evaluate three matrices of correlation using Pearson, Kendall and Spearman methods: 

- Without time lag:

In [None]:
n = len(companies)
corr_matrix_pearson = np.zeros((n, n))
corr_matrix_kendall = np.zeros((n, n))
corr_matrix_spearman = np.zeros((n, n))
companies_id = dict()

for i in range(0, n):
    companies_id[companies[i]] = i
    for j in range(0, n):
        if i<j:
            corr_matrix_pearson[i][j] = datalog[companies[i]].corr(datalog[companies[j]],method='pearson')
            corr_matrix_kendall[i][j] = datalog[companies[i]].corr(datalog[companies[j]],method='kendall')
            corr_matrix_spearman[i][j] = datalog[companies[i]].corr(datalog[companies[j]],method='spearman')      

### Export Data

We export correlation matrices and tickers: 

In [None]:
#export corr matrix
np.savetxt("../data/corr_matrix/corr_matrix_pearson.csv",corr_matrix_pearson, delimiter=",")
np.savetxt("../data/corr_matrix/corr_matrix_kendall.csv",corr_matrix_kendall, delimiter=",")
np.savetxt("../data/corr_matrix/corr_matrix_spearman.csv",corr_matrix_spearman, delimiter=",")

#export ticker e id number
import csv
with open('../data/input/components.csv', 'w') as csv_file:
    writer = csv.writer(csv_file)
    for key, value in companies_id.items():
        writer.writerow([key, value])