# Preparación y obtención de los datos:

## Autor: María Carrasco Meléndez, Raquel Fort Serra y Lucía Saiz Lapique

__Práctica 4__

__Gestión de Activos y Carteras__

__CUNEF__

El objetivo de esta práctica es, para una serie de acciones de renta variable, construir una cartera de activos, en base a una serie de datos y señales. Debido a que los datos deben ser numerosos (al menos 5 años de datos históricos, entre otros, de unas 100 acciones), debemos hacer previamente una preparación de estos datos. 

Para ello, además de una base de datos proporcionada por el profesor de la asignatura, hacemos web scrapping y aplicamos la librería de yfinance para obtener los datos históricos asociados a los datos proporcionados.

In [1]:
import fix_yahoo_finance as yf
from pandas_datareader import data as pdr
import datetime
import requests
import urllib.request
import time
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt


*** `fix_yahoo_finance` was renamed to `yfinance`. ***
Please install and use `yfinance` directly using `pip install yfinance -U`

More information: https://github.com/ranaroussi/yfinance

  from pandas.util.testing import assert_frame_equal


En primer lugar, cargamos la base de datos original con los casi 200 activos de 4 índices distintos: IBEX 35, Euro Stoxx, Dow Jones y NASDAQ.

In [2]:
datos = pd.read_excel('Datos adicionales.xlsx')
datos.tail()

Unnamed: 0,Symbol,Name,Price,Currency,Sector,Country,Rentab 1M,Rentab 3M,Rentab 1Y,Rentab 2Y,...,PX_TO_BOOK_RATIO,PX_TO_CASH_FLOW,EPS_GROWTH,DVD_PAYOUT_RATIO,EQY_REC_CONS,TOT_ANALYST_REC,TOT_BUY_REC,TOT_SELL_REC,TWITTER_SENTIMENT_REALTIME,NEWS_SENTIMENT_RT
213,CSCO UW Equity,CISCO SYSTEMS INC,44.9,USD,Telecommunications,UNITED STATES,0.055974,0.050538,-0.174177,0.16568,...,5.30632,12.2488,13050,51.45,3.8,30,14,2,0,-0.52
214,XOM UN Equity,EXXON MOBIL CORP,44.6,USD,Oil&Gas,UNITED STATES,0.019895,-0.177122,-0.398111,-0.151016,...,1.03564,3.21,-31.1475,103.515,2.78571,28,3,6,0.0155916,0.87
215,WBA UW Equity,WALGREENS BOOTS ALLIANCE INC,39.6,USD,Retail,UNITED STATES,-0.09465,-0.166491,-0.235078,-0.036743,...,5.12,6.2,-14.7929,41.1458,2.90909,22,1,2,0,0.012
216,PFE UN Equity,PFIZER INC,37.5,USD,Pharmaceuticals,UNITED STATES,0.00321,0.105217,-0.106079,0.248197,...,3.20263,14.8296,53.6842,49.98,3.88889,18,8,0,0,-0.234
217,DOW UN Equity,#N/A Requesting Data...,#N/A Requesting Data...,#N/A Requesting Data...,#N/A Requesting Data...,#N/A Requesting Data...,,,,,...,#N/A Requesting Data...,#N/A Requesting Data...,#N/A Requesting Data...,#N/A Requesting Data...,#N/A Requesting Data...,#N/A Requesting Data...,#N/A Requesting Data...,#N/A Requesting Data...,#N/A Requesting Data...,0.045


El primer paso para obtener los datos de los activos con los que contamos es obtener el símbolo o 'ticker' por el que, más adelante y gracias al web-scrapping, se obtendrán los datos históricos. Por ello, decidimos renombrar todos los símbolos de cada activo y quedarnos únicamente con el símbolo.

In [3]:
datos['Symbol'] = datos['Symbol'].apply(lambda x: x.split(' ')[0])

In [4]:
datos.tail()

Unnamed: 0,Symbol,Name,Price,Currency,Sector,Country,Rentab 1M,Rentab 3M,Rentab 1Y,Rentab 2Y,...,PX_TO_BOOK_RATIO,PX_TO_CASH_FLOW,EPS_GROWTH,DVD_PAYOUT_RATIO,EQY_REC_CONS,TOT_ANALYST_REC,TOT_BUY_REC,TOT_SELL_REC,TWITTER_SENTIMENT_REALTIME,NEWS_SENTIMENT_RT
213,CSCO,CISCO SYSTEMS INC,44.9,USD,Telecommunications,UNITED STATES,0.055974,0.050538,-0.174177,0.16568,...,5.30632,12.2488,13050,51.45,3.8,30,14,2,0,-0.52
214,XOM,EXXON MOBIL CORP,44.6,USD,Oil&Gas,UNITED STATES,0.019895,-0.177122,-0.398111,-0.151016,...,1.03564,3.21,-31.1475,103.515,2.78571,28,3,6,0.0155916,0.87
215,WBA,WALGREENS BOOTS ALLIANCE INC,39.6,USD,Retail,UNITED STATES,-0.09465,-0.166491,-0.235078,-0.036743,...,5.12,6.2,-14.7929,41.1458,2.90909,22,1,2,0,0.012
216,PFE,PFIZER INC,37.5,USD,Pharmaceuticals,UNITED STATES,0.00321,0.105217,-0.106079,0.248197,...,3.20263,14.8296,53.6842,49.98,3.88889,18,8,0,0,-0.234
217,DOW,#N/A Requesting Data...,#N/A Requesting Data...,#N/A Requesting Data...,#N/A Requesting Data...,#N/A Requesting Data...,,,,,...,#N/A Requesting Data...,#N/A Requesting Data...,#N/A Requesting Data...,#N/A Requesting Data...,#N/A Requesting Data...,#N/A Requesting Data...,#N/A Requesting Data...,#N/A Requesting Data...,#N/A Requesting Data...,0.045


Comprobamos si los datos tienen valores nulos y eliminamos los activos que contengan alguno (pues estropearía el estudio).

In [5]:
datos.isna().sum()

Symbol                         0
Name                           0
Price                          0
Currency                       0
Sector                         0
Country                        0
Rentab 1M                      3
Rentab 3M                      3
Rentab 1Y                      4
Rentab 2Y                     10
Rentab 3Y                     12
Rentab 5Y                     19
Volat 30d                      3
Volat 360d                     7
CUR_MKT_CAP                    3
PE_RATIO                      16
PX_TO_BOOK_RATIO               6
PX_TO_CASH_FLOW               27
EPS_GROWTH                    14
DVD_PAYOUT_RATIO              14
EQY_REC_CONS                   0
TOT_ANALYST_REC                0
TOT_BUY_REC                    0
TOT_SELL_REC                   0
TWITTER_SENTIMENT_REALTIME     0
NEWS_SENTIMENT_RT              0
dtype: int64

In [6]:
datos = datos.dropna()
datos.isna().sum()

Symbol                        0
Name                          0
Price                         0
Currency                      0
Sector                        0
Country                       0
Rentab 1M                     0
Rentab 3M                     0
Rentab 1Y                     0
Rentab 2Y                     0
Rentab 3Y                     0
Rentab 5Y                     0
Volat 30d                     0
Volat 360d                    0
CUR_MKT_CAP                   0
PE_RATIO                      0
PX_TO_BOOK_RATIO              0
PX_TO_CASH_FLOW               0
EPS_GROWTH                    0
DVD_PAYOUT_RATIO              0
EQY_REC_CONS                  0
TOT_ANALYST_REC               0
TOT_BUY_REC                   0
TOT_SELL_REC                  0
TWITTER_SENTIMENT_REALTIME    0
NEWS_SENTIMENT_RT             0
dtype: int64

In [7]:
datos2 = datos.drop_duplicates(keep = 'first', subset = 'Symbol')
datos2 = datos2.set_index('Symbol')
len(datos2)

137

Vemos que, tras eliminar los valores nulos, contamos con datos para 137 activos (muchos están repetidos y este número será menor más adelante)

In [8]:
tickers_data = list(datos2.index)
tickers_data

['OR',
 'DG',
 'ASML',
 'PHIA',
 'TEF',
 'FP',
 'AI',
 'CS',
 'BN',
 'VIV',
 'EL',
 'MC',
 'KER',
 'AMS',
 'SAF',
 'AD',
 'UNA',
 'IBE',
 'ORA',
 'ABI',
 'SAN',
 'ENEL',
 'SU',
 'BAYN',
 'BMW',
 'CRH',
 'BAS',
 'SIE',
 'VOW3',
 'FRE',
 'SAP',
 'ADS',
 'DTE',
 'DPW',
 'DAI',
 'FER',
 'GRF',
 'ELE',
 'REE',
 'ACS',
 'ENG',
 'IAG',
 'ANA',
 'COL',
 'CIE',
 'BKIA',
 'TL5',
 'MEL',
 'MXIM',
 'CDW',
 'MDLZ',
 'AMZN',
 'CPRT',
 'ALXN',
 'GOOG',
 'IDXX',
 'CSGP',
 'CHTR',
 'CSCO',
 'INTC',
 'MSFT',
 'NVDA',
 'CTSH',
 'ISRG',
 'ALGN',
 'EBAY',
 'BKNG',
 'ILMN',
 'TXN',
 'GOOGL',
 'ADP',
 'WBA',
 'ADBE',
 'AMGN',
 'AAPL',
 'CTAS',
 'CMCSA',
 'KLAC',
 'AVGO',
 'CDNS',
 'PCAR',
 'COST',
 'REGN',
 'SWKS',
 'ATVI',
 'AMAT',
 'LULU',
 'CERN',
 'NTES',
 'SNPS',
 'EA',
 'FAST',
 'ULTA',
 'FISV',
 'ANSS',
 'FB',
 'GILD',
 'TTWO',
 'LRCX',
 'BIIB',
 'VRTX',
 'PAYX',
 'ADI',
 'ROST',
 'XLNX',
 'INTU',
 'MCHP',
 'MNST',
 'CHKP',
 'ORLY',
 'NXPI',
 'MU',
 'BIDU',
 'VRSK',
 'NTAP',
 'UNH',
 'HD',
 'V',
 'MCD

# WEB-SCRAPING

El siguiente paso es obtener todos los tickers únicos de cada índice para todos sus componentes gracias a we-scraping. Se almamcenarán todos en una serie de listas que, después, se unirán y se eliminarán los duplicados.

## EURO STOXX

In [9]:
tickers2 = []
 
CryptoCurrenciesUrl = 'https://es.finance.yahoo.com/quote/%5ESTOXX50E/components?p=%5ESTOXX50E'
r= requests.get(CryptoCurrenciesUrl)
data=r.text
soup=BeautifulSoup(data)

In [10]:
counter = 40
for i in range(40, 404, 14):
    for row in soup.find_all('tbody'):
        for srow in row.find_all('tr'):
            for ticker in srow.find_all('td', attrs={'class':'Py(10px) Ta(start) Pend(10px)'}):
                for otro in ticker.find_all('a', attrs={'class':'C($c-fuji-blue-1-b) Cur(p) Td(n) Fw(500)'}):
                    tickers2.append(otro.text)
tickers2 = list(dict.fromkeys(tickers2))

## IBEX

In [11]:
tickers = []
 
CryptoCurrenciesUrl = 'https://es.finance.yahoo.com/quote/%5EIBEX/components?p=%5EIBEX'
r= requests.get(CryptoCurrenciesUrl)
data=r.text
soup=BeautifulSoup(data)

In [12]:
counter = 40
for i in range(40, 404, 14):
    for row in soup.find_all('tbody'):
        for srow in row.find_all('tr'):
            for ticker in srow.find_all('td', attrs={'class':'Py(10px) Ta(start) Pend(10px)'}):
                for otro in ticker.find_all('a', attrs={'class':'C($c-fuji-blue-1-b) Cur(p) Td(n) Fw(500)'}):
                    tickers.append(otro.text)
tickers = list(dict.fromkeys(tickers))

## NASDAQ

In [13]:
tickers3 = []
 
CryptoCurrenciesUrl = 'https://es.finance.yahoo.com/quote/%5EIXIC/components?p=^IXIC&.tsrc=fin-srch'
r= requests.get(CryptoCurrenciesUrl)
data=r.text
soup=BeautifulSoup(data)

In [14]:
counter = 40
for i in range(40, 404, 14):
    for row in soup.find_all('tbody'):
        for srow in row.find_all('tr'):
            for ticker in srow.find_all('td', attrs={'class':'Py(10px) Ta(start) Pend(10px)'}):
                for otro in ticker.find_all('a', attrs={'class':'C($c-fuji-blue-1-b) Cur(p) Td(n) Fw(500)'}):
                    tickers3.append(otro.text)
tickers3 = list(dict.fromkeys(tickers3))

## DOW JONES

In [15]:
tickers4 = []
 
CryptoCurrenciesUrl = 'https://es.finance.yahoo.com/quote/%5EDJI/components?p=%5EDJI'
r= requests.get(CryptoCurrenciesUrl)
data=r.text
soup=BeautifulSoup(data)

In [16]:
counter = 40
for i in range(40, 404, 14):
    for row in soup.find_all('tbody'):
        for srow in row.find_all('tr'):
            for ticker in srow.find_all('td', attrs={'class':'Py(10px) Ta(start) Pend(10px)'}):
                for otro in ticker.find_all('a', attrs={'class':'C($c-fuji-blue-1-b) Cur(p) Td(n) Fw(500)'}):
                    tickers4.append(otro.text)
tickers4 = list(dict.fromkeys(tickers4))

In [17]:
total_tickers = tickers2 + tickers + tickers3 + tickers4
total_tickers = list(dict.fromkeys(total_tickers))
len(total_tickers)

117

La lista final de tickers cuyos datos históricos podemos obtener con yfinance son 117. Se deiciden obtener los datos históricos de cierre de los últimos 5 años para todos ellos, para después quedarnos únicamente con los que coinciden con nuestra base de datos original.

In [18]:
stocks = total_tickers
start = datetime.datetime(2015,1,1)
end = datetime.datetime(2020,1,1)

In [19]:
f = pdr.get_data_yahoo(stocks, start=start, end=end)
f.Close



Symbols,FRE.DE,PHIA.AS,ORA.PA,OR.PA,ASML.AS,IBE.MC,SU.PA,BN.PA,DTE.DE,SAN.PA,...,MRK,JPM,BA,WBA,VZ,AXP,RTX,CSCO,EI.PA,CIIC
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2015-01-02,42.535000,24.174999,14.060,137.399994,89.129997,5.596,59.950001,53.860001,13.185,75.440002,...,57.189999,62.490002,129.949997,76.000000,46.959999,93.019997,72.397736,27.610001,,
2015-01-05,42.410000,23.375000,13.520,134.149994,87.489998,5.495,57.340000,52.480000,12.780,73.269997,...,58.040001,60.549999,129.050003,74.500000,46.570000,90.559998,71.189430,27.059999,,
2015-01-06,42.840000,23.155001,13.370,134.149994,84.669998,5.473,57.419998,52.320000,12.630,73.510002,...,60.320000,58.980000,127.529999,74.690002,47.040001,88.629997,70.182503,27.049999,,
2015-01-07,42.544998,23.275000,13.590,134.899994,84.949997,5.466,57.779999,52.930000,12.990,74.290001,...,61.610001,59.070000,129.509995,76.599998,46.189999,90.300003,70.943993,27.299999,,
2015-01-08,43.810001,23.865000,14.205,139.800003,88.190002,5.574,60.160000,54.459999,13.630,77.419998,...,62.849998,60.389999,131.800003,77.550003,47.180000,91.580002,72.152298,27.510000,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2019-12-25,,,13.280,264.500000,,9.246,92.199997,74.339996,,90.629997,...,,,,,,,,,,
2019-12-26,,,,,,,,,,,...,91.339996,139.039993,329.920013,58.900002,61.290001,125.410004,94.845818,47.849998,,
2019-12-27,50.160000,43.959999,13.195,266.100006,266.899994,9.322,92.500000,74.500000,14.716,90.839996,...,91.500000,139.139999,330.140015,59.020000,61.529999,125.190002,94.575203,47.770000,,
2019-12-30,50.180000,43.575001,13.130,263.200012,262.899994,9.284,91.540001,74.000000,14.570,89.750000,...,91.029999,138.630005,326.399994,58.910000,61.209999,124.300003,94.323471,47.590000,,


In [31]:
historicos = f.Close
tickers_final = list(historicos.columns.str.split('.').str[0])

In [21]:
len(tickers_final)

117

In [22]:
lista_indices = []
for i in tickers_final:
    if i in tickers_data:
        lista_indices.append(i)
len(lista_indices)

59

Al comparar los tickers de la base de datos original y la nueva, observamos que únicamente coinciden 59 de todos los activos. Comprobamos una vez más que los tickers son únicos y nos disponemos a seleccionar los campos coincidentes de ambas bases de datos para el estudio a continuación.

In [32]:
lista = list(dict.fromkeys(lista_indices))
len(lista)

58

In [24]:
datos3 = datos2.loc[lista_indices,:]

59

In [25]:
datos = datos3
datos.head()

Unnamed: 0_level_0,Name,Price,Currency,Sector,Country,Rentab 1M,Rentab 3M,Rentab 1Y,Rentab 2Y,Rentab 3Y,...,PX_TO_BOOK_RATIO,PX_TO_CASH_FLOW,EPS_GROWTH,DVD_PAYOUT_RATIO,EQY_REC_CONS,TOT_ANALYST_REC,TOT_BUY_REC,TOT_SELL_REC,TWITTER_SENTIMENT_REALTIME,NEWS_SENTIMENT_RT
Symbol,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
FRE,FRESENIUS SE & CO KGAA,43.88,EUR,Healthcare-Services,GERMANY,0.165162,-0.063294,-0.076405,-0.341636,-0.116814,...,1.42378,5.03818,-7.39726,24.8645,4.53846,26,20,0,0,0.663
PHIA,KONINKLIJKE PHILIPS NV,42.04,EUR,Healthcare-Products,NETHERLANDS,0.05761,0.030771,0.163898,0.002897,0.114594,...,2.97345,18.847,-7.20546,63.8556,4.11538,26,16,1,0,0.0
ORA,ORANGE,10.395,EUR,Telecommunications,FRANCE,-0.07063,-0.197607,-0.254839,0.025109,0.047468,...,1.06268,2.71271,63.4921,67.7582,4.25806,31,21,1,0,0.234
OR,L'OREAL,251.1,EUR,Cosmetics/Personal Care,FRANCE,0.032059,-0.006332,0.032484,0.108156,0.085859,...,4.76365,22.5862,-4.26784,63.2533,3.03226,31,7,7,0,0.0
ASML,ASML HOLDING NV,295.6,EUR,Semiconductors,NETHERLANDS,0.108361,0.109818,0.734742,-0.046739,0.374452,...,10.0457,39.4297,0.983607,38.8669,3.925,40,23,4,0,-0.106781


In [26]:
datos.to_csv('activos_finales.csv')

In [27]:
historicos = historicos.set_axis(tickers_final, axis=1, inplace=False)
historicos

Unnamed: 0_level_0,FRE,PHIA,ORA,OR,ASML,IBE,SU,BN,DTE,SAN,...,MRK,JPM,BA,WBA,VZ,AXP,RTX,CSCO,EI,CIIC
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2015-01-02,42.535000,24.174999,14.060,137.399994,89.129997,5.596,59.950001,53.860001,13.185,75.440002,...,57.189999,62.490002,129.949997,76.000000,46.959999,93.019997,72.397736,27.610001,,
2015-01-05,42.410000,23.375000,13.520,134.149994,87.489998,5.495,57.340000,52.480000,12.780,73.269997,...,58.040001,60.549999,129.050003,74.500000,46.570000,90.559998,71.189430,27.059999,,
2015-01-06,42.840000,23.155001,13.370,134.149994,84.669998,5.473,57.419998,52.320000,12.630,73.510002,...,60.320000,58.980000,127.529999,74.690002,47.040001,88.629997,70.182503,27.049999,,
2015-01-07,42.544998,23.275000,13.590,134.899994,84.949997,5.466,57.779999,52.930000,12.990,74.290001,...,61.610001,59.070000,129.509995,76.599998,46.189999,90.300003,70.943993,27.299999,,
2015-01-08,43.810001,23.865000,14.205,139.800003,88.190002,5.574,60.160000,54.459999,13.630,77.419998,...,62.849998,60.389999,131.800003,77.550003,47.180000,91.580002,72.152298,27.510000,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2019-12-25,,,13.280,264.500000,,9.246,92.199997,74.339996,,90.629997,...,,,,,,,,,,
2019-12-26,,,,,,,,,,,...,91.339996,139.039993,329.920013,58.900002,61.290001,125.410004,94.845818,47.849998,,
2019-12-27,50.160000,43.959999,13.195,266.100006,266.899994,9.322,92.500000,74.500000,14.716,90.839996,...,91.500000,139.139999,330.140015,59.020000,61.529999,125.190002,94.575203,47.770000,,
2019-12-30,50.180000,43.575001,13.130,263.200012,262.899994,9.284,91.540001,74.000000,14.570,89.750000,...,91.029999,138.630005,326.399994,58.910000,61.209999,124.300003,94.323471,47.590000,,


In [28]:
historicos = historicos[lista_indices]
len(historicos.columns)

61

In [29]:
historicos.to_csv('datos_historicos.csv')

Tras guardar los datos en csvs nuevos, podemos comenzar el estudio de cada estrategia por separado.

__Bibliografía:__
* https://hackernoon.com/scraping-yahoo-finance-data-using-python-ayu3zyl
* https://es.finance.yahoo.com/quote/%5ESTOXX50E/components?p=%5ESTOXX50E
* https://es.finance.yahoo.com/quote/%5EIBEX/components?p=%5EIBEX
* https://es.finance.yahoo.com/quote/%5EIXIC/components?p=^IXIC&.tsrc=fin-srch
* https://es.finance.yahoo.com/quote/%5EDJI/components?p=%5EDJI