## Tratamiento de datos

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

import warnings
warnings.filterwarnings("ignore")

In [2]:
pd.options.display.max_columns = None
pd.options.display.max_rows = 50

Los ficheros que vamos a tratar se han obtenido de la web simfin https://simfin.com/data/bulk, que proporciona datos de compañías cotizadas en EEUU desde 2007.
Nos descargamos 5 ficheros (10_6_20) con los datos de cotización, industrias, datos de compañías, balance y cuenta de resultados. Estas dos últimas solo para bancos, ya que el formato de sus cuentas es diferente al resto de compañías.

Por limitación de espacio, previamente hemos filtrado los datos de cotización para que contengan únicamente datos de entidades financieras.

In [3]:
! ls -1 ../datos

filtered_prices.csv
filtrado_precios.ipynb
industries.csv
Tickers.csv
us-balance-quarterly.csv
us-companies.csv
us-income-quarterly.csv


In [4]:
balance = pd.read_csv('../datos/us-balance-quarterly.csv', sep=';')
resultados = pd.read_csv('../datos/us-income-quarterly.csv', sep=';', 
                         usecols=['Ticker','Fiscal Year','Fiscal Period', 'Revenue',
                                  "Cost of Revenue","Gross Profit","Operating Expenses",
                                  "Selling, General & Administrative","Research & Development",
                                  "Depreciation & Amortization","Operating Income (Loss)",
                                  "Non-Operating Income (Loss)","Interest Expense, Net",
                                  "Pretax Income (Loss), Adj.", "Abnormal Gains (Losses)","Pretax Income (Loss)",
                                  "Income Tax (Expense) Benefit, Net","Income (Loss) from Continuing Operations",
                                  "Net Extraordinary Gains (Losses)", "Net Income","Net Income (Common)"])

industries = pd.read_csv('../datos/industries.csv', sep=';', dtype='str')
companies = pd.read_csv('../datos/us-companies.csv', sep=';', dtype='str')
prices = pd.read_csv('../datos/filtered_prices.csv', date_parser='Date', compression='bz2')

In [5]:
# Creamos un dataframe con la infomración de las compañías por cada sector analizado

companies = companies.merge(industries, 
                              left_on=('IndustryId'),
                              right_on=('IndustryId'), 
                              how='left')

total_companies=companies[companies.Ticker.isin(prices.Ticker.unique())]

energy_companies = companies[companies.Sector=='Energy']
tec_companies = companies[companies.Sector=='Technology']
health_companies = companies[companies.Sector=='Healthcare']
ind_companies = companies[companies.Sector=='Industrials']

In [6]:
total_companies.groupby('Sector').count()

Unnamed: 0_level_0,Ticker,SimFinId,Company Name,IndustryId,Industry
Sector,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Energy,108,108,108,108,108
Healthcare,323,323,323,323,323
Industrials,263,263,263,263,263
Technology,411,411,411,411,411


A continuación verificamos la disponibilidad de información contable para estas compañías

In [7]:
# Creamos una nueva tabla de datos de balance con las compañías para las cuales disponemos cotización,
# incorporando el sector al que pertenecen

balance_filtrado = balance[balance['Ticker'].isin(energy_companies.Ticker)]
balance_filtrado['Sector'] = 'Energy'

balance_health = balance[balance['Ticker'].isin(health_companies.Ticker)]
balance_health['Sector'] = 'Healthcare'
balance_filtrado = balance_filtrado.append(balance_health)

balance_ind = balance[balance['Ticker'].isin(ind_companies.Ticker)]
balance_ind['Sector'] = 'Industrials'
balance_filtrado = balance_filtrado.append(balance_ind)

balance_tec = balance[balance['Ticker'].isin(tec_companies.Ticker)]
balance_tec['Sector'] = 'Technology'
balance_filtrado = balance_filtrado.append(balance_tec)

balance_filtrado.reset_index(inplace=True, drop=True)


In [8]:
# Creamos una nueva tabla de datos de pyg con las compañías para las cuales disponemos cotización
resultados_filtrado = resultados[resultados['Ticker'].isin(total_companies.Ticker)]

A continuación, debemos generar una tabla única en la que se encuentre incorporado el mayor número posible de datos contables y con el dato de cotización al final del periodo. Para intentar ser realistas, vamos a considerar que la información contable al cierre de un trimestre no está disponible en ese mismo momento, sino en la fecha de publicación de los estados financieros. En nuestros datos contamos con esa fecha de publicación, así que es la que utilizaremos de referencia temporal.

### Generar tabla principal de variables
En esta tabla cada registro representa un valor en un periodo determinado. Podemos partir de balance_filtrado y crear una referencia única a partir de Ticker, Fiscal Year y Fiscal Period.

In [9]:
balance_filtrado = balance_filtrado.astype({'Fiscal Year':str})
balance_filtrado['Ref'] = balance_filtrado['Ticker'] + balance_filtrado['Fiscal Year'] + balance_filtrado['Fiscal Period']

In [10]:
balance_filtrado.index = balance_filtrado.Ref
balance_filtrado.head()

Unnamed: 0_level_0,Ticker,SimFinId,Currency,Fiscal Year,Fiscal Period,Report Date,Publish Date,Restated Date,Shares (Basic),Shares (Diluted),"Cash, Cash Equivalents & Short Term Investments",Accounts & Notes Receivable,Inventories,Total Current Assets,"Property, Plant & Equipment, Net",Long Term Investments & Receivables,Other Long Term Assets,Total Noncurrent Assets,Total Assets,Payables & Accruals,Short Term Debt,Total Current Liabilities,Long Term Debt,Total Noncurrent Liabilities,Total Liabilities,Share Capital & Additional Paid-In Capital,Treasury Stock,Retained Earnings,Total Equity,Total Liabilities & Equity,Sector,Ref
Ref,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1
AE2011Q2,AE,191518,USD,2011,Q2,2011-03-31,2011-08-12,2011-08-12,4217596.0,4217596.0,36687000.0,212748000.0,20393000.0,286520000.0,69562000.0,,3306000.0,72868000.0,359388000,244369000.0,,251296000.0,,8765000.0,260061000,12115000.0,,87212000.0,99327000.0,359388000,Energy,AE2011Q2
AE2011Q3,AE,191518,USD,2011,Q3,2011-06-30,2011-11-13,2011-11-14,4217596.0,4217596.0,53119000.0,189217000.0,18415000.0,275601000.0,74783000.0,,3203000.0,77986000.0,353587000,224981000.0,,231114000.0,,14120000.0,245234000,12115000.0,,96238000.0,108353000.0,353587000,Energy,AE2011Q3
AE2011Q4,AE,191518,USD,2011,Q4,2011-09-30,2012-02-13,2013-03-15,4217596.0,4217596.0,37066000.0,225393000.0,18464000.0,304965000.0,68857000.0,,5018000.0,73875000.0,378840000,249768000.0,,256094000.0,0.0,12064000.0,268158000,12115000.0,,98567000.0,110682000.0,378840000,Energy,AE2011Q4
AE2012Q1,AE,191518,USD,2012,Q1,2011-12-31,2012-05-14,2012-05-14,4217596.0,4217596.0,35989000.0,251618000.0,29150000.0,326920000.0,80818000.0,,4796000.0,85614000.0,412534000,280186000.0,,282815000.0,,12462000.0,295277000,12115000.0,,105142000.0,117257000.0,412534000,Energy,AE2012Q1
AE2012Q2,AE,191518,USD,2012,Q2,2012-03-31,2012-08-14,2012-08-14,4217596.0,4217596.0,32213000.0,190133000.0,17001000.0,249747000.0,89591000.0,,4697000.0,94288000.0,344035000,207331000.0,,208355000.0,,13037000.0,221392000,12115000.0,,110528000.0,122643000.0,344035000,Energy,AE2012Q2


In [11]:
resultados_filtrado = resultados_filtrado.astype({'Fiscal Year':str})
resultados_filtrado['Ref'] = resultados_filtrado['Ticker'] + resultados_filtrado['Fiscal Year'] + resultados_filtrado['Fiscal Period']

In [12]:
resultados_filtrado.index = resultados_filtrado.Ref
resultados_filtrado.head()

Unnamed: 0_level_0,Ticker,Fiscal Year,Fiscal Period,Revenue,Cost of Revenue,Gross Profit,Operating Expenses,"Selling, General & Administrative",Research & Development,Depreciation & Amortization,Operating Income (Loss),Non-Operating Income (Loss),"Interest Expense, Net","Pretax Income (Loss), Adj.",Abnormal Gains (Losses),Pretax Income (Loss),"Income Tax (Expense) Benefit, Net",Income (Loss) from Continuing Operations,Net Extraordinary Gains (Losses),Net Income,Net Income (Common),Ref
Ref,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1
A2014Q1,A,2014,Q1,1008000000.0,-498000000.0,510000000.0,-386000000.0,-298000000.0,-88000000.0,,124000000.0,-27000000.0,-27000000.0,97000000,,97000000,24000000.0,121000000,74000000.0,195000000,195000000,A2014Q1
A2014Q2,A,2014,Q2,988000000.0,-503000000.0,485000000.0,-391000000.0,-304000000.0,-87000000.0,,94000000.0,-25000000.0,-28000000.0,69000000,,69000000,-29000000.0,40000000,99000000.0,139000000,139000000,A2014Q2
A2014Q3,A,2014,Q3,1009000000.0,-507000000.0,502000000.0,-371000000.0,-285000000.0,-86000000.0,,131000000.0,-46000000.0,-25000000.0,85000000,,85000000,-22000000.0,63000000,84000000.0,147000000,147000000,A2014Q3
A2014Q4,A,2014,Q4,1043000000.0,-564000000.0,479000000.0,-409000000.0,-312000000.0,-97000000.0,,70000000.0,-92000000.0,-21000000.0,-22000000,,-22000000,30000000.0,8000000,60000000.0,68000000,68000000,A2014Q4
A2015Q1,A,2015,Q1,1026000000.0,-513000000.0,513000000.0,-398000000.0,-310000000.0,-88000000.0,,115000000.0,-2000000.0,-14000000.0,113000000,,113000000,-20000000.0,93000000,-30000000.0,63000000,63000000,A2015Q1


### Merge de tablas balance y resultados

A continuación unimos la información de balance y resultados en una única tabla

In [13]:
balance_filtrado.drop('Ref', axis=1, inplace=True)
resultados_filtrado.drop(['Ref','Fiscal Year', 'Fiscal Period','Ticker'], axis=1, inplace=True)

In [14]:
mergedbalres=balance_filtrado.merge(resultados_filtrado, on= 'Ref', how='left')

In [15]:
mergedbalres.head(5)

Unnamed: 0_level_0,Ticker,SimFinId,Currency,Fiscal Year,Fiscal Period,Report Date,Publish Date,Restated Date,Shares (Basic),Shares (Diluted),"Cash, Cash Equivalents & Short Term Investments",Accounts & Notes Receivable,Inventories,Total Current Assets,"Property, Plant & Equipment, Net",Long Term Investments & Receivables,Other Long Term Assets,Total Noncurrent Assets,Total Assets,Payables & Accruals,Short Term Debt,Total Current Liabilities,Long Term Debt,Total Noncurrent Liabilities,Total Liabilities,Share Capital & Additional Paid-In Capital,Treasury Stock,Retained Earnings,Total Equity,Total Liabilities & Equity,Sector,Revenue,Cost of Revenue,Gross Profit,Operating Expenses,"Selling, General & Administrative",Research & Development,Depreciation & Amortization,Operating Income (Loss),Non-Operating Income (Loss),"Interest Expense, Net","Pretax Income (Loss), Adj.",Abnormal Gains (Losses),Pretax Income (Loss),"Income Tax (Expense) Benefit, Net",Income (Loss) from Continuing Operations,Net Extraordinary Gains (Losses),Net Income,Net Income (Common)
Ref,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1
AE2011Q2,AE,191518,USD,2011,Q2,2011-03-31,2011-08-12,2011-08-12,4217596.0,4217596.0,36687000.0,212748000.0,20393000.0,286520000.0,69562000.0,,3306000.0,72868000.0,359388000,244369000.0,,251296000.0,,8765000.0,260061000,12115000.0,,87212000.0,99327000.0,359388000,Energy,777538000.0,-16147000.0,761391000.0,-755943000.0,-752151000.0,,-3792000.0,5448000.0,36000.0,36000.0,5484000.0,0.0,5484000.0,-2129000.0,3355000.0,234000.0,3589000.0,3589000.0
AE2011Q3,AE,191518,USD,2011,Q3,2011-06-30,2011-11-13,2011-11-14,4217596.0,4217596.0,53119000.0,189217000.0,18415000.0,275601000.0,74783000.0,,3203000.0,77986000.0,353587000,224981000.0,,231114000.0,,14120000.0,245234000,12115000.0,,96238000.0,108353000.0,353587000,Energy,755995000.0,-15377000.0,740618000.0,-726674000.0,-722721000.0,,-3953000.0,13944000.0,142000.0,142000.0,14086000.0,0.0,14086000.0,-4820000.0,9266000.0,-240000.0,9026000.0,9026000.0
AE2011Q4,AE,191518,USD,2011,Q4,2011-09-30,2012-02-13,2013-03-15,4217596.0,4217596.0,37066000.0,225393000.0,18464000.0,304965000.0,68857000.0,,5018000.0,73875000.0,378840000,249768000.0,,256094000.0,0.0,12064000.0,268158000,12115000.0,,98567000.0,110682000.0,378840000,Energy,841358000.0,-27114000.0,814244000.0,-807295000.0,-802471000.0,,-4824000.0,6949000.0,3000.0,3000.0,6952000.0,90000.0,7042000.0,-2723000.0,4319000.0,414000.0,4733000.0,4733000.0
AE2012Q1,AE,191518,USD,2012,Q1,2011-12-31,2012-05-14,2012-05-14,4217596.0,4217596.0,35989000.0,251618000.0,29150000.0,326920000.0,80818000.0,,4796000.0,85614000.0,412534000,280186000.0,,282815000.0,,12462000.0,295277000,12115000.0,,105142000.0,117257000.0,412534000,Energy,877489000.0,-13731000.0,863758000.0,-854376000.0,-850213000.0,,-4163000.0,9382000.0,20000.0,20000.0,9402000.0,,9402000.0,-3352000.0,6050000.0,525000.0,6575000.0,6575000.0
AE2012Q2,AE,191518,USD,2012,Q2,2012-03-31,2012-08-14,2012-08-14,4217596.0,4217596.0,32213000.0,190133000.0,17001000.0,249747000.0,89591000.0,,4697000.0,94288000.0,344035000,207331000.0,,208355000.0,,13037000.0,221392000,12115000.0,,110528000.0,122643000.0,344035000,Energy,831474000.0,-15463000.0,816011000.0,-807448000.0,-802422000.0,,-5026000.0,8563000.0,19000.0,19000.0,8582000.0,,8582000.0,-3085000.0,5497000.0,-111000.0,5386000.0,5386000.0


In [16]:
mergedbalres.to_csv('../tablas/mergedbalres.csv')
total_companies.to_csv('../tablas/filteredcompanies.csv', index=False)

El siguiente paso en nuestro proyecto consiste en el etiquetado de cada uno de los registros de la tabla mergedbalres en función de su comportamiento en relación a un índice.
Abordamos esta fase en el notebook 'etiquetado'.