# Pairs Trading with Machine Learning, Part 2
## Companian Notebook: Gather Business Profiles

[Jonathan Larkin](mailto:jlarkin@quantopian.com)

September 2017

# Gather Business Profiles

I use the [`pandas_finance`](https://github.com/davidastephens/pandas-finance) Python libary to query the [Profile tab](https://finance.yahoo.com/quote/KO/profile?p=KO) on Yahoo Finance. For example, let's look at KO and PEP.

In [1]:
from pandas_finance import Equity

In [4]:
ko = Equity('KO')
print ko.profile.Longbusinesssummary

The Coca-Cola Company, a beverage company, manufactures and distributes various nonalcoholic beverages worldwide. The company primarily offers sparkling beverages and still beverages. Its sparkling beverages include nonalcoholic ready-to-drink beverages with carbonation, such as carbonated energy drinks, and carbonated waters and flavored waters. The companys still beverages comprise nonalcoholic beverages without carbonation, including noncarbonated waters, flavored and enhanced waters, noncarbonated energy drinks, juices and juice drinks, ready-to-drink teas and coffees, and sports drinks. It also provides flavoring ingredients, sweeteners, beverage ingredients, and fountain syrups, as well as powders for purified water products. The Coca-Cola Company sells its products primarily under the Coca-Cola, Diet Coke/Coca-Cola Light, Coca-Cola Zero, Fanta, Sprite, Minute Maid, Georgia, Powerade, Del Valle, Schweppes, Aquarius, Minute Maid Pulpy, Dasani, Simply, Glacéau Vitaminwater, Gold P

In [5]:
pep = Equity('PEP')
print pep.profile.Longbusinesssummary

PepsiCo, Inc. operates as a food and beverage company worldwide. Its Frito-Lay North America segment offers Lays and Ruffles potato chips; Doritos, Tostitos, and Santitas tortilla chips; and Cheetos cheese-flavored snacks, branded dips, and Fritos corn chips. The companys Quaker Foods North America segment provides Quaker oatmeal, grits, rice cakes, granola, and oat squares; and Aunt Jemima mixes and syrups, Quaker Chewy granola bars, Capn Crunch cereal, Life cereal, and Rice-A-Roni side dishes. Its North America Beverages segment offers beverage concentrates, fountain syrups, and finished goods under the Pepsi, Gatorade, Mountain Dew, Diet Pepsi, Aquafina, Diet Mountain Dew, Tropicana Pure Premium, Mist Twst, and Mug brands; and ready-to-drink tea and coffee, and juices. The companys Latin America segment provides snack foods under the Doritos, Cheetos, Marias Gamesa, Ruffles, Emperador, Saladitas, Sabritas, Lays, Rosquinhas Mabel, and Tostitos brands; cereals and snacks under th

We need a corpus of these profiles for a large set of tickers. Ironically, the "hardest" part of this post for me was finding a source of tickers as a base universe. I need this universe off-platform becuase I obtain the profile data off-platform. We get spoiled using the `Q1500US`, etc. on Quantopian which gives a valid tradeable universe; good luck trying to find this outside of Quantopian (unless you buy very expensive index composition data from a major index provider). This is hard because freely available financial data (e.g., Yahoo) exists to satistfy analysis that starts with a question like "Given AAPL, MSFT, and FB, what...?". For proper *quantitiative* analysis, this is completely backwards. We don't know *yet* what stocks we care to look at; we need a large valid universe of tickers as a pool to analyze. We need to get this universe cross-sectionally as of a specific date. The best I could find is the [Quandl WIKI PRICES](https://www.quandl.com/product/WIKIP/documentation/about) EOD price database. It's not clear what the complete criteria are for inclusion in this dataset, but it *looks* to me to be an attempt to replicate the Russell 3000. Quandl indicated to me that volume is the primary criterion. This data is free; you need a Quandl API key which you can obtain when you register a free account.

In [6]:
import numpy as np
import os
import quandl
from tqdm import tqdm, tqdm_notebook
import pandas as pd

In [7]:
# put your Quandl API KEY in your .bash_profile as
# export QUANDL_API_KEY="ABC_abc123..." 
QUANDL_API_KEY = os.environ['QUANDL_API_KEY']
quandl.ApiConfig.api_key = QUANDL_API_KEY

In [8]:
# get cross-sectional data for 1 day
data = quandl.get_table('WIKI/PRICES', date='2017-09-06')

In [9]:
data.head()

Unnamed: 0_level_0,ticker,date,open,high,low,close,volume,ex-dividend,split_ratio,adj_open,adj_high,adj_low,adj_close,adj_volume
None,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
0,A,2017-09-06,64.56,64.81,64.05,64.71,967758.0,0.0,1.0,64.56,64.81,64.05,64.71,967758.0
1,AA,2017-09-06,44.4,44.5,43.3,44.43,3419529.0,0.0,1.0,44.4,44.5,43.3,44.43,3419529.0
2,AAL,2017-09-06,43.09,44.68,42.6066,44.31,8689321.0,0.0,1.0,43.09,44.68,42.6066,44.31,8689321.0
3,AAMC,2017-09-06,98.5,116.0,96.1,108.7,30394.0,0.0,1.0,98.5,116.0,96.1,108.7,30394.0
4,AAN,2017-09-06,43.62,43.62,41.72,42.0,1111899.0,0.0,1.0,43.62,43.62,41.72,42.0,1111899.0


In [10]:
tickers = list(data[data.date == pd.Timestamp('2017-09-06')]['ticker'].values)

In [11]:
profile_df = pd.DataFrame(index=tickers)
profile_df['quandl_sym'] = tickers
profile_df['yhoo_sym'] = None
profile_df['mstr_sym'] = None
profile_df['profile'] = None

Symbology mapping is not too painful in this post. We only need to account for the different conventions for stocks with distinct share classes across Quandl, Yahoo, and Quantopian (Morningstar).

In [12]:
profile_df['yhoo_sym'] = profile_df['quandl_sym'].str.replace('_','-')
profile_df['mstr_sym'] = profile_df['quandl_sym'].str.replace('_','.')

#### Get Company Profile Data (skip if you already generated the file)
We loop through each ticker in the universe, and get the company profile (this takes about a minute to run).

In [None]:
missing_symbols = []
for symbol in tqdm_notebook(profile_df.index):
    try:
        eq = eq = Equity(profile_df.loc[symbol]['yhoo_sym'])
        profile_df.loc[symbol]['profile'] = eq.profile.Longbusinesssummary
    except:
        missing_symbols.append(symbol)

A Jupyter Widget

In [13]:
profile_df = profile_df[~profile_df['profile'].isnull()]
print "We got %d company profiles." % len(profile_df)

We got 2486 company profiles.


In [14]:
profile_df.head()

Unnamed: 0,quandl_sym,yhoo_sym,mstr_sym,profile
A,A,A,A,"Agilent Technologies, Inc. provides applicatio..."
AA,AA,AA,AA,"Alcoa Corporation produces and sells bauxite, ..."
AAL,AAL,AAL,AAL,"American Airlines Group Inc., through its subs..."
AAMC,AAMC,AAMC,AAMC,"Altisource Asset Management Corporation, an as..."
AAN,AAN,AAN,AAN,"Aarons, Inc. operates an omnichannel provider..."


Save to a file.

In [15]:
profile_df.to_csv(
    'profiles_20170907.csv',
    index=False,
    encoding='utf-8'
)

#### Load Stock Profiles

In [12]:
profile_df = pd.read_csv('profiles_20170907.csv')
profile_df.index = profile_df['quandl_sym']
del profile_df.index.name

In [13]:
profile_df.head()

Unnamed: 0,quandl_sym,yhoo_sym,mstr_sym,profile
A,A,A,A,"Agilent Technologies, Inc. provides applicatio..."
AA,AA,AA,AA,"Alcoa Corporation produces and sells bauxite, ..."
AAL,AAL,AAL,AAL,"American Airlines Group Inc., through its subs..."
AAMC,AAMC,AAMC,AAMC,"Altisource Asset Management Corporation, an as..."
AAN,AAN,AAN,AAN,"Aarons, Inc. operates an omnichannel provider..."


_This presentation is for informational purposes only and does not constitute an offer to sell, a solicitation to buy, or a recommendation for any security; nor does it constitute an offer to provide investment advisory or other services by Quantopian, Inc. ("Quantopian"). Nothing contained herein constitutes investment advice or offers any opinion with respect to the suitability of any security, and any views expressed herein should not be taken as advice to buy, sell, or hold any security or as an endorsement of any security or company. In preparing the information contained herein, Quantopian, Inc. has not taken into account the investment needs, objectives, and financial circumstances of any particular investor. Any views expressed and data illustrated herein were prepared based upon information, believed to be reliable, available to Quantopian, Inc. at the time of publication. Quantopian makes no guarantees as to their accuracy or completeness. All information is subject to change and may quickly become unreliable for various reasons, including changes in market conditions or economic circumstances._