# Analyzing Stock Prices

We will be analyzing stock prices for the NASDAQ stock exchange reported on Yahoo Finance between 2007-01-01 and 2017-04-17. Each file is listed with the stock ticker. Within the file, we are proved rows with the following info:

- date -- date that the data is from.
- close -- the closing price on that day, which is the price when the trading day ends.
- open -- the opening price on that day, which is the price when the trading day starts.
- high -- the highest price the stock reached during trading.
- low -- the lowest price the stock reached during trading.
- volume -- the number of shares that were traded during the day.

We will need to consider three layers to reading and storing this data:

- Layer 1 -- the stock symbol, or an numeric index representing the stock symbol.
- Layer 2 -- the rows in a stock symbol csv file.
- Layer 3 -- The column names in a stock symbol csv file.

I think the best data structure would be a has table for the first layer to quickly index the stock symbol. Then, I would do a list for the rows for a specific stock symbol. Finally, I would just use a list for the column names.

In [3]:
import concurrent.futures
import os

def read_file(filename):
    with open(filename, 'r') as f:
        data = f.read().strip()
    key = filename.replace('.csv','').replace('prices/','')
    data = data.split('\n')
    data = [d.split(',') for d in data]
    return key, data

pool = concurrent.futures.ProcessPoolExecutor(max_workers=2)
filenames = ['prices/{}'.format(f) for f in os.listdir('prices')]
prices = list(pool.map(read_file, filenames))
prices = dict(prices)

In [5]:
from dateutil.parser import parse

prices_columns = {}

for k,v in prices.items():
    price = v
    headers = price[0]
    price_columns = {}
    for i, header in enumerate(headers):
        values = [p[i] for p in price[1:]]
        if i > 0:
            values = [float(v) for v in values]
        else:
            values = [parse(v) for v in values]
        price_columns[header] = values
    prices_columns[k] = price_columns

{'date': [datetime.datetime(2007, 1, 3, 0, 0), datetime.datetime(2007, 1, 4, 0, 0), datetime.datetime(2007, 1, 5, 0, 0), datetime.datetime(2007, 1, 8, 0, 0), datetime.datetime(2007, 1, 9, 0, 0), datetime.datetime(2007, 1, 10, 0, 0), datetime.datetime(2007, 1, 11, 0, 0), datetime.datetime(2007, 1, 12, 0, 0), datetime.datetime(2007, 1, 16, 0, 0), datetime.datetime(2007, 1, 17, 0, 0), datetime.datetime(2007, 1, 18, 0, 0), datetime.datetime(2007, 1, 19, 0, 0), datetime.datetime(2007, 1, 22, 0, 0), datetime.datetime(2007, 1, 23, 0, 0), datetime.datetime(2007, 1, 24, 0, 0), datetime.datetime(2007, 1, 25, 0, 0), datetime.datetime(2007, 1, 26, 0, 0), datetime.datetime(2007, 1, 29, 0, 0), datetime.datetime(2007, 1, 30, 0, 0), datetime.datetime(2007, 1, 31, 0, 0), datetime.datetime(2007, 2, 1, 0, 0), datetime.datetime(2007, 2, 2, 0, 0), datetime.datetime(2007, 2, 5, 0, 0), datetime.datetime(2007, 2, 6, 0, 0), datetime.datetime(2007, 2, 7, 0, 0), datetime.datetime(2007, 2, 8, 0, 0), datetime.date

In [7]:
from statistics import mean

average_closing = {}
for k,v in prices_columns.items():
    average_closing[k] = mean(v['close'])
    
closing_details = [(k,v) for k,v in average_closing.items()]
sorted(closing_details, key = lambda x:x[1], reverse=True)

[('amzn', 275.13407757104244),
 ('aapl', 257.17654040231656),
 ('cme', 230.2946601100386),
 ('atri', 228.38977615984555),
 ('fcnca', 200.25248278146717),
 ('bidu', 193.53191124478764),
 ('eqix', 165.3847721150579),
 ('biib', 164.53822006138998),
 ('esgr', 114.26885330617759),
 ('bbh', 113.28309655096525),
 ('djco', 110.25166789845561),
 ('dhil', 104.54806553783784),
 ('csgp', 103.10355984362934),
 ('anat', 97.93825093397683),
 ('alxn', 97.1099267011583),
 ('cost', 96.17006946409266),
 ('cacc', 95.49895756602317),
 ('amgn', 92.2331003965251),
 ('bwld', 89.39383399150579),
 ('ffiv', 86.29457917374518),
 ('celg', 85.09483015984556),
 ('algt', 83.70168345444016),
 ('coke', 80.56527417181468),
 ('cswc', 77.7559074069498),
 ('cbrl', 76.63736287992278),
 ('chdn', 72.21778764864865),
 ('fisv', 67.52742853513513),
 ('esrx', 67.4280848891892),
 ('cern', 65.04237453166023),
 ('alog', 64.74335521467181),
 ('acgl', 63.325907376833975),
 ('anss', 62.32520078146718),
 ('chrw', 61.98583785675676),
 ('

Amazon appears to have the higherst average closing price over this period of time.

In [8]:
average_volume = {}
for k,v in prices_columns.items():
    average_volume[k] = mean(v['volume'])
    
volume_details = [(k,v) for k,v in average_volume.items()]
sorted(volume_details, key = lambda x:x[1], reverse=True)

[('aapl', 130112422.35521236),
 ('csco', 45224781.428571425),
 ('cmcsa', 34337459.69111969),
 ('ebay', 29059822.548262548),
 ('amd', 24757016.94980695),
 ('bbry', 18719564.942084942),
 ('amat', 17738006.679536678),
 ('bidu', 14979788.764478764),
 ('csx', 11931650.617760617),
 ('atvi', 9792519.536679536),
 ('brcd', 9177569.536679536),
 ('aal', 8469080.501930501),
 ('cy', 7870053.281853282),
 ('celg', 7086173.3590733595),
 ('ctsh', 6460113.05019305),
 ('amgn', 6412205.173745174),
 ('flex', 6136054.633204633),
 ('amzn', 5974108.532818533),
 ('esrx', 5430109.884169884),
 ('eric', 5398251.66023166),
 ('adbe', 5341678.532818533),
 ('ea', 5209939.420849421),
 ('arna', 4344525.791505791),
 ('amtd', 3952669.806949807),
 ('etfc', 3887712.1621621624),
 ('ca', 3778223.6293436293),
 ('akam', 3720049.305019305),
 ('disca', 3629313.243243243),
 ('dltr', 3526729.3822393822),
 ('cdns', 3437799.189189189),
 ('adi', 3337189.034749035),
 ('ctxs', 3333962.3166023167),
 ('cost', 3242085.4054054054),
 ('bbby

Apple had the highest overall average volume of trade. However, Amazon wasn't far down the list with the highest average price as well.

In [10]:
range = {}
for k,v in prices_columns.items():
    range[k] = mean(v['close']) - mean(v['open'])
    
range_details = [(k,v) for k,v in range.items()]
sorted(range_details, key = lambda x:x[1], reverse=True)

[('atri', 0.39902332818533637),
 ('fcnca', 0.1757878837837552),
 ('dhil', 0.13680693204634053),
 ('eqix', 0.12352105830115079),
 ('djco', 0.11308870849421737),
 ('amzn', 0.0977996007721913),
 ('fisv', 0.06399999189189032),
 ('cohr', 0.05984936949808173),
 ('expo', 0.055976842471039845),
 ('cswc', 0.05594595907335531),
 ('bbh', 0.05522780849422304),
 ('cern', 0.05469882200772247),
 ('bwld', 0.052745122779924714),
 ('cost', 0.05246715289574411),
 ('cacc', 0.052220128957529255),
 ('anss', 0.05086875366795596),
 ('esgr', 0.04873359189188875),
 ('acgl', 0.04661028339768336),
 ('cass', 0.04558703281853127),
 ('chdn', 0.04427032664094099),
 ('cpla', 0.044081135521224724),
 ('cffi', 0.043177622393827164),
 ('csgp', 0.04196131737452902),
 ('bstc', 0.04187643783783557),
 ('cash', 0.040220054440155195),
 ('chrw', 0.039791492277991836),
 ('adp', 0.038671754826253846),
 ('adsk', 0.03827030579150659),
 ('abax', 0.03721620965250594),
 ('blkb', 0.036355230888034384),
 ('abco', 0.03557917953668266),
 (

Interestingly, Apple has one of the lowest differences between the open and close prices. In other words, it seems to, on average, open at a higher price than it closes.

## Finding the most traded stock for each day

In [15]:
trades = {}
for k,v in prices_columns.items():
    for i,date in enumerate(v['date']):
        if date not in trades:
            trades[date] = []
        trades[date].append([k,v['volume'][i]])

most_traded = []
for k,v in trades.items():
    company = sorted(v, key = lambda x:x[1], reverse=True)[0][0]
    most_traded.append([k,company])

sorted(most_traded, key = lambda x:x[0])

[[datetime.datetime(2007, 1, 3, 0, 0), 'aapl'],
 [datetime.datetime(2007, 1, 4, 0, 0), 'aapl'],
 [datetime.datetime(2007, 1, 5, 0, 0), 'aapl'],
 [datetime.datetime(2007, 1, 8, 0, 0), 'aapl'],
 [datetime.datetime(2007, 1, 9, 0, 0), 'aapl'],
 [datetime.datetime(2007, 1, 10, 0, 0), 'aapl'],
 [datetime.datetime(2007, 1, 11, 0, 0), 'aapl'],
 [datetime.datetime(2007, 1, 12, 0, 0), 'aapl'],
 [datetime.datetime(2007, 1, 16, 0, 0), 'aapl'],
 [datetime.datetime(2007, 1, 17, 0, 0), 'aapl'],
 [datetime.datetime(2007, 1, 18, 0, 0), 'aapl'],
 [datetime.datetime(2007, 1, 19, 0, 0), 'aapl'],
 [datetime.datetime(2007, 1, 22, 0, 0), 'aapl'],
 [datetime.datetime(2007, 1, 23, 0, 0), 'aapl'],
 [datetime.datetime(2007, 1, 24, 0, 0), 'aapl'],
 [datetime.datetime(2007, 1, 25, 0, 0), 'aapl'],
 [datetime.datetime(2007, 1, 26, 0, 0), 'aapl'],
 [datetime.datetime(2007, 1, 29, 0, 0), 'aapl'],
 [datetime.datetime(2007, 1, 30, 0, 0), 'aapl'],
 [datetime.datetime(2007, 1, 31, 0, 0), 'aapl'],
 [datetime.datetime(2007,

In [18]:
daily_volumes = {}

most_traded = []
for k,v in trades.items():
    volume = sum([item[1] for item in v])
    daily_volumes[k] = volume
    
volume_details = [[k,v] for k,v in daily_volumes.items()]
top_volume_details = sorted(volume_details, key = lambda x:x[1], reverse=True)[:10]
top_volume_details

[[datetime.datetime(2008, 1, 23, 0, 0), 1964583900.0],
 [datetime.datetime(2008, 10, 10, 0, 0), 1770266900.0],
 [datetime.datetime(2007, 7, 26, 0, 0), 1611272800.0],
 [datetime.datetime(2008, 10, 8, 0, 0), 1599183500.0],
 [datetime.datetime(2008, 1, 22, 0, 0), 1578877700.0],
 [datetime.datetime(2008, 2, 7, 0, 0), 1559032100.0],
 [datetime.datetime(2008, 9, 29, 0, 0), 1555072400.0],
 [datetime.datetime(2007, 11, 8, 0, 0), 1553880500.0],
 [datetime.datetime(2008, 1, 16, 0, 0), 1536176400.0],
 [datetime.datetime(2008, 1, 24, 0, 0), 1533363200.0]]

In [21]:
import math

top_volume_days = [v[0] for v in top_volume_details]

def binary_search(array, search):
    m = 0
    i = 0
    z = len(array) - 1
    while i <= z:
        m = math.floor(i + ((z - i) / 2))
        if array[m] == search:
            return m
        elif array[m] < search:
            i = m + 1
        elif array[m] > search:
            z = m - 1
            
high_volume_prices = {}
for k,v in prices_columns.items():
    for day in top_volume_days:
        result = binary_search(v['date'], day)
        if result is None:
            continue
        if k not in high_volume_prices:
            high_volume_prices[k] = []
        high_volume_prices[k].append(prices[k][result])

high_volume_prices

{'aal': [['2008-01-22', '12.02', '12.26', '12.92', '11.61', '4828200'],
  ['2008-10-09', '3.63', '4.40', '4.74', '3.56', '8180200'],
  ['2007-07-25', '34.84', '35.259998', '35.650002', '34.240002', '1992600'],
  ['2008-10-07', '5.11', '6.18', '6.30', '4.95', '10827400'],
  ['2008-01-18', '12.92', '12.35', '13.14', '12.35', '3806100'],
  ['2008-02-06', '15.34', '14.76', '15.65', '14.06', '5329200'],
  ['2008-09-26', '6.12', '6.01', '6.29', '5.90', '4478800'],
  ['2007-11-07', '22.60', '22.610001', '23.25', '22.00', '4501800'],
  ['2008-01-15', '12.51', '11.85', '12.64', '11.75', '6321800'],
  ['2008-01-23', '13.14', '12.04', '13.42', '11.75', '4990600']],
 'aame': [['2008-01-22', '1.50', '1.40', '1.50', '1.29', '5000'],
  ['2008-10-09', '1.07', '1.05', '1.07', '1.05', '2700'],
  ['2007-07-25', '3.81', '3.81', '3.81', '3.81', '000'],
  ['2008-10-07', '1.00', '1.00', '1.01', '0.52', '7700'],
  ['2008-01-18', '1.49', '1.43', '1.50', '1.43', '7200'],
  ['2008-02-06', '1.65', '1.70', '1.70',

### Finding the most profitable stock to buy on 2007-01-03 assuming growth through to the last day of this dataset

In [22]:
profits = []
for k,v in prices_columns.items():
    percentage = (v['close'][-1] - v['close'][0]) * 100 / v['close'][0]
    profits.append([k,percentage])
    
sorted(profits, key = lambda x:x[1], reverse=True)[:10]

[['admp', 7483.8389225948395],
 ['adxs', 4005.0000000000005],
 ['arcw', 3898.6004898285596],
 ['blfs', 2437.4365640858978],
 ['amzn', 2230.7234281466817],
 ['anip', 1707.355447278503],
 ['apdn', 1549.6700659868027],
 ['cui', 1525.162516251625],
 ['bcli', 1339.2137535980346],
 ['achc', 1330.0000666666667]]

ADMP would have been the most profitable to invest in during that time, growing by a whopping 7483%!