# Analyzing Stock Prices 

In this project, I will be working with stock market data that was downloaded from [Yahoo Finance](https://finance.yahoo.com/) using the `yahoo_finance` Python package. This dataset has daily stock prices from `2007-1-1` to `2017-04-17` for several hundred stocks traded on the NASDAQ stock exchange. This data is stored in the `prices` folder. Each file in the `prices` folder is named for a specific stock symbol, and contains the following columns:

* `date`: date that the data is from.
* `close`: the closing price on that day, which is the price when the trading day ends.
* `open`: the opening price on that day, which is the price when the trading day starts.
* `high`: the highest price the stock reached during trading.
* `low`: the lowest price the stock reached during trading.
* `volume`: the number of shares that were traded during the day.

## Reading in the data

In order to read in and store all of the data, I will need three layers of indices. They are as follows:

1. The stock symbol, or a numeric index representing the stock symbol.
2. The rows in a stock symbol csv file.
3. The column names in a stock symbol csv file.

I will first determine an initial appropriate data structure for each layer. 

* For layer 1, a hash table seems to be most fitting since it will allow me to do quick searches based on keys, which will be the stock symbol. 
* For layers 2 and 3, a list will work for now. 

I will now use multiple processes to read the data into the above data structures. 

In [1]:
import concurrent.futures
import os

def read_file(filename):
    with open(filename, 'r') as f:
        data = f.read().strip()
    key = filename.replace('.csv', '').replace('prices/', '')
    data = data.split('\n')
    split_data = []
    for d in data:
        split_data.append(d.split(','))
    return key, split_data

pool = concurrent.futures.ProcessPoolExecutor(max_workers=5)
filenames = []
for f in os.listdir("prices"):
    filenames.append("prices/{}".format(f))
prices = pool.map(read_file, filenames)
prices = list(prices)
prices = dict(prices)

## Transforming the data

Now that I have read in the data, I can use it to compute aggregates. Some examples are:

* The average closing price of all stocks over the time period.
* The average volume for each stock.
* The difference between the average opening price and the average closing price for each stock.
* The difference between the average high and the average low for each stock.

To do so, I will have to transform the data structures into more appropriate ones: 

* Layer 1 will remain a hash table, with each stock symbol as a kay.
* Layer 2 will become a hash table, with each column header as a key.
* Layer 3 will become an array, since it will allow me to access data quickly, and I will not be changing it much. Additionally, an array is useful for computation, such as analyzing data.

The data structure will look like so:
``` {
    'aapl': {
        "date": [
                "2007-01-03",
                "2007-01-04",
                ...
            ],
        "close": [
                83.800002,
                85.659998,
                ...
            ],
        ...

    },
    'goog': {
        ...
    },
    ...
}
```
I will also change the columns to the appropriate types, in order to make later computations easier. 

In [2]:
from dateutil.parser import parse

stock_prices = {}

for k,v in prices.items():
    price = v
    header = price[0]
    price_cols = {}
    values = []
    for i,header in enumerate(header):
        values = [p[i] for p in price[1:]]
        if i > 0:
            values = [float(v) for v in values]
        else:
            values = [parse(v) for v in values]
        price_cols[header] = values
    stock_prices[k] = price_cols

## Computing aggregates

Now that the data is in the structure I want, the date columns have been parsed, and the price columns have been converted to floats, I can compute some aggregates using the data.

First, I will determine the average opening and closing prices for each stock, and sort that dictionary on increasing opening price. 

In [3]:
open_and_close = {}

for k,v in stock_prices.items():
    average_opening = sum(v['open'])/len(v['open'])
    average_closing = sum(v['close'])/len(v['close'])
    open_and_close[k] = (average_opening, average_closing)

sorted(open_and_close.items(), key = lambda x : x[1])

[('blfs', (0.8188902019305008, 0.8122763011583004)),
 ('apdn', (0.8270161598455602, 0.824100993822394)),
 ('bmra', (0.894806949806949, 0.901011583011584)),
 ('bcli', (0.9994472722007721, 0.9969415324324327)),
 ('cyrx', (1.1649345586872604, 1.1615408884169918)),
 ('clrb', (1.2100074432432475, 1.204571143629345)),
 ('cpst', (1.2140849420849418, 1.2069536679536692)),
 ('csbr', (1.2219540000000002, 1.228244384585441)),
 ('egt', (1.32960617760617, 1.329351351351346)),
 ('cpah', (1.397319064748199, 1.411618944844124)),
 ('aemd', (1.4035907335907356, 1.398042471042472)),
 ('dfbg', (1.4041805756756733, 1.4005010393822352)),
 ('alqa', (1.4091090583011576, 1.4052982830115854)),
 ('astc', (1.4182471042471023, 1.4152123552123521)),
 ('chci', (1.463473763706565, 1.4581224154440182)),
 ('ctic', (1.5039222907335907, 1.4943663119691135)),
 ('eltk', (1.524528957528961, 1.5323436293436348)),
 ('dzsi', (1.5379806949806927, 1.53823166023166)),
 ('cool', (1.5471896648648567, 1.547598892277985)),
 ('cgnt', 

I can see from above that `BLFS` and `APDN` have the lowest average opening prices, while `AAPL` and `AMZN` have the highest average opening prices. I will now compute the average difference between opening and closing prices as percentages. 

In [4]:
average_gains = {}

for k,v in stock_prices.items():
    average_opening = sum(v['open'])/len(v['open'])
    average_closing = sum(v['close'])/len(v['close'])
    gain_pct = (average_closing - average_opening)/average_opening * 100
    average_gains[k] = gain_pct
    
sorted(average_gains.items(), key = lambda x : x[1])

[('blfs', -0.8076663704863435),
 ('aezs', -0.7170080828577347),
 ('ctic', -0.6354037587817094),
 ('clbs', -0.6075648806310248),
 ('cdti', -0.6071284353266959),
 ('cbli', -0.5939534851583746),
 ('cpst', -0.5873785172745878),
 ('cytx', -0.5398084552043086),
 ('cprx', -0.5085022509388191),
 ('drys', -0.49319032273459756),
 ('dcth', -0.47065673952640186),
 ('clrb', -0.4492781961184775),
 ('cytr', -0.43506922485578053),
 ('cycc', -0.4234429331952147),
 ('arwr', -0.4214571849442779),
 ('cntf', -0.4176569134450096),
 ('abio', -0.41183066060716517),
 ('egle', -0.40893872214496485),
 ('cur', -0.3995195859524079),
 ('aemd', -0.39529062250716357),
 ('edap', -0.3830994085316084),
 ('creg', -0.3683829070331847),
 ('bldp', -0.36644022412860766),
 ('chci', -0.36566069001423257),
 ('apdn', -0.3524920267229849),
 ('caas', -0.3316127580008448),
 ('cidm', -0.32932837088636435),
 ('arna', -0.3227900079935598),
 ('cbak', -0.3102402980830561),
 ('chnr', -0.30647199295867866),
 ('cyrx', -0.2913185332996606),

The results above are interesting, since `BFLS`, which had the lowest opening prices, actually also has to lowest average gains (greatest losses). However, when I look at the stocks with the largest gains, I don't see `AAPL` nor `AMZN` on the list, but rather `CPAH` and `BMRA` as the top. 

## Most traded stocks

I will now determine the most traded stocks each day. The data structure I would like to see is a list of lists, as shown below:
```
[
    ["2007-01-03", "AAPL"],
    ["2007-01-04", "GOOG"],
    ...
]
```
I will need to combine the volume for each stock on each day, and then sort the volume in descending order. 

In [5]:
volumes = {}
max_vol = []

for k,v in stock_prices.items():
    date = v['date']
    stock = k
    volume = v['volume']
    for i,d in enumerate(date):
        if d not in volumes or volume[i] > volumes[d][1]:
            volumes[d] = [stock, volume[i]]

for k,v in volumes.items():
    max_vol.append([k, v[0]])

sorted(max_vol)

[[datetime.datetime(2007, 1, 3, 0, 0), 'aapl'],
 [datetime.datetime(2007, 1, 4, 0, 0), 'aapl'],
 [datetime.datetime(2007, 1, 5, 0, 0), 'aapl'],
 [datetime.datetime(2007, 1, 8, 0, 0), 'aapl'],
 [datetime.datetime(2007, 1, 9, 0, 0), 'aapl'],
 [datetime.datetime(2007, 1, 10, 0, 0), 'aapl'],
 [datetime.datetime(2007, 1, 11, 0, 0), 'aapl'],
 [datetime.datetime(2007, 1, 12, 0, 0), 'aapl'],
 [datetime.datetime(2007, 1, 16, 0, 0), 'aapl'],
 [datetime.datetime(2007, 1, 17, 0, 0), 'aapl'],
 [datetime.datetime(2007, 1, 18, 0, 0), 'aapl'],
 [datetime.datetime(2007, 1, 19, 0, 0), 'aapl'],
 [datetime.datetime(2007, 1, 22, 0, 0), 'aapl'],
 [datetime.datetime(2007, 1, 23, 0, 0), 'aapl'],
 [datetime.datetime(2007, 1, 24, 0, 0), 'aapl'],
 [datetime.datetime(2007, 1, 25, 0, 0), 'aapl'],
 [datetime.datetime(2007, 1, 26, 0, 0), 'aapl'],
 [datetime.datetime(2007, 1, 29, 0, 0), 'aapl'],
 [datetime.datetime(2007, 1, 30, 0, 0), 'aapl'],
 [datetime.datetime(2007, 1, 31, 0, 0), 'aapl'],
 [datetime.datetime(2007,

## High transaction volume

Now, I want to search for all transactions on days with unusually high volume. I will do the following:

* Compute total volume of trading for each day
* Sort and find the 10 highest volume days overall
* Find all prices for all stocks on each of the high volume days

I will do so using a binary search algorithm. 

In [6]:
all_dates = {}

for k,v in stock_prices.items():
    date = v['date']
    volume = v['volume']
    for i,d in enumerate(date):
        if d not in all_dates:
            all_dates[d] = volume[i]
        else:
            all_dates[d] += volume[i]
            
top = sorted(all_dates.items(), key = lambda x : x[1], reverse=True)
top = top[:10]
top

[(datetime.datetime(2008, 1, 23, 0, 0), 1964583900.0),
 (datetime.datetime(2008, 10, 10, 0, 0), 1770266900.0),
 (datetime.datetime(2007, 7, 26, 0, 0), 1611272800.0),
 (datetime.datetime(2008, 10, 8, 0, 0), 1599183500.0),
 (datetime.datetime(2008, 1, 22, 0, 0), 1578877700.0),
 (datetime.datetime(2008, 2, 7, 0, 0), 1559032100.0),
 (datetime.datetime(2008, 9, 29, 0, 0), 1555072400.0),
 (datetime.datetime(2007, 11, 8, 0, 0), 1553880500.0),
 (datetime.datetime(2008, 1, 16, 0, 0), 1536176400.0),
 (datetime.datetime(2008, 1, 24, 0, 0), 1533363200.0)]

In [7]:
import math

def binary_search(array, search):
    m = 0
    i = 0
    z = len(array) - 1
    while i<= z:
        m = math.floor(i + ((z - i) / 2))
        if array[m] == search:
             return m
        elif array[m] < search:
             i = m + 1
        elif array[m] > search:
             z = m - 1

high_volume_days = []
top_date_transactions = {}
for date in top:
    high_volume_days.append(date[0])

for k,v in stock_prices.items():
    for day in high_volume_days:
        ind = binary_search(v['date'], day)
        if ind is None:
            continue
        if k not in top_date_transactions:
            top_date_transactions[k] = []
        top_date_transactions[k].append(prices[k][ind])

## Finding profitable stocks

I can now find which stocks would have been the most profitable to buy on `2007-01-03`. I will do this by:

* Subtracting the initial price from the final price, then computing a percentage relative to the initial price. This will tell us how much our initial investment would have grown or shrunk.
* Sorting all of the percentages.
* Finding the stock that grew the most in the time period.

In [8]:
over_time = {}
for k,v in stock_prices.items():
    initial = v['close'][0]
    final = v['close'][-1]
    change_pct = (final - initial)/initial * 100
    over_time[k] = change_pct

top_profit = sorted(over_time.items(), key = lambda x : x[1], reverse=True)
top_profit = top_profit[:10]
top_profit

[('admp', 7483.8389225948395),
 ('adxs', 4005.0000000000005),
 ('arcw', 3898.60048982856),
 ('blfs', 2437.4365640858978),
 ('amzn', 2230.7234281466817),
 ('anip', 1707.3554472785033),
 ('apdn', 1549.6700659868025),
 ('cui', 1525.1625162516252),
 ('bcli', 1339.2137535980346),
 ('achc', 1330.0000666666667)]

The stocks that would have been the most profitable to buy on `2007-01-03` are shown above. `ADMP` would have been the most profitable stock to purchase, which saw gains of almost 7,500%. 