## Reading the data in a custom data structure: 

- We have several files, each one of this for each company. The files have the info of stock daily prices from 2007-1-1 to 2017-04-17.

- The columns names are: date, close (close price), open (open price), high, low, volume

We need three layers of índices to get acces to specific values of all the data. A good data structure can be: 

hash table --> List --> List 

Each hash (associated to each hash - table) for each stock symbol (company)
List of Lists: List of rows of daily prices for each key in the hash table (we can use a dictionary). 

Since we need to get quick access to a stock symbol, its convenient to  have a hash table for each although the tradeoff for this is slightly higher memory usage.

In the next layers using again a hash table we increments considerably the memory usage, instead of that, we can use a List of List for the second and third layer.  

In [72]:
import os
import concurrent.futures
from datetime import datetime
import numpy as np
import math

In [2]:
files = os.listdir("prices")
files[:10]

['dgica.csv',
 'bdge.csv',
 'cvco.csv',
 'blkb.csv',
 'bbox.csv',
 'ffbc.csv',
 'fbiz.csv',
 'ffic.csv',
 'bdsi.csv',
 'amgn.csv']

In [46]:
# prueba: Reading only one file into pandas.DataFrame to 
# see how the data is structured in each file.
import pandas as pd

df = pd.read_csv("prices/dgica.csv")
print(df.shape)
df.head()

(2590, 6)


Unnamed: 0,date,close,open,high,low,volume
0,2007-01-03,19.99,19.65,20.120001,19.59,30100
1,2007-01-04,20.0,20.0,20.0,19.700001,16800
2,2007-01-05,19.02,19.870001,19.870001,19.02,27700
3,2007-01-08,18.58,19.059999,19.059999,18.5,30300
4,2007-01-09,18.57,18.540001,18.639999,18.41,37800


In [4]:
def read_files(file_name):
    with open("prices/"+file_name,"r") as f:
        lines = f.readlines()
        all_lines = [p.strip().split(",") for p in lines][1:]
        all_lines = [d for d in all_lines]
        return (file_name.replace(".csv",""),all_lines)

In [5]:
pool = concurrent.futures.ProcessPoolExecutor(max_workers = 3)
data = list(pool.map(read_files,files))

In [6]:
data = dict(data)

### Stored the data in the "data" dictionary according to the previous diagram

In [7]:
data["amgn"][:10]

[['2007-01-03', '68.400002', '68.43', '69.480003', '67.849998', '12908400'],
 ['2007-01-04',
  '71.330002',
  '69.480003',
  '71.870003',
  '68.970001',
  '16000900'],
 ['2007-01-05', '71.50', '71.339996', '72.080002', '71.010002', '10462000'],
 ['2007-01-08', '70.93', '71.629997', '71.889999', '70.790001', '6747100'],
 ['2007-01-09', '71.269997', '71.169998', '71.68', '70.669998', '7165200'],
 ['2007-01-10', '71.040001', '70.900002', '71.239998', '70.699997', '5780600'],
 ['2007-01-11', '71.910004', '71.160004', '72.209999', '70.739998', '6872600'],
 ['2007-01-12', '73.269997', '72.019997', '73.57', '72.00', '10115000'],
 ['2007-01-16', '73.50', '73.400002', '73.580002', '72.769997', '5465400'],
 ['2007-01-17', '73.93', '73.50', '74.080002', '73.199997', '6535100']]

### Calculating some aggregates

In [8]:
# Average closing price for each stocks

average_closing_prices = {}

for k,v in data.items():
    suma = 0
    for row in v:
        suma += float(row[1])
    average_closing_prices[k] = round(suma/len(v),2)
    
average_closing_prices["amgn"] # result for "amgn" company

92.23

In [9]:
# Average volume for each stock

average_volume = {}

for k,v in data.items():
    suma = 0
    for row in v:
        suma += float(row[5])
    average_volume[k] = round(suma/len(v),2)
    
average_volume["amgn"] # result for "amgn" company

6412205.17

In [10]:
# Average opening price for each stocks

average_opening_prices = {}

for k,v in data.items():
    suma = 0
    for row in v:
        suma += float(row[2])
    average_opening_prices[k] = round(suma/len(v),2)
    
average_opening_prices["amgn"] # result for "amgn" company

92.22

In [11]:
# difference between the average opening price and the
# average closing price for each stock.

average_delta_prices = {}

for k in average_opening_prices.keys():
    average_delta_prices[k] = round(average_closing_prices[k]\
                              - average_opening_prices[k],2)
        
average_delta_prices["amgn"] # result for "amgn" company

0.01

In [12]:
# difference between the average high and 
# the average low for each stock.

average_high = {}

for k,v in data.items():
    suma = 0
    for row in v:
        suma += float(row[3])
    average_high[k] = round(suma/len(v),2)
    
average_low = {}

for k,v in data.items():
    suma = 0
    for row in v:
        suma += float(row[4])
    average_low[k] = round(suma/len(v),2)
    
high_low_average_diff = {}

for k in average_low.keys():
    high_low_average_diff[k] = round(average_high[k]\
                              - average_low[k],2)
        
high_low_average_diff["amgn"] # result for "amgn" company

1.94

### Another way to compute aggregates: transforming data structure: 


    Layer 1 -- hash table (each stock symbol is a key)
    Layer 2 -- hash table (each column header is a key)
    Layer 3 -- array.
    
    

In [13]:
new_data = {}

columns = ["date","close","open","high","low","volume"]

for k,v in data.items():
    dic = {}
    for i,column in enumerate(columns):
        if column == "date":
            dic[column] = [datetime.strptime(v[j][i],"%Y-%m-%d") \
                           for j in range(len(v))]
        elif column == "volume":
            dic[column] = [int(v[j][i]) \
                           for j in range(len(v))]
        else: 
            dic[column] = [round(float(v[j][i]),2) \
                           for j in range(len(v))]
    new_data[k] = dic


new_data["aapl"]["date"][:3]

[datetime.datetime(2007, 1, 3, 0, 0),
 datetime.datetime(2007, 1, 4, 0, 0),
 datetime.datetime(2007, 1, 5, 0, 0)]

##### Now we have data structured as the following form: 

{
    'aapl': {
        "date": [
                "2007-01-03",
                "2007-01-04",
                ...
            ],
        "close": [
                83.800002,
                85.659998,
                ...
            ],
        ...

    },
    'goog': {
        ...
    },
    ...
}

###### in this kind of data structure we can access values like in pandas data frames

In [14]:
new_data["aapl"]["volume"][:3]

[309579900, 211815100, 208685400]

In [15]:
new_data["aapl"]["high"][:3]

[86.58, 85.95, 86.2]

In [16]:
# Then calculating the same aggregates as before: 

# Average closing price for each stocks

average_closing_prices_v2 = {}

for k,v in new_data.items():
    average_closing_prices_v2[k] = \
    round(sum(new_data[k]["close"])/\
          len(new_data[k]["close"]),2)
    
print(average_closing_prices_v2["amgn"], "average_closing for amgn") 
# result for "amgn" company

average_volume_v2 = {}

for k,v in new_data.items():
    average_volume_v2[k] = \
    round(sum(new_data[k]["volume"])/\
          len(new_data[k]["volume"]),2)
    
print(average_volume_v2["amgn"], "average volume for amgn") 
# result for "amgn" company

average_opening_prices_v2 = {}

for k,v in new_data.items():
    average_opening_prices_v2[k] = \
    round(sum(new_data[k]["open"])/\
          len(new_data[k]["open"]),2)
    
print(average_opening_prices_v2["amgn"],"average opening for amgn") 
# result for "amgn" company


average_delta_prices_v2 = {}

for k in average_opening_prices_v2.keys():
    average_delta_prices_v2[k] = \
          round(average_closing_prices_v2[k]\
            - average_opening_prices_v2[k],2)
        
print(average_delta_prices_v2["amgn"],"average delta close-open for amgn")
# result for "amgn" company

average_high_v2 = {}

for k,v in new_data.items():
    average_high_v2[k] = \
    round(sum(new_data[k]["high"])/\
          len(new_data[k]["high"]),2)
    
print(average_high_v2["amgn"],"average high for amgn") 
# result for "amgn" company
    
average_low_v2 = {}

for k,v in new_data.items():
    average_low_v2[k] = \
    round(sum(new_data[k]["low"])/\
          len(new_data[k]["low"]),2)
    
print(average_low_v2["amgn"],"average low for amgn") 
# result for "amgn" company
    
high_low_average_diff_v2 = {}

for k in average_low.keys():
    high_low_average_diff_v2[k] = round(average_high_v2[k]\
                              - average_low_v2[k],2)
        
print(high_low_average_diff_v2["amgn"],"average high low difference for amgn")
# result for "amgn" company

92.23 average_closing for amgn
6412205.17 average volume for amgn
92.22 average opening for amgn
0.01 average delta close-open for amgn
93.18 average high for amgn
91.24 average low for amgn
1.94 average high low difference for amgn


### Searching on data (binary search):

In [36]:
# Additional aggregates

total_volume_each_day = {}

for company in new_data:
    for i,day in enumerate(new_data[company]["date"]):
        if day not in total_volume_each_day:
            total_volume_each_day[day] = \
            new_data[company]["volume"][0]
        else:
            total_volume_each_day[day] += \
            new_data[company]["volume"][i]


In [47]:
len(total_volume_each_day.keys())

2636

In [70]:
# Defining some already known functions to sort values 
# Using insertion sort.

def swap(array, pos1, pos2): 
    first_value_stored = array[pos1]
    array[pos1] = array[pos2]
    array[pos2] = first_value_stored
    return array

def insertion_sort_vinculated_arrays(array_index,array):
    for i in range(1,len(array_index)):
        j = i
        while j>0 and (array_index[j-1]>array_index[j]):
            swap(array_index,j-1,j)
            swap(array,j-1,j)
            j -= 1
            
def binary_search(array, search):
    
    insertion_sort(array)
    m = 0
    i = 0
    z = len(array) - 1
    while i<= z:
        
        m = math.floor(i + ((z - i) / 2))
        if array[m] == search:
            return m
        elif array[m] < search:
            i = m + 1
        elif array[m] > search:
            z = m - 1
    return None
    

In [62]:
days_list = sorted([day for day in total_volume_each_day])
volumes_list = [total_volume_each_day[day] for day in days_list]

In [64]:
# Sorting the days - volume vinculated lists:

insertion_sort_vinculated_arrays(volumes_list,days_list)
print(volumes_list[-5:])
print(" ")
print(days_list[-5:])

[1578732000, 1599461700, 1611417400, 1769830800, 1963904000]
 
[datetime.datetime(2008, 1, 22, 0, 0), datetime.datetime(2008, 10, 8, 0, 0), datetime.datetime(2007, 7, 26, 0, 0), datetime.datetime(2008, 10, 10, 0, 0), datetime.datetime(2008, 1, 23, 0, 0)]


In [68]:
top10_volume_days = [days_list[-i] for i in range(1,11)]
top10_volume_days

[datetime.datetime(2008, 1, 23, 0, 0),
 datetime.datetime(2008, 10, 10, 0, 0),
 datetime.datetime(2007, 7, 26, 0, 0),
 datetime.datetime(2008, 10, 8, 0, 0),
 datetime.datetime(2008, 1, 22, 0, 0),
 datetime.datetime(2008, 2, 7, 0, 0),
 datetime.datetime(2008, 9, 29, 0, 0),
 datetime.datetime(2007, 11, 8, 0, 0),
 datetime.datetime(2008, 1, 16, 0, 0),
 datetime.datetime(2008, 1, 24, 0, 0)]

### Finally, finding all prices for all stocks on each of the 10 high volume days:

In [76]:
prices_top_10_days = {}

for day in top10_volume_days:
    prices = []
    for company in new_data:
        result = binary_search(new_data[company]["date"],day)
        if result is not None: 
            price = new_data[company]["close"][result]
            prices.append((company,price))
    prices_top_10_days[day] = prices


### Finding the most profitable stocks from 2007 to present and the fastest growing companies

In [83]:
# Easy task: 

companies = []
percentages = []

companies_growth_dict = {}

for company in new_data:
    companies.append(company)
    initial_price = new_data[company]["open"][0]
    final_day_index = len(new_data[company]["date"]) - 1
    final_price = new_data[company]["open"][final_day_index]
    percentage = round(final_price / initial_price - 1,2)
    percentages.append(percentage)
    # storing company - percentages 
    companies_growth_dict[company] = percentage 


In [80]:
insertion_sort_vinculated_arrays(percentages,companies)

top10_growth_companies = [companies[-i] for i in range(1,11)]

top10_growth_companies

['admp', 'adxs', 'arcw', 'blfs', 'amzn', 'anip', 'apdn', 'cui', 'axgn', 'achc']

#### Next steps in further data science analysis:


-    What stocks would have been best to short at the start of the period?
-    Which stocks have the most after-hours trading, and show the biggest changes between the closing price and the next day open?
-    Can technical indicators like Bollinger Bands help us forecast the market?
-    What time periods have resulted in steady increases in prices, and what periods have resulted in steady declines?
-    Based on price, what was the optimal day to buy each stock if we wanted to hold them until now?
-    On days with high trading volume, do stocks move in one direction (up or down) more than the other one?
