# Analyzing Stock Prices

In this project, I work with stock market data that was downloaded from Yahoo Finance using the yahoo_finance Python package. This data consists of the daily stock prices from 2007-1-1 to 2017-04-17 for several hundred stock symbols traded on the NASDAQ stock exchange, stored in the prices folder. The download_data.py script in the same folder as the Jupyter notebook was used to download all of the stock price data. Each file in the prices folder is named for a specific stock symbol, and contains the:

- date -- date that the data is from.
- close -- the closing price on that day, which is the price when the trading day ends.
- open -- the opening price on that day, which is the price when the trading day starts.
- high -- the highest price the stock reached during trading.
 -low -- the lowest price the stock reached during trading.
- volume -- the number of shares that were traded during the day.

## Data Structure for each layer to represent data

- Layer 1 : Use stock symbol as key, 
- Layer 2 & 3: datafame that contains stock price over time period with columns data as value

Use dict data structure, which would make search by stock symbol easier.  A dataframe would allow efficient use of panda and numpy libries for data analysis.

In [1]:

import concurrent.futures
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

# The concurrent.futures module provides a high-level 
# interface for asynchronously executing callables.

### Read data from csv 

In [2]:
directory = 'prices'
filenames = os.listdir('prices')
#print(filenames)
data = {}

In [3]:
with concurrent.futures.ProcessPoolExecutor() as executor:
    for file in filenames:
        key = file[:-4]
        data[key] = pd.read_csv(os.path.join(directory, file))   
 
                                

In [4]:
#print(data['aapl'])
data['aapl'].head()

Unnamed: 0,date,close,open,high,low,volume
0,2007-01-03,83.800002,86.289999,86.579999,81.899999,309579900
1,2007-01-04,85.659998,84.050001,85.949998,83.820003,211815100
2,2007-01-05,85.049997,85.77,86.199997,84.400002,208685400
3,2007-01-08,85.47,85.959998,86.529998,85.280003,199276700
4,2007-01-09,92.570003,86.450003,92.979999,85.15,837324600


## Explotory Data Analysis

#### The average closing price of all stocks over the time period.

In [5]:
time_periods = data['aapl']['date']
stock_symbols = data.keys()

In [6]:
avg_closing_price_all = pd.DataFrame()
avg_closing_price_all['date'] = time_periods
for key in data:
    avg_closing_price_all[key] = data[key]['close']
avg_closing_price_all.head(3)

Unnamed: 0,date,cprt,azpn,cac,acor,cart,abio,amri,blfs,chy,...,crme,daio,camt,dhil,chci,finl,ecol,endp,acet,cnmd
0,2007-01-03,30.530001,10.71,45.799999,15.61,18.450003,3.788484,10.5,0.080002,17.030001,...,10.9,3.61,4.31,83.989998,6.089994,14.37,18.41,27.25,8.51,23.139999
1,2007-01-04,31.030001,11.14,46.319999,15.35,18.450003,3.89844,10.19,0.080002,16.959999,...,11.21,3.61,4.4,83.629997,6.479994,14.42,18.32,27.23,8.57,23.15
2,2007-01-05,30.709999,10.63,45.190001,15.45,18.450003,3.84846,9.99,0.110002,17.0,...,11.25,3.71,4.35,84.089996,6.009994,14.15,17.82,28.700001,8.8,22.84


#### The average volumn for each stock

#### The difference between the average opening price and the average closing price for each stock.

#### The difference between the average high and the average low for each stock.


## Most Traded Stock Each Day

In [7]:
stocks_volumes = pd.DataFrame()
stocks_volumes['date'] = time_periods
for key in data:
    stocks_volumes[key] = data[key]['volume']
stocks_volumes.head(3)

Unnamed: 0,date,cprt,azpn,cac,acor,cart,abio,amri,blfs,chy,...,crme,daio,camt,dhil,chci,finl,ecol,endp,acet,cnmd
0,2007-01-03,1492200,751800,6900,266300,800,3200,152500,3300,203200,...,33300,5200,120000,20300,83500,1222600,133200,1497900,96200,288500
1,2007-01-04,916200,367500,12600,294000,0,3100,134700,0,173000,...,73800,22300,118100,7500,98400,911100,109300,1080400,28100,176900
2,2007-01-05,859400,405800,20800,367000,0,2100,192400,11000,136600,...,33000,7000,68700,15200,61500,613800,207400,2508100,71200,135500


In [8]:
stocks_volumes['buse'].head()

0    10500
1    11700
2    10900
3     6600
4     8300
Name: buse, dtype: int64

In [9]:
columns = stocks_volumes.columns
columns

Index(['date', 'cprt', 'azpn', 'cac', 'acor', 'cart', 'abio', 'amri', 'blfs',
       'chy',
       ...
       'crme', 'daio', 'camt', 'dhil', 'chci', 'finl', 'ecol', 'endp', 'acet',
       'cnmd'],
      dtype='object', length=561)

In [10]:
columns = columns[1:]
def get_max_vol_stock(row):
    date = row[0]
    row = np.array(row)[1:]
    idx = np.argmax(row)
    return [date, columns[idx]]


In [11]:
most_traded_stock_by_day = stocks_volumes.apply(get_max_vol_stock, axis=1)  
most_traded_stock_by_day[:30]   

0     [2007-01-03, aapl]
1     [2007-01-04, aapl]
2     [2007-01-05, aapl]
3     [2007-01-08, aapl]
4     [2007-01-09, aapl]
5     [2007-01-10, aapl]
6     [2007-01-11, aapl]
7     [2007-01-12, aapl]
8     [2007-01-16, aapl]
9     [2007-01-17, aapl]
10    [2007-01-18, aapl]
11    [2007-01-19, aapl]
12    [2007-01-22, aapl]
13    [2007-01-23, aapl]
14    [2007-01-24, aapl]
15    [2007-01-25, aapl]
16    [2007-01-26, aapl]
17    [2007-01-29, aapl]
18    [2007-01-30, aapl]
19    [2007-01-31, aapl]
20    [2007-02-01, aapl]
21    [2007-02-02, aapl]
22    [2007-02-05, aapl]
23    [2007-02-06, aapl]
24    [2007-02-07, aapl]
25    [2007-02-08, aapl]
26    [2007-02-09, aapl]
27    [2007-02-12, aapl]
28    [2007-02-13, aapl]
29    [2007-02-14, aapl]
dtype: object

## High Transaction Day

In [12]:
# Compute total volume of trading for each day
def agg_volume(row):
    return row[1:].sum()
total_volume_by_day = pd.DataFrame()
total_volume_by_day['date'] = stocks_volumes['date']
total_volume_by_day['total_volume'] = stocks_volumes.apply(agg_volume, axis=1)     
total_volume_by_day.head()


Unnamed: 0,date,total_volume
0,2007-01-03,1036523000.0
1,2007-01-04,837589600.0
2,2007-01-05,745213900.0
3,2007-01-08,746887300.0
4,2007-01-09,1419223000.0


In [13]:
# Sort and find the 1- highest volume days overall
highest_10_volume_day_overall = total_volume_by_day.sort_values(by='total_volume', ascending=False).head(10)    
highest_10_volume_day_overall

Unnamed: 0,date,total_volume
265,2008-01-23,1963754000.0
447,2008-10-10,1769361000.0
141,2007-07-26,1610858000.0
445,2008-10-08,1598255000.0
264,2008-01-22,1578117000.0
276,2008-02-07,1557507000.0
438,2008-09-29,1554716000.0
215,2007-11-08,1552455000.0
261,2008-01-16,1535759000.0
266,2008-01-24,1533396000.0


In [14]:
highest_volume_dates = highest_10_volume_day_overall['date']
highest_volume_dates

265    2008-01-23
447    2008-10-10
141    2007-07-26
445    2008-10-08
264    2008-01-22
276    2008-02-07
438    2008-09-29
215    2007-11-08
261    2008-01-16
266    2008-01-24
Name: date, dtype: object

In [15]:
# all stock prices by date
stocks_prices = pd.DataFrame()
stocks_prices['date'] = time_periods
for key in data:
    stocks_prices[key] = data[key]['close']
stocks_prices.head()
    

Unnamed: 0,date,cprt,azpn,cac,acor,cart,abio,amri,blfs,chy,...,crme,daio,camt,dhil,chci,finl,ecol,endp,acet,cnmd
0,2007-01-03,30.530001,10.71,45.799999,15.61,18.450003,3.788484,10.5,0.080002,17.030001,...,10.9,3.61,4.31,83.989998,6.089994,14.37,18.41,27.25,8.51,23.139999
1,2007-01-04,31.030001,11.14,46.319999,15.35,18.450003,3.89844,10.19,0.080002,16.959999,...,11.21,3.61,4.4,83.629997,6.479994,14.42,18.32,27.23,8.57,23.15
2,2007-01-05,30.709999,10.63,45.190001,15.45,18.450003,3.84846,9.99,0.110002,17.0,...,11.25,3.71,4.35,84.089996,6.009994,14.15,17.82,28.700001,8.8,22.84
3,2007-01-08,30.68,10.63,44.830001,15.6,18.450003,3.84846,9.9,0.110002,17.02,...,11.02,3.6,4.34,85.230003,5.889994,14.0,17.65,28.85,9.0,22.940001
4,2007-01-09,30.889999,10.7,44.849999,15.91,18.749996,3.688524,10.12,0.110002,17.110001,...,10.75,3.57,4.3,83.550003,6.039994,13.94,17.969999,29.48,8.96,22.959999


In [16]:
highest_volume_day_prices_all = stocks_prices[stocks_prices['date'].isin(highest_volume_dates)]  

highest_volume_day_prices_all

Unnamed: 0,date,cprt,azpn,cac,acor,cart,abio,amri,blfs,chy,...,crme,daio,camt,dhil,chci,finl,ecol,endp,acet,cnmd
141,2007-07-26,28.76,13.02,36.12,17.809999,16.560006,2.469012,15.6,0.100002,14.62,...,8.81,3.47,3.56,84.239998,1.839998,6.75,20.9,33.529999,9.13,29.59
215,2007-11-08,36.259998,15.92,33.839999,19.52,15.600002,1.569372,12.31,0.090002,13.26,...,11.25,5.52,2.7,82.550003,1.279999,3.28,24.290001,27.889999,8.19,26.370001
261,2008-01-16,40.240002,13.24,32.200001,22.440001,14.1,1.64934,13.92,0.070001,14.2,...,7.8,5.23,1.59,73.0,0.739999,1.74,21.959999,26.5,8.1,23.35
264,2008-01-22,38.470001,11.87,31.429999,23.940001,12.4,1.579368,13.02,0.060001,13.25,...,6.87,4.82,1.48,68.599998,0.729999,1.7,20.15,24.559999,7.51,22.07
265,2008-01-23,38.970001,12.73,32.000001,23.049999,13.05,1.489404,14.4,0.070001,13.54,...,7.02,4.89,1.48,69.839996,0.709999,1.91,21.9,23.92,7.22,23.059999
266,2008-01-24,39.099998,12.72,31.9,22.309999,13.05,1.59936,13.21,0.080002,13.95,...,7.08,5.02,1.45,69.839996,0.699999,1.92,21.99,24.66,7.55,23.309999
276,2008-02-07,39.009998,13.72,33.0,22.870001,12.57,1.529388,10.18,0.060001,14.44,...,6.0,5.15,1.43,74.43,0.919999,2.39,22.99,25.09,7.0,27.129999
438,2008-09-29,37.639999,12.0,29.64,24.4,10.0,0.44982,18.969999,0.040001,8.87,...,7.82,4.51,0.84,80.050003,0.27,9.82,27.43,19.92,8.91,31.459999
445,2008-10-08,33.02,11.15,28.74,18.17,9.25,0.259896,12.08,0.040001,8.9,...,6.11,3.93,0.82,80.059998,0.37,7.46,23.75,16.33,7.36,26.969999
447,2008-10-10,34.490002,8.75,27.0,16.41,8.51,0.279888,11.71,0.040001,7.55,...,3.98,3.3,0.76,68.849998,0.34,6.95,23.530001,14.81,8.1,26.92


## The Most Profitable Stock from 2007 to 2017


In [24]:
# using stocks_prices
def get_most_profitable(stock, col):
    stock_dict = {}
    stock_dict['buy_date'] = '2001-01-03'
    stock_dict['buy_price'] = col[0]
    stock_dict['sell_date'] = time_periods[np.argmax(col)]
    #stock_dict['sell_date'] = time_periods[col.index(max(col))]
    stock_dict['sell_price'] = max(col)
    stock_dict['profit'] = stock_dict['sell_price'] - stock_dict['buy_price'] 
    stock_dict['gain_percentage'] = stock_dict['profit'] / stock_dict['buy_price']

    stock_df = pd.DataFrame(stock_dict, index=[stock])
    return stock_df


In [25]:
most_profit_all_stocks = pd.DataFrame(columns=['buy_date', 'buy_price', 'sell_date', 'sell_price', 'profit', 'gain_percentage'])      


In [26]:
stocks = data.keys()
for stock in stocks:
    stock_df = get_most_profitable(stock, stocks_prices[stock])     
    most_profit_all_stocks = most_profit_all_stocks.append(stock_df)

most_profit_all_stocks.head()



  return getattr(obj, method)(*args, **kwds)


Unnamed: 0,buy_date,buy_price,gain_percentage,profit,sell_date,sell_price
cprt,2001-01-03,30.530001,1.043891,31.870001,2017-03-28,62.400002
azpn,2001-01-03,10.71,4.551821,48.749999,2017-03-20,59.459999
cac,2001-01-03,45.799999,0.058515,2.680001,2016-09-08,48.48
acor,2001-01-03,15.61,1.850737,28.89,2015-01-26,44.5
cart,2001-01-03,18.450003,0.03794,0.699996,2007-01-11,19.149999


## More to be done:

I've done some basic analysis of the data, but there's still quite a bit more depth to go into:

- What stocks would have been best to short at the start of the period?
- Which stocks have the most after-hours trading, and show the biggest changes between the closing price and the next day open?
- Can technical indicators like Bollinger Bands help us forecast the market?
- What time periods have resulted in steady increases in prices, and what periods have resulted in steady declines?
- Based on price, what was the optimal day to buy each stock if we wanted to hold them until now?
- On days with high trading volume, do stocks move in one direction (up or down) more than the other one?
