# Analyzing Stock Prices

In this project, I work with stock market data that was downloaded from Yahoo Finance using the yahoo_finance Python package. This data consists of the daily stock prices from 2007-1-1 to 2017-04-17 for several hundred stock symbols traded on the NASDAQ stock exchange, stored in the prices folder. The download_data.py script in the same folder as the Jupyter notebook was used to download all of the stock price data. Each file in the prices folder is named for a specific stock symbol, and contains the:

- date -- date that the data is from.
- close -- the closing price on that day, which is the price when the trading day ends.
- open -- the opening price on that day, which is the price when the trading day starts.
- high -- the highest price the stock reached during trading.
 -low -- the lowest price the stock reached during trading.
- volume -- the number of shares that were traded during the day.

## Data Structure for each layer to represent data

- Layer 1 : Use stock symbol as key, 
- Layer 2 & 3: datafame that contains stock price over time period with columns data as value

Use dict data structure, which would make search by stock symbol easier.  A dataframe would allow efficient use of panda and numpy libries for data analysis.

In [1]:

import concurrent.futures
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

# The concurrent.futures module provides a high-level 
# interface for asynchronously executing callables.

### Read data from csv 

In [2]:
directory = 'prices'
filenames = os.listdir('prices')
#print(filenames)
data = {}

In [3]:
with concurrent.futures.ProcessPoolExecutor() as executor:
    for file in filenames:
        key = file[:-4]
        data[key] = pd.read_csv(os.path.join(directory, file))   
 
                                

In [4]:
#print(data['aapl'])
data['aapl'].head()

Unnamed: 0,date,close,open,high,low,volume
0,2007-01-03,83.800002,86.289999,86.579999,81.899999,309579900
1,2007-01-04,85.659998,84.050001,85.949998,83.820003,211815100
2,2007-01-05,85.049997,85.77,86.199997,84.400002,208685400
3,2007-01-08,85.47,85.959998,86.529998,85.280003,199276700
4,2007-01-09,92.570003,86.450003,92.979999,85.15,837324600


## Explotory Data Analysis

#### The average closing price of all stocks over the time period.

In [5]:
time_periods = data['aapl']['date']
stock_symbols = data.keys()

In [6]:
avg_closing_price_all = pd.DataFrame()
avg_closing_price_all['date'] = time_periods
for key in data:
    avg_closing_price_all[key] = data[key]['close']
avg_closing_price_all.head(3)

Unnamed: 0,date,fmbi,cmcsa,boom,ctg,cpah,abco,acta,ecpg,asrv,...,algn,aray,esxb,cvgi,ceco,cybe,fisi,agys,arcw,aciw
0,2007-01-03,38.93,42.660001,26.9,4.75,0.435,53.959999,10.32,12.27,4.85,...,13.35,28.469999,7.15,22.1,25.030001,12.85,23.09,20.450001,0.100035,32.92
1,2007-01-04,38.880001,43.050001,27.76,4.75,0.43,54.5,10.33,12.43,4.78,...,13.6,29.25,7.1,21.780001,25.040001,12.9,22.73,19.49,0.100035,33.700001
2,2007-01-05,37.869999,42.55,26.77,4.75,0.38,53.709999,10.15,12.04,4.82,...,13.64,28.549999,7.15,20.530001,25.309999,13.0,22.84,18.73,0.100035,33.850001


#### The average volumn for each stock

#### The difference between the average opening price and the average closing price for each stock.

#### The difference between the average high and the average low for each stock.


## Most Traded Stock Each Day

In [7]:
stocks_volumes = pd.DataFrame()
stocks_volumes['date'] = time_periods
for key in data:
    stocks_volumes[key] = data[key]['volume']
stocks_volumes.head(3)

Unnamed: 0,date,fmbi,cmcsa,boom,ctg,cpah,abco,acta,ecpg,asrv,...,algn,aray,esxb,cvgi,ceco,cybe,fisi,agys,arcw,aciw
0,2007-01-03,275000,39543600,486000,42100,200.0,523000,351100,52600,6800,...,1201200,19190700.0,0,149600,2377600,35600,26500,993300,500,1339800
1,2007-01-04,208700,38122400,888000,67200,300.0,225800,303200,53400,27800,...,504600,3288700.0,800,114700,830500,122300,11000,531000,0,981000
2,2007-01-05,321300,25701200,462100,31900,300.0,242800,518100,80200,31200,...,435800,981100.0,800,182300,1099100,64600,5900,317700,3500,1509900


In [8]:
stocks_volumes['buse'].head()

0    10500
1    11700
2    10900
3     6600
4     8300
Name: buse, dtype: int64

In [9]:
columns = stocks_volumes.columns
columns

Index(['date', 'fmbi', 'cmcsa', 'boom', 'ctg', 'cpah', 'abco', 'acta', 'ecpg',
       'asrv',
       ...
       'algn', 'aray', 'esxb', 'cvgi', 'ceco', 'cybe', 'fisi', 'agys', 'arcw',
       'aciw'],
      dtype='object', length=561)

In [10]:
columns = columns[1:]
def get_max_vol_stock(row):
    date = row[0]
    row = np.array(row)[1:]
    idx = np.argmax(row)
    return [date, columns[idx]]


In [11]:
most_traded_stock_by_day = stocks_volumes.apply(get_max_vol_stock, axis=1)  
most_traded_stock_by_day[:30]   

0     [2007-01-03, aapl]
1     [2007-01-04, aapl]
2     [2007-01-05, aapl]
3     [2007-01-08, aapl]
4     [2007-01-09, aapl]
5     [2007-01-10, aapl]
6     [2007-01-11, aapl]
7     [2007-01-12, aapl]
8     [2007-01-16, aapl]
9     [2007-01-17, aapl]
10    [2007-01-18, aapl]
11    [2007-01-19, aapl]
12    [2007-01-22, aapl]
13    [2007-01-23, aapl]
14    [2007-01-24, aapl]
15    [2007-01-25, aapl]
16    [2007-01-26, aapl]
17    [2007-01-29, aapl]
18    [2007-01-30, aapl]
19    [2007-01-31, aapl]
20    [2007-02-01, aapl]
21    [2007-02-02, aapl]
22    [2007-02-05, aapl]
23    [2007-02-06, aapl]
24    [2007-02-07, aapl]
25    [2007-02-08, aapl]
26    [2007-02-09, aapl]
27    [2007-02-12, aapl]
28    [2007-02-13, aapl]
29    [2007-02-14, aapl]
dtype: object

## High Transaction Day

In [12]:
# Compute total volume of trading for each day
def agg_volume(row):
    return row[1:].sum()
total_volume_by_day = pd.DataFrame()
total_volume_by_day['date'] = stocks_volumes['date']
total_volume_by_day['total_volume'] = stocks_volumes.apply(agg_volume, axis=1)     
total_volume_by_day.head()


Unnamed: 0,date,total_volume
0,2007-01-03,1036523000.0
1,2007-01-04,837589600.0
2,2007-01-05,745213900.0
3,2007-01-08,746887300.0
4,2007-01-09,1419223000.0


In [13]:
# Sort and find the 1- highest volume days overall
highest_10_volume_day_overall = total_volume_by_day.sort_values(by='total_volume', ascending=False).head(10)    
highest_10_volume_day_overall

Unnamed: 0,date,total_volume
265,2008-01-23,1963754000.0
447,2008-10-10,1769361000.0
141,2007-07-26,1610858000.0
445,2008-10-08,1598255000.0
264,2008-01-22,1578117000.0
276,2008-02-07,1557507000.0
438,2008-09-29,1554716000.0
215,2007-11-08,1552455000.0
261,2008-01-16,1535759000.0
266,2008-01-24,1533396000.0


In [14]:
highest_volume_dates = highest_10_volume_day_overall['date']
highest_volume_dates

265    2008-01-23
447    2008-10-10
141    2007-07-26
445    2008-10-08
264    2008-01-22
276    2008-02-07
438    2008-09-29
215    2007-11-08
261    2008-01-16
266    2008-01-24
Name: date, dtype: object

In [15]:
# all stock prices by date
stocks_prices = pd.DataFrame()
stocks_prices['date'] = time_periods
for key in data:
    stocks_prices[key] = data[key]['close']
stocks_prices.head()
    

Unnamed: 0,date,fmbi,cmcsa,boom,ctg,cpah,abco,acta,ecpg,asrv,...,algn,aray,esxb,cvgi,ceco,cybe,fisi,agys,arcw,aciw
0,2007-01-03,38.93,42.660001,26.9,4.75,0.435,53.959999,10.32,12.27,4.85,...,13.35,28.469999,7.15,22.1,25.030001,12.85,23.09,20.450001,0.100035,32.92
1,2007-01-04,38.880001,43.050001,27.76,4.75,0.43,54.5,10.33,12.43,4.78,...,13.6,29.25,7.1,21.780001,25.040001,12.9,22.73,19.49,0.100035,33.700001
2,2007-01-05,37.869999,42.55,26.77,4.75,0.38,53.709999,10.15,12.04,4.82,...,13.64,28.549999,7.15,20.530001,25.309999,13.0,22.84,18.73,0.100035,33.850001
3,2007-01-08,38.169998,42.47,26.49,4.75,0.4,53.639999,10.15,11.92,4.76,...,13.68,27.610001,7.15,20.209999,25.110001,13.2,22.9,18.83,0.100035,34.130001
4,2007-01-09,38.119999,42.740001,25.84,4.75,0.375,54.060001,10.21,11.63,4.82,...,13.7,26.620001,7.15,21.049999,25.190001,13.2,23.0,19.02,0.100035,34.42


In [16]:
highest_volume_day_prices_all = stocks_prices[stocks_prices['date'].isin(highest_volume_dates)]  

highest_volume_day_prices_all

Unnamed: 0,date,fmbi,cmcsa,boom,ctg,cpah,abco,acta,ecpg,asrv,...,algn,aray,esxb,cvgi,ceco,cybe,fisi,agys,arcw,aciw
141,2007-07-26,32.77,27.209999,39.209999,4.3,0.385,52.700001,12.06,11.25,3.7,...,27.620001,13.73,7.43,15.3,31.23,12.93,19.040001,20.0,4.959995,30.799999
215,2007-11-08,31.4,19.82,54.91,4.98,0.48,64.07,11.82,10.44,2.96,...,17.040001,14.57,7.4,12.85,33.330002,12.06,17.66,14.46,5.229995,22.36
261,2008-01-16,26.700001,18.18,47.799999,4.58,0.45,64.300003,10.1,7.85,2.75,...,14.22,10.5,7.4,9.67,20.67,10.43,19.219999,14.9,5.299995,13.469999
264,2008-01-22,26.02,16.65,44.189999,4.08,0.4,61.77,9.36,6.92,2.54,...,12.98,10.97,7.41,9.82,16.889999,10.34,17.299999,14.12,4.909995,12.48
265,2008-01-23,28.01,17.26,50.439999,4.25,0.4,62.599998,9.71,7.4,2.53,...,13.48,10.69,7.46,10.29,17.43,10.12,17.469999,14.79,4.899995,13.570001
266,2008-01-24,29.32,17.440001,50.150002,4.4,0.3867,61.619999,9.53,7.25,2.93,...,13.12,10.36,7.43,9.97,19.16,10.08,18.51,14.89,4.899995,13.4
276,2008-02-07,29.870001,17.379999,54.139999,4.03,0.4,54.0,9.07,7.87,3.08,...,13.16,9.04,7.46,9.59,19.68,10.45,19.42,11.91,4.749995,14.989999
438,2008-09-29,24.82,18.01,22.32,5.5,1.17,30.290001,7.57,13.48,2.98,...,10.45,6.49,4.01,8.03,15.85,8.9,20.0,10.34,3.999996,16.510001
445,2008-10-08,21.639999,16.99,18.02,5.6,0.84,29.110001,6.15,9.82,2.37,...,8.39,4.85,3.6,3.0,14.27,8.54,16.93,6.95,3.899996,12.43
447,2008-10-10,23.610001,15.36,17.059999,5.85,1.01,27.559999,6.02,8.73,1.59,...,6.77,4.71,3.35,1.77,15.45,7.96,14.44,6.67,3.839996,11.12


## The Most Profitable Stock from 2007 to 2017


In [27]:
# using stocks_prices
def get_most_profitable(col):
    stock_dict = {}
    stock_dict['buy_date'] = '2001-01-03'
    stock_dict['buy_price'] = col[0]
    stock_dict['sell_date'] = time_periods[col.values.argmax]
    stock_dict['sell_price'] = max(col)
    
    return stock_dict

In [None]:
df_aapl = data['aapl']


In [28]:
stocks = data.keys()
most_profit_all_stocks = pd.DataFrame()
#most_profit_all_stocks.append(stocks_prices.apply(get_most_profitable, axis=0), ignore_index=True)  
#most_profit_all_stocks.head()
for stock in stocks:
    most_profit_all_stocks.append(get_most_profitable(stocks_prices[stock]), ignore_index=True)
    
most_profit_all_stocks.head()



  return maybe_callable(obj, **kwargs)


## More to be done:

I've done some basic analysis of the data, but there's still quite a bit more depth to go into:

- What stocks would have been best to short at the start of the period?
- Which stocks have the most after-hours trading, and show the biggest changes between the closing price and the next day open?
- Can technical indicators like Bollinger Bands help us forecast the market?
- What time periods have resulted in steady increases in prices, and what periods have resulted in steady declines?
- Based on price, what was the optimal day to buy each stock if we wanted to hold them until now?
- On days with high trading volume, do stocks move in one direction (up or down) more than the other one?
