Stock Market
=============

Step 1: Define a function to access Yahoo Finance API and extract the information for the stocks and dates requered, then store them as a csv file in the stock directory.

In [7]:
import yahoo_finance
import csv

columns = ['volume','symbol','adj_close','high','low','date','close','open']

def stock_dl(symbol,start,end):
    for j in symbol:
        stock = yahoo_finance.Share(j)
        history = stock.get_historical(start,end)

        with open("stock/{}.csv".format(j), "w") as toWrite:
            writer = csv.writer(toWrite,delimiter=",")
            writer.writerow(columns)
            for i in history:
                temp = []
                for a in i.keys():
                    temp.append(i[a])
                writer.writerow(temp)

Step 2: Run the previous function suplying the needed parameters. We will check major Tech companies and the SPY as baseline, use 5 years of data.

In [8]:
symbols = ['SPY','GOOGL','AAPL','TSLA','FB','MSFT']
start = '2011-09-22'
end = '2016-09-22'

stock_dl(symbol=symbols,start=start,end=end)

Step 3: Read csv files and store as pandas data frames, select subsets of data if needed.

In [19]:
import os
import pandas as pd

def symbol_to_path(symbol,base_dir="stock"):
    return os.path.join(base_dir,"{}.csv".format(str(symbol)))

def get_data(symbols,columns,start,end):
    dates = pd.date_range(start,end)
    df = pd.DataFrame(index=dates)
    
    for symbol in symbols:
        df_temp = pd.read_csv(symbol_to_path(symbol),index_col='date',parse_dates=True,
                              usecols=columns,na_values=['nan'])
        for i in columns:
            df_temp = df_temp.rename(columns={i:symbol+'_'+i})
        df = df.join(df_temp)
    
    df = df.dropna(subset=['SPY_symbol'])
    
    return df
            
df = get_data(symbols=symbols,columns=columns,start=start,end=end)
print df.head()
print df.shape
    

             SPY_volume SPY_symbol  SPY_adj_close    SPY_high     SPY_low  \
2011-09-22  513911300.0        SPY     101.773991  114.209999  111.300003   
2011-09-23  307242500.0        SPY     102.387196  114.160004  112.019997   
2011-09-26  260673700.0        SPY     104.821978  116.400002  112.980003   
2011-09-27  311753900.0        SPY     105.994284  119.559998  116.839996   
2011-09-28  286696800.0        SPY     103.830030  118.489998  114.970001   

             SPY_close    SPY_open  GOOGL_volume GOOGL_symbol  \
2011-09-22  112.860001  113.250000     8791700.0        GOOGL   
2011-09-23  113.540001  112.110001     5549000.0        GOOGL   
2011-09-26  116.239998  114.610001     5263100.0        GOOGL   
2011-09-27  117.540001  118.529999     6015700.0        GOOGL   
2011-09-28  115.139999  117.779999     4522000.0        GOOGL   

            GOOGL_adj_close    ...      FB_low  FB_close  FB_open  \
2011-09-22       260.590580    ...         NaN       NaN      NaN   
2011-09-

Take a look of the dataframe:

In [5]:
num_steps = 801

def accuracy(predictions, labels):
  return (100.0 * np.sum(np.argmax(predictions, 1) == np.argmax(labels, 1))
          / predictions.shape[0])

with tf.Session(graph=graph) as session:
  # This is a one-time operation which ensures the parameters get initialized as
  # we described in the graph: random weights for the matrix, zeros for the
  # biases. 
  tf.initialize_all_variables().run()
  print('Initialized')
  for step in range(num_steps):
    # Run the computations. We tell .run() that we want to run the optimizer,
    # and get the loss value and the training predictions returned as numpy
    # arrays.
    _, l, predictions = session.run([optimizer, loss, train_prediction])
    if (step % 100 == 0):
      print('Loss at step %d: %f' % (step, l))
      print('Training accuracy: %.1f%%' % accuracy(
        predictions, train_labels[:train_subset, :]))
      # Calling .eval() on valid_prediction is basically like calling run(), but
      # just to get that one numpy array. Note that it recomputes all its graph
      # dependencies.
      print('Validation accuracy: %.1f%%' % accuracy(
        valid_prediction.eval(), valid_labels))
  print('Test accuracy: %.1f%%' % accuracy(test_prediction.eval(), test_labels))

Initialized
Loss at step 0: 15.897280
Training accuracy: 10.1%
Validation accuracy: 12.6%
Loss at step 100: 2.297131
Training accuracy: 72.2%
Validation accuracy: 70.6%
Loss at step 200: 1.850722
Training accuracy: 74.9%
Validation accuracy: 72.6%
Loss at step 300: 1.594598
Training accuracy: 76.1%
Validation accuracy: 73.6%
Loss at step 400: 1.422940
Training accuracy: 77.3%
Validation accuracy: 74.2%
Loss at step 500: 1.298927
Training accuracy: 78.1%
Validation accuracy: 74.5%
Loss at step 600: 1.203677
Training accuracy: 78.8%
Validation accuracy: 74.7%
Loss at step 700: 1.127500
Training accuracy: 79.2%
Validation accuracy: 74.8%
Loss at step 800: 1.064721
Training accuracy: 79.8%
Validation accuracy: 75.0%
Test accuracy: 82.1%


Let's now switch to stochastic gradient descent training instead, which is much faster.

The graph will be similar, except that instead of holding all the training data into a constant node, we create a `Placeholder` node which will be fed actual data at every call of `session.run()`.

In [6]:
batch_size = 128

graph = tf.Graph()
with graph.as_default():

  # Input data. For the training data, we use a placeholder that will be fed
  # at run time with a training minibatch.
  tf_train_dataset = tf.placeholder(tf.float32,
                                    shape=(batch_size, image_size * image_size))
  tf_train_labels = tf.placeholder(tf.float32, shape=(batch_size, num_labels))
  tf_valid_dataset = tf.constant(valid_dataset)
  tf_test_dataset = tf.constant(test_dataset)
  
  # Variables.
  weights = tf.Variable(
    tf.truncated_normal([image_size * image_size, num_labels]))
  biases = tf.Variable(tf.zeros([num_labels]))
  
  # Training computation.
  logits = tf.matmul(tf_train_dataset, weights) + biases
  loss = tf.reduce_mean(
    tf.nn.softmax_cross_entropy_with_logits(logits, tf_train_labels))
  
  # Optimizer.
  optimizer = tf.train.GradientDescentOptimizer(0.5).minimize(loss)
  
  # Predictions for the training, validation, and test data.
  train_prediction = tf.nn.softmax(logits)
  valid_prediction = tf.nn.softmax(
    tf.matmul(tf_valid_dataset, weights) + biases)
  test_prediction = tf.nn.softmax(tf.matmul(tf_test_dataset, weights) + biases)

Let's run it:

In [7]:
num_steps = 3001

with tf.Session(graph=graph) as session:
  tf.initialize_all_variables().run()
  print("Initialized")
  for step in range(num_steps):
    # Pick an offset within the training data, which has been randomized.
    # Note: we could use better randomization across epochs.
    offset = (step * batch_size) % (train_labels.shape[0] - batch_size)
    # Generate a minibatch.
    batch_data = train_dataset[offset:(offset + batch_size), :]
    batch_labels = train_labels[offset:(offset + batch_size), :]
    # Prepare a dictionary telling the session where to feed the minibatch.
    # The key of the dictionary is the placeholder node of the graph to be fed,
    # and the value is the numpy array to feed to it.
    feed_dict = {tf_train_dataset : batch_data, tf_train_labels : batch_labels}
    _, l, predictions = session.run(
      [optimizer, loss, train_prediction], feed_dict=feed_dict)
    if (step % 500 == 0):
      print("Minibatch loss at step %d: %f" % (step, l))
      print("Minibatch accuracy: %.1f%%" % accuracy(predictions, batch_labels))
      print("Validation accuracy: %.1f%%" % accuracy(
        valid_prediction.eval(), valid_labels))
  print("Test accuracy: %.1f%%" % accuracy(test_prediction.eval(), test_labels))

Initialized
Minibatch loss at step 0: 17.714567
Minibatch accuracy: 11.7%
Validation accuracy: 11.8%
Minibatch loss at step 500: 1.099879
Minibatch accuracy: 80.5%
Validation accuracy: 75.3%
Minibatch loss at step 1000: 1.487191
Minibatch accuracy: 80.5%
Validation accuracy: 76.4%
Minibatch loss at step 1500: 0.709410
Minibatch accuracy: 79.7%
Validation accuracy: 77.6%
Minibatch loss at step 2000: 0.870362
Minibatch accuracy: 82.0%
Validation accuracy: 77.3%
Minibatch loss at step 2500: 1.224014
Minibatch accuracy: 72.7%
Validation accuracy: 78.2%
Minibatch loss at step 3000: 0.920802
Minibatch accuracy: 78.9%
Validation accuracy: 78.9%
Test accuracy: 86.1%


---
Problem
-------

Turn the logistic regression example with SGD into a 1-hidden layer neural network with rectified linear units [nn.relu()](https://www.tensorflow.org/versions/r0.7/api_docs/python/nn.html#relu) and 1024 hidden nodes. This model should improve your validation / test accuracy.

---