**The Data Science Method**


1.   [Problem Identification](https://medium.com/@aiden.dataminer/the-data-science-method-problem-identification-6ffcda1e5152)

2.   [Data Wrangling](https://medium.com/@aiden.dataminer/the-data-science-method-dsm-data-collection-organization-and-definitions-d19b6ff141c4)
  * Data Collection - Collected data from wikipedia and quandl wiki price dataset. The wikipedia showed us the currect S&P 500 companies and used their ticker symbols to query quandl wiki prices.
  * Data Organization - Done using cookiecutter
  * Data Definition
  * Data Cleaning - The S&P 500 data from quandls wiki price is clean and ready for analysis use but has lost its support from Quandl community as of April 11, 2018. So we will use this dataset to setup the protfolio optimizer with proof of concept then use a different data source later for cost efficiencies.

3.   [Exploratory Data Analysis](https://medium.com/@aiden.dataminer/the-data-science-method-dsm-exploratory-data-analysis-bc84d4d8d3f9)
 * Build data profile tables and plots
        - Cumulative Return
        - Annualized Return
        - Daily Return
        - Mean Daily Return
        - Standard Deviation Daily Return
        - Simple Moving Average
        - Exponential Moving Average
        - Moving Average Convergence Divergence
        - Adj. Close & Daily Return Covariance
        - Adj. Close & Daily Return Correlation
        - Sharpe Ratio
        - Skew 
        - Kurtosis
 * Explore data relationships
 * Identification and creation of features 

4.   [**Pre-processing and Training Data Development**](https://medium.com/@aiden.dataminer/the-data-science-method-dsm-pre-processing-and-training-data-development-fd2d75182967)
  * Create dummy or indicator features for categorical variables
  * Standardize the magnitude of numeric features
  * Split into testing and training datasets
  * Apply scaler to the testing set
5.   [Modeling](https://medium.com/@aiden.dataminer/the-data-science-method-dsm-modeling-56b4233cad1b)
  * Fit Models with Training Data Set
  * Review Model Outcomes — Iterate over additional models as needed.
  * Identify the Final Model

6.   [Documentation](https://medium.com/@aiden.dataminer/the-data-science-method-dsm-documentation-c92c28bd45e6)

  * Review the Results
  * Present and share your findings - storytelling
  * Finalize Code
  * Finalize Documentation


First, loads the needed packages and modules into Python. Then loads the data into a pandas dataframe for ease of use.

In [1]:
#load python packages
import os
import pandas as pd
import datetime
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib
import numpy as np

import pandas as pd 
import numpy as np
from scipy.stats import norm
import pypfopt

import datetime as dt
import scipy.optimize as spo
from tqdm import tqdm

import dotenv
import os

from __future__ import print_function
from ipywidgets import interact, interactive, fixed, interact_manual
import ipywidgets as widgets
%matplotlib inline



In [2]:
# prints current directory
current_dir = os.getcwd()
print("Current Directory: ")
print(current_dir)

Current Directory: 
/Users/jb/Development/courses/springboard/ds/Assignments/Portfolio-Optimization/portopt/notebooks/exploratory


In [3]:
# prints parent directory
project_dir = os.path.abspath(os.path.join(os.path.join(current_dir, os.pardir), os.pardir))
print("Parent Directory: ")
print(project_dir)

Parent Directory: 
/Users/jb/Development/courses/springboard/ds/Assignments/Portfolio-Optimization/portopt


In [4]:
print(os.listdir())

['3.0-jujbates-S&P500-PO_eda.ipynb', '3.0-jujbates-S&P500-PO_eda-Copy1.ipynb', '.DS_Store', '5.0-jujbates-S&P500-PO_modeling.ipynb', 'Untitled.ipynb', '4.0-jujbates-S&P500-PO_pre-processing_and_training_data_development.ipynb', '3.0-jujbates-S&P500-PO_eda-Copy2.ipynb', 'stonks.csv', '4.0-jujbates-S&P500-PO_pre-processing_and_training_data_development-Copy2.ipynb', '.ipynb_checkpoints', '4.0-jujbates-S&P500-PO_pre-processing_and_training_data_development-Copy1.ipynb', 'weights.csv', '1.0-jujbates-S&P500-PO_problem_identification.ipynb', '3.0-jujbates-S&P500-PO_eda1.ipynb', '2.0-jujbates-S&P500-PO_data_wrangling.ipynb', '5.0-jujbates-S&P500-PO_modeling-Copy1.ipynb']


In [5]:
plt.style.use('dark_background')
c = ['white', 'springgreen', 'fuchsia', 'lightcoral', 'red'] # Color
s = [24, 20, 16, 12]  # Size
w = [0.75, 1, 1.25, 1.50] # Line Width
ga = 0.10 # Grid Alpha

In [6]:
sp500_adj_close_df = pd.read_csv(project_dir + '/data/interim/'+ 'yahoo_sp500_adj_close_interim.csv', index_col=['date'])
sp500_index_adj_close_df = pd.read_csv(project_dir + '/data/interim/'+ 'yahoo_sp500_index_adj_close_interim.csv', index_col=['date'])

sp500_adj_close_df.index = pd.to_datetime(sp500_adj_close_df.index)
sp500_index_adj_close_df.index = pd.to_datetime(sp500_index_adj_close_df.index)


In [7]:
import random

def get_random_date_ranges(start_date='1999-01-01', end_date='2019-12-31', range_window=365, evaluation_window=365, n_ranges=3):
    random_date_ranges = []

    range_window_dd = dt.timedelta(days=range_window)
    evaluation_window_dd = dt.timedelta(days=evaluation_window)

    start_date =  dt.datetime.strptime('1999-01-01', '%Y-%m-%d') 
    end_date =  dt.datetime.strptime('2019-12-31', '%Y-%m-%d') - (range_window_dd + evaluation_window_dd)    
    
    time_between_dates = end_date - start_date
    days_between_dates = time_between_dates.days    
    
    for i in range(n_ranges):
        random_number_of_days = random.randrange(days_between_dates)        
        random_start_date = start_date + datetime.timedelta(days=random_number_of_days)
        
        random_end_date = random_start_date + range_window_dd
        random_date_ranges.append((random_start_date, random_end_date))
    return random_date_ranges

In [8]:
def daily_returns(df):
    return df.pct_change() * 100

def sharpe_ratio(adr, sddr, rfr=0):
    return (adr - rfr) / sddr 

def get_top_50_symbols_with_sharpe_ratio(date_ranges, df):
    sym_df = pd.DataFrame({})
    for sd, ed in tqdm(date_ranges):
        prices = df.copy(deep=True).loc[sd : ed]
        dr = daily_returns(prices)
        sr = sharpe_ratio(dr.mean(), dr.std(), rfr=0)
        idx = str(sd) + '-->' + str(ed)
        if len(sr) == 1 and sr.index[0] == '^GSPC':
            sym_df[idx] = sr.index.tolist()
        else:
            sym_df[idx] = sr.sort_values(ascending=False)[:20].index.tolist()
    return sym_df

In [9]:
rand_one_year_window_date_ranges = get_random_date_ranges(n_ranges=5)

rand_one_year_window_sym_50_df = get_top_50_symbols_with_sharpe_ratio(rand_one_year_window_date_ranges, sp500_adj_close_df)
rand_one_year_window_sp500_index_df = get_top_50_symbols_with_sharpe_ratio(rand_one_year_window_date_ranges, sp500_index_adj_close_df)

100%|██████████| 5/5 [00:00<00:00, 56.89it/s]
100%|██████████| 5/5 [00:00<00:00, 397.08it/s]


In [10]:
rand_one_year_window_sym_50_df.index.name = 'index'
rand_one_year_window_sp500_index_df.index.name = 'index'

rand_one_year_window_sym_50_df.to_csv(project_dir + '/data/interim/'+ 'sp500_rand_one_year_window_sym_50_interim.csv', index=True)
rand_one_year_window_sp500_index_df.to_csv(project_dir + '/data/interim/'+ 'sp500_rand_one_year_window_sp500_index_interim.csv', index=True)

In [11]:
rand_one_year_window_sym_50_df

Unnamed: 0_level_0,2004-02-03 00:00:00-->2005-02-02 00:00:00,2001-08-04 00:00:00-->2002-08-04 00:00:00,2009-01-10 00:00:00-->2010-01-10 00:00:00,2002-03-25 00:00:00-->2003-03-25 00:00:00,2006-12-14 00:00:00-->2007-12-14 00:00:00
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
0,DLR,TSCO,FTNT,BSX,MSCI
1,CBRE,ANTM,AAPL,ODFL,MOS
2,AAPL,BLL,WDC,ROL,CF
3,GOOGL,PEAK,BKNG,GRMN,J
4,GOOG,SPG,F,CME,CMG
5,HFC,LMT,CERN,ZBH,ISRG
6,HSY,NWL,ISRG,EBAY,KO
7,VLO,PSA,GOOG,WYNN,PRGO
8,NRG,WLTW,GOOGL,AMZN,BKNG
9,ROK,REG,CTSH,VTRS,CXO
