# Part 3 -- Get Target Data (S&P 500)

Get our target data (the S&P 500 index) from the Yahoo Finance website and perform calculations. We want to look at the Adj Close price to determine whether stocks went up/down/neutral compared to the previous day. This will be a **classification model** and we will set our threshold of up/down/neutral to a change of 0.01%.

**Load lib codes**

In [1]:
from os import chdir
chdir('/home/jovyan/work/Portfolio/Analyzing_Unstructured_Data_for_Finance/')

from lib import *
# suppress_warnings()

**Convert strings to datetime and integers so we can perform calculations**

In [2]:
sp500 = pd.read_csv('../Analyzing_Unstructured_Data_for_Finance/data/3.^GSPC.csv')

In [3]:
sp500['Adj Close'][1773]

'null'

In [4]:
def convert_to_datetime_int(df):
    date_list = []
    for d in sp500['Date']:
        date = dt.datetime.strptime(d, '%Y-%m-%d')
        date_list.append(date)
        
    df['Date'] = pd.Series(date_list)
    df['Adj Close'] = pd.to_numeric(df['Adj Close'], errors='coerce', downcast='float')
    return df

In [5]:
convert_to_datetime_int(sp500)

Unnamed: 0,Date,Open,High,Low,Close,Adj Close,Volume
0,2009-06-15,942.450012,942.450012,919.650024,923.719971,923.719971,4697880000
1,2009-06-16,925.599976,928.000000,911.599976,911.969971,911.969971,4951200000
2,2009-06-17,911.890015,918.440002,903.780029,910.710022,910.710022,5523650000
3,2009-06-18,910.859985,921.929993,907.940002,918.369995,918.369995,4684010000
4,2009-06-19,919.960022,927.090027,915.799988,921.229980,921.229980,5713390000
5,2009-06-22,918.130005,918.130005,893.039978,893.039978,893.039978,4903940000
6,2009-06-23,893.460022,898.690002,888.859985,895.099976,895.099976,5071020000
7,2009-06-24,896.309998,910.849976,896.309998,900.940002,900.940002,4636720000
8,2009-06-25,899.450012,921.419983,896.270020,920.260010,920.260010,4911240000
9,2009-06-26,918.840027,922.000000,913.030029,918.900024,918.900024,6076660000


In [6]:
sp500.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2013 entries, 0 to 2012
Data columns (total 7 columns):
Date         2013 non-null datetime64[ns]
Open         2013 non-null object
High         2013 non-null object
Low          2013 non-null object
Close        2013 non-null object
Adj Close    2012 non-null float32
Volume       2013 non-null object
dtypes: datetime64[ns](1), float32(1), object(5)
memory usage: 102.3+ KB


In [7]:
print(min(sp500['Date']))
print(max(sp500['Date']))

2009-06-15 00:00:00
2017-06-12 00:00:00


# NOTE: Can always change this from a classification model to regression model by using Percent_Change as our target instead of Percent_Change_Class

**We want the <u>PERCENT CHANGE</u> of each stock so that data is normalized. This is especially important if you ever want to compare different stocks/indices.**

In [8]:
def get_change_in_Adj_Close(df):
    df = df.sort(columns='Date')
    df['Diff'] = df['Adj Close'].diff()
    df['Percent_Change'] = (df['Adj Close']-df['Adj Close'].shift(1))/df['Adj Close']
    return df

In [9]:
sp500_df = get_change_in_Adj_Close(sp500)

  


In [10]:
sp500_df.head(2)

Unnamed: 0,Date,Open,High,Low,Close,Adj Close,Volume,Diff,Percent_Change
0,2009-06-15,942.450012,942.450012,919.650024,923.719971,923.719971,4697880000,,
1,2009-06-16,925.599976,928.0,911.599976,911.969971,911.969971,4951200000,-11.75,-0.012884


In [11]:
# Download data and double-check results in Excel to see if calculations were executed correctly
sp500_df.to_csv('../Analyzing_Unstructured_Data_for_Finance/data/3.sp500.csv')

**Note: We will set our threshold to .01% because a threshold of .05% gives us the following distribution, which is a little uneven:**
```
down    1484
up       526
```

In [12]:
# Look at the distribution of Percent_Change to determine what threshold to use
print('min: ', sp500_df['Percent_Change'].min())
print('max: ', sp500_df['Percent_Change'].max())
print('mean: ', sp500_df['Percent_Change'].mean())

print('mode: ', "{0:.10f}".format(round(sp500_df['Percent_Change'],3).mode()[0]))
print('mode: ', "{0:.10f}".format(round(sp500_df['Percent_Change'],5).mode()[0]))

min:  -0.0713916
max:  0.0452612
mean:  0.000419228
mode:  -0.0000000000
mode:  -0.0014800000


In [13]:
# Make threshold .01% (for percent change in daily Close price)
def make_binary(data):
    data_list = []
    for d in data:
        if round(d,5) < 0.001:
            data_list.append('down')
        elif round(d,5) == 0.001:
            data_list.append('neutral')
        elif round(d,5) > 0.001:
            data_list.append('up')
        else:
            data_list.append('n/a')
    return data_list

In [14]:
sp500_df['Percent_Change_Class'] = make_binary(sp500_df['Percent_Change'])

In [15]:
sp500_df['Percent_Change_Class'].value_counts()

down    1062
up       948
n/a        3
Name: Percent_Change_Class, dtype: int64

In [16]:
sp500_df.sample(5)

Unnamed: 0,Date,Open,High,Low,Close,Adj Close,Volume,Diff,Percent_Change,Percent_Change_Class
1928,2017-02-09,2296.699951,2311.080078,2296.610107,2307.870117,2307.870117,3677940000,13.200195,0.00572,up
65,2009-09-16,1053.98999,1068.76001,1052.869995,1068.76001,1068.76001,6793530000,16.130005,0.015092,up
819,2012-09-12,1433.560059,1439.150024,1432.98999,1436.560059,1436.560059,3641200000,3.0,0.002088,up
149,2010-01-15,1147.719971,1147.77002,1131.390015,1136.030029,1136.030029,4758730000,-12.429932,-0.010942,down
1588,2015-10-05,1954.329956,1989.170044,1954.329956,1987.050049,1987.050049,4334490000,35.690063,0.017961,up


In [17]:
sp500_df.shape

(2013, 10)

In [18]:
joblib.dump(sp500_df, '../Analyzing_Unstructured_Data_for_Finance/data/3.sp500_df.pickle')

['../Analyzing_Unstructured_Data_for_Finance/data/3.sp500_df.pickle']