In [240]:
#Import Clean data set from Data Cleaning notebook
import os
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from math import sqrt

from sklearn import preprocessing
from sklearn.model_selection import train_test_split
from sklearn import linear_model
from sklearn.metrics import explained_variance_score,mean_absolute_error, mean_squared_error, r2_score
os.chdir(r'C:\Users\nmur1\Google Drive\Springboard\Capstone 1\CleanData')

df = pd.read_csv('modeling.csv',index_col = 0)

## Final Data Cleaning that I Missed in the Other Stages

In [241]:
#whoops still have a few N/A Falues.
nas=pd.DataFrame(df.isnull().sum().sort_values(ascending=False)/len(df),columns = ['percent'])
pos = nas['percent'] > 0
nas[pos]

Unnamed: 0,percent
Industry,0.075658
Sector,0.075658
HSAFRAL,0.069901
CEOAge,0.059211
CEOGender,0.023026


In [242]:

df.HSAFRAL.fillna((df['HSAFRAL'].mean()), inplace = True )
df.CEOAge.fillna((df['CEOAge'].mean()), inplace = True )
df.CEOGender.fillna('male', inplace = True )
df.Sector.fillna('Other', inplace = True )
df.Industry.fillna('Other', inplace = True)
nas=pd.DataFrame(df.isnull().sum().sort_values(ascending=False)/len(df),columns = ['percent'])
pos = nas['percent'] > 0
nas[pos]


Unnamed: 0,percent


In [243]:
df['Industry'].nunique()

119

In [244]:
df.CEOGender.value_counts()

male       1089
female       65
unknown      62
Name: CEOGender, dtype: int64

In [245]:
#drop the year off my quarter column to limit to 4 distint values, Q1, Q2, Q3, Q4
df.FQ = df.FQ.apply(lambda x: str(x[0]))

# Data Definitions

### The following table defines the fields in my modeling set. My Y (or prediction) variable will be the stock's price change between trading days 0 to 260 (approx 1 year after the IPO date).

* Day 0 of trading represents the day the company IPO'd
* Day 260 is approximately 1 year afetr the IPO date. Note: it's not 365 due to holidays, weekends, and other days the stock market is closed. 

|Field        | Definition                                               | Source                               |
|:--------------|:----------------------------------------------------------|:--------------------------------------|
| Symbol       | Stock ticker symbol                                      | Original Kaggle Set                  |
| Month        | Month of IPO date                                        | Original Kaggle Set                  |
| Day          | Day (date) of IPO date                                   | Original Kaggle Set                  |
| Year         | Year of IPO date                                         | Original Kaggle Set                  |
| CEOGender    | CEO Gender                                               | Original Kaggle Set                  |
| Industry     | Company Industry                                         | Original Kaggle Set                  |
| Sector       | Company Sector                                           | Original Kaggle Set                  |
| CEOAge       | CEO Age                                                  | Original Kaggle Set                  |
| DayofWeek    | Day of week (Mon - Friday) of the IPO Date               | Original Kaggle Set                  |
| FQ           | Fiscal Quarter of IPO Date                               | Original Kaggle Set                  |
| GDP          | Total GDP for the Quarter of the IPO                     | Federal Reserve Economic Data (FRED) |
| GDP Growh    | Rolling 4 Quarter GDP Growth for the Quarter of IPO      | Federal Reserve Economic Data (FRED) |
| FEDFUNDS     | Fed Funds Interest Rate for Quarter of IPO               | Federal Reserve Economic Data (FRED) |
| UNRATE       | Unemployment rate for Quarter of IPO                     | Federal Reserve Economic Data (FRED) |
| UMCSENT      | Univ of Mich Consumer Sentiment Score for Quarter of IPO | Federal Reserve Economic Data (FRED) |
| PE_Ratio     | Average S&P 500 PE Ratio for Quarter of IPO              | Quandl S&P 500 API                   |
| SP_Value     | Gross S&P Value for Quarter of IPO                       | Quandl S&P 500 API                   |
| SP500 Growth | Rolling 4 Quarter S&P 500 Growh for Quarter of IPO       | Quandl S&P 500 API                   |
| HSAFRAL      | Homes sold as foreclosure for Quarter of IPO             | Quandl Zillow API                    |
| HPI          | Freddie Mac Housing Pricing Index for Quarter of IPO     | Quandl Freddie MAC API               |
| 0 to 65      | Stock price change. Trading days: 0 to 65                | Original Kaggle Set                  |
| 65 to 130    | Stock price change. Trading days: 65 to 130              | Original Kaggle Set                  |
| 130 to 195   | Stock price change. Trading days: 130 to 195             | Original Kaggle Set                  |
| 195 to 260   | Stock price change. Trading days: 195 to 260             | Original Kaggle Set                  |
| 0 to 260     | PREDICT VARIABLE Stock price change. Trading days: 0 to 260               | Original Kaggle Set                  |
| 0 to 30      | Stock price change. Trading days: 0 to 30                | Original Kaggle Set                  |
| 30 to 65     | Stock price change. Trading days: 30 to 65               | Original Kaggle Set                  |

### I'll start with writing two re-usable functions that will be used throughout

* The **'dummy'** function takes in a dataframe and a list of of variables I want to exclude from conversion. The output is a new dataframe with categorical/object variables converted into dummy variables
<br><br>
* My **'LinearModel'** class leverages the linear regression models in the SciKit-Learn package. Initiate the class with the following inputs:
<br><br>
    * df: Dataframe to be used for modeling
    * xvar: A list of dataframe fields to exclude from your X axis
    * yvar: your dependant/y variable. What you are trying to predict
    * tsize: the test train split size. I.e. enter .25 if you want a 75/25 split
<br><br>
* The class runs the typical series of steps when modeling in lienar regression including:
    * Scaling X values
    * creating a train/test split
    * fitting and predicting the model
    * calculating the intercept
    * error scoring using rmse, mae, evs, and r squared metrics
    * defining coefficients for dependant variables

In [246]:
#write a function to get my dummy variables 

def dummy(df, exclude):
    
    objects = df.select_dtypes(include=['object'])
    
    for col in objects.columns:
    
        if col not in exclude:
    
            df_dummy = df[[col]]
            df = pd.concat([df.drop(df_dummy, axis = 1), pd.get_dummies(df_dummy)], axis =1)

    return df

In [247]:
#write a class function for my linear model to quickly reuse

class LinearModel:
    
    def __init__(self, df, xvar, yvar, tsize):
        
        #define class variables
        self.df = df
        self.xvar = xvar
        self.yvar = yvar
        self.tsize = tsize
    
        #convert X and Y axises for modeling
        self.X = self.df.drop(self.xvar, axis = 1)
        self.y = self.df[self.yvar].ravel()
    
        #scale X
        self.scaler = preprocessing.StandardScaler().fit(self.X)
        self.X_scaled =  self.scaler.transform(self.X)
        
        #split train
        self.X_train, self.X_test, self.y_train, self.y_test = \
            train_test_split(self.X_scaled, self.y, test_size = self.tsize, random_state = 1)
        
        
        #prediction and intercepts
        self.lm = linear_model.LinearRegression()
        model = self.lm.fit(self.X_train, self.y_train)
        self.y_pred = model.predict(self.X_test)
        self.inter = self.lm.intercept_
        
        #error scoring
        self.rmse = sqrt(mean_squared_error(self.y_test, self.y_pred))
        self.mae = mean_absolute_error(self.y_test, self.y_pred)
        self.evs = explained_variance_score(self.y_test, self.y_pred)
        self.r2 = r2_score(self.y_test, self.y_pred)

    #coefficient table
    def coefficients(self):
        
        coef = np.round(self.lm.coef_,2)
        df = pd.DataFrame(abs(coef), self.X.columns, columns=['Coefficient']).sort_values(by='Coefficient', ascending = False)
        
        return df
    
    #table to compare predicted values against test values
    def compdf(self):
        df = pd.DataFrame({'Actual':self.y_test, 'Predicted': self.y_pred}, columns = ['Actual','Predicted'])
        return df



### My First Dataset will remove day, year, industry

In [248]:
#Day and Year are irrelevant
#Industry has 119 unique values too many dummy variables when I imported into my model.

Quarterly = df


to_drop = ['Day', 'Year', 'Industry']


for col in to_drop:
    
    try:
        Quarterly = Quarterly.drop(columns = [col])
    except:
        Quarterly = Quarterly
    
Quarterly.Month = Quarterly.Month.astype('object')


In [249]:
#run my first model using the dummy function and LinearModel Class I wrote

model1 = dummy(Quarterly, ['Symbol'])

xvar = ['Symbol', '0 to 260']
yvar = '0 to 260'

m1 = LinearModel(model1, xvar, yvar, .25)
score1 = [m1.rmse, m1.mae, m1.evs, m1.r2]

coef1 = m1.coefficients()
coef1.head(10)

Unnamed: 0,Coefficient
FQ_1,65250230000000.0
Month_7,46302600000000.0
CEOGender_male,44865220000000.0
Month_8,41886700000000.0
Month_9,40678970000000.0
CEOGender_female,32998050000000.0
FQ_3,32947370000000.0
DayofWeek_Thur,32847210000000.0
Month_6,32822960000000.0
CEOGender_unknown,32269540000000.0


### The month of the IPO and the CEO Gender are in the top 10 coefficients. Overall I feel these are adding noise to the data and will drop them below

In [250]:
# add the desired columns to drop to the list and create a second dataframe - model2
to_drop = ['Day', 'Year', 'Industry', 'Month', 'CEOGender', 'Month']

for col in to_drop:
    
    try:
        Quarterly = Quarterly.drop(columns = [col])
    except:
        Quarterly = Quarterly
    



In [251]:
#rerun model using m2 as the variable
model2 = dummy(Quarterly, ['Symbol'])
xvar = ['Symbol', '0 to 260']
yvar = '0 to 260'

m2 = LinearModel(model2, xvar, yvar, .25)
score2 = [m2.rmse, m2.mae, m2.evs, m2.r2]

coef2 = m2.coefficients()
coef2.head(10)

Unnamed: 0,Coefficient
65 to 130,17.27
130 to 195,17.05
195 to 260,16.03
0 to 65,14.31
GDP,7.73
SP_Value,4.06
HPI,3.38
30 to 65,2.38
0 to 30,2.11
UNRATE,1.46


* My new coefficients make more sense now. I'd expect to see several of the macroeconomic indicators such as GDP, S&P500 values, HPI, and the pricing changes for the stock itself to show up
<br><br>
* However, if my goal is to predict the price change from days 0 to 260 I shouldn't be using pricing changes from days 195 to 260 as inputs. I would want to be able to input all available data as soon as possible - by day 195 the year is almost over.
<br><br>
* In my next run I'll remove price changes: 65 to 130, 130 to 195, and 195 to 260. I'll keep 0 to 30, 30 to 65, and 0 to 65. These inputs would still allow me to evaluate an IPO and make a buy decision within the first Quarter of the IPO date


In [252]:
# add the desired columns to drop to the list and create a third dataframe - model3
to_drop = ['Day', 'Year', 'Industry', 'Month', 'CEOGender', 'Month', '65 to 130', '130 to 195', '195 to 260']
for col in to_drop:
    
    try:
        Quarterly = Quarterly.drop(columns = [col])
    except:
        Quarterly = Quarterly
    



In [253]:
#rerun through my regression class

model3 = dummy(Quarterly, ['Symbol'])
xvar = ['Symbol', '0 to 260']
yvar = '0 to 260'

m3 = LinearModel(model3, xvar, yvar, .25)
score3 = [m3.rmse, m3.mae, m3.evs, m3.r2]

coef3 = m3.coefficients()
coef3.head(10)


Unnamed: 0,Coefficient
GDP,37.25
SP_Value,26.07
HPI,17.76
0 to 65,10.83
UNRATE,10.25
0 to 30,6.15
HSAFRAL,5.7
30 to 65,5.24
Sector_Health Care,3.4
Sector_Finance,2.76


### Finally I'll compare the error scores from each odel and determine next steps

In [254]:
#compare modeling scores in data frame

metrics = ['rmse', 'mae', 'evs', 'r2']
scores = dict(metrics = metrics, m1 = score1, m2= score2, m3 = score3)
pd.DataFrame(scores)

Unnamed: 0,metrics,m1,m2,m3
0,rmse,8.587441,8.456645,32.365504
1,mae,5.536138,5.473787,23.051205
2,evs,0.947843,0.949421,0.262373
3,r2,0.947843,0.94942,0.259115


# Conclusion/Next Steps

* **Model 1 and Model 2** yielded similar error results. The root mean squared error was approx 8.5 points. This interpretation indicates that given all the required inputs, I could predict a new company's 1 year price change from their IPO date within +/- 8.5%. In other words, if the model predicted a 30% price increase, I could reasonably expect the range to fall between 21.5% and 38.5%. In this case I would be very confident in investing in the stock.
<br><br>
* Unfortunately **Models 1 and 2** are inherently flawed. The stock's price change data between days 65 to 130, 130 to 195, and 195 to 260 are among the independant variables used. An investor could NOT reasonably use these as inputs into the model. The goal is to predict the 1 year price at the time of the IPO – or at least within the first quarter of the IPO date. We cannot wait for data in the 2nd, 3rd, and 4th quarter in order to make the investment decision.
<br><br>
* In **Model 3** - when the 65 to 130, 130 to 195, and 195 to 260 price changes were removed - the accuracy dramatically declined. RMSE jumped to 32.36 points, indicating that the model’s prediction could swing +/- 32.36%. In my opinion this margin of error is too high to make an investment decision.
<br><br>
* The initial results and comparative differences between Models 1/2 and Model 3 can be expected. The more pricing data I feed into the model in the first year, the more accurate the prediction for a full year change will be. In an extreme example, if I input price changes for days 0 to 200, I'm sure I would get a very accurate prediction for the full year (days 0 to 260) change.
<br><br>
* As I continue to refine and tune the model I will need to re-evaluate the input/dependant variables. Next steps will be to:
    * Add more pricing data/ranges in the first 60 days of trading
    * scrape the web in an attempt to get fundamental financial data for the stocks in question