# EDA

Before we can attempt to implement a model, we must perform exploratory data analysis. This will allow us to better understand the data that was given to us.  

In [None]:
# Import Necessary modules 
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.feature_selection import VarianceThreshold
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import f_regression
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
# Import data
raw_data = pd.read_csv("../data/ucsbdata.csv")

In [None]:
# Initial observations 
raw_data.head(5)

In [None]:
# Number of Observations and missing values
print("Number of observations:", len(raw_data.Index))
print("Number of variables:", len(raw_data.columns))
print("Number of missing values:", np.count_nonzero(raw_data.isnull()))
print("Number of observations with missing values:", raw_data.isnull().any(axis=1).sum())

In [None]:
# Return df will all the missing values
raw_data[raw_data.isnull().any(axis=1)]

There are a significant amount of missing values in the dataset. The first year with complete observations is 2008. This would allow us to create a model after the crash of the 2008 recession. However, if we wanted to work with years prior to 2008, we need to further analyze the missing values in the dataset. 

## Post Market Crash

The most recent US recession occerd in 2008. September 29, 2008 was the day the stock market crashed. Since this infamous day, there have been no such significant drop or gains with the stocks. We will assume the stock will not crash during this time of the contest. Our first model will consist of data from September 29, 2008 to the end of 2018. This will roughly give us 10 years worth of data. 

In [None]:
# Create dataset after stock market crash
initial_start = '2008-08-30'
stock_data1 = raw_data.loc[raw_data.Index > initial_start]
stock_data1.head(10)

## Feature Selection

Our data is considered high dimension since it consists of 67 varaibles plus our target variable of prediction. In order for our model to run effectivly, we need to only select the varaibles that have the most influence in predicitng our desired variable. We will attempt various feature selection models to select the variable for our model


In [None]:
# Remove Data Dates and store dates
data_dates = stock_data1['Index']
data_dates.head(5)

In [None]:
# Remove data dates from our current complete dataset
no_dates = stock_data1.drop(columns=['Index'])
no_dates.head(5)

In [None]:
# The variable that we are interested in predicting
actual_returns = stock_data1['R']
actual_returns.head(5)

In [None]:
# The feature variables
feature_variables = stock_data1.drop(columns = ['Index', 'R'])
feature_variables.head(5)

### Pearsons Correlation 

In [None]:
# Number of varialbes that we want 
number_features = 8
fr = SelectKBest(score_func = f_regression, k = number_features)
pearsons_features = fr.fit_transform(feature_variables, actual_returns)
np.savetxt('../data/pearsons_features.txt', pearsons_features, fmt = '%f')

## PCA

Instead of finding specific individual features to use, we will use components. 

In [None]:
# Split data into training set and test set for PCA
train_data, test_data, train_lbl, test_lbl = train_test_split(feature_variables, 
                                                             actual_returns.values,
                                                             test_size = 0.2, 
                                                             random_state = 0)

In [None]:
# Initialize Class to apply standard transformation on data
scaler = StandardScaler()
# Fit ; Compute the mean and std to be used for later scaling
scaler.fit(train_data)
# Apply standard transformaiton to test and training data
transformed_train_data = scaler.transform(train_data)
transformed_test_data = scaler.transform(test_data)

In [None]:
# Initialize PCA class 
pca = PCA(svd_solver='full')
# Fit PCA to trainig set
pca.fit(transformed_train_data)
# Apply PCA to both training and test data
pca_train_data = pca.transform(transformed_train_data)
pca_test_data = pca.transform(transformed_test_data)

The skree plot will help us the number of components to use in our models. We will looking for the last major drop in explained variance. We want to obtain the number of personal components before the skree plot levels out. 

In [None]:
# Skree plot for PCA
plt.figure(figsize=(15,10))
sns.set(style="whitegrid")
sns.lineplot(x = range(66) , y = pca.explained_variance_ratio_,
            color = "purple")
plt.title("Skree Plot for PCA Components", size = 20)
plt.ylabel("Explained Variance", size = 15)
plt.xlabel("Number of components", size = 15)
plt.xticks(np.arange(0, 66, 2));
plt.vlines(x = 8, ymin = 0, ymax = 0.35, colors = 'red', linestyles='--')

In [None]:
# Number of components
target_component = 8
# Select only first 8 component from data
pca_train_data = pca_train_data[:,0: target_component]
pca_test_data = pca_test_data[:, 0: target_component]

In [None]:
# Save PCA results
np.savetxt('../data/pca_train_data.txt', pca_train_data, fmt = '%f')
np.savetxt('../data/pca_test_dat.txt', pca_test_data, fmt = '%f')
# Save labels
np.savetxt('../data/train_labels.txt', train_lbl, fmt = '%f')
np.savetxt('../data/test_labels.txt', test_lbl, fmt = '%f')