# Exploratory Data Analysis

In [None]:
#imports

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import seaborn as sns
import matplotlib.pyplot as plt

#helpers
sigLev = 3
sns.set_style("dark")

In [None]:
#load in dataset
trainFrame = pd.read_csv("../input/train.csv")
testFrame = pd.read_csv("../input/test.csv")

# Metadata Analysis

In [None]:
trainFrame.shape

We see that we have about 292 features for each of our 30471 observations. this is a large feature set, and we may need to do some forms of dimensionality reduction in order to get this feature set into a more reasonable shape.

In [None]:
trainFrame.dtypes

Thankfully, it looks like most of our variables are quantitative, which makes choosing a dimensionality reduction method relatively easier than having to deal with many interspersed categorical variables.

In [None]:
trainFrame.isnull().sum()

We see that we have missing values for many components in our dataset. Let's see how our dimensionality would be reduced if we were to remove variables with missing values.

In [None]:
numMissing = trainFrame.isnull().sum()
numWithMissingObs = numMissing[numMissing > 0].shape[0]
print(numWithMissingObs)

It looks like we only have 51 of the over 200 variables that would be removed from consideration if we were to not consider variables with missing values. For the sake of simplification, let us drop these variables.

In [None]:
colsWithMissingObs = numMissing[numMissing > 0].index
filteredTrainFrame = trainFrame.drop(colsWithMissingObs,axis = 1)

In [None]:
filteredTrainFrame.shape

Let's now filter out variables that have little to no variation in our dataset.

In [None]:
sdVec = filteredTrainFrame.std()
sdVec = sdVec.sort_values()
sdVec

We see we have one variable that seems to have an unusually low amount of variance for our dataset. Given that it is a single variable, it doesn't entirely make sense to go through the trouble of removing it when there are other feature reduction methods that will likely reduce it anyway.

In [None]:
timeCountFrame = filteredTrainFrame.groupby("timestamp")["timestamp"].count()
#then plot
timeCountFrame.plot()
plt.xlabel("Time Stamp")
plt.ylabel("Count")
plt.title("Observations over Time For Training Data")

We do see a peak in observations around 2014, although given that there are tens of thousands of observations in this dataset, these peaks do not look too substantial. Thus, I wouldn't worry about a particular time bias unless our training set is substantially  different in time periods.

In [None]:
timeCountFrame = testFrame.groupby("timestamp")["timestamp"].count()
#then plot
timeCountFrame.plot()
plt.xlabel("Time Stamp")
plt.ylabel("Count")
plt.title("Observations over Time For Test Data")

We do see that the test data features time points that are much later than our current time points. Perhaps we will need to account for some time varying factors in order to predict future household properties.

# Dimensionality Exploration

In order to best wield this large amount of data, it is very likely that we will need to filter this data into very key components. This data isn't quite large enough that an $L_1$ or $L_2$ regression is immediately necessary, and so I want to see how manageable a PCA can be on this dataset.

In [None]:
filteredTrainFrame = filteredTrainFrame.drop("timestamp",axis = 1)

In [None]:
priceDocVec = filteredTrainFrame["price_doc"]
filteredTrainFrame = filteredTrainFrame.drop("price_doc",axis = 1)

In [None]:
from sklearn.decomposition import PCA
testPCA = PCA()
testPCA.fit(filteredTrainFrame)

Despite my original assumptions, it looks like we have some categorical data in this dataset. This suggests to me that I need to edit my previous sections in order to account for this issue.

Need to fix:

* Need to study how to re-encode strings in this dataset.