Module 01: Exploratory Analysis
===============================

Let’s begin our project by first getting to know the dataset. Our
initial analysis will allow us to start planning for our next steps in
data cleaning and feature engineering. This step should be quick, but
thorough enough for us to gain a basic intuition for addressing the
problem at hand.

Basic Information
-----------------

Here is an overview of our objectives for this phase:

-   Import data as a pandas dataframe
-   Review data shape (observations, features)
-   Review the data types to determine categorical vs numerical features
-   Reference the data dictionary and verify features weren’t imported
    as incorrect data type
-   Perform base analysis of dataset and get a qualitative feel

Let’s start by importing our libraries and make our initial
configurations.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
        
# change pandas option to view additional data frame columns
pd.set_option('display.max_columns', 100)

# display plots in the notebook
%matplotlib inline

Next, let’s load our data set.

In [None]:
# load the real estate data from the dataset folder
df = pd.read_csv('../dataset/real_estate_data.csv')

Now that our data set has been loaded, let’s get an initial
understanding on what we’re working with.

In [None]:
# the dataframe shape tells us the number of observations and features available
df.shape

In [None]:
# display the columns and sort them by index name
df.dtypes.sort_index()

In [None]:
# determine which features are categorical
df.dtypes[df.dtypes == 'object']

In [None]:
# display the first five observations
df.head(5)

In [None]:
# display the last five observations
df.tail()

Numeric Distribution
====================

After completing our basic observations, let’s review our numerical data
for the following:

-   Unexpected distributions (e.g. max value higher than normal)
-   Numbers outside of their boundaries (e.g. +100%)
-   Sparse data
-   Measurement errors

In [None]:
# plot histogram 
df.hist(figsize=(14, 14), xrot=-45)

# clear the text "residue"
plt.show()

While building a visual provides a quick interpretation of the data, it
lacks the detail necessary for a more in-depth analysis.

In [None]:
# display summary statistics, such as mean, std, and quartiles
df.describe()

In [None]:
# we can specify summary statistics for a particular observation
df.basement.describe()

At a quick glance, our numerical data appears to make sense. At this
point, we can consider features that could potentially be replaced by
booleans, such as the basement feature.

Categorical Distribution
------------------------

Let’s review the categorical data for the following:

-   Observe class frequency
-   Account for sparse data that could be combined or reassigned

In [None]:
# filter our observations by object type to provide descriptions
df.describe(include=['object'])

From our object type descriptions, we see that some features contain
multiple unique values. We can build a seaborn countplot visual to
better understand their distribution.

In [None]:
# display a barplot with the count for y variables
sns.countplot(y='exterior_walls', data=df)

In [None]:
# barplot with count for each object index
for index in df.dtypes[df.dtypes == 'object'].index:
    sns.countplot(y=index, data=df)

At this point, we should have started to think of features we can
consider consolidating.

Segmentations
-------------

Review segmentation to observe the relationship between categorical and
numeric features.

-   Build a boxplot to segment the target variable (tx\_price) by key
    categorical features.

In [None]:
# use boxplot for a visual interpretation
sns.boxplot(y='property_type', x='tx_price', data=df)

In [None]:
# segment by property_type and get means for each class
df.groupby('property_type').mean()

In [None]:
# segment by property_type and get both mean and std for each class
df.groupby('property_type').agg(['mean', 'std'])

When comparing our features, we should consider the following questions:

-   On average, which type of property is larger?

-   Which type of property has larger lots?

-   Which property is in areas of more nightlife/restaurants/grocery
    stores?

-   Are there any relationships that make intuitive sense?

Correlation
-----------

Correlate relationships between numeric features:

-   Search for strong correlations for target variable

To start, we can assess our correlations against our target value.

In [None]:
# Calculate correlation between numeric features
correlations = df.corr()
correlations.tx_price.sort_values(ascending=True)

Knowing that we have different property types, we can build correlations
for each types for further analysis.

In [None]:
apt_correlation = df[df.property_type != 'Single-Family'].corr()
apt_correlation.tx_price.sort_values(ascending=False)

In [None]:
single_correlation = df[df.property_type == 'Single-Family'].corr()
single_correlation.tx_price.sort_values(ascending=False)

We can also create a Seaborn heatmap to better visualize the
correlations.

In [None]:
# create pyplot figure
plt.figure(figsize=(10, 8))

# generate mask to create triangle figure
mask = np.zeros_like(correlations, dtype=np.bool) # 2d ndarry bool values, same shape as correlations

mask[np.triu_indices_from(mask)] = True # set upper triangle indices to True

# plot heatmap as triangle
sns.heatmap(correlations * 100, annot=True, fmt='.0f', mask=mask)

Next Module
-----------

[02. Data Cleaning](module02.md)