<h1 style="font-size:42px; text-align:center; margin-bottom:30px;"><span style="color:SteelBlue">In Context: </span> Exploratory Analysis</h1>



<br><hr id="toc">

### In this lesson...

In this lesson, we'll go through the essential exploratory analysis steps:
1. [Basic information](#basic)
2. [Distributions of numeric features](#numeric)
3. [Distributions of categorical features](#categorical)
4. [Segmentations](#segmentations)
5. [Correlations](#correlations)

<hr>

### First, let's import libraries and load the dataset.

In general, it's good practice to keep all of your library imports at the top of your notebook or program.

Let's import the libraries we'll need for this lesson.

In [2]:
# NumPy for numerical computing
import numpy as np

# Pandas for DataFrames
import pandas as pd
pd.set_option('display.max_columns', 100)

# Matplotlib for visualization
from matplotlib import pyplot as plt
# display plots in the notebook
%matplotlib inline 

# Seaborn for easier visualization
import seaborn as sns

Next, let's import the dataset. 
* Pandas has a <code style="color:steelblue">pd.read_csv()</code> function for importing CSV files into a Pandas DataFrame. 
* You can name the DataFrame variable anything, but we prefer the simple name: <code style="color:steelblue">df</code> (short for DataFrame).

In [3]:
# Load real estate data from CSV
df = pd.read_csv('../../data/real_estate_data.csv')

<br id="basic">

# 1. Basic information

First, always look at basic information about the dataset.

#### Display the dimensions of the dataset.

In [None]:
# Dataframe dimensions
df.shape

#### Next, display the data types of our features.

In [None]:
# Column datatypes
df.dtypes

#### Display the first 5 rows to see example observations.

In [None]:
# Display first 5 rows of df
df.head()

<br><hr style="border-color:royalblue;background-color:royalblue;height:1px;">
## <span style="color:RoyalBlue">Exercise 1.1</span>

Before moving on, let's dig a bit deeper into some of these functionalities. Getting some extra practice right now will set you up for smoother success as you continue through the project.
<br>

**First, try to filter df.dtypes to only categorical variables:**

**Tip:** Remember the boolean filtering we've been talking about?

In [None]:
# Filter and display only df.dtypes that are 'object'
df.dtypes[df.dtypes == 'object']

#### Iterate through the categorical feature names and print each name.

In [None]:
# Loop through categorical feature names and print each one
for cat_feature in df.dtypes[df.dtypes == 'object'].index:
    print(cat_feature)

As you'll see later, the ability to select feature names based on some condition (instead of manually typing out each one) will be quite useful.

<br>

**Next**, look at a few more examples by displaying the first 10 rows of data, instead of just the first 5

In [None]:
# Display the first 10 rows of data
df.head(10)

Finally, it's also helpful to look at the last 5 rows of data.
* Sometimes datasets will have **corrupted data** hiding at the very end (depending on the data source).
* It never hurts to double-check.

In [None]:
# Display last 5 rows of data
df.tail()

<hr style="border-color:royalblue;background-color:royalblue;height:1px;">

<div style="text-align:center; margin: 40px 0 40px 0; font-weight:bold">
    
[Back to Contents](#toc)
</div>

<br id="numeric">

# 2. Distributions of numeric features

One of the most enlightening data exploration tasks is plotting the distributions of your features.

<br><hr style="border-color:royalblue;background-color:royalblue;height:1px;">
## <span style="color:RoyalBlue">Exercise 1.2</span>

**Plot the histogram grid, but make it larger, and rotate the x-axis labels clockwise by 45 degrees.**
* <code style="color:steelblue">df.hist()</code> has a <code style="color:steelblue">figsize=</code> argument takes a tuple for figure size.
* Try making the figure size 14 x 14
* <code style="color:steelblue">df.hist()</code> has a <code style="color:steelblue">xrot=</code> argument rotates x-axis labels **counter-clockwise**.
* The [documentation](http://pandas.pydata.org/pandas-docs/version/0.17.0/generated/pandas.DataFrame.hist.html) is useful for learning more about the arguments to the <code style="color:steelblue">.hist()</code> function.
* **Tip:** It's ok to arrive at the answer through **trial and error** (this is often easier than memorizing the various arguments).

In [None]:
# Plot histogram grid
df.hist(xrot=-45, figsize=(14, 14))

# Clear the text "residue"
plt.show()

#### Display summary statistics for the numerical features.

In [None]:
# Summarize numerical features
df.describe()

<hr style="border-color:royalblue;background-color:royalblue;height:1px;">

<div style="text-align:center; margin: 40px 0 40px 0; font-weight:bold">
    
[Back to Contents](#toc)
</div>

<br id="categorical">

# 3. Distributions of categorical features

Next, let's take a look at the distributions of our categorical features.

<br>
Display summary statistics for categorical features.

In [None]:
# Summarize categorical features
df.describe(include=["object"])

Plot bar plot for the <code style="color:steelblue">'exterior_walls'</code> feature.

In [None]:
# Bar plot for 'exterior_walls'
sns.countplot(y="exterior_walls", data=df)

<br><hr style="border-color:royalblue;background-color:royalblue;height:1px;">
## <span style="color:RoyalBlue">Exercise 1.3</span>

**Write a <code style="color:steelblue">for</code> loop to plot bar plots of each of the categorical features.**
* Write the loop to be able to handle any number of categorical features (borrow from your answer to <span style="color:royalblue">Exercise 1.1</span>).
* Invoke <code style="color:steelblue">plt.show()</code> after each bar plot to display all 3 plots in one output.
* Which features suffer from sparse classes?

In [None]:
# Plot bar plot for each categorical feature
for features in df.dtypes[df.dtypes == "object"].index: 
    sns.countplot(y=features, data=df)
    plt.show()

<hr style="border-color:royalblue;background-color:royalblue;height:1px;">

<div style="text-align:center; margin: 40px 0 40px 0; font-weight:bold">
    
[Back to Contents](#toc)
</div>

<br id="segmentations">

# 4. Segmentations

Next, let's create some segmentations. Segmentations are powerful ways to cut the data to observe the relationship between **categorical features** and **numeric features**.

<br>
Segment <code style="color:steelblue">'tx_price'</code> by <code style="color:steelblue">'property_type'</code> and plot the resulting distributions

In [None]:
# Segment tx_price by property_type and plot distributions
sns.boxplot(y="property_type", x="tx_price", data=df)

Segment by <code style="color:steelblue">'property_type'</code> and calculate the average value of each feature within each class:

In [None]:
# Segment by property_type and display the means within each class
df.groupby("property_type").mean()

<br><hr style="border-color:royalblue;background-color:royalblue;height:1px;">
## <span style="color:RoyalBlue">Exercise 1.4</span>

On average, it looks like single family homes are more expensive.

How else do the different property types differ? Let's see:

<br>

**First, segment <code style="color:steelblue">'sqft'</code> by <code style="color:steelblue">'property_type'</code> and plot the boxplots.**

In [None]:
# Segment sqft by sqft and property_type distributions
sns.boxplot(y="property_type", x="sqft", data=df)

<br>

**After producing the plot, consider these questions:**
* Which type of property is larger, on average?
* Which type of property sees greater variance in sizes?
* Does the difference in distributions between classes make intuitive sense?

<br>

**Next, display the standard deviations of each feature alongside their means after performing a groupby.**
* This will give you a better idea of the variation within in feature, by class.

* **Tip:** Pass a list of metrics into the <code style="color:steelblue">.agg()</code> function, after performing your groupby.

* Check out the [documentation](http://pandas.pydata.org/pandas-docs/stable/groupby.html#applying-multiple-functions-at-once) for more help.

In [None]:
# Segment by property_type and display the means and standard deviations within each class
df.groupby("property_type").agg([np.mean, np.std])

<hr style="border-color:royalblue;background-color:royalblue;height:1px;">
<div style="text-align:center; margin: 40px 0 40px 0; font-weight:bold">
    
[Back to Contents](#toc)
</div>

<br id="correlations">

# 5. Correlations

Finally, let's take a look at the relationships between **numeric features** and **other numeric features**.

<br>
Create a <code style="color:steelblue">correlations</code> dataframe from <code style="color:steelblue">df</code>.

In [None]:
# Calculate correlations between numeric features
correlations = df.corr()

#### Visualize the correlation grid with a heatmap to make it easier to digest.

In [None]:
# Make the figsize 7 x 6
plt.figure(figsize=(7,6))

# Plot heatmap of correlations
_ = sns.heatmap(correlations)

<br><hr style="border-color:royalblue;background-color:royalblue;height:1px;">
## <span style="color:RoyalBlue">Exercise 1.5</span>

When plotting a heatmap of correlations, it's often helpful to do four things:
1. Change the background to white. This way, 0 correlation will show as white
2. Annotate the cell with their correlations values
3. Mask the top triangle (less visual noise)
4. Drop the legend (colorbar on the side)

<br>

**First, change the background to white.**
* Seaborn has several different **themes**. The default theme is called <code style="color:crimson">'darkgrid'</code>.

* You can change the theme with <code style="color:steelblue">sns.set_style()</code>.

* You only need to run this once, and the theme will persist until you change it again.

* Change the theme to <code style="color:crimson">'white'</code>

* Make the figure size 10 x 8

In [None]:
# Change color scheme
sns.set_style("white")
# Make the figsize 10 x 8
plt.figure(figsize=(10,8))

# Plot heatmap of correlations
_ = sns.heatmap(correlations)

See how the cells for <code style="color:steelblue">'basement'</code> are now white? That's what we want because they were not able to be calculated.

<br>

**Next, display the correlation values in each cell.**

* The <code style="color:steelblue">annot=</code> argument controls whether to annotate each cell with its value. By default, it's <code style="color:crimson">False</code>.
* To make the chart cleaner, multiply the <code style="color:steelblue">correlations</code> DataFrame by 100 before passing it to the heatmap function.
* Pass in the argument <code style="color:steelblue">fmt=<span style="color:crimson">'.0f'</span></code> to format the annotations to a whole number.

In [None]:
# Make the figsize 10 x 8
plt.figure(figsize=(10, 8))

# Plot heatmap of annotated correlations
correlations = correlations * 100
sns.heatmap(correlations, annot=True, fmt='.0f')

#### Next, we'll generate a mask for the top triangle. Run this code:

In [None]:
# Generate a mask for the upper triangle
mask = np.zeros_like(correlations, dtype=np.bool)
mask[np.triu_indices_from(mask)] = True

<br>

**Plot the heatmap again, this time using that mask.**

* <code style="color:steelblue">sns.heatmap()</code> has a <code style="color:steelblue">mask=</code> argument.
* Keep all of the other styling changes you've made up to now.

In [None]:
# Make the figsize 10 x 8
plt.figure(figsize=(10, 8))

# Plot heatmap of correlations
sns.heatmap(correlations, annot=True, fmt='.0f', mask=mask)

<br>

**Finally, remove the colorbar on the side.**

* <code style="color:steelblue">sns.heatmap()</code> has a <code style="color:steelblue">cbar=</code> argument. By default, it's <code style="color:crimson">True</code>.
* Keep all of the other styling changes you've made up to now.
* But change the figure size to 9 x 8 (since we're removing the sidebar, this will help us keep nice proportions)

In [None]:
# Make the figsize 9 x 8
plt.figure(figsize=(9, 8))

# Plot heatmap of correlations
sns.heatmap(correlations, annot=True, fmt='.0f', mask=mask, cbar=False)

<hr style="border-color:royalblue;background-color:royalblue;height:1px;">
<div style="text-align:center; margin: 40px 0 40px 0; font-weight:bold">
    
[Back to Contents](#toc)
</div>