<h1 style="font-size:42px; text-align:center; margin-bottom:30px;"><span style="color:SteelBlue">Your Turn: </span> Exploratory Analysis</h1>


<br><hr id="toc">

### In this module...

In this module, we'll go through the essential exploratory analysis steps:
1. [Basic information](#basic)
2. [Distributions of numeric features](#numeric)
3. [Distributions of categorical features](#categorical)
4. [Segmentations](#segmentations)
5. [Advanced segmentations](#advanced-segmentations) 


This time, however, you'll be in the driver's seat.
<br><hr>

### First, let's import libraries and load the dataset.

In [None]:
# NumPy for numerical computing

# Pandas for DataFrames

# Special Pandas option that adds a scroll bar, instead of "..."
# for dataframes with lots of columns.
pd.set_option('display.max_columns', 100)

# Matplotlib for visualization

# Seaborn for easier visualization

Next, let's import the dataset.
* The file path is <code style="color:crimson">'../../data/employee_data.csv'</code>

In [None]:
import pandas as pd

In [None]:
# Load employee data from CSV
df = pd.read_csv('../../data/employee_data.csv')

Now we're ready to jump into exploring the data!

<span id="basic"></span>
# 1. Basic information

Let's begin by displaying the dataset's basic information.

<br>

**First, display the <span style="color:royalblue">dimensions</span> (a.k.a. shape) of the dataset.**

In [None]:
# Dataframe dimensions

**Next, display the <span style="color:royalblue">datatypes</span> of the features.**
* Which are the **numeric** features?
* Which are the **categorical** features?

In [None]:
# Column datatypes

**Next, display the first 10 <span style="color:royalblue">example observations</span> from the dataset.**
* Remember, the purpose is not to perform rigorous analysis. 
* Instead, it's to get a **qualitative "feel"** for the dataset.

In [None]:
# First 10 rows of data

**Finally, display the last 10 rows of data to check for any signs of <span style="color:royalblue">corrupted data</span>.**
* Corrupted data will usually appear as a bunch of gibberish. It will be obvious.
* Most of the time, you won't have corrupted data... but this is still a quick and easy check.

In [None]:
# Last 10 rows of data

<div style="text-align:center; margin: 40px 0 40px 0; font-weight:bold">
    
[Back to Contents](#toc)
</div>

<span id="numeric"></span>
# 2. Distributions of numeric features

One of the most enlightening data exploration tasks is plotting the distributions of your features.

<br>

**First, plot the Pandas <span style="color:royalblue">histogram grid</span> for all the numeric features.** 

Feel free to mess around with the settings and formatting, but here are the settings we used for reference:
* We made the figure size 10x10
* We also rotated x-labels by -45 degrees

In [None]:
# Plot histogram grid

**Next, display formal <span style="color:royalblue">summary statistics</span> for the numeric features.**

In [None]:
# Summarize numerical features

<div style="text-align:center; margin: 40px 0 40px 0; font-weight:bold">
    
[Back to Contents](#toc)
</div>

<span id="categorical"></span>
# 3. Distributions of categorical features

Next, let's take a look at the distributions of our categorical features.

<br>

**First, display the <span style="color:royalblue">summary statistics</span> for categorical features in the dataset.**

In [None]:
# Summarize categorical features

**Using a loop, display <span style="color:royalblue">bar plots</span> for each of the categorical features.**

In [None]:
# Plot bar plot for each categorical feature

<div style="text-align:center; margin: 40px 0 40px 0; font-weight:bold">
    
[Back to Contents](#toc)
</div>

<span id="segmentations"></span>
# 4. Segmentations

Next, let's create some segmentations. Segmentations are powerful ways to cut the data to observe the relationship between categorical features and numeric features.

The code is provided in this section as we didn't cover it in our lesson. If everything was done properly upto this point, you should be able to just execute the code below to display the charts.

**First, display a <span style="color:royalblue">violin plot</span> with <code style="color:steelblue">'status'</code> on the y-axis and <code style="color:steelblue">'satisfaction'</code> on the x-axis.**

In [None]:
# Segment satisfaction by status and plot distributions
sns.violinplot(y="status", x="satisfaction", data=df)

**Next, display a violin plot that segments <code style="color:steelblue">'last_evaluation'</code> by <code style="color:steelblue">'status'</code>.**

In [None]:
# Segment last_evaluation by status and plot distributions
sns.violinplot(y="status", x="last_evaluation", data=df)

**<span style="color:royalblue">Group by</span> <code style="color:steelblue">'status'</code> and calculate the average value of each feature within each class.**

In [None]:
# Segment by status and display the means within each class
df.groupby('status').mean()

<div style="text-align:center; margin: 40px 0 40px 0; font-weight:bold">
    
[Back to Contents](#toc)
</div>

<span id="advanced-segmentations"></span>
# 5. Advanced segmentations

Because the target variable is categorical, it can often be helpful to expand your segmentation analysis. 

<br>

**Now, we'll see how to do bivariate segmentations, which can be produced with the <code style="color:steelblue">sns.lmplot()</code> function from the Seaborn library.**
* <code style="color:steelblue">sns.lmplot()</code> is essentially a regular **scatterplot** with additional options.
* For example, we can color each point based on its <code style="color:steelblue">'status'</code>.
* To do so, we'll use the <code style="color:steelblue">hue=</code> argument.

In [None]:
# Scatterplot of satisfaction vs. last_evaluation
sns.lmplot(y="satisfaction", x="last_evaluation", data=df, hue='status', fit_reg=None)

**Plot another scatterplot of <code style="color:steelblue">'satisfaction'</code> and <code style="color:steelblue">'last_evaluation'</code>, but only for employees who have <code style="color:crimson">'Left'</code>.**
* **Hint:** Do you still need the <code style="color:steelblue">hue=</code> argument?
* **Hint:** How might you change the <code style="color:steelblue">data=df</code> argument?

In [None]:
# Scatterplot of satisfaction vs. last_evaluation, only those who have left
sns.lmplot(y="satisfaction", x="last_evaluation", data=df[df.status == 'Left'], hue='status', fit_reg=None)

<div style="text-align:center; margin: 40px 0 40px 0; font-weight:bold">
    
[Back to Contents](#toc)
   
</div>

<br>

Congratulations for making through Exploratory Analysis!

As a reminder, here are a few things you did in this module:
* You explored basic information about your dataset.
* You plotted distributions of numeric and categorical features.
* You segmented your dataset by <code style="color:steelblue">'status'</code>.
* And you dove into some advanced, bivariate segmentations.

<div style="text-align:center; margin: 40px 0 40px 0; font-weight:bold">
[Back to Contents](#toc)
</div>