#### Exploratory Data Analysis - Why are employees leaving
We will use our Python Tools to understand the data of employees leaving and try to answer an analytics question: **Why are employees Leaving?**. 
Data can be presented in any format - Excel Sheets, CSV (Comma Separated Values) format, HDF5 data etc. 
<br>Before we jump into actual data analysis we need to first understand what is the motivation behind EDA. EDA can be done for the following objectives:
- To understand what is the distribution of the data
    - For numerical data we try to identify underlying distribution - some other metrics which may be computed include: Mean, Median, Mode, Standard Deviation
    - For categorical data we try to tabulate the data. 
    - Try to identify relationships between the variables
Generally in our data analysis exercise we try to predict a particular variable. In our exercise we will try to predict employee churn basis the available variable. More on this subsequently first let us start understanding our data. 

We will be doing this in Python. Important libraries to be used are:
- **Pandas**: Probably one of the most important data analysis libraries



In [3]:
import pandas as pd

In [4]:
employee_data = pd.read_csv("Why are employees leaving.csv")

CSV is comma delimited format which is one of the most common format for data tables. Other options which may be used for reading different data types can be:
- **read_json**: JSON (JavaScript Object Notations) used for Internet based data exchange. Are very similar to dictionary formats. 
- **read_html**: Hyper Text Markup Language are text files which are used for rendering webpages and can be used to store data as well. 
- **read_sql**: Can be used to read/ write to databases.
- **read_pickle**: These are binary files which are used for keeping data hierarchy in python. 
We will mostly be working with CSV file in this article. 

#### First steps in our EDA:
First step for data analysis will be to understand the dimensions of the data and what exactly is there. Therefore we will be using following functions:
- shape: To get the # of rows and columns in the data
- head: To get first few rows of the data
- columns: To get the names of the columns
- 

In [6]:
employee_data.shape
#This shows our data has 14,999 rows and 11 columns: Every dataframe is represented by ROWS X COLUMNS format

(14999, 11)

In [8]:
#Let us look at what are these columns
employee_data.columns
#So we have data points - ID, Satisfaction_level, etc. in the data

Index(['ID', 'satisfaction_level', 'last_evaluation', 'number_project',
       'average_montly_hours', 'time_spend_company', 'Work_accident', 'left',
       'promotion_last_5years', 'Division', 'salary'],
      dtype='object')

In [9]:
#Let us look at the first few rows of the data set
employee_data.head()
#Looks like out of 11 columns 2 are categorical variables viz - Division and Salary rest all seem to be numerical variables

Unnamed: 0,ID,satisfaction_level,last_evaluation,number_project,average_montly_hours,time_spend_company,Work_accident,left,promotion_last_5years,Division,salary
0,1,0.38,0.53,2,157,3,0,1,0,sales,low
1,2,0.8,0.86,5,262,6,0,1,0,sales,medium
2,2,0.11,0.88,7,272,4,0,1,0,sales,medium
3,4,0.72,0.87,5,223,5,0,1,0,sales,low
4,5,0.37,0.52,2,159,3,0,1,0,sales,low


We did a cardinal mistake in the code snippet above, just basis the head of the data we decided 2 are categorical and rest all are numerical. However, this might not be true and our data might have some variable types as mixed. You can confirm your assumption by looking at the tail by ```employee_data.tail()```. 
<br>Alternatively, we could even look at more number of rows by using ```employee_data.head(20)``` this will show 20 rows of the data default (without any value) shows 5 rows as above. 
<br>One more interesting thing here is the rownumbers in extreme left these are known as row indices, they will form important part of our data analysis in Python. We will come to this when required. 

In [11]:
employee_data.head(20)

Unnamed: 0,ID,satisfaction_level,last_evaluation,number_project,average_montly_hours,time_spend_company,Work_accident,left,promotion_last_5years,Division,salary
0,1,0.38,0.53,2,157,3,0,1,0,sales,low
1,2,0.8,0.86,5,262,6,0,1,0,sales,medium
2,2,0.11,0.88,7,272,4,0,1,0,sales,medium
3,4,0.72,0.87,5,223,5,0,1,0,sales,low
4,5,0.37,0.52,2,159,3,0,1,0,sales,low
5,6,0.41,0.5,2,153,3,0,1,0,sales,low
6,7,0.1,0.77,6,247,4,0,1,0,sales,low
7,8,0.92,0.85,5,259,5,0,1,0,sales,low
8,9,0.89,1.0,5,224,5,0,1,0,sales,low
9,10,0.42,0.53,2,142,3,0,1,0,sales,low


In [18]:
employee_data['Division'].value_counts().sort_index()

IT             1227
RandD           787
accounting      767
hr              739
management      630
marketing       858
product_mng     902
sales          4140
support        2229
technical      2720
Name: Division, dtype: int64