# Exploratory Data Analysis (EDA)
Learn the basics of Exploratory Data Analysis using Python with NumPy, Pandas, Matplotlib and Seaborn.

**Exploratory Data Analysis** is an approach to analysing data sets to summarise their main characteristics, often with visual methods.

In this practical, we will how to perform the following:
1. Structured Based
2. Univariate Analysis
3. Bivariate Analysis
4. Multivariate Analysis
5. Frequency Distribution

## Read the dataset

## Structured Based
The first part of EDA is to evaluate the dataset for structure, columns and data types. Below are the methods we have used in Pandas which gives us a quick understanding of our dataset:
* **head() and tail()**: display the first 5 and last 5 observations of your dataset
* **shape**: display the number of variables (columns) and observations (rows)
* **dtypes**: display the variable names and their data types
* **count()**: counts the number of non-empty values for each variable
* **describe()**: display the characteristics of the numerical variables - count (number of non-missing values), mean, standard deviation, and the 5 point summary which includes minimum, first quartile, second quartile, third quartile and maximum
* **info()**: display the complete meta-data of the dataset, summary of the dataframe including data types, shape, etc

In [None]:
# display first 5 records


# display last 5 records


# display shape of df


# display variable names and data types


# count number of non-empty values


# display the characteristics of the numerical variables


# display meta-data of dataset


## Univariate Analysis
Univariate analysis is analysing data over a single variable/column from a dataset.

### Categorical Unordered Univariate Analysis
An unordered variable is a categorical variable that has no defined order. In our dataset, the **job** column is divided into many sub-categories like technician, blue-collar, services, management, etc. 

Another useful method is **value_counts()** which we can get the count of each category in a categorical attributed series of values. 

Let's analyse the job category using a bar plot.

In [None]:
# Let's calculate the percentage of each job status category.


In [None]:
#plot the bar graph of percentage job categories


From the bar plot above, we can infer that the data set contains more number of blue-collar workers compared to other categories.

### Categorical Ordered Univariate Analysis
Ordered variables are variables with natural rank of order. Some example of categorical ordered variables in our daset are:
* Month: Jan, Feb, Mar, ...
* Education: Primary, Secondary, ...

Let's analyse the **education** variable in our data using a pie chart. 

In [None]:
#calculate the percentage of each education category.


In [None]:
#plot the pie chart of education categories


From the pie char above, we can infer that the dataset has a large number of them belongs to secondary education after that tertiary and next primary. Also, a very small percentage of them are unknown. 

This is how we analyse univariate categorical analysis. If the column or variable is numberical, we will analyse by calculating using mean, mediam, standard deviation, etc. We can get those values by using the describe function. 

```Python
df.salary.describe()
```

The output will be:
![image.png](attachment:image.png)


## Bivariate Analysis
Bivariate analysis is analysing data by taking two variables/columns from a dataset.

### Numeric-Numeric Analysis
Analysing two numeric variables from a dataset is known as numeric-numeric analysis. We can analyse it in three different ways:
* Scatter Plot
* Pair Plot
* Correlation Matrix

#### Scatter Plot
Let's analyse the **balance**, **age** and **salary** in our dataset and see what we can infer using a scatter plot.

In [None]:
#plot the scatter plot of balance and salary variable in data


In [None]:
#plot the scatter plot of balance and age variable in data


#### Pair Plot
Let's plot a pair plot using the same three variables (**balance**, **age** and **salary**) we have used in our scatter plot. We will use the seaborn library for ploting pair plots.

In [None]:
#plot the pair plot of salary, balance and age in data dataframe.


#### Correlation Matrix
Since we cannot use more than two variables as x-axis and y-axis in scatter and pair plots. It is difficult to see the relation between three numberical variables in a single graph. In those cases, we will use the correlation matrix.

First, we will create a matrix using **balance**, **age** and **salary**. After that, we will plot a heapmap using the seaborn library using the matrix.

In [None]:
# Creating a matrix using age, salary, balance as rows and columns


In [None]:
#plot the correlation matrix of salary, balance and age 


### Numeric-Categorical Analysis
Analysing one numeric variable and one categorical variable from a dataset is known as numeric-categorical analysis. We can analysis it using mean, median and box plots.

We will use the **response** and **marital** columns from our dataset. First, let's get the mean value using groupby.

In [None]:
#groupby the response to find the mean for yes & no responses


There's not much difference between the yes and no response based on salary. Let's check the median.

In [None]:
#groupby the response to find the median for yes & no responses


From the mean and median, we can say that the response of yes and no remains the same irrespective of the person's salary. However, we can further if the behaviour is true by ploting a box plot. 

In [None]:
#plot the box plot of salary for yes & no responses.


From the box plot, it paints a very different picture compared to the mean and median. The IQR for customers who gave a positive response is on the higher salary side. 

### Categorical-Categorical Analysis
Next, we will see how the different categories like education, marital status, etc are associated with the response column. so instead of 'Yes' and 'No' we will convert them into '1' and '0', by doing that we will get the "Response Rate".

In [None]:
#create response_rate of numerical data type where response "yes"= 1, "no"= 0


Let's see how the average response rate varies for the different categories in **marital** status.

In [None]:
#plot the bar graph of marital status with average value of response_rate


From the above plot, we can infer that there are more positive response for single status member in teh dataset. Similarly we can plot the graphs for **loan** vs **response_rate**, **education** vs **response_rate**, etc.

In [None]:
#plot the bar graph of education with average value of response_rate


In [None]:
#plot the bar graph of loan with average value of response_rate


## Multivariate Analysis
Multivariate analysis is analysing data by taking more than two variables/columns from a dataset.

Let's look at how **education**, **marital**, and **response_rate** vary with each other. 

First, we will create a pivot able with the three columns and after that, we will create a heatmap using the pivot table.

In [None]:
# create a pivot table of education, marital and response_rate


In [None]:
#create heat map of education vs marital vs response_rate


Based on the heatmap, we can infer that the married people with primary education are less likely to respond positively for the survey and single people with tertiary education are most likely to respond positively to the survey. Similarly, we can plot the graph for **job** vs **marital** vs **response_rate**, etc.

In [None]:
# create a pivot table of job, marital and response_rate


#create heat map of education vs marital vs response_rate


## Frequency Distribution
A frequency distribution is a representation, either in a graphical or tabular format, that displays the number of observations within a given interval or categories.

Let's look at the distribution of our **age** column. Since the number of observations for our dataset is more than 2^15 (32768), hence we will keep our bin size to 15.

In [None]:
# compute the histogram for age using 15 bins


# setup the plot specifying the bins range


# setup the plot title and label
