# **What’s Exploratory data analysis?**

Exploratory data analysis (EDA) is a term for certain kinds of initial analysis and findings done with data sets, usually early on in an analytical process.

EDA - no model trained yet, just exploring the data.

EDA is a critical step in the data analysis process that involves examining and visualizing the data to gain a deeper understanding of its properties and relationships. 

By performing EDA, you can identify data quality issues, detect outliers, explore patterns and trends, and generate hypotheses for further analysis.

**Some commonly used graphs:**

numerical variables: histograms; density plots.

categorical variables: bar plots.

**relations between two variables:**

two numericals: scatter plot.

one numerical, one categorical: box plots.

two categoricals: side-by-side bar plots.

# **Preprocessing:**

In EDA, we will be doing preprocessing of the data by analysing the data either categorical or numerical, visualizing them and some statistical decision.

# **Categorical variable:**

The variables containing details like name, gender, address, company, job role, etc are called categorical variables.

Data type: categorical, object

# **Numerical variable:**

The variable that contains details like id, salary, class is called a numerical variable.

Data type: integer, float



![image.png](attachment:image.png)

# **Data Cleaning:**

Data cleaning is an essential step in EDA. We must **handle missing values, outliers, and inconsistencies** in the dataset. 

# **Notes:**

1)  **Exploring the Data:**
To check datatypes by using: .info()

To get a describtion of the dataset: .describe() 

To split the categorical and numerical variable by coding:

      i. numerical data: .select_dtypes(np.number)

      ii. categorical data: .select_dtypes(object)

4)  **Handling missing values:**

To check for null values: .Isnull().sum()

**important:**
  
-if the numerical variable has null values then replace them by mean or median.

-if the categorical variable has null values replace them by mode.

**Filling null values for numerical columns:**

df[column_name] = df[column_name].fillna(df[column_name].mean())

**Filling null values for categorical columns:**

df[column_name] = df[column_name].fillna(df[column_name].mode())

**Removing rows with missing values:**

df.dropna(inplace=True)

# **Outliers:**

An outlier is a piece of data that is an abnormal distance from other points. In other words, it’s data that lies outside the other values in the set. There are many ways to find outliers and mostly used technique by visualization boxplot.

![image.png](attachment:image.png)

This picture explains the boxplot

### Outliers treatment by code:

This code will help **to remove the outliers**

Q1 = df.quantile(.25)

Q3 = df.quantile(.75)

IQR = Q3-Q1

df = df[~((df<(Q1–1.5*IQR)) | (df>(Q3+1.5*IQR))).any(axis=1)]

# Population vs Sample data:

The population is the entire data, the sample is the subset of the population.

After filling the null values use distplot and removing the outliers lets see the normality of the data:
for that you have to use **df.skewness()**: this code will tell you either your code normally distributed or skewed

The following picture tells how the data are distributed:

![image.png](attachment:image.png)

![image.png](attachment:image.png)

![image.png](attachment:image.png)

### so now we can start working on data visualisation.

# **DATA VISUALISATION:**
 
Visualization helps us understand the data and identify patterns

**Barplot:**

Show point estimates and confidence intervals as rectangular bars. A bar plot represents an estimate of central tendency for a numeric variable with the height of each rectangle and provides some indication of the uncertainty around that estimate using error bars.

**Histplot:**

The histplot is so likely as the same bar plot but there is no distance gap between the two bars.

![image.png](attachment:image.png)

**Heatmap:**

Heat map analysis is the process of reviewing and analyzing heat map data to gather insights about user interaction on the page. This data analysis can lead to improved site designs with lower bounce rates, fewer drop-offs, more pageviews, and better conversion rates. Heatmap is basically working with correlation values so it’s also used to the correlation between various variables.

![image.png](attachment:image.png)

The above figure shows the correlation between various variable, The highly correlated areas with darker blue in colour and lowly correlated with creamy white in colour.

**Pairplot:**

To plot multiple pairwise bivariate distributions in a dataset, you can use the pairplot() function. This shows the relationship for (n,2) combination of variable in a DataFrame as a matrix of plots and the diagonal plots are the univariate plots

![image.png](attachment:image.png)

**Scatterplot:**

A scatter plot is a type of plot or mathematical diagram using Cartesian coordinates to display values for typically two variables for a set of data. The example scatterplot figure is shown below.

![image.png](attachment:image.png)

The relation between the scatterplot is Gross and No. of users voted for the movie with separation of imbd_score.

**LMplot:**

The LMplot is the same as the scatterplot but with the line of regression model fits across a FacetGrid. This plot is mostly used for Machine Learning purposes in Supervised learning to see the best fit.

![image.png](attachment:image.png)

**Distplot:**

The distplot is the same as the combination of **Kdeplot** and **histplot**.

![image.png](attachment:image.png)



**Jointplot:**

Jointplot displays a relationship between 2 variables (bivariate) as well as 1D profiles (univariate) in the margins. This plot is a convenience class that wraps JointGrid.

![image.png](attachment:image.png)

# **Exploring Relationships:**

EDA involves exploring relationships between variables to uncover insights. We can use techniques like correlation analysis or cross-tabulation for this purpose.

### Correlation analysis
correlation = df[['Age', 'Fare']].corr()
print(correlation)

### Cross-tabulation
cross_tab = pd.crosstab(df['Pclass'], df['Survived'])
print(cross_tab)