# 10 Techniques for Effective Exploratory Data Analysis (EDA)
Exploratory Data Analysis (EDA) is an essential step in any data analysis process. It involves examining and visualizing the data to uncover patterns, anomalies, relationships, and insights. EDA helps data scientists and analysts gain a deeper understanding of their datasets before diving into more complex analyses or modeling.

In this article, we will explore essential techniques and tools for conducting effective Exploratory Data Analysis.

1. Summary Statistics
Summary statistics provide a quick overview of the dataset’s central tendencies and spread. Common summary statistics include mean, median, mode, standard deviation, variance, and quantiles. Calculating and visualizing these statistics can help identify potential outliers and anomalies.

2. Data Visualization
Data visualization is a powerful EDA technique to understand data distributions and relationships. Techniques such as histograms, box plots, scatter plots, and pair plots can reveal insights into variables’ distribution, correlation, and potential patterns.

3. Missing Data Analysis
Missing data can impact analysis outcomes. EDA helps identify missing values, assess their patterns, and decide on appropriate strategies for handling them, such as imputation or removal.

4. Outlier Detection
Outliers can significantly affect analysis results. EDA techniques like box plots, scatter plots, and Z-score analysis help identify and understand outliers’ impact on data distribution and relationships.

5. Correlation Analysis
Correlation analysis measures the strength and direction of relationships between variables. Techniques like correlation matrices and heatmaps visualize correlations, aiding in feature selection and identifying multicollinearity.

6. Categorical Data Exploration
For categorical variables, techniques like bar plots, pie charts, and count plots reveal frequency distributions and help understand class imbalances or dominant categories.

7. Time Series Analysis
For time-based data, time series plots and decomposition can help detect trends, seasonality, and irregularities over time.

8. Dimensionality Reduction
EDA may reveal high dimensionality, leading to computational challenges. Techniques like Principal Component Analysis (PCA) help reduce dimensionality while preserving key features.

9. Interactive Visualization Tools
Tools like Plotly, Seaborn, and Tableau offer interactive visualizations, enabling users to explore data subsets, zoom, and hover for detailed insights.

10. Domain Knowledge Integration
Incorporating domain knowledge is crucial during EDA. Subject matter experts can help interpret patterns, validate findings, and uncover domain-specific insights.

# The main goals of EDA
1. Data cleaning: EDA examines the information for errors, missing values ​​and inconsistencies. It includes techniques such as imputation of datasets, management of missing statistics, and identification and elimination of outliers.

2. Descriptive Statistics: EDA uses precise records to identify the important trend, variability and distribution of variables. Typically, measures such as suggestion, median, mode, preferred deviation, range and percentiles are used.


3. Data Visualization: EDA uses visual techniques to graphically represent the statistics. Visualizations consisting of histograms, boxplots, scatterplots, line charts, heatmaps, and bar charts help identify styles, trends, and relationships within the facts.

4. Feature Engineering: EDA enables the exploration of various variables and their adjustments to create new features or derive meaningful insights. Feature engineering can include scaling, normalization, binning, express variable encoding, and creation of interaction or derived variables.

5. Correlation and Relationships: EDA enables the discovery of relationships and dependencies between variables. Techniques such as correlation analysis, scatter plots, and fit tables provide insight into the strength and direction of relationships between variables.


6. Data Segmentation: EDA can involve dividing the information into significant segments based solely on certain standards or characteristics. This segmentation provides valuable insights into specific subgroups within the data and can enable more targeted analysis.

7. Hypothesis Generation: EDA helps in generating hypotheses or study questions that are entirely based on the preliminary examination of the data. It facilitates inspiration for further evaluation and modeling.

8. Data quality assessment: EDA certificates to assess the quality and reliability of the information. This involves checking the integrity, consistency and accuracy of records to ensure that the information is suitable for analysis.

##Types of EDA
Depending on the number of columns we analyze, we can divide EDA into two types. 

EDA, or Exploratory Data Analysis, refers to the method of analyzing and dissecting units of information to uncover styles, find out connections, and gain insights. There are different types of EDA strategies that can be used depending on the type of records and the needs of the assessment. Here are some not uncommon types of EDA:

1. Univariate Analysis: This type of evaluation is a specialty of analyzing character variables within the data set. The aim is to summarize and visualize a single variable at a time in order to understand its distribution, relevant trend, development and other applicable data sets. Univariate analysis generally uses techniques such as histograms, field charts, bar charts, and precise information.


2. Bivariate analysis: Bivariate analysis involves examining the connection between variables. It allows searching for associations, correlations and dependencies between pairs of variables. Scatter plots, line graphs, correlation matrices and shift tables are commonly used strategies in bivariate analysis.

3. Multivariate analysis: Multivariate analysis extends the bivariate analysis to include larger variables. The goal is to understand the complex interactions and dependencies between multiple variables in a data set. Techniques such as heatmaps, parallel coordinates, aspect analysis and primary component analysis (PCA) are used for multivariate analysis.

4. Time series analysis: This type of analysis is mainly applied to statistical sets that have a temporal component. The evaluation of time tracking includes examining and modeling styles, characteristics and seasonality within the statistics over the years. Techniques such as line charts, autocorrelation analysis, transfer averages, and AutoRegressive Integrated Moving Average (ARIMA) models are widely used in time series analysis.

5. Lack of data analysis: Missing information is not an uncommon problem in data sets and can affect the reliability and validity of the assessment. Missing statistics analysis involves identifying missing values, recognizing the patterns of missingness, and using appropriate techniques to deal with missing data. When assessing missing facts, techniques such as missing fact styles, imputation strategies, and sensitivity assessment are used.

6. Outlier Analysis: Outliers are statistical factors that differ drastically from the general sample of facts. Outlier analysis involves identifying and knowing the existence of outliers, their capability reasons, and their impact on the analysis. Techniques such as box plots, scatter plots, Z-rankings and clustering algorithms are used for outlier evaluation.

7. Data Visualization: Data visualization is a critical enabler of EDA that involves creating visible representations of the statistics to facilitate understanding and exploration. Various visualization techniques are used to present exclusive statistics, including bar charts, histograms, scatter charts, line charts, heatmaps and interactive dashboards.

These are just a few examples of the types of EDA techniques that may eventually be used in information analysis. The choice of strategies depends on the information characteristics, research questions and the insights gained from the analysis.

In [1]:
import pandas as pd
import numpy as np

In [4]:
#Loading dataset
data=pd.read_csv("C:/Users/mdram/Pandas/employees.csv")
data

Unnamed: 0,First Name,Gender,Start Date,Last Login Time,Salary,Bonus %,Senior Management,Team
0,Douglas,Male,8/6/1993,12:42 PM,97308,6.945,True,Marketing
1,Thomas,Male,3/31/1996,6:53 AM,61933,4.170,True,
2,Maria,Female,4/23/1993,11:17 AM,130590,11.858,False,Finance
3,Jerry,Male,3/4/2005,1:00 PM,138705,9.340,True,Finance
4,Larry,Male,1/24/1998,4:47 PM,101004,1.389,True,Client Services
...,...,...,...,...,...,...,...,...
995,Henry,,11/23/2014,6:09 AM,132483,16.655,False,Distribution
996,Phillip,Male,1/31/1984,6:30 AM,42392,19.675,False,Finance
997,Russell,Male,5/20/2013,12:39 PM,96914,1.421,False,Product
998,Larry,Male,4/20/2013,4:45 PM,60500,11.985,False,Business Development


In [5]:
data.head(10)

Unnamed: 0,First Name,Gender,Start Date,Last Login Time,Salary,Bonus %,Senior Management,Team
0,Douglas,Male,8/6/1993,12:42 PM,97308,6.945,True,Marketing
1,Thomas,Male,3/31/1996,6:53 AM,61933,4.17,True,
2,Maria,Female,4/23/1993,11:17 AM,130590,11.858,False,Finance
3,Jerry,Male,3/4/2005,1:00 PM,138705,9.34,True,Finance
4,Larry,Male,1/24/1998,4:47 PM,101004,1.389,True,Client Services
5,Dennis,Male,4/18/1987,1:35 AM,115163,10.125,False,Legal
6,Ruby,Female,8/17/1987,4:20 PM,65476,10.012,True,Product
7,,Female,7/20/2015,10:43 AM,45906,11.598,,Finance
8,Angela,Female,11/22/2005,6:29 AM,95570,18.523,True,Engineering
9,Frances,Female,8/8/2002,6:51 AM,139852,7.524,True,Business Development


In [7]:
data.tail(10)

Unnamed: 0,First Name,Gender,Start Date,Last Login Time,Salary,Bonus %,Senior Management,Team
990,Robin,Female,7/24/1987,1:35 PM,100765,10.982,True,Client Services
991,Rose,Female,8/25/2002,5:12 AM,134505,11.051,True,Marketing
992,Anthony,Male,10/16/2011,8:35 AM,112769,11.625,True,Finance
993,Tina,Female,5/15/1997,3:53 PM,56450,19.04,True,Engineering
994,George,Male,6/21/2013,5:47 PM,98874,4.479,True,Marketing
995,Henry,,11/23/2014,6:09 AM,132483,16.655,False,Distribution
996,Phillip,Male,1/31/1984,6:30 AM,42392,19.675,False,Finance
997,Russell,Male,5/20/2013,12:39 PM,96914,1.421,False,Product
998,Larry,Male,4/20/2013,4:45 PM,60500,11.985,False,Business Development
999,Albert,Male,5/15/2012,6:24 PM,129949,10.169,True,Sales


In [10]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 8 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   First Name         933 non-null    object 
 1   Gender             855 non-null    object 
 2   Start Date         1000 non-null   object 
 3   Last Login Time    1000 non-null   object 
 4   Salary             1000 non-null   int64  
 5   Bonus %            1000 non-null   float64
 6   Senior Management  933 non-null    object 
 7   Team               957 non-null    object 
dtypes: float64(1), int64(1), object(6)
memory usage: 62.6+ KB


# Pandas offers different types of attributes and ways to use DataFrames and Series:

Methods - these have to be called with () and in some cases include one or more arguments. If you don’t use () the method will not be called. Examples: .head(), .drop_duplicates()

Properties/attributes - these look like methods in the source code, however behave like value-attributes. Must be accessed without (). Examples: .values, .shape, .size index
columns
axes
dtypes
size
shape
ndim
empty
T
values

Indexers - these enable accessing specific locations in the dataframe, for example a range of rows/columns or a row by index value. These must be accessed using the [] item getter operator (similar to a dictionary or list). Examples: [], .loc[], .iloc[]

In [8]:
data.shape

(1000, 8)

In [21]:
data.size

8000

# describe() method.
Let's get a quick summary of the dataset using Pandas #describe() method. The describe() function applies basic statistical calculations to the data set, such as: E.g. extreme values, number of data points, standard deviation, etc. Any missing values ​​or NaN values ​​are automatically skipped. The describe() function provides a good picture of the data distribution.

In [17]:
data.describe()

Unnamed: 0,Salary,Bonus %
count,1000.0,1000.0
mean,90662.181,10.207555
std,32923.693342,5.528481
min,35013.0,1.015
25%,62613.0,5.40175
50%,90428.0,9.8385
75%,118740.25,14.838
max,149908.0,19.944


In [18]:
#Now let's take a look at the columns and their data types. To do this we use the info() method 
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 8 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   First Name         933 non-null    object 
 1   Gender             855 non-null    object 
 2   Start Date         1000 non-null   object 
 3   Last Login Time    1000 non-null   object 
 4   Salary             1000 non-null   int64  
 5   Bonus %            1000 non-null   float64
 6   Senior Management  933 non-null    object 
 7   Team               957 non-null    object 
dtypes: float64(1), int64(1), object(6)
memory usage: 62.6+ KB


In [22]:
data.memory_usage()

Index                 128
First Name           8000
Gender               8000
Start Date           8000
Last Login Time      8000
Salary               8000
Bonus %              8000
Senior Management    8000
Team                 8000
dtype: int64

In [35]:
data.head(4)

Unnamed: 0,First Name,Gender,Start Date,Last Login Time,Salary,Bonus %,Senior Management,Team
0,Douglas,Male,1993-08-06,12:42 PM,97308,6.945,True,Marketing
1,Thomas,Male,1996-03-31,6:53 AM,61933,4.17,True,
2,Maria,Female,1993-04-23,11:17 AM,130590,11.858,False,Finance
3,Jerry,Male,2005-03-04,1:00 PM,138705,9.34,True,Finance


In [46]:
# convert "Start Date" column to datetime data type
pd.to_datetime(data['Start Date']), pd.to_datetime(data.head()['Start Date'])

(0     1993-08-06
 1     1996-03-31
 2     1993-04-23
 3     2005-03-04
 4     1998-01-24
          ...    
 995   2014-11-23
 996   1984-01-31
 997   2013-05-20
 998   2013-04-20
 999   2012-05-15
 Name: Start Date, Length: 1000, dtype: datetime64[ns],
 0   1993-08-06
 1   1996-03-31
 2   1993-04-23
 3   2005-03-04
 4   1998-01-24
 Name: Start Date, dtype: datetime64[ns])

In [64]:
data.head()[['Start Date', "Gender"]]

Unnamed: 0,Start Date,Gender
0,1993-08-06,Male
1,1996-03-31,Male
2,1993-04-23,Female
3,2005-03-04,Male
4,1998-01-24,Male


We can see the number of unique elements in our dataset. This will help us in deciding which type of encoding to choose for converting categorical columns into numerical columns.

In [66]:
data.nunique()

First Name           200
Gender                 2
Start Date           972
Last Login Time      720
Salary               995
Bonus %              971
Senior Management      2
Team                  10
dtype: int64

Handling Missing Values
isnull()
notnull()
dropna()
fillna()
replace()
interpolate()

In [79]:
data.isnull().sum(), data.notnull().sum() #data.notnull(), data.dropna(), data.fillna(), data.replace(), data.interpolate()

(First Name            67
 Gender               145
 Start Date             0
 Last Login Time        0
 Salary                 0
 Bonus %                0
 Senior Management     67
 Team                  43
 dtype: int64,
 First Name            933
 Gender                855
 Start Date           1000
 Last Login Time      1000
 Salary               1000
 Bonus %              1000
 Senior Management     933
 Team                  957
 dtype: int64)

In [81]:
data.dropna().sum()

First Name           DouglasMariaJerryLarryDennisRubyAngelaFrancesJ...
Gender               MaleFemaleMaleMaleMaleFemaleFemaleFemaleFemale...
Last Login Time      12:42 PM11:17 AM1:00 PM4:47 PM1:35 AM4:20 PM6:...
Salary                                                        69090962
Bonus %                                                         7753.1
Senior Management                                                  381
Team                 MarketingFinanceFinanceClient ServicesLegalPro...
dtype: object