# **Exploratory Data Analysis**

Exploratory Data Analysis refers to the critical process of performing initial investigations on data so as to,
- discover patterns,
- spot anomalies,
- test hypothesis, and
- check assumptions with the help of summary statistics and graphical representations.

It is a good practice to understand the data first and try to gather as many insights from it.

EDA is all about making sense of data in hand,
before getting them dirty with it.


## **EDA Steps**



1.   Data Sourcing
2.   Data Cleaning
3.   Univariate Analysis
4.   Bivariate and Multivariate Analysis





### **A. Data Sourcing**

- Source the data from file, querying from DB or scrapping.
- To determine whether the data is public or private.

### **B. Data Cleaning**

Once you source the data, it is essential to get rid of the irregularities in the data and fix it to improve its quality.

One can encounter different kinds of issues in a dataset. Irregularities may appear in the form of ***missing values, anomalies/outliers, incorrect format and inconsistent spelling, etc***.

These irregularities may propagate further and affect the assumptions and analysis based on that dataset and may hamper the further process of machine learning model building. Hence, data cleaning is a very important step in EDA.

Though data cleaning is often done in a somewhat haphazard manner, and it is difficult to define a ‘single structured process’, you will study data cleaning through the following steps:

1. Identifying the data types

2. Fixing the rows and columns

3. Imputing/removing missing values

4. Handling outliers

5. Standardising the values

6. Fixing invalid values

7. Filtering the data

#### **1. Identifying the data type.**

Exploring the data set and finding out the columns and datatypes of those columns.

How is the data stored, can we split a column into more than 1 col or combine multiple columns into 1

#### **2. Fixing the rows and columns**

**Checklist for fixing rows:**

- Delete summary rows: Total and Subtotal rows
- Delete incorrect rows: Header row and footer row
- Delete extra rows: Column number, indicators, blank rows, page number.

**Checklist for fixing columns:**

- if needed, merge columns for creating unique identifiers, for example, merge the columns State and City into the column Full Address.

- Split columns to get more data: Split the Address column to get State and City columns to analyse each separately. 

- Add column names: Add column names if missing.

- Rename columns consistently: Abbreviations, encoded columns.

- Delete columns: Delete unnecessary columns.

- Align misaligned columns: The data set may have shifted columns, which you need to align correctly.



#### **3. Imputing/removing missing values**

- Set values as missing values: Identify values that indicate missing data, for example, treat blank strings "NA", "XX", "999", etc., as missing.

- Adding is good, exaggerating is bad: You should try to get information from reliable external sources as much as possible, but if you can’t, then it is better to retain missing values rather than exaggerating the existing rows/columns.

- Delete rows and columns: Rows can be deleted if the number of missing values is insignificant, as this would not impact the overall analysis results. Columns can be removed if the missing values are significant in number.

- Fill partial missing values using business judgement: Such values include missing time zones, century, etc. These values can be identified easily.

**Types of missing values:**

- MCAR: It stands for Missing completely at random. The reason behind the missing value is not dependent on any other features.

- MAR: It stands for Missing at random. The reason behind the missing value may be associated with some other features.

- MNAR: It stands for Missing not at random. There is a specific reason behind the missing value.


## **Missing value doesn't always have to be present as null**

***The various ways to impute the missing values -***

**Imputation on categorical/numeric columns:**

1. Categorical column: 

- Impute the most popular category.

- Imputation can be done using logistic regression techniques.

2. Numerical column:

- Impute the missing value with mean/median/mode.

- The other methods to impute the missing values involve the use of interpolation, linear regression. These methods are useful for continuous numerical variables.

#### **4. Handling Outliers**

Outliers are values that are much beyond or far from the next nearest data points.

There are two types of outliers. These are:

**Univariate outliers:** Univariate outliers are those data points in a variable whose values lie beyond the range of expected values. You can get a better understanding of univariate outliers from the image below. Here, almost all the points lie between 0 and 5.0, and one point is extremely far away (at 20.0) from the normal norms of this data set.

**Multivariate outliers:** While plotting data, some values of one variable may not lie beyond the expected range, but when you plot the data with some other variable, these values may lie far from the expected value. These are called multivariate outliers. You can refer to the image below to get a better understanding of multivariate outliers.

The major approaches to the treatment of outliers can include:

- Imputation

- Deletion of outliers

- Binning of values

- Capping the outliers





#### **5. Standardising Values**

from rows and columns, we move to clean and fix the value of individal cell.

**Steps to follow for standardising the numeric values**

- **Standardise units**: Ensure all observations under one variable are expressed in a common and consistent unit, e.g., convert lbs to kg, miles/hr to km/hr, etc.

- **Scale values if required**: Make sure all the observations under one variable have a common scale.

- **Standardise precision** for a better presentation of data, e.g., change 4.5312341 kg to 4.53 kg.

**Steps to follow for standardising the text values**

- **Remove extra characters** such as common prefixes/suffixes, leading/trailing/multiple spaces, etc. These are irrelevant to the analysis.

- **Standardise case**: String variables may take various casing styles, e.g., FULLCAPS, lowercase, Title Case, Sentence case, etc.

- **Standardise format**: It is important to standardise the format of other elements such as date, name, etc. For example, change 23/10/16 to 2016/10/23, “Modi, Narendra” to “Narendra Modi", etc.

#### **6. Fixing Invalid Values and Filter Data**

If your data set has invalid values, and if you do not know which accurate values could replace the invalid values, then it is recommended that you treat these values as missing. 

Now, let’s summarise what you learnt about fixing invalid values in a data set. Given below is a list of points that we covered. You could use this as a checklist for future data cleaning exercises:

- **Encode unicode properly**: In case the data is being read as junk characters, try to change the encoding, for example, use CP1252 instead of UTF-8.

- **Convert incorrect data types**: Change the incorrect data types to the correct data types for ease of analysis. For example, if numeric values are stored as strings, then it would not be possible to calculate metrics such as mean, median, etc. Some of the common data type corrections include changing a string to a number ("12,300" to “12300”), a string to a date ("2013-Aug" to “2013/08”), a number to a string (“PIN Code 110001” to "110001"), etc.

- **Correct the values that lie beyond the range**: If some values lie beyond the logical range, for example, temperature less than -273° C (0° K), then you would need to correct those values as required. A close look at the data set would help you determine whether there is scope for correction or the value needs to be removed.

- **Correct the values not belonging in the list**: Remove the values that do not belong to a list. For example, in a data set of blood groups of individuals, strings ‘E’ or ‘F’ are invalid values and can be removed.

#### **7. Filtering the Data**

It is important for you to understand what you need in order to draw insights from the data, and then choose relevant parts of the dataset for your analysis. Thus, you need to filter the data in order to get what you need for your analysis.

You could use below points as a checklist for filtering data:

- **Deduplicate data**: Remove identical rows and the rows in which some columns are identical.

- **Filter rows**: Filter rows by segment and date period to obtain only rows relevant to the analysis.

- **Filter columns**: Filter columns relevant to the analysis.

- **Aggregate data**: Group by the required keys and aggregate the rest.

### **C. Univariate Analysis**

Univariate analysis involves the analysis of a single variable at a time.

The concept of univariate analysis is divided into **ordered** and **unordered** category of variables.

**Unordered data** is the type of data that does not have any measurable terms (measurable terms can be like high-low, more-less, fail-pass, etc.) Example:

The type of loan taken by an individual (home loan, personal loan, auto loan, etc.) does not have any ordered notion. They are just different types of loans.

Departments of an organisation — Sales, Marketing, HR — are different departments in an organization, with no measurable attribute attached to any term.

Unordered variables also called Nominal variables.

**Ordered variables** are those variables that follow a natural rank of order. Some examples

  - Age group:  <30, 30-40, 40-50 and so on

  - Month: Jan, Feb, Mar, etc.

  - Education: primary, secondary and so on


#### Note -

Both standard deviation and interquartile difference are used to represent the spread of the data.

The interquartile difference is a much better metric than standard deviation if there are outliers in the data because the standard deviation will be influenced by outliers, while the interquartile difference will simply ignore them.
  

### **D. Bivariate And Multivariate Analysis**

#### **1. Analysis Between Two Numeric Variables**

Plot Types can be used - 

1. Correlation Matrix (heatmap)
2. Scatter Plot
3. Pair Plot

Coorelation matrix shows the relationship between 2 vaiables(columns) through correlation coefficient.

**Correlation Coefficient** depicts only a linear relationship between the numeric variables. It doesn't depict any other relationship between variables.

A zero doesn't imply that there is no relationship between variables, it merely indicates that there is no linear relationship between them.

A negative correlation means that if the value of the one variable increases, the value of another decreases, where it is the opposite for the positive correlation.

However, the correlation matrix has its own limitations where you cannot see the exact distribution of a variable with another numeric variable. To solve this problem, we use pair plots. Pair plots are scatter plots of all numeric variables in a data set. It shows the exact variation of one variable with respect to the others.

### *A high correlation coefficient does not imply that there will be a correlation with another numeric variable every time because there can be **no causation** between them. There may be cases where you will see a high correlation coefficient between two variables but there is no relation between them.*

#### **2. Correlation vs Causation**

Correlation does not imply causation.

#### **3. Numerical - Categorical Analysis**



#### **4. Categorical - Categorical Analysis**

#### **5. Multivariate Analysis**

# **Summary**

Exploratory Data Analysis (EDA) helps a data analyst to look beyond the data. It is a never-ending process—the more you explore the data, the more the insights you draw from it.  As a data analyst, almost 80% of your time will be spent understanding data and solving various business problems through EDA. If you understand EDA properly, that will be half the battle won.

 

Now, one thing that you should keep in mind is that EDA is far more than plain visualisation. It is an end-to-end process to analyse a data set and prepare it for model-building.

 

The four most crucial steps in any kind of data analysis. These steps include the following:

- Gather data for analysis: In the data sourcing part, you learnt about the various sources of data. There are majorly two types of data sources, namely, public data and private data. Private data is associated with some security and privacy concerns, whereas public data is freely available to use without any restrictions on access or usage. There are many websites that provide access public data set available. You have also learnt about the basics of web scraping—a process to fetch the data from a web page directly.

- Preparation and cleaning of data: In the cleaning process, the main objective is to remove irregularities from a data set. There are many ways to clean data, but the two most important approaches that you learnt as part of the cleaning step are treatment of missing values and outlier handling. 
 

Now, there are many ways to deal with missing values, for example, removing an entire column or rows with missing values; however, you need to keep in mind that it should not hamper the data with loss of information. The other method to deal with missing values is to just impute them with other values such as mean, median, mode or quantiles. The third method is to treat the missing values as a separate category; this is the safest method to deal with missing values.

 

The different methods for analysing variables. These methods include the following:

- Univariate analysis: Univariate analysis involves the analysis of a single variable at a time. Now, there are multiple types of variables, such as categorical ordered and unordered variables, and numerical variables. A univariate analysis gives insights about a single variable and how it varies, and what the counts of each and every category in it are.

- Bivariate and multivariate analysis: Bivariate/multivariate analysis involves analysing two or more variables at the same time. These analyses yield very specific insights about a data set. You can infer various findings through bivariate analysis.
 