# Exploratory Data Analysis and Feature Engineering

In this part of the notes, we dive into the exploratory data analysis or EDA topic. This step is a prerequisite of any statistical modeling approach as it allows us to get an understanding of the data and therefore create better quality datasets. Specifically, in this chapter we will review: 

- Measures of central tendency
- Measures of dispersion
- Empirical CDF
- Common plots such histogram, boxplot, and Q-Q plot

In addition, we will look into how to evaluate the quality of a dataset and how to evaluate the quality of statistical model.
We will need the latter metrics when we discuss the various models in the forthcoming parts of the notes.

However, before we begin, let's discuss a little bit out the data that we will be working on.  We will organise these predominantly into structured and unstuctured data. Let's briefly see what these constitute

**Structured data**

A large proportion of the data we will work is structured and typically come in the form tables.
Examples are:

- Relational databases; by definition organise the data into tables
- Excel/Libre documents
- CSV files

Note that although a number of different models  of databases exist, typically these will allow us to have some
structure when reading the data i.e. having a predefined schema. We will categorize this data as structured data.

**Unstructured data**

We will use the term unstructured to capture all the data that does not fit into the table format.
Although this may not be entirely correct, it will simplify things for us withou affecting much.
Unstructured data, therefore, have no predefined design and do not follow any particular data model. 
Examples of unstructured data include

- Images
- Videos
- Text
- Audio

The table below shows some characteristics of structured vs unstructured data

| Structured data                 | Unstructured                      |
|---------------------------------| ----------------------------------|
| Represented in tables           | Cannot be represented as tables   | 
| Easier to use                   | More difficult to use             |
| Require less storage capacity   | Require more storage capacity     | 


In addition to the structured/unstructured categorization of the data we also have the quantitative/qualitative arrangement.
Broadly speaking, a variable can be either quantitative i.e. numerical, or qualitative i.e. categorical. 
Most of the time, it will be easy to see/understand to which category the data belongs to. Examples of numerical
data include temperature, weight, height, prices e.t.c. In principal, anything else that it is not numerical will be qualitative e.g.
hair color, eyes color, text such as tweets or logs coming from a system.
We will also distinguish four categories for the two groups above;


- Nominal
- Ordinal
- Interval
- Ratio

**Nominal:** This type data is always qualitative. For most cases,  we won't be able to say much about it other than then value of the variable and computing counts about the occured values. Expressions such as the mean or variance have no meaning for such variables.

**Ordinal:** This type of data is also qualitative in nature but this time some sort of order is attached. Consider for example satisfaction from a given service. This may be happy, neutral, unhappy or any other states we choose. Ordinal data can be turned into
numerical data by using numbers to represent each category. We need to know however how these are used and the meanings we extract 
as this is still oridnal data.

**Interval** data are quantitative. The differences between values have a consistent meaning. We can compute expressions
about the mean and ask questions about standard deviations. However,  the data in this category,  lack the concept of 
zero. Zero represents an absence of what we are trying to measure. However, when you say zero degrees Celcious to express temperature
this does not mean that you don't have a measurement.  In fact zero is a legitimate temperature measurement. The consequence of this is that we can only subtract and add such variables but not divide or multiply them.
In addition, they lack the notion of ratio; e.g. 40 degrees does not mean twice as hot when 20 degrees.

**Ratio:** This is the type of data that most people think when discussing quantitative data. It is the same as interval data encompassed with a true meaning of zero. At the ratio level, with the concept of a true zero, we can divide and multiply values together and have their results be meaningful. 

Note that in general we will assume that we are working with a single data generating process $P_{data}$. This will represent
the broadest and most abstract representation of the data. If we knew $P_{data}$, then most, if not all, of the material discussed
herein would have been unneccessary or of theoretical value only.

## References