## Data understanding
The goal of this phase is to increase familiarity with data, assess data quality & suitability to the business problem at hand, discover relevant relationships, structures and patterns in the data, and outline first findings and initial insights that can be exploited in the next phases. To this end, we employ statistical methods, visualization techniques, clustering algorithms, pattern mining, etc.

### Data selection
In order to allow for quick and easy exploration, we first select a sufficiently large and relevant subset of the data.

> This is of course highly dependent on the data, the business question and the domain, so there are no general guidelines. Just make sure to explain clearly which data you select/discard and why. If you have a quite small dataset, this step can of course be skipped.

### Basic data exploration
 We now apply some basic numerical and visual exploration techniques on the selected data to develop an initial understanding of the attributes and the relationships between them.
 
> This involves
> - univariate analysis: explore each attribute one by one. 
>     - for numerical attributes: we need to understand the central tendencies (mean, median), the spread (min, max) and the variation (standard deviation)
>     - for categorical attributes: we need to know the frequency for each category
> - multi-variate analysis: explore different combinations of different attributes to understand their relationship
>
> Typically, all this can be shown with standard visualizations, e.g.
> - a distribution plot can show the distribution for a numerical attribute
> - a box plot can show the min, median, mean, max for a numerical attribute
> - a bar/pie chart can show the frequency for each category
> - a correlation matrix can show the correlation between different (numerical) attributes
> - a scatter plot matrix can show the relationship (incl. correlation) between (numerical) attributes and indicate whether there are any groups to be observed
> - a timeline plot can show whether all time-sensitive attributes cover (more or less) the same period, if there are any trends or seasonalities visible
> - a geographical heatmap can show hotspots and outliers on a map
> - ...

> Note that some of these statistics were already presented in the reference dataset creation notebook, but in this section the goal is to provide a more detailed discussion on observed mean, variation, etc. rather then just presenting them. This means you should explicitly answer questions like:
> - are the mean/median as expected? Is the variation higher/lower than expected?
> - are there any obvious outliers in the univariate/multi-variate plots?
> - are there any attributes that are strongly correlated and is this expected? Or vice versa: did you expect attributes to be correlated even though they do not seem to be?
> - is the data more or less evenly distributed over the categories, or are there certain categories under- or over-represented? Is that expected?
> - can we observe an increasing/decreasing trend for a time-sensitive attribute? Or sudden peaks/drops at certain moments in time?
> - do we see any seasonality in a time-sensitive attribute (e.g. hourly/daily/weekly/monthly patterns)?
> - ...?

### Advanced data exploration
> Based on the initial understanding of the data, in this phase you try to uncover more interesting patterns or structures in the data. Typically, you want to investigate the relationship between the attribute(s) of interest (e.g. the numerical attribute that you want to predict, the categorical attribute that you want to classify, ...) and  (a combination of) the other attributes present in the data. You can use techniques such as 
- PCA to group highly correlated attributes and reduce the dimensionality of the data so that it's easier to visualize
- clustering to identify interesting groups of related instances. This typically requires to normalize the attributes first.
- pattern mining to identify sequential patterns in the data. This typically involves discretizing the data, i.e. converting numerical attributes to categorical ones using binning techniques.
>
> and use more advanced visualizations such as heatmaps or clustermaps

### Conclusion
> Document the most important insights that you have gained about the data