# Profiling High-Dimensional Datasets

In this chapter we discuss useful tools for exploring datasets with many potential features.


## `pandas_profiling` 

[`pandas_profiling`](https://github.com/pandas-profiling/pandas-profiling) generates HTML report pages on datasets. The reports include an overview of:

- Essentials: type, unique values, missing values
- Quantile statistics like minimum value, Q1, median, Q3, maximum, range, interquartile range
- Descriptive statistics like mean, mode, standard deviation, sum, median absolute deviation, coefficient of variation, kurtosis, skewness
- Most frequent values
- Histograms
- Correlations: Spearman, Pearson and Kendall matrices
- Missing values matrix, count, heatmap and dendrogram of missing valu

### Example: Profile of the House Prices Dataset

This dataset has been provided for a [Kaggle competition](https://www.kaggle.com/c/house-prices-advanced-regression-techniques) to explore advanced regression and feature selection techniques.

In [None]:
data_dir = "../.assets/data/house/"

In [None]:
!cat {data_dir}/data_description.txt

We use `pandas` to read the CSV file and call `pandas_profiling` on the data frame to generate a report.

In [None]:
import pandas

In [None]:
data = pandas.read_csv(f"{data_dir}/prices.csv")

In [None]:
import pandas_profiling

In [None]:
pandas_profiling.ProfileReport(data)

## Exercise: Data Cleaning after Profiling

_Looking at the profiling reports, how would you clean the data to get it ready for machine learning? Implement the data transformations!_

---
_This notebook is licensed under a [Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)](https://creativecommons.org/licenses/by-nc-sa/4.0/). Copyright © 2018-2025 [Point 8 GmbH](https://point-8.de)_