# EDA - Pandas profiling

It should come as no surprise that there are plenty of automated tools that can make data analysts easier. The easier it is to get a lot of numbers, graphs and information from the data with their help, the more attention must be paid to their correct interpretation. Therefore, we will talk about one of them at the very end of the exploratory data analysis, when we can already obtain the individual characteristics manually and interpret them correctly, thanks to which we will not get lost in the rich automatic reports.
One of the tools for automatic reports is [Pandas Profiling] (https://github.com/pandas-profiling/pandas-profiling).
Like other libraries, we must first install Pandas Profiling. In one of the first lessons, we discovered that command line commands can be run directly from a laptop, so we can try to take advantage of that.
&gt; By default, the preparation of the environment and the installation of libraries are not part of laptops and are performed before it is started, so take this as a non-traditional use and rather as a demonstration of possibilities.

In [None]:
!python -m pip install pandas-profiling

&gt; The dynamic output in this case is not very clear and, apart from the list of all installed dependencies, it does not contain anything useful, so we will omit it from the laptop this time.
The data for analysis are prepared in the file [spotify_top10.csv] (static / spotify_top10.csv). The data was published on [kaggle.com] (https://www.kaggle.com/leonardopena/top-spotify-songs-from-20102019-by-year) and their original source is [this part of the Spotify service] (http: //organizeyourmusic.playlistmachinery.com/#).
The report can be generated either separately outside the notebook, in which case the result will be a separate HTML file with the report, or in the notebook, where the results are displayed directly inside the notebook. We will use the first option because it is faster and clearer.
&gt; As in the previous case, this command can be run directly from the command line and the laptop is not needed for it at all.

In [None]:
!pandas_profiling static/spotify_top10.csv static/spotify_report.html

If you have a multi-core processor on your computer, you can use the `--pool_size X` parameter to set` pandas_profiling` to process data in parallel on * X * cores, making better use of your computer&#39;s computing power and reducing processing time.
And now to the report itself, which is available on [this page] (static / spotify_report.html).
At the very top is a navigation bar, which will make it easier for us to navigate in a long report and speed up switching between the individual parts, which we will now look at gradually.
## Part One - General Information (Overview)
The first part contains general information about the entire dataset in the left column and information about individual variables and identified types in the right column. It is interesting to note that one variable was automatically excluded from the analysis. We will find out why this happened a little lower in the information and warnings section. This most often contains warnings about too many null values or, conversely, too many unique values for categorical variables. As in this case, there is a reason to exclude the * year * column for too high a correlation with an unnamed column. Here is the first mistake. While we would rather discard the first unnamed column containing the index, automation decided otherwise.
![overview](static/overview.png)

## Part Two - Variables
The second part contains information for each column. The amount and form of information provided depends on the type of variable. For each variable, it is possible to expand further details with a link in the lower right corner (Toggle details) and get more detailed information. For numerical variables, these are more detailed descriptive statistics, histograms, and the most common and extreme values. For categorical variables, this is more detailed information about the most common values and other properties describing the aggregate values (Composition).
![variables](static/variables.png)

## Part Three - Correlations
The third part graphically represents several different correlation coefficients between pairs of variables.
![correlations](static/correlations.png)

## Part Four - Missing values
In the fourth part we find graphical and numerical representations of missing values. Matrix views can help us detect their occurrences in clusters.
![missing](static/missing.png)

## Part Five - Sample
The last part of the report contains the first ten and last five lines of the dataset for viewing.
## How to use the potential of automatic reports
Of course, over time, each analyst will develop their own habits and workflows to take full advantage of automated reporting, but also to focus on the important things and not be carried away by seemingly surprising information without deeper manual examination.
It is common practice to create a report at the very beginning of the analysis. Because the tool we tested, as its name suggests, uses Pandas in the background, we can see from the report how good the automatic detection of the types of individual variables will be. If the detection fails on the first attempt, it is time to modify the input data and regenerate the report. Then it is possible to examine the properties of individual variables and the relationships between them and perform follow-up analyzes. Personally, I prefer an open report next to my laptop, where I analyze the data, because I can look back at graphs and descriptive statistics at any time and I don&#39;t have to retrieve them manually. But not everything is included in the report, and what is there may not be without error.
## Examples
If this sample is small, the [project pages] (https://github.com/pandas-profiling/pandas-profiling#examples) contain a number of examples of reports from different datasets.
## Time to play
If you like Pandas profiling, you can try to use it for your own data from past lessons and examine whether this automatic tool has come up with something you haven&#39;t, resp. whether any discovery of his could save you manual labor.