[CINTO4003U.LA_F23 (Artificial Intelligence and Machine Learning)](https://cbscanvas.instructure.com/courses/32010) - Copenhagen Business School<br>
***
<br>

#### Practical ML - Lab 6
### **Automated EDA using ydata-profiling**<br><br>

[Exploratory data analysis](https://www.ibm.com/topics/exploratory-data-analysis) (EDA) is an important part of data analytics and data science. It is consider an axiomatic step in machine learning experimentation. In essence, it is the insights from your EDA, that decides all later choices regarding feature selection, data mining, learning algorithm selection, evaluation etc. However, the process of EDA can also be very time-consuming and repetitive. 

Wouldn't it be awesome if we could automate parts of it? This is where [ydata-profiling](https://github.com/ydataai/ydata-profiling) (`formerly, pandas-profiling`) comes in. The primary goal of ydata-profiling is to provide a one-line Exploratory Data Analysis (EDA) experience in a consistent and fast solution. Like pandas' `df.describe()` function, that is so handy, ydata-profiling delivers an extended analysis of a DataFrame while allowing the data analysis to be exported in different formats such as html and json.

Below we run through an example of using ydata-profiling with pandas on a tabular dataset.

**NOTE**: If you can't get this to work, here is an example of the kind of report we are building with ydata-profiling: https://ydata-profiling.ydata.ai/examples/master/meteorites/meteorites_report.html

In [10]:
# Make sure you are in the virtual environment you want to be in!
!pip install -U ydata-profiling
!pip install -U ipywidgets



In [1]:
import pandas as pd
import ydata_profiling

In [2]:
URL = "https://data.nasa.gov/api/views/gh4g-9sfh/rows.csv?accessType=DOWNLOAD"
df = pd.read_csv(URL)
df.head()

Unnamed: 0,name,id,nametype,recclass,mass (g),fall,year,reclat,reclong,GeoLocation
0,Aachen,1,Valid,L5,21.0,Fell,1880.0,50.775,6.08333,"(50.775, 6.08333)"
1,Aarhus,2,Valid,H6,720.0,Fell,1951.0,56.18333,10.23333,"(56.18333, 10.23333)"
2,Abee,6,Valid,EH4,107000.0,Fell,1952.0,54.21667,-113.0,"(54.21667, -113.0)"
3,Acapulco,10,Valid,Acapulcoite,1914.0,Fell,1976.0,16.88333,-99.9,"(16.88333, -99.9)"
4,Achiras,370,Valid,L6,780.0,Fell,1902.0,-33.16667,-64.95,"(-33.16667, -64.95)"


In [3]:
report = df.profile_report(
    sort=None,
    html={"style": {"full_width": True}}, 
    progress_bar=True, 
    explorative=True
)

In [None]:
# renders it in the notebook
report

That's it! You now have an interactive report of your dataset as a starting point for your EDA! <br>
Check out [the documentation for ydata-profiling](https://ydata-profiling.ydata.ai/docs/master/index.html) for more info about how take full advantage of the tools, and how to save the report to .html, .json etc.