## **Automatic EDA with ydata-profiling**

Importing the pandas, ydata_profiling and seaborn (for dataset)

In [None]:
# import libraries
import pandas as pd
import seaborn as sns
import ydata_profiling as yd

In [None]:
# import dataset from seaborn
df = sns.load_dataset('titanic')

Making ydata profile report:

In [None]:
profile = yd.ProfileReport(df)
profile.to_file(output_file='../Outputs/ydata_titanic.html')

![image.png](attachment:image.png)

Use HTML viewer extension to view your profile report or you can open the report in your browser.

This will just give you the information about your data not do any other actions like computing missing values etc

we can also use the some parameters and can view the Data Profile in our Jupyter Notebook:

In [None]:
# load dataset
df = sns.load_dataset('iris')

# create profile
profile = yd.ProfileReport(df, title="Iris Data Profile", explorative=True)

# Display in a jupyter notebook
profile.to_notebook_iframe()

There are also other libraries I have been searching for Automatic EDA, let,s explore some of those.

## **Sweetviz**

Sweetviz is a Python Library that generates beautiful visualizations with just two lines of code and return report as a HTML file in output

Whith Sweetviz with can:
- how dataset are correlated to the target variable
- visualize test and train data as well as compare them
- plot correlation fro numerical and categorical variables
- summarize information on missing values, duplicate data entries, and frequent entries along with numerical analysis

In [None]:
# installing sweetviz
!pip install sweetviz

**Note:** you will get an error because Sweetviz use numpy.VisibleDeprecationWarning, which was removed in NumPy 2.0, and Sweetviz has not yet been updated. You need to downgrade NumPy to a version that still has VisibleDeprecationWarning (e.g., 1.26.4):

In [None]:
pip install numpy==1.26.4

In [None]:
# dataset
df = sns.load_dataset('titanic')

# importing sweetviz 
import sweetviz as sv

# generating sweetviz report
sweetviz_report = sv.analyze(df)

sweetviz_report.show_html("../Outputs/titanic_EDA_report.html") 

![image.png](attachment:image.png)

In [20]:
# we can also view the report in our Jupyter Notebook
sweetviz_report.show_notebook()

## **AutoViz**

Another open-source Python EDA library to quickly analyze any data with a single line of code. (Requires: Python >=3.7)

In [None]:
!pip install autoviz 

In [None]:
!pip install jupyter_bokeh # must install for displaying the plotted charts in Jupyter Notebook

In [None]:
df = sns.load_dataset('iris')

from autoviz import AutoViz_Class
AV = AutoViz_Class()

dft = AV.AutoViz(
    '',
    sep=",",
    depVar="",
    dfte=df,
    header=0,
    verbose=2,
    lowess=False,
    chart_format="png",
    max_rows_analyzed=150000,
    max_cols_analyzed=30,
)

Shape of your Data Set loaded: (150, 5)
#######################################################################################
######################## C L A S S I F Y I N G  V A R I A B L E S  ####################
#######################################################################################
Classifying variables in data set...
  Printing up to 30 columns (max) in each category:
    Numeric Columns : ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']
    Integer-Categorical Columns: []
    String-Categorical Columns: ['species']
    Factor-Categorical Columns: []
    String-Boolean Columns: []
    Numeric-Boolean Columns: []
    Discrete String Columns: []
    NLP text Columns: []
    Date Time Columns: []
    ID Columns: []
    Columns that will not be considered in modeling: []
    5 Predictors classified...
        No variables removed since no ID or low-information variables found in data set
   Columns to delete:
'   []'
   Boolean variables %s 
'   []'
   C

Unnamed: 0,Data Type,Missing Values%,Unique Values%,Minimum Value,Maximum Value,DQ Issue
sepal_length,float64,0.0,,4.3,7.9,No issue
sepal_width,float64,0.0,,2.0,4.4,Column has 4 outliers greater than upper bound (4.05) or lower than lower bound(2.05). Cap them or remove them.
petal_length,float64,0.0,,1.0,6.9,Column has a high correlation with ['sepal_length']. Consider dropping one of them.
petal_width,float64,0.0,,0.1,2.5,"Column has a high correlation with ['sepal_length', 'petal_length']. Consider dropping one of them."
species,object,0.0,2.0,,,No issue


Number of All Scatter Plots = 10
All Plots are saved in .\AutoViz_Plots\AutoViz
Time to run AutoViz = 8 seconds 


A folder will be created in current directory where all the plotted chart will save.

![image-2.png](attachment:image-2.png)
![image.png](attachment:image.png)

you can also change the parameters values like using verbose=2 and chart_format="bokeh" you can view the plotted charts in Jupyter notebook

Explore more about Autoviz here = [https://pypi.org/project/autoviz/](https://pypi.org/project/autoviz/)

These tools are meant to avoid writing repetitive code to visualize a common dataset like let’s say Titanic and others. You want something quick but without wasting time on such toy datasets. 

However if you have a real world dataset, you might want to try these first and then develop your own additional visualizations that these tools miss. Or you may even want to investigate something deeper based on what you are seeing covered superficially here.

Quoting someone:- Nothing can replace the experience and insight you get by getting your hands dirty. I generally don't recommend any automatic packages for tasks that beginners should be learning how to do themselves. **Understanding what the package is automating is more important than the automation itself**