<a href="https://colab.research.google.com/github/itaewonflow/lecture-pandas/blob/main/Lec_pandas_eda_ydata_profiling.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#EDA : Exploratory Data Analysis

-  important and essential part of the data science and machine learning workflow. It allows us to become familiar with our data by exploring it, from multiple angles, through statistics, data visualisations, and data summaries. This helps discover patterns in the data, spot outliers, and gain a solid understanding of the data we are working with.

#EDA: ydata_profiling | Python Tutorial
- ydata_profiling official site : https://ydata-profiling.ydata.ai/docs/master/index.html

## - **Installation**

- pip install ydata-profiling

In [4]:
!pip install -U ydata-profiling



## 1. Load Dataset

In [5]:
import pandas as pd

titanic = pd.read_csv("https://raw.githubusercontent.com/itaewonflow/data-mart/main/titanic_with_han.csv", encoding='euc-kr')
titanic.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,성별,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",남자,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",여자,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",여자,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",여자,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",남자,35.0,0,0,373450,8.05,,S


## 2. Generate Report

In [6]:
from ydata_profiling import ProfileReport

profile = ProfileReport(titanic, title = "Titanic Profiling Report", explorative=True)

In [None]:
profile

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

## 3. Save Report

In [None]:
profile.to_file("report.html")

## 4. Explore More Options

- minimal mode : ydata-profiling includes a minimal configuration file where the most expensive computations are turned off by default. **This is the recommended starting point for larger datasets.**

In [None]:
from ydata_profiling import ProfileReport
profile = ProfileReport(titanic, title = "Pandas Profiling Report", minimal=True)

In [None]:
profile

## 5. Compare Report

- ydata-profiling can be used to compare multiple version of the same dataset. This is useful when comparing data from multiple time periods, such as two years. Another common scenario is to view the dataset profile for training, validation and test sets in machine learning.

In [None]:
from sklearn.model_selection import train_test_split

train_df, test_df = train_test_split(titanic, train_size=0.3, stratify=titanic['Survived'])
train_df.shape, test_df.shape

In [None]:
from ydata_profiling import ProfileReport, compare

train_report = ProfileReport(train_df, title='Train')
test_report = ProfileReport(train_df, title='Test')

comparison_report = compare([train_report, test_report])

In [None]:
comparison_report

In [None]:
comparison_report.to_file("comparison_report.html")

- type_schema : We can set the type_schema only for the variables that we are certain of their types. All the other will be automatically inferred.


In [None]:
import json
import pandas as pd

from ydata_profiling import ProfileReport
from ydata_profiling.utils.cache import cache_file

file_name = cache_file(
    "titanic.csv",
    "https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv",
)
df = pd.read_csv(file_name)

type_schema = {"Survived": "categorical", "Embarked": "categorical"}

report = ProfileReport(df, title="Titanic EDA", type_schema=type_schema)

report.to_file("titanic_report.html")
report