## import datapilot.DataAnalyst

> First of all, we need to import datapilot's DataAnalyst class, to use its methods

In [None]:
cd ..

In [None]:
pwd

> import datapilot's DataAnalyst from datapilot.data_analysis

In [None]:
from datapilot.data_analysis import DataAnalyst

## Import Data

> for this tutorial, use the famous iris dataset

In [None]:
import pandas as pd

In [None]:
iris = pd.read_csv("./data/iris.csv")

## Get help message

> to initialize a DataAnalyst instance, give it a pandas DataFrame to analyze.

In [None]:
da = DataAnalyst(iris)

> for any class in datapilot, you can easily lookup all the callable methods by calling .help()

In [None]:
da.help()

## Get list of all column names

> with get_all_column_info(), get all the column names along with their automatically given type(numerical or categorical). This can be modified later on.

In [None]:
da.get_all_column_info()

## Inspect Data

> when we call inspect_data(), it first shows a report from pandas, specifying non-missing row count with each columns' data type by pandas.

Then it gives the detected data type by datapilot. If the two does not match, it raises a warning. In this case, you can manually set the data type.

In [None]:
da.inspect_data()

## Detect suspicious values

> Initialize a corrupt iris dataset with suspicious, missing, and duplicate values.

In [None]:
iris_corrupt = pd.read_csv("./data/iris_corrupt.csv")

In [None]:
da_corrupt = DataAnalyst(iris_corrupt)

In [None]:
da_corrupt.inspect_data()

> with detect_suspicious(), datapilot detects pre-defined suspicious values in the input and shows where it happened, and why it happened.

In [None]:
da_corrupt.detect_suspicious()

## Detect missing values

> with detect_missing(), datapilot detects missing values in the data and shows where it occured.

In [None]:
da_corrupt.detect_missing()

## Detect duplicate values

> with detect_duplicate(), datapilot detects duplicate rows in the data and shows where it occured.

In [None]:
da_corrupt.detect_duplicate()

## Custom feature engineering support

> datapilot also supports custom feature engineering. All you need to do is write a function like below, with each positional parameter as a column name, and return what you want to make with this feature engineering. 

In [None]:
def sepal_length_width(sepal_length, sepal_width):
    return sepal_length + 3*sepal_width

> Then pass the function to custom_feature_engineering(). The newly generated column will automatically be the name of the given function.

In [None]:
da.custom_feature_engineering(sepal_length_width)