Skip to content

Commit

Permalink
Merge pull request #18 from katonic-dev/17-doc-fix-update-the-app-wor…
Browse files Browse the repository at this point in the history
…kflow
  • Loading branch information
Subhrajit-Katonic committed Oct 8, 2022
2 parents 59fdba5 + 2b563ed commit e67edf7
Show file tree
Hide file tree
Showing 6 changed files with 25 additions and 18 deletions.
8 changes: 4 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -96,17 +96,17 @@ from sklearn.model_selection import train_test_split
cancer_data = datasets.load_breast_cancer()
cancer_dataframe = pd.DataFrame(cancer_data.data, columns = cancer_data.feature_names)
cancer_dataframe['target'] = cancer_data.target
ref_data, cur_data = train_test_split(cancer_dataframe, train_size=0.80, shuffle=True)
ref_data, prod_data = train_test_split(cancer_dataframe, train_size=0.80, shuffle=True)
```

Once you prepared the both training and testing datasets, all you need to do is pass those datasets into the method that we imported along with the target column name and target column type (type should be `cat` for categorical column and `num` for numerical columns).

```python
build(
reference_data=ref_data,
current_data=cur_data,
target_column_name="target",
target_column_type="cat",
production_data=prod_data,
target_col_name="target",
target_col_type="cat",
host="127.0.0.1",
port=8050
)
Expand Down
10 changes: 5 additions & 5 deletions docs/faq.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,16 +12,16 @@ from explainit.app import build
```
Explainit requires several parameters to be passed to the function in order to run the application.
- `reference_data`: The data on which your machine learning model will get trained (training data).
- `current_data`: The data for which you need the predictions from the model (production data).
- `target_column_name`: The dependent column name from your dataset.
- `target_column_type`: What type of category that the target column is, use `cat` if your target column is a categorical or use `num` for continuous column.
- `datetime_column_name`: Optional datetime column name present in the `reference` & `current` data.
- `production_data`: The data for which you need the predictions from the model (production data).
- `target_col_name`: The dependent column name from your dataset.
- `target_col_type`: What type of category that the target column is, use `cat` if your target column is a categorical or use `num` for continuous column.
- `datetime_col_name`: Optional datetime column name present in the `reference` & `production` data.
- `host`: Optional host address where you want this application to run, in string format (default: 0.0.0.0 or localhost).
- `port`: Optional port where you want this application to run, in integer format (default: 8050).

Once you define the parameters for the Explainit, simply run the function in order to start the application.
```python
build(reference_data, current_data, target_column_name, target_column_type, datetime_column_name, host, port)
build(reference_data, production_data, target_col_name, target_col_type, datetime_col_name, host, port)
```

**Q2. How to decide the Statistical-tests and their significance?**
Expand Down
19 changes: 13 additions & 6 deletions docs/getting-started.md
Original file line number Diff line number Diff line change
Expand Up @@ -39,7 +39,7 @@ Once you imported the required libraries, you need to preprocess the data in a f
cancer_data = datasets.load_breast_cancer()
cancer_dataframe = pd.DataFrame(cancer_data.data, columns = cancer_data.feature_names)
cancer_dataframe['target'] = iris.target
ref_data, cur_data = train_test_split(cancer_dataframe, train_size=0.80, shuffle=True)
ref_data, prod_data = train_test_split(cancer_dataframe, train_size=0.80, shuffle=True)
```

You need to pass two dataframes in order to analyze the drift inside your data. First dataframe is used for the reference which is nothing but the data that you'll use for model training. The second dataframe is your production data on which you want to perform the drift and quality analysis.
Expand All @@ -63,7 +63,7 @@ In order to Initialize the dash application, you need to pass following paramete
```python
build(
reference_data=ref_data,
production_data=cur_data,
production_data=prod_data,
target_col_name="target",
target_col_type="cat",
host='127.0.0.1',
Expand All @@ -85,6 +85,7 @@ Dash is running on http://127.0.0.1:8000/
Once your application is up and running. Go to the location that you've provided while running the app, go to localhost if you've not specified any host or port.

## Initial Application interface:

![initial_tab](./assets/metrics_row.jpg)

You'll get to see three tabs inside the opening interface. The default tab will be data drift.
Expand All @@ -96,13 +97,14 @@ Following can be found in the above section:
* Column name: Name of the column from the dataframe.
* Type: what type of column it is.
* Reference Distribution: Distribution of the column from the reference data.
* Current Distribution: Distribution of the column from the production data.
* Production Distribution: Distribution of the column from the production data.
* Threshold: Threshold of the Statistical test.
* Stat-test: The statistical test, that used to analyze the drift.
* P-Value: Result metric of the Statistics test.
* Drift/No Drift: An indicator which will differentiate between whether drift detected or not for that particular column.

## Column Drift and Distritution along with the Standard Deviation.
## Column Drift and Distritution along with the Standard Deviation:

![drift_and_distr_graphs](./assets/drift_dist_graphs.jpg)

Choose the desired column and the standard deviation from the dropdowns, In order to visualize the Distribution and Drift in comparision with reference data.
Expand All @@ -112,34 +114,39 @@ In the Drift Garph (Right Graph) will helps you understand the production data d
More the Red colured points are outside the light green area, that much drift is existed in your production data.


## Target Drift.
## Target Drift:
![target_graph](./assets/target_graph.jpg)

This tab gives a bried understanding of the target column. The heading informs whether drift occurs in target column along with the statistical test and p-value.

Apart from the target graph, you'll get to see how the target column is behaving with respect to the individual feature.
Choose the feature from the Dropdown.

![target_behaviour](./assets/target_behavior_based_on_featurejpg.jpg)

## Data Quality.
## Data Quality:
The Data Quality tab contains three sub-tabs consisting of the data summary, feature summary and correlation analysis. The default tab will be Data Summary.

![data_summary](./assets/data_summary.jpg)

Data Summary helps you understand the basic statistical information regarding your reference and production data. It contains information like total count, missing observations, number of numerical features, number of empty features, etc.

When you choose the feature summary tab:

![feature_summary](./assets/feature_summary.jpg)

Feature Summary provides the insights about the individual feature. You need to select the feature from dropdown in order to choose the desired feature.

It provides count, number of unique values(with the ratio), most common value and missing values. Apart from basic info you can see a detailed distribution of feature for both reference and production data.

Correlation tab:

![correlation_table](./assets/correlation_table.jpg)

This tab provides the correlation between the feature along with the top 5 correlated features along with the difference.

Apart from the correlation table, there will be correlation heatmaps in three different correlation methods i.e, Pearson, Spearman and Kendall.

Choose the desired correlation method from the radio buttons.

![correlation_heatmaps](./assets/correlation_heatmaps.jpg)
4 changes: 2 additions & 2 deletions examples/getting-started.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -63,7 +63,7 @@
"cancer_data = datasets.load_breast_cancer()\n",
"cancer_dataframe = pd.DataFrame(cancer_data.data, columns = cancer_data.feature_names)\n",
"cancer_dataframe['target'] = cancer_data.target\n",
"ref_data, cur_data = train_test_split(cancer_dataframe, train_size=0.80, shuffle=True)"
"ref_data, prod_data = train_test_split(cancer_dataframe, train_size=0.80, shuffle=True)"
]
},
{
Expand Down Expand Up @@ -95,7 +95,7 @@
"source": [
"build(\n",
" reference_data=ref_data,\n",
" production_data=cur_data,\n",
" production_data=prod_data,\n",
" target_col_name=\"target\",\n",
" target_col_type=\"cat\",\n",
" host='127.0.0.1',\n",
Expand Down
2 changes: 1 addition & 1 deletion explainit/assets/custom-styles.css
Original file line number Diff line number Diff line change
Expand Up @@ -835,7 +835,7 @@ body {
}

.metric-row>div:nth-of-type(4):before {
content: "Current Distribution";
content: "Production Distribution";
}

.metric-row>div:nth-of-type(5) {
Expand Down
Binary file modified workflow/architecture.png
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

0 comments on commit e67edf7

Please sign in to comment.