diff --git a/README.md b/README.md index ef118d2..44093fc 100644 --- a/README.md +++ b/README.md @@ -96,7 +96,7 @@ from sklearn.model_selection import train_test_split cancer_data = datasets.load_breast_cancer() cancer_dataframe = pd.DataFrame(cancer_data.data, columns = cancer_data.feature_names) cancer_dataframe['target'] = cancer_data.target -ref_data, cur_data = train_test_split(cancer_dataframe, train_size=0.80, shuffle=True) +ref_data, prod_data = train_test_split(cancer_dataframe, train_size=0.80, shuffle=True) ``` Once you prepared the both training and testing datasets, all you need to do is pass those datasets into the method that we imported along with the target column name and target column type (type should be `cat` for categorical column and `num` for numerical columns). @@ -104,9 +104,9 @@ Once you prepared the both training and testing datasets, all you need to do is ```python build( reference_data=ref_data, - current_data=cur_data, - target_column_name="target", - target_column_type="cat", + production_data=prod_data, + target_col_name="target", + target_col_type="cat", host="127.0.0.1", port=8050 ) diff --git a/docs/faq.md b/docs/faq.md index 03bf5e0..7915f0d 100644 --- a/docs/faq.md +++ b/docs/faq.md @@ -12,16 +12,16 @@ from explainit.app import build ``` Explainit requires several parameters to be passed to the function in order to run the application. - `reference_data`: The data on which your machine learning model will get trained (training data). - - `current_data`: The data for which you need the predictions from the model (production data). - - `target_column_name`: The dependent column name from your dataset. - - `target_column_type`: What type of category that the target column is, use `cat` if your target column is a categorical or use `num` for continuous column. - - `datetime_column_name`: Optional datetime column name present in the `reference` & `current` data. + - `production_data`: The data for which you need the predictions from the model (production data). + - `target_col_name`: The dependent column name from your dataset. + - `target_col_type`: What type of category that the target column is, use `cat` if your target column is a categorical or use `num` for continuous column. + - `datetime_col_name`: Optional datetime column name present in the `reference` & `production` data. - `host`: Optional host address where you want this application to run, in string format (default: 0.0.0.0 or localhost). - `port`: Optional port where you want this application to run, in integer format (default: 8050). Once you define the parameters for the Explainit, simply run the function in order to start the application. ```python -build(reference_data, current_data, target_column_name, target_column_type, datetime_column_name, host, port) +build(reference_data, production_data, target_col_name, target_col_type, datetime_col_name, host, port) ``` **Q2. How to decide the Statistical-tests and their significance?** diff --git a/docs/getting-started.md b/docs/getting-started.md index e91bd6c..b5c77b4 100644 --- a/docs/getting-started.md +++ b/docs/getting-started.md @@ -39,7 +39,7 @@ Once you imported the required libraries, you need to preprocess the data in a f cancer_data = datasets.load_breast_cancer() cancer_dataframe = pd.DataFrame(cancer_data.data, columns = cancer_data.feature_names) cancer_dataframe['target'] = iris.target -ref_data, cur_data = train_test_split(cancer_dataframe, train_size=0.80, shuffle=True) +ref_data, prod_data = train_test_split(cancer_dataframe, train_size=0.80, shuffle=True) ``` You need to pass two dataframes in order to analyze the drift inside your data. First dataframe is used for the reference which is nothing but the data that you'll use for model training. The second dataframe is your production data on which you want to perform the drift and quality analysis. @@ -63,7 +63,7 @@ In order to Initialize the dash application, you need to pass following paramete ```python build( reference_data=ref_data, - production_data=cur_data, + production_data=prod_data, target_col_name="target", target_col_type="cat", host='127.0.0.1', @@ -85,6 +85,7 @@ Dash is running on http://127.0.0.1:8000/ Once your application is up and running. Go to the location that you've provided while running the app, go to localhost if you've not specified any host or port. ## Initial Application interface: + ![initial_tab](./assets/metrics_row.jpg) You'll get to see three tabs inside the opening interface. The default tab will be data drift. @@ -96,13 +97,14 @@ Following can be found in the above section: * Column name: Name of the column from the dataframe. * Type: what type of column it is. * Reference Distribution: Distribution of the column from the reference data. -* Current Distribution: Distribution of the column from the production data. +* Production Distribution: Distribution of the column from the production data. * Threshold: Threshold of the Statistical test. * Stat-test: The statistical test, that used to analyze the drift. * P-Value: Result metric of the Statistics test. * Drift/No Drift: An indicator which will differentiate between whether drift detected or not for that particular column. -## Column Drift and Distritution along with the Standard Deviation. +## Column Drift and Distritution along with the Standard Deviation: + ![drift_and_distr_graphs](./assets/drift_dist_graphs.jpg) Choose the desired column and the standard deviation from the dropdowns, In order to visualize the Distribution and Drift in comparision with reference data. @@ -112,22 +114,25 @@ In the Drift Garph (Right Graph) will helps you understand the production data d More the Red colured points are outside the light green area, that much drift is existed in your production data. -## Target Drift. +## Target Drift: ![target_graph](./assets/target_graph.jpg) This tab gives a bried understanding of the target column. The heading informs whether drift occurs in target column along with the statistical test and p-value. Apart from the target graph, you'll get to see how the target column is behaving with respect to the individual feature. Choose the feature from the Dropdown. + ![target_behaviour](./assets/target_behavior_based_on_featurejpg.jpg) -## Data Quality. +## Data Quality: The Data Quality tab contains three sub-tabs consisting of the data summary, feature summary and correlation analysis. The default tab will be Data Summary. + ![data_summary](./assets/data_summary.jpg) Data Summary helps you understand the basic statistical information regarding your reference and production data. It contains information like total count, missing observations, number of numerical features, number of empty features, etc. When you choose the feature summary tab: + ![feature_summary](./assets/feature_summary.jpg) Feature Summary provides the insights about the individual feature. You need to select the feature from dropdown in order to choose the desired feature. @@ -135,6 +140,7 @@ Feature Summary provides the insights about the individual feature. You need to It provides count, number of unique values(with the ratio), most common value and missing values. Apart from basic info you can see a detailed distribution of feature for both reference and production data. Correlation tab: + ![correlation_table](./assets/correlation_table.jpg) This tab provides the correlation between the feature along with the top 5 correlated features along with the difference. @@ -142,4 +148,5 @@ This tab provides the correlation between the feature along with the top 5 corre Apart from the correlation table, there will be correlation heatmaps in three different correlation methods i.e, Pearson, Spearman and Kendall. Choose the desired correlation method from the radio buttons. + ![correlation_heatmaps](./assets/correlation_heatmaps.jpg) diff --git a/examples/getting-started.ipynb b/examples/getting-started.ipynb index ac2b97d..e940faa 100644 --- a/examples/getting-started.ipynb +++ b/examples/getting-started.ipynb @@ -63,7 +63,7 @@ "cancer_data = datasets.load_breast_cancer()\n", "cancer_dataframe = pd.DataFrame(cancer_data.data, columns = cancer_data.feature_names)\n", "cancer_dataframe['target'] = cancer_data.target\n", - "ref_data, cur_data = train_test_split(cancer_dataframe, train_size=0.80, shuffle=True)" + "ref_data, prod_data = train_test_split(cancer_dataframe, train_size=0.80, shuffle=True)" ] }, { @@ -95,7 +95,7 @@ "source": [ "build(\n", " reference_data=ref_data,\n", - " production_data=cur_data,\n", + " production_data=prod_data,\n", " target_col_name=\"target\",\n", " target_col_type=\"cat\",\n", " host='127.0.0.1',\n", diff --git a/explainit/assets/custom-styles.css b/explainit/assets/custom-styles.css index c3eb5b5..a4cb67e 100644 --- a/explainit/assets/custom-styles.css +++ b/explainit/assets/custom-styles.css @@ -835,7 +835,7 @@ body { } .metric-row>div:nth-of-type(4):before { - content: "Current Distribution"; + content: "Production Distribution"; } .metric-row>div:nth-of-type(5) { diff --git a/workflow/architecture.png b/workflow/architecture.png index 56db358..e80bb8b 100644 Binary files a/workflow/architecture.png and b/workflow/architecture.png differ