# Data Exploration
- This notebook performs exploratory data analysis on the dataset.
- To expand on the analysis, attach this notebook to a cluster with runtime version **15.4.x-cpu-ml-scala2.12**,
edit [the options of pandas-profiling](https://pandas-profiling.ydata.ai/docs/master/rtd/pages/advanced_usage.html), and rerun it.
- Explore completed trials in the [MLflow experiment](#mlflow/experiments/4037905021133080).

In [0]:
import mlflow
import os
import uuid
import shutil
import pandas as pd
import databricks.automl_runtime

# Download input data from mlflow into a pandas DataFrame
# Create temporary directory to download data
temp_dir = os.path.join(os.environ["SPARK_LOCAL_DIRS"], "tmp", str(uuid.uuid4())[:8])
os.makedirs(temp_dir)

# Download the artifact and read it
training_data_path = mlflow.artifacts.download_artifacts(run_id="5c3b42cd9b08489aa4b4e4b9c8fa27db", artifact_path="data", dst_path=temp_dir)
df = pd.read_parquet(os.path.join(training_data_path, "training_data"))

# Delete the temporary data
shutil.rmtree(temp_dir)

target_col = "UHII"

# Drop columns created by AutoML and user-specified sample weight column (if applicable) before pandas-profiling
df = df.drop(['_automl_split_col_0000'], axis=1)

## Semantic Type Detection Alerts

For details about the definition of the semantic types and how to override the detection, see
[Databricks documentation on semantic type detection](https://docs.microsoft.com/azure/databricks/applications/machine-learning/automl#semantic-type-detection).

- Semantic type `categorical` detected for column `is_night_time`. Training notebooks will encode features based on categorical transformations.

## Additional AutoML Alerts

- Columns `features` are not supported and will be dropped before running the data exploration and training notebooks.

## Truncate rows
Only the first 10000 rows will be considered for pandas-profiling to avoid out-of-memory issues.
Comment out next cell and rerun the notebook to profile the full dataset.

In [0]:
df = df.iloc[:10000, :]

## Profiling Results

In [0]:
from ydata_profiling import ProfileReport
df_profile = ProfileReport(df, minimal=True, title="Profiling Report", progress_bar=False, infer_dtypes=False)
profile_html = df_profile.to_html()

displayHTML(profile_html)