# Data Exploration
- This notebook performs exploratory data analysis on the dataset.
- To expand on the analysis, attach this notebook to a cluster with runtime version **15.4.x-cpu-ml-scala2.12**,
edit [the options of pandas-profiling](https://pandas-profiling.ydata.ai/docs/master/rtd/pages/advanced_usage.html), and rerun it.
- Explore completed trials in the [MLflow experiment](#mlflow/experiments/4412028206669702).

In [0]:
import mlflow
import os
import uuid
import shutil
import pandas as pd
import databricks.automl_runtime

# Download input data from mlflow into a pandas DataFrame
# Create temporary directory to download data
temp_dir = os.path.join(os.environ["SPARK_LOCAL_DIRS"], "tmp", str(uuid.uuid4())[:8])
os.makedirs(temp_dir)

# Download the artifact and read it
training_data_path = mlflow.artifacts.download_artifacts(run_id="464684a9dd0949dba2e12ea3052beec4", artifact_path="data", dst_path=temp_dir)
df = pd.read_parquet(os.path.join(training_data_path, "training_data"))

# Delete the temporary data
shutil.rmtree(temp_dir)

target_col = "Churn"

# Drop columns created by AutoML and user-specified sample weight column (if applicable) before pandas-profiling
df = df.drop(['_automl_split_col_0000'], axis=1)

Thu Nov 28 23:04:40 2024 Connection to spark from PID  13972
Thu Nov 28 23:04:40 2024 Initialized gateway on port 33669


Thu Nov 28 23:04:40 2024 Connected to spark.


Downloading artifacts:   0%|          | 0/1 [00:00<?, ?it/s]

## Semantic Type Detection Alerts

For details about the definition of the semantic types and how to override the detection, see
[Databricks documentation on semantic type detection](https://docs.databricks.com/applications/machine-learning/automl.html#semantic-type-detection).

- Semantic type `categorical` detected for column `SeniorCitizen`. Training notebooks will encode features based on categorical transformations.

## Profiling Results

In [0]:
from ydata_profiling import ProfileReport
df_profile = ProfileReport(df,
                           correlations={
                               "auto": {"calculate": True},
                               "pearson": {"calculate": True},
                               "spearman": {"calculate": True},
                               "kendall": {"calculate": True},
                               "phi_k": {"calculate": True},
                               "cramers": {"calculate": True},
                           }, title="Profiling Report", progress_bar=False, infer_dtypes=False)
profile_html = df_profile.to_html()

displayHTML(profile_html)

  @nb.jit


  return df.corr(method="pearson")
  return df.corr(method="spearman")
  return df.corr(method="kendall")


0,1
Number of variables,7
Number of observations,6976
Missing cells,0
Missing cells (%),0.0%
Duplicate rows,658
Duplicate rows (%),9.4%
Total size in memory,354.4 KiB
Average record size in memory,52.0 B

0,1
Text,5
Numeric,2

0,1
Dataset has 658 (9.4%) duplicate rows,Duplicates
PhoneService is highly overall correlated with MultipleLines and 1 other fields,High correlation
MultipleLines is highly overall correlated with PhoneService and 1 other fields,High correlation
MonthlyCharges is highly overall correlated with PhoneService and 1 other fields,High correlation
SeniorCitizen has 5847 (83.8%) zeros,Zeros

0,1
Analysis started,2024-11-28 23:04:43.567475
Analysis finished,2024-11-28 23:04:48.664354
Duration,5.1 seconds
Software version,ydata-profiling vv4.5.1
Download configuration,config.json

0,1
Distinct,2
Distinct (%),< 0.1%
Missing,0
Missing (%),0.0%
Memory size,54.6 KiB

0,1
Max length,6.0
Median length,4.0
Mean length,4.9919725
Min length,4.0

0,1
Total characters,34824
Distinct characters,6
Distinct categories,2 ?
Distinct scripts,1 ?
Distinct blocks,1 ?

0,1
Unique,0 ?
Unique (%),0.0%

0,1
1st row,Female
2nd row,Female
3rd row,Male
4th row,Female
5th row,Female

Value,Count,Frequency (%)
male,3516,50.4%
female,3460,49.6%

Value,Count,Frequency (%)
e,10436,30.0%
a,6976,20.0%
l,6976,20.0%
M,3516,10.1%
F,3460,9.9%
m,3460,9.9%

Value,Count,Frequency (%)
Lowercase Letter,27848,80.0%
Uppercase Letter,6976,20.0%

Value,Count,Frequency (%)
e,10436,37.5%
a,6976,25.1%
l,6976,25.1%
m,3460,12.4%

Value,Count,Frequency (%)
M,3516,50.4%
F,3460,49.6%

Value,Count,Frequency (%)
Latin,34824,100.0%

Value,Count,Frequency (%)
e,10436,30.0%
a,6976,20.0%
l,6976,20.0%
M,3516,10.1%
F,3460,9.9%
m,3460,9.9%

Value,Count,Frequency (%)
ASCII,34824,100.0%

Value,Count,Frequency (%)
e,10436,30.0%
a,6976,20.0%
l,6976,20.0%
M,3516,10.1%
F,3460,9.9%
m,3460,9.9%

0,1
Distinct,2
Distinct (%),< 0.1%
Missing,0
Missing (%),0.0%
Infinite,0
Infinite (%),0.0%
Mean,0.1618406

0,1
Minimum,0
Maximum,1
Zeros,5847
Zeros (%),83.8%
Negative,0
Negative (%),0.0%
Memory size,27.4 KiB

0,1
Minimum,0
5-th percentile,0
Q1,0
median,0
Q3,0
95-th percentile,1
Maximum,1
Range,1
Interquartile range (IQR),0

0,1
Standard deviation,0.36833092
Coefficient of variation (CV),2.2758871
Kurtosis,1.3738543
Mean,0.1618406
Median Absolute Deviation (MAD),0
Skewness,1.8366983
Sum,1129
Variance,0.13566767
Monotonicity,Not monotonic

Value,Count,Frequency (%)
0,5847,83.8%
1,1129,16.2%

Value,Count,Frequency (%)
0,5847,83.8%
1,1129,16.2%

Value,Count,Frequency (%)
1,1129,16.2%
0,5847,83.8%

0,1
Distinct,2
Distinct (%),< 0.1%
Missing,0
Missing (%),0.0%
Memory size,54.6 KiB

0,1
Max length,3.0
Median length,3.0
Mean length,2.9030963
Min length,2.0

0,1
Total characters,20252
Distinct characters,5
Distinct categories,2 ?
Distinct scripts,1 ?
Distinct blocks,1 ?

0,1
Unique,0 ?
Unique (%),0.0%

0,1
1st row,Yes
2nd row,Yes
3rd row,Yes
4th row,Yes
5th row,Yes

Value,Count,Frequency (%)
yes,6300,90.3%
no,676,9.7%

Value,Count,Frequency (%)
Y,6300,31.1%
e,6300,31.1%
s,6300,31.1%
N,676,3.3%
o,676,3.3%

Value,Count,Frequency (%)
Lowercase Letter,13276,65.6%
Uppercase Letter,6976,34.4%

Value,Count,Frequency (%)
e,6300,47.5%
s,6300,47.5%
o,676,5.1%

Value,Count,Frequency (%)
Y,6300,90.3%
N,676,9.7%

Value,Count,Frequency (%)
Latin,20252,100.0%

Value,Count,Frequency (%)
Y,6300,31.1%
e,6300,31.1%
s,6300,31.1%
N,676,3.3%
o,676,3.3%

Value,Count,Frequency (%)
ASCII,20252,100.0%

Value,Count,Frequency (%)
Y,6300,31.1%
e,6300,31.1%
s,6300,31.1%
N,676,3.3%
o,676,3.3%

0,1
Distinct,3
Distinct (%),< 0.1%
Missing,0
Missing (%),0.0%
Memory size,54.6 KiB

0,1
Max length,16.0
Median length,3.0
Mean length,3.7785264
Min length,2.0

0,1
Total characters,26359
Distinct characters,13
Distinct categories,3 ?
Distinct scripts,2 ?
Distinct blocks,1 ?

0,1
Unique,0 ?
Unique (%),0.0%

0,1
1st row,No
2nd row,No
3rd row,Yes
4th row,No
5th row,No

Value,Count,Frequency (%)
no,4033,48.4%
yes,2943,35.3%
phone,676,8.1%
service,676,8.1%

Value,Count,Frequency (%)
e,4971,18.9%
o,4709,17.9%
N,4033,15.3%
s,3619,13.7%
Y,2943,11.2%
,1352,5.1%
p,676,2.6%
h,676,2.6%
n,676,2.6%
r,676,2.6%

Value,Count,Frequency (%)
Lowercase Letter,18031,68.4%
Uppercase Letter,6976,26.5%
Space Separator,1352,5.1%

Value,Count,Frequency (%)
e,4971,27.6%
o,4709,26.1%
s,3619,20.1%
p,676,3.7%
h,676,3.7%
n,676,3.7%
r,676,3.7%
v,676,3.7%
i,676,3.7%
c,676,3.7%

Value,Count,Frequency (%)
N,4033,57.8%
Y,2943,42.2%

Value,Count,Frequency (%)
,1352,100.0%

Value,Count,Frequency (%)
Latin,25007,94.9%
Common,1352,5.1%

Value,Count,Frequency (%)
e,4971,19.9%
o,4709,18.8%
N,4033,16.1%
s,3619,14.5%
Y,2943,11.8%
p,676,2.7%
h,676,2.7%
n,676,2.7%
r,676,2.7%
v,676,2.7%

Value,Count,Frequency (%)
,1352,100.0%

Value,Count,Frequency (%)
ASCII,26359,100.0%

Value,Count,Frequency (%)
e,4971,18.9%
o,4709,17.9%
N,4033,15.3%
s,3619,13.7%
Y,2943,11.2%
,1352,5.1%
p,676,2.6%
h,676,2.6%
n,676,2.6%
r,676,2.6%

0,1
Distinct,4
Distinct (%),0.1%
Missing,0
Missing (%),0.0%
Memory size,54.6 KiB

0,1
Max length,25.0
Median length,23.0
Mean length,18.563073
Min length,12.0

0,1
Total characters,129496
Distinct characters,23
Distinct categories,5 ?
Distinct scripts,2 ?
Distinct blocks,1 ?

0,1
Unique,0 ?
Unique (%),0.0%

0,1
1st row,Mailed check
2nd row,Mailed check
3rd row,Credit card (automatic)
4th row,Electronic check
5th row,Electronic check

Value,Count,Frequency (%)
check,3946,23.2%
automatic,3030,17.8%
electronic,2350,13.8%
mailed,1596,9.4%
bank,1527,9.0%
transfer,1527,9.0%
credit,1503,8.9%
card,1503,8.9%

Value,Count,Frequency (%)
c,17125,13.2%
a,12213,9.4%
t,11440,8.8%
e,10922,8.4%
,10006,7.7%
i,8479,6.5%
r,8410,6.5%
k,5473,4.2%
n,5404,4.2%
o,5380,4.2%

Value,Count,Frequency (%)
Lowercase Letter,106454,82.2%
Space Separator,10006,7.7%
Uppercase Letter,6976,5.4%
Open Punctuation,3030,2.3%
Close Punctuation,3030,2.3%

Value,Count,Frequency (%)
c,17125,16.1%
a,12213,11.5%
t,11440,10.7%
e,10922,10.3%
i,8479,8.0%
r,8410,7.9%
k,5473,5.1%
n,5404,5.1%
o,5380,5.1%
d,4602,4.3%

Value,Count,Frequency (%)
E,2350,33.7%
M,1596,22.9%
B,1527,21.9%
C,1503,21.5%

Value,Count,Frequency (%)
,10006,100.0%

Value,Count,Frequency (%)
(,3030,100.0%

Value,Count,Frequency (%)
),3030,100.0%

Value,Count,Frequency (%)
Latin,113430,87.6%
Common,16066,12.4%

Value,Count,Frequency (%)
c,17125,15.1%
a,12213,10.8%
t,11440,10.1%
e,10922,9.6%
i,8479,7.5%
r,8410,7.4%
k,5473,4.8%
n,5404,4.8%
o,5380,4.7%
d,4602,4.1%

Value,Count,Frequency (%)
,10006,62.3%
(,3030,18.9%
),3030,18.9%

Value,Count,Frequency (%)
ASCII,129496,100.0%

Value,Count,Frequency (%)
c,17125,13.2%
a,12213,9.4%
t,11440,8.8%
e,10922,8.4%
,10006,7.7%
i,8479,6.5%
r,8410,6.5%
k,5473,4.2%
n,5404,4.2%
o,5380,4.2%

0,1
Distinct,1580
Distinct (%),22.6%
Missing,0
Missing (%),0.0%
Infinite,0
Infinite (%),0.0%
Mean,64.743184

0,1
Minimum,18.25
Maximum,118.75
Zeros,0
Zeros (%),0.0%
Negative,0
Negative (%),0.0%
Memory size,54.6 KiB

0,1
Minimum,18.25
5-th percentile,19.65
Q1,35.5
median,70.3
Q3,89.85
95-th percentile,107.45
Maximum,118.75
Range,100.5
Interquartile range (IQR),54.35

0,1
Standard deviation,30.084535
Coefficient of variation (CV),0.46467493
Kurtosis,-1.2567989
Mean,64.743184
Median Absolute Deviation (MAD),24.1
Skewness,-0.21982968
Sum,451648.45
Variance,905.07922
Monotonicity,Not monotonic

Value,Count,Frequency (%)
20.05,60,0.9%
19.85,45,0.6%
19.95,44,0.6%
19.9,44,0.6%
20,43,0.6%
19.65,43,0.6%
19.7,43,0.6%
19.55,40,0.6%
20.15,40,0.6%
20.25,39,0.6%

Value,Count,Frequency (%)
18.25,1,< 0.1%
18.4,1,< 0.1%
18.55,1,< 0.1%
18.7,2,< 0.1%
18.75,1,< 0.1%
18.8,7,0.1%
18.85,5,0.1%
18.9,2,< 0.1%
18.95,6,0.1%
19.0,7,0.1%

Value,Count,Frequency (%)
118.75,1,< 0.1%
118.65,1,< 0.1%
118.6,2,< 0.1%
118.35,1,< 0.1%
118.2,1,< 0.1%
117.8,1,< 0.1%
117.6,1,< 0.1%
117.5,1,< 0.1%
117.45,1,< 0.1%
117.35,1,< 0.1%

0,1
Distinct,2
Distinct (%),< 0.1%
Missing,0
Missing (%),0.0%
Memory size,54.6 KiB

0,1
Max length,3.0
Median length,2.0
Mean length,2.2653383
Min length,2.0

0,1
Total characters,15803
Distinct characters,5
Distinct categories,2 ?
Distinct scripts,1 ?
Distinct blocks,1 ?

0,1
Unique,0 ?
Unique (%),0.0%

0,1
1st row,No
2nd row,Yes
3rd row,No
4th row,No
5th row,No

Value,Count,Frequency (%)
no,5125,73.5%
yes,1851,26.5%

Value,Count,Frequency (%)
N,5125,32.4%
o,5125,32.4%
Y,1851,11.7%
e,1851,11.7%
s,1851,11.7%

Value,Count,Frequency (%)
Lowercase Letter,8827,55.9%
Uppercase Letter,6976,44.1%

Value,Count,Frequency (%)
o,5125,58.1%
e,1851,21.0%
s,1851,21.0%

Value,Count,Frequency (%)
N,5125,73.5%
Y,1851,26.5%

Value,Count,Frequency (%)
Latin,15803,100.0%

Value,Count,Frequency (%)
N,5125,32.4%
o,5125,32.4%
Y,1851,11.7%
e,1851,11.7%
s,1851,11.7%

Value,Count,Frequency (%)
ASCII,15803,100.0%

Value,Count,Frequency (%)
N,5125,32.4%
o,5125,32.4%
Y,1851,11.7%
e,1851,11.7%
s,1851,11.7%

Unnamed: 0,SeniorCitizen,MonthlyCharges
SeniorCitizen,1.0,0.222
MonthlyCharges,0.222,1.0

Unnamed: 0,SeniorCitizen,MonthlyCharges
SeniorCitizen,1.0,0.221
MonthlyCharges,0.221,1.0

Unnamed: 0,SeniorCitizen,MonthlyCharges
SeniorCitizen,1.0,0.222
MonthlyCharges,0.222,1.0

Unnamed: 0,SeniorCitizen,MonthlyCharges
SeniorCitizen,1.0,0.181
MonthlyCharges,0.181,1.0

Unnamed: 0,Gender,SeniorCitizen,PhoneService,MultipleLines,PaymentMethod,MonthlyCharges,Churn
Gender,1.0,0.0,0.0,0.0,0.0,0.012,0.0
SeniorCitizen,0.0,1.0,0.0,0.089,0.294,0.306,0.23
PhoneService,0.0,0.0,1.0,1.0,0.0,0.832,0.008
MultipleLines,0.0,0.089,1.0,1.0,0.175,0.711,0.022
PaymentMethod,0.0,0.294,0.0,0.175,1.0,0.4,0.449
MonthlyCharges,0.012,0.306,0.832,0.711,0.4,1.0,0.36
Churn,0.0,0.23,0.008,0.022,0.449,0.36,1.0

Unnamed: 0,Gender,SeniorCitizen,PhoneService,MultipleLines,PaymentMethod,MonthlyCharges,Churn
0,Female,0,Yes,No,Mailed check,65.6,No
1,Female,1,Yes,No,Mailed check,83.9,Yes
2,Male,0,Yes,Yes,Credit card (automatic),84.65,No
3,Female,1,Yes,No,Electronic check,48.2,No
4,Female,0,Yes,No,Electronic check,68.95,No
5,Female,0,Yes,Yes,Bank transfer (automatic),101.3,No
6,Female,0,Yes,Yes,Credit card (automatic),95.75,No
7,Female,1,Yes,Yes,Electronic check,72.1,No
8,Female,1,Yes,Yes,Electronic check,25.2,No
9,Female,0,Yes,Yes,Electronic check,94.1,Yes

Unnamed: 0,Gender,SeniorCitizen,PhoneService,MultipleLines,PaymentMethod,MonthlyCharges,Churn
6966,Female,0,Yes,No,Electronic check,97.2,No
6967,Female,0,Yes,No,Mailed check,84.45,No
6968,Female,0,Yes,Yes,Mailed check,72.1,No
6969,Female,0,Yes,Yes,Bank transfer (automatic),91.75,Yes
6970,Female,0,Yes,No,Electronic check,95.1,Yes
6971,Male,1,Yes,No,Electronic check,74.4,Yes
6972,Female,1,Yes,No,Electronic check,94.0,No
6973,Female,0,Yes,No,Bank transfer (automatic),20.95,Yes
6974,Female,0,Yes,No,Mailed check,55.15,No
6975,Male,0,Yes,No,Mailed check,50.3,No

Unnamed: 0,Gender,SeniorCitizen,PhoneService,MultipleLines,PaymentMethod,MonthlyCharges,Churn,# duplicates
484,Male,0,Yes,No,Mailed check,19.95,No,16
150,Female,0,Yes,No,Mailed check,20.0,No,14
151,Female,0,Yes,No,Mailed check,20.05,No,14
474,Male,0,Yes,No,Mailed check,19.65,No,13
486,Male,0,Yes,No,Mailed check,20.05,No,13
147,Female,0,Yes,No,Mailed check,19.9,No,12
139,Female,0,Yes,No,Mailed check,19.55,No,11
146,Female,0,Yes,No,Mailed check,19.85,No,11
152,Female,0,Yes,No,Mailed check,20.1,No,11
157,Female,0,Yes,No,Mailed check,20.35,No,11
