**PROBLEM STATEMENT**
<br/>Predict the House price depending on depending on the date of purchase, distance from local institutes, location etc.
<br/>Get Sample data from Source- https://data.world/nrippner/titanic-disaster-dataset
<br/>
<br/>**COLUMN DEFINITION**
<br/>survival - Survival (0 = No; 1 = Yes)
<br/>class - Passenger Class (1 = 1st; 2 = 2nd; 3 = 3rd)
<br/>name - Name
<br/>sex - Sex (Male, Female the dataset is Imbalanced towards Males)
<br/>age - Age
<br/>
<br/>**STEPS IN MODELLING**
<br/>1.Data Acquisation
<br/>2.Data understanding
<br/>3.Data visualisation/EDA
<br/>4.Data cleaning/missing imputation/typecasting
<br/>5.Sampling/ bias removal
<br/>6.Anomaly detection
<br/>7.Feature selection/importance
<br/>8.Azure ML Model trigger
<br/>9.Model Interpretation & Error Analysis
<br/>10.Telemetry
<br/>
<br/>**FEATURE ENGINEERING**
<br/>1. Data is Imbalanced with more Males, so Cluster Oversample by 'Sex' Column and then model. This imbalance can be identified via the Data Plots.

## Import functions from Master Notebook:
Import the Functions and dependencies from the Master notebook to be used in the Trigger Notebook

In [0]:
%run /Users/.../AMLMasterNotebook

## 1.Data Acquisition
1.Acquisition of data from datasource ADLS path in CSV/Parquet/JSON etc format.
<br/>2.Logical Transformations in data. 
<br/>3.Transforming columns into required datatypes, converting to pandas df, persisiting actual dataset, intoducing a column 'Index' to assign a unique identifier to each dataset row so that this canm be used to retrieve back the original form after any data manupulations.

In [0]:
%scala
//<USER INPUT FILEPATH PARQUET OR CSV>

val filepath= "adl://<Your Datalake storage>.azuredatalakestore.net/Temp/ML-PJC/Titanic.csv"
var df=spark.read.format("csv").option("header", "true").option("delimiter", ",").load(filepath)
//val filepath ="abfss:/.../.parquet"
//var df = spark.read.parquet(filepath)
df.createOrReplaceTempView("vw")

In [0]:
%sql
select * from vw

In [0]:
import pandas as pd
import numpy as np
from pyspark.sql.functions import col

input_dataframe= spark.sql("""select * FROM vw""")
#input_dataframe = pd.read_csv("/dbfs/FileStore/Titanic.csv", header='infer')

cols_string=['Name','PClass','Sex']
cols_int=['Age','Survived']
cols_datetime=[]
cols_Float=[]				

#Function call: DataTypeConversion(input_dataframe,cols_string,cols_int,cols_datetime,cols_Float)
input_dataframe = DataTypeConversion(input_dataframe,cols_string,cols_int,cols_datetime,cols_Float)

##To assign an Index unique identifier of original record from after data massaging
input_dataframe['Index'] = np.arange(len(input_dataframe)) 

#Saving data acquired in dbfs for future use
outdir = '/dbfs/FileStore/Titanic.csv'
input_dataframe.to_csv(outdir, index=False)
#input_dataframe = pd.read_csv("/dbfs/FileStore/Dataframe.csv", header='infer')


## 2.Data Exploration
1.Exploratory Data Analysis (EDA)- To understand the overall data at hand, analysing each feature independently for its' statistics, the correlation and interraction between variables, data sample etc. 
<br/>2.Data Profiling Plots- To analyse the Categorical and Numerical columns separately for any trend in data, biasness in data etc.

In [0]:
input_dataframe = pd.read_csv("/dbfs/FileStore/Titanic.csv", header='infer')

#Function Call: Data_Profiling_viaPandasProfiling(input_dataframe)
p=Data_Profiling_viaPandasProfiling(input_dataframe)
displayHTML(p)


In [0]:
input_dataframe = pd.read_csv("/dbfs/FileStore/Titanic.csv", header='infer')

#User Inputs
cols_all=['Name','PClass','Sex','Age','Survived']
Categorical_cols=['Name','PClass','Sex']
Numeric_cols=['Age','Survived']
Label_col='Survived'

#Data_Profiling_Plots(input_dataframe,Categorical_cols,Numeric_cols,Label_col)
Data_Profiling_Plots(input_dataframe,Categorical_cols,Numeric_cols,Label_col)

## 4.Cleansing
To clean the data from NULL values, fix structural errors in columns, drop empty columns, encode the categorical values, normalise the data to bring to the same scale. We also check the Data Distribution via Correlation heatmap of original input dataset v/s the Cleansed dataset to validate whether or not the transformations hampered the original data trend/density.

In [0]:
subsample_final = pd.read_csv("/dbfs/FileStore/Titanic.csv", header='infer')
filepath="/dbfs/FileStore/Titanic.csv"
#subsample_final=subsample_final.drop(['Index'], axis = 1) # Index is highest variability column hence always imp along PC but has no business value. You can append columns to be dropped by your choice here in the list


inputdf_new=autodatacleaner(subsample_final,filepath,"Titanic","Data Cleanser")
print("Total rows in the new pandas dataframe:",len(inputdf_new.index))

#persist cleansed data sets 
filepath1 = '/dbfs/FileStore/Cleansed_Titanic.csv'
inputdf_new.to_csv(filepath1, index=False)


In [0]:
original = pd.read_csv("/dbfs/FileStore/Titanic.csv", header='infer')
display(Data_Profiling_Fin(original))

In [0]:
Cleansed=pd.read_csv("/dbfs/FileStore/Cleansed_Titanic.csv", header='infer')

display(Data_Profiling_Fin(Cleansed))

## 4.Sampling
Perform Stratified, Systematic, Random, Cluster sampling over data and compare the so obtained sampled dataset with the original data using a NULL Hypothesis, and suggest the best sample obtained thus. Compare the data densities of sampled datasets with that of the original input dataset to validate that our sample matches the data trend of original set.

In [0]:
input_dataframe = pd.read_csv("/dbfs/FileStore/Cleansed_Titanic.csv", header='infer') ## Sample after cleansing so that all categorical cols converted to num and hence no chi test. chi test requires the total of observed and tot of original sample to be same in frequency. 
filepath="/dbfs/FileStore/Cleansed_Titanic.csv"
subsample_final = pd.DataFrame()
subsample1 = pd.DataFrame()
subsample2 = pd.DataFrame()
subsample3 = pd.DataFrame()
subsample4 = pd.DataFrame()

#Function Call: Sampling(input_dataframe,filepath,task_type,input_appname,cluster_classified_col_ifany(Supervised))
subsample_final,subsample1,subsample2,subsample3,subsample4=Sampling(input_dataframe,filepath,'Sampling','Titanic','Sex')

#persist sampled data sets 
filepath1 = '/dbfs/FileStore/StratifiedSampled_Titanic.csv'
subsample1.to_csv(filepath1, index=False)
filepath2 = '/dbfs/FileStore/RandomSampled_Titanic.csv'
subsample2.to_csv(filepath2, index=False)
filepath3 = '/dbfs/FileStore/SystematicSampled_Titanic.csv'
subsample3.to_csv(filepath3, index=False)
filepath4 = '/dbfs/FileStore/ClusterSampled_Titanic.csv'
subsample4.to_csv(filepath4, index=False)
filepath = '/dbfs/FileStore/subsample_final_Titanic.csv'
subsample_final.to_csv(filepath, index=False)

In [0]:
original = pd.read_csv("/dbfs/FileStore/Titanic.csv", header='infer')

display(display_DataDistribution(original,'Survived'))

In [0]:
subsample1 = pd.read_csv("/dbfs/FileStore/StratifiedSampled_Titanic.csv", header='infer')

display(display_DataDistribution(subsample1,'Survived'))

In [0]:
subsample2 = pd.read_csv("/dbfs/FileStore/RandomSampled_Titanic.csv", header='infer')

display(display_DataDistribution(subsample2,'Survived'))

In [0]:
subsample3 = pd.read_csv("/dbfs/FileStore/SystematicSampled_Titanic.csv", header='infer')

display(display_DataDistribution(subsample3,'Survived'))

In [0]:
subsample4 = pd.read_csv("/dbfs/FileStore/ClusterSampled_Titanic.csv", header='infer')

display(display_DataDistribution(subsample4,'Survived'))

## 5.Anomaly Detection
Iterate data over various Anomaly-detection techniques and estimate the number of Inliers and Outliers for each.

In [0]:
#Calling the Anamoly Detection Function for identifying outliers  
outliers_fraction = 0.05
#df =pd.read_csv("/dbfs/FileStore/subsample_final_Titanic.csv", header='infer')
df =pd.read_csv("/dbfs/FileStore/ClusterSampled_Titanic.csv", header='infer')
target_variable = 'Survived'
variables_to_analyze='Sex'

AnomalyDetection(df,target_variable,variables_to_analyze,outliers_fraction,'anomaly_test','Titanic')

## 6.Feature Selection
Perform feature selection on the basis of Feature Importance ranking, correlation values, variance within the column.
Choose features with High Importance value score, drop one of the two highly correlated features, drop features which offer zero variability to data and thus do not increase the entropy of dataset.

In [0]:
import pandas as pd
import numpy as np
#input_dataframe = pd.read_csv("/dbfs/FileStore/RealEstate.csv", header='infer')
#label_col='Y house price of unit area'
#filepath="/dbfs/FileStore/RealEstate.csv"
#input_appname='RealEstate'
#task_type='FeatureSelectionCleansing'
#Y_discrete='Continuous'

input_dataframe = pd.read_csv("/dbfs/FileStore/Cleansed_Titanic.csv", header='infer')
label_col='Survived'
filepath='/dbfs/FileStore/Cleansed_Titanic.csv'
input_appname='Titanic'
task_type='FeatureSelectionCleansing'
Y_discrete='Categorical'

FeatureSelection(input_dataframe,label_col,Y_discrete,filepath,input_appname,task_type)

## 7.Auto ML Trigger - after preprocessing
Trigger Azure auto ML, pick the best model so obtained and use it to predict the label column. Calculate the Weighted Absolute Accuracy amd push to telemetry. also obtain the data back in original format by using the unique identifier of each row 'Index' and report Actual v/s Predicted Columns. We also provide the direct link to the azure Portal Run for the current experiment for users to follow.

In [0]:
import pandas as pd
dfclean = pd.read_csv("/dbfs/FileStore/Cleansed_Titanic.csv", header='infer')

#AutoMLFunc(subscription_id,resource_group,workspace_name,input_dataframe,label_col,task_type,input_appname)
df=AutoMLFunc('3ecb9b6a-cc42-4b0a-9fd1-6c08027eb201','psbidev','psdatainsightsML',dfclean,'Survived','classification','Titanic')


In [0]:
##df has just index,y actual, y predicted cols, as rest all cols are encoded after manipulation
for col in df.columns:
  if col not in ["y_predict","y_actual","Index"]: 
    df.drop([col], axis=1, inplace=True)
    
#dataframe is the actual input dataset     
dataframe = pd.read_csv("/dbfs/FileStore/Titanic.csv", header='infer')

#Merging Actual Input dataframe with AML output df using Index column
dataframe_fin = pd.merge(left=dataframe, right=df, left_on='Index', right_on='Index')
dataframe_fin

## 9.Model Interpretation, Feature Importance, Error Analysis
We can explore the model by splitting the Model metrics over various cohorts and analyse the data and model performance for each subclass.We can also get Global & Local feature Importance values for the Model.

In [0]:
df = pd.read_csv("/dbfs/FileStore/Cleansed_Titanic.csv", header='infer')
label_col='Survived'
subscription_id='3ecb9b6a-cc42-4b0a-9fd1-6c08027eb201'
resource_group='psbidev'
workspace_name='psdatainsightsML'
run_id='AutoML_45a82620-d605-4643-8a1b-8055e32ffd9b'
iteration=1
task='classification'

ModelInterpret(df,label_col,subscription_id,resource_group,workspace_name,run_id,iteration,task)

In [0]:
df = pd.read_csv("/dbfs/FileStore/Cleansed_Titanic.csv", header='infer')
label_col='Survived'
subscription_id='3ecb9b6a-cc42-4b0a-9fd1-6c08027eb201'
resource_group='psbidev'
workspace_name='psdatainsightsML'
run_id='AutoML_45a82620-d605-4643-8a1b-8055e32ffd9b'
iteration=1
task='classification'

ErrorAnalysisDashboard(df,label_col,subscription_id,resource_group,workspace_name,run_id,iteration,task)