# Data science in Microsoft Fabric

### Machine Learning Framework
* A machine learning framework is a suite of tools and libraries that provide a structured and efficient way to build, train, and deploy machine learning models. It simplifies the process of getting machine learning models into production.

#### OSEMN Framework
* The **OSEMN** framework is a practical and flexible roadmap for working on a data science or machine learning project. The acronym stands for **Obtain**, **Scrub**, **Explore**, **Model**, and **iNterpret**.

    - **O:** Obtain - this is the first step in the process where you gather the data. This could be from various sources like a database, a data API, web scraping, or even an Excel file.

    - **S:** Scrub - In this step, you clean the data. This involves handling missing data, dealing with outliers, converting data types, or even transforming variables.

    - **E:** Explore - this is where you explore the data. You might create visualizations, calculate statistics, or do feature engineering. The goal is to understand the data and find patterns or insights.

    - **M:** Model - here, you select, build, and train machine learning models. You might also need to optimize hyperparameters and evaluate and compare different models.

    - **N:** iNterpret - In the final step, you interpret the results. You might explain how the model works, discuss the performance of the model, derive insights, or make predictions.
    

**O:** Obtain - this is the first step in the process where you gather the data. This could be from various sources like a database, a data API, web scraping, or even an Excel file.

In [1]:
df_diabetes = spark.read.format("csv")\
          .option("header","true")\
          .option("inferschema","true")\
          .load("Files/csv/diabetes/diabetes.csv")
# df now is a Spark DataFrame containing CSV data from "Files/csv/diabetes/diabetes.csv".
display(df_diabetes.limit(5))

StatementMeta(, , , Waiting, )

SynapseWidget(Synapse.DataFrame, efe6f18a-7e9e-46df-9497-331e5be84c88)

### Get the descriptive statistics of the DataFrame
- Getting insights into the central tendency, dispersion, and shape of the distribution of the dataset.

In [2]:
display(df_diabetes.summary())

StatementMeta(, , , Waiting, )

SynapseWidget(Synapse.DataFrame, 551b0ecc-ff0e-4ea3-bb2a-670f6ded4e75)

### Data Transformation

- **Find NaN or Null Values**: You can use the `isNull()` or `isNaN()` functions to find the total of null or NaN values in your DataFrame.

In [3]:
from pyspark.sql.functions import col, isnan, when, count

# Selecting the columns that have null or NaN values
df_diabetes.select([count(when(isnan(c) | col(c).isNull(), c)).alias(c) for c in df_diabetes.columns]).show()

StatementMeta(, , , Waiting, )

+---+---+---+---+---+---+---+---+---+---+---+
|AGE|SEX|BMI| BP| S1| S2| S3| S4| S5| S6|  Y|
+---+---+---+---+---+---+---+---+---+---+---+
|  1|  1|  2|  2|  1|  2|  1|  1|  2|  2|  1|
+---+---+---+---+---+---+---+---+---+---+---+



- **Display Records with Null or NaN Values**: To display only the records with null or NaN values, you can filter the DataFrame accordingly.

In [4]:
# Filtering the DataFrame to display only rows with at least one null or NaN value
df_diabetes.filter(' or '.join([f'({c} IS NULL OR isNaN({c}))' for c in df_diabetes.columns])).show()

StatementMeta(, , , Waiting, )

+----+----+----+----+----+----+----+----+----+----+----+
| AGE| SEX| BMI|  BP|  S1|  S2|  S3|  S4|  S5|  S6|   Y|
+----+----+----+----+----+----+----+----+----+----+----+
|  80|null|null|null| 193|null|null| 5.0|null|null|null|
|null|   2|null|null|null|null|42.0|null|null|null| 104|
+----+----+----+----+----+----+----+----+----+----+----+



- **Delete Records with Null or NaN Values**: To delete these records, you can use the `dropna()` function.

In [5]:
# Dropping rows with any null or NaN values
df_diabetes_clean = df_diabetes.na.drop()

StatementMeta(, , , Waiting, )

- verification

In [6]:
print('clean dataset row count: ' + str(df_diabetes_clean.count()))
print('original dataset row count: ' + str(df_diabetes.count()))

StatementMeta(, , , Waiting, )

clean dataset row count: 442
original dataset row count: 444


# Train a regression model
#### Defining `Features` & `Label`:
- **X** is being defined as the features and **Y** as the label (or target variable).
    - **X:** is a 2D array containing the values of the features ‘AGE’, ‘SEX’, ‘BMI’, ‘BP’, ‘S1’, ‘S2’, ‘S3’, ‘S4’, ‘S5’, and ‘S6’ from the DataFrame df_clean. These are the independent variables that the model will use to make predictions.
    - **Y:** is a 1D array containing the values of the target variable ‘Y’ from the DataFrame df_clean. This is the dependent variable that the model will try to predict based on the features.

In [7]:
from sklearn.model_selection import train_test_split

df_clean = df_diabetes_clean.toPandas()  

X, y = df_clean[['AGE','SEX','BMI','BP','S1','S2','S3','S4','S5','S6']].values, df_clean['Y'].values

StatementMeta(, , , Waiting, )

#### Splitting the dataset into `Train` and `Test` datasets:
- `X_train` and `y_train` are the features and label for the training set, respectively.
- `X_test` and `y_test` are the features and label for the testing set, respectively.
- `test_size`=0.30 means that 30% of the data will be used for the test set, and the remaining 70% will be used for the training set.
- `random_state`=0 is used for reproducibility. It ensures that the data is split in the same way every time the code is run.

In [9]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=0)

StatementMeta(, , , Waiting, )

## Managing the end-to-end machine learning lifecycle with MLflow
##### MLflow is an open-source platform for managing the end-to-end machine learning lifecycle. It includes capabilities for experiment tracking, deploying models, and more.
- Experiments in MLflow are a way to organize your runs (individual model training sessions).
- Allow you to group related runs together and compare their results.

`mlflow.set_experiment(experiment_name)`: 
- This line creates an MLflow experiment with the name stored in experiment_name. If an experiment with this name already exists, it will be reused. If it does not exist, a new experiment will be created.

In [10]:
import mlflow
experiment_name = "diabetes-regression"
mlflow.set_experiment(experiment_name)

StatementMeta(, , , Waiting, )

<Experiment: artifact_location='', creation_time=1713206283404, experiment_id='80614e41-37c6-4dd0-9c93-a6a4acbe20ff', last_update_time=None, lifecycle_stage='active', name='diabetes-regression', tags={}>

#### Tracking a run of an experiment & automatic logging of parameters
- `mlflow.start_run()`: is used to start tracking a run of an experiment.
- `mlflow.autolog()`: This line enables automatic logging of parameters, metrics, and models from sklearn. 
    - When `autolog()` is called, all subsequent sklearn calls will be logged automatically without the need for explicit log statements.
- `model.fit(X_train, y_train)`: This line fits the model to your training data. `X_train` is the training data, and `y_train` are the corresponding labels. The fit method trains the model using the provided training data and labels.

In [11]:
from sklearn.linear_model import LinearRegression
import mlflow

with mlflow.start_run() as run:
    mlflow.autolog()

    model = LinearRegression()
    model.fit(X_train, y_train)

    # Save the run id to a variable
    run_id = run.info.run_id

StatementMeta(, , , Waiting, )

2024/04/17 16:48:44 INFO mlflow.tracking.fluent: Autologging successfully enabled for sklearn.






In [12]:
import mlflow

# Assuming run_id is the ID of the run we're interested in
run_info = mlflow.get_run(run_id)

# Now we can access the metrics
training_mean_absolute_error = run_info.data.metrics['training_mean_absolute_error']
training_mean_squared_error = run_info.data.metrics['training_mean_squared_error']
training_r2_score = run_info.data.metrics['training_r2_score']
training_root_mean_squared_error = run_info.data.metrics['training_root_mean_squared_error']
training_score = run_info.data.metrics['training_score']

print(
      f"mean_absolute_error: {training_mean_absolute_error}\
       \nmean_squared_error: {training_mean_squared_error}\
       \nr2_score: {training_r2_score}\
       \nroot_mean_squared_error: {training_root_mean_squared_error}\
       \ntraining_score: {training_score}"
)

StatementMeta(, , , Waiting, )

mean_absolute_error: 43.054973480424714       
mean_squared_error: 2804.1435610448802       
r2_score: 0.5539378915448929       
root_mean_squared_error: 52.95416471860245       
training_score: 0.5539378915448929


In [12]:
display(run.info)

StatementMeta(, f608c244-130d-4996-813e-6db54264a010, 14, Finished, Available)

<RunInfo: artifact_uri='sds://onelakewestus3.pbidedicated.windows.net/294331d4-04cc-4c6c-bfd4-a94a8a4c7b24/62525170-e476-4926-bb69-c6730cd9b310/55454c80-8163-47e0-a407-55f4ea653efe/artifacts', end_time=None, experiment_id='80614e41-37c6-4dd0-9c93-a6a4acbe20ff', lifecycle_stage='active', run_id='55454c80-8163-47e0-a407-55f4ea653efe', run_name='', run_uuid='55454c80-8163-47e0-a407-55f4ea653efe', start_time=1713372524383, status='RUNNING', user_id='7ebfac85-3ebb-440f-a743-e52052051f6a'>

# Train a classification model

#### Data Transformation & Feature Engineering
- `Y` field is a continuous number and Regression models are designed to predict a continuous or quantitative output, so they work best when the dependent variable (the output) is a continuous number.
- Classification models are used to predict a categorical or qualitative output, so they work best when the dependent variable (the output) is categorical.
    - For example, if y is a continuous variable representing age, you might convert it into categories like “child”, “teenager”, “adult”, and “senior”. Or if y is a variable representing income, you might convert it into categories like “low income”, “middle income”, and “high income”.

- Adding a new field out of `Y` field and called it `Risk` for classification

In [13]:
display(df_diabetes_clean.limit(5))

StatementMeta(, f608c244-130d-4996-813e-6db54264a010, 15, Finished, Available)

SynapseWidget(Synapse.DataFrame, 40cad490-1a4b-430a-9411-c87a1a26504d)

In [14]:
from pyspark.sql.functions import col, when

# Now you can create the 'Risk' column
df_clean = (df_diabetes_clean.withColumn('Risk', when(col('Y') > 211.5, 1).otherwise(0))).toPandas()

# Show the DataFrame
display(df_clean.head(5))

StatementMeta(, f608c244-130d-4996-813e-6db54264a010, 16, Finished, Available)

SynapseWidget(Synapse.DataFrame, 7e3d2be4-85fe-4fd8-bbb1-0762b4cac7a2)

#### Defining the `Features` and `Label` (target variable) & Splitting data into `Train` & `Test` datasets

In [15]:
from sklearn.model_selection import train_test_split
    
X, y = df_clean[['AGE','SEX','BMI','BP','S1','S2','S3','S4','S5','S6']].values, df_clean['Risk'].values
    
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=0)

StatementMeta(, f608c244-130d-4996-813e-6db54264a010, 17, Finished, Available)

#### Tracking a run of an experiment & automatic logging of parameters

In [16]:
import mlflow
experiment_name = "diabetes-classification"
mlflow.set_experiment(experiment_name)

StatementMeta(, f608c244-130d-4996-813e-6db54264a010, 18, Finished, Available)

<Experiment: artifact_location='', creation_time=1713206390433, experiment_id='8e02016b-39f4-4046-8e6b-cbbbcadb4bca', last_update_time=None, lifecycle_stage='active', name='diabetes-classification', tags={}>

- `LogisticRegression(C=1/0.1, solver="liblinear")`: 
    - This creates an instance of the `LogisticRegression` model. The `C` parameter is the inverse of regularization strength, which must be a positive float. Smaller values specify stronger regularization. In this case, `C` is set to `10` (1/0.1). The solver parameter is set to "liblinear", which is a good choice for small datasets and binary classification.
- .fit(X_train, y_train): 
    - This fits the model according to the given training data. X_train is the training input samples and y_train is the target values (class labels in classification, real numbers in regression).

In [17]:
from sklearn.linear_model import LogisticRegression
import mlflow

with mlflow.start_run() as run:
    mlflow.autolog()

    model = LogisticRegression(C=1/0.1, solver="liblinear").fit(X_train, y_train)  
    
    # Save the run id to a variable
    run_id = run.info.run_id

StatementMeta(, f608c244-130d-4996-813e-6db54264a010, 19, Finished, Available)

2024/04/17 16:53:29 INFO mlflow.tracking.fluent: Autologging successfully enabled for sklearn.


In [18]:
import mlflow

# Assuming run_id is the ID of the run we're interested in
run_info = mlflow.get_run(run_id)

# Now we can access the metrics
training_accuracy_score = run_info.data.metrics['training_accuracy_score']
training_f1_score = run_info.data.metrics['training_f1_score']
training_log_loss = run_info.data.metrics['training_log_loss']
training_precision_score = run_info.data.metrics['training_precision_score']
training_recall_score = run_info.data.metrics['training_recall_score']
training_roc_auc = run_info.data.metrics['training_roc_auc']
training_score = run_info.data.metrics['training_score']

print(
      f"accuracy_score: {training_accuracy_score}\
       \nf1_score: {training_f1_score}\
       \nlog_loss: {training_log_loss}\
       \nprecision_score: {training_precision_score}\
       \nrecall_score: {training_recall_score}\
       \nroc_auc: {training_roc_auc}\
       \nscore: {training_score}"
)

StatementMeta(, f608c244-130d-4996-813e-6db54264a010, 20, Finished, Available)

accuracy_score: 0.8770226537216829       
f1_score: 0.8743550696597232       
log_loss: 0.3073882860851032       
precision_score: 0.8740071643475847       
recall_score: 0.8770226537216829       
roc_auc: 0.9198990007521219       
score: 0.8770226537216829


AGE	SEX	BMI	BP	S1	S2	S3	S4	S5	S6	Y
62	1	28.7	95	162	94.5	42	4	4.6821	90	235
50	2	22.5	88	185	104.3	71	3	3.9120	70	80
74	1	31.2	94	158	94.1	43	4	4.6829	86	145
26	2	26.1	85	200	132.1	41	5	4.8974	90	240
68	1	29.8	96	160	95.2	40	4	4.7005	88	155
