## Get started with machine learning in Azure Databricks

### Principles of Machine Learning
Machine learning is a way of programming that builds prediction models. Unlike traditional programming, which spells out exact steps, machine learning uses algorithms to learn the link between input features and the output label by studying lots of data.

Simply put, a machine learning model is like a function that takes input features `(x)` and returns a prediction `(y)`, written as `y = f(x)`.

The algorithm decides how the model calculates the label from the input.

### Machine Learning in Azure Databricks
Azure Databricks offers a cloud-based workspace for developing machine learning models.

Here, data scientists prepare and analyze data, while machine learning engineers deploy and maintain the models.

Databricks lets you handle everything—data loading, cleaning, training, testing, and managing models—all in one place.

### Preparing Data for Machine Learning
Data preparation varies by project, but you usually need to fix these common issues:

- **Incomplete data**: Some fields are missing (NULL). You can fill in missing values (e.g., using the mean) or drop incomplete rows.
- **Errors**: Data entry mistakes happen and need careful checking with queries and visualizations to spot unusual values.
- **Outliers**: Extreme values can skew the model. You might remove them or replace them with a capped value.
- **Incorrect data types**: Numeric data may be wrongly stored as text. You must fix types to match the data.
- **Unbalanced data**: If some categories appear too often, the model can become biased. You can balance the data by duplicating rare cases or removing excess ones.

### Feature Engineering
Feature engineering means improving or creating features to help the model learn better. Common tasks include:

- **Deriving new features**: Create new columns from existing ones, like extracting day_of_week from a date.
- **Discretizing numeric features**: Group numbers into categories (e.g., low, medium, high).
- **Encoding categorical features**: Convert text labels into numbers or one-hot encoded columns.
- **Scaling numeric values**: Normalize numbers to a similar range so that no feature dominates due to scale.

### Training a Machine Learning Model
- Training a model means adjusting the algorithm so it predicts accurately.
- Although this idea is simple, making a model that works well on new data takes many iterations of training and testing with different algorithms and settings.

### Training and Validation Data
To test how well a model generalizes, you split data into training and validation sets (commonly 70% train, 30% validate).

Models that do well on training data but fail on new data are overfitted.

### Machine Learning Algorithms
Algorithms differ based on the problem:
- 
- **Classification algorithms**: Predict categories (e.g., logistic regression, decision trees, ensembles).
- **Regression algorithms**: Predict numeric values.
- **Clustering algorithms**: Group similar data without labels.

Choosing the right algorithm often requires trial and error.

### Hyperparameters
Besides data inputs, algorithms have hyperparameters you can adjust to control how they train (like randomness, iterations, and tree depth).

These settings affect accuracy and training time

### Fitting a Model
- Fitting means training the algorithm on your data.
- In supervised learning, you fit based on known labels.
- In unsupervised learning, you fit to group data without labels.

### Validating a Machine Learning Model
**Regression** models predict numbers and are evaluated with:

- **MSE** (Mean Squared Error): Average of squared prediction errors.
- **RMSE** (Root Mean Squared Error): The square root of MSE, in the same units as the label.
- **R²** (Coefficient of Determination): Measures how much variance is explained by the model (closer to 1 is better).

**Classification** models predict categories and are evaluated with:

- **Accuracy**: % of correct predictions.
- **Precision**: Correct positive predictions out of all predicted positives.
- **Recall**: Correct positive predictions out of all actual positives.
- **F1 Score**: Balance between precision and recall.

Clustering models use metrics like Silhouette score, which shows how well data points are grouped.