### Imports

In [None]:
from IPython.display import Image, display

: 

### Reality of AI in the market 

In [None]:
file='img/Reality_of_AI_in_jobs.jpg'
display(Image(filename=file, embed=True, width=500, height=800))

: 

In the real world, the path to consistent value add with AI/ML in much harder than what everyone wants to believe. All 5 stages mentioned in the above diagram are critical to the success of a ML project, however, when companies find talent or tools to achieve data engineering and modeling stages, they mistake it for a functional ML driven system 

### 1. Introduction <a name="introduction"></a>

In this Jupyter notebook, we will explore the Machine Learning Lifecycle, a step-by-step guide to building and deploying machine learning models. Additionally, we will introduce the concept of version control using GitHub, a popular platform for managing and collaborating on code. Whether you're new to machine learning or version control, this notebook will provide a comprehensive introduction.

#### Standard ML cycle

In [None]:
file='img/lifecycle.png'
display(Image(filename=file, embed=True, width=1000, height=500))

: 

#### What is DevOps

In [None]:
file='./img/what-is-devops.jpg'
display(Image(filename=file, embed=True, width=700, height=600))

: 

#### What is MLOps?

MLOps is a methodology of operation that aims to facilitate the process of bringing an __experimental Machine Learning model into production and maintaining it efficiently__. MLOps focus on bringing the methodology of __DevOps__ used in the software industry to the __Machine Learning model lifecycle__. 

In [None]:
file='./img/what-is-mlops.png'
display(Image(filename=file, embed=True, width=700, height=600))

: 

In that way we can define some of the main features of a MLOPs project:

1. Data and Model Versioning
2. Feature Management and Storing
3. Automation of Pipelines and Processes
4. CI/CD for Machine Learning
5. Continuous Monitoring of Models

#### Who is a MLOps engineer

In [None]:
file='./img/MLOPs_engineer.jpg'
display(Image(filename=file, embed=True, width=700, height=600))

: 

---

### 2. Machine Learning Lifecycle <a name="machine-learning-lifecycle"></a>

Machine learning projects typically follow a structured lifecycle, which consists of the following phases:

#### Assignment (Crash course)

#### 2.1. Problem Definition <a name="problem-definition"></a>

__"If you don't know where you're going, any road will take you there." - Cheshire Cat (Alice in Wonderland)__

In the world of machine learning, the journey begins with a well-defined problem. The problem definition stage is like setting your GPS coordinates before embarking on a road trip. Without a clear destination, you're just wandering.

Why Problem Definition Matters-
    
1. __Understanding the Problem Domain__: Before diving into data and algorithms, you need to grasp the domain in which your problem resides. Whether it's healthcare, finance, or image recognition, domain knowledge is crucial.

2. __Identifying the Target Audience__: You need to know who will benefit from your solution. Is it a recommendation system for movie enthusiasts, a fraud detection model for banks, or an image classifier for biologists?

3. __Specifying the Goals__: What do you want to achieve? Are you building a model to predict customer churn, classify diseases, or generate art? Your goals should be specific, measurable, achievable, relevant, and time-bound (SMART).
    
---    
_Example_: Predicting Customer Churn
Let's take a real-world example: a telecom company wants to reduce customer churn (customers leaving for competitors). The problem definition might look like this:

_Problem_: Predict customer churn for a telecom company.

_Domain_: Telecommunications

_Target Audience_: Telecom company management

_Goals_: Build a machine learning model that can predict customer churn with at least 80% accuracy within the next three months.

Now, the team knows exactly what they're aiming for. They are ready to move on to the next stages of the machine learning lifecycle.

#### 2.2. Data Collection <a name="data-collection"></a>

In the realm of machine learning, data is the raw material that fuels the creation and training of models. Without data, you have nothing to work with. In this phase, we focus on the essential process of gathering and acquiring the data required to train and test our machine learning model. Data can originate from diverse sources, such as databases, APIs, or specific data collection procedures.

### Why Data Collection Matters

Data collection is a critical phase in the machine learning lifecycle. The importance of this stage lies in the following aspects:

1. **Quality Inputs, Quality Outputs:** The accuracy and effectiveness of your model's predictions are directly linked to the quality of the data you use. If you start with poor-quality data, your model's predictions are likely to be unreliable.

2. **Data Sources:** Identifying the sources of your data is key. Data can be obtained from a wide array of places, including databases storing historical records, web APIs providing real-time information, or data collected directly through sensors and surveys.

3. **Data Volume and Diversity:** Adequate data is essential for training and testing your model. The data you gather should represent diverse scenarios and encompass a variety of situations to ensure your model's generalizability.

### Example: Data Collection for Customer Churn Prediction

Let's consider a practical example. Imagine you are working on a machine learning project to predict customer churn for a telecom company. The data you need may come from various sources within the company, such as customer records, call logs, and service usage statistics. This data could include information such as customer demographics, service plans, contract lengths, and call duration. Your objective is to gather all this relevant data for your project.

In Python, you can use the popular `pandas` library to load and manipulate your data.

**NOTE** - It will be great if you practice with `Dask` and `Polars` along with Pandas, both these libraries extend pandas to handle distributed filesystem making it less memory dependent while offering similar usability at scale

#### 2.3. Data Preprocessing <a name="data-preprocessing"></a>


Data preprocessing is an indispensable phase in the machine learning lifecycle. In this step, you perform several essential tasks to ensure that your data is clean, well-structured, and ready for model training. Data preprocessing involves operations like cleaning, transformation, and preparation of the data to make it suitable for feeding into machine learning models. Common tasks in data preprocessing include handling missing values, scaling features, and encoding categorical variables.

### Why Data Preprocessing Matters

Data preprocessing plays a pivotal role for the following reasons:

1. **Quality Assurance:** Clean and well-processed data leads to a more accurate and reliable model. Data inconsistencies or missing values can lead to skewed results or model errors.

2. **Model Compatibility:** Different machine learning algorithms have specific data requirements. Data preprocessing ensures that your data is compatible with the chosen algorithm.

3. **Improved Efficiency:** Preprocessing can reduce the time and resources needed for model training and improve the model's overall performance.

### Data Preprocessing Techniques

1. **Handling Missing Values:** Data may have missing values, and these gaps need to be addressed. Common techniques include filling missing values with averages, medians, or zeros.

2. **Scaling Features:** Features often have varying scales. Scaling, such as normalization or standardization, is used to bring features to a common scale, preventing some features from dominating others.

3. **Encoding Categorical Variables:** Many machine learning algorithms require numerical input, so categorical variables (like "Red," "Green," "Blue") need to be converted into numerical representations. Common methods include one-hot encoding and label encoding.

### Example: Handling Missing Values

Let's take an example where you have a dataset with missing values, and you want to fill those missing values with the mean value of the respective column using Python and the `pandas` library:

```
import pandas as pd

# Load the dataset
data = pd.read_csv('data_with_missing_values.csv')

# Fill missing values with the mean
data.fillna(data.mean(), inplace=True)
```

#### 2.4. Model Building <a name="model-building"></a>

The model building phase is a pivotal step in the machine learning process. Here, you move from data preparation to constructing the actual machine learning model. This phase involves choosing the appropriate machine learning algorithm, defining the model's architecture, and configuring its hyperparameters.

### Why Model Building Matters

The significance of the model building phase can be summarized as follows:

1. **Algorithm Selection:** The choice of the right machine learning algorithm is critical. The performance of your model hinges on selecting an algorithm that aligns with the nature of your data and the specific problem you are trying to solve.

2. **Model Architecture:** Designing the architecture of your model entails determining the number of layers, nodes, and connections for neural networks or setting parameters for other algorithm types.

3. **Hyperparameter Tuning:** Hyperparameters are settings that influence how your model learns. Finding the optimal combination of hyperparameters is essential for achieving peak model performance.

### Algorithm Selection

Your selection of a machine learning algorithm depends on the problem type:

- **Regression:** For regression tasks, where the goal is to predict a continuous value, algorithms like Linear Regression, Decision Trees, or Random Forests are commonly employed.

- **Classification:** In classification problems, where data must be assigned to predefined categories, options include Logistic Regression, Support Vector Machines, and Neural Networks.

- **Clustering:** Clustering algorithms like K-Means or DBSCAN are used for unsupervised learning tasks.

### Model Architecture and Hyperparameters

For neural networks and deep learning models, careful consideration is required for model architecture and hyperparameters:

- **Architecture:** You must determine the number of layers, neurons in each layer, activation functions, and connections between layers.

- **Hyperparameters:** Choices include learning rates, batch sizes, regularization terms, and other parameters that impact the training process.

### Example: Building a Random Forest Classifier

Here's an example of building a Random Forest Classifier in Python using the `scikit-learn` library:

```
from sklearn.ensemble import RandomForestClassifier

# Create a Random Forest Classifier
model = RandomForestClassifier(n_estimators=100, max_depth=10, random_state=0)

# Fit the model to your training data
model.fit(X_train, y_train)
```

In this example, a Random Forest Classifier is created with specified hyperparameters such as the number of trees in the forest (n_estimators) and the maximum depth of the trees (max_depth). The model is then trained on the provided training data.

Model building is where your problem-solving skills, domain knowledge, and understanding of machine learning algorithms come into play. It's a critical phase that sets the stage for the subsequent steps in the machine learning lifecycle.

#### 2.5. Model Training <a name="model-training"></a>


The model training phase is a pivotal step in the machine learning lifecycle, where you put your preprocessed data to work. In this stage, you feed your data into the chosen machine learning model, adjust its parameters, and fine-tune it to minimize errors and enhance the accuracy of its predictions.

### Why Model Training Matters

Model training is a critical phase for the following reasons:

1. **Learning from Data:** During training, the model learns to make predictions by recognizing patterns in the data. It optimizes its internal parameters to make accurate predictions on unseen data.

2. **Minimizing Errors:** The goal is to reduce the discrepancy between the model's predictions and the actual outcomes. Model training is all about iteratively minimizing errors and improving prediction accuracy.

3. **Generalization:** A well-trained model can generalize its knowledge to make predictions on new, unseen data. Generalization is a key criterion for model success.

### The Training Process

The training process typically involves the following steps:

1. **Data Splitting:** Your data is divided into two parts: a training set and a testing/validation set. The training set is used to teach the model, while the testing set evaluates the model's performance.

2. **Initialization:** The model's parameters are initialized, and the training process begins.

3. **Forward and Backward Passes:** During each training iteration, the model makes predictions (forward pass), calculates the error, and adjusts its internal parameters (backward pass) to reduce the error.

4. **Iterative Optimization:** The model goes through multiple iterations, continually adjusting its parameters to minimize errors.

5. **Validation:** The model's performance is evaluated on the testing/validation set to ensure it's not overfitting the training data.

### Example: Training a Neural Network

Here's a simplified example of training a neural network using the popular deep learning library, TensorFlow, and Keras:

```
import tensorflow as tf
from tensorflow import keras

# Define the neural network architecture
model = keras.Sequential([
    keras.layers.Dense(64, activation='relu', input_shape=(10,)),
    keras.layers.Dense(32, activation='relu'),
    keras.layers.Dense(1, activation='sigmoid')
])

# Compile the model
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

# Train the model
model.fit(X_train, y_train, epochs=10, batch_size=32, validation_data=(X_val, y_val))
```

In this example, a neural network is defined with its architecture, compiled with the choice of optimizer and loss function, and then trained on the training data for a specified number of epochs.

Model training is where your machine learning model truly learns to make predictions. The iterative optimization process is crucial for achieving accuracy and generalization.

#### 2.6. Model Evaluation <a name="model-evaluation"></a>

Model evaluation is a critical phase in the machine learning lifecycle that comes after your model has been trained. During this stage, you assess your model's performance to understand how well it's performing on unseen data. Model evaluation involves the use of various metrics to gauge accuracy, precision, recall, F1-score, and more.

### Why Model Evaluation Matters

Model evaluation is indispensable for several reasons:

1. **Performance Assessment:** You need to determine how well your model is doing in real-world scenarios. Model evaluation provides insights into its effectiveness and reliability.

2. **Comparison:** Evaluation metrics allow you to compare different models or variations of the same model to select the best-performing one.

3. **Decision Making:** The results of model evaluation often guide decision-making processes. For instance, in healthcare, it may impact patient treatment plans, while in finance, it can influence investment decisions.

### Common Evaluation Metrics

Several metrics are commonly used to evaluate machine learning models:

- **Accuracy:** Measures the overall correctness of predictions.
- **Precision:** Evaluates the percentage of true positive predictions among all positive predictions made by the model.
- **Recall (Sensitivity):** Measures the proportion of actual positive cases that were correctly identified by the model.
- **F1-Score:** Harmonic mean of precision and recall, providing a balance between the two.
- **Confusion Matrix:** A table that describes the model's performance, showing true positives, true negatives, false positives, and false negatives.

### Example: Model Evaluation

Let's consider an example where you have trained a binary classification model to predict whether an email is spam or not. After training, you evaluate its performance using common metrics. In Python, you can use libraries like `scikit-learn` to calculate these metrics:

```
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix

# Make predictions on the test data
y_pred = model.predict(X_test)

# Calculate evaluation metrics
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
confusion = confusion_matrix(y_test, y_pred)
```
Model evaluation allows you to gauge the strengths and weaknesses of your model and make informed decisions about its deployment and further optimization.

#### 2.7. Model Deployment <a name="model-deployment"></a>

Model deployment is the final and crucial phase in the machine learning lifecycle. It marks the transition from a successful model trained on historical data to a real-world solution that makes predictions on new, unseen data. This phase requires careful planning, implementation, and considerations for scalability and monitoring.

### Why Model Deployment Matters

Model deployment holds significant importance for several reasons:

1. **Real-World Application:** Deploying a model means putting it to practical use. It can have real-world impacts, such as helping doctors diagnose diseases, assisting in autonomous driving, or powering recommendation systems for online shopping.

2. **Return on Investment:** The value of a machine learning model is realized when it is deployed. It can lead to cost savings, revenue generation, or process optimization.

3. **Monitoring and Adaptation:** Deployed models need ongoing monitoring to ensure they continue to perform well. Changes in the data distribution or model drift might necessitate adaptations to maintain accuracy.

### The Deployment Process

The model deployment process can be summarized as follows:

1. **Environment Setup:** Create a production environment suitable for running the model. This may involve setting up servers, cloud infrastructure, or edge devices.

2. **Model Packaging:** Prepare the model for deployment by saving its architecture, weights, and preprocessing steps.

3. **Scalability:** Consider the scalability of the deployed model to handle varying workloads and user demand. This may involve deploying multiple instances of the model or using load balancing.

4. **Monitoring:** Implement a monitoring system to keep an eye on the model's performance. Detect model drift, changes in data distribution, and any decrease in prediction accuracy.

5. **Feedback Loop:** Create a feedback loop to collect user feedback and actual outcomes. This information can be used to retrain the model and improve its performance.

### Example: Deploying a Predictive Maintenance Model

Imagine you've trained a predictive maintenance model to forecast when industrial machines need maintenance to prevent breakdowns. After thorough evaluation, you decide to deploy the model in a factory environment. The deployment process would involve setting up servers in the factory, packaging the model for deployment, and monitoring its performance. Additionally, you'd implement a feedback loop to incorporate user feedback and data from the machines to continuously improve the model.

Model deployment is the stage where machine learning moves from a research or development context to a practical solution with real-world impact. Careful planning and ongoing monitoring are essential for success.



----

### 3. Introduction to Version Control <a name="introduction-to-version-control"></a>

Version control is a fundamental and indispensable system that underpins the development and maintenance of software projects. It provides a structured and efficient way to track and manage changes made to your codebase over time. Whether you are working solo or collaborating with a team, version control offers a range of benefits that enhance the quality and reliability of your software.

### Key Aspects of Version Control

Version control systems (VCS) encompass several key aspects:

1. **Change Tracking:** Version control allows you to monitor every alteration made to your codebase, no matter how small. Each change, or "commit," is documented and timestamped, creating a detailed history of your project's evolution.

2. **Collaboration:** VCS enables seamless collaboration among developers. Multiple team members can work on the same project concurrently, with the ability to merge their changes systematically.

3. **Reversion:** Inevitably, mistakes or unforeseen issues can arise during development. Version control makes it possible to revert to a previous state of the project, helping you recover from errors or bugs.

4. **Branching:** VCS allows the creation of branches, which are independent lines of development. This is particularly useful for parallel work on different features or bug fixes without interfering with the main project.

5. **Documentation:** Detailed commit messages and comments accompany each change, providing context and insight into the reasons behind the modifications. This documentation is invaluable for project maintenance and knowledge sharing.

### Types of Version Control Systems

There are two primary categories of version control systems:

1. **Centralized Version Control Systems (CVCS):** In a CVCS, there is a central repository that stores the entire history and current state of the project. Developers can check out files, make changes, and commit them back to the central repository.

2. **Distributed Version Control Systems (DVCS):** DVCS systems like Git provide each developer with a complete copy of the project, including its entire history. This decentralization offers more flexibility, independence, and a distributed workflow.

### Example: Using Git for Version Control

Git is one of the most popular distributed version control systems. Here is a basic workflow example:

1. **Initialization:** Start a Git repository by running `git init` in your project directory.

2. **Adding Files:** Stage your files with `git add` to include them in the next commit.

3. **Committing:** Create a snapshot of your project with `git commit -m "Your commit message"`.

4. **Branching:** Create a new branch with `git branch`, switch to it with `git checkout`, and merge it with the main branch when your feature is complete.

5. **Collaboration:** Push your changes to a remote repository, and others can clone and collaborate on your project.

Version control systems like Git play a pivotal role in software development, helping teams collaborate efficiently, track changes, and manage the software's evolution over time.

#### TODO: Provide installation details for miniconda for students.

### 4. Using GitHub for Version Control <a name="using-github"></a>

#### 1. What is GitHub?

GitHub is a web-based platform for version control and collaboration. It allows developers to work on projects, track changes, and collaborate with others efficiently. GitHub uses Git, a distributed version control system, as its underlying technology.

#### 2. Setting Up GitHub

Before you get started with GitHub, you need to create an account. Visit [GitHub](https://github.com/) and sign up for a free account if you don't already have one.

#### 3. Creating a Repository

A repository, or "repo" for short, is where your project and its files are stored. To create a new repository:

- Log in to GitHub.
- Click the **+** sign in the top right corner and select **New repository**.
- Enter a name for your repository.
- Choose between a public or private repository. Public repositories are visible to everyone, while private repositories are only visible to you and collaborators.
- Optionally, select the option to initialize the repository with a README file, which can serve as documentation for your project.
- Click **Create repository**.

#### 4. Cloning a Repository

Cloning a repository means creating a local copy on your computer to work with. To clone a repository:

- Go to the repository you want to clone on GitHub.
- Click the **Code** button, and copy the URL.
- Open your terminal or command prompt.
- Navigate to the directory where you want to clone the repository.
- Run the following command to clone the repository:
```
git clone <repository_URL>
```

#### 5. Making Changes

Once you have the repository on your computer, you can make changes to the files inside. Use your preferred code editor to make modifications.

**NOTE** - I suggest using VS Code for your projects. I find it fast, reliable and extendable.

#### 6. Committing Changes

After making changes, you need to commit them to your local repository. This is like taking a snapshot of your changes. Use the following commands:

- To stage your changes:
```
git add .
```
- To commit your changes:
```
git commit -m "Your commit message"
```

#### 7. Pushing Changes

- To send your local commits to the GitHub repository, use the following command:
```
git push origin <branch_name>
```

#### 8. Branching

Branches allow you to work on different features or bug fixes without affecting the main project. To create a new branch:

- Create a branch:
```
git branch <new_branch_name>
```
- Switch to the new branch:
```
git checkout <new_branch_name>
```

#### 9. Pull Requests

When you're ready to merge your changes into the main project, you create a pull request (PR). To create a PR:

- Go to the repository on GitHub.
- Click the **Pull requests** tab.
- Click the **New pull request** button.
- Select the branches you want to compare.
- Write a description and click **Create pull request**.

#### 10. Collaborating with Others

GitHub is excellent for collaborating with others. You can invite collaborators to your repository, and they can work on the same project. It's important to follow the same procedures for making changes, committing, pushing, and creating pull requests.

### **Assignment: GitHub Fundamentals Assessment**

**Assignment Overview:**
In this assignment, you will demonstrate your understanding of the basic concepts and commands for using GitHub. You will be required to perform various tasks related to GitHub, including creating a repository, making changes, committing, branching, and collaborating with others.

**Instructions:**
1. **Create a GitHub Account:**
   If you don't already have a GitHub account, sign up for a free account at [GitHub](https://github.com/).

2. **Repository Creation:**
   - Create a new public repository on GitHub. Name it "GitHubAssignment."

3. **Cloning the Repository:**
   - Clone the "GitHubAssignment" repository to your local computer using the provided URL.

4. **Making Changes:**
   - Inside the cloned repository, create a new text file named "my_changes.txt."
   - Add any text content to "my_changes.txt."

5. **Committing Changes:**
   - Stage your changes.
   - Commit the changes with a meaningful commit message.

6. **Pushing Changes:**
   - Push your committed changes to the GitHub repository.

7. **Branching:**
   - Create a new branch in the repository. Name it "feature-branch."

8. **Pull Requests:**
   - Switch to the "feature-branch."
   - Create a new text file in the branch named "new_feature.txt" and add some content.
   - Commit your changes on the "feature-branch."
   - Push your changes to the "feature-branch."
   - Create a pull request to merge the "feature-branch" into the main branch. Provide a description for the pull request.

9. **Collaboration (Optional):**
   - Invite a friend or classmate to collaborate on your "GitHubAssignment" repository. They can make changes to the repository, commit, and create pull requests.

**Evaluation Criteria:**
Your assignment will be evaluated based on the following criteria:

- Successful completion of each step of the assignment.
- Clarity and completeness of commit messages.
- Proper branching and pull request creation.
- Following best practices for collaboration (if collaborating with others).
- Following the GitHub commands and concepts as explained in the tutorial.

**Submission:**
Once you have completed the assignment, share the URL of your "GitHubAssignment" repository with other students and conduct peer review.

**Note:** This assignment is designed to test your practical understanding of GitHub fundamentals. Be sure to complete it independently to the best of your ability.

---

### 5. Industry Trivia <a name="Industry"></a>

#### Guess the use-case

There are 24 logos in this image, guess the use-case of the tool/service highlighted with the logos

##### Show image

In [None]:
file='img/zenml.jpg'
display(Image(filename=file, embed=True, width=800, height=800))

: 

#### Industry tools

In [None]:
file='./img/Ops_tools_and_companies.jpg'
display(Image(filename=file, embed=True, width=1000, height=1000))

: 

### 6. Next two weeks <a name="next-2-weeks"></a>

1. ML pipeline reproducibility, Versioning and Packaging
    1. Project strcutures [CookieCutter](http://drivendata.github.io/cookiecutter-data-science/)
    2. Data Registry [DVC](https://dvc.org/)
    3. AWS Sagemaker pipelines and high level architecture for DS setup
    
2. End-to-End Lifecycle management
    1. ML Model Registry [MLFlow](https://mlflow.org/)
    2. ML experiments management and tracking [MLFlow](https://mlflow.org/)
    3. End-to-End workflow with AWS Sagemaker


**See you next week!**