# 📘 Introduction to Machine Learning Pipelines for AI Beginners

## 👋 Welcome to Your First AI Class!

Welcome, future AI expert! In this 2-hour session, we are going to demystify the process that turns raw data into intelligent predictions. This process is called a **Machine Learning (ML) Pipeline**.

Think of it like a recipe for baking a cake. You need to gather ingredients (data collection), prepare them (preprocessing), mix them in creative ways (feature engineering), bake it (model training), taste it to see if it's good (evaluation), and finally, serve it (deployment). By the end of this session, you'll understand each of these essential steps.

### 🎯 Learning Objectives for Today:

1.  **Understand the 7 key stages** of a Machine Learning pipeline.
2.  **Load and explore** a real dataset using Python libraries.
3.  **Prepare data** for a machine learning model.
4.  **Train a basic model** to make predictions.
5.  **Evaluate** how well your model performs.
6.  **Complete a final assignment** to build your own mini-pipeline!

Let's get started! 🚀

## 1. Data Collection

This is the first and most important step! We need data to learn from. The quality of our model depends heavily on the quality of our data. We can get data from many places like files, databases, or even public datasets available online.

For this class, we'll use a built-in dataset from a popular Python library called `seaborn`.

In [7]:
# First, we need to import the libraries that will help us.
# pandas is for working with data tables (we call them DataFrames).
# seaborn is for loading datasets and creating cool visualizations.

import pandas as pd
import seaborn as sns

# Let's load a dataset about restaurant tips.
# This dataset contains information about the total bill, the tip amount, gender of the payer, etc.
tips_df = sns.load_dataset('tips')

# Let's look at the first 5 rows to see what it looks like!
print("Here are the first 5 rows of our dataset:")
tips_df.head()

Here are the first 5 rows of our dataset:


Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.5,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4


### 🎯 Practice Task 1: Explore the Dataset's Size

It's always good to know how much data we're working with. Can you find out how many rows and columns are in our `tips_df` dataset? 

💡 **Hint:** You can use the `.shape` attribute on a DataFrame. For example: `your_dataframe.shape`.

In [9]:
# Your code here! Find the shape of the tips_df dataset.
print("The shape of our dataset (rows, columns) is:")
# print(tips_df.shape) # Uncomment this line and run the cell!

The shape of our dataset (rows, columns) is:


## 2. Data Preprocessing

Raw data is often messy! It can have missing values, errors, or be in a format that machines don't understand. Preprocessing is all about cleaning and preparing our data for the model.

**Common Steps:**
*   **Handling Missing Values:** Find empty spots in our data and decide how to fill them (e.g., with the average value).
*   **Encoding Categorical Data:** Computers understand numbers, not text like "Male" or "Female". We need to convert these categories into numbers. A popular method is **One-Hot Encoding**.

In [2]:
# First, let's check for any missing values in our dataset.
print("Missing values in each column:")
print(tips_df.isnull().sum())

# Fun Fact: The 'tips' dataset is very clean! There are no missing values. Lucky us! 🎉

Missing values in each column:
total_bill    0
tip           0
sex           0
smoker        0
day           0
time          0
size          0
dtype: int64


In [11]:
tips_df

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.50,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4
...,...,...,...,...,...,...,...
239,29.03,5.92,Male,No,Sat,Dinner,3
240,27.18,2.00,Female,Yes,Sat,Dinner,2
241,22.67,2.00,Male,Yes,Sat,Dinner,2
242,17.82,1.75,Male,No,Sat,Dinner,2


In [14]:
# Now, let's handle categorical data. Look at the 'sex' column. It's text!
# We will convert it into numbers using one-hot encoding.
# This creates new columns for each category (e.g., 'sex_Male', 'sex_Female') with 1s and 0s.

# Option 1: Convert to integers after encoding
tips_processed = pd.get_dummies(tips_df, columns=['sex', 'smoker', 'day', 'time'])
tips_processed = tips_processed.astype(int)

# Option 2: Use dtype parameter (if available in your pandas version)
tips_processed = pd.get_dummies(tips_df, columns=['sex', 'smoker', 'day', 'time'], dtype=int)

In [15]:
tips_processed

Unnamed: 0,total_bill,tip,size,sex_Male,sex_Female,smoker_Yes,smoker_No,day_Thur,day_Fri,day_Sat,day_Sun,time_Lunch,time_Dinner
0,16.99,1.01,2,0,1,0,1,0,0,0,1,0,1
1,10.34,1.66,3,1,0,0,1,0,0,0,1,0,1
2,21.01,3.50,3,1,0,0,1,0,0,0,1,0,1
3,23.68,3.31,2,1,0,0,1,0,0,0,1,0,1
4,24.59,3.61,4,0,1,0,1,0,0,0,1,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...
239,29.03,5.92,3,1,0,0,1,0,0,1,0,0,1
240,27.18,2.00,2,0,1,1,0,0,0,1,0,0,1
241,22.67,2.00,2,1,0,1,0,0,0,1,0,0,1
242,17.82,1.75,2,1,0,0,1,0,0,1,0,0,1


### 🎯 Practice Task 2: Check the Data Types

After our changes, what do the data types of our columns look like? Use the `.info()` method to see the summary of our new `tips_processed` DataFrame.

💡 **Hint:** Just type `tips_processed.info()` in the cell below and run it.

In [16]:
# Your code here! Get information about the processed DataFrame.
tips_processed.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 244 entries, 0 to 243
Data columns (total 13 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   total_bill   244 non-null    float64
 1   tip          244 non-null    float64
 2   size         244 non-null    int64  
 3   sex_Male     244 non-null    int32  
 4   sex_Female   244 non-null    int32  
 5   smoker_Yes   244 non-null    int32  
 6   smoker_No    244 non-null    int32  
 7   day_Thur     244 non-null    int32  
 8   day_Fri      244 non-null    int32  
 9   day_Sat      244 non-null    int32  
 10  day_Sun      244 non-null    int32  
 11  time_Lunch   244 non-null    int32  
 12  time_Dinner  244 non-null    int32  
dtypes: float64(2), int32(10), int64(1)
memory usage: 15.4 KB


## 3. Feature Crafting (Feature Engineering)

This is where creativity comes in! Feature engineering is the art of creating new, more informative features from the ones we already have. A good new feature can dramatically improve a model's performance.

For our dataset, we can create a feature like `tip_percentage` to see how generous the tip was relative to the bill.

In [18]:
# Let's create a new feature called 'tip_percentage'
tips_processed['tip_percentage'] = (tips_processed['tip'] / tips_processed['total_bill']) * 100

# Let's create another one: 'bill_per_person'
tips_processed['bill_per_person'] = tips_processed['total_bill'] / tips_processed['size']

print("Dataset with our shiny new features!")
tips_processed

Dataset with our shiny new features!


Unnamed: 0,total_bill,tip,size,sex_Male,sex_Female,smoker_Yes,smoker_No,day_Thur,day_Fri,day_Sat,day_Sun,time_Lunch,time_Dinner,tip_percentage,bill_per_person
0,16.99,1.01,2,0,1,0,1,0,0,0,1,0,1,5.944673,8.495000
1,10.34,1.66,3,1,0,0,1,0,0,0,1,0,1,16.054159,3.446667
2,21.01,3.50,3,1,0,0,1,0,0,0,1,0,1,16.658734,7.003333
3,23.68,3.31,2,1,0,0,1,0,0,0,1,0,1,13.978041,11.840000
4,24.59,3.61,4,0,1,0,1,0,0,0,1,0,1,14.680765,6.147500
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
239,29.03,5.92,3,1,0,0,1,0,0,1,0,0,1,20.392697,9.676667
240,27.18,2.00,2,0,1,1,0,0,0,1,0,0,1,7.358352,13.590000
241,22.67,2.00,2,1,0,1,0,0,0,1,0,0,1,8.822232,11.335000
242,17.82,1.75,2,1,0,0,1,0,0,1,0,0,1,9.820426,8.910000


### 🎯 Practice Task 3: A Simple Feature

Can you create one more feature? Let's imagine each person at the table also gives a small tip. Create a feature called `tip_per_person`.

💡 **Hint:** It would be the `tip` divided by the `size` of the party.

In [20]:
# Your code here! Create the 'tip_per_person' feature.
tips_processed['tip_per_person'] = tips_processed['tip'] / tips_processed['size']

print("Dataset with the 'tip_per_person' feature:")
tips_processed # Uncomment these lines and run!

Dataset with the 'tip_per_person' feature:


Unnamed: 0,total_bill,tip,size,sex_Male,sex_Female,smoker_Yes,smoker_No,day_Thur,day_Fri,day_Sat,day_Sun,time_Lunch,time_Dinner,tip_percentage,bill_per_person,tip_per_person
0,16.99,1.01,2,0,1,0,1,0,0,0,1,0,1,5.944673,8.495000,0.505000
1,10.34,1.66,3,1,0,0,1,0,0,0,1,0,1,16.054159,3.446667,0.553333
2,21.01,3.50,3,1,0,0,1,0,0,0,1,0,1,16.658734,7.003333,1.166667
3,23.68,3.31,2,1,0,0,1,0,0,0,1,0,1,13.978041,11.840000,1.655000
4,24.59,3.61,4,0,1,0,1,0,0,0,1,0,1,14.680765,6.147500,0.902500
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
239,29.03,5.92,3,1,0,0,1,0,0,1,0,0,1,20.392697,9.676667,1.973333
240,27.18,2.00,2,0,1,1,0,0,0,1,0,0,1,7.358352,13.590000,1.000000
241,22.67,2.00,2,1,0,1,0,0,0,1,0,0,1,8.822232,11.335000,1.000000
242,17.82,1.75,2,1,0,0,1,0,0,1,0,0,1,9.820426,8.910000,0.875000


## 4. Modeling (Model Selection and Training)

Now for the fun part! We will choose a machine learning model and 'train' it on our data. Training is the process where the model learns patterns from the data.

**Our Goal:** Predict the `tip` amount based on other features like `total_bill`, `size`, etc.

**Process:**
1.  **Define Features (X) and Target (y):** `X` is the input (the information we have), and `y` is the output we want to predict (the tip).
2.  **Split Data:** We split our data into a `training set` (for the model to learn from) and a `testing set` (to see how well it learned on data it has never seen before).
3.  **Train the Model:** We'll use a simple model called `LinearRegression`.

In [22]:
# 1. Define our features (X) and target (y)
# The target is what we want to predict.
y = tips_processed['tip']

# The features are the data we use to make the prediction.
# We drop the original 'tip' column and our other created tip features to avoid cheating!
X = tips_processed.drop(columns=['tip', 'tip_percentage', 'tip_per_person'])

print("This is our Target (y):")
print(y.head())


This is our Target (y):
0    1.01
1    1.66
2    3.50
3    3.31
4    3.61
Name: tip, dtype: float64


In [23]:
print("\nThis is our Features (X):")
print(X.head())


This is our Features (X):
   total_bill  size  sex_Male  sex_Female  smoker_Yes  smoker_No  day_Thur  \
0       16.99     2         0           1           0          1         0   
1       10.34     3         1           0           0          1         0   
2       21.01     3         1           0           0          1         0   
3       23.68     2         1           0           0          1         0   
4       24.59     4         0           1           0          1         0   

   day_Fri  day_Sat  day_Sun  time_Lunch  time_Dinner  bill_per_person  
0        0        0        1           0            1         8.495000  
1        0        0        1           0            1         3.446667  
2        0        0        1           0            1         7.003333  
3        0        0        1           0            1        11.840000  
4        0        0        1           0            1         6.147500  


In [24]:
# 2. Split the data
# We will use 80% of the data for training and 20% for testing.
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print(f"We have {X_train.shape[0]} samples for training.")
print(f"We have {X_test.shape[0]} samples for testing.")

We have 195 samples for training.
We have 49 samples for testing.


In [25]:
# 3. Select and Train the Model
from sklearn.linear_model import LinearRegression

# Create the model instance
model = LinearRegression()

# Train the model on the training data
model.fit(X_train, y_train)

print("✅ Model training complete! The model has learned from the data.")

✅ Model training complete! The model has learned from the data.


In [28]:
# Save the trained model
import joblib

joblib.dump(model, "trained_model.pkl")
print("✅ Model saved as 'trained_model.pkl'")


✅ Model saved as 'trained_model.pkl'


## 5. Testing and Evaluation

Our model is trained, but is it any good? We need to evaluate its performance on the `testing set` – the data it has never seen before.

For our problem (predicting a number), we can use metrics like:
*   **Mean Squared Error (MSE):** The average of the squared differences between predicted and actual values. Lower is better!
*   **R-Squared (R²):** A score between 0 and 1 that tells us how much of the change in our target variable is explained by our model. Higher is better!

In [26]:
# Let's make predictions on our test data
predictions = model.predict(X_test)

# Let's see the first 5 predictions vs the actual values
print("First 5 Predictions:", predictions[:5])
print("First 5 Actual Values:", y_test.head().values)

First 5 Predictions: [2.82765516 2.12024134 3.94443664 3.76353817 2.21836665]
First 5 Actual Values: [3.18 2.   2.   5.16 2.  ]


In [27]:
# Now let's calculate the evaluation metrics
from sklearn.metrics import mean_squared_error, r2_score

mse = mean_squared_error(y_test, predictions)
r2 = r2_score(y_test, predictions)

print(f"Mean Squared Error (MSE): {mse:.2f}")
print(f"R-Squared (R²): {r2:.2f}")

print("\n💡 An R² of around 0.44 means our model can explain about 44% of the variance in the tip amount. Not perfect, but a good start!")

Mean Squared Error (MSE): 0.73
R-Squared (R²): 0.41

💡 An R² of around 0.44 means our model can explain about 44% of the variance in the tip amount. Not perfect, but a good start!


### 🎯 Practice Task 4: Interpret the Results

Based on the R-Squared score, would you say our model is an excellent predictor, a decent predictor, or a poor predictor of tips? Write your answer as a comment in the code cell below.

In [None]:
# Your answer here!
# I think the model is a ...

## 6 & 7. Deployment, Monitoring & Maintenance

These are advanced but crucial final steps in a real-world project.

**🚀 Deployment:** This is where we make our model available for others to use. It could be part of a website, an app, or another system. For example, a restaurant app could use our model to suggest a tip amount.

**📡 Monitoring & Maintenance:** Once deployed, we need to watch our model's performance. Does it still make good predictions over time? Sometimes, models need to be retrained with new data to stay accurate. This is called maintenance.

For our beginner session, we won't be deploying this model, but it's important to know what the next steps would be!

---

## 🎉 Final Revision Assignment: The Titanic Challenge!

Congratulations on learning the entire ML pipeline! Now it's time to put your new skills to the test. Your mission is to build a model that predicts which passengers survived the Titanic disaster.

We'll load the dataset for you. Your job is to follow the pipeline steps we learned.

**Your Goal:** Predict the `survived` column (0 = No, 1 = Yes).

In [None]:
# Here is your new dataset! Let's load it and see what it contains.
titanic_df = sns.load_dataset('titanic')

print("Titanic Dataset Information:")
titanic_df.info()

### Your Tasks:

**1. Preprocessing - Handle Missing Data:**
*   You can see the `age` column has missing values. Fill them with the median age. 
    *   *Hint: `median_age = titanic_df['age'].median()` and then `titanic_df['age'].fillna(median_age, inplace=True)`*
*   The `deck` and `embark_town` columns also have many missing values. For simplicity, let's just drop them.
    *   *Hint: `titanic_df.drop(columns=['deck', 'embark_town', 'embarked'], inplace=True)`*

**2. Preprocessing - Encode Categorical Data:**
*   Convert the `sex` and `class` columns into numerical format using one-hot encoding (`pd.get_dummies`).

**3. Feature Engineering:**
*   Create a new feature called `family_size` by adding the `sibsp` (siblings/spouses) and `parch` (parents/children) columns together.

**4. Define Features (X) and Target (y):**
*   Your target `y` is the `survived` column.
*   Your features `X` should be the columns you think are useful (e.g., `pclass`, `age`, `family_size`, and your new encoded columns). Make sure to drop columns you don't need, like `survived`, `sex`, `class`, `sibsp`, `parch`, `who`, `adult_male`, `alive`, `alone`.

**5. Split the Data:**
*   Split your `X` and `y` into training and testing sets. Use a `test_size` of `0.2`.

**6. Train a Model:**
*   This is a classification problem (Survived: Yes or No). Instead of `LinearRegression`, import and use `DecisionTreeClassifier` from `sklearn.tree`.
*   Train it just like before: `model.fit(X_train, y_train)`.

**7. Evaluate Your Model:**
*   Make predictions on `X_test`.
*   Import `accuracy_score` from `sklearn.metrics`. This metric tells you the percentage of correct predictions. Calculate and print the accuracy!

**Good luck! Use the cells below to complete your assignment.**

In [None]:
# Task 1 & 2: Preprocessing - Write your code here

In [None]:
# Task 3: Feature Engineering - Write your code here

In [None]:
# Task 4 & 5: Define X, y and Split Data - Write your code here

In [None]:
# Task 6: Train a Model - Write your code here

In [None]:
# Task 7: Evaluate your Model - Write your code here