# Scikit-learn Tutorial
Author: Lauren Gliane

In this tutorial, we'll go over the following topics to build our own classification model:
1. Introduction
2. Install
3. Datasets
4. Choosing a Model
5. Training the Model
6. Making Predictions
7. Save/Load Models
8. Evaluate the Model

## 1. Get to know SK-learn
### What is Scikit-learn?
Scikit-learn (AKA sk-learn) is written in Python, an open source project, and is **one of the most used ML libraries today**. Sk-learn is built on top of Numpy, SciPy, and Matplotlib, and contains tons of algorithms ready to use to train, evaluate, and save models straight out of the box!

### Why learn Scikit-learn?
With sk-learn, we don’t need to implement complex algorithms built on a backbone of linear algebra and statistics. By using sk-learn’s ML algorithms and neural networks, we can build models faster while getting familiar with industry-standard tools.

## 2. Install
To use sk-learn, we'll need **scipy**, **numpy**, and **sklearn**. To install these, run `pip install scipy numpy scikit-learn` in your terminal.

You can confirm sk-learn was installed correctly by importing something from the package.

In [None]:
from sklearn.tree import DecisionTreeClassifier

## 3. Datasets
When doing machine learning, we need data to train and evaluate our models, because without data, we can't learn patterns, validate performance, or generalize to unseen examples.

Scikit-learn provides tools to do that via built-in datasets, dataset loading utilities, and data preprocessing functions (like [train_test_split](https://sklearn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html?utm_source=chatgpt.com)).

### Features and Labels
In machine learning, we use data to train models to make predictions or decisions. This data is typically structured into two main parts:
1. **Features (Input)**
- Features are the input variables (also called independent variables) that the model uses to learn.
- Think of them as the measurable properties or characteristics representing what causes or correlates to your output
- Features are real numbers or are non-numeric and have been transformed into a numerical representation

    **Example:** In a house price prediction dataset, features might include the number of bedrooms, square footage, and location.

2. **Labels (Output)**
- The label is the target variable (also called the dependent variable). Labels are what you want the model to predict.
- Think of it as the correct answer for each example in the dataset.

    **Example:** For house prices, the label is the actual price of the house.

### Step 1: Pick a Dataset
Scikit-learn comes with several built-in toy datasets that are great for learning and experimenting. These datasets are small, well-structured, and easy to load making them perfect for learning.

Look through the following [toy datasets](https://scikit-learn.org/stable/datasets/toy_dataset.html) and select one that interests you for this tutorial!

**Note**: Don't choose the digits dataset, which is an image dataset that requires additional preprocessing this tutorial does not support

In [None]:
from sklearn.datasets import ______

dataset = ______

##### Viewing your data
Whenever you're working with new data, it's always a good idea to familiarize yourself with the feature names, your labels, and examine a bit of the data. Run the following to understand a bit about the dataset you've chosen.

In [None]:
X = dataset.data 
y = dataset.target 
  
feature_names = dataset.feature_names 
target_names = dataset.target_names 
  
print("Feature names:", feature_names) 
print("Target names:", target_names) 

print("\nType of X is:", type(X)) 

print("\nFirst 5 rows of X:\n", X[:5])


#### Want to use custom data?
If we’re using an external dataset, we can use the pandas library to load and manipulate the datasets with ease. If you haven’t yet, check out our [AI Club Pandas Tutorial](https://github.com/npragin/ai-club-project-management/blob/main/tutorials/pandas-tutorial.ipynb)!

#### A. Missing Values (imputation methods)
Fixing missing values is very important in data preprocessing to avoid hindering your model's performance.

Step 1: Identify Missing Data
We'll use Pandas to find missing values in our dataset

In [None]:
import pandas as pd

df.isnull().sum()  # Shows missing values per column
df.info()          # Also shows counts of non-null values

Step 2: Understand the Pattern
Missing data can be:
- MCAR (Missing Completely at Random): No pattern
- MAR (Missing at Random): Related to other variables
- MNAR (Missing Not at Random): Related to the missing variable itself

Step 3: Select a Strategy to Handle Missing Values

**A. Remove Missing Data**
- drop rows: when a few rows are missing, and you have a large dataset

`df.dropna(inplace=True)`

- drop columns: if a feature has too many missing values (e.g. >50%)

`df.drop(columns=['column_name'], inplace=True)`

Pros: Simple method | Cons: Loss of information

**B. Imputation (Fill Missing Values)**
1. Numerical Features
- Mean/Median Imputation (for normal/skewed distributions):

`df['col'].fillna(df['col'].mean(), inplace=True)`
- Mode Imputation (for categorical-like numerics):

`df['col'].fillna(df['col'].mode()[0], inplace=True)`
- KNN Imputer (considers similarity between samples):


In [None]:
from sklearn.impute import KNNImputer
imputer = KNNImputer(n_neighbors=3)
df_imputed = imputer.fit_transform(df)

#### B. Feature Scaling

Step 1: Choose your scaling method
| Scaler           | Use When                                                                  |
| ---------------- | ------------------------------------------------------------------------- |
| `StandardScaler` | You want features with mean = 0 and std = 1 (default for most ML models). |
| `MinMaxScaler`   | You need values in a fixed range, like \[0, 1].                           |
| `RobustScaler`   | Your data contains **outliers**.                                          |



#### A. Split the Data
To efficiently train and evaluate model performance, the dataset must be split into the training set and testing set. The training set teaches our model to recognize patterns in the data while the testing set lets us check our model’s performance on never before seen data
#### C. Split the Data
To efficiently train and evaluate model performance, the dataset will be split into the training set and testing set.
*Training set:* teaches our model to recognize patterns in the data
*Testing set:* checks our model’s performance on new, never seen before data

We will use the [train_test_split()](https://sklearn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) function from sklearn.model_selection module to do this.

##### Deciding on a split
80% training and 20% testing data is the most common split for larger datasets. 
Since most of sk-learn's toy datasets are small, we’ll use 60% for training and 40% for testing to ensure we can be confident about our evaluation results. 

To do this, the parameter responsible for train size or test size (either or both) by taking a look at the docs to see how to pass them in. By adding setting `random_state=1`, we can ensure the split is consistent with each run for reproducibility.

**Result:**
We will have four subsets of the data after splitting.

- **x_train and y_train:** feature and target values for training

- **x_test and y_test:** feature and target values for testing

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(dataset, random_state=1, ___) # Fill in parameter(s) here

#### B. Feature Scaling

Step 1: Choose your scaling method
| Scaler           | Use When                                                                  |
| ---------------- | ------------------------------------------------------------------------- |
| `StandardScaler` | You want features with mean = 0 and std = 1 (default for most ML models). |
| `MinMaxScaler`   | You need values in a fixed range, like \[0, 1].                           |
| `RobustScaler`   | Your data contains **outliers**.                                          |



### Common Steps in Data Preparation with Scikit-learn tools
For this tutorial, we will only go over loading data, scaling, and splitting. Below shows the common steps for preprocessing and sk-learn tools used to complete them.


| Step                                   | What It Does                                                    | Example Tools in Scikit-learn         |
| -------------------------------------- | --------------------------------------------------------------- | ------------------------------------- |
| **1. Load the Data**                   | Bring data into your environment.                               | `load_iris()`, `pandas.read_csv()`    |
| **2. Inspect & Explore**               | Understand structure, types, missing values, outliers.          | `X.shape`, `X.info()`, `X.describe()` |
| **3. Handle Missing Values**           | Fill in or remove incomplete data.                              | `SimpleImputer`                        |
| **4. Encode Categorical Variables**    | Convert text labels into numbers.                               | `OneHotEncoder`, `LabelEncoder`       |
| **5. Feature Scaling**                 | Standardize or normalize values so features contribute equally. | `StandardScaler`, `MinMaxScaler`      |
| **6. Feature Selection / Engineering** | Choose or create features that improve performance.             | `SelectKBest`, custom transformations |
| **7. Split the Data**                  | Divide data into training and testing sets.                     | `train_test_split()`                  |


#### Steps 1-3 Code so far

In [None]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# Load dataset
data = load_iris()
X, y = data.data, data.target

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Scale features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)


## 4. Choose a Model

### Pick a simple model (show different ones)
### Cross-validation


## 5. Train the Model
### Fit model on training data
### hyperparameter tuning methods (gridsearch, etc)

## 6. Make predictions
### Predict on test data

## 7. Save/Load Models

## 8. Evaluate the Model
### Accuracy scoring
### Confusion matrix
### Classification reporting