# Scikit-learn Tutorial
Author: Lauren Gliane

In this tutorial, we'll go over the following topics to build our own classification model:
1. Introduction
2. Install
3. Datasets
4. Choosing a Model
5. Training the Model
6. Making Predictions
7. Save/Load Models
8. Evaluate the Model

## 1. Get to know SK-learn
### What is Scikit-learn?
Scikit-learn (AKA sk-learn) is written in Python, an open source project, and is **one of the most used ML libraries today**. Sklearn contains modules built on top of Numpy, SciPy, and Matplotlib libraries containing tons of algorithms--ready to use to train, evaluate, and save models straight out of the box!

### Why learn Scikit-learn?
With sk-learn, we don’t need an in-depth understanding of complex concepts like linear algebra or calculus. By using sk-learn’s pretrained neural networks and ML algorithms, we can easily preprocess datasets for supervised learning (regression or classification) and unsupervised learning (clustering and dimensionality reduction) applications.

## 2. Install
To use sk-learn, we'll need **scipy**, **numpy**, and **sklearn**. To install these, run `pip install scipy numpy scikit-learn` in your terminal.

## 3. Datasets
When doing machine learning, we need data to train and evaluate our models, because without data, we can't learn patterns, validate performance, or generalize to unseen examples.

Scikit-learn provides tools to do that via built-in datasets, dataset loading utilities, and data preprocessing functions (like [train_test_split](https://sklearn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html?utm_source=chatgpt.com)).

### Features and Labels
In machine learning, we use data to train models to make predictions or decisions. This data is typically structured into two main parts:
1. **Features (Input)**
- Features are the input variables (also called independent variables) that the model uses to learn.
- Think of them as the measurable properties or characteristics of the data

    **Example:** In a house price prediction dataset, features might include the number of bedrooms, square footage, and location.

2. **Labels (Output)**
- The label is the target variable (also called the dependent variable) — it’s what you want the model to predict.
- Think of it as: The correct answer for each example in the dataset.

    **Example:** For house prices, the label is the actual price of the house.

### Step 1: Pick a Dataset
Scikit-learn comes with several built-in toy datasets that are great for learning and experimenting. These datasets are small, well-structured, and easy to load — perfect for understanding the basics of machine learning workflows.

Look through the following [toy datasets](https://scikit-learn.org/stable/datasets/toy_dataset.html) and select one that interests you for this tutorial!

(We'll be using Iris as an example so we suggest using something else)

In [None]:
dataset = ______

### Step 2: Load the Dataset
We’ll practice loading datasets with the Iris dataset using `load_iris()`. 

**About Iris:** Iris dataset will be used for a classification task. The dataset contains measurements of Sepal Length, Sepal Width, Petal Length, and Petal Width, which we will use to identify the iris species.

##### Practice: View your data
Use this code to view the feature names, target names, feature data, and target data of your selected dataset:

In [None]:
from sklearn.datasets import load_iris 

iris = load_iris() 
## load data into x & target into y
X = iris.data 
y = iris.target 
  
feature_names = iris.feature_names 
target_names = iris.target_names 
  
print("Feature names:", feature_names) 
print("Target names:", target_names) 

print("\nType of X is:", type(X)) 

print("\nFirst 5 rows of X:\n", X[:5])


#### Want to use custom data?
If we’re using an external dataset, we can use the pandas library to load and manipulate the datasets with ease. If you haven’t yet, check out our [AI Club Pandas Tutorial](https://github.com/npragin/ai-club-project-management/blob/main/tutorials/pandas-tutorial.ipynb)!

### Step 3: Data Preprocessing
Raw data is rarely in a form that machine learning models can use effectively. Without data preparation, we risk: Poor performance, Errors or crashes, and Bias or misleading results. 

In this tutorial, we will go over the most essential steps: **Missing Values**, **Feature Scaling**, and **Splitting Data**.

#### A. Missing Values (imputation methods)

#### B. Feature Scaling

Step 1: Choose your scaling method
| Scaler           | Use When                                                                  |
| ---------------- | ------------------------------------------------------------------------- |
| `StandardScaler` | You want features with mean = 0 and std = 1 (default for most ML models). |
| `MinMaxScaler`   | You need values in a fixed range, like \[0, 1].                           |
| `RobustScaler`   | Your data contains **outliers**.                                          |



#### C. Split the Data
To efficiently train and evaluate model performance, the dataset will be split into the training set and testing set.
*Training set:* teaches our model to recognize patterns in the data
*Testing set:* checks our model’s performance on new, never seen before data

We will use the [train_test_split()](https://sklearn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html?utm_source=chatgpt.com) function from sklearn.model_selection module to do this.

##### 80-20 Split
80% training and 20% testing data is the most common split for larger datasets. 
Iris is a small dataset containing only 150 samples, so we’ll use 60% for training and 40% for testing. 

To do this, set parameter `test_size=0.4`. By adding setting `random_state=1`, we can ensure the split is consistent with each run for reproducibility.

**Result:** we will get four subsets after splitting

**x_train and y_train:** feature and target values for training

**x_test and y_test:** feature and target values for testing

#### Steps 1-3 Code so far

In [None]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# Load dataset
data = load_iris()
X, y = data.data, data.target

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Scale features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)


## 4. Choose a Model

### Pick a simple model (show different ones)
### Cross-validation


## 5. Train the Model
### Fit model on training data
### hyperparameter tuning methods (gridsearch, etc)

## 6. Make predictions
### Predict on test data

## 7. Save/Load Models

## 8. Evaluate the Model
### Accuracy scoring
### Confusion matrix
### Classification reporting