# Scikit-learn Tutorial
Author: Lauren Gliane

In this tutorial, we'll go over the following topics to build our own classification model:
1. Introduction
2. Install
3. Datasets
4. Data Preprocessing
5. Choosing a Model
6. Training the Model
7. Making Predictions
8. Save/Load Models
9. Evaluate the Model

## 1. Get to know SK-learn
### What is Scikit-learn?
Scikit-learn (AKA sk-learn) is written in Python, an open source project, and is **one of the most used ML libraries today**. Sk-learn is built on top of Numpy, SciPy, and Matplotlib, and contains tons of algorithms ready to use to train, evaluate, and save models straight out of the box!

### Why learn Scikit-learn?
With sk-learn, we don’t need to implement complex algorithms built on a backbone of linear algebra and statistics. By using sk-learn’s ML algorithms and neural networks, we can build models faster while getting familiar with industry-standard tools.

## 2. Install
To use sk-learn, we'll need **scipy**, **numpy**, and **sklearn**. To install these, run `pip install scipy numpy scikit-learn` in your terminal.

You can confirm sk-learn was installed correctly by importing something from the package.

In [None]:
from sklearn.tree import DecisionTreeClassifier

## 3. Datasets
When doing machine learning, we need data to train and evaluate our models, because without data, we can't learn patterns, validate performance, or generalize to unseen examples.

Scikit-learn provides tools to do that via built-in datasets, dataset loading utilities, and data preprocessing functions (like [train_test_split](https://sklearn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html?utm_source=chatgpt.com)).

### Features and Labels
In machine learning, we use data to train models to make predictions or decisions. This data is typically structured into two main parts:
1. **Features (Input)**
- Features are the input variables (also called independent variables) that the model uses to learn.
- Think of them as the measurable properties or characteristics representing what causes or correlates to your output
- Features are real numbers or are non-numeric and have been transformed into a numerical representation

    **Example:** In a house price prediction dataset, features might include the number of bedrooms, square footage, and location.

2. **Labels (Output)**
- The label is the target variable (also called the dependent variable). Labels are what you want the model to predict.
- Think of it as the correct answer for each example in the dataset.

    **Example:** For house prices, the label is the actual price of the house.

### Step 1: Pick a Dataset
Scikit-learn comes with several built-in toy datasets that are great for learning and experimenting. These datasets are small, well-structured, and easy to load making them perfect for learning.

Look through the following [toy datasets](https://scikit-learn.org/stable/datasets/toy_dataset.html) and select one that interests you for this tutorial!

**Note**: Don't choose the digits dataset, which is an image dataset that requires additional preprocessing this tutorial does not support

In [None]:
from sklearn.datasets import ______

dataset = ______

##### Viewing your data
Whenever you're working with new data, it's always a good idea to familiarize yourself with the feature names, your labels, and examine a bit of the data. Run the following to understand a bit about the dataset you've chosen.

In [None]:
X = dataset.data 
y = dataset.target 
  
feature_names = dataset.feature_names 
target_names = dataset.target_names 
  
print("Feature names:", feature_names) 
print("Target names:", target_names) 

print("\nType of X is:", type(X)) 

print("\nFirst 5 rows of X:\n", X[:5])


#### Want to use custom data?
If we’re using an external dataset, we can use the pandas library to load and manipulate the datasets with ease. If you haven’t yet, check out our [AI Club Pandas Tutorial](https://github.com/npragin/ai-club-project-management/blob/main/tutorials/pandas-tutorial.ipynb)!

## 4. Data Preprocessing
When working with real-world data, it often requires some preprocessing to ensure it's in the right format for training a model. This can include handling missing values, scaling features, and selecting the relevant features.

The first step is always to inspect your data to understand its structure and identify any potential issues.

### A. Missing Values
Missing values can occur for various reasons, such as data entry errors, sensor malfunctions, or respondents skipping questions in surveys. Handling missing values is crucial because they can hinder the performance of machine learning models. Sometimes, datasets will use NaN (Not a Number), null, or zero to represent missing values. It is important to read about your dataset to understand how missing values are represented.

#### Identify Missing Data
We'll use Pandas to find missing values in our dataset

In [None]:
import pandas as pd

# Convert to DataFrame for easier handling
X = pd.DataFrame(X, columns=feature_names)
y = pd.Series(y)

print(X.isnull().sum())  # Shows number of nulls per column
print()
print((X == 0).sum())  # Shows number of zeros per column
print()
X.info()          # Also shows counts of non-null values

#### Strategies for Handling Missing Values


**Option 1: Remove Missing Values**
- Drop rows (examples) with missing values, and don't predict unless all features are available
  - Useful when a few rows are missing, and you have a large dataset
  - `df.dropna(inplace=True)`

- Drop columns (features), and don't use that feature for training or prediction
  - Useful when a feature has many missing values and is not critical to the task
  - `df.drop(columns=['column_name'], inplace=True)`

**Option 2: Imputation (Fill Missing Values)**

[SK-Learn imputers](https://scikit-learn.org/stable/api/sklearn.impute.html)

- Fill missing values with a specific value, like the mean, median, or mode of the column
  - `SimpleImputer`
- Fit a function to the non-missing values, then use that function to fill in the missing values
  - `IterativeImputer`
- Use a KNN model to predict and fill in the missing values
  - `KNNImputer`

#### Step 2: Fill Missing Values (if needed)
If you found your dataset has missing values, choose one of the strategies above to handle them. If not, you can skip this step.

In [None]:
from sklearn.impute import ____
imputer = ____
X = imputer.fit_transform(X)

### B. Feature Scaling

Step 1: Choose your scaling method
| Scaler           | Use When                                                                  |
| ---------------- | ------------------------------------------------------------------------- |
| `StandardScaler` | You want features with mean = 0 and std = 1 (default for most ML models). |
| `MinMaxScaler`   | You need values in a fixed range, like \[0, 1].                           |
| `RobustScaler`   | Your data contains **outliers**.                                          |



#### C. Split the Data
To efficiently train and evaluate model performance, the dataset will be split into the training set and testing set.

- *Training set:* teaches our model to recognize patterns in the data
- *Testing set:* checks our model’s performance on new, never seen before data

We will use the [train_test_split()](https://sklearn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) function from sklearn.model_selection module to do this.

##### Deciding on a split
80% training and 20% testing data is the most common split for larger datasets. 
Since most of sk-learn's toy datasets are small, we’ll use 60% for training and 40% for testing to ensure we can be confident about our evaluation results. 

To do this, the parameter responsible for train size or test size (either or both) by taking a look at the docs to see how to pass them in. By adding setting `random_state=1`, we can ensure the split is consistent with each run for reproducibility.

**Result:**
We will have four subsets of the data after splitting.

- **x_train and y_train:** feature and target values for training

- **x_test and y_test:** feature and target values for testing

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(dataset, random_state=1, ___) # Fill in parameter(s) here

### Other Common Steps in Data Preparation with Scikit-learn tools
For this tutorial, we will only go over loading data, scaling, and splitting. Below shows the common steps for preprocessing and sk-learn tools used to complete them.


| Step                                   | What It Does                                                    | Example Tools in Scikit-learn         |
| -------------------------------------- | --------------------------------------------------------------- | ------------------------------------- |
| **Encode Categorical Variables**    | Convert text labels into numbers.                               | `OneHotEncoder`, `LabelEncoder`       |
| **Feature Selection / Engineering** | Choose the features that maximize performance.             | `SelectKBest`, `SequentialFeatureSelector` |

#### Steps 1-3 Code so far

In [None]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# Load dataset
data = load_iris()
X, y = data.data, data.target

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=42)

# Scale features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)


## 5. Choose a Model

### Pick a simple model (show different ones)
### Cross-validation


## 6. Train the Model
### Fit model on training data
### hyperparameter tuning methods (gridsearch, etc)

## 7. Make predictions
### Predict on test data

## 8. Save/Load Models

## 9. Evaluate the Model
### Accuracy scoring
### Confusion matrix
### Classification reporting