# Beginner's Guide to Scikit-learn: 7 Essential Commands

Scikit-learn is a powerful Python library for machine learning. This guide will provide an introduction to 7 essential Scikit-learn commands which will help you to kick start your Scikit-learn journey.

## 1. Import Scikit-learn

We need to import Scikit-learn before we start using this library as it is the foundation that we are going to work on. It allows us to access the comprehensive suite of various machine learning algorithms, tools and utilities we will need to perform data analysis and model evaluation.

In [12]:
import sklearn
from sklearn import datasets

This 'dataset' module provides us with the access to various datasets available on Scikit-learn library.

## 2. Loading a dataset

To explore and learn about Scikit-learn, let's load a built-in dataset available on Scikit-learn library.

In [20]:
# Load the breast cancer dataset
breastCancer_ds = datasets.load_breast_cancer()

Now that we have loaded the dataset, let's print the first 3 rows of data.

In [27]:
print(breastCancer_ds.data[:3])

[[1.799e+01 1.038e+01 1.228e+02 1.001e+03 1.184e-01 2.776e-01 3.001e-01
  1.471e-01 2.419e-01 7.871e-02 1.095e+00 9.053e-01 8.589e+00 1.534e+02
  6.399e-03 4.904e-02 5.373e-02 1.587e-02 3.003e-02 6.193e-03 2.538e+01
  1.733e+01 1.846e+02 2.019e+03 1.622e-01 6.656e-01 7.119e-01 2.654e-01
  4.601e-01 1.189e-01]
 [2.057e+01 1.777e+01 1.329e+02 1.326e+03 8.474e-02 7.864e-02 8.690e-02
  7.017e-02 1.812e-01 5.667e-02 5.435e-01 7.339e-01 3.398e+00 7.408e+01
  5.225e-03 1.308e-02 1.860e-02 1.340e-02 1.389e-02 3.532e-03 2.499e+01
  2.341e+01 1.588e+02 1.956e+03 1.238e-01 1.866e-01 2.416e-01 1.860e-01
  2.750e-01 8.902e-02]
 [1.969e+01 2.125e+01 1.300e+02 1.203e+03 1.096e-01 1.599e-01 1.974e-01
  1.279e-01 2.069e-01 5.999e-02 7.456e-01 7.869e-01 4.585e+00 9.403e+01
  6.150e-03 4.006e-02 3.832e-02 2.058e-02 2.250e-02 4.571e-03 2.357e+01
  2.553e+01 1.525e+02 1.709e+03 1.444e-01 4.245e-01 4.504e-01 2.430e-01
  3.613e-01 8.758e-02]]


As you can see, we get a bunch of numbers. This is the first 3 rows of the features in the Breast Cancer dataset. A feature is a property or phenomenon that is been observed on a dataset.

Each row corresponds to a single sample in this case a patient. A column represents various characteristics or the measurements of the tumor and other metrics. This helps the analyst to understand and classify data depending on the observations.

## 3. Splitting the dataset

Splitting a dataset is a practice that is used to evaluate the effectiveness of a model. To evaluate it we need to split our dataset into two sets namely the 'Training' and the 'Testing' sets.
As the name suggest, the training set is used for the purpose of training the model and the testing set is used to test the model we trained.

In [37]:
from sklearn.model_selection import train_test_split

# Split the breast cancer dataset
X_train, X_test, y_train, y_test = train_test_split(breastCancer_ds.data, breastCancer_ds.target, test_size=0.3, random_state=70)

**'train_test_split'** is a function available in Scikit-learn that will split a dataset into two subsets: a training set and a testing set. The 'breastCancer_ds.data' and 'breastCancer_ds.target' are the two aspects of the dataset. Features used to train the model are held by 'breastC#ancer_ds.data' while the 'breastCancer_ds.target' containes the target which is the label or the classification that our model aims to predict.

## 4. Standardizing the data

This is a preprocessing step in machine learning. Which involves adjusting the values of features in a way that they have an average of 0 and standard deviation of 1.

#### Why it is important?

In simple, standardizing data is important as it helps to normalize the scale of the features making it more comparable.

##### a) Increases the Consistency: 

Usually various features are measured using various different scales (Eg: Weight is measures in Kg or Lbs while your hearing is measured using decibels (dB) and your vision is measured using a different system - Snellen chart). If we use all these types of measuring scales and expect a model to perform task there will be various sorts of issues (It's like asking a calculator to add 2 to Melbourne.. Makes no sense!). Therefore standardizing ensures that all the features that we consider are in the same scale and contributes equally to the model's learning.

##### b) Improved Performance:

Many algorithms performs better because standardization helps the converge (successfully adjusted its parameters to achieve optimal performance) faster during training.

##### c) Outlier Influence:

Standardization reduces the influence of outliers on model


In [48]:
from sklearn.preprocessing import StandardScaler

standard_scaler = StandardScaler()
X_train = standard_scaler.fit_transform(X_train)
X_test = standard_scaler.transform(X_test)

## 5. Training a model

Let's train a logistic regression model on our training set.

#### What's a logistic regression model?

It's a statistical model used for binary classification. It predicts the probability of an occurence based on the features of the dataset (Eg: whether a tumor is malignant or benign).

In [59]:
from sklearn.linear_model import LogisticRegression

# Create a logistic regression model
regression_model = LogisticRegression(max_iter=10000)

# Train the model
regression_model.fit(X_train, y_train)


'max_iter' is the maximum number of iterations in our case 10000. Having a higher number of iterations ensures there's adequate iterations to find the best fitting model parameters, leading to more accurate predictions.

## 6. Making predictions

Once the training of our model is completed, we can use our model to make predictions on the test set.

In [65]:
# Make predictions
y_pred = regression_model.predict(X_test)
print(y_pred)


[1 1 1 1 0 1 0 0 1 1 1 0 0 1 1 0 0 1 1 1 1 1 1 0 1 1 1 0 1 1 1 0 1 1 1 1 1
 1 1 0 0 1 0 1 0 1 1 0 1 0 0 1 0 1 1 0 1 0 1 0 1 1 1 0 1 0 0 1 0 1 1 0 1 1
 0 1 1 1 1 1 1 0 0 1 0 0 1 1 1 0 1 0 0 0 1 0 1 1 1 0 1 0 1 0 1 0 1 1 1 0 1
 0 0 1 0 1 0 1 1 1 1 1 1 1 0 0 0 1 1 0 0 1 1 1 1 1 0 1 1 1 1 0 1 0 1 1 0 1
 1 1 0 1 0 1 1 1 1 1 0 0 1 1 0 0 0 0 1 0 1 1 0]


What we do here is using the predict function in Scikit-learn, we ask it to make predictions on the test set (X_test) using the regression model we already trained.

'y_pred' is the model's predictions for our test dataset 'X_test'.

## 7. Evaluating the model

Now that we have trained a model and tested it with our test datset, it's time to evaluate our model using metrics such as accuracy and classification report.

In Scikit-learn there are various ways to evaluate a model, one of such module that we are going to use in this guide is **'sklearn.metrics'**  which has a wide range of tool to evaluate the performance of a machine learning model.

In [73]:
from sklearn.metrics import accuracy_score, classification_report

# Calculate accuracy
model_accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy of the model: {model_accuracy}')

# Detailed classification report
detailed_report = classification_report(y_test, y_pred)
print(detailed_report)


Accuracy of the model: 0.9590643274853801
              precision    recall  f1-score   support

           0       0.94      0.95      0.94        63
           1       0.97      0.96      0.97       108

    accuracy                           0.96       171
   macro avg       0.95      0.96      0.96       171
weighted avg       0.96      0.96      0.96       171



'model_accuracy' is the accuracy score of our model. The **accuracy_score** function calculates the accuracy by comparing the test dataset 'y_test' with the predicted labels (y_pred). What we print using the print method is the propotion of correct predictions out of all the predictions the model predicts. In other words, our model has 95% (0.95) accuracy.

Next, using the **classification_report** function we are generating the detailed classification report that has the precision, recall, f1-score and the support for each class (in binary classification there are two classes, in our case Benign (non-cancerous) tumors and Malignant (cancerous) tumors).
According to the report we can see the,
*  **Precision**: The ratio of correctly predicted positive observations to the total predicted positives.
*  **Recall**: The ratio of correctly predicted positive observations to all the observations in the class.
*  **f1-score**: The weighted average of the precision and the recall.
*  **Support**: The number of occurances of each class in the dataset.

These 7 commands provides a comprehensive introduction to using Scikit-learn, from importing Scikit-learn to loading a dataset, preprocessing our datasets, training, testing and evaluating the model.