# Scikit learn basics

Scikit-learn is a powerful, open-source machine learning library for Python. It's built on NumPy, SciPy, and Matplotlib, and it offers simple and efficient tools for predictive data analysis. It's accessible to everybody and reusable in various contexts. Scikit-learn is known for its ease of use and its ability to handle complex data operations with just a few lines of code.

Here are some of the basics of scikit-learn that cover its core functionalities:

## Installation
If you don't have scikit-learn installed, you can install it using pip:

In [1]:
# !pip install -U scikit-learn

## Importing Scikit-learn
To use scikit-learn, you first need to import it, along with other libraries it depends on, like NumPy and Pandas.

In [2]:
import numpy as np
import pandas as pd
from sklearn import datasets, metrics

## Working with Datasets
Scikit-learn comes with a few standard datasets, like the iris and digits datasets for classification and the Boston house prices dataset for regression.

In [3]:
# Load dataset
iris = datasets.load_iris()
X, y = iris.data, iris.target

## Data Preprocessing
Data preprocessing is crucial in machine learning. Scikit-learn offers tools for normalization, binarization, encoding categorical variables, imputing missing values, and more.

In [4]:
from sklearn import preprocessing

# Standardization
X_scaled = preprocessing.scale(X)

## Splitting Data
You usually split your data into a training set and a test set. The training set is used to build and train the model, while the test set is used to evaluate its performance.

In [5]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

## Model Selection
Scikit-learn implements a wide range of machine learning algorithms. You can choose one based on your problem.

For example, using a Random Forest Classifier:

In [6]:
from sklearn.ensemble import RandomForestClassifier

# Create a Gaussian Classifier
clf = RandomForestClassifier(n_estimators=100)

# Train the model using the training sets
clf.fit(X_train, y_train)

# Predict the response for the test dataset
y_pred = clf.predict(X_test)

## Model Evaluation
Scikit-learn provides several methods to evaluate the performance of a model, depending on the type of machine learning algorithm used.

In [7]:
# Import scikit-learn metrics module for accuracy calculation
from sklearn import metrics

# Model Accuracy, how often is the classifier correct?
print("Accuracy:", metrics.accuracy_score(y_test, y_pred))

Accuracy: 0.9555555555555556


## Model Persistence
You can save a model in scikit-learn by using Python’s built-in persistence model, `pickle`.

In [8]:
import pickle

# Save to file in the current working directory
pkl_filename = "pickle_model.pkl"
with open(pkl_filename, 'wb') as file:
    pickle.dump(clf, file)

To load the model back:

In [9]:
with open(pkl_filename, 'rb') as file:
    pickle_model = pickle.load(file)

These are the basics to get you started with scikit-learn. The library offers a lot more functionality and is incredibly powerful in the hands of someone who knows how to use it. Its ease of use and variety of tools make it the go-to library for many data scientists and machine learning practitioners.