# Scikit-Learn Cheatsheet: Methods For Classification and Regression

## Scikit-Learn for Regression
### Background
We will be using the diabetes dataset to perform the regression. 

It has 10 feature variables about the age, gender, and other clinical data of patients.
 
The target variable is a numerical measure of the diabetes extent in patients. 

The objective here is to predict the target measures providing the remaining values of the features.

In [14]:
# import libraries and modules
import sklearn
import pandas as pd
# import the datasets module of sklearn to use the inbuilt available data
from sklearn import datasets

In [15]:
# load the diabetes data and set the column names
diabetes = datasets.load_diabetes()
columns = "age sex bmi map tc ldl hdl tch ltg glu".split()
# create a pandas dataframe for the features of the dataset by naming it "diabetes.data"
# x has independent variables
x = pd.DataFrame(diabetes.data, columns=columns) 
# store target values separately
# y has dependent variables
y = diabetes.target
x.head()

Unnamed: 0,age,sex,bmi,map,tc,ldl,hdl,tch,ltg,glu
0,0.038076,0.05068,0.061696,0.021872,-0.044223,-0.034821,-0.043401,-0.002592,0.019908,-0.017646
1,-0.001882,-0.044642,-0.051474,-0.026328,-0.008449,-0.019163,0.074412,-0.039493,-0.06833,-0.092204
2,0.085299,0.05068,0.044451,-0.005671,-0.045599,-0.034194,-0.032356,-0.002592,0.002864,-0.02593
3,-0.089063,-0.044642,-0.011595,-0.036656,0.012191,0.024991,-0.036038,0.034309,0.022692,-0.009362
4,0.005383,-0.044642,-0.036385,0.021872,0.003935,0.015596,0.008142,-0.002592,-0.031991,-0.046641


In [16]:
# In any modeling, it is a common practice to set aside some amount of data for testing purposes. 
# Sklearn provides an elegant function “train_test_split()” that will randomly split your data into training and testing sets. You can adjust the size of the testing set using the “test_size” parameter
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.2)
print(X_train.shape, y_train.shape)
print(X_test.shape, y_test.shape)

(353, 10) (353,)
(89, 10) (89,)


## After setting training and testing sets, lets move to the models that we can use

### Linear Regression
Most common method for supervised learning.

A regression line is fitted with the data points available.

Is easy to interpret, cost-efficient and is used as a base line in any business case.

In [17]:
# Import the model class from the linear_model module of sklearn
from sklearn.linear_model import LinearRegression
# Initialize
model = LinearRegression()
# Fit it with the training data
model.fit(X_train, y_train)

LinearRegression()

In [18]:
# Use the trained model to make predictions of unseen data. You can also evaluate it using the inbuilt score function.
predictions = model.predict(X_test)
model.score(X_test, y_test)

0.46543838161202344

### Ridge Regression
Is an improved version of linear regression. 

Removes some issues of the OLS (ordinary least squares) methodology.

Imposes a penalty for ranging coefficient values with the alpha parameter. This coefficient plays a vital role in the calculation of the residual sum of squares for ridge regression, making the model robust.

In [19]:
from sklearn import linear_model
model = linear_model.Ridge(alpha=.5)
model.fit(X_train, y_train)

Ridge(alpha=0.5)

### Polynomial Regression
Modern data is often complex with non-linear patterns that cannot be modeled by simple linear models. 

Polynomial regressions are models where we fit a higher degree curve to the data. 

It makes the model more flexible and scalable. 

To implement this in scikit-learn, you have to use the pipeline component. 

You can define the polynomial degree required in the pipeline.

In [20]:
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import Pipeline
model = Pipeline([('poly', PolynomialFeatures(degree=3)),
        ('linear', LinearRegression(fit_intercept=False))])
model.fit(X_train, y_train)

Pipeline(steps=[('poly', PolynomialFeatures(degree=3)),
                ('linear', LinearRegression(fit_intercept=False))])

### Support Vector Regression(SVR)
Were initially developed to classify problems, but they have been extended to apply to regression too. 

Can be used when you have a higher dimension of features. 

Provide different kernel options as per requirements.

In [21]:
from sklearn import svm
model = svm.SVR()
model.fit(X_train, y_train)
predictions = model.predict(X_test)

### Decision Tree Regression
Is a tree-based model where the data is split into subgroups based on homogeneity. 

You can import this model from the tree module of sklearn.

In order to avoid overfitting, make use of the “max_depth” parameter to decide the maximum depth of the decision tree. 

If the value is set too high, the model might fit on noises and perform poorly upon a test dataset.

In [22]:
from sklearn.tree import DecisionTreeRegressor
model = DecisionTreeRegressor(max_depth=12)
model.fit(X_train, y_train)

DecisionTreeRegressor(max_depth=12)

### Random Forest Regression
Decision tree models are usually upscaled a level higher by combining multiple models. 

These are ensemble learning methods. 

They can be broadly classified into boosting and bagging algorithms.

The base models are weak learners, and by combining multiple weak learners, we get the final, strong learner model. 

The ‘ensemble’ module has all these functions in sklearn. “N_estimators” is an important parameter that decides the number of decision trees that require training

In [23]:
from sklearn.ensemble import RandomForestRegressor
model=RandomForestRegressor(n_estimators=10, max_features=2, max_leaf_nodes=5,random_state=42)
model.fit(X_train, y_train)

RandomForestRegressor(max_features=2, max_leaf_nodes=5, n_estimators=10,
                      random_state=42)

## Scikit-Learn For Classification

Scikit Learn provides inbuilt datasets and models for classification tasks.

We will use Iris Dataset to learn classification with Scikit-Learn.

Our aim is to classify the species of a flower where the features like petal length and width are provided.

There are 3 classes of species: setosa, versicolor and virginica therefore making it a multiclass classification.

We can split them into training and testing datasets as show in the code block below.

In [24]:
import pandas as pd
from sklearn.datasets import load_iris
iris=load_iris()
x = pd.DataFrame(iris.data) 
y = iris.target 
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.2)

### Logistic regression

Is a linear model developed from linear regression to address classification issues.

Uses the default regularization technique in the algorithm.

When applied to multiclass classification problems, it uses One vs Rest strategy where separate binary classifiers are trained for each class, converting them into a binary classification at the base level.

In [25]:
from sklearn.linear_model import LogisticRegression
clf = LogisticRegression(random_state=13)
clf.fit(X_train, y_train)
predictions = clf.predict(X_test)

### Support vector classifiers

Are popularly used for classification problems with a high dimension of features.

Can transform the feature space into a higher dimension using the kernel function.

Multiple kernel options are available including linear, RBF(Radial Base Function), polynomial, and so on.

The 'gamma' parameter can also be finetuned, which is the kernel coefficient.

In [26]:
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
clf = make_pipeline(StandardScaler(), SVC(gamma='auto', kernel ='rbf'))
clf.fit(X_train, y_train)

Pipeline(steps=[('standardscaler', StandardScaler()),
                ('svc', SVC(gamma='auto'))])

### Gaussian Naive Bayes Classifier(Popular Classification Algo)
Applies Bayes' theorem of conditional probability.

Assumes the features are independent of each other, while targets are dependent on them.


In [27]:
from sklearn.naive_bayes import GaussianNB
clf = GaussianNB()
clf.fit(X_train, y_train)

GaussianNB()

### Decision Tree Classifier
Is a tree based classifier where a dataset is split based on values of various attributes and data points with features of similar values are grouped together.

Make sure to finetune the maximum depth and minimum leaf split parameters for better results as it helps avoid overfitting.

In [28]:
from sklearn.tree import DecisionTreeClassifier
clf = DecisionTreeClassifier()
clf.fit(X_train, y_train)

DecisionTreeClassifier()

### Gradient Boosting Classifier
Boosting is a method of ensemble learning where multiple decision trees are combined to enhance performance.

Is a parallel learning method where multiple trees are trained parallely and then combined to vote for the final result.

We can finetune the hyperparameters like learning rate and number of estimators to achieve optimal training results.

In [29]:
from sklearn.ensemble import GradientBoostingClassifier
clf = GradientBoostingClassifier(n_estimators=100, learning_rate=1.0,max_depth=5, random_state=0)
clf.fit(X_train, y_train)

GradientBoostingClassifier(learning_rate=1.0, max_depth=5, random_state=0)

### kNN Classification
Groups data points into clusters.

Value of k can be chosen as a parameter "n_neighbors"

The algorithms form K Clusters and assign each data point to the nearest cluster.

kNN performs multiple iterations where the distance of the points are the centers of the clusters, which are calculated and reassigned optimally.

In [30]:
from sklearn import neighbors
clf = neighbors.KNeighborsClassifier(n_neighbors=5, weights=weights)
clf.fit(X, y)

NameError: name 'weights' is not defined

## Scikit-learn metrics for evaluation
Scikit-learn offers multiple functions and metrics to evaluate the predictions for both regression and classification.

### Regression metrics
R squared correlation metric is very important

In [None]:
from sklearn.metrics import r2_score
print(r2_score(y_test, y_pred))

NameError: name 'y_pred' is not defined

The syntax is similar for all the metrics. 

Below is the list of metrics you can import and test it as required.

In [None]:
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import mean_squared_log_error
from sklearn.metrics import mean_absolute_percentage_error
from sklearn.metrics import median_absolute_error

### Classification metrics
You can generate a classification report with sklearn for any classification problem. 

This provides information on precision and recalls for each class in the case of a multiclass classification task. 

It also calculates the F1 score and accuracy of the predictions.



In [None]:
from sklearn.metrics import classification_report
print(classification_report(y_test, y_pred))

NameError: name 'y_pred' is not defined

A confusion matrix is also a suggested method for classification. 

It helps you visualize how well the model is performing if there are more false positives or negatives. 

Sklearn provides a simple way to obtain that too.



In [None]:
from sklearn.metrics import confusion_matrix
print(confusion_matrix(y_test, y_pred))

NameError: name 'y_pred' is not defined

### Further Info
Scikit-Learn provides the following alogs:
1. Linear regression

2. Logistic regression.

3. Decision tree models.

4. Random forest regression.

5. Gradient boosting regression.

6. Gradient boosting classification.

7. K-nearest neighbors.

8. Support Vector Machine.

9. Naive Bayes.

10. Neural networks and a lot more. 

Broadly, these algorithms could be classified under Supervised (regression, classification) and unsupervised learning (clustering) algorithms.

## How to reduce overfitting in scikit learn models?
Overfitting happens when the model you have trained is very complex and has fitted on every training data point. 

It has memorized the training data, rather than learning the pattern as we desire. 

To ensure that your model is not overfitting, sklearn provides various hyperparameters you can tune for each model. 

For example, in decision tree regressions, you can reduce the maximum depth of trees if you find it overfitting. 

In Support vector machine models, there are regularization parameters like C and gamma to help with this. 

In neural networks, you can reduce the no of layers, no of neurons in hidden layers and so on.



## How to evaluate scikit learn models?
There is whole set of metrics available to evaluate your model in the “sklearn.metrics” module. 

There are metrics available separately for classification and regression problems. 

Metrics like R squared correlation, Mean Sqaurred Error, Mean Absolute Error, Accuracy are commonly used for regression.

Whereas classification problems use metrics like precision, recall, AUC, AUROC, IOU (Intersection Over Union) and so on.
