# Scikit-Learn Cheatsheet: Methods For Classification and Regression

## Scikit-Learn for Regression
### Background
We will be using the diabetes dataset to perform the regression. 
It has 10 feature variables about the age, gender, and other clinical data of patients. 
The target variable is a numerical measure of the diabetes extent in patients. 
The objective here is to predict the target measures providing the remaining values of the features.

In [1]:
# import libraries and modules
import sklearn
import pandas as pd
# import the datasets module of sklearn to use the inbuilt available data
from sklearn import datasets

In [2]:
# load the diabetes data and set the column names
diabetes = datasets.load_diabetes()
columns = "age sex bmi map tc ldl hdl tch ltg glu".split()
# create a pandas dataframe for the features of the dataset by naming it "diabetes.data"
# x has independent variables
x = pd.DataFrame(diabetes.data, columns=columns) 
# store target values separately
# y has dependent variables
y = diabetes.target
x.head()

Unnamed: 0,age,sex,bmi,map,tc,ldl,hdl,tch,ltg,glu
0,0.038076,0.05068,0.061696,0.021872,-0.044223,-0.034821,-0.043401,-0.002592,0.019908,-0.017646
1,-0.001882,-0.044642,-0.051474,-0.026328,-0.008449,-0.019163,0.074412,-0.039493,-0.06833,-0.092204
2,0.085299,0.05068,0.044451,-0.005671,-0.045599,-0.034194,-0.032356,-0.002592,0.002864,-0.02593
3,-0.089063,-0.044642,-0.011595,-0.036656,0.012191,0.024991,-0.036038,0.034309,0.022692,-0.009362
4,0.005383,-0.044642,-0.036385,0.021872,0.003935,0.015596,0.008142,-0.002592,-0.031991,-0.046641


In [3]:
# In any modeling, it is a common practice to set aside some amount of data for testing purposes. 
# Sklearn provides an elegant function “train_test_split()” that will randomly split your data into training and testing sets. You can adjust the size of the testing set using the “test_size” parameter
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.2)
print(X_train.shape, y_train.shape)
print(X_test.shape, y_test.shape)

(353, 10) (353,)
(89, 10) (89,)


## After setting training and testing sets, lets move to the models that we can use

### Linear Regression
Most common method for supervised learning

In [4]:
# Import the model class from the linear_model module of sklearn
from sklearn.linear_model import LinearRegression
# Initialize
model = LinearRegression()
# Fit it with the training data
model.fit(X_train, y_train)

LinearRegression()

In [5]:
# Use the trained model to make predictions of unseen data. You can also evaluate it using the inbuilt score function.
predictions = model.predict(X_test)
model.score(X_test, y_test)

0.4696088586991457

## Ridge Regression
Is an improved version of linear regression. It removes some issues of the OLS (ordinary least squares) methodology, also imposes a penalty for ranging coefficient values with the alpha parameter. This coefficient plays a vital role in the calculation of the residual sum of squares for ridge regression, making the model robust.

In [6]:
from sklearn import linear_model
model = linear_model.Ridge(alpha=.5)
model.fit(X_train, y_train)

Ridge(alpha=0.5)

## Polynomial Regression
Modern data is often complex with non-linear patterns that cannot be modeled by simple linear models. Polynomial regressions are models where we fit a higher degree curve to the data. It makes the model more flexible and scalable. To implement this in scikit-learn, you have to use the pipeline component. You can define the polynomial degree required in the pipeline.

In [7]:
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import Pipeline
model = Pipeline([('poly', PolynomialFeatures(degree=3)),
        ('linear', LinearRegression(fit_intercept=False))])
model.fit(X_train, y_train)

Pipeline(steps=[('poly', PolynomialFeatures(degree=3)),
                ('linear', LinearRegression(fit_intercept=False))])

## Support Vector Regression(SVR)
Were initially developed to classify problems, but they have been extended to apply to regression too. These models can be used when you have a higher dimension of features. They also provide different kernel options as per requirements.

In [8]:
from sklearn import svm
model = svm.SVR()
model.fit(X_train, y_train)
predictions = model.predict(X_test)

## Decision Tree Regression
Is a tree-based model where the data is split into subgroups based on homogeneity. You can import this model from the tree module of sklearn.

In order to avoid overfitting, make use of the “max_depth” parameter to decide the maximum depth of the decision tree. If the value is set too high, the model might fit on noises and perform poorly upon a test dataset.

In [9]:
from sklearn.tree import DecisionTreeRegressor
model = DecisionTreeRegressor(max_depth=12)
model.fit(X_train, y_train)

DecisionTreeRegressor(max_depth=12)

## Random Forest Regression
Decision tree models are usually upscaled a level higher by combining multiple models. These are ensemble learning methods. They can be broadly classified into boosting and bagging algorithms.

The base models are weak learners, and by combining multiple weak learners, we get the final, strong learner model. The ‘ensemble’ module has all these functions in sklearn. “N_estimators” is an important parameter that decides the number of decision trees that require training

In [10]:
from sklearn.ensemble import RandomForestRegressor
model=RandomForestRegressor(n_estimators=10, max_features=2, max_leaf_nodes=5,random_state=42)
model.fit(X_train, y_train)

RandomForestRegressor(max_features=2, max_leaf_nodes=5, n_estimators=10,
                      random_state=42)

## Scikit-Learn For Classification

In [None]:
import pandas as pd
from sklearn.datasets import load_iris
iris=load_iris()
x = pd.DataFrame(iris.data) 
y = iris.target 
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.2)

In [None]:
from sklearn.linear_model import LogisticRegression
clf = LogisticRegression(random_state=13)
clf.fit(X_train, y_train)
predictions = clf.predict(X_test)

In [None]:
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
clf = make_pipeline(StandardScaler(), SVC(gamma='auto', kernel =’rbf’))
clf.fit(X_train, y_train)

In [None]:
from sklearn.naive_bayes import GaussianNB
clf = GaussianNB()
clf.fit(X_train, y_train)

In [None]:
from sklearn.tree import DecisionTreeClassifier
clf = DecisionTreeClassifier()
clf.fit(X_train, y_train)

In [None]:
from sklearn.ensemble import GradientBoostingClassifier
clf = GradientBoostingClassifier(n_estimators=100, learning_rate=1.0,max_depth=5, random_state=0)
clf.fit(X_train, y_train)

In [None]:
from sklearn import neighbors
clf = neighbors.KNeighborsClassifier(n_neighbors=5, weights=weights)
clf.fit(X, y)

In [None]:
from sklearn.metrics import r2_score
print(r2_score(y_test, y_pred))

In [None]:
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import mean_squared_log_error
from sklearn.metrics import mean_absolute_percentage_error
from sklearn.metrics import median_absolute_error

In [None]:
from sklearn.metrics import classification_report
print(classification_report(y_test, y_pred))

In [None]:
from sklearn.metrics import confusion _matri x
print (confusion _matrix ( y _test, y _pred))