**Run this Jupyter Notebook**
- Run this Notebook via [**Google Colab Platform**](https://colab.research.google.com/github/mhmaem/data_science_university/blob/master/05_sklearn_interactive_cheatsheets/05_01_sklearn_basics.ipynb)
- Download this [**Notebook**](https://github.com/mhmaem/data_science_university/blob/master/05_sklearn_interactive_cheatsheets/05_01_sklearn_basics.ipynb) to run it locally

---
---
# **Scikit-Learn**
Scikit-learn is an open source Python library that implements a range of machine learning, preprocessing, cross-validation and visualization algorithms using a unified interface.

The easiest way to use Scikit-Learn to follow the following steps systematically

---
## **Loading The Data**

In [1]:
import numpy as np
X = np.random.random((10,5))
y = np.array(['M','M','F','F','M','F','M','M','F','F'])
X[X < 0.7] = 0

### **Training And Test Data**

In [2]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

print(X_train, '\n\n', X_test, '\n\n', y_train, '\n\n', y_test)

[[0.         0.75976059 0.99483206 0.74645171 0.7209713 ]
 [0.         0.         0.         0.         0.72315595]
 [0.         0.         0.         0.74215606 0.        ]
 [0.75298782 0.         0.         0.         0.91200959]
 [0.         0.83166526 0.         0.         0.        ]
 [0.82727353 0.         0.         0.         0.        ]
 [0.         0.         0.         0.         0.        ]] 

 [[0.         0.93866853 0.88728583 0.         0.        ]
 [0.         0.         0.71406547 0.93057107 0.91075855]
 [0.         0.87738038 0.7801023  0.99347839 0.        ]] 

 ['F' 'M' 'M' 'M' 'F' 'M' 'F'] 

 ['F' 'F' 'M']


---
## **Preprocessing The Data**
### **Standardization**

In [3]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler().fit(X_train)
standardized_X = scaler.transform(X_train)
standardized_X_test = scaler.transform(X_test)

### **Normalization**

In [4]:
from sklearn.preprocessing import Normalizer
scaler = Normalizer().fit(X_train)
normalized_X = scaler.transform(X_train)
normalized_X_test = scaler.transform(X_test)

### **Binarization**

In [5]:
from sklearn.preprocessing import Binarizer
binarizer = Binarizer(threshold=0.0).fit(X)
binary_X = binarizer.transform(X)

### **Encoding Categorical Features**

In [6]:
from sklearn.preprocessing import LabelEncoder
enc = LabelEncoder()
y = enc.fit_transform(y)

### **Imputing Missing Values**

In [7]:
from sklearn.preprocessing import Imputer
imp = Imputer(missing_values=0, strategy='mean', axis=0)
imp.fit_transform(X_train)



array([[0.79013068, 0.75976059, 0.99483206, 0.74645171, 0.7209713 ],
       [0.79013068, 0.79571292, 0.99483206, 0.74430388, 0.72315595],
       [0.79013068, 0.79571292, 0.99483206, 0.74215606, 0.78537894],
       [0.75298782, 0.79571292, 0.99483206, 0.74430388, 0.91200959],
       [0.79013068, 0.83166526, 0.99483206, 0.74430388, 0.78537894],
       [0.82727353, 0.79571292, 0.99483206, 0.74430388, 0.78537894],
       [0.79013068, 0.79571292, 0.99483206, 0.74430388, 0.78537894]])

### **Randomized Parameter Optimization**

In [8]:
from sklearn.model_selection import  RandomizedSearchCV
from sklearn import neighbors
params = {"n_neighbors": range(1,5), "weights": ["uniform", "distance"]}
knn = neighbors.KNeighborsClassifier(n_neighbors=5)
rsearch = RandomizedSearchCV(estimator=knn,param_distributions=params,cv=4,n_iter=8,random_state=5)
rsearch.fit(X_train, y_train)
print(rsearch.best_score_)

0.8571428571428571




---

## **Create Your Model**
### **Supervised Learning Estimators**
#### **Linear Regression**

In [9]:
from sklearn.linear_model import LinearRegression
lr = LinearRegression(normalize=True)

#### **Support Vector Machines (SVM)**

In [10]:
from sklearn.svm import SVC
svc = SVC(kernel='linear')

#### **Naive Bayes**

In [11]:
from sklearn.naive_bayes import GaussianNB
gnb = GaussianNB()

#### **KNN**

In [12]:
from sklearn import neighbors
knn = neighbors.KNeighborsClassifier(n_neighbors=5)

### **Unsupervised Learning Estimators**
#### **Principal Component Analysis (PCA)**

In [13]:
from sklearn.decomposition import PCA
pca = PCA(n_components=0.95)

#### **K Means**

In [14]:
from sklearn.cluster import KMeans
k_means = KMeans(n_clusters=3, random_state=0)

---
## **Model Fitting**

### **Supervised learning**

In [15]:
lr.fit(X, y)
knn.fit(X_train, y_train)
svc.fit(X_train, y_train)

SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto_deprecated',
  kernel='linear', max_iter=-1, probability=False, random_state=None,
  shrinking=True, tol=0.001, verbose=False)

### **Unsupervised Learning**

In [16]:
k_means.fit(X_train)
pca_model = pca.fit_transform(X_train)

---
## **Prediction**

### **Supervised Estimators**

In [17]:
y_pred = svc.predict(np.random.random((2,5)))
y_pred = lr.predict(X_test)
y_pred = knn.predict_proba(X_test)

### **Unsupervised Estimators**

In [18]:
y_pred = k_means.predict(X_test)

---
## **Evaluate Your Model’s Performance**

### **Classification Metrics**

#### **Accuracy Score**

In [19]:
knn.score(X_test, y_test)
from sklearn.metrics import accuracy_score
accuracy_score(y_test, y_pred)

  score = y_true == y_pred


0.0

#### **Classification Report**

In [20]:
from sklearn.metrics import classification_report
knn = neighbors.KNeighborsClassifier(n_neighbors=5)
X_train_ = X_train.copy()
y_train_ = y_train.copy()
knn.fit(X_train_, y_train_)
y_pred_ = knn.predict(X_test)
print(classification_report(y_test, y_pred_))

              precision    recall  f1-score   support

           F       0.50      0.50      0.50         2
           M       0.00      0.00      0.00         1

   micro avg       0.33      0.33      0.33         3
   macro avg       0.25      0.25      0.25         3
weighted avg       0.33      0.33      0.33         3



#### **Confusion Matrix**

In [21]:
from sklearn.metrics import confusion_matrix
print(confusion_matrix(y_test, y_pred_))

[[1 1]
 [1 0]]


### **Regression Metrics**

#### **Mean Absolute Error**

In [22]:
from sklearn.metrics import mean_absolute_error
y_true = [3, -0.5, 2]
mean_absolute_error(y_true, y_pred)

1.1666666666666667

#### **Mean Squared Error**

In [23]:
from sklearn.metrics import mean_squared_error
mean_squared_error(y_true, y_pred)

2.4166666666666665

#### **R² Score**

In [24]:
from sklearn.metrics import r2_score
r2_score(y_true, y_pred)

-0.11538461538461542

### **Clustering Metrics**

#### **Adjusted Rand Index**

In [25]:
from sklearn.metrics import adjusted_rand_score
adjusted_rand_score(y_true, y_pred)

0.0

#### **Homogeneity**

In [26]:
from sklearn.metrics import homogeneity_score
homogeneity_score(y_true, y_pred)

0.0

#### **V-measure**

In [27]:
from sklearn.metrics import v_measure_score
v_measure_score(y_true, y_pred)

0.0

### **Cross-Validation**

In [28]:
from  sklearn.model_selection import cross_val_score
print(cross_val_score(knn, X_train, y_train, cv=4))
print(cross_val_score(lr, X, y, cv=2))

[0.5 0.5 0.5 1. ]
[-1.20868759 -4.56166961]




---
## **Tune Your Model**

### **Grid Search**

In [29]:
from sklearn.model_selection import GridSearchCV
params = {"n_neighbors": np.arange(1,3),"metric": ["euclidean", "cityblock"]}
grid = GridSearchCV(estimator=knn,param_grid=params)
grid.fit(X_train, y_train)
print(grid.best_score_)
print(grid.best_estimator_.n_neighbors)

0.5714285714285714
1




## **Randomized Parameter Optimization**

In [30]:
from sklearn.model_selection import RandomizedSearchCV
params = {"n_neighbors": range(1,5), "weights": ["uniform", "distance"]}
rsearch = RandomizedSearchCV(estimator=knn, param_distributions=params, cv=4, n_iter=8, random_state=5)
rsearch.fit(X_train, y_train)
print(rsearch.best_score_)

0.8571428571428571




---
---
# **Tiny Example**

In [31]:
from sklearn import neighbors, datasets, preprocessing
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
iris = datasets.load_iris()
X, y = iris.data[:, :2], iris.target
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=33)
scaler = preprocessing.StandardScaler().fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)
knn = neighbors.KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train, y_train)
y_pred = knn.predict(X_test)
accuracy_score(y_test, y_pred)

0.631578947368421