# Scikit-Learn

- Scikit-Learn is a package that provides efficient versions of a large number of common ML algorithms.
- Scikit-Learn is characterized by a clean, uniform, and streamlined API, as well as by very useful and complete online documentation.
- A benefit of this uniformity is that once you understand the basic use and syntax of Scikit-Learn for one type of model, switching to a new model or algorithm is very straightforward.


## Data Representation in Scikit-Learn

- The information can be thought of as a two-dimensional numerical array, the features matrix.
- By convention, this matrix is stored in a variable named X.
- The features matrix is assumed to be two-dimensional, with shape [n_samples, n_features], and is most often contained in a NumPy array or a Pandas DataFrame.
    - The samples (i.e., rows) always refer to the individual objects described by the dataset.
    - The features (i.e., columns) refer to the distinct observations that describe each sample in a quantitative manner.

## Target array
- The target array, called y,  is usually one dimensional, with length n_samples, and is generally contained in a NumPy array or Pandas Series.
- The target array may have continuous numerical values, or discrete classes/labels.
- The target array is that it is usually the quantity we want to predict from the data.




## Scikit-Learn’s Estimator API

### Estimators objects
- An estimator is any object that learns from data
    - it may be a classification, regression or clustering algorithm or a transformer that extracts/filters useful features from raw data.
- Fitting data: the main API implemented by scikit-learn is that of the estimator.;
    - All estimator objects expose a fit method that takes a dataset (usually a 2-d array):
    
<tt> >>> estimator.fit(data)</tt>

- Estimator parameters: All the parameters of an estimator can be set when it is instantiated or by modifying the corresponding attribute:

<tt> >>> estimator = Estimator(param1=1, param2=2) </tt>
- Estimated parameters: When data is fitted with an estimator, parameters are estimated from the data at hand. All the estimated parameters are attributes of the estimator object ending by an underscore:
    
    <tt> >>> estimator.estimated_param_ </tt>
    
    
- All supervised estimators in scikit-learn implement a <tt>fit(X, y)</tt> method to fit the model and a <tt>predict(X)</tt> method that, given unlabeled observations <tt>X</tt>, returns the predicted labels <tt>y</tt>.

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

In [None]:
import sklearn as sk

In [None]:
from sklearn import datasets

In [None]:
datasets.load_iris()

In [None]:
iris = datasets.load_iris()

In [None]:
iris.keys()

In [None]:
iris['target_names']

In [None]:
iris['feature_names']

In [None]:
X = iris.data

In [None]:
X

In [None]:
X.shape

In [None]:
type(X)

In [None]:
y = iris.target

In [None]:
y

In [None]:
y.shape

In [None]:
X[0,:]

In [None]:
y[0]

In [None]:
Xnew = X[y == 0, :]
plt.scatter(Xnew[:, 1], Xnew[:,2], c='r', label='classe 0')

Xnew = X[y == 1, :]
plt.scatter(Xnew[:, 1], Xnew[:, 2], c='b', label='classe 1')

Xnew = X[y == 2, :]
plt.scatter(Xnew[:, 1], Xnew[:, 2], c='g', label='classe 2')

plt.title('iris dataset')
plt.xlabel('feature 2')
plt.ylabel('feature 3')
plt.legend(loc='best')

### KNeighborsClassifier

In [None]:
from sklearn.neighbors import KNeighborsClassifier

In [None]:
knn = KNeighborsClassifier()

In [None]:
knn

In [None]:
knn.fit(X, y)

In [None]:
knn.predict(X)

In [None]:
y == knn.predict(X)

In [None]:
result = y == knn.predict(X)
result

In [None]:
result.sum()/len(result)

In [None]:
from sklearn.metrics import accuracy_score

accuracy_score(y, knn.predict(X))

### Decision Tree

In [None]:
from sklearn.tree import DecisionTreeClassifier

In [None]:
dtc= DecisionTreeClassifier()

In [None]:
dtc.fit(X,y)

In [None]:
accuracy_score(y, dtc.predict(X))

### Train/test split

In [None]:
knn.fit(X[:100], y[:100])

In [None]:
knn.predict(X[:100])

In [None]:
accuracy_score(y[:100], knn.predict(X[:100]))

In [None]:
result = knn.predict(X[100:])

In [None]:
accuracy_score(y[100:], result)

In [None]:
result

In [None]:
y[100:]

In [None]:
np.random.permutation(10)

In [None]:
indexes = np.random.permutation(len(X))

In [None]:
indexes

In [None]:
X_train = X[indexes[:100]]
y_train = y[indexes[:100]]

X_test = X[indexes[100:]]
y_test = y[indexes[100:]]

In [None]:
X_train

In [None]:
X[8]

In [None]:
knn.fit(X_train, y_train)

In [None]:
result = knn.predict(X_test)

In [None]:
accuracy_score(y_test,result)

In [None]:
accuracy_score(y_train,knn.predict(X_train))

In [None]:
knn=KNeighborsClassifier(n_neighbors=9, weights='distance')

In [None]:
knn.fit(X_train, y_train)

In [None]:
accuracy_score(y_test,knn.predict(X_test))

In [None]:
accuracy_score(y_train,knn.predict(X_train))

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25)

In [None]:
X_train.shape

In [None]:
knn = KNeighborsClassifier(n_neighbors=3, weights='distance')

In [None]:
knn.fit(X_train, y_train)

In [None]:
result = knn.predict(X_test)

In [None]:
accuracy_score(y_test, result)

In [None]:
from sklearn.tree import DecisionTreeClassifier

In [None]:
model = DecisionTreeClassifier()

In [None]:
model.fit(X_train, y_train)
result = model.predict(X_test)

In [None]:
result

In [None]:
accuracy_score(y_test, result)