<a href="https://colab.research.google.com/github/mellow-steps/S576J-Graduate-Certificate-Of-Data-Analytics/blob/main/PassTask3_1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Collection of scikit-learn commands for machine learning



###  Importing Modules, Classes and Functions:
 In scikit-learn (sklearn), you can import modules, classes, and functions to utilize its wide range of machine learning algorithms, utilities, and tools. This modular structure allows you to import specific components as needed for machine learning tasks.

For example, you can import the **svm** module, which contains implementations of Support Vector Machine (SVM) algorithms. These are powerful supervised learning models used for classification and regression tasks.

From the svm module, you can import the **svc** class, which stands for Support Vector Classification. It is a supervised learning algorithm that is widely used for *classification* tasks. It is capable of performing binary classification, as well as multiclass classification through one-vs-one or one-vs-rest strategies.

From the svc class, you can import various functions for training and prediction tasks. Some commonly used functions include:

*   **fit(X, y)**: Trains the SVC model on the input data X and target labels y.
*   **predict(X)**: Predicts the target labels for the input data X.
*   **decision_function(X)**: Predicts the decision function values for the input data X.
*   **score(X, y)**: Computes the mean accuracy of the SVC model on the given test data and labels.



In [9]:
#Import svm module from scikitlearn
from sklearn import svm

# Import SVC class from svm module
from sklearn.svm import SVC

# Create an instance of SVC
svc_classifier = SVC()

# Now, svc_classifier can be used to fit the model to data and make predictions

###  Loading Example Datasets:
scikit-learn comes with a few standard datasets, for instance the **iris** and **digits** datasets for classification and the **diabetes** dataset for regression.


In [10]:
# Import the datasets module from scikit-learn
from sklearn import datasets

# Load the Iris dataset
iris = datasets.load_iris()


### Loading External Datasets:

In Google Colab, you can upload files directly to the runtime environment.
Click on the "Files" tab on the left sidebar in Google Colab.
Click on the "Upload to session storage" icon.
Select the file from your local system and upload it.


For example, after uploading 'payment_fraud.csv', you can read it into a DataFrame using pd.read_csv('payment_fraud.csv').

In [28]:
# Use pandas to load CSV data
import pandas as pd
data = pd.read_csv('payment_fraud.csv')


### Accessing Dataset Attributes:
A dataset is a dictionary-like object that holds all the data and some metadata about the data. This data is stored in the `.data` member, which is a `n_samples, n_features` array. In the case of supervised problems, one or more response variables are stored in the `.target` member.

In [16]:
print(iris.data)
print(iris.target)



[[5.1 3.5 1.4 0.2]
 [4.9 3.  1.4 0.2]
 [4.7 3.2 1.3 0.2]
 [4.6 3.1 1.5 0.2]
 [5.  3.6 1.4 0.2]
 [5.4 3.9 1.7 0.4]
 [4.6 3.4 1.4 0.3]
 [5.  3.4 1.5 0.2]
 [4.4 2.9 1.4 0.2]
 [4.9 3.1 1.5 0.1]
 [5.4 3.7 1.5 0.2]
 [4.8 3.4 1.6 0.2]
 [4.8 3.  1.4 0.1]
 [4.3 3.  1.1 0.1]
 [5.8 4.  1.2 0.2]
 [5.7 4.4 1.5 0.4]
 [5.4 3.9 1.3 0.4]
 [5.1 3.5 1.4 0.3]
 [5.7 3.8 1.7 0.3]
 [5.1 3.8 1.5 0.3]
 [5.4 3.4 1.7 0.2]
 [5.1 3.7 1.5 0.4]
 [4.6 3.6 1.  0.2]
 [5.1 3.3 1.7 0.5]
 [4.8 3.4 1.9 0.2]
 [5.  3.  1.6 0.2]
 [5.  3.4 1.6 0.4]
 [5.2 3.5 1.5 0.2]
 [5.2 3.4 1.4 0.2]
 [4.7 3.2 1.6 0.2]
 [4.8 3.1 1.6 0.2]
 [5.4 3.4 1.5 0.4]
 [5.2 4.1 1.5 0.1]
 [5.5 4.2 1.4 0.2]
 [4.9 3.1 1.5 0.2]
 [5.  3.2 1.2 0.2]
 [5.5 3.5 1.3 0.2]
 [4.9 3.6 1.4 0.1]
 [4.4 3.  1.3 0.2]
 [5.1 3.4 1.5 0.2]
 [5.  3.5 1.3 0.3]
 [4.5 2.3 1.3 0.3]
 [4.4 3.2 1.3 0.2]
 [5.  3.5 1.6 0.6]
 [5.1 3.8 1.9 0.4]
 [4.8 3.  1.4 0.3]
 [5.1 3.8 1.6 0.2]
 [4.6 3.2 1.4 0.2]
 [5.3 3.7 1.5 0.2]
 [5.  3.3 1.4 0.2]
 [7.  3.2 4.7 1.4]
 [6.4 3.2 4.5 1.5]
 [6.9 3.1 4.

###  Learning and Predicting:
If given a prediction task, we can fit an [estimator](https://en.wikipedia.org/wiki/Estimator) to be able to predict the classes to which unseen samples belong.

The SVC class mentioned earlier is an estimator that can implement the methods `fit(X, y)` and `predict(X)`. The estimator’s constructor takes as arguments the model’s parameters.

If we consider the estimator as a black box, we can use the following code:


In [12]:
# Create an instance of the SVC classifier with specified hyperparameters
clf = svm.SVC(gamma=0.001, C=100.)


####  Choosing the parameters of the model:
In this example, we set the value of `gamma` manually. To find good values for these parameters, we can use tools such as grid search and cross validation.

The `clf `(for classifier) estimator instance is first fitted to the model; that is, it must learn from the model. This is done by passing our training set to the `fit` method. For the training set, we’ll use all the images from our dataset, except for the last image, which we’ll reserve for our predicting. We select the training set with the `[:-1]` Python syntax, which produces a new array that contains all but the last item from `iris.data`.

Now you can `predict` new values. In this case, you’ll predict using the last image from iris.data. By predicting, you’ll determine the image from the training set that best matches the last image.



In [20]:
# Train the classifier
clf.fit(iris.data[:-1], iris.target[:-1])

In [21]:
# Make predictions
clf.predict(iris.data[-1:])

array([2])

###  Type Casting:
Using `float32`-typed training (or testing) data is often more efficient than using the usual `float64` `dtype`: it allows to reduce the memory usage and sometimes also reduces processing time by leveraging the vector instructions of the CPU. However it can sometimes lead to numerical stability problems causing the algorithm to be more sensitive to the scale of the values and require adequate preprocessing.

However, not all scikit-learn estimators attempt to work in `float32` mode. For instance, some transformers will always cast their input to `float64` and return `float64` transformed values as a result.

Regression targets are cast to `float64` and classification targets are maintained:

In [18]:
# Train the classifier
clf.fit(iris.data, iris.target)
# Make predictions and print them
list(clf.predict(iris.data[:3]))

[0, 0, 0]

In [19]:
# Train the classifier using target names
clf.fit(iris.data, iris.target_names[iris.target])
# Make predictions and print them
list(clf.predict(iris.data[:3]))

['setosa', 'setosa', 'setosa']

Here, the first `predict()` returns an integer array, since `iris.target` (an integer array) was used in `fit`. The second `predict()` returns a string array, since `iris.target_names` was for fitting.

###  Refitting and updating paramters:
Hyper-parameters of an estimator can be updated after it has been constructed via the set_params() method. Calling `fit()` more than once will overwrite what was learned by any previous `fit()`:



In [23]:
import numpy as np
from sklearn.datasets import load_iris
X, y = load_iris(return_X_y=True)
clf = SVC()
clf.set_params(kernel='linear').fit(X, y)
SVC(kernel='linear')
clf.predict(X[:5])



array([0, 0, 0, 0, 0])

In [24]:
clf.set_params(kernel='rbf').fit(X, y)
SVC()
clf.predict(X[:5])

array([0, 0, 0, 0, 0])

Here, the default kernel `rbf` is first changed to `linear` via SVC.set_params() after the estimator has been constructed, and changed back to `rbf` to refit the estimator and to make a second prediction.

###  Multiclass vs. Multilabel fitting:
When using **multiclass classifiers**, the learning and prediction task that is performed is dependent on the format of the target data fit upon:

In [25]:
from sklearn.svm import SVC
from sklearn.multiclass import OneVsRestClassifier
from sklearn.preprocessing import LabelBinarizer
X = [[1, 2], [2, 4], [4, 5], [3, 2], [3, 1]]
y = [0, 0, 1, 1, 2]

classif = OneVsRestClassifier(estimator=SVC(random_state=0))
classif.fit(X, y).predict(X)

array([0, 0, 1, 1, 2])

In the above case, the classifier is fit on a 1d array of multiclass labels and the `predict()` method therefore provides corresponding multiclass predictions. It is also possible to fit upon a 2d array of binary label indicators:

In [26]:
y = LabelBinarizer().fit_transform(y)
classif.fit(X, y).predict(X)

array([[1, 0, 0],
       [1, 0, 0],
       [0, 1, 0],
       [0, 0, 0],
       [0, 0, 0]])

Here, the classifier is `fit()` on a 2d binary label representation of `y`, using the **LabelBinarizer**. In this case `predict() `returns a 2d array representing the corresponding multilabel predictions.

Note that the fourth and fifth instances returned all zeroes, indicating that they matched none of the three labels `fit` upon. With multilabel outputs, it is similarly possible for an instance to be assigned multiple labels:

In [27]:
from sklearn.preprocessing import MultiLabelBinarizer
y = [[0, 1], [0, 2], [1, 3], [0, 2, 3], [2, 4]]
y = MultiLabelBinarizer().fit_transform(y)
classif.fit(X, y).predict(X)

array([[1, 1, 0, 0, 0],
       [1, 0, 1, 0, 0],
       [0, 1, 0, 1, 0],
       [1, 0, 1, 0, 0],
       [1, 0, 1, 0, 0]])

In this case, the classifier is fit upon instances each assigned multiple labels. The **MultiLabelBinarizer** is used to binarize the 2d array of multilabels to `fit` upon. As a result, `predict()` returns a 2d array with multiple predicted labels for each instance.