# Machine Learning in Python

## Subjects
- Linear Classifiers in Python

## Linear Classifiers in Python

### Fitting and predicting

The 20 newsgroups dataset comprises around 18000 newsgroups posts on 20 topics split in two subsets: one for training (or development) and the other one for testing (or for performance evaluation). The split between the train and test set is based upon a messages posted before and after a specific date.

This module contains two loaders. The first one, `sklearn.datasets.fetch_20newsgroups`, returns a list of the raw texts that can be fed to text feature extractors such as `CountVectorizer` with custom parameters so as to extract feature vectors. The second one, `sklearn.datasets.fetch_20newsgroups_vectorized`, returns ready-to-use features, i.e., it is not necessary to use a feature extractor.

In [None]:
# Fitting and predicting
import sklearn.datasets

newsgroups = sklearn.datasets.fetch_20newsgroups_vectorized()

X, y = newsgroups.data, newsgroups.target

In [None]:
X.shape

In [None]:
y.shape # Article topics

#### k-Nearest Neighbors

In [None]:
SEED = 42
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                   random_state = SEED)

In [None]:
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=1)

knn.fit(X_train, y_train)

In [None]:
y_pred = knn.predict(X_test)

In [None]:
# Model evaluation

knn.score(X_test,y_test)

#### Logistic regression and Support vector machine (SVM)

In [2]:
# Logistic regression 
import sklearn.datasets
wine = sklearn.datasets.load_wine()
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

X, y = wine.data, wine.target


# Instantiate the classifier model 
lr = LogisticRegression()

# Fit the model on the training set
lr.fit(X, y)

# Predict the test set using the model parameters
lr.predict(X)

# Determine the model score
lr.score(X, y)

lr.predict_proba(X[:1])

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


array([[9.96581954e-01, 2.73942257e-03, 6.78623534e-04]])

In [3]:
# SVM 

wine = sklearn.datasets.load_wine()
from sklearn.svm import LinearSVC

svm = LinearSVC()

svm.fit(wine.data, wine.target)
svm.score(wine.data, wine.target)



0.8932584269662921

### Linear decision boundaries
#### Common definitions:
- **Classification:** learning to predict categories
- **Decision boundary:** the surface separating different predicted classes
- **liner classifier:** a classifier that learns linear decision boundaries
- **linearly seperable:** a data set can be perfectly explained by a linear classifer

In [5]:
# Visualizing decision boundaries

import sklearn.datasets
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC, LinearSVC
from sklearn.neighbors import KNeighborsClassifier
wine = sklearn.datasets.load_wine()

X, y = wine.data, wine.target

# Define the classifiers
classifiers = [LogisticRegression(), LinearSVC(), SVC(), KNeighborsClassifier()]

# Fit the classifiers
for c in classifiers:
    c.fit(X,y)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


### Linear classifiers: prediction equations
#### Linear classifier prediction 
- raw model output = coefficients * features + intercept
- Linear classifier prediction: compute raw model output and check the **sign**
    - if **positive**, predict one class
    - if **negative**, predict the other class
- This is the same for logistic regression and linear SVM
    - the `fit()` is different but the `predict()` is the same

#### Dot Products

In [14]:
import numpy as np
x = np.arange(3)
y = np.arange(3,6)
print(x, y)
print(x*y)
print(np.sum(x*y))
print(x@y) # produces the same result as np.sum(x*y) x@y is called the dot product of `x` and `y`.

[0 1 2] [3 4 5]
[ 0  4 10]
14
14


#### Loss Functions
- scikit-learn `LinearRegression` minimizes a loss:
***
$ 
\sum \limits_{i=1}^{n} \text{(actual $i^{th}$ target value - predicted $i^{th}$ target value)}^2 $
***
- Minimization is with respect to coefficients or parameters of the model
- Squared loss are not appropriate for classification problems
- a natural loss for classification problem is the number of errors: **0-1 loss**:
    - 0 for a correct prediction 
    - 1 for an incorrect prediction
- Minimizing a loss using
```python
from scipy.optimize import minimize
```


In [16]:
# Example
from scipy.optimize import minimize

print(minimize(np.square, 0).x)
print(minimize(np.square,2).x)

[0.]
[-1.88846401e-08]


In [31]:
SEED = 42
import sklearn.datasets

digits = sklearn.datasets.load_digits()

X = digits.data
y = digits.target

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                   stratify = y,
                                                   random_state = SEED)

from sklearn.linear_model import LogisticRegression

lr = LogisticRegression(max_iter = 2000)
lr.fit(X_train, y_train)
pred_y = lr.predict(X_test)



STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
