# WHAT IS CLASSIFICATION
> Classification is a __data analysis__ task where a model is constructed to predict class labels (categories)

## Motivation
- Prediction
    - In a bank loan => Safe or Risky
    - Which treatment is better for patient, __“treatmentX”__ or __“treatmentY”__ ?

## Steps:
- Learning (training Step):
- Classification Step:

### Learning Step : => construct classification model
- Build classifier for a predetermined set of classes
- Learn from a training dataset (data tuples + their associated classes) → Supervised
Learning

### Classification Step: 
- model is used to predict class labels for given data (test set)


## Let's Code Something
     


In [1]:

from numpy import mean
from numpy import std
from pandas import read_csv
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import PowerTransformer
from sklearn.dummy import DummyClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis

### Let's getting The Iris dataset from My GitHub Source

In [3]:
url = 'https://media.githubusercontent.com/media/AhmedKhalil777/DataScience.Learning/master/Datasets/Iris.csv'
dataframe = read_csv(url, header=None)
data = dataframe.values
data[:5]

array([['Id', 'SepalLengthCm', 'SepalWidthCm', 'PetalLengthCm',
        'PetalWidthCm', 'Species'],
       ['1', '5.1', '3.5', '1.4', '0.2', 'Iris-setosa'],
       ['2', '4.9', '3.0', '1.4', '0.2', 'Iris-setosa'],
       ['3', '4.7', '3.2', '1.3', '0.2', 'Iris-setosa'],
       ['4', '4.6', '3.1', '1.5', '0.2', 'Iris-setosa']], dtype=object)

- So What we need, to create a rule or model to predict `Species`
- Let's Extract the data without Species and the Species in other column

In [30]:
X, Y = data[:, :-1], data[:, -1]
print('Shape: %s, %s' % (X.shape,Y.shape))

Shape: (151, 5), (151,)


In [23]:
print(f'Without Species\n{X[:5]} \n\n With Species \n{Y[:5]}')


Without Species
[['Id' 'SepalLengthCm' 'SepalWidthCm' 'PetalLengthCm' 'PetalWidthCm']
 ['1' '5.1' '3.5' '1.4' '0.2']
 ['2' '4.9' '3.0' '1.4' '0.2']
 ['3' '4.7' '3.2' '1.3' '0.2']
 ['4' '4.6' '3.1' '1.5' '0.2']] 

 With Species 
['Species' 'Iris-setosa' 'Iris-setosa' 'Iris-setosa' 'Iris-setosa']


- We need to prepare the dataset it's arrays of `str` why it can't be `float64`

In [31]:
X = X[1:].astype('float64')
Y = LabelEncoder().fit_transform(Y[1:].astype('str'))

In [26]:
X[:5]

array([[1. , 5.1, 3.5, 1.4, 0.2],
       [2. , 4.9, 3. , 1.4, 0.2],
       [3. , 4.7, 3.2, 1.3, 0.2],
       [4. , 4.6, 3.1, 1.5, 0.2],
       [5. , 5. , 3.6, 1.4, 0.2]])

In [33]:
print(Y)

[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2
 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 2 2]


# evaluate naive

> DummyClassifier is a classifier that makes predictions using simple rules.

- This classifier is useful as a simple baseline to compare with other (real) classifiers. Do not use it for real problems.
- Have 3 Params 
    - strategy: `str` , default=”stratified” 
        - “stratified”: generates predictions by respecting the training set’s class distribution.

        - “most_frequent”: always predicts the most frequent label in the training set.

        - “prior”: always predicts the class that maximizes the class prior (like “most_frequent”) and predict_proba returns the class prior.

        - “uniform”: generates predictions uniformly at random.

        - “constant”: always predicts a constant label that is provided by the user. This is useful for metrics that evaluate a non-majority class

## We Use the most_frequent classifier

In [34]:
naive = DummyClassifier(strategy='most_frequent')

In [37]:
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)

In [40]:
n_scores = cross_val_score(naive, X, Y, scoring='accuracy', cv=cv, n_jobs=-1, error_score='raise')
print('Baseline: %.3f (%.3f)' % (mean(n_scores), std(n_scores)))

Baseline: 0.333 (0.000)


array([0.33333333, 0.33333333, 0.33333333, 0.33333333, 0.33333333,
       0.33333333, 0.33333333, 0.33333333, 0.33333333, 0.33333333,
       0.33333333, 0.33333333, 0.33333333, 0.33333333, 0.33333333,
       0.33333333, 0.33333333, 0.33333333, 0.33333333, 0.33333333,
       0.33333333, 0.33333333, 0.33333333, 0.33333333, 0.33333333,
       0.33333333, 0.33333333, 0.33333333, 0.33333333, 0.33333333])

In [42]:
model = LinearDiscriminantAnalysis()
steps = [('p',PowerTransformer()), ('m',model)]
pipeline = Pipeline(steps=steps)
m_scores = cross_val_score(pipeline, X, Y, scoring='accuracy', cv=cv, n_jobs=-1, error_score='raise')
print('Good: %.3f (%.3f)' % (mean(m_scores), std(m_scores)))

Good: 1.000 (0.000)


array([1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
       1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.])