## Classification
Classify examples into given set of categories.

### Examples of Classification Problems
* Text Categorization (e.g. Spam Filtering)
* Classification of Apple and Oranges
* Fruad Detection
* Face Detection
* Optical Character Recognition
* Natural Language Processing

### Packges
The main packages used in this project are
* Sklearn (For accessing classifier)
* Numpy
* matplotlib ( For ploting)

In [202]:
# Import all the libraries here
import numpy as np
import matplotlib.pyplot as plt
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.gaussian_process import GaussianProcessClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.metrics import accuracy_score

### 1. Classifier 
The classifer used for the problem are: 
   * Decision Tree Classifier
   * KNeighbors Classifier
   * Guassian Process Classifier
   * Random Forest Classifier
   * Ada-Boost Classifier

Initializing all classifier

In [203]:
decisionClf = DecisionTreeClassifier()
knnClf = KNeighborsClassifier()
gpcClf = GaussianProcessClassifier()
# allowing bootstrap
rpcClf = RandomForestClassifier(bootstrap=True)
adaBoostClf = AdaBoostClassifier()

###  2. Import and visualize data


In [204]:
# [height, weight, shoe_size]
X = [[181, 80, 44], [177, 70, 43], [160, 60, 38], [154, 54, 37], [166, 65, 40],
     [190, 90, 47], [175, 64, 39],
     [177, 70, 40], [159, 55, 37], [171, 75, 42], [181, 85, 43]]

Y = ['male', 'male', 'female', 'female', 'male', 'male', 'female', 'female',
     'female', 'male', 'male']

#TEST DATA[height, weight, shoe_size]
test_X = [[179, 90, 44], [190, 88, 44], [165, 55, 37], [160, 60, 39], [156, 56, 36], [181, 85, 43], [174, 66, 40],
     [177, 70, 43], [159, 66, 47], [188, 100, 44], [179, 84, 47]]

test_Y = ['male', 'male', 'female', 'female', 'male', 'male', 'female', 'female', 'female', 'male', 'male']

#### a) Decision Tree Classifier

In [205]:
decisionClf = decisionClf.fit(X, Y)
prediction = decisionClf.predict(test_X)

# Explained variance score: 1 is perfect prediction
print('Decision Tree Classifier')
print('Score:  %.2f ' % accuracy_score(test_Y, prediction))
print('Variance score: %.2f' % decisionClf.score(test_X, test_Y))

Decision Tree Classifier
Score:  0.64 
Variance score: 0.64


#### b) KNeighbors Classifier

In [206]:
knnClf = knnClf.fit(X, Y)
prediction = knnClf.predict(test_X)

# Explained variance score: 1 is perfect prediction
print('KNeighbors Classifier')
print('Score:  %.2f ' % accuracy_score(test_Y, prediction))
print('Variance score: %.2f' % knnClf.score(test_X, test_Y))

KNeighbors Classifier
Score:  0.73 
Variance score: 0.73


#### c) Guassian Process Classifier

In [207]:
gpcClf = gpcClf.fit(X, Y)
prediction = gpcClf.predict(test_X)

print('Guassian Process Classifier')
# Explained variance score: 1 is perfect prediction
print( 'Score:  %.2f ' % accuracy_score(test_Y, prediction))
print('Variance score: %.2f' % gpcClf.score(test_X, test_Y))

Guassian Process Classifier
Score:  0.73 
Variance score: 0.73


#### d) Random Forest Classifier

In [212]:
rpcClf = rpcClf.fit(X, Y)
prediction = rpcClf.predict(test_X)

# Explained variance score: 1 is perfect prediction
print('Random Forest Classifier')
print( 'Score:  %.2f ' % accuracy_score(test_Y, prediction))
print('Variance score: %.2f' % rpcClf.score(test_X, test_Y))


Random Forest Classifier
Score:  0.82 
Variance score: 0.82


#### e) Ada-Boost Classifier

In [209]:
adaBoostClf = adaBoostClf.fit(X, Y)
prediction = adaBoostClf.predict(test_X)

# Explained variance score: 1 is perfect prediction
print('Ada-Boost Classifier')
print( 'Score:  %.2f ' % accuracy_score(test_Y, prediction))
print('Variance score: %.2f' % adaBoostClf.score(test_X, test_Y))

Ada-Boost Classifier
Score:  0.73 
Variance score: 0.73


In [210]:
#not used, just for testing purposes 
maleIndex = [item for item in range(len(prediction)) if prediction[item] == 'male']
femaleIndex = [x for x in range(len(prediction)) if prediction[x] == "female"]

prediction[maleIndex] = 1
prediction[femaleIndex] = 0

print(prediction )

['1' '1' '0' '0' '0' '1' '0' '1' '1' '1' '1']


### Results
**Random Forest Classifier**
```
Score:  0.82 
Variance score: 0.82
```

**Ada-Boost Classifier**
```
Score:  0.73 
Variance score: 0.73
```

**Guassian Process Classifier**
```
Score:  0.73 
Variance score: 0.73
```

**Decision Tree Classifier**
```
Score:  0.64 
Variance score: 0.64
```

**KNeighbors Classifier**
```
Score:  0.73 
Variance score: 0.73
```

### Summary
The best answer score of Random Forest Classfier is 0.82 and worst case score is 0.63, this is because random forest classifier in order to improve the predictive accuracy and over-fitting it mean prediction (regression) of multiple trees and thus each time we get score different since we don't have enough data.
On the otherhand, all the other Classifier are getting same score so for the above example we can use any classifier if you want to improve the results you must increase the amount of data. Since, here we are generated data manually but if you want that these classifier give you more accurate result than you must use more data this will allow the above algorithm to generalize its learns parameter more.


## Reference
* [Statistical classification](https://en.wikipedia.org/wiki/Statistical_classification)
* [Machine Learning Algorithms for Classification](http://www.cs.princeton.edu/~schapire/talks/picasso-minicourse.pdf)
* [Introduction - Learn Python for Data Science #1](https://www.youtube.com/watch?v=T5pRlIbr6gg&index=1&list=PL2-dafEMk2A6QKz1mrk1uIGfHkC1zZ6UU)
* [Random Forest](https://en.wikipedia.org/wiki/Random_forest)