# Basic classification project

## Content   

&nbsp;&nbsp;&nbsp;**1.Aims and objectives**   
&nbsp;&nbsp;&nbsp;**2.Literature review**   
&nbsp;&nbsp;&nbsp;**3.Method**   
&nbsp;&nbsp;&nbsp;**4.Result**   
&nbsp;&nbsp;&nbsp;**5.Discussion and conclusion**   
&nbsp;&nbsp;&nbsp;**6.Reference**



## 1. Aims and objectives

This project aims to grasp fundamentals of scikit-learn library, dataset analysis, classification, and evaluation of the result of classification. Scikit-learn provides small sets of standard datasets, for example boston house prices for regression and diabetes dataset for classification. In this project, three classification datasets, iris, wine, and breast cancer datasets, had been adopted and analysed, and the models which were trained with them was evaluated accordingly. 

## 2. Literature review
### Scikit-learn   
   
Scikit-learn is a free machine learning library for python programming language. It involves commonly used classification, clustering, and regression algorithms such as Suppor Vector Machine(SVM), decision tree, random forest or K-means method. The library was designed to interact with Numpy and Scipy perform both supervised and unsupervised learning (see [Scikit-learn](https://scikit-learn.org/stable/)).

### Load and analyse datasets   

Three classification datasets(hand written digits, wine, breast cancer) had been acquired.

In [2]:
from sklearn.datasets import load_wine
from sklearn.datasets import load_digits
from sklearn.datasets import load_breast_cancer

This report will focus on wine classification. For full codes of the three practices, ssee [wine](https://github.com/hweejuni/The-very-first-repository/blob/master/Classifier_Project/Wine_Classification.ipynb), [handwritten digits](https://github.com/hweejuni/The-very-first-repository/blob/master/Classifier_Project/Handwritten_Letter_Classifier.ipynb), and [breast cancer](https://github.com/hweejuni/The-very-first-repository/blob/master/Classifier_Project/Breast_Cancer_Classifier.ipynb).   

Loading the wine dataset, the samples and labels (targets) were assigned into 'wine_data' and 'wine_label' respectively.

In [3]:
wines = load_wine()

wines_data = wines.data # assign the data
wines_label = wines.target # assign the label


print('Number of samples: ', len(wines.data), '\nFeature names: ', wines.feature_names, '\nLabel names: ', wines.target_names)
#print(wines.DESCR)

Number of samples:  178 
Feature names:  ['alcohol', 'malic_acid', 'ash', 'alcalinity_of_ash', 'magnesium', 'total_phenols', 'flavanoids', 'nonflavanoid_phenols', 'proanthocyanins', 'color_intensity', 'hue', 'od280/od315_of_diluted_wines', 'proline'] 
Label names:  ['class_0' 'class_1' 'class_2']


As can be observed above, the wine dataset has 178 samples, 3 classes, and 13 features. Knowing how the dataset constitues is salient.

### Dataset partition   
   
The dataset needs to be divided into both training and test dataset. The splitting can be achieved with 'train_test_split' function and only 20 percent of the dataset was converted to the test set. Moreover, when it comes to partitioning samples, it must be randomly executed since unshuffled subsets could result in unwelcomed bias into the model.

In [22]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(wines_data, 
                                                   wines_label,
                                                   test_size = 0.2,
                                                   random_state = 7)

### Evaluation metrics
   
Evaluating the model is paramount importance in classification. Above all, an optimized classifier is obtained by correct selection of appropriate metric for the optimal solution [1]    
##### i) Accuracy   
Accuracy is one of the most commonly used metrices among the researchers.  

$${Accuracy = \frac{Number\,of\,correct\, predictions}{Total\, number\, of\, predictions\, made}}$$  

However, considering accuracy as the main evaluation metric will come with a number of limitations. Especially, [2] and [3] have well demonstrated how accuracy is not good at tackling imbalanced dataset.   
   
##### ii) Confusion matrix   
Confusion matrix is a matrix that suitably summarises the performance of the algorithm. In particular, it is competent at representing where the model have been confused in predicting.   
![Confusion matrix](./r4.jpg)   
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; *Figure 1: Confusion Matrix*   
    
The four outputs in figure 1 represent significant milestones when evaluating the performance.
- True Positive(TP): The number of correct predictions that the model predicted true as true.   
- False Negative(FN): The number of incorrect predictions that the model predicted positive as negative.   
- False Positive(FP): The number of incorrect predictions that the model predicted negative as positive.   
- True Negative(TN): The number of correct predictions that hte model predicted negative as negative.   

They are mathematically calculated so as to express another important indications.   
- Recall(Sensitivity): Recall is a ratio of 'TP' to 'actual' positive samples.   
   
$${Recall = \frac{TP}{TP+FN}}$$   

- Specificity: Specificity is a ratio of 'TN' to 'actual' negative samples.   

$${Specificity = \frac{TN}{FP+TN}}$$   

- Precision: Precision is a ratio of 'TP' to 'labelled' positive samples.   

$${Precision = \frac{TP}{TP+FP}}$$   

- F1 score: F1 score is a ratio showing the balance between Precision and Recall. F1 score ranges between 0 and 1; 0 indicating a total failure and 1 to be perfect. Good F1 score means the model predicted low false positives and negatives.

$${F1\, score = 2*\frac{Precision*Recall}{Precision+Recall}}$$   


## 3. Method   
### Training the model   
   
Five prominent classification algorithms had been trained: Decision tree, Random forest, Support Vector Machine, Stochastic Gradient Descent, and Logistic Regression. Each algorithm asks for different syntax to call the classifier model.

##### i) Decision tree

In [2]:
from sklearn.tree import DecisionTreeClassifier

decision_tree = DecisionTreeClassifier(random_state=32)

##### ii) Random forest

In [3]:
from sklearn.ensemble import RandomForestClassifier

random_forest = RandomForestClassifier(random_state=32)

##### iii) Support Vector Machine(SVM)

In [None]:
from sklearn import svm

svm_model = svm.SVC()

##### iv) Stochastic Gradient Descent Classifier(SGD Classifier)

In [None]:
from sklearn.linear_model import SGDClassifier

sgd_model = SGDClassifier()

##### v) Logistic regression

In [None]:
from sklearn.linear_model import LogisticRegression

logistic_model = LogisticRegression(max_iter=4000)

## 4. Result   

For the results, see [wine](https://github.com/hweejuni/The-very-first-repository/blob/master/Classifier_Project/Wine_Classification.ipynb), [handwritten digits](https://github.com/hweejuni/The-very-first-repository/blob/master/Classifier_Project/Handwritten_Letter_Classifier.ipynb), and [breast cancer](https://github.com/hweejuni/The-very-first-repository/blob/master/Classifier_Project/Breast_Cancer_Classifier.ipynb).   

Support Vector Machine(SVM) conducted the most correct performance, achieving approximately 99% accuracy in handwritten digit classification project. On the other hand, Random forest method is most accurate in both wine and breast cancer classification project, reaching 100% accuracy. 

## 5. Discussion and conclusion   

As a result, it can be thought that SVM is competent at dealing with large samples by comparison, whilst Random forest is good at less samples. The experiments were conducted in more uncomplicated fashion so that the models could be compared intuitively and straightforwardly. Other models should be more explored with different hyperparameters for larger number of samples.

## 6. Reference   

[1] H. M and S. M.N, "A Review on Evaluation Metrics for Data Classification Evaluations", International Journal of Data Mining & Knowledge Management Process, vol. 5, no. 2, pp. 01-11, 2015. Available: 10.5121/ijdkp.2015.5201 [Accessed 30 July 2020].   

[2] R. Ranawana, and V. Palade, “Optimized precision-A new measure for classifier performance
evaluation”, in Proc. of the IEEE World Congress on Evolutionary Computation (CEC 2006), 2006,
pp. 2254-2261.    

[3] S. W. Wilson, “Mining oblique data with XCS”, in P. L. Lanzi, W. Stolzmann and S. W. Wilson
(Eds.) Advances in Learning Classifier Systems: Third Int. Workshop (IWLCS 2000), Berlin,
Heidelberg: Springer-Verlag, 2001, pp. 283-290. 