In [2]:
# Import KNeighborsClassifier from sklearn.neighbors
from sklearn.neighbors import KNeighborsClassifier

# Import train_test_split from sklearn.model_selection (to split dataset into training and test data)
from sklearn.model_selection import train_test_split

# Import datasets from sklearn
from sklearn import datasets

# Initialize iris dataset
iris = datasets.load_iris()

# Create arrays for the features and the response variable
y = iris.target
X = iris.data

# Split features and response variable into training and test sets
# Set training data to 70%, and set random state to 59 because 59 is my favorite number
# Set stratify to the response variable so that the split is stratified according to target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state=59, stratify=y)

# Create a k-NN classifier with 6 neighbors: knn
knn = KNeighborsClassifier(n_neighbors=6)

# Fit the classifier to the training data
knn.fit(X_train,y_train)

# Predict the labels for the test data X_test
y_pred = knn.predict(X_test)

print("k-NN Classification (6 Nearest Neighbors)")
print("=========================================")

print("Actual species    : {}".format(y_test))
print("Predicted species : {}".format(y_pred))

# Compare predicted labels with actual labels (True if correctly labeled, False if incorrectly labeled)
print("Correct prediction: {}".format(y_pred == y_test))

# Calculate prediction accuracy
pred_score = knn.score(X_test,y_test)
print("Prediction score  : {}".format(pred_score))

## decision tree

# Import tree from sklearn
from sklearn import tree

# Create a decision tree classifier
treeclassifier = tree.DecisionTreeClassifier()

# Fit training data into tree classifier
treeclassifier = treeclassifier.fit(X_train,y_train)

# Predict the labels for the test data X_test
y_pred = treeclassifier.predict(X_test)

print()
print("Decision Tree Classification")
print("============================")

print("Actual species    : {}".format(y_test))
print("Predicted species : {}".format(y_pred))

# Compare predicted labels with actual labels (True if correctly labeled, False if incorrectly labeled)
print("Correct prediction: {}".format(y_pred == y_test))

# Calculate prediction accuracy
pred_score = treeclassifier.score(X_test,y_test)
print("Prediction score  : {}".format(pred_score))

## naive bayes

# Import GaussianNB from sklearn.naive_bayes
from sklearn.naive_bayes import GaussianNB

# Create a Gaussian Naive Bayes classifier
gnb = GaussianNB()

# Fit training data into Gaussian Naive Bayes classifier
gnb.fit(X_train,y_train)

# Predict the labels for the test data X_test
y_pred = gnb.predict(X_test)

print()
print("Gaussian Naive Bayes Classification")
print("===================================")

print("Actual species    : {}".format(y_test))
print("Predicted species : {}".format(y_pred))

# Compare predicted labels with actual labels (True if correctly labeled, False if incorrectly labeled)
print("Correct prediction: {}".format(y_pred == y_test))

# Calculate prediction accuracy
pred_score = gnb.score(X_test,y_test)
print("Prediction score  : {}".format(pred_score))

k-NN Classification (6 Nearest Neighbors)
Actual species    : [0 2 2 1 0 0 1 0 2 0 1 0 2 2 2 0 2 0 1 1 0 2 2 2 1 1 0 2 1 1 1 0 2 0 1 2 1
 1 1 2 1 0 0 2 0]
Predicted species : [0 2 2 1 0 0 1 0 2 0 2 0 2 2 2 0 1 0 1 1 0 2 2 2 1 1 0 2 1 1 1 0 2 0 1 2 1
 1 1 2 1 0 0 2 0]
Correct prediction: [ True  True  True  True  True  True  True  True  True  True False  True
  True  True  True  True False  True  True  True  True  True  True  True
  True  True  True  True  True  True  True  True  True  True  True  True
  True  True  True  True  True  True  True  True  True]
Prediction score  : 0.9555555555555556

Decision Tree Classification
Actual species    : [0 2 2 1 0 0 1 0 2 0 1 0 2 2 2 0 2 0 1 1 0 2 2 2 1 1 0 2 1 1 1 0 2 0 1 2 1
 1 1 2 1 0 0 2 0]
Predicted species : [0 2 2 1 0 0 1 0 2 0 2 0 2 2 2 0 1 0 1 1 0 2 2 2 1 1 0 2 1 1 1 0 2 0 1 2 1
 1 1 1 1 0 0 2 0]
Correct prediction: [ True  True  True  True  True  True  True  True  True  True False  True
  True  True  True  True False  True  True  True 

### Writing exercises: find 10 papers that have applied kNN, Decision Tree and Naive Bayes to solve their problems. For each paper, explain the problems they are solving, the techniques and the data that were used in the paper. Write this in your Jupyter Notebook.

1. **Mukherjee, Saurabh, and Neelam Sharma. "Intrusion detection using naive Bayes classifier with feature reduction." Procedia Technology 4 (2012): 119-128.**  
   In this paper, the authors were trying to see whether applying feature reduction algorithms to the dataset prior to applying the Naive Bayers classification algorithm will perform as well as using the algorithm on all the features of the dataset, for classifying network intrusion from network data.  
   The data used to perform the training and testing was the NSL-KDD dataset, which is a popular dataset used to study the effectiveness of classification algorithms in detecting anomalies in network traffic patterns.  
   The authors used the WEKA machine learning tool to first apply four (4) different feature reduction methods on the dataset - namely Correlation-based Feature Selection (CFS), Information Gain (IG), Gain Ratio (GR) and Feature-Vitality Based Reduction Method (FVBRM) to reduce the number of features that will be used. Then the Naive Bayes classifier is applied to the different reduced datasets, and the effectiveness of classifier is the compared between each reduced dataset.  

  
2. **Cho, Yoon Ho, Jae Kyeong Kim, and Soung Hie Kim. "A personalized recommender system based on web usage mining and decision tree induction." Expert systems with Applications 23.3 (2002): 329-342.**  
   In this paper, the authors are trying to solve the problem of making product recommendations to customers (recommender system).  
   The data used is a combination of the customers clickstream through the marketplace website (web usage mining) and product and purchase data from data marts.  
   The authors developed a system comprising of eight agents and one data mart. The decision tree algorithm are used by several of the agents as part of the overall complicated recommender algorithm.

  
3. **Wang, Qiong, et al. "Naive Bayesian classifier for rapid assignment of rRNA sequences into the new bacterial taxonomy." Appl. Environ. Microbiol. 73.16 (2007): 5261-5267.**  
   In this paper, the authors are trying to solve the problem of classifying bacterial microorganisms into their proper taxonomy e.g. domain and genus, using their 16S rRNA sequences.  
   The data used in this paper is from the The Ribosomal Database Project II (RDP), which maintains over 300,000 bacterial sequences.  
   The Naive Bayesian classifier was used as a classification algorithm in the paper, although some other methods were also used as comparison such as the k-NN classifier.
   
  
4. **Pal, Mahesh, and Paul M. Mather. "An assessment of the effectiveness of decision tree methods for land cover classification." Remote sensing of environment 86.4 (2003): 554-565.**  
   In this paper, the authors are trying to solve the problem of classifying land cover types (e.g. forest, plains, jungle) from satellite imaging data.  
   Two datasets were used in this paper - the first one is satellite image data (Landsat ETM+) from the town of Littleport in Little England, and the second one is DAIS 7915 airborne imaging spectrometer data within the region of La Mancha Alta located to the south of Spain.  
   Several classification algorithms were compared in the paper, including univariate decision tree, boosted univariate decision tree, maximum likelihood (ML) and neural networks (NN).  
   
  
5. **Balabin, Roman M., Ravilya Z. Safieva, and Ekaterina I. Lomakina. "Gasoline classification using near infrared (NIR) spectroscopy data: Comparison of multivariate techniques." Analytica Chimica Acta 671.1-2 (2010): 27-35.**  
   In this paper, the authors are trying to solve the problem of classifying gasoline (petroleum) in terms of source (which refinery the petroleum was processed in) and in terms of process (what process was used to refine the petroleum).  
   The data used in this paper were three petroleum sample sets provided by refineries, all of them unleaded Russian petroleum without additives. The Near-IR FT Spectrometer data is then acquired for all the sample sets. The spectrometer data is then used for classification.  
   Several classification algorithms were compared in the paper, including Linear discriminant analysis (LDA), Quadratic discriminant analysis (QDA), Regularized discriminant analysis (RDA), Soft independent modeling of class analogy (SIMCA), Partial least squares (PLS) regression and classification, K-Nearest neighbor (KNN), Support vector machines (SVM), Probabilistic neural network (PNN), and Multilayer perceptron (MLP).
   
  
6. **Verma, Abhishek, Chengjun Liu, and Jiancheng Jia. "New colour SIFT descriptors for image classification with applications to biometrics." International Journal of Biometrics 3.1 (2011): 56.**  
   In this paper, the authors are exploring new colour SIFT descriptors for image processing, in order to improve the performance of classification algorithms that can be applied to biometrics.  
   The data used are a combination of the Caltech 256 object categories (30,607 images divided into 256 categories) and the UPOL Iris dataset (128 unique classes belonging to 64 subjects).  
   The paper compares the performance of eight different descriptors: the oRGB-SIFT, the YCbCr-SIFT, the RGB-SIFT, the HSV-SIFT, the rgb-SIFT, the greyscale-SIFT, the CSF, and the CGSF descriptors. Classification is implemented using a novel EFM-KNN classifier, which combines the EFM and the KNN decision rule.
   
  
7. **Moosavian, Ashkan, et al. "Support vector machine and K-nearest neighbour for unbalanced fault detection." Journal of Quality in Maintenance Engineering 20.1 (2014): 65-75.**  
   In this paper, the authors are trying to diagnose unbalanced faults in rotating machines using support vector machine and KNN algorithms for classification. The vibration signals from the rotating machines are used as the variables in order to predict whether the rotating machine is unloaded, having balanced load, or having unbalanced load.  
   The data used is collected by having a test setup where vibration signals were taken from a rotating machine under different load conditions. The vibration signals are then put through Fast Fourier Transform (FFT) where 29 features were extracted from the FFT amplitude of the vibration signals including maximum, range, average, root mean square, standard deviation variance, fifth central moment, sixth central moment, skewness, kurtosis, etc.  
   Support Vector Machine (SVM) and KNN were then employed to the data to classify the three different conditions. The performances of the two classifiers were obtained and compared.
   
  
8. **Kalimuthu, Sathyavikasini, and Vijaya Vijayakumar. "Shallow learning model for diagnosing neuro muscular disorder from splicing variants." World Journal of Engineering 14.4 (2017): 329-336.**  
   In this paper, the authors are trying to do some form of disease diagnosis. Specifically, they aim to predict which type of muscular dystrophy is associated with certain gene sequences.  
   The data used by this paper is from the human gene mutational database (HGMD). In each category of muscular dystrophy disease, 120 synthetic mutated gene sequences are generated and a corpus comprising 600 sequences for five categories of muscular dystrophy is developed.  
   The corpus is then run through different classification algorithms and the results are compared. The algorithms used in the comparison include Decision tree classifier, K-nearest neighbor, Naïve Bayes classifier, and Support Vector Machine.
   
  
9. **Faleh, Rabeb, et al. "Enhancing WO3 gas sensor selectivity using a set of pollutant detection classifiers." Sensor Review 38.1 (2018): 65-73.**  
   In this paper, the authors are trying to detect different gases using signals from an electronic nose (e-nose) system. The different gases involved are ozone, ethanol and acetone.  
   The data for this paper was acquired by the authors themselves using three electronic nose system. Transient parameters were then extracted for each gas, and principal component analysis (PCA) was performed.  
   The data out of the PCA is then used as inputs for a Support Vector Machine classifier, and the output from the SVM is then run through another classifier, this time KNN, to produce the final classification output. This fusion of SVM and KNN classifiers is proposed by the paper as a way to increase the accuracy of detection when compared with single classifiers.
   
  
10. **Metsis, Vangelis, Ion Androutsopoulos, and Georgios Paliouras. "Spam filtering with naive bayes-which naive bayes?." CEAS. Vol. 17. 2006.**  
   In this paper, the authors are comparing different naive bayes algorithms in order to identify spam messages from legitimate messages.  
   The data used in this paper includes spam and ham messages from several sources, including Spam Track, SpamAssassin corpus, the Honeypot Project, the spam collection of Bruce Guenter, and the spam collected by the third author of the paper.
   The performance of different naive bayes classifiers are then compared in terms of accuracy of classification. The different naive bayes classifiers include Multivariate Bernoulli NB, Multinomial NB with TF attributes, Multinomial NB with Boolean attributes, Multivariate Gauss NB, and finally Flexible Bayes