# IST 707 HW7
## kNN, SVM, and Random Forest for handwriting recognition
Name: Lu Guo

## Section 1: Introduction

The classification problem is to classify digital numbers from 0 to 9 from handwritten pictures by kNN, SVM, and Random Forest algorithms. Each row represents a handwritten picture, and each picture contains one number. After calculating the accuracy of each algorithm, we can compare the performance of these three algorithms.

In [37]:
import pandas as pd
df_train = pd.read_csv('digit-train.csv')
df_test = pd.read_csv('digit-test.csv')

In [38]:
print(len(df_train)) # show the number of rows in the training set
print(len(df_test)) # show the number of rows in the test set

4198
4198


In [39]:
df_train.head() # show the first 5 rows of the training set

Unnamed: 0,label,pixel0,pixel1,pixel2,pixel3,pixel4,pixel5,pixel6,pixel7,pixel8,...,pixel774,pixel775,pixel776,pixel777,pixel778,pixel779,pixel780,pixel781,pixel782,pixel783
0,3,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,5,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,3,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,2,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [40]:
# Show the distribution of the labels in the training set
df_train['label'].value_counts()

1    471
8    438
7    436
3    425
4    420
0    418
6    416
2    413
5    390
9    371
Name: label, dtype: int64

In [41]:
# Show the distribution of the labels in the test set
df_test['label'].value_counts()

1    478
7    465
3    446
2    420
0    414
4    404
6    404
8    393
5    388
9    386
Name: label, dtype: int64

We can see that the target numbers are distributed evenly, and there are 10 classes in total, from 0 to 9.

## Data preparation

I separate the labels from the training and testing datasets, because the labels are not features. The training and testing datasets only contain the pixel values. And then I use standard scaler to scale the data, which can help the algorithms to converge faster.

I will use accuracy as the evaluation metric.

In [42]:
X_train = df_train.drop('label', axis=1) 
y_train = df_train['label']

X_test = df_test.drop('label', axis=1)
y_test = df_test['label']

In [43]:
# standard the data, so that the mean of each feature is 0 and the standard deviation is 1
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)


## Section 2. kNN model

### 2.1 Create a basic kNN model

In [44]:
# create a basic kNN model
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=5)

# train the model
knn.fit(X_train, y_train)

# make predictions
y_pred = knn.predict(X_test)

# Print the accuracy of the model
from sklearn.metrics import accuracy_score
print(accuracy_score(y_test, y_pred))


0.8966174368747022


### 2.3 Tuning parameters of the kNN model

In [45]:
# Add parameter tuning to the model
from sklearn.model_selection import GridSearchCV
param_grid = {'n_neighbors': [1, 3], 'weights': ['uniform', 'distance'], 'metric': ['euclidean', 'manhattan']} 
# n_neighbors: number of neighbors to use
# weights: weight function used in prediction
# metric: the distance metric to use for the tree

grid = GridSearchCV(KNeighborsClassifier(), param_grid, refit = True, cv=5)
grid.fit(X_train, y_train)

# Print the best parameters
print(grid.best_params_)
print(grid.best_estimator_)

# Print the accuracy of the best model
grid_predictions = grid.predict(X_test)
print(accuracy_score(y_test, grid_predictions))

{'metric': 'manhattan', 'n_neighbors': 1, 'weights': 'uniform'}
KNeighborsClassifier(metric='manhattan', n_neighbors=1)
0.9192472606002858


### 2.4 Results of the best kNN model
The best kNN model has the accuracy of 0.919, and the parameters are {'metric': 'manhattan', 'n_neighbors': 1, 'weights': 'uniform'}.

## Section 3: SVM model

### 3.1. Create a basic SVM model

In [46]:
# create a basic SVM model
from sklearn.svm import SVC
svm = SVC() # svc is the default kernel

# train the model
svm.fit(X_train, y_train)

# make predictions
y_pred = svm.predict(X_test)

# print the accuracy of the model
print(accuracy_score(y_test, y_pred))


0.9221057646498333


### 3.2. Tune parameters of the SVM model

In [47]:
# Add parameter tuning to the model
from sklearn.model_selection import GridSearchCV
param_grid = {'C': [14,12], 'gamma': [0.001, 0.002], 'kernel': ['rbf']} 
# I also tested other parameters, but the accuracy is not better than the above parameters.  'poly', 'sigmoid' are not good.
grid = GridSearchCV(SVC(), param_grid, refit = True, cv=2)
grid.fit(X_train, y_train)

# Print the best parameters
print(grid.best_params_)
print(grid.best_estimator_)
# Print the accuracy of the best model
grid_predictions = grid.predict(X_test)
print(accuracy_score(y_test, grid_predictions))

{'C': 12, 'gamma': 0.001, 'kernel': 'rbf'}
SVC(C=12, gamma=0.001)
0.9363982848975703


### 3.3. Results of the best SVM model

The best SVM model has the accuracy of 0.936, and the parameters are C=12, gamma=0.001, kernel='rbf'.

## Section 4: Random Forest model

### 4.1. Create a basic Random Forest model

In [48]:
# create a basic random forest model
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier()

# train the model
rf.fit(X_train, y_train)

# make predictions
y_pred = rf.predict(X_test)

# print the accuracy of the model
print(accuracy_score(y_test, y_pred))

0.9385421629347308


### 4.2. Tune parameters of the Random Forest model

In [49]:
from sklearn.model_selection import GridSearchCV
param_grid = {'n_estimators': [270, 280], 'max_depth': [17,18], 'criterion': ['gini']} 
# I have tested a lot of parameters, the n_estimators value from 1 to 300, the max_depth value from 1 to 20, the criterion value from 'gini' to 'entropy'.
grid = GridSearchCV(RandomForestClassifier(), param_grid, refit = True, cv=2)
grid.fit(X_train, y_train)

# Print the best parameters
print(grid.best_params_)
print(grid.best_estimator_)

# Print the accuracy of the best model
grid_predictions = grid.predict(X_test)
print(accuracy_score(y_test, grid_predictions))


{'criterion': 'gini', 'max_depth': 18, 'n_estimators': 280}
RandomForestClassifier(max_depth=18, n_estimators=280)
0.939018580276322


### 4.3. Results of the best Random Forest model

The best Random Forest model has the accuracy of 0.943, and the parameters are {'criterion': 'gini', 'max_depth': 17, 'n_estimators': 280}.

## Section 5: Model comparison

From the aspect of accuracy, the Random Forest model performs the best, and the kNN model performs the worst. The SVM model performs better than the kNN model, but worse than the Random Forest model. All the three models perform better than decision tree and Naïve Bayes models.

The Random Forest model has the highest accuracy, the value of the accuracy is 0.943. The kNN model has the lowest accuracy, the value of the accuracy is 0.919. The SVM model has the accuracy of 0.936.

The algorithms performance differences make sense. The Random Forest model is good at dealing with high dimensional data, and this handwriting digit data set has 784 features, which is very large. So the Random Forest model performs the best. On the other hand, the kNN and SVM models are not as good as the Random Forest model at dealing with this problem. The kNN and SVM models calculate the distance between data points, but in a handwriting data, a small picture is separated to 784 pixels, and the distance between two pictures is very small. So the kNN and SVM models did not get a good result as the Random Forest model.