# Kaggle Handwriting Recognition project

By: John DeOrian

Date started: March 13, 2020

Last modified: March 15, 2020

This workbook is my record of trying out different ML approaches to the handwriting recognition problem on Kaggle. I'll start off with more simple approaches and lead to more complicated approaches.

## Housekeeping

In [1]:
import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split

In [2]:
dfTrain = pd.read_csv('train.csv')
dfTest = pd.read_csv('test.csv')

## Random guessing

As a baseline, I will guess a random number from 0 to 9. Since I can't test the model directly, I will upload it to Kaggle and report the results. I expect to achieve a 10% accuracy since we'd expect a random guess from 0 to 9 to be right ~10% of the time.

In [None]:
myGuesses = tuple((np.random.randint(0,10) for x in range(len(dfTest))))
myLabels = tuple((x+1 for x in range(len(dfTest))))
resultsDf = pd.DataFrame(zip(myLabels,myGuesses),columns=['ImageId','Label'])
resultsDf.to_csv('results1.csv',index=False)

Our first submission matches our expectations and we achieve a 10.214% accuracy.

## Logistic regression

Next, I will use logistic regression to make predictions. I expect this to perform poorly, but better than random guessing.

In [None]:
y = dfTrain.iloc[:,0]
X = dfTrain.iloc[:,1:]

clf = LogisticRegression().fit(X,y)
print(clf.score(X,y))

In [None]:
myGuesses = clf.predict(dfTest)
myLabels = tuple((x+1 for x in range(len(dfTest))))
resultsDf = pd.DataFrame(zip(myLabels,myGuesses),columns=['ImageId','Label'])
resultsDf.to_csv('results2.csv',index=False)

This model achieve >90% accuracy! That's much, much better than I expected. At the same time, I'm concerned that the model never converged even after 10,000 iterations. I don't know why it's not working - a quick Google search suggests the problem might be "complete or quasi-complete separation in the data". That's over my head for now, but I may return to it later. For now, I'll ignore it as if the model converged without issue.

Since logistic regression performed so well, I'll continue experimenting with it a couple more times before moving on to another model.

My next experiment is to see how regularization affects the model's performance. Since the basic logit model had a train performance of ~94% and a test performance of ~92%, I expect regularization to reduce the gap between train and test. I do not expect train performance, itself, to improve.

I begin with L1 regularization (aka Lasso Regression) because the effect of Lasso is to shrink less important feature coefficients to zero. This is useful when we have a large number of feature, which we do in this case with 784. The challenge here is to select an appropriate value for lambda. I'm not sure how sklearn works, yet, so I'll just go with default values for now. I'm required to change the solver because the default solver doesn't work with L1.

In [None]:
y = dfTrain.iloc[:,0]
X = dfTrain.iloc[:,1:]

clf = LogisticRegression(solver='liblinear',penalty='l1').fit(X,y)
print(clf.score(X,y))

myGuesses = clf.predict(dfTest)
myLabels = tuple((x+1 for x in range(len(dfTest))))
resultsDf = pd.DataFrame(zip(myLabels,myGuesses),columns=['ImageId','Label'])
resultsDf.to_csv('results3.csv',index=False)

I'll also train a model using L2 regularization (Ridge Regression).

In [None]:
y = dfTrain.iloc[:,0]
X = dfTrain.iloc[:,1:]

clf = LogisticRegression(penalty='l2').fit(X,y)
print(clf.score(X,y))

myGuesses = clf.predict(dfTest)
myLabels = tuple((x+1 for x in range(len(dfTest))))
resultsDf = pd.DataFrame(zip(myLabels,myGuesses),columns=['ImageId','Label'])
resultsDf.to_csv('results4.csv',index=False)

In the end, Logistic regression had solid performance but some critical failures. First, the models won't converge. Even though they ended with >90% accuracy, I am unwilling to use a model that won't converge. Second, the models never performed better than 92%, so we'll have to look elsewhere to get additional accuracy gains. Finally, I want to note that I wasted a lot of time during training by increasing max_iter to 10,000. I did that to try and overcome the convergence problem, but it didn't helped and just cost time.

At this point, I'm going to try another approach and see how it performs.

## K-Nearest Neighbors

I actually expect KNN to perform pretty well and definitely better than Logistic Regression. Since the pixel values for the same number will be similar, my intuition is that KNN will pick up similar values.

It may be that, after performing basic KNN, I do some feature reduction and then redo KNN.

In [3]:
y = dfTrain.iloc[:,0]
X = dfTrain.iloc[:,1:]

In [None]:
clf = KNeighborsClassifier(n_neighbors = 5).fit(X=X,y=y)
myGuesses = clf.predict(dfTest)
myLabels = tuple((x+1 for x in range(len(dfTest))))
resultsDf = pd.DataFrame(zip(myLabels,myGuesses),columns=['ImageId','Label'])
resultsDf.to_csv('results5.csv',index=False)

Basic KNN scored 96.8% accuracy! That's pretty good and definitely better than Logistic Regression. The only downside is that it's painfully slow. 

Now I will experiment with different values for k to see whether it affects my score. I expect this to take a long time and not have a big effect. I'm just doing it for completeness sake.

In [4]:
XTrain, XTest, yTrain, yTest = train_test_split(X, y, test_size=0.3)
print('XTrain: {0}, XTest: {1}, yTrain:{2}, and yTest:{3}'.format(XTrain.shape,XTest.shape,yTrain.shape,yTest.shape))

XTrain: (29400, 784), XTest: (12600, 784), yTrain:(29400,), and yTest:(12600,)


In [None]:
kRange = range(1,11)
scores = {}

for k in kRange:
    clf = KNeighborsClassifier(n_neighbors=k).fit(X=XTrain, y=yTrain)
    scores[k] = clf.score(X=XTest, y=yTest)
    
print(scores)

As expected, changing k did not have a meaningful effect on my outcomes. Also, it took a really long time (~1.5 hours!). One of the confusing things for me is why KNN works for this task when KNN is said to suffer from the curse of dimensionality. After some digging, I found a Quora post that talks about the blessing of non-uniformity. This basically says that curse of dimensionality is mostly an issue with random or uniformly distributed data. 'Real' data is not random or uniformly distributed, so the algorithm's accuracy is saved. This is a big, complex topic that I'll need to learn more about.

Now, I'm going to rerun KNN but using Manhattan instead of Euclidean distance. Per some of my readings, it sounds like Manhattan might perform better with higher dimensions.

In [6]:
clf = KNeighborsClassifier(p=1).fit(X=X,y=y)
myGuesses = clf.predict(dfTest)
myLabels = tuple((x+1 for x in range(len(dfTest))))
resultsDf = pd.DataFrame(zip(myLabels,myGuesses),columns=['ImageId','Label'])
resultsDf.to_csv('results2.csv',index=False)

KNN with Manhattan distance performed the same as KNN with Euclidean - ~96%. Time to move onto another approach.

## Random forest
tbd