Skip to content

Commit

Permalink
Made corrections and changes in dataset.tex.
Browse files Browse the repository at this point in the history
  • Loading branch information
kjellwinblad committed Aug 1, 2011
1 parent d935813 commit 18e92b0
Showing 1 changed file with 12 additions and 8 deletions.
20 changes: 12 additions & 8 deletions report/dataset.tex
@@ -1,13 +1,17 @@

We tried to find a dataset with handwritten text at the beginning of the project, but it turns out there are not that many available.
% Not sure what Image is supposed to be referenced here.
The datasets that do exist, like following image example(Figure 2), they would have needed a lot of preprocessing before we could use them in our project.
We would have had to implement baseline slant normalization, skew correction, skeleton and so on.
An attempt was made to find a dataset with handwritten text, but we failed to find one that suited our requirements.
The datasets that were found would require a lot of preprocessing.
Figure~\ref{} shows a sample from that kind of dataset.
To get good result from that kind of dataset it would be necessary to implement baseline slant normalization, skew correction, skeleton and so on.

Therefore, instead of spending a lot of time preprocessing the datasets, we implemented a Graphic User Interface to create our own dataset.
The biggest advantages of this solution is that our solution records one pixel wide letters and the characters are already separated.
The most important part of the work, image processing, was thus reduced significantly.
The largest advantages of this solution is that our solution records one pixel wide lines and the characters are already separated.
The large part of the work, image processing, was thus reduced significantly.
Our dataset contains 100 examples for every capital letter in the Latin alphabet
\footnote{The dataset is available together with the source code of the system. See appendix~\ref{app:source_code}.}.
An example image from our character image dataset can be found in Figure~\ref{fig:image_feature_extraction}.

Furthermore, if the vocabulary is relatively large, we found that it became easier for us to test the HMM.
This is because our word training data is made up of randomly chosen samples.
To get a dataset for training the word classifier a generator was created\footnote{Please see HandReco\/src\/api\/word\_examples\_generator.py in the source code for documentation of the word example generator. See appendix~\ref{app:source_code}.}.
The generator creates random errors in the words given as input.
To generate the dataset is obviously not optimal for practical applications, but it is good enough to test the implementation.

0 comments on commit 18e92b0

Please sign in to comment.