# laumann/statml

### Subversion checkout URL

You can clone with HTTPS or Subversion.

# Comparing changes

Choose two branches to see what's changed or to start a new pull request. If you need to, you can also compare across forks.

# Open a pull request

Create a new pull request by comparing changes across two branches. If you need to, you can also compare across forks.
...
Checking mergeability… Don't worry, you can still create the pull request.
• 2 commits
• 4 files changed
• 2 contributors
Commits on Mar 20, 2012
 Marco Finished report for SVM part, redid classify function to produce nice… …r output bb8147c laumann Merge pull request #13 from ratiopharm88/bb8147c94d774ae652e67eb9b5b4… …01626b2b5d4b DONE c82dcfe
11 handin3/Code/classify.m
 @@ -1,8 +1,11 @@ -function [ classvector ] = classify( data, model ) +function [ classvector ] = classify( data, model, labels ) %% Trains the SVM on the given data using the given parameters. %% Returns libsvm model data which can then be used with svmpredict. - dummylabels = ones(size(data, 1), 1); + if nargin < 3 + % create dummy labels + labels = ones(size(data, 1), 1); + end - classvector = svmpredict(dummylabels, data, model, ''); + classvector = svmpredict(labels, data, model, ''); -end +end
39 handin3/Code/runsvm.m
 @@ -30,24 +30,37 @@ %% And run all instances on themselves (and the others?) and the test data -getpct=@(a,b) sum(a==b)/size(a,1); + disp('Running the model for knollC-train100 on all data sets.'); -fprintf('knollC-train100: %d %% \n', getpct(knollC100(:,3), classify(knollC100(:, 1:2), c100model))); -fprintf('knollC-train200: %d %% \n', getpct(knollC200(:,3), classify(knollC200(:, 1:2), c100model))); -fprintf('knollC-train400: %d %% \n', getpct(knollC400(:,3), classify(knollC400(:, 1:2), c100model))); -fprintf('knollC-test: %d %% \n', getpct(knollCtest(:,3), classify(knollCtest(:, 1:2), c100model))); +disp('knollC-train100: '); +classify(knollC100(:,1:2), c100model, knollC100(:,3)); +disp('knollC-train200: '); +classify(knollC200(:,1:2), c100model, knollC200(:,3)); +disp('knollC-train400: '); +classify(knollC400(:,1:2), c100model, knollC400(:,3)); +disp('knollC-test: '); +classify(knollCtest(:,1:2), c100model, knollCtest(:,3)); disp('Running the model for knollC-train200 on all data sets.'); -fprintf('knollC-train100: %d %% \n', getpct(knollC100(:,3), classify(knollC100(:, 1:2), c200model))); -fprintf('knollC-train200: %d %% \n', getpct(knollC200(:,3), classify(knollC200(:, 1:2), c200model))); -fprintf('knollC-train400: %d %% \n', getpct(knollC400(:,3), classify(knollC400(:, 1:2), c200model))); -fprintf('knollC-test: %d %% \n', getpct(knollCtest(:,3), classify(knollCtest(:, 1:2), c200model))); +disp('knollC-train100: '); +classify(knollC100(:,1:2), c200model, knollC100(:,3)); +disp('knollC-train200: '); +classify(knollC200(:,1:2), c200model, knollC200(:,3)); +disp('knollC-train400: '); +classify(knollC400(:,1:2), c200model, knollC400(:,3)); +disp('knollC-test: '); +classify(knollCtest(:,1:2), c200model, knollCtest(:,3)); disp('Running the model for knollC-train400 on all data sets.'); -fprintf('knollC-train100: %d %% \n', getpct(knollC100(:,3), classify(knollC100(:, 1:2), c400model))); -fprintf('knollC-train200: %d %% \n', getpct(knollC200(:,3), classify(knollC200(:, 1:2), c400model))); -fprintf('knollC-train400: %d %% \n', getpct(knollC400(:,3), classify(knollC400(:, 1:2), c400model))); -fprintf('knollC-test: %d %% \n', getpct(knollCtest(:,3), classify(knollCtest(:, 1:2), c400model))); +disp('knollC-train100: '); +classify(knollC100(:,1:2), c400model, knollC100(:,3)); +disp('knollC-train200: '); +classify(knollC200(:,1:2), c400model, knollC200(:,3)); +disp('knollC-train400: '); +classify(knollC400(:,1:2), c400model, knollC400(:,3)); +disp('knollC-test: '); +classify(knollCtest(:,1:2), c400model, knollCtest(:,3)); +
BIN  handin3/handin3.pdf
Binary file not shown
8 handin3/handin3.tex
 @@ -218,6 +218,8 @@ \subsection{Model Selection} \label{tab:svmoptimal} \end{table} +If we train the SVM on all training data sets and run the resulting models on all training sets and the test set, we get result in table ~\ref{tab:svmpredictresults}. We can clearly see that all models behave about equally well on the test data and the other training data sets as on their own particular training set, so there seems to be no overfitting the problem. With more than 96\% accuracy for all combinations of model and data, the quality of the prediction is generally very high, however the results of models with larger training data sets tend to slightly surpass those for lower values on $n$. + \begin{table}[!h] \centering \begin{tabular}{l | c | c | c | c} @@ -236,7 +238,7 @@ \subsection{Inspecting the kernel expansion} \subsubsection{Visualization} -Fig.~\ref{fig:freebounded} shows the plot of the \texttt{knollC-train200} data set, in which the support vectors are circled. The free support vectors are circled in black, and bounded are circled in green. There are 87 bounded support vectors, and just six free for a total of 93 support vectors. +Fig.~\ref{fig:freebounded} shows the plot of the \texttt{knollC-train200} data set, in which the support vectors are circled. The free support vectors are circled in black, and bounded are circled in green. There are 87 bounded support vectors, and just six free ones for a total of 93 support vectors. \begin{figure}[!ht] \centering @@ -255,7 +257,9 @@ \subsubsection{Effect of the regularization parameter} \subsubsection{Scaling behaviour} -Table of free and bounded +See table ~\ref{tab:knoll_free_bounded_SV} for the number of bounded and free support vectors for all training data sets. The number of support vectors grows (nearly) linearly with the number of data patterns in the training sets, as is to be expected. Since a higher number of support vectors will increase the computational effort for classifying new data patterns, this development needs to be controlled; otherwise, it may lead to a near-quadratic rise in complexity if training and test data grow alike. + +In general, a higher number of samples of a distribution with a non-zero Bayes risk means that more samples will be misclassified. Using a Gaussian kernel and therefore an infinite-dimensional feature space, this could of course be prevented, but that would most likely mean that we overfit the model to our training data and is therefore not desirable. To avoid this, i.e. to decrease the penalty for misclassifications in our model, this should mean that a lower value should be chosen for $C$. This is confirmed nicely by the fact that during our tests, $C=100$ actually gave the best results for $n=400$, whereas for lower values of $n$, $C=1000$ performed better. \begin{table}[h!] \centering

### No commit comments for this range

Something went wrong with that request. Please try again.