# marcoeilers/statml forked from laumann/statml

A little report writing.

Introduce tables in report with results from modifying the parameter C
after model has been selected. Misses the analysis. Introduce another
table showing the optimal values for C and gamma after model selection
by grid-search.
1 parent c6ede21 commit f601ee19199a0ecf1a365a90a8d8eafb3ed3ff4b laumann committed Mar 18, 2012
2 handin3/Code/freeBoundedSVs.eps
 @@ -1,7 +1,7 @@ %!PS-Adobe-2.0 %%Creator: MATLAB, The MathWorks, Inc. Version 7.13.0.564 (R2011b). Operating System: Linux 3.2.1-gentoo-r2 #7 SMP Thu Feb 9 21:03:08 CET 2012 x86_64. %%Title: ./freeBoundedSVs.eps -%%CreationDate: 03/18/2012 20:55:34 +%%CreationDate: 03/18/2012 22:41:27 %%DocumentNeededFonts: Helvetica %%DocumentProcessColors: Cyan Magenta Yellow Black %%Extensions: CMYK
21 handin3/Code/regularization.m
 @@ -0,0 +1,21 @@ + + +data = loadknoll('knollC-train200.dt'); + +[C gamma] = modelselect(data); + +model = train(data, C, gamma) +modelLarger = train(data, C*100, gamma) +modelSmaller = train(data, C/100, gamma) + +[f b] = dividesupportvectors(C, model.SVs, model.sv_coef); +[fl bl] = dividesupportvectors(C, modelLarger.SVs, modelLarger.sv_coef); +[fs bs] = dividesupportvectors(C, modelSmaller.SVs, ... + modelSmaller.sv_coef); + +disp(sprintf('Original: #SVs: %d\t#free SVs: %d\t#bounded SVs: %d', ... + length(model.SVs), length(f), length(b))); +disp(sprintf('C*100: #SVs: %d\t#free SVs: %d\t#bounded SVs: %d', ... + length(modelLarger.SVs), length(fl), length(bl))); +disp(sprintf('C/100: #SVs: %d\t#free SVs: %d\t#bounded SVs: %d', ... + length(modelSmaller.SVs), length(fs), length(bs)));
24 handin3/Code/runsvm.m
 @@ -22,18 +22,27 @@ %% We now train our SVM on each dataset using the respective values %% we found for C and gamma. -c100model=train(knollC100(:,1:2), knollC100(:,3), c100, gamma100); +c100model=train(knollC100, c100, gamma100); -c200model=train(knollC200(:,1:2), knollC200(:,3), c200, gamma200); +c200model=train(knollC200, c200, gamma200); -c400model=train(knollC400(:,1:2), knollC400(:,3), c400, gamma400); +c400model=train(knollC400, c400, gamma400); %% And run all instances on themselves (and the others?) and the test data %% TODO +%% Get number of free and bounded support vectors +[free100 bounded100] = dividesupportvectors(c100, c100model.SVs, c100model.sv_coef); +[free200 bounded200] = dividesupportvectors(c200, c200model.SVs, c200model.sv_coef); +[free400 bounded400] = dividesupportvectors(c400, c400model.SVs, c400model.sv_coef); + +disp(sprintf('knollC100: Free SVs: %d Bounded SVs: %d', length(free100), length(bounded100))); +disp(sprintf('knollC200: Free SVs: %d Bounded SVs: %d', length(free200), length(bounded200))); +disp(sprintf('knollC400: Free SVs: %d Bounded SVs: %d', length(free400), length(bounded400))); + %% Visualizing the SVM solution %% We want to plot the original knollC-train200 data @@ -45,11 +54,8 @@ hold on; plot(class2(:, 1), class2(:, 2), 'bx'); -%% Get the free and bounded support vectors -[free bounded] = dividesupportvectors(c200, c200model.SVs, c200model.sv_coef); - -%% And plot them: bounded SVs in green, free ones in black -plot(bounded(:,1), bounded(:,2), 'go'); -plot(free(:,1), free(:,2), 'ko'); +%% Plot SVs: bounded SVs in green, free ones in black +plot(bounded200(:,1), bounded200(:,2), 'go'); +plot(free200(:,1), free200(:,2), 'ko'); print -dpsc freeBoundedSVs.eps;
4 handin3/Code/train.m
 @@ -1,7 +1,7 @@ -function [ model ] = train( data, labels, c, gamma ) +function [ model ] = train(knolldata, c, gamma ) %% Trains the SVM on the given data using the given parameters. %% Returns libsvm model data which can then be used with svmpredict. commandstring = sprintf('-s 0 -t 2 -g %d -c %d', gamma, c); - model = svmtrain(labels, data, commandstring); + model = svmtrain(knolldata(:,3), knolldata(:,1:2), commandstring); end
BIN handin3/handin3.pdf
Binary file not shown.
67 handin3/handin3.tex
 @@ -70,16 +70,75 @@ \section{Neural Networks} \section{Support Vector Machines} +For this part of the assignment we chose to use the LIBSVM software. + \subsection{Model Selection} Description (we normalized the data, then used the builtin function of libsvm, tried these values for gamma: []) -Result: best parameters are: +We did grid search using the following values of $\gamma: \{ 0.0001, 0.001, 0.01, 0.1, 1, 10, 100 \}$. This choice is based on what? + +LIBSVM has built-in functionality to perform $n$-fold cross validation +given a command line option. To perform model selection we iterate +through all combinations of $C$ and $\gamma$ and call a function +called \texttt{crossval}, which invokes LIBSVM to perform a 5-fold +cross validation on the current values of $C$ and $\gamma$. When +performing $n$-fold cross validation, LIBSVM returns the accuracy, +which we use to keep track of the configuration that gives the +highest accuracy. + +%% Result: best parameters are: +%% C: 1000, gamma: 0.100000 Cross Validation Accuracy = 98% +%% C: 1000, gamma: 0.100000 Cross Validation Accuracy = 97.5% +%% C: 100, gamma: 1.000000 Cross Validation Accuracy = 97.5% +\begin{table}[!h] + \centering + \begin{tabular}{l | c | c | c } + \hfill & $C$ & $\gamma$ & Acc.\\\hline + \texttt{knollC-train100} & 1000 & 0.1 & 98\%\\ + \texttt{knollC-train200} & 1000 & 0.1 & 97.5\%\\ + \texttt{knollC-train400} & 100 & 1 & 97.5\% + \end{tabular} + \caption{Table of results for model selection using grid-search + showing the optimal values for $C$ and $\gamma$.} +\end{table} Applied to the testdata, this gives the following results: -\begin{figure} - \includegraphics[width=\textwidth]{Code/freeBoundedSVs.eps} - \caption{\texttt{knollC-train200} trained SVM model. Bounded support vectors are circled in green and free support vectors are circled in black.} +\subsection{Inspecting the kernel expansion} + +\subsubsection{Visualization} + +Fig.~\ref{fig:freebounded} shows the plot of the \texttt{knollC-train200} data set, in which the support vectors are circled. The free support vectors are circled in black, and bounded are circled in green. There are 87 bounded support vectors, and just six free for a total of 93 support vectors. + +\begin{figure}[!ht] + \centering + \includegraphics[width=.8\textwidth]{Code/freeBoundedSVs.eps} + \caption{\texttt{knollC-train200} data set with circled support vectors.} + \label{fig:freebounded} \end{figure} +\subsubsection{Effect of the regularization parameter} + +%%Retrain model on \texttt{knollC-train200} using values of $C$ that are 100 times larger and 100 times smaller than the $C*$ found during model selection. How does it change? + +The file \texttt{regularization.m} performs the outlined procedure, by first training the SVM model using the values for $C$ and $\gamma$ found during model selection. Then it trains to other models, one in which $C$ is multiplied by a hundred and one in which we divide $C$ by 100. + +The most notable change is in the number of support vectors. There's a total of 93 support vectors for the original'' value of $C$---87 of which are bounded. When $C$ is a hundred times larger, the number of support vectors drop to just 19, all of which are free. Conversely, when dividing $C$ by a hundred we get an increase in the number of support vectors to 199, but again all of them are free. + +\subsubsection{Scaling behaviour} + +Table of free and bounded + +\begin{table}[h!] + \centering + \begin{tabular}{l | c | c} + \hfill & bounded & free\\\hline + \texttt{knollC-train100} & 5 & 60 \\ + \texttt{knollC-train200} & 6 & 87 \\ + \texttt{knollC-train400} & 12 & 153 \\ + \end{tabular} + \caption{Table of bounded and free support vectors for the three data sets.} + \label{tab:knoll_free_bounded_SV} +\end{table} + \end{document}