This is a description of my idea that I implemented for the Averaging GPS Segments problem. More information about the problem can be found in the problem’s website, so we’ll skip introducing the problem and go straightly onto describing the proposed method for solving the problem.
Though I’ve got more ideas to improve the accuracy (which will probably take some time to implement), currently, the method has achieved about 66% training accuracy according to the problem website. The implementation is available on GitHub in R Language: Implementation
I’m going to describe the method in six main sections:
- Initial idea
- Choosing the properties of the model
- Outlier segment detection
- Implementation and sample results
- Conclusion
- Citations
The main idea behind this method is to watch each set from another perspective, in which the connections between each point in the segments are removed. Then we are faced with a 3-dimensional tabular data; the first dimension points to the segment in which the point lies (regarded as
If we view the data form such perspective, then probably a simple linear regression, or more precisely, a piecewise linear regression would be a good approximation of the correct segment. Also choosing the start and end points, choosing the number of knots and detecting the outlier segments would be a great challenge.
The simple linear regression and the piecewise linear regression (linear spline) with some outlier removals showed some good approximations:
Using the linear spline ****model was the main reason for using R language here, as the linear spline model isn’t available in python (actually, it is somehow available here as interp1d
, but it’s so limited as it doesn’t offer parameter tuning and choosing the number of knots, and therefore is useless for us).
As the results showed, using splines for the approximation seems a good choice, but we should answer these questions to be able to complete the method and make it usable:
- How to choose the predictor and the target based on y an x. which choice gives a better result?
- How to set the start and end points for the output segments based on the linear spline.
- How to choose the number of knots (points in segments) for the linear spline.
These questions are answered in the next section.
We divide this section into three parts, each part answering one of the questions mentioned in the end of the previous section.
For choosing the predictor, the idea was to choose the variable among which there is a higher variance. As the input data is normalized, choosing the variable with the higher variance corresponds to higher accuracy in predicting the other variable and therefore (assuming other factors such as the start and endpoints constant) a more exact approximation of the ground truth for the segments.
For example, in the first set,
Here, for simplicity we suppose the variable
$$start = \overline{xs} + \hat{se}(xs)$$ $$end = \overline{xe} - \hat{se}(xe)$$
We choose the number of knots used in the linear spline according to the following algorithm:
$$m_knots = \text{average of the count of points in each segment of the set}$$ $$knots = \text{sequence from } start \text{ to } end \text{ by } m_knots$$
After fitting the linear spline with the knots as
After doing some tests using this algorithm, I concluded that using simple linear regression on the data for the sets in which
As viewing the training set suggests, in cases like set 9 and set 1, which are depicted below, outlier detection can be a very good approach to optimize the prediction.
I chose to implement the DBScan clustering algorithm to detect the outlier segments, as the algorithm detect the outliers according to the density of the points in a region, which seems rational here. We first run a DBScan on the input set, with
-
$$eps = .05$$ ,$$count_of_groups = \text{number of segments in the input set}$$ - Perform DBScan on the set. with eps as
$$eps$$ and min points as$$count_of_groups$$ . save the id of segments with at least one outlier point in$$outlier_segments$$ list. - If the length of
$$outlier_segments$$ list is more than$$count_of_groups / 2$$ : $$eps = eps + .01$$ - return to step 2
- else: remove the outlier segments from dataset and move on.
Sample result of the algorithm: Red points correspond to the detected outlier segments, which are discarded before fitting the linear spline (or linear regression).
Note: It’s clear that the outlier detection must be the first phase of the algorithm after reading the input (before choosing the predictor, number of knots and …)
Detecting outliers using linear regression Another idea was to detect outliers according to the high leverage points and outliers corresponding to a simple linear regression fit on the input dataset, which didn’t work so well after the implementation and lead to lower training score rather than the DBScan method, and hence was not used. ****
The implementation of my proposed method is available on GitHub: AveragingGPSSegments Here are some useful methods that are implemented and can be used: (assuming you are in the project directory)
source('functions/read_segments.R')
dataset = read_segment('training_data/0.csv')
print(head(dataset))
source('functions/draw_solution.R')
draw_route(1)
Getting complete output corresponding to the training data in training_data
folder:
source('functions/save_solution.R')
save_predicted_segments('result.txt')
And then uploading the result on the training webpage:
The current accuracy achieved by the implementation is 66.32%. Although I think it’s already not that bad, noticing that the data has a high amount of irreducible error; I believe the accuracy can be enhanced by some better approximations of the outliers, starting points and the end points.
As the previous section suggests, the proposed method has a good rate of training error and is performing nicely, noticing the huge rate of irreducible error of the data. We used a combination of linear spline and linear regression models for predicting the approximated true road segments, with some tricks for choosing the starting points, ending points and using DBScan for detecting outliers.
Clearly, the proposed method on using DBScan algorithm is too slow, as it may repeat a couple of times. Also, there are many points of improvement in the method for future work:
- Smarter outlier segment detection:
- Using other outlier detection methods, like using kNN for it.
- Optimizing the
$$eps$$ guess on DBScan at the beginning, according to other factors in the dataset, like the number of segments, the number of points in each segment and … - Smarter choice of the begin and end points
- Smarter choice of knots used in linear spline, and tradeoff between using linear spline and linear regression
I think this method is a novel approach for solving this problem, and haven’t seen anyone using this approach for doing map-construction. However, I think I should cite the papers introducing “Linear Splines” and “DBScan” algorithms, as they’re the backbone of this method: