# Technical report on the SEG contest
In this report, the main information on the data used for the contest is compiled. A series of information sources are used:
- The 2007 paper from Dubois et al. (Dubois, M. K., Bohling, G. C., and Chakrabarti, S. (2007) Comparison of four approaches to a rock facies classification problem. Computers & Geosciences 33(5):599-617.)
- The 2016 paper from Brendon Hall (Hall, B. (2016) Facies classification using machine learning. The Leading Edge 35(10):906-909.)
- The results from the first submission we did.

Tentative interpretations and suggestions on how to create the 'best' model ("I know models, terrific models, but mine is the best model. So sad lyin' Ted and crooked Hillary use biased models" anonymous, 2016) are proposed.

## Nature of the dataset and resulting issues

The dataset is well described in the Dubois et al. paper. Here are a few quotes giving insights on the nature and organization of the data:
- "the term facies as used throughout this paper means lithofacies(lithology)"
- "Formation or member tops (Fig. 2) segregate the Council Grove into alternating NM and M half-cycles, fundamentally different depositional environments. A NM–M depositional environment indicator variable is assigned to intervals on the basis of the depth of the top and base of stratigraphic formations or members. Relative position (RPos) is the position of a particular sample with respect to the base of its respective NM or M (formation/member) interval."
- "facies in the Council Grove NM–M cycles have predictable vertical stacking patterns (Dubois et al., 2003a, 2006)"
- "Although facies definition is theoretically objective, with facies having distinctly defined boundaries in basic rock type and texture, in practice, the assignment of facies in cores is a subjective process in the absence of objective measurements"
- "Furthermore, facies can vary at the centimeter scale but facies are assigned at the half-foot (0.15 m)scale."
- "...wire-line log measurements recorded at half-foot (0.15 m) intervals are actually weighted averages of properties over a much larger interval (several feet, meter),..."
- "Training data includes 3647 examples at half-foot (0.15 m) intervals having known, core-defined facies that are associated with feature vectors of either six or seven elements (GR, PHI, N–D, Rta, NM–M, Rpos, and PE when available)."
- "The subset chosen is believed to be the minimum training set required to represent all major facies and their variations across the study area."
- "An alternative approach would have been to cross validate on a well-by-well basis, removing each well in turn from the training data and then predicting on that well. The classifiers are likely to have been less successful overall had the well-by-well approach been taken because the training set would not include examples from that particular locale, geologic and well bore setting. We do not believe that a particular classifier had a distinct advantage over another in the random-split approach and success relative to one another is what we investigated. We chose random splitting for simplicity in comparing the large number of classifiers."
- "Having a facies classification that is close to the actual (within one facies in the continuum) may be deemed satisfactory because properties of the adjacent facies are relatively close to those of the actual facies (Dubois et al., 2006). In addition to being correct or nearly correct, it is important that the number of a particular facies predicted by any classifier be relatively close to that in the overall population in order that the ultimate model accurately represents the volumetric distribution of facies. Because the main gas pay facies (F6, F7 and F8) are the most important in terms of gas storage and flow capacity, their accurate representation is critical."
- "F2 (shaley fine siltstone) is not considered one facies from F3 (M siltstone) and vice versa, mainly due to differences in depositional environment; F8 (grainstone) is considered one facies from both F7 (packstone) and F6 (finecrystalline dolomite), on the basis of physical properties; and F7 is considered one facies from F5 (wackestone), F6 and F8, on the basis of physical properties."
- "...considering the fuzzy nature of the facies and inherent error in the data, being close (within one facies) is nearly as good as being precise."


#### A series of conclusions can be drawn from this:
- Facies were described by geologists and are not completely discrete variables. There is a progressive transition between the facies. Facies are interlayered at a scale smaller than that we are working at.
- Rock physical properties are weighted averages. This implies further fuzziness in the data.
- The k-fold cross-validation we have implemented puts aside one hole at a time. However, each hole is needed to have a complete/representative picture of the data. Thus, obtained scores might not be representative of the model performance overall. However, shuffle split will give high scores due to overfitting.


## Results of the first submission and lessons learned
The "true" (or interpreted) versus predicted facies from the first submission are presented below. Overall, true and predicted logss seem similar but far from identical. Intervals are longer (or more continuous) in the true logs (which have more variations). But this difference is not extreme.
In terms of classes, overall, the same classes are present roughly in the same quantity in true and predicted logs. However, some colors appear in the predicted log and are nearly absent in the true log, and the other way around.
Drillholes are highly dissimilar in terms of facies representation. This shows that the dataset is too small to be truly representative. This will lead to overfitted and biased prediction models. Parameter tuning should take that into account (e.g. leaf_size or sample_split in tree-based classifiers).
Some patterns can be observed:
- SS is usually in the middle of continental intervals and not in direct contact with marine facies
- SiSh and BS are usually at the center of marine intervals and not in contact with continental facies
- MS, WS and PS are often interlayer, sometimes in rather fine 'beds'. Same goes for CSiS and FSiS.

![](fig\real and predict logs submission_1.png?raw=true)

When we look at the confusion matrix (below), we see that there is no confusion between marine and non-marine facies. Within each group, a series of 'most common confusions' can be drawn:

Non-marine:

SS &rarr; CSiS

CSiS &rarr; FSiS

FSiS &rarr; CSiS

Marine:

SiSh &rarr; WS

MS &rarr; WS (or P)

WS &rarr; PS (or SiSh or MS)

D &rarr; PS

PS &rarr; WS (or D)



![](fig\Confusion matrix submission_1.PNG?raw=true)

These 'most common confusions' occur between facies that are often in contact within the logs. This further shows the progressive contacts and the fuzziness in the classes.
Optimizing the discrimination between these 'most common confusions' is probably part of the solution toward doing better predictions.

## Conclusions
Raw predictor variables are not sufficient for obtaining statisfactory predictions. There is a need for further discrimnation power, especially between variables that are similar in terms of feature vectors and in contact within drillholes. Three improvement leads are proposed:

1) Improvement of the predictor variables using feature enginnering. Wavelet, entropy and gradient variables could improve the prediction power by giving insight on the texture of the data and thus on the fuzziness of the data. Moving averages (and/or standard deviation, min, max and other statistical measurements) could give information on the neighboring samples in boreholes and thus help render more homogeneous/continuous logs.

2) A two steps classification would allow to further use the geology within a borehole. First step would be initial classification and second step would be done using variables obtained from the first classification (neighboring facies, most dominant facies in the drillholes, presence of a specific facies close-by, etc... However, a two-step classification might increase bias: a wrongly classified interval obtained from the first step might negatively influence the second step and lead to misclassification of many intervals within the borehole during the second step. Moreover, it weems that the units with the most 'unique' distributions in drillholes are the ones that are already the easier to discriminate (i.e., BS and SiSh) except for SS.

3) The choice of the algorithm might improve a lot the result. Here, the dataset is fairly small (thousands of samples). SVM seem to give better results according to submissions from other teams. If we introduce more variables, including more fuzzy and/or noisy ones, other algorithms such as random forest or gradient boosting might prove more powerful. Also, because of the non-stationarity of the data between drillholes and the non-representativity of the dataset, caution should be taken when tuning parameters. Reducing the bias rather than the variance should be favored.