# Tutorial 3. Performing queries
In this tutorial, we will correct sampling distortions. Let's setup the simple layer object.

In [None]:
import numpy as np
from pyxc.core.layer import Layer
from pyxc.core.processor.arrays import column_parser
from pyxc.core.container import Container2D
from pyxc.core.loader import ImageLoader, XYDLoader
from pyxc.transform.homography import Homography

EBSD = np.genfromtxt(
    "./data/SiC_in_NiSA.ctf", dtype=float, skip_header=15, delimiter="\t", names=True
)

layer_ebsd = Layer(
    data=column_parser(EBSD, format_string="dxydddddddd"),
    container=Container2D,
    dataloader=XYDLoader,
    transformer=Homography,
)

You have two choices to query data. You can either query by single a coordinate or multiple coordinates.

The first option provides better flexibility. You can receive correlation results and you can run your own analysis. The second option provides better convenience but is rather limited.

Let's see!

## Single point query
You can query the data by a single object. Several columns are additionally provided along with the columns contained in the container object.
1. query_index: for internal reference. This will be dealt little later.
2. distance: Euclidean distance between given coordinate and nearby point.
3. x-coordinates: query x coordinate
4. y-coordinates: query y coordinate

Also, note that we've got several x and y related columns. Read this carefully:
1. x: distortion-corrected x
2. y: distortion_corrected y
3. x_raw: initially supplied x value, before correction.
4. Y_raw: initially supplied y value, before correction.
5. x-coordinates: x for query
6. y-coordinates: y for query

In [None]:
layer_ebsd.query(5, 5)

There are two important options, cut-off and output_number. If your data points' nearest neighbour distances are larger than a specific cutoff, you might not get results. For example,

In [None]:
layer_ebsd.query(5, 5, cutoff=0.0001)

Furthermore, you can get more datapoints, if you want, by explicitly specifying the cut-off and output_number parameters.

In [None]:
layer_ebsd.query(x=5, y=5, cutoff=5, output_number=5)

## Multi point query
Let's do it more conveniently! You can retrieve data from multiple points at once. If data is large, `execute_queries` might take approximately one or two minutes, but it is perfectly normal. It is preparing parallel execution.

In [None]:
xs = [4.1, 4.2, 4.3]
ys = [4.5, 4.6, 4.7]
layer_ebsd.execute_queries(xs, ys)

### Use `query_index` column to filter out not correlated points!

<div class="alert alert-warning">

Warning

See the code below very carefully. There is no guarantee that all points that you have provided yield a correlation result. If the points are too far away from the data point (beyond the cut-off distance), you will not get the result. You will be required to filter out the points that are not hit by using the `query_index` column.

</div>
This is especially useful when you are comparing correlation results with the serialized data. 

Let's assume we have `xs, ys`, and hardness. For example, data provided below means we have 100 MPa hardness point at the (4.1, 4.5). The 4th point (-10, -10, 150) is deliberately set to not existing point.

In [None]:
xs = np.array([4.1, 4.2, 4.3, -10])
ys = np.array([4.5, 4.6, 4.7, -10])
hd = np.array([100, 200, 110, 150])
result = layer_ebsd.execute_queries(xs, ys)

In [None]:
result

Now you can see that the provided data has a length of 4, but the returned data only has a length of 3. So it is not directly plottable. In this case, 'query_index' plays a significant role. It can be used to filter out failed data points from the initially provided data, like below:

In [None]:
xs_refined = xs[result["query_index"]]
ys_refined = ys[result["query_index"]]
hd_refined = hd[result["query_index"]]

Now, you can use the query result with your own hardness data. Such as doing,

In [None]:
import matplotlib.pyplot as plt

plt.scatter(result["BC"], hd_refined)

However, one single caveat of this multi-point query cannot handle the situation when the `output_number` is other than 1. If you try to query more than one point, you will get an error.

In [None]:
xs = [4.1, 4.2, 4.3]
ys = [4.5, 4.6, 4.7]
layer_ebsd.execute_queries(xs, ys, output_number=2)

You can specify the reducer to handle this situation. Reducer objecst should be specified from List[Tuple[Callable, List['ColumnNames']]]. Callable should accept 1-dimensional numpy arrays and yields a single value. Such as np.std, np.mean. 

The Reducer object can be used for a single point query also. It is useful to do statistical analyses on the results.

In [None]:
from pyxc.core.processor.reducer import Reducer

reducer_obj = Reducer([(np.mean, ["BS", "Phase"]), (np.std, ["BS", "Phase"])])

Then, you can do like this. Note that you have got new columns such as "Phase_std".

In [None]:
xs = [4.1, 4.2, 4.3, -10]
ys = [4.5, 4.6, 4.7, -10]
layer_ebsd.execute_queries(xs, ys, output_number=2, reducer=reducer_obj)

## Query performance tip
Please use small cut-off and small output_number. As you can see, by reducing the cut-off parameter, the performance enhances for almost 5 times.

In [None]:
%%timeit
layer_ebsd.query(5, 5, cutoff=10, output_number=1000)

In [None]:
%%timeit
layer_ebsd.query(5, 5, cutoff=1, output_number=1000)

In [None]:
%%timeit
layer_ebsd.query(5, 5, cutoff=1, output_number=10)

In [None]:
%%timeit
layer_ebsd.query(5, 5, cutoff=1, output_number=1)