# Selecting and filtering spatial data


After loading the data based on the tutorial in Data Retrival section, it is needed to do the analysis on the data but when the data is huge the processing with jupyter will take more time or sometimes the result will not be visualized. So the data could be limited to small parts by selecting the data of interest.

In the first step based on the definition in the previous parts, the headers (attributes) of the data will be called.

In [None]:
print(gdf.columns)

It is also possible to have information about the format of each attribute, it is important when the attribute will be called, for example if it is numbers in object (string) format or in other numeric formats such as Double.

In [None]:
print(gdf.dtypes)

For each attribute the values could be repetitive or unique.

**ex:** In a building dataset for a city, the usage of the buildings could be defined as unique values of residential or non-residential but the area of the buildings 
          could be different or some buildings have the same area.  

For finding these patterns in the values the term unique could be used, by defining the interested attribute in this example "KS_IS", the unique values in this attribute will be visualized.

In [None]:
unique_groups = gdf['KS_IS'].unique()
print(unique_groups)

Then it is possible to count that how many times a unique value repeated. "group by" group the features based on the uniqe values and "size" calculates the number of features in each group. The output is sorted, based on the unique values.

In [None]:
groupcounts = gdf.groupby('KS_IS').size()
print(groupcounts)

It is also possible to count the frequency of the unique values using "value_counts". In this case the output will be sorted descending based on the frequency counts. 

In [None]:
groupcounts= gdf["KS_IS"].value_counts()
print(groupcounts)

If among long group of unique values the number of features for one spesific group needed it is possible to call it like like a value for a key in dictionary: 

In [None]:
sv_count = groupcounts['SV']
print (sv_count)

:::{note}
Here the importance of the format of the attributes that was mentioned in the begining of this chapter, will be specified: 
:::

The unique values for both attributes of "CLC_st1" and "Biotpkt2018" are numbers, but the format of the first one is object (string) and the other is float. The way for calling them based on their format is available in the next cells: 

- **String format**

In [None]:
group_1 = gdf.groupby('CLC_st1').size()
print(group_1)

In [None]:
count = group_1['141']
print (count)

- **Float format**

In [None]:
group_2 = gdf.groupby('Biotpkt2018').size()
print(group_2)

In [None]:
count2 = group_2[1.000000]
print (count2)

```{note}
Sometimes because of the precision of the float format, the value that is in the "groupby" or "value_counts" is not exactly the same as the value in data frame. In this case if that value is called directly an error will be appeared.
```{glue:figure} 
:doc: ../resources/14.png
```

```{note}
Sometimes because of the precision of the float format, the value that is in the "groupby" or "value_counts" is not exactly the same as the value in data frame. In this case if that value is called directly an error will be appeared.
```{glue:figure} ../resources/14.png
```

After getting familiar with the data attributes and uniqe values for the features, the required features could be selected:

If just features with one specific value is needed, the value directly called:

- String format

In [None]:
filtered_data=gdf[gdf['KS_IS'] == 'FG']
print(filtered_data)

In [None]:
filtered_data=gdf[gdf['CLC_st1'] == '133']
print(filtered_data)

- Float format

In [None]:
filtered_data=gdf[gdf['Biotpkt2018'] == 1.000000]
print(filtered_data)

```{note}
As it mentioned in the previous note, for the float values because of the precision sometimes it shows the empty data frame (means there is no feature with this value), while in the "groupby" or "value_counts" it shows some features exist with this values.
```{glue} sorted_means_fig
:doc: ../resources/15.png
```

In [None]:
filtered_data=gdf[gdf['Biotpkt2018'] == 18.055116]
print(filtered_data)

If features with more than one specific value is needed, the values called as a list:

In [None]:
filtered_data = gdf[gdf['CLC_st1'].isin(['133', '321', '411'])]
print(filtered_data)

Also it is possible to get familiar with the range of the data for numeric values such as area.

For finding the minimum and maximum values of the features the following code uses.

In [None]:
min_value = gdf['Shape_Area'].min()
max_value = gdf['Shape_Area'].max()
print ("minimum:", min_value)
print ("maximum:", max_value)

Now after getting information about the attributes of features and values assigned to them it is possible to make decision which part of data is interested.

For example it is decided to just work with the features having areas less than 1000.

First the filter defined for the attribute containing the area.

In [None]:
filter_db = gdf[gdf['Shape_Area'] < 1000]

And then it is just needed to plot it.

In [None]:
filter_db.plot()
plt.title("Dresden - Shapes with Area < 1000")
plt.show()


In [13]:
# plt.figure(figsize=(50, 50), dpi=500)  not working

It is also possible to have a widget to interactively visualize the interested part of the data.

For this reason the library ipywidgets and the function interact from that library imported.

In [None]:
import ipywidgets as widgets
from ipywidgets import interact

Then a function defined using "def" for displaying the features in the interested range.

In [15]:
def ShapeArea(value):
    filtered_gdf = gdf[gdf['Shape_Area'] <= value]
    
    filtered_gdf.plot()

The interactive part is using the slider for visualize some parts in the interested range.

In the following codes the "interact" is the function creating the slider by linking to the function which is defined by the user "ShapeArea", ShapeArea is the name of the function which is defined and the values are defined as the lowest and highest areas with the intervals of 0.1. In the second line the maximum value assigned to the area of 10000 to limit the data and by moving the slider the range of visualization also changes.

In [None]:
#interact(shape_area, value=widgets.FloatSlider(min=min_value, max=max_value, step=0.1));
interact(ShapeArea, value=widgets.FloatSlider(min=min_value, max=10000, step=0.1));