/
languages.pymd
183 lines (143 loc) · 4.98 KB
/
languages.pymd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
# Scala and SQL
One of the great powers of RasterFrames is the ability to express computation in multiple programming languages. The content in this manual focuses on Python because it is the most commonly used language in data science and GIS analytics. However, Scala (the implementation language of RasterFrames) and SQL (commonly used in many domains) are also fully supported. Examples in Python can be mechanically translated into the other two languages without much difficulty once the naming conventions are understood.
In the sections below we will show the same example program in each language. To do so we will compute the average NDVI per month for a single _tile_ in Tanzania.
```python, imports, echo=False
from pyspark.sql.functions import *
from pyrasterframes.utils import create_rf_spark_session
from pyrasterframes.rasterfunctions import *
import pyrasterframes.rf_ipython
import pandas as pd
import os
spark = create_rf_spark_session()
```
## Python
### Step 1: Load the catalog
```python, step_1_python
modis = spark.read.format('aws-pds-modis-catalog').load()
```
### Step 2: Down-select data by month
```python, step_2_python
red_nir_monthly_2017 = modis \
.select(
col('granule_id'),
month('acquisition_date').alias('month'),
col('B01').alias('red'),
col('B02').alias('nir')
) \
.where(
(year('acquisition_date') == 2017) &
(dayofmonth('acquisition_date') == 15) &
(col('granule_id') == 'h21v09')
)
red_nir_monthly_2017.printSchema()
```
### Step 3: Read tiles
```python, step_3_python
red_nir_tiles_monthly_2017 = spark.read.raster(
red_nir_monthly_2017,
catalog_col_names=['red', 'nir'],
tile_dimensions=(256, 256)
)
```
### Step 4: Compute aggregates
```python, step_4_python
result = red_nir_tiles_monthly_2017 \
.where(st_intersects(
st_reproject(rf_geometry(col('red')), rf_crs(col('red')).crsProj4, rf_mk_crs('EPSG:4326')),
st_makePoint(lit(34.870605), lit(-4.729727)))
) \
.groupBy('month') \
.agg(rf_agg_stats(rf_normalized_difference(col('nir'), col('red'))).alias('ndvi_stats')) \
.orderBy(col('month')) \
.select('month', 'ndvi_stats.*')
result
```
## SQL
For convenience, we're going to evaluate SQL from the Python environment. The SQL fragments should work in the `spark-sql` shell just the same.
```python, sql_setup
def sql(stmt):
return spark.sql(stmt)
```
### Step 1: Load the catalog
```python, step_1_sql, results='hidden'
sql("CREATE OR REPLACE TEMPORARY VIEW modis USING `aws-pds-modis-catalog`")
```
### Step 2: Down-select data by month
```python, step_2_sql
sql("""
CREATE OR REPLACE TEMPORARY VIEW red_nir_monthly_2017 AS
SELECT granule_id, month(acquisition_date) as month, B01 as red, B02 as nir
FROM modis
WHERE year(acquisition_date) = 2017 AND day(acquisition_date) = 15 AND granule_id = 'h21v09'
""")
sql('DESCRIBE red_nir_monthly_2017')
```
### Step 3: Read tiles
```python, step_3_sql, results='hidden'
sql("""
CREATE OR REPLACE TEMPORARY VIEW red_nir_tiles_monthly_2017
USING raster
OPTIONS (
catalog_table='red_nir_monthly_2017',
catalog_col_names='red,nir',
tile_dimensions='256,256'
)
""")
```
### Step 4: Compute aggregates
```python, step_4_sql
grouped = sql("""
SELECT month, ndvi_stats.* FROM (
SELECT month, rf_agg_stats(rf_normalized_difference(nir, red)) as ndvi_stats
FROM red_nir_tiles_monthly_2017
WHERE st_intersects(st_reproject(rf_geometry(red), rf_crs(red), 'EPSG:4326'), st_makePoint(34.870605, -4.729727))
GROUP BY month
ORDER BY month
)
""")
grouped
```
## Scala
The latest Scala API documentation is available here:
* [Scala API Documentation](https://rasterframes.io/latest/api/index.html)
### Step 1: Load the catalog
```scala
import geotrellis.proj4.LatLng
import org.locationtech.rasterframes._
import org.locationtech.rasterframes.datasource.raster._
import org.apache.spark.sql._
import org.apache.spark.sql.functions._
implicit val spark = SparkSession.builder()
.master("local[*]")
.appName("RasterFrames")
.withKryoSerialization
.getOrCreate()
.withRasterFrames
import spark.implicits._
val modis = spark.read.format("aws-pds-modis-catalog").load()
```
### Step 2: Down-select data by month
```scala
val red_nir_monthly_2017 = modis
.select($"granule_id", month($"acquisition_date") as "month", $"B01" as "red", $"B02" as "nir")
.where(year($"acquisition_date") === 2017 && (dayofmonth($"acquisition_date") === 15) && $"granule_id" === "h21v09")
```
### Step 3: Read tiles
```scala
val red_nir_tiles_monthly_2017 = spark.read.raster
.fromCatalog(red_nir_monthly_2017, "red", "nir")
.load()
```
### Step 4: Compute aggregates
```scala
val result = red_nir_tiles_monthly_2017
.where(st_intersects(
st_reproject(rf_geometry($"red"), rf_crs($"red"), LatLng),
st_makePoint(34.870605, -4.729727)
))
.groupBy("month")
.agg(rf_agg_stats(rf_normalized_difference($"nir", $"red")) as "ndvi_stats")
.orderBy("month")
.select("month", "ndvi_stats.*")
result.show()
```