<a href="https://colab.research.google.com/github/ianakoto/Cropland-Mapping/blob/main/Dataset_Creation_GEO_AI_Challenge_for_Cropland_Mapping.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# GEO-AI Challenge for Cropland Mapping- Dataset Creation



This project is broken into the following notebooks:

- **Open 🧭 Overview**: Go through what we want to achieve, and explore the data we want to use as inputs and outputs for our model.

- **Open 🗄️ Create the dataset**: Use Apache Beam to fetch data from Earth Engine in parallel, and create a dataset for our model in Dataflow.

- **Open 🧠 Train the model**: Build a Unet with pretained model and train it in Vertex AI with the dataset we created.

- **Open 🔮 Model predictions**: Get predictions from the model with data it has never seen before.

This sample leverages geospatial satellite data from Google Earth Engine. Using satellite imagery, you'll build and train a model for Cropland classification

# 🎬 Before you begin

Let's start by cloning the GitHub repository, and installing some dependencies.

In [1]:
# Now let's get the code from GitHub and navigate to the sample.
!git clone https://github.com/ianakoto/Cropland-Mapping.git
%cd Cropland-Mapping/serving

Cloning into 'Cropland-Mapping'...
remote: Enumerating objects: 40, done.[K
remote: Counting objects: 100% (40/40), done.[K
remote: Compressing objects: 100% (29/29), done.[K
remote: Total 40 (delta 15), reused 24 (delta 8), pack-reused 0[K
Receiving objects: 100% (40/40), 613.19 KiB | 3.50 MiB/s, done.
Resolving deltas: 100% (15/15), done.
/content/Cropland-Mapping/serving


## ☁️ My Google Cloud resources

Make sure you have followed these steps to configure your Google Cloud project:

1. Enable the APIs: _Earth Engine_

  <button>

  [Click here to enable the APIs](https://console.cloud.google.com/flows/enableapi?apiid=earthengine.googleapis.com)
  </button>

1. Register your
  [Compute Engine default service account](https://console.cloud.google.com/iam-admin/iam)
  on Earth Engine.

  <button>

  [Click here to register your service account on Earth Engine](https://signup.earthengine.google.com/#!/service_accounts)
  </button>

Once you have everything ready, you can go ahead and fill in your Google Cloud resources in the following code cell.
Make sure you run it!

In [None]:
from __future__ import annotations

import os
from google.colab import auth

# Please fill in these values.
project = "kagglex-396821"  # @param {type:"string"}
bucket = "cropland_classification_data"  # @param {type:"string"}
location = "us (multiple regions in United States)"  # @param {type:"string"}

# Quick input validations.
assert project, "⚠️ Please provide a Google Cloud project ID"
assert bucket, "⚠️ Please provide a Cloud Storage bucket name"
assert not bucket.startswith(
    "gs://"
), f"⚠️ Please remove the gs:// prefix from the bucket name: {bucket}"
assert location, "⚠️ Please provide a Google Cloud location"

# Authenticate to Colab.
auth.authenticate_user()

# Set GOOGLE_CLOUD_PROJECT for google.auth.default().
os.environ["GOOGLE_CLOUD_PROJECT"] = project

# Set the gcloud project for other gcloud commands.
!gcloud config set project {project}

Updated property [core/project].


In [2]:
!pip install -q earthengine-api

## Import Earth Engine API and authenticate<a class="anchor" id="import-api"></a>

The Earth Engine API is installed by default in Google Colaboratory so requires only importing and authenticating. These steps must be completed for each new Colab session, if you restart your Colab kernel, or if your Colab virtual machine is recycled due to inactivity.

### Import the API

Run the following cell to import the API into your session.

In [3]:
import ee
from datetime import datetime, timedelta
import io
import pandas as pd
import random
import numpy as np

### Authenticate and initialize

Run the `ee.Authenticate` function to authenticate your access to Earth Engine servers and `ee.Initialize` to initialize it. Upon running the following cell you'll be asked to grant Earth Engine access to your Google account. Follow the instructions printed to the cell.

In [4]:
## Trigger the authentication flow. You only need to do this once
ee.Authenticate()

# Initialize the library.
ee.Initialize()

To authorize access needed by Earth Engine, open the following URL in a web browser and follow the instructions. If the web browser does not start automatically, please manually browse the URL below.

    https://code.earthengine.google.com/client-auth?scopes=https%3A//www.googleapis.com/auth/earthengine%20https%3A//www.googleapis.com/auth/devstorage.full_control&request_id=EnGWT8OOwZqDQSNn8g3fBzEwXbbqIb2hZ34sLfVlXRE&tc=3D341emxcPONXc6IUW8WzvPVpX_1BPzarFOSipqz5SI&cc=2mynPT9WOrD-aMUAFokvx1L54bqB9rTSl7CVxs-NTBs

The authorization workflow will generate a code, which you should paste in the box below.
Enter verification code: 4/1Adeu5BVAyuo0ZjiCIJzfZm-GE5msrJnWTfLT0LnZjNxtjMDEp5gvNI_DfIk

Successfully saved authorization token.


## 🎛️ Create train/validation splits

Before we can train an ML model, we need to split this data into training and validation datasets. We will do this by creating two new dataframes with a 70/30 training validation split.

In [None]:
!pip install -q /content/Cropland-Mapping/serving/

  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m244.7/244.7 kB[0m [31m4.3 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m511.7/511.7 MB[0m [31m1.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.6/1.6 MB[0m [31m41.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m42.6/42.6 kB[0m [31m2.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.8/5.8 MB[0m [31m79.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m438.7/438.7 kB[0m [31m35.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

In [None]:
import os
import sys

sys.path.append(os.path.join('content', 'Cropland-Mapping', 'serving'))

from data import *

## Load Dataset From Drive

In [None]:
train_pd = pd.read_csv("/content/drive/MyDrive/Zindi Competitions/Cropland Classification/Train.csv")
sample_pd = pd.read_csv("/content/drive/MyDrive/Zindi Competitions/Cropland Classification/SampleSubmission.csv")
test_pd = pd.read_csv("/content/drive/MyDrive/Zindi Competitions/Cropland Classification/Test.csv")

In [8]:
train_pd.head()

Unnamed: 0,ID,Lat,Lon,Target
0,ID_SJ098E7S2SY9,34.162491,70.763668,0
1,ID_CWCD60FGJJYY,32.075695,48.492047,0
2,ID_R1XF70RMVGL3,14.542826,33.313483,1
3,ID_0ZBIDY0PEBVO,14.35948,33.284108,1
4,ID_C20R2C0AYIT0,14.419128,33.52845,0


In [9]:
sample_pd.head()

Unnamed: 0,ID,Target
0,ID_9ZLHTVF6NSU7,
1,ID_LNN7BFCVEZKA,
2,ID_SOYSG7W04UH3,
3,ID_EAP7EXXV8ZDE,
4,ID_QPRX1TUQVGHU,


In [10]:
test_pd.head()

Unnamed: 0,ID,Lat,Lon
0,ID_9ZLHTVF6NSU7,34.254835,70.348699
1,ID_LNN7BFCVEZKA,32.009669,48.535526
2,ID_SOYSG7W04UH3,14.431884,33.399991
3,ID_EAP7EXXV8ZDE,14.281866,33.441224
4,ID_QPRX1TUQVGHU,14.399365,33.109566


In [12]:
test_pd.shape

(1500, 3)

In [13]:
TRAIN_VALIDATION_SPLIT = 0.7

train_dataframe = train_pd.sample(
    frac=TRAIN_VALIDATION_SPLIT, random_state=200
)  # random state is a seed value
validation_dataframe = train_pd.drop(train_dataframe.index).sample(frac=1.0)

In [15]:
start_date = "2022-01-01"
end_date = "2022-12-31"
iran_col, sudan_col, afghanistan_col = get_collections(start_date, end_date)

In [24]:
for row in train_dataframe.itertuples():
  print(row)
  break

Pandas(Index=412, ID='ID_06SS9UPGRDWP', Lat=32.282756, Lon=48.26109, Target=1)


In [16]:
train_features = [
    labeled_feature(row,iran_col,sudan_col,afghanistan_col) for row in train_dataframe.itertuples()
    ]

validation_features = [
    labeled_feature(row,iran_col,sudan_col,afghanistan_col) for row in validation_dataframe.itertuples()
]

TypeError: ignored

## 💾 Export data

Lastly, we'll export the data to a Cloud Storage bucket. We'll export the data as TFRecords.

Later when we run the training job, we'll parse these **TFRecords** and feed them to the model.

In [None]:
# Export data

training_task = ee.batch.Export.table.toCloudStorage(
    collection=ee.FeatureCollection(train_features),
    description="Training image export",
    bucket=bucket,
    fileNamePrefix="geospatial_training",
    selectors=BANDS + [FEATURES] + [LABEL],
    fileFormat="TFRecord",
)

training_task.start()

validation_task = ee.batch.Export.table.toCloudStorage(
    collection=ee.FeatureCollection(validation_features),
    description="Validation image export",
    bucket=bucket,
    fileNamePrefix="geospatial_validation",
    selectors=BANDS + [FEATURES] + [LABEL],
    fileFormat="TFRecord",
)

validation_task.start()


This export will take a while. You can monitor the progress with the following command:

In [None]:
from pprint import pprint

pprint(ee.batch.Task.list())

In [None]:
start_date = '2022-1-1'
end_date = '2022-12-31'

iran_collection = create_composited_sentinel2_collection(iran_geometry, start_date, end_date)
sudan_collection = create_composited_sentinel2_collection(sudan_geometry, start_date, end_date)
afghanistan_collection = create_composited_sentinel2_collection(afghanistan_geometry, start_date, end_date)



In [None]:
df_subset = train_pd.head(100)
train_features = [labeled_feature(row) for row in df_subset.itertuples()]

To get a better sense of what's going on, let's look at the properties for the first Feature in the train_features list. You can see that it contains a property for the label **is_crop_or_land**, and 15 additional properies, one for each spectral band.

In [None]:
ee.FeatureCollection(train_features[0]).propertyNames().getInfo()

['system:index',
 'is_crop_or_land',
 'B10',
 'B11',
 'B12',
 'B8A',
 'NDVI',
 'B1',
 'B2',
 'B3',
 'B4',
 'B5',
 'B6',
 'B7',
 'B8',
 'B9',
 'EVI']

The data contained in each band property is an array of shape 33x33.

For example, here is the data for band B1 in the first element in our list expressed as a numpy array.

In [None]:
example_feature = np.array(train_features[0].get("B1").getInfo())
print(example_feature)
print("shape: " + str(example_feature.shape))

[[2900 2900 2900 ... 2851 2851 2839]
 [2900 2900 2900 ... 2851 2851 2839]
 [2885 2885 2885 ... 2876 2876 2835]
 ...
 [2858 2858 2858 ... 2858 2858 2858]
 [2858 2858 2858 ... 2858 2858 2858]
 [2858 2858 2858 ... 2862 2862 2862]]
shape: (33, 33)
