<a href="https://colab.research.google.com/github/liangchow/zindi-amazon-secret-runway/blob/shruti-working/Data_Visualization/explore_sample_submission.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Imports and Setup

In [1]:
%%capture
!pip -q install geopandas
!pip install dask[dataframe]

In [7]:
# Standard imports
import os
import pandas as pd
import numpy as np
from PIL import Image

import dask.dataframe as dd

# Geospatial processing packages
import geopandas as gpd

# Mapping and plotting libraries
import matplotlib.pyplot as plt
import matplotlib.colors as cl

## Mount Drive

In [3]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [4]:
# clone the main branch from GitHub to get all the data and files from there onto the current runtime session
!apt-get install git
!git clone https://github.com/liangchow/zindi-amazon-secret-runway.git
!git pull # pulls the latest changes from repo

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
git is already the newest version (1:2.34.1-1ubuntu1.11).
0 upgraded, 0 newly installed, 0 to remove and 49 not upgraded.
Cloning into 'zindi-amazon-secret-runway'...
remote: Enumerating objects: 239, done.[K
remote: Counting objects: 100% (77/77), done.[K
remote: Compressing objects: 100% (75/75), done.[K
remote: Total 239 (delta 29), reused 7 (delta 2), pack-reused 162 (from 1)[K
Receiving objects: 100% (239/239), 14.82 MiB | 22.42 MiB/s, done.
Resolving deltas: 100% (96/96), done.
fatal: not a git repository (or any of the parent directories): .git


# Download AOI Boundary

In [5]:
base_aoi_path = '/content/zindi-amazon-secret-runway/Data_Visualization/data/shp_test_AOIs'
aoi_name = 'aoi_2020_01'

# Select export folder on Google Drive
export_folder = 'Colab Notebooks'

In [8]:
# Read data using GeoPandas
filename = os.path.join(base_aoi_path, f'{aoi_name}.shp')
geoboundary = gpd.read_file(filename)
print("Data dimensions: {}".format(geoboundary.shape))
geoboundary

Data dimensions: (1, 11)


Unnamed: 0,MINX,MINY,MAXX,MAXY,CNTX,CNTY,AREA,PERIM,HEIGHT,WIDTH,geometry
0,690096.799317,8793048.0,705506.799317,8808318.0,697801.799317,8800683.0,235310700.0,61360.0,15270.0,15410.0,"POLYGON ((690096.799 8793048.011, 690096.799 8..."



# Create a 10m grid in QGIS
Created a 10m grid using the extent of the AOI aoi_2020_01.shp using the Grid Creation tool.

The grid created has coordinates 0,0 at the top left corner.
- The row index ranges from 0 to 1526 from top to bottom
- The column index ranges from 0 to 1540 from left to right

# Load Sample Submission file

Upload the `SampleSubmission.csv` file provided by Zindi to your runtime. Read it with `dask` and filter data for AOI of interest. Save data for that AOI in a new csv file.

Uncomment the 2 cells below if you want to explore data for a different AOI.

In [10]:
# # Load with dask because the original file is very large
# df = dd.read_csv('/content/SampleSubmission.csv')

# # Filter for rows where 'tile_row_col' contains a specific substring (e.g., 'Tileaoi_20_01')
# filtered_df = df[df['tile_row_column'].str.contains('Tileaoi_20_01')]

# # Compute the filtered results if you need to bring them into memory (e.g., for further processing)
# result = filtered_df.compute()
# result.head()

Unnamed: 0,tile_row_column,label
6,Tileaoi_20_01_259_1267,0
7,Tileaoi_20_01_7_205,0
10,Tileaoi_20_01_1026_497,0
13,Tileaoi_20_01_415_280,0
32,Tileaoi_20_01_1184_798,0


In [11]:
# # Save result to a new csv file
# result.to_csv('SampleSubmission_2020_01.csv', index=False)

In [20]:
# Load the filtered dataset to make it easier to work with
aoi_df = pd.read_csv('/content/SampleSubmission_2020_01.csv')
aoi_df.head()

Unnamed: 0,tile_row_column,label
0,Tileaoi_20_01_259_1267,0
1,Tileaoi_20_01_7_205,0
2,Tileaoi_20_01_1026_497,0
3,Tileaoi_20_01_415_280,0
4,Tileaoi_20_01_1184_798,0


In [21]:
len(aoi_df)

628563

In [22]:
# Split the 'tile_row_col' column into multiple columns
split_cols = aoi_df['tile_row_column'].str.split('_', expand=True)

# Assign split columns to new columns with specified names
aoi_df['aoi'] = split_cols[0]
aoi_df['year'] = split_cols[1]
aoi_df['num'] = split_cols[2]
aoi_df['row'] = split_cols[3]
aoi_df['col'] = split_cols[4]

# Combine 'aoi', 'year', and 'num' into a new column 'aoi_year_num'
aoi_df['aoi_year_num'] = aoi_df['aoi'] + '_' + aoi_df['year'] + '_' + aoi_df['num']
aoi_df.head()

Unnamed: 0,tile_row_column,label,aoi,year,num,row,col,aoi_year_num
0,Tileaoi_20_01_259_1267,0,Tileaoi,20,1,259,1267,Tileaoi_20_01
1,Tileaoi_20_01_7_205,0,Tileaoi,20,1,7,205,Tileaoi_20_01
2,Tileaoi_20_01_1026_497,0,Tileaoi,20,1,1026,497,Tileaoi_20_01
3,Tileaoi_20_01_415_280,0,Tileaoi,20,1,415,280,Tileaoi_20_01
4,Tileaoi_20_01_1184_798,0,Tileaoi,20,1,1184,798,Tileaoi_20_01


In [23]:
aoi_df['row'] = aoi_df['row'].astype(int)
aoi_df['col'] = aoi_df['col'].astype(int)

In [24]:
sorted_df = aoi_df.sort_values(by=['row', 'col'])
sorted_df.head()

Unnamed: 0,tile_row_column,label,aoi,year,num,row,col,aoi_year_num
464754,Tileaoi_20_01_0_2,0,Tileaoi,20,1,0,2,Tileaoi_20_01
333897,Tileaoi_20_01_0_4,0,Tileaoi,20,1,0,4,Tileaoi_20_01
205820,Tileaoi_20_01_0_5,0,Tileaoi,20,1,0,5,Tileaoi_20_01
543260,Tileaoi_20_01_0_8,0,Tileaoi,20,1,0,8,Tileaoi_20_01
554245,Tileaoi_20_01_0_16,0,Tileaoi,20,1,0,16,Tileaoi_20_01


In [25]:
# Min and max row values
print('Row Stats')
min(sorted_df['row']), max(sorted_df['row'])

Row Stats


(0, 1540)

In [18]:
# Min and max col values
print('Col Stats')
min(sorted_df['col']), max(sorted_df['col'])

Col Stats


(0, 1526)

**Note: In 10m grid created in QGIS the row index ranges from 0 to 1526 and the column index ranges from 0 to 1540. However in the table the indexes are reversed. It's possible that the row & col values are switched in the SampleSubmission.csv file.**