<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Importing-the-Data" data-toc-modified-id="Importing-the-Data-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Importing the Data</a></span><ul class="toc-item"><li><span><a href="#Metadata-File" data-toc-modified-id="Metadata-File-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span><code>Metadata</code> File</a></span></li><li><span><a href="#train_labels-File" data-toc-modified-id="train_labels-File-1.2"><span class="toc-item-num">1.2&nbsp;&nbsp;</span><code>train_labels</code> File</a></span></li><li><span><a href="#Prepping-the-data-for-the-Satellite-imagery-analysis." data-toc-modified-id="Prepping-the-data-for-the-Satellite-imagery-analysis.-1.3"><span class="toc-item-num">1.3&nbsp;&nbsp;</span>Prepping the data for the Satellite imagery analysis.</a></span></li><li><span><a href="#Setting-up-the-DataFrame" data-toc-modified-id="Setting-up-the-DataFrame-1.4"><span class="toc-item-num">1.4&nbsp;&nbsp;</span>Setting up the DataFrame</a></span></li></ul></li><li><span><a href="#Pulling-in-All-of-the-Data" data-toc-modified-id="Pulling-in-All-of-the-Data-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Pulling in All of the Data</a></span><ul class="toc-item"><li><span><a href="#Pulling-in-the-first-half-of-the-data." data-toc-modified-id="Pulling-in-the-first-half-of-the-data.-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>Pulling in the first half of the data.</a></span></li><li><span><a href="#Pulling-in-the-second-half-of-the-data" data-toc-modified-id="Pulling-in-the-second-half-of-the-data-2.2"><span class="toc-item-num">2.2&nbsp;&nbsp;</span>Pulling in the second half of the data</a></span></li><li><span><a href="#Pulling-in-the-third-set-of-data" data-toc-modified-id="Pulling-in-the-third-set-of-data-2.3"><span class="toc-item-num">2.3&nbsp;&nbsp;</span>Pulling in the third set of data</a></span></li><li><span><a href="#Creating-a-Full-DataFrame" data-toc-modified-id="Creating-a-Full-DataFrame-2.4"><span class="toc-item-num">2.4&nbsp;&nbsp;</span>Creating a Full DataFrame</a></span></li></ul></li><li><span><a href="#Test-Data" data-toc-modified-id="Test-Data-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Test Data</a></span><ul class="toc-item"><li><span><a href="#Pulling-the-the-Data-in-Batches-and-Saving-to-.pkl" data-toc-modified-id="Pulling-the-the-Data-in-Batches-and-Saving-to-.pkl-3.1"><span class="toc-item-num">3.1&nbsp;&nbsp;</span>Pulling the the Data in Batches and Saving to .pkl</a></span></li><li><span><a href="#Reading-in-the-Saved-Test-Data" data-toc-modified-id="Reading-in-the-Saved-Test-Data-3.2"><span class="toc-item-num">3.2&nbsp;&nbsp;</span>Reading in the Saved Test Data</a></span></li></ul></li></ul></div>

Running main notebook

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

import geopandas as gpd
from shapely.geometry import Point
import geopy.distance as distance

import planetary_computer as pc
from pystac_client import Client

from datetime import datetime
from datetime import timedelta

# from keras.utils import load_img, img_to_array
import requests
from PIL import Image
from io import BytesIO

from tqdm import tqdm
tqdm.pandas()

import rioxarray
import cv2
import odc.stac
import tempfile
import rasterio
import os

from sklearn.preprocessing import MinMaxScaler, OneHotEncoder
from sklearn.model_selection import train_test_split, cross_validate
from sklearn.ensemble import RandomForestClassifier
from sklearn.dummy import DummyClassifier
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer

import functions
import functions2

import warnings
warnings.filterwarnings('ignore')

# Importing the Data

## `Metadata` File

In [2]:
# Reading in the data and bringing in date as datetime dtype
metadata = pd.read_csv('Data/metadata.csv', parse_dates=['date'])

## `train_labels` File

In [3]:
train_labels = pd.read_csv('Data/train_labels.csv')

## Prepping the data for the Satellite imagery analysis.

In [4]:
sat_df = metadata.reset_index()

In [5]:
sat_df['split'].value_counts()

train    17060
test      6510
Name: split, dtype: int64

In [6]:
sat_train = sat_df[sat_df['split'] == 'train'].copy()
sat_test = sat_df[sat_df['split'] == 'test'].copy()

Bringing back in the labels for sat_train.

In [7]:
sat_train = sat_train.merge(train_labels, on='uid')

## Setting up the DataFrame

Here I use a custom function to add a date range that the satellites can interpret and also include bounding boxes to later manipulate the images.

In [8]:
functions.get_important_info(sat_train, dist=31, big_crop_dist=3000, small_crop_dist=500, tiny_crop_dist=100);

# Pulling in All of the Data

Because the API is sometimes unstable. I will be pulling over the data in two large batches with several smaller batches making up the larger batches. I am splitting the data into batches below.

In [9]:
all_train = list(np.arange(0, len(sat_train), 853))

In [11]:
all_train.append(17060)

In [12]:
all_train

[0,
 853,
 1706,
 2559,
 3412,
 4265,
 5118,
 5971,
 6824,
 7677,
 8530,
 9383,
 10236,
 11089,
 11942,
 12795,
 13648,
 14501,
 15354,
 16207,
 17060]

In [17]:
train_dict = {}
for slice in range(1, len(all_train)):
    train_dict[f"sat_train_{slice}"] = sat_train[all_train[slice-1]:all_train[slice]]


In [22]:
train_dict_key_list = list(train_dict.keys())

In [21]:
train_dict['sat_train_1'].head()

Unnamed: 0,index,uid,latitude,longitude,date,split,region,severity,density,date_range,bbox,big_crop_bbox,small_crop_bbox,tiny_crop_bbox
0,0,aabm,39.080319,-86.430867,2018-05-14,train,midwest,1,585.0,2018-04-29/2018-05-14,"[-87.00742888244132, 38.63091417147125, -85.85...","[-86.46553737052635, 39.05329612116674, -86.39...","[-86.43664511758249, 39.07581525298953, -86.42...","[-86.43202235685135, 39.079418305988106, -86.4..."
1,2,aacd,35.875083,-78.878434,2020-11-19,train,south,1,290.0,2020-11-04/2020-11-19,"[-79.43088170919651, 35.425434522510464, -78.3...","[-78.91165478658658, 35.84804560208817, -78.84...","[-78.88397103218583, 35.8705769758293, -78.872...","[-78.87954163128398, 35.87418198776951, -78.87..."
2,3,aaee,35.487,-79.062133,2016-08-24,train,south,1,1614.0,2016-08-09/2016-08-24,"[-79.61191193921022, 35.03732231399556, -78.51...","[-79.09519299105324, 35.459960615407574, -79.0...","[-79.06764296370947, 35.48249344433146, -79.05...","[-79.06323495914324, 35.48609868913609, -79.06..."
3,4,aaff,38.049471,-99.827001,2019-07-23,train,midwest,3,111825.0,2019-07-08/2019-07-23,"[-100.3953854756163, 37.59998670596361, -99.25...","[-99.86118001864864, 38.02244310076497, -99.79...","[-99.83269759502433, 38.044966208779, -99.8213...","[-99.82814040700623, 38.048569898032675, -99.8..."
4,5,aafl,39.474744,-86.898353,2021-08-23,train,midwest,4,2017313.0,2021-08-08/2021-08-23,"[-87.47815708269697, 39.02536958186914, -86.31...","[-86.93321866191961, 39.44772288780534, -86.86...","[-86.9041639439351, 39.47024049004548, -86.892...","[-86.89951518878857, 39.473843298288934, -86.8..."


In [16]:
# first_half = all_train[:int(len(all_train)/2)]

# first_half_dict = {}
# for batch in range(1, len(first_half)):
#     first_half_dict[f'sat_train_{batch}'] = sat_train[first_half[batch-1]:first_half[batch]]
    
# first_half_dict.keys()

In [17]:
# first_half_dict['sat_train_9'].head()

In [18]:
# all_train.append(len(sat_train))
# second_half = all_train[9:]

# second_half_dict = {}
# for batch in range(10, len(second_half)+9):
#     second_half_dict[f'sat_train_{batch}'] = sat_train[second_half[batch-10]:second_half[batch-9]]
    
# second_half_dict.keys()

In [19]:
# second_half_dict['sat_train_20'].tail()

## Pulling in the first half of the data.

In [20]:
# commented out due to having completed and pickled the results
# first_half_key_list = list(first_half_dict.keys())

In [21]:
# commented out due to having completed and pickled the results


# first_half_results_dict = {}
# for n, key in enumerate(first_half_key_list):
#     first_half_results_dict[key] = get_sat_to_features(first_half_dict[key])
#     print(f"{key} has finished loading.")

## Pulling in the second half of the data

In [22]:
# second_half_key_list = list(second_half_dict.keys())

In [23]:
# second_half_results_dict = {}
# for key in second_half_key_list:
#     second_half_results_dict[key] = get_sat_to_features(second_half_dict[key])
#     print(f"{key} has finished loading.")

In [24]:
# second_half_results_dict.keys()

## Pulling in the third set of data

The API is incredibly finicky. Only sat_train_10-sat_train_13 was successful. I am running another pull request.

In [25]:
# third_pull_key_list = second_half_key_list[4:]

In [26]:
# third_pull_results_dict = {}
# for key in third_pull_key_list:
#     third_pull_results_dict[key] = get_sat_to_features(second_half_dict[key])
#     print(f"{key} has finished loading.")

## Creating a Full DataFrame

Now that I have the data in a dictionary, I will concat it all together into a complete dataframe.

In [27]:
# commented out due to having completed and pickled the results
# First pull to DataFrame
# first_pull_df = pd.concat(first_half_results_dict.values())

In [28]:
# Second pull to DataFrame
# second_pull_df = pd.concat(second_half_results_dict.values())

In [29]:
# Third pull to DataFrame
# third_pull_df = pd.concat(third_pull_results_dict.values())

I also need to store the pulled data as a .pkl file.

In [30]:
# first_pull_df.to_pickle('./first_7677_rows.pkl')

In [31]:
# first_pull_df = pd.read_pickle('./first_7677_rows.pkl')

In [32]:
# second_pull_df.to_pickle('./second_set_rows.pkl')

In [33]:
# second_pull_df = pd.read_pickle('./second_set_rows.pkl')

In [34]:
# third_pull_df.to_pickle('./third_set_rows.pkl')

In [35]:
# third_pull_df = pd.read_pickle('./third_set_rows.pkl')

In [36]:
# full_df = pd.concat([first_pull_df, second_pull_df, third_pull_df])

In [37]:
# full_df.to_pickle('./full_df.pkl')

In [38]:
# new_full = pd.read_pickle('../full_df.pkl')

# Test Data

In [39]:
# sat_test = functions.get_important_info(sat_test)

## Pulling the the Data in Batches and Saving to .pkl

0-499

In [40]:
# slice df
# small_sat_test_499 = sat_test[0:500]

In [None]:
# get features
# test_df = functions2.try_get_sat_to_features(small_sat_test_499)

In [None]:
# save features
# test_df.to_pickle('../pickles/test_data/0_to_500.pkl')

500-999

In [None]:
# slice df
# small_sat_test_999 = sat_test[500:1000]

In [None]:
# get features
# test_df = functions2.try_get_sat_to_features(small_sat_test_999)

In [None]:
# save features
# test_df.to_pickle('../pickles/test_data/500_to_999.pkl')

1000-1499

In [None]:
# slice df
# small_sat_test_1499 = sat_test[1000:1500]

In [None]:
# get features
# test_df = functions2.try_get_sat_to_features(small_sat_test_1499)

In [None]:
# save features
# test_df.to_pickle('../pickles/test_data/1000_to_1499.pkl')

1500-1999

In [None]:
# slice df
# small_sat_test_1999 = sat_test[1500:2000]

In [None]:
# get features
# test_df = functions2.try_get_sat_to_features(small_sat_test_1999)

In [None]:
# save features
# test_df.to_pickle('../pickles/test_data/1500_to_1999.pkl')

2000-2499

In [None]:
# slice df
# small_sat_test_2499 = sat_test[2000:2500]

In [None]:
# get features
# test_df = functions2.try_get_sat_to_features(small_sat_test_2499)

In [None]:
# save features
# test_df.to_pickle('../pickles/test_data/2000_to_2499.pkl')

2500-2999

In [None]:
# slice df
# small_sat_test_2999 = sat_test[2500:3000]

In [None]:
# get features
# test_df = functions2.try_get_sat_to_features(small_sat_test_2999)

In [None]:
# save features
# test_df.to_pickle('../pickles/test_data/2500_to_2999.pkl')

3000-3499

In [None]:
# slice df
# small_sat_test_3499 = sat_test[3000:3500]

In [None]:
# get features
# test_df = functions2.try_get_sat_to_features(small_sat_test_3499)

In [None]:
# save features
# test_df.to_pickle('../pickles/test_data/3000_to_3499.pkl')

3500-3999

In [None]:
# slice df
# small_sat_test_3999 = sat_test[3500:4000]

In [None]:
# get features
# test_df = functions2.try_get_sat_to_features(small_sat_test_3999)

In [None]:
# save features
# test_df.to_pickle('../pickles/test_data/3500_to_3999.pkl')

4000-4499

In [None]:
# slice df
# small_sat_test_4499 = sat_test[4000:4500]

In [None]:
# get features
# test_df = functions2.try_get_sat_to_features(small_sat_test_4499)

In [None]:
# save features
# test_df.to_pickle('../pickles/test_data/4000_to_4499.pkl')

4500-4999

In [None]:
# slice df
# small_sat_test_4999 = sat_test[4500:5000]

In [None]:
# get features
# test_df = functions2.try_get_sat_to_features(small_sat_test_4999)

In [None]:
# save features
# test_df.to_pickle('../pickles/test_data/4500_to_4999.pkl')

5000-5499

In [None]:
# slice df
# small_sat_test_5499 = sat_test[5000:5500]

In [None]:
# get features
# test_df = functions2.try_get_sat_to_features(small_sat_test_5499)

In [None]:
# save features
# test_df.to_pickle('../pickles/test_data/5000_to_5499.pkl')

5500-5999

In [None]:
# slice df
# small_sat_test_5999 = sat_test[5500:6000]

In [None]:
# get features
# test_df = functions2.try_get_sat_to_features(small_sat_test_5999)

In [None]:
# save features
# test_df.to_pickle('../pickles/test_data/5500_to_5999.pkl')

6000-6510

In [None]:
# slice df
# small_sat_test_6510 = sat_test[6000:]

In [None]:
# get features
# test_df = functions2.try_get_sat_to_features(small_sat_test_5999)

In [None]:
# save features
# test_df.to_pickle('../pickles/test_data/6000_to_6510.pkl')

## Reading in the Saved Test Data

In [2]:
# small_sat_test_499 = pd.read_pickle('../pickles/test_data/0_to_500.pkl')

In [4]:
# small_sat_test_999 = pd.read_pickle('../pickles/test_data/500_to_999.pkl')

In [5]:
# small_sat_test_1499 = pd.read_pickle('../pickles/test_data/1000_to_1499.pkl')

In [6]:
# small_sat_test_1999 = pd.read_pickle('../pickles/test_data/1500_to_1999.pkl')

In [7]:
# small_sat_test_2499 = pd.read_pickle('../pickles/test_data/2000_to_2499.pkl')

In [8]:
# small_sat_test_2999 = pd.read_pickle('../pickles/test_data/2500_to_2999.pkl')

In [9]:
# small_sat_test_3499 = pd.read_pickle('../pickles/test_data/3000_to_3499.pkl')

In [10]:
# small_sat_test_3999 = pd.read_pickle('../pickles/test_data/3500_to_3999.pkl')

In [11]:
# small_sat_test_4499 = pd.read_pickle('../pickles/test_data/4000_to_4499.pkl')

In [12]:
# small_sat_test_4999 = pd.read_pickle('../pickles/test_data/4500_to_4999.pkl')

In [13]:
# small_sat_test_5499 = pd.read_pickle('../pickles/test_data/5000_to_5499.pkl')

In [14]:
# small_sat_test_5999 = pd.read_pickle('../pickles/test_data/5500_to_5999.pkl')

In [15]:
# small_sat_test_6510 = pd.read_pickle('../pickles/test_data/6000_to_6510.pkl')

Making a list of all batched DataFrames.

In [16]:
# test_pickle_list = [small_sat_test_499, small_sat_test_999, small_sat_test_1499,
#                    small_sat_test_1999, small_sat_test_2499, small_sat_test_2999,
#                    small_sat_test_3499, small_sat_test_3999, small_sat_test_4499,
#                    small_sat_test_4999, small_sat_test_5499, small_sat_test_5999,
#                    small_sat_test_6510]

Bringing the List of DataFrames together so that I can concat to one large DataFrame.

In [17]:
# full_test_df = pd.concat(test_pickle_list)

In [20]:
# full_test_df = full_test_df.reset_index().drop('index', axis=1)

In [21]:
# full_test_df.to_pickle('../pickles/test_data/full_test_df.pkl')

In [2]:
# # Can use this if I decide to use multiple satelitte images
# def get_sat_info(df):

#     '''
#     input a dataframe and get a dictionary with satellite information for each row in the dataframe
#     '''
    
#     sat_dict = {}
#     for index in range(len(df)):
#         row = df.iloc[index]

#         # Get all satellite images
#         search = catalog.search(collections=["sentinel-2-l2a", "landsat-c2-l2"],
#                                 bbox=row['bbox'],
#                                 datetime=row['date_range'],
#                                 query={'eo:cloud_cover': {'lt':100}}
#     )


#         # Going through Satellite info

#         # search for sat images and create a dataframe with results for one sample
# #         search_items = [item for item in search.get_all_items()]
#         search_items = [item for item in search.item_collection()]


#         pic_details = []
#         for pic in search_items:
#             pic_details.append(
#             {
#             'item': pic,
#             'satelite_name':pic.collection_id,
#             'img_date':pic.datetime.date(),
#             'cloud_cover(%)': pic.properties['eo:cloud_cover'],
#             'img_bbox': pic.bbox,
#             'min_long': pic.bbox[0],
#             "max_long": pic.bbox[2],
#             "min_lat": pic.bbox[1],
#             "max_lat": pic.bbox[3]
#             }
#             )

#         temp_df = pd.DataFrame(pic_details)

#         # Check to make sure sample location is actually within sat image
#         temp_df['has_sample_point'] = (
#             (temp_df.min_lat < row.latitude)
#             & (temp_df.max_lat > row.latitude)
#             & (temp_df.min_long < row.longitude)
#             & (temp_df.max_long > row.longitude)
#         )

#         temp_df = temp_df[temp_df['has_sample_point'] == True]
#         sat_dict[row['uid']] = temp_df
        
#     return sat_dict

In [3]:
# # delete comments for prints (# is all on left edge)
# def pick_best_sat(df, sat_dict):
    
#     '''
#     input a dataframe and dictionary of satellite images and returns a dataframe with the best satellite image
#     '''
    
#     # picking the best
#     # inputs would need to be df and dictionary
#     best_sat_df = pd.DataFrame()
#     row_count=0
#     invalid_sats = 0
#     for index in range(len(df)):
#         row = df.iloc[index]

#         name = row['uid']
#         temp_df = sat_dict[name]
#         temp_df = temp_df.reset_index()
#         # checking to see if there's only one image and adding it to df if so
#         if len(temp_df) == 1:
# #             print('only one satellite')
#             temp_df = temp_df.reset_index().drop(['index','min_long', 'max_long', 'min_lat', 'max_lat'], axis=1)
#             row = pd.DataFrame(row).T.reset_index().join(temp_df, how='outer')
#             row = row.set_index(pd.Series(row_count)).drop(['level_0', 'index'], axis=1)
#             best_sat_df = pd.concat([best_sat_df, row])
#             row_count+=1

#         # checking if no images
#         elif len(temp_df) == 0:
#             invalid_sats +=1
#             row = pd.DataFrame(row).T.reset_index()
#             row = row.set_index(pd.Series(row_count)).drop('index', axis=1)
#             best_sat_df = pd.concat([best_sat_df, row])
#             row_count+=1
# #             print('no satellite images')
#             continue

#         # There are many satellite images, need to narrow it down
#         else:
# #             print('many sats')
#             # first checking for any sentinel satelites
#             if len(temp_df[temp_df['satelite_name'].str.contains('entinel')]) >0:
#                     temp_df = temp_df[temp_df['satelite_name'].str.contains('entinel')]

#                     # if only one sentinel, add to df and move on
#                     if len(temp_df) == 1:
# #                         print('\tonly one sentinal')
#                         temp_df = temp_df.reset_index().drop(['index','min_long', 'max_long', 'min_lat', 'max_lat'], axis=1)
#                         row = pd.DataFrame(row).T.reset_index().join(temp_df, how='outer')
#                         row = row.set_index(pd.Series(row_count)).drop(['level_0', 'index'], axis=1)
#                         best_sat_df = pd.concat([best_sat_df, row])
#                         row_count+=1
#                     # if many sentinel, check for images with low cloud cover
#                     else:
# #                         print('\tmany sentinel')
#                         # checking for clouds less than 30%
#                         if len(temp_df[temp_df['cloud_cover(%)'] <= 30]) >0:
# #                             print('\t\tsentinal cloud cover lower than 30%')
#                             temp_df = temp_df[temp_df['cloud_cover(%)'] <= 30]

#                             # add the row with the closest date
#                             temp_df = temp_df.sort_values('img_date', ascending=False).reset_index().drop(['index','min_long', 'max_long', 'min_lat', 'max_lat'], axis=1)
#                             temp_df = pd.DataFrame(temp_df.loc[0]).T
#                             row = pd.DataFrame(row).T.reset_index().join(temp_df, how='outer')
#                             row = row.set_index(pd.Series(row_count)).drop(['level_0', 'index'], axis=1)
#                             best_sat_df = pd.concat([best_sat_df, row])
#                             row_count+=1
#                         else:
#                             # If there's only images with a clouds over 30%, 
#                             # pick the one with the least clouds
# #                             print('\t\tvery cloudy sentinel')
#                             temp_df = temp_df.sort_values('cloud_cover(%)', ascending=True).reset_index().drop(['index','min_long', 'max_long', 'min_lat', 'max_lat'], axis=1)
#                             temp_df = pd.DataFrame(temp_df.loc[0]).T
#                             row = pd.DataFrame(row).T.reset_index().join(temp_df, how='outer')
#                             row = row.set_index(pd.Series(row_count)).drop(['level_0', 'index'], axis=1)
#                             best_sat_df = pd.concat([best_sat_df, row])
#                             row_count+=1

#             else:
# #                 print('\tno sentinal')
#                 if len(temp_df[temp_df['cloud_cover(%)'] <= 30]) >0:
# #                     print('\t\tlandsat cloud cover lower than 30%')
#                     temp_df = temp_df[temp_df['cloud_cover(%)'] <= 30]

#                     # add the row with the closest date
#                     temp_df = temp_df.sort_values('img_date', ascending=False).reset_index().drop(['index','min_long', 'max_long', 'min_lat', 'max_lat'], axis=1)
#                     temp_df = pd.DataFrame(temp_df.loc[0]).T
#                     row = pd.DataFrame(row).T.reset_index().join(temp_df, how='outer')
#                     row = row.set_index(pd.Series(row_count)).drop(['level_0', 'index'], axis=1)
#                     best_sat_df = pd.concat([best_sat_df, row])
#                     row_count+=1
#                 else:
#                     # If there's only images with a clouds over 30%, 
#                     # pick the one with the least clouds
# #                     print('\t\tvery cloudy landsat')
#                     temp_df = temp_df.sort_values('cloud_cover(%)', ascending=True).reset_index().drop(['index','min_long', 'max_long', 'min_lat', 'max_lat'], axis=1)
#                     temp_df = pd.DataFrame(temp_df.loc[0]).T
#                     row = pd.DataFrame(row).T.reset_index().join(temp_df, how='outer')
#                     row = row.set_index(pd.Series(row_count)).drop(['level_0', 'index'], axis=1)
#                     best_sat_df = pd.concat([best_sat_df, row])
#                     row_count+=1



#     print(f'{len(df)} attempts. {invalid_sats} failures.')
#     return best_sat_df

In [4]:
# def get_arrays_from_sats(df):
    
    
#     '''
#     input a dataframe with satellites in it and get a dictionary with arrays 
#     that came from cropped images around the sample area
#     '''

# # Now to get images from the satellites
#     array_dict = {}
#     scaler = functions.MinMaxScaler3D(feature_range=(0,255))
#     error_count = 0
#     attempt_count = 0
#     for index in range(len(df)):
#         row = df.iloc[index]


#         try:
#             attempt_count +=1
#         # checking to see which satellite it came from
#             if 'sentinel' in row['satelite_name']:
#                 # Setting tiny crop box for image
#                 minx, miny, maxx, maxy = row['tiny_crop_bbox']
#                 # getting the image
#                 image = rioxarray.open_rasterio(pc.sign(row['item'].assets["visual"].href)).rio.clip_box(
#                         minx=minx,
#                         miny=miny,
#                         maxx=maxx,
#                         maxy=maxy,
#                         crs="EPSG:4326",
#                     )

#                 image_array = image.to_numpy()
#                 img_array_trans = np.transpose(image_array, axes=[1, 2, 0])
#                 # storing array of image in dictionary
#                 array_dict[row['uid']] = img_array_trans

#             else:
#                 # getting the image from the LandSat satellite
#                 minx, miny, maxx, maxy = row['tiny_crop_bbox']
#                 image = odc.stac.stac_load(
#                         [pc.sign(row['item'])], bands=["red", "green", "blue"], bbox=[minx, miny, maxx, maxy]
#                     ).isel(time=0)

#                 image_array = image[["red", "green", "blue"]].to_array()
#                 img_array_trans = np.transpose(image_array.to_numpy(), axes=[1, 2, 0])
#                 # scaling the image so its the same scale as the sentinel ones
#                 scaled_img = scaler.fit_transform(img_array_trans)
#         #         int_scaled_img = scaled_img.astype(int)
#                 # storing array of image in dictionary
#                 array_dict[row['uid']] = scaled_img
                

#         except:
#             error_count +=1
            
            
#     print(f'{attempt_count} attempted. {error_count} failures.')
#     return array_dict

In [5]:
# def get_features(df, img_arrays):
#     '''
#     input a dataframe and a list of integers and create features from arrays
#     '''
#     feature_df = pd.DataFrame()
#     for index in range(len(img_arrays.keys())):
#         feature_dict = {}
#         key =list(img_arrays.keys())[index]
# #         row = df.iloc[index]
#         temp_array = img_arrays[key]
#         for n, color in enumerate(['red', 'green', 'blue']):
#             feature_dict['uid'] = key
#             feature_dict[f'{color}_mean'] = np.mean(temp_array[:,:,n])
#             feature_dict[f'{color}_median'] = np.median(temp_array[:,:,n])
#             feature_dict[f'{color}_max'] = np.max(temp_array[:,:,n])
#             feature_dict[f'{color}_min'] = np.min(temp_array[:,:,n])
#             feature_dict[f'{color}_sum'] = np.sum(temp_array[:,:,n])
#             feature_dict[f'{color}_product'] = np.prod(temp_array[:,:,n])
#         feature_df = pd.concat([feature_df, pd.DataFrame(feature_dict, index=[index])], )

#     feature_df = df.merge(feature_df, how='outer', on='uid')
#     return feature_df

In [6]:
# # A function to get it all in one
# def get_sat_to_features(df):
    
#     '''
#     input a dataframe of raw data and get sat images, convert to arrays, and turn into features.
#     '''
#     catalog = Client.open(
#     "https://planetarycomputer.microsoft.com/api/stac/v1", modifier=pc.sign_inplace
#     )
    
#     # get sat info
#     satelite_dict = get_sat_info(df)
    
#     # pick best sat
#     single_df = pick_best_sat(df, satelite_dict)
    
#     # get image arrays from best sats
#     img_arrays = get_arrays_from_sats(single_df)
    
#     # get a dataframe with relevant features
#     feature_df = get_features(single_df, img_arrays)
    
#     return feature_df

In [7]:
# def clean_data(df):
#     '''
#     input dataframe with all data and clean it.
#     '''
#     # only keeping cols that I need
#     model_df = df[['date', 'latitude', 'longitude', 'season', 'img_date',
#             'red_mean', 'red_median', 'red_max', 'red_min','red_sum',
#             'red_product', 'green_mean', 'green_median', 'green_max',
#             'green_min', 'green_sum', 'green_product', 'blue_mean',
#             'blue_median','blue_max', 'blue_min', 'blue_sum', 'blue_product', 'severity']]
#     # dropping nulls
#     model_df = model_df.dropna()
#     # converting to correct type
#     model_df['date'] = model_df['date'].apply(lambda x: datetime.date(x))
#     # getting difference from image date to sample date and creating feature
#     model_df['days_from_sat_to_sample'] = model_df['date'] - model_df['img_date']
#     # converting to int
#     model_df['days_from_sat_to_sample'] = model_df['days_from_sat_to_sample'].dt.days
#     # converting from datetime to an int
#     model_df['date'] = model_df['date'].apply(lambda x: x.toordinal())
#     model_df['img_date'] = model_df['img_date'].apply(lambda x: x.toordinal())
#     # converting from string to float
#     model_df['latitude'] = model_df['latitude'].apply(lambda x: x.astype(float))
#     model_df['longitude'] = model_df['longitude'].apply(lambda x: x.astype(float))
    
#     # One hot encoding seasons
#     ohe = OneHotEncoder(sparse=False)
#     seasons = ohe.fit_transform(model_df[['season']])
#     cols = ohe.get_feature_names_out()
#     # converting new ohe to dataframe
#     seasons = pd.DataFrame(seasons, columns=cols, index=model_df.index)
#     model_ohe = pd.concat([model_df.drop('season', axis=1), seasons], axis=1)
#     return model_ohe