## Basic Libraries I

Let's jump into today's exercice.

### Exercise

Given a zip file with a subfolder with multiple annotations, where the name convention for each one of them is:

`{DATE}_{TIME}_SN{SATELLITE_NUMBER}_QUICKVIEW_VISUAL_{VERSION}_{UNIQUE_REGION}.txt`

where:

- DATE expressed as YYYYMMDD (year, month and day), e.g. 20241201, 20230321 ...
- TIME expressed as HHMMSS (hour, minutes and seconds), e.g. 2134307
- SATELLITE_NUMBER an integer that represents the satellite number.
- VERSION provides the version of the pipeline, e.g. '0_1_2", "1_3_1" ...
- UNIQUE_REGION provides a unique location in the form of a string, e.g SATL-2KM-10N_552_4164

Find out the following thing about your data:

1. How many files the annotations folder has.
2. How many of them follow the name convention expressed above.
3. How many of annotations you have per month and year. Which month has more annotation files.
4. Create a new annotations folder with multiple folders corresponding to a month.
5. Print all the annotations from the most recent to the oldest one.
6. How many different satellites there are, how many annotations we have per satellite number, and which one was used in the most recent annotation file.
7. How many unique regions there are.

some tips:

- str class has a method called split, you can use it to get each field per annotation.
- you can use sort from numpy on strings.


In [475]:
import pandas as pd
import os
import shutil as sh
from datetime import datetime


class AnnotationFile:
    def __init__(self, file_name):
        # file
        self.file_name = file_name
        file_name_parts = file_name.removesuffix('.txt').split('_')

        # date and time
        self.date = datetime.strptime(file_name_parts[0] + file_name_parts[1], '%Y%m%d%H%M%S')

        # satelite number
        self.satelite_number = file_name_parts[2][2:]

        # version
        self.version = '_'.join(file_name_parts[5:8])

        # unique region
        self.unique_region = '_'.join(file_name_parts[8:11])

In [476]:
# Exercise 1

# get all files
file_names = os.listdir('annotations')
print(f'Total no. of files: {len(file_names)}')

Total no. of files: 206


In [477]:
# Exercise 2

# validation function
def is_valid_file_name(file_name):
    # check if file is a txt file
    if not file_name.endswith('.txt'):
        return False

    # check if file has 11 parts
    if not len(file_name.split('_')) == 11:
        return False
    
    return True

# filter valid files
valid_file_names = [f for f in file_names if is_valid_file_name(f)]
print(f'No. of valid files: {len(valid_file_names)}')

# create class instances
files = [AnnotationFile(f) for f in valid_file_names]
files_df = pd.DataFrame([f.__dict__ for f in files]).sort_values(by='date', ascending=False, ignore_index=True)

No. of valid files: 194


In [478]:
# Exercise 3

# add year_month column
files_df['year_month'] = files_df['date'].dt.to_period('M')

# group by year_month
year_month_df = files_df.groupby('year_month').size().reset_index(name='count')
print('Files grouped by year and month:')
display(year_month_df)

# get max
idmax = year_month_df['count'].idxmax()
print(f'Most files in a month: {year_month_df.iloc[idmax]["count"]}')
print(f'Month: {year_month_df.iloc[idmax]["year_month"]}')

Files grouped by year and month:


Unnamed: 0,year_month,count
0,2024-01,27
1,2024-02,45
2,2024-03,17
3,2024-04,25
4,2024-05,28
5,2024-06,52


Most files in a month: 52
Month: 2024-06


In [479]:
# Exercise 4

path = 'new_annotations'

# reset directory
if os.path.exists(path):
    sh.rmtree(path)

# init directory
os.makedirs(path)

for i, f in files_df.iterrows():
    # create subdirectory
    if not os.path.exists(f'{path}/{f['year_month']}'):
        os.makedirs(f'{path}/{f['year_month']}')
    # copy file
    sh.copy(f'annotations/{f['file_name']}', f'{path}/{f['year_month']}/{f['file_name']}')

In [480]:
# Exercise 5

# files are already sorted by date
for i, f in files_df.iterrows():
    print(f['file_name'])

20240623_215120_SN29_QUICKVIEW_VISUAL_1_7_0_SATL-2KM-10N_596_4134.txt
20240623_215102_SN43_QUICKVIEW_VISUAL_1_7_0_SATL-2KM-11N_384_3750.txt
20240623_193704_SN27_QUICKVIEW_VISUAL_1_7_0_SATL-2KM-11N_566_3734.txt
20240619_215556_SN29_QUICKVIEW_VISUAL_1_7_0_SATL-2KM-10N_742_4460.txt
20240619_185757_SN24_QUICKVIEW_VISUAL_1_7_0_SATL-2KM-11N_528_3700.txt
20240619_052401_SN30_QUICKVIEW_VISUAL_1_7_0_SATL-2KM-52N_368_4336.txt
20240618_215539_SN31_QUICKVIEW_VISUAL_1_7_0_SATL-2KM-11N_458_3756.txt
20240618_215539_SN31_QUICKVIEW_VISUAL_1_7_0_SATL-2KM-11N_452_3740.txt
20240618_193146_SN27_QUICKVIEW_VISUAL_1_7_0_SATL-2KM-11N_530_3682.txt
20240617_211350_SN29_QUICKVIEW_VISUAL_1_7_0_SATL-2KM-11N_724_3614.txt
20240617_184443_SN24_QUICKVIEW_VISUAL_1_7_0_SATL-2KM-11N_702_3566.txt
20240617_052859_SN29_QUICKVIEW_VISUAL_1_7_0_SATL-2KM-51N_730_4348.txt
20240616_213053_SN30_QUICKVIEW_VISUAL_1_7_0_SATL-2KM-11N_460_3792.txt
20240616_213047_SN30_QUICKVIEW_VISUAL_1_7_0_SATL-2KM-11N_466_3828.txt
20240616_213047_SN30

In [481]:
# Exercise 6

# group by satelite_number
satellite_df = files_df.groupby('satelite_number').size().reset_index(name='count')

# sort by count
satellite_df = satellite_df.sort_values('count', ascending=False)

print('Files grouped by satelite number:')
display(satellite_df)

print(f'Most recent satelite number: {satellite_df.iloc[0]["satelite_number"]}')

Files grouped by satelite number:


Unnamed: 0,satelite_number,count
1,26,37
2,27,29
0,24,26
4,29,22
6,31,19
5,30,18
3,28,16
7,33,16
8,43,11


Most recent satelite number: 26


In [482]:
# Exercise 7
print(f'No. of unique regions: {files_df["unique_region"].nunique()}')

No. of unique regions: 137
