## Basic Libraries II

### Exercise

Reusing the same annotations we work with in the previous session, answer the following items using the libraries we saw today:

1. How many annotations you have per month and year. Which month has more annotation files.
2. Create a dictionary where each **key** is a month, and the corresponding **value** is a list containing all the annotation names with where their date corresponds to the month.
   - Save it following the json format, and load it again to check that everything is ok.
   - Save it this time using Pickle.
   - Instead of storing a list of all the annotation names happening that month, let's create for each annotation a dictionary with keys: name and date (using a datetime object).
3. Print all the annotations from the oldest ones to the newest one during the seconf half of the 2024.


In [1]:
# This cell is copied from the previous assignment.

import pandas as pd
import os
from datetime import datetime


class AnnotationFile:
    def __init__(self, file_name):
        # file
        self.file_name = file_name
        file_name_parts = file_name.removesuffix('.txt').split('_')

        # date and time
        self.date = datetime.strptime(file_name_parts[0] + file_name_parts[1], '%Y%m%d%H%M%S')

        # satelite number
        self.satelite_number = file_name_parts[2][2:]

        # version
        self.version = '_'.join(file_name_parts[5:8])

        # unique region
        self.unique_region = '_'.join(file_name_parts[8:11])

In [2]:
# Exercise 1
# This code is copied from Exercises 1-3 of the previous assignment.

# get all files
file_names = os.listdir('../a-4/annotations')


# validation function
def is_valid_file_name(file_name):
    # check if file is a txt file
    if not file_name.endswith('.txt'):
        return False

    # check if file has 11 parts
    if not len(file_name.split('_')) == 11:
        return False

    return True


# filter valid files
valid_file_names = [f for f in file_names if is_valid_file_name(f)]

# create class instances
files = [AnnotationFile(f) for f in valid_file_names]
files_df = pd.DataFrame([f.__dict__ for f in files]).sort_values(by='date', ascending=False, ignore_index=True)


# add year_month column
files_df['year_month'] = files_df['date'].dt.to_period('M')

# group by year_month
year_month_df = files_df.groupby('year_month').size().reset_index(name='count')
print('Files grouped by year and month:')
display(year_month_df)

# get max
idmax = year_month_df['count'].idxmax()
print(f'Most files in a month: {year_month_df.iloc[idmax]["count"]}')
print(f'Month: {year_month_df.iloc[idmax]["year_month"]}')

Files grouped by year and month:


Unnamed: 0,year_month,count
0,2024-01,27
1,2024-02,45
2,2024-03,17
3,2024-04,25
4,2024-05,28
5,2024-06,52


Most files in a month: 52
Month: 2024-06


In [3]:
# Exercise 2

import json
import pickle

# create dictionary
month_dict = {}
for i, file in files_df.iterrows():
    # get year_month
    year_month = file['year_month'].__str__()

    # add to dictionary
    if not year_month in month_dict:
        month_dict[year_month] = []

    # add file name
    month_dict[file['year_month'].__str__()].append(file['file_name'])

# save to json
with open('month_dict.json', 'w') as f:
    json.dump(month_dict, f, indent=4)

# load from json
with open('month_dict.json', 'r') as f:
    month_dict = json.load(f)


# reset month_dict
month_dict = {}
for i, file in files_df.iterrows():
    # get year_month
    year_month = file['year_month'].__str__()

    # add to dictionary
    if not year_month in month_dict:
        month_dict[year_month] = []

    # add file name
    month_dict[file['year_month'].__str__()].append({'name': file['file_name'], 'date': file['date']})

# save to pickle
with open('month_dict.pkl', 'wb') as f:
    pickle.dump(month_dict, f)

In [4]:
# Exercise 3

print('Files in order from oldest to newest in second half of 2024:')

start_date = datetime(2024, 7, 1)
end_date = datetime(2024, 12, 31)

for i, file in files_df.sort_values(by='date', ascending=True, ignore_index=True).iterrows():
    if start_date <= file['date'] <= end_date:
        print(file["file_name"])

Files in order from oldest to newest in second half of 2024:
