Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CoE Starter Kit - QUESTION] [BYODL] when to delete old files from datalake / storage account? #6550

Closed
1 task done
MSFT-klpinhac opened this issue Sep 7, 2023 · 5 comments
Assignees
Labels
coe-starter-kit CoE Starter Kit issues documentation Improvements or additions to documentation

Comments

@MSFT-klpinhac
Copy link

Does this question already exist in our backlog?

  • I have checked and confirm this is a new question.

What is your question?

The data export feature creates lots of JSON-Files in the storage account / data lake, and my customer needs to implement a clean-up for multiple reasons, e.g. compliance (i.e. old files need to be deleted from the storage account / data lake).
Is there any recommendation regarding such a clean-up?
I would assume a scheduled task/cronjob/etc that runs e.g. once per month and deletes all files in the storage account / data lake that have been created more then 6 month ago should work fine as a clean-up.
Is there anything my customer should look out for, e.g. are there any dependencies that could trouble if files that have been created more then 6 month ago are deleted from the storage account / data lake?

What solution are you experiencing the issue with?

Core

What solution version are you using?

No response

What app or flow are you having the issue with?

No response

What method are you using to get inventory and telemetry?

None

@MSFT-klpinhac MSFT-klpinhac added coe-starter-kit CoE Starter Kit issues question Further information is requested labels Sep 7, 2023
@manuelap-msft
Copy link
Contributor

Hello,

great question, here's some thoughts:

There's two "types" of files that are exported to the Datalake

  • usage files: in the usage folder for apps and flows, here you get a new file daily with the usage of the past day. For those, there's two different strategies - either you decide how far back you want to report on usage and then delete everything that's older - e.g. do you want to do year on year adoption reporting, then you need to keep files for a year. do you want to only report on the past three months, then you could delete files that are older then 3 months. This would not have an impact on the CoE kit, but it would only report on the files that you're keeping. The other strategy is that you save aggregated data but delete the actual files, e.g. you store (somewhere) app and flow runs per day / per month as a number, but delete the actual file itself that contains the run/launch details. You'd have to modify the CoE kit reports to work of aggregated data - I think you could achieve that using Power BI dataflows.
  • inventory files: for inventory files (e.g. environments.json, all the files in the App/Flows folder) you don't actually get a new file per day - each file uses the environment GUID as a name and gets overwritten IF there are changes to that environment. this means it's a little harder to identify stale files - just because a file has not been updated in a while doesn't mean it's not accurate anymore, it could just mean that the environment itself hasn't been updated in a while either. You always have the current environments listed in the environments.json so your strategy for deletion would be reading that file and deleting all the files where an environment no longer exists in the environments.json file. If you delete files just based on the timestamp, you could risk incorrect inventory (e.g. you may delete a file where the environment still exists but hasn't been modified in a while).

Hope that helps,
Manuela

@Jenefer-Monroe Jenefer-Monroe added documentation Improvements or additions to documentation and removed question Further information is requested labels Sep 7, 2023
@Jenefer-Monroe
Copy link
Collaborator

Lets consider adding this to the BYODL FAQ documentation.

@MSFT-klpinhac
Copy link
Author

MSFT-klpinhac commented Sep 13, 2023

Thanks a lot for your answer!

My understanding is that the entries of the files in the data lake are reflected in the dataverse-instance of the CoE - and if files in the data lake are removed, then the data in the dataverse-instance of the CoE, which is based upon the removed files, is removed as well.
Basically that means, it's not necessary to remove data from the dataverse-instance of the CoE if the files are removed from the data lake.
Is my understanding correct?

@manuelap-msft
Copy link
Contributor

For inventory, that's correct. If a file is deleted in the data lake, we assume that the environment/app/flow/etc does not exist anymore and remove it from the CoE inventory (or mark it as deleted there).

For usage, we don't write the usage information for apps or flows to Dataverse anymore - we only consume it into the Power BI dashboard via Power BI Dataflows - so once you delete those, they're lost.

But yes, overall it's not needed to remove data from Dataverse if you remove it from the Datalake.

@CoEStarterKitBot
Copy link
Collaborator

@MSFT-klpinhac This has been fixed in the latest release. Please install the latest version of the toolkit following the instructions for installing updates. Note that if you do not remove the unmanaged layers as described there you will not receive updates from us.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
coe-starter-kit CoE Starter Kit issues documentation Improvements or additions to documentation
Projects
Archived in project
Development

No branches or pull requests

5 participants