New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added Support For Azure CosmosDB #573
Conversation
# Please enter a commit message to explain why this merge is necessary, # especially if it merges an updated upstream into a topic branch. # # Lines starting with '#' will be ignored, and an empty message aborts # the commit.
cassandra_keyspaces = get_cassandra_keyspaces(credentials, subscription_id, database_account) | ||
mongodb_databases = get_mongodb_databases(credentials, subscription_id, database_account) | ||
table_resources = get_table_resources(credentials, subscription_id, database_account) | ||
yield database_account['id'], database_account['name'], database_account[ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[question] Why use yield here when the convention is to return a list of dicts? Did you run into memory issues when testing? If so, add a comment on the reason here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@achantavy I saw the codes for AWS and figured the convention was to use yield in a function like this where we get the secondary resources stemming from our primary resource. I followed that for SQL and Storage so I followed the same for this code too.
return [] | ||
except HttpResponseError as e: | ||
logger.warning(f"Error while retrieving Cassandra keyspaces list - {e}") | ||
return [] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Something to keep in mind for the other get
functions: if we have a transient HTTP error, this function will return null, and then the cleanup functions will run, resulting in all data from a previous run being deleted.
Not going to block on this since this is a larger problem with cartography itself (we should make error handling configurable and expose that to all modules), but just wanted to make sure you considered this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc: @mpurusottamc
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@achantavy Yes, this makes a lot of sense to handle transient HTTP errors. I will open an issue to keep track and work on all the get functions for both Azure and AWS.
One idea is to return multiple values from get functions. First value would be error code and second value would be the actual list. If there's an error, we ignore the load and cleanup steps. Let me know what you think about this.
cc: @kedarghule
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mpurusottamc Let's continue the conversation on this in #166.
try: | ||
client = get_client(credentials, subscription_id) | ||
mongodb_database_list = list( | ||
map( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[question] I'm a bit confused by this syntax, can you explain? I'm guessing list_mongo_db_databases
returns a list of objects and you use as_dict()
to convert it to a list of dicts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes @achantavy, you're right. It returns a list of objects and I am converting them to a list of dicts.
mongodb_databases: List[Dict] = [] | ||
table_resources: List[Dict] = [] | ||
|
||
for account_id, name, resourceGroup, sql_database, cassandra_keyspace, mongodb_database, table in details: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment here about decoupling the data transforms from the load functions.
I also think it's more maintainable to split these by resource type: one function for sql dbs, one for cassandra, one for mongo, one for tables.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@achantavy - I added a transform function for the manipulations happening here. My bad I forgot to do that.
About the splitting up bit, firstly, it would just repeat a lot of code. I have split it up later for the individual resources under SQL databases, Cassandra keyspaces and Mongo. Second, these are the direct resources under Database Accounts (our primary resource) so I figured I'll keep them in this function as similar conventions were followed in the AWS codes. Lastly, the details
gets populated in get_database_account_details()
(line 436). If I have to split up the code for each resource and I'll have to do some major overhauls which I feel might repeat the code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair points on copying prior patterns. Disclaimer: there are lots of prior code in the AWS modules that aren't exactly shining examples of great software :-)
In general, I think it's less bad to repeat code than it is to tightly couple and generalize functions too early ("write code that is easy to delete" and all that).
All that said, we're working to improve things by standardizing on the load steps in the near future: https://docs.google.com/document/d/1IZ12R3oROn11LcYj5XunokyOjJkKu-H2O1TEk065Dsk/edit#
Will not block for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alright. I'll def check out this document :) Thanks for letting me know :)
cartography/intel/azure/cosmosdb.py
Outdated
for loc in database_account['locations']: | ||
locations.append(loc) | ||
# Selecting only the unique location entries | ||
loc = [i for n, i in enumerate(locations) if i not in locations[n + 1:]] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How does this select only uniques? Could we use a set instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@achantavy - So locations
is a list of dictionaries with each item having the following sample structure -
{
"id": "DA1-eastus",
"location_name": "East US",
"document_endpoint": "https://DA1-eastus.documents.azure.com:443/",
"provisioning_state": "Succeeded",
"failover_priority": 0,
}
And the plan is to return such unique dicts from the list of locations. Since locations
is a list of dicts, set() can't be used since it is unhashable. I tried it and got a TypeError and saw that explanation on stackoverflow.
So I used the method 2 on this link - https://www.geeksforgeeks.org/python-removing-duplicate-dicts-in-list/ and it worked like a peach! Tested this code and it works :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are all the items unique by their "id"
? Here's a nicer way you could get the unique items, borrowing from the third example in your link.
>>> test_list = [{'id' : i} for i in range(3)] * 2
>>> test_list
[{'id': 0}, {'id': 1}, {'id': 2}, {'id': 0}, {'id': 1}, {'id': 2}]
>>> unique_lists = {x['id'] : x for x in test_list}.values()
>>> unique_lists
dict_values([{'id': 0}, {'id': 1}, {'id': 2}])
>>> test_list = [{'id' : i} for i in range(3)]*2
>>> test_list
[{'id': 0}, {'id': 1}, {'id': 2}, {'id': 0}, {'id': 1}, {'id': 2}]
>>> uniques = {x['id'] : x for x in test_list}.values()
>>> uniques
dict_values([{'id': 0}, {'id': 1}, {'id': 2}])
>>> list(uniques)
[{'id': 0}, {'id': 1}, {'id': 2}]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The id
field is not unique. Hence I stuck to the method I used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you provide an example?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Something like this - You may have this in 'read locations'
{
"id": "DA1-eastus",
"location_name": "East US",
"document_endpoint": "https://DA1-eastus.documents.azure.com:443/",
"provisioning_state": "Succeeded",
"failover_priority": 0,
}
And you may get this in 'write locations' -
{
"id": "DA1-eastus",
"location_name": "East US",
"document_endpoint": "https://DA1-eastus.documents.azure.com:443/",
"provisioning_state": "Succeeded",
"failover_priority": 10,
}
In this case, they need to be treated as two different "locations" because of the different values in failover_priority
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I understand now. And later on in line 161-170 you do another filtering of this list. I recommend creating separate lists for each of the types "read", "write", and "associated". Would these individual lists each need to be reduced to have their items be unique?
For example
write_locations = database_account.get('write_locations', [])
And then I think for your _load_..._locations()
you might also use UNWIND
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done!
cartography/intel/azure/cosmosdb.py
Outdated
{'UPDATE_TAG': azure_update_tag, 'AZURE_SUBSCRIPTION_ID': subscription_id}, | ||
) | ||
|
||
if 'cors' in database_account and len(database_account['cors']) > 0: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function does too many things at once. It does a transform+load for each type of policies/rules, and it does a transform+load for each type of location.
I'd recommend separating this for each resource.
I'll clarify this in a follow-up comment if needed; hold off on editing if you're confused. It's getting late here but I wanted to make sure I gave you a somewhat timely second pass. :)
@achantavy Made necessary changes and left replies for you :) Lemme know of any more changes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please address the comments about the _load_...
discussion.
Made the necessary changes. Lemme know if there are any more changes. |
docs/schema/azure.md
Outdated
- Azure Database Account has write permissions from, read permissions from and is associated with Azure CosmosDB Locations. | ||
|
||
``` | ||
(AzureCosmosDBAccount)-[WRITE_PERMISSIONS_FROM]->(AzureCosmosDBLocation) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a little confused about the wording of this since I am not familiar with Cosmos. What does it mean for a CosmosDBAccount to have read permissions from a CosmosDBLocation versus having it be associated with a CosmosDBLocation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
read_locations
: An array that contains of the read locations enabled for the Cosmos DB account.
write_locations
: An array that contains the write location for the Cosmos DB account.
locations
: This is the one which creates the "associated with" relationship. An array that contains all of the locations enabled for the Cosmos DB account. Also, specifies a region in which the Azure Cosmos DB database account is deployed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe something like CAN_READ
and CAN_WRITE
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Database accounts can't read a location. They will have read permission from a location. Likewise for write permissions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you mean by "They will have read permission from a location"? For a particular database and location, do you mean the accounts will have permissions to read the data when accessing the database through that location?
How about CAN_READ_FROM
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, you are right. Okay! I'll change it to CAN_READ_FROM and CAN_WRITE_FROM.
@achantavy Made the changes. Let me know if there are any more :) |
@achantavy and @ramonpetgrave64 All changes made :) Let me know if there are any more! :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks again for the contribution!
@achantavy Updated the latest in this branch. You can merge into master :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your patience!
This PR adds initial support for Azure CosmosDB. Following resources are populated -