Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added Support For Azure CosmosDB #573

Merged
merged 57 commits into from Apr 22, 2021

Conversation

kedarghule
Copy link
Collaborator

This PR adds initial support for Azure CosmosDB. Following resources are populated -

  • Azure Database Accounts
  • Azure CosmosDB Locations
  • Azure Database Accounts Cors Policies
  • Azure Database Accounts Failover Policies
  • Private Endpoint Connections
  • Virtual Network Rules
  • SQL Databases
  • SQL Containers
  • Cassandra Keyspaces
  • Cassandra Tables
  • MongoDB Databases
  • MongoDB Collections
  • CosmosDB Table Resources

mpurusottamc and others added 30 commits December 5, 2020 07:47
# Please enter a commit message to explain why this merge is necessary,
# especially if it merges an updated upstream into a topic branch.
#
# Lines starting with '#' will be ignored, and an empty message aborts
# the commit.
docs/schema/azure.md Outdated Show resolved Hide resolved
cartography/intel/azure/cosmosdb.py Outdated Show resolved Hide resolved
cassandra_keyspaces = get_cassandra_keyspaces(credentials, subscription_id, database_account)
mongodb_databases = get_mongodb_databases(credentials, subscription_id, database_account)
table_resources = get_table_resources(credentials, subscription_id, database_account)
yield database_account['id'], database_account['name'], database_account[
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[question] Why use yield here when the convention is to return a list of dicts? Did you run into memory issues when testing? If so, add a comment on the reason here.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@achantavy I saw the codes for AWS and figured the convention was to use yield in a function like this where we get the secondary resources stemming from our primary resource. I followed that for SQL and Storage so I followed the same for this code too.

return []
except HttpResponseError as e:
logger.warning(f"Error while retrieving Cassandra keyspaces list - {e}")
return []
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something to keep in mind for the other get functions: if we have a transient HTTP error, this function will return null, and then the cleanup functions will run, resulting in all data from a previous run being deleted.

Not going to block on this since this is a larger problem with cartography itself (we should make error handling configurable and expose that to all modules), but just wanted to make sure you considered this.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@achantavy Yes, this makes a lot of sense to handle transient HTTP errors. I will open an issue to keep track and work on all the get functions for both Azure and AWS.

One idea is to return multiple values from get functions. First value would be error code and second value would be the actual list. If there's an error, we ignore the load and cleanup steps. Let me know what you think about this.
cc: @kedarghule

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mpurusottamc Let's continue the conversation on this in #166.

try:
client = get_client(credentials, subscription_id)
mongodb_database_list = list(
map(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[question] I'm a bit confused by this syntax, can you explain? I'm guessing list_mongo_db_databases returns a list of objects and you use as_dict() to convert it to a list of dicts?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes @achantavy, you're right. It returns a list of objects and I am converting them to a list of dicts.

mongodb_databases: List[Dict] = []
table_resources: List[Dict] = []

for account_id, name, resourceGroup, sql_database, cassandra_keyspace, mongodb_database, table in details:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment here about decoupling the data transforms from the load functions.

I also think it's more maintainable to split these by resource type: one function for sql dbs, one for cassandra, one for mongo, one for tables.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@achantavy - I added a transform function for the manipulations happening here. My bad I forgot to do that.

About the splitting up bit, firstly, it would just repeat a lot of code. I have split it up later for the individual resources under SQL databases, Cassandra keyspaces and Mongo. Second, these are the direct resources under Database Accounts (our primary resource) so I figured I'll keep them in this function as similar conventions were followed in the AWS codes. Lastly, the details gets populated in get_database_account_details() (line 436). If I have to split up the code for each resource and I'll have to do some major overhauls which I feel might repeat the code.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair points on copying prior patterns. Disclaimer: there are lots of prior code in the AWS modules that aren't exactly shining examples of great software :-)

In general, I think it's less bad to repeat code than it is to tightly couple and generalize functions too early ("write code that is easy to delete" and all that).

All that said, we're working to improve things by standardizing on the load steps in the near future: https://docs.google.com/document/d/1IZ12R3oROn11LcYj5XunokyOjJkKu-H2O1TEk065Dsk/edit#

Will not block for now.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright. I'll def check out this document :) Thanks for letting me know :)

for loc in database_account['locations']:
locations.append(loc)
# Selecting only the unique location entries
loc = [i for n, i in enumerate(locations) if i not in locations[n + 1:]]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does this select only uniques? Could we use a set instead?

Copy link
Collaborator Author

@kedarghule kedarghule Apr 6, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@achantavy - So locations is a list of dictionaries with each item having the following sample structure -
{
"id": "DA1-eastus",
"location_name": "East US",
"document_endpoint": "https://DA1-eastus.documents.azure.com:443/",
"provisioning_state": "Succeeded",
"failover_priority": 0,
}
And the plan is to return such unique dicts from the list of locations. Since locations is a list of dicts, set() can't be used since it is unhashable. I tried it and got a TypeError and saw that explanation on stackoverflow.

So I used the method 2 on this link - https://www.geeksforgeeks.org/python-removing-duplicate-dicts-in-list/ and it worked like a peach! Tested this code and it works :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are all the items unique by their "id"? Here's a nicer way you could get the unique items, borrowing from the third example in your link.

>>> test_list = [{'id' : i} for i in range(3)] * 2
>>> test_list
[{'id': 0}, {'id': 1}, {'id': 2}, {'id': 0}, {'id': 1}, {'id': 2}]
>>> unique_lists = {x['id'] : x for x in test_list}.values()
>>> unique_lists
dict_values([{'id': 0}, {'id': 1}, {'id': 2}])
>>> test_list = [{'id' : i} for i in range(3)]*2
>>> test_list
[{'id': 0}, {'id': 1}, {'id': 2}, {'id': 0}, {'id': 1}, {'id': 2}]
>>> uniques = {x['id'] : x for x in test_list}.values()
>>> uniques
dict_values([{'id': 0}, {'id': 1}, {'id': 2}])
>>> list(uniques)
[{'id': 0}, {'id': 1}, {'id': 2}]

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The id field is not unique. Hence I stuck to the method I used.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you provide an example?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something like this - You may have this in 'read locations'
{
"id": "DA1-eastus",
"location_name": "East US",
"document_endpoint": "https://DA1-eastus.documents.azure.com:443/",
"provisioning_state": "Succeeded",
"failover_priority": 0,
}

And you may get this in 'write locations' -
{
"id": "DA1-eastus",
"location_name": "East US",
"document_endpoint": "https://DA1-eastus.documents.azure.com:443/",
"provisioning_state": "Succeeded",
"failover_priority": 10,
}

In this case, they need to be treated as two different "locations" because of the different values in failover_priority.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I understand now. And later on in line 161-170 you do another filtering of this list. I recommend creating separate lists for each of the types "read", "write", and "associated". Would these individual lists each need to be reduced to have their items be unique?

For example

write_locations = database_account.get('write_locations', [])

And then I think for your _load_..._locations() you might also use UNWIND.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

{'UPDATE_TAG': azure_update_tag, 'AZURE_SUBSCRIPTION_ID': subscription_id},
)

if 'cors' in database_account and len(database_account['cors']) > 0:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function does too many things at once. It does a transform+load for each type of policies/rules, and it does a transform+load for each type of location.

I'd recommend separating this for each resource.

I'll clarify this in a follow-up comment if needed; hold off on editing if you're confused. It's getting late here but I wanted to make sure I gave you a somewhat timely second pass. :)

@kedarghule
Copy link
Collaborator Author

@achantavy Made necessary changes and left replies for you :) Lemme know of any more changes.

Copy link
Contributor

@ramonpetgrave64 ramonpetgrave64 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please address the comments about the _load_... discussion.

@kedarghule
Copy link
Collaborator Author

Please address the comments about the _load_... discussion.

Made the necessary changes. Lemme know if there are any more changes.

- Azure Database Account has write permissions from, read permissions from and is associated with Azure CosmosDB Locations.

```
(AzureCosmosDBAccount)-[WRITE_PERMISSIONS_FROM]->(AzureCosmosDBLocation)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a little confused about the wording of this since I am not familiar with Cosmos. What does it mean for a CosmosDBAccount to have read permissions from a CosmosDBLocation versus having it be associated with a CosmosDBLocation?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

read_locations : An array that contains of the read locations enabled for the Cosmos DB account.

write_locations: An array that contains the write location for the Cosmos DB account.

locations: This is the one which creates the "associated with" relationship. An array that contains all of the locations enabled for the Cosmos DB account. Also, specifies a region in which the Azure Cosmos DB database account is deployed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe something like CAN_READ and CAN_WRITE?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Database accounts can't read a location. They will have read permission from a location. Likewise for write permissions.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean by "They will have read permission from a location"? For a particular database and location, do you mean the accounts will have permissions to read the data when accessing the database through that location?

How about CAN_READ_FROM?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you are right. Okay! I'll change it to CAN_READ_FROM and CAN_WRITE_FROM.

@kedarghule
Copy link
Collaborator Author

@achantavy Made the changes. Let me know if there are any more :)

@kedarghule
Copy link
Collaborator Author

@achantavy and @ramonpetgrave64 All changes made :) Let me know if there are any more! :)

Copy link
Contributor

@achantavy achantavy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks again for the contribution!

@kedarghule
Copy link
Collaborator Author

@achantavy Updated the latest in this branch. You can merge into master :)

Copy link
Contributor

@ramonpetgrave64 ramonpetgrave64 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your patience!

@ramonpetgrave64 ramonpetgrave64 merged commit 142bc2b into lyft:master Apr 22, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants