Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding Scaleway VM types #7

Open
Shillaker opened this issue Jan 28, 2024 · 2 comments
Open

Adding Scaleway VM types #7

Shillaker opened this issue Jan 28, 2024 · 2 comments
Assignees
Labels
enhancement New feature or request

Comments

@Shillaker
Copy link

Shillaker commented Jan 28, 2024

Hi @pevab, thanks for all your work on the project and for open-sourcing it, it looks great.

I'm doing a PoC of adding all the Scaleway VM types to the VM impact calculation, but running into a few snags.

First I'll describe how I'm framing the problem, and then I'll ask some specific questions on the implementation.

Framing the problem

All VMs are part of a family, and each family contains a number of different VMs, all with different resources. All the VMs in a given family run on a fixed type of base server.

For example, if we have an alpha family, we might have several VM types:

  • alpha-small (8GiB RAM, 2vCPU, 16GiB storage)
  • alpha-medium (16GiB RAM, 4vCPU, 32GiB storage)
  • alpha-large (32GiB RAM, 8vCPU, 64GiB storage)
  • etc.

All of these will run on a single type of base server, let's call it alpha-base. An alpha-base has 256GiB RAM, a 32 core processor (64 threads), and 512GiB of storage.

The resources available on the base server will determine the max number of each type of VM that server can run. Assuming we allocate 1 vCPU to a CPU thread, and don't over-allocate memory, an alpha-base can run 16 alpha-medium VMs (256GiB memory on the alpha-base divided by 16GiB of memory for the alpha-medium, and the same ratio for CPU and storage).

In each region, we run the same types of base servers, and offer the same VM families. For example, we might have 1000 alpha-base servers running in france-1, and 500 alpha-base servers running in netherlands-1. An alpha-medium instance in france-1 is exactly the same as an alpha-medium instance running in netherlands-1, just with a different energy mix, in a different data center.

Adding background data on electricity and base servers

To add the Scaleway VM types to cloud-assess, I would expect to do the following:

  • Add the Scaleway-specific electricity mix to trusted_library/background/electricity.csv, with a geo label specific to each data center/region, i.e. a row for france-1, and a row for netherlands-1 in our example.
  • Add the base server types to trusted_library/background/inventory.csv, including their embodied impacts and lifespan, i.e. a row for alpha-base in our example.

This seems to be the case, I just have a few questions:

  1. What is the need for the n_items value in trusted_library/background/inventory.csv? Is this the total number of that server type we run? Why is this necessary? Whether we run 10 servers or 1000 servers, the impact of using any of those servers for any given workload is the same, isn't it?

  2. Why do we need to specify geo in trusted_library/background/inventory.csv? As described above, the hardware is the same regardless of the region. If we run the same server type in 10 regions, will I have to duplicate the information over 10 lines in this file, with a different geo value each time?

  3. Why do you not currently include CPU and GPU specs for the servers listed in inventory.csv?

  4. If we were to add data for many cloud providers, how can we split things up? Could we have a different subdirectory for trusted_library/scaleway, or perhaps add a provider column in each CSV?

Querying cloud-assess

With all the data for the electricity mix and base servers in place, I assume we would then submit a query for each VM type. For example, for the alpha-medium VM with 4vCPU, 16GiB RAM, and 32GiB storage we would submit the following query to get the impact of 1 hour's usage:

{
  "virtual_machines": [{
      "id": "alpha-medium",
      "ram": { "amount": 16384, "unit": "MiB_hour" },
      "cpu": { "amount": 4000, "unit": "mvCPU_hour" },
      "storage": { "amount": 32768, "unit": "MiB_hour" },
      "meta": {
        "region": "france-1",
        "server": "alpha-base"
      }
  }]
}

This is almost what I see in the sample, but not quite:

  1. How can I specify the base server type for each VM, i.e. how do I specify that an alpha-medium runs on an alpha-base server? I have put this in meta.server in the example, but I don't see this in the samples. Alternatively, this mapping could be expressed in the LCA-as-code, but I can't work out how to do that either.

  2. Currently, memory and storage units are in GiB. Would you consider reworking it to be MiB instead (and CPU to be mvCPU if added in future)? This would avoid using floating point numbers for smaller quantities, e.g. for VMs or serverless functions with memory/storage less than 1GiB.

General questions

Finally, I have a couple of general questions:

  1. How can we take load into account? A VM running at 80% CPU usage will have a different impact to one running at 10% CPU usage, but I don't see this accounted for anywhere in the workload specification. Obviously, introducing this would require expressing/estimating the consumption curve of each component depending on load, so I understand that it's a big feature and not for now, just interested for the future.

  2. Could the maximum time granularity be increased to something smaller than 1 hour, e.g. a minute or second? I know this is the smallest unit quoted in the PCR spec, but our usage data will be down to the second, especially for things like function-as-a-service.

Thanks again for all your work on this, and sorry for what I now realize is a wall of text 🙈. I would be very happy to write up the output of this conversation into a "Getting started for cloud providers" doc.

Thanks,
Simon

@pevab pevab self-assigned this Jan 30, 2024
@pevab pevab added the enhancement New feature or request label Jan 30, 2024
@pevab
Copy link
Contributor

pevab commented Jan 30, 2024

Hi, thank you for the feedback! Don't worry for the wall of text, I'm happy to see people interested in the project.

I'll give try to answer your questions here, but for some of them it might be faster for you and me to connect somehow and discuss it live.

Some introductory remarks

The ambition of Cloud Assess is to propose an executable version of official/standard methodology, namely the ADEME's PCR. It is based on a DSL, namely "LCA as Code". You can find the repo here; there are tutorials to learn about the language and a cli if you want to interact more easily with the models.

Right now, Cloud Assess covers only one functional unit (VM) among eleven, and the current model is the result of a joint work with a local cloud partner. We started with a very simple configuration: a unique zone, all the physical servers are dedicated to running VMs, no reconditioning, only electricity impact for usage, etc.

My answer below will refer to the current state (Jan. 2024) of the Cloud Assess models, but be aware that some design decisions may be revised soon. Indeed, we are now currently involved in a project to instantiate the whole PCR with more actors of the cloud industry. And, our models will certainly evolve (with breaking changes). We should have a better idea of where it will land by the end of march.

Adding data

You're right about filling the csv files with your data. We don't have the license to distribute emission factors. For physical equipments, Resilio has launched its new service Resilio DB. For electricity, one usually uses the data from Ecoinvent, but you need the appropriate license for that.

  1. What is the need for n_items?

Virtual machines are not directly mapped to physical servers in the PCR. Instead, the servers are aggregated into a pool which acts as a "unique big abstract server" that provides ram and storage (I'll discuss the omission of vCPUs below). The inventory.csv lists all the physical servers that participate in the pool. More precisely, all the physical server models (e.g., Power Edge r740) and their cardinalities/multiplicities via the column n_items. Note that the embodied impacts (columns GWP, ADPe, etc.) are unitary: they are the impacts for one instance of the physical server model.

Regarding the vCPUs, from our discussions with our partner, there seems to be no general consensus on what "1 vCPU" means quantitatively, so we omitted it for now. However, this is an ongoing topic of discussion, which we hope to settle soon.

  1. Why the need to specify geo?

The geo is used to decides which electricity mix the physical servers will consume. When you run a 100 W machine for one hour, there is 100 Wh of energy drawn from the electric grid. The file electricity.csv reports the impacts of 1 kWh, and it depends on the geographical location. In short, Switzerland has more hydraulic power than, say, Germany, so, "co2-wise", Switzerland's electricity is less expensive than Germany's.

  1. Why not including the CPU and GPU specs?

Those specs might be necessary to compute the embodied impacts of the physical machines, but they are not really necessary in Cloud Assess. From Cloud Assess pov, it doesn't really matter how the embodied impacts are computed. We include the ram and storage capacity, because they are used to allocate the impacts for each client.

  1. Adding data for many cloud-providers?

On this one I'm not sure I understand the question, or what you want to do. Let's discuss that point live.

Querying Cloud Assess

  1. How to specify mapping of VM to server?

As explained above, the approach in the PCR is to aggregate physical servers into pools, and then each VM is being mapped to a pool. That being said, because we only considered a unique pool so far, there is no way right now to specify a mapping of VM to "pool of servers" in the REST API. Clearly, we will need to do that in order to deal with multiple pools.

  1. MiB instead of GiB?

Noted, thanks. We do plan to include more relevant units in the API. I'm noting your request.

General questions

  1. Taking load into account?

From a development perspective, it is not really that difficult to take the load (in terms of CPU) into account. For now, the impact of a VM is allocated with respect to its RAM and storage usage. We could easily add the vCPU. However, as mentioned above, the issue with the vCPU is more about a consensus on a common definition. We will see what comes out of our discussions with the various players.

  1. Time granularity?

Indeed, the API for now only handles a granularity of 1 hour. It is relatively easy to include something like "MB_second" in the MemoryTimeUnitsDto (cf. openapi.yaml).

@Shillaker
Copy link
Author

Shillaker commented Jan 30, 2024

Hi @pevab, thanks so much for the detail, much appreciated 🙏. Lots of very interesting topics!

Virtual machines are not directly mapped to physical servers in the PCR. Instead, the servers are aggregated into a pool which acts as a "unique big abstract server" that provides ram and storage

Ok this is what I thought from looking at the code and PCR spec, but unfortunately the idea of all hardware in a data centre being in a single aggregated pool won't work for us. However, if we introduce the concept of multiple pools, to which we can map each type of VM (and functional unit in general), that would work. This is something that needs to be discussed at the PCR working group level.

Regarding the vCPUs, from our discussions with our partner, there seems to be no general consensus on what "1 vCPU" means quantitatively, so we omitted it for now.

That's a good point. It depends on the VM type; sometimes it's one vCPU per CPU thread, sometimes it's more than 1. See next point for why I think it's still useful.

We include the ram and storage capacity, because they are used to allocate the impacts for each client.

This is where I think both CPU and GPU information would be important. We may have VM types that have the same RAM and storage capacity, but different allocations of CPU and/or GPU. We would then need to allocate the impacts for each client proportional to their usage of all the resources, and not just memory and storage.

As you say, this is a consensus issue rather than a technical blocker, as the arithmetic will always be relatively simple. There needs to be a shared definition of how to allocate the impact of a server based on all resource types (CPU, GPU, RAM, SSDs, HDDs etc.), and whether that resource is just reserved, or actually used (and then under what load, as mentioned previously).

The geo is used to decides which electricity mix the physical servers will consume.

I think I understand this now given the context. I thought that inventory.csv was just a list of embodied impacts for specific hardware. However, inventory.csv serves two purposes: i) a list of how many servers there are in each region; ii) the embodied impacts of those servers.

The issue I see here is duplication, won't the columns like power, amortization_period, and all the impact factors be duplicated every time the same hardware is run in a different geo/region?

For example, in the current file there is the following row for small in GLO region:

id	geo	n_items	power	amortization_period	ram_size	storage_size	ADPe		etc.
small	GLO	62	400	5			384		11.52		0.009010305	etc.

If I want to add 10 small servers running in a FRA region, I would need to a new row:

small	FRA	10	400	5			384		11.52		0.009010305	etc.

In this row, the values for geo and n_items are different, but all the other values are the same. Accordingly, if I run small servers in other 10 regions, won't I have to duplicate these values another 10 times?

Perhaps splitting the data into an inventory.csv and embodied.csv would avoid this repetition.

On this one I'm not sure I understand the question, or what you want to do. Let's discuss that point live.

This was more a question about how (and whether) we could include open-source data from many providers in this repo. For example, if the VM types and impact factors for 10 different cloud providers was all put into trusted_library and open-sourced, it would become difficult to manage. However, your vision may not be to open-source everything, and instead for each provider to manage their own instance of trusted_library, in which case this is a moot point.


Thanks again for the quick, detailed response. As I say, there are some issues here that need to be worked out at the PCR/ADEME level, so I'll contact you through those channels instead 😄

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants