Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use Filestore NFS rather than GCS in Google Cloud? #2180

Closed
urbanenomad opened this issue Dec 24, 2020 · 25 comments
Closed

Use Filestore NFS rather than GCS in Google Cloud? #2180

urbanenomad opened this issue Dec 24, 2020 · 25 comments

Comments

@urbanenomad
Copy link

urbanenomad commented Dec 24, 2020

So we are using Clusterfuzz but we are finding that the vast majority of cost of using clusterfuzz in Google cloud is going to class B operations in GCS. with about 100 bots we are finding that half the cost is going into GCS class B read operations (aout 3-4 Billion operations), which comes out to about $1500 in read operations out of a total of $3000/month. We want to ramp up the number of bots to about 1000 VMs but the expected increase in read operations would break our budget.

So we decided to also setup an on-premise clusterfuzz (cfz) instance but of course that has it's own challenges. But we realized that on-prem local instance of cfz uses a GCS emulator with an NFS backend. I was wondering if a cloud production deployment of Clusterfuzz has used the GCS emulator to point to Google Filestore so that we can avoid the high cost of GCS class B operations.

This could help mitigate the high cost of all the read operations from GCS and save us 50% in costs? Has anyone tried this and do you see any problems attempting this? Is there anything we lose by doing this? Do we lose any of the functionality of the analytics?

Is this even possible with the cloud version of clusterfuzz? If so any guidance would be helpful.

@Manouchehri
Copy link

See #1667.

@urbanenomad
Copy link
Author

Thanks @Manouchehri, I saw that earlier, since I opened that last issue. We are looking to continue to use GCP for our clusterfuzz service but we want to move away from using GCS and look to use Filestore or some other NFS service such as managed NetApp CVS. But I understand that some of the functions of clusterfuzz relies on GCS...I am wondering if we switch the clusterfuzz instance to use NFS rather than GCS on cloud VMs, (A) is that possible? (B) would we lose other services such as BQ analytics? Is there any alternative?

Also in that last thread @Dor1s stated the following

Coincidentally, a few weeks ago I was doing some calculations, and came up with the following estimate. Please do not take it too seriously, I've spent a few minutes working on it, it might be not very precise. Running ClusterFuzz in GCP with 500 fuzzing VMs (400 preemptibles and 100 regular) would cost roughly $80,000 per year, where $70,000 are GCE expenses (i.e. fuzzing VMs aka bots), and the rest $10,000 are other GCP expenses which are "required" for the production instance.

So we have been running 100 VMs in GCP for the past few months and we are finding that a large part of the cost is due to Google Cloud Storage Class B operations (Reads). About half the cost goes into that. So we would have about $1500 worth of GCE, storage etc and about $1500 would be GCS class B operations, about 3-4 billion read operations. I am not sure why our services have such high read operations compared to what @Dor1s states in his cost. This is why I am asking about using NFS rather than GCS in a clusterfuzz service deployed in GCP.

@inferno-chromium
Copy link
Collaborator

It should be easy to add code for another StorageProvider, see https://github.com/google/clusterfuzz/blob/master/src/python/google_cloud_utils/storage.py#L93
We are open to adding NFS as another storage provider type.

@urbanenomad
Copy link
Author

it looks like the storage class already supports FileSystemProvider so all we would need to do is mount the NFS in each bot. Would it be as easy as creating an NFS Filestore and then mount the NFS on each bot? Can we modify the startup script of the bot to put in the mount code and leverage the already FileSystemProvider?

@Dor1s
Copy link
Contributor

Dor1s commented Dec 28, 2020

@urbanenomad below are a few of random guesses on how to reduce GCS spendings:

  1. Make sure your storage buckets are in the same GCP region as are the GCE VMs. If I remember correctly, there are different price tiers for different regions or even zones
  2. If you're running libFuzzer fuzz targets, make sure you have corpus pruning enabled. Otherwise, you may end up having too many files in your corpora buckets.
  3. You can also inspect the buckets manually, just in case there's any redundant / unexpected stuff. For example, files you've uploaded manually, or maybe you have lifecycle versioning enabled and that somehow blows the number of "objects" present in the bucket.

@inferno-chromium
Copy link
Collaborator

it looks like the storage class already supports FileSystemProvider so all we would need to do is mount the NFS in each bot. Would it be as easy as creating an NFS Filestore and then mount the NFS on each bot? Can we modify the startup script of the bot to put in the mount code and leverage the already FileSystemProvider?

Yes in your startup script, mount your NFS to a local folder and set LOCAL_GCS_BUCKETS_PATH [https://github.com/google/clusterfuzz/blob/master/src/python/google_cloud_utils/storage.py#L569], then FileSystemProvider will be automatically used. Would appreciate if you propose this to add to docs if it works for you.

@urbanenomad
Copy link
Author

@inferno-chromium
I get how this would work for the bots. But when I look at the logs for GCS get operations the vast majority of the get operations are for the bucket object "corpus.clusterfuzzsbx8.appspot.com/objects/libFuzzer/wlanmob/*". If we switch to using NFS, does the move away from GCS affect in any way other services in clusterfuzz such as the BQ analytics parts? When we upload a job does it make copies in both the GCS bucket and NFS?

@Dor1s

  1. Make sure your storage buckets are in the same GCP region as are the GCE VMs. If I remember correctly, there are different price tiers for different regions or even zones

We are using multi region buckets although all the VMs are in one zone.

  1. If you're running libFuzzer fuzz targets, make sure you have corpus pruning enabled. Otherwise, you may end up having too many files in your corpora buckets.

I will ask my colleague who does the actual fuzzing about this.

  1. You can also inspect the buckets manually, just in case there's any redundant / unexpected stuff. For example, files you've uploaded manually, or maybe you have lifecycle versioning enabled and that somehow blows the number of "objects" present in the bucket.

The actual storage size is not that large about 177GB but rather the number of reads of the objects that is costing a lot. As stated earlier we had about 1.2 million reads in 1 day and about 3-4 billion reads a month.

@oliverchang
Copy link
Collaborator

Note that FileSystemProvider was only intended for use in tests, so your mileage will vary in terms of how well this works in production at a large scale.

Bigquery stats etc may not work because that expects stats to be stored on real GCS.

@urbanenomad
Copy link
Author

Note that FileSystemProvider was only intended for use in tests, so your mileage will vary in terms of how well this works in production at a large scale.

when you say mileage...what do you mean exactly? Will it not work if we run 1000 pre-emptible bots all reading from the same NFS file and we place the NFS service and all the bots in the same zone? Do you think this will be a performance issue or reliability issue or both?

Bigquery stats etc may not work because that expects stats to be stored on real GCS.

hmmm...yeah that is what I suspected. Now you say may not work...are you sure that it won't work? I am not 100% sure what gets stored in GCS and what gets stored in the bot's local store. Can we sync the 2 data between the GCS and NFS so that BQ would work? Any thoughts on how best to do that?

@inferno-chromium
Copy link
Collaborator

Note that FileSystemProvider was only intended for use in tests, so your mileage will vary in terms of how well this works in production at a large scale.

when you say mileage...what do you mean exactly? Will it not work if we run 1000 pre-emptible bots all reading from the same NFS file and we place the NFS service and all the bots in the same zone? Do you think this will be a performance issue or reliability issue or both?

Bigquery stats etc may not work because that expects stats to be stored on real GCS.

hmmm...yeah that is what I suspected. Now you say may not work...are you sure that it won't work? I am not 100% sure what gets stored in GCS and what gets stored in the bot's local store. Can we sync the 2 data between the GCS and NFS so that BQ would work? Any thoughts on how best to do that?

Just set up your NFS using https://cloud.google.com/filestore, that is pretty reliable and scales. From ClusterFuzz side, that sync wont happen, you can probably setup a cron somewhere to keep sync between GCS and NFS.

@urbanenomad
Copy link
Author

From ClusterFuzz side, that sync wont happen, you can probably setup a cron somewhere to keep sync between GCS and NFS.

So to keep BQ stats and analytics still working I would need to just keep the buckets and filers on the NFS in sync? How often would you recommend the sync to occur? Also which buckets need to be synced?

@inferno-chromium
Copy link
Collaborator

From ClusterFuzz side, that sync wont happen, you can probably setup a cron somewhere to keep sync between GCS and NFS.

So to keep BQ stats and analytics still working I would need to just keep the buckets and filers on the NFS in sync? How often would you recommend the sync to occur? Also which buckets need to be synced?

BQ stats are not related to this and should all keep working. Maybe one or twice a day sync for just backup. for bucket list, do all the ones declared in project.yaml. out for the week, so expect delayed responses after holidays

@urbanenomad
Copy link
Author

Thanks for the quick response and understood with the delayed response going forward.

Just one last question (and answer when ever you get the chance): Should the sync be one way or bi-directional? Or does it depend on which bucket? I am going to assume one-direction from GCS to NFS.

@inferno-chromium
Copy link
Collaborator

Thanks for the quick response and understood with the delayed response going forward.

Just one last question (and answer when ever you get the chance): Should the sync be one way or bi-directional? Or does it depend on which bucket? I am going to assume one-direction from GCS to NFS.

GCS would become just a backup place, so just one-way from NFS to GCS every now and then [probably with gsutil rsync -d].

@urbanenomad
Copy link
Author

urbanenomad commented Jan 1, 2021

Ok so I got this to work and I see a reduced number of reads from GCS but I don't see anything show up in the NFS filer? I see no folders being created nor any files being created. Are you sure we don't need to sync files from GCS into NFS?

plus we are seeing error logs for reads for nonexistent files with the new NFS location /opt/gcs

“Source file /opt/gcs/blobs.clusterfuzzsbx8.appspot.com/objects/a08dfaac-c6d1-4efa-a2c5-c49721cb7b06 for copy not found.”

Do I need to copy GCS files over to the NFS folders?

@inferno-chromium
Copy link
Collaborator

/opt/gcs/blobs.clusterfuzzsbx8.appspot.com

you dont need to sync gcs files, but you need to create all the buckets in nfs dir.
See code in run_server.py

def create_local_bucket(local_gcs_buckets_path, name):
  """Create a local bucket."""
  blobs_bucket = os.path.join(local_gcs_buckets_path, name)
  if not os.path.exists(blobs_bucket):
    os.mkdir(blobs_bucket)


def bootstrap_gcs(storage_path):
  """Bootstrap GCS."""
  local_gcs_buckets_path = os.path.join(storage_path, 'local_gcs')
  if not os.path.exists(local_gcs_buckets_path):
    os.mkdir(local_gcs_buckets_path)

  config = local_config.ProjectConfig()
  test_blobs_bucket = os.environ.get('TEST_BLOBS_BUCKET')
  if test_blobs_bucket:
    create_local_bucket(local_gcs_buckets_path, test_blobs_bucket)
  else:
    create_local_bucket(local_gcs_buckets_path, config.get('blobs.bucket'))

  create_local_bucket(local_gcs_buckets_path, config.get('deployment.bucket'))
  create_local_bucket(local_gcs_buckets_path, config.get('bigquery.bucket'))
  create_local_bucket(local_gcs_buckets_path, config.get('backup.bucket'))
  create_local_bucket(local_gcs_buckets_path, config.get('logs.fuzzer.bucket'))
  create_local_bucket(local_gcs_buckets_path, config.get('env.CORPUS_BUCKET'))
  create_local_bucket(local_gcs_buckets_path,
                      config.get('env.QUARANTINE_BUCKET'))
  create_local_bucket(local_gcs_buckets_path,
                      config.get('env.SHARED_CORPUS_BUCKET'))
  create_local_bucket(local_gcs_buckets_path,
                      config.get('env.FUZZ_LOGS_BUCKET'))
  create_local_bucket(local_gcs_buckets_path,
                      config.get('env.MUTATOR_PLUGINS_BUCKET'))

so basically all folders /opt/gcs/{*}.clusterfuzzsbx8.appspot.com need to be created in NFS, also double check user permissions manually by creating files,folder with that uid.

@urbanenomad
Copy link
Author

urbanenomad commented Jan 1, 2021

Ok, I created all the bucket name folders in NFS share as well as set the permissions to user clusterfuzz, (the folder /mnt/disks/gcs is the host folder which gets mounted to /opt/gcs in the container).

wkim_qualcomm_com@clusterfuzz-linux-pre-2gdn /mnt/disks/gcs $ ls -l
total 64
drwxr-xr-x 2 clusterfuzz clusterfuzz 4096 Jan 1 18:45 backup.clusterfuzzsbx8.appspot.com
drwxr-xr-x 2 clusterfuzz clusterfuzz 4096 Jan 1 18:45 bigquery.clusterfuzzsbx8.appspot.com
drwxr-xr-x 2 clusterfuzz clusterfuzz 4096 Jan 1 18:44 blobs.clusterfuzzsbx8.appspot.com
drwxr-xr-x 2 clusterfuzz clusterfuzz 4096 Jan 1 18:47 corpus.clusterfuzzsbx8.appspot.com
drwxr-xr-x 2 clusterfuzz clusterfuzz 4096 Jan 1 18:46 coverage.clusterfuzzsbx8.appspot.com
drwxr-xr-x 2 clusterfuzz clusterfuzz 4096 Jan 1 18:45 deployment.clusterfuzzsbx8.appspot.com
drwxr-xr-x 2 clusterfuzz clusterfuzz 4096 Jan 1 18:48 fuzz-logs.clusterfuzzsbx8.appspot.com
drwxr-xr-x 2 clusterfuzz clusterfuzz 4096 Jan 1 18:46 fuzzer-logs.clusterfuzzsbx8.appspot.com
drwx------ 2 root root 16384 Dec 31 23:20 lost+found
drwxr-xr-x 2 clusterfuzz clusterfuzz 4096 Jan 1 18:46 mutator-plugins.clusterfuzzsbx8.appspot.com
drwxr-xr-x 2 clusterfuzz clusterfuzz 4096 Jan 1 18:47 quarantine.clusterfuzzsbx8.appspot.com
drwxr-xr-x 2 clusterfuzz clusterfuzz 4096 Jan 1 18:47 shared-corpus.clusterfuzzsbx8.appspot.com

but still getting the following error.

Source file /opt/gcs/blobs.clusterfuzzsbx8.appspot.com/objects/28e96bcc-6174-4a20-a437-a57e3d1c8879 for copy not found.

BTW this cluster is an old clusterfuzz instance that I updated the code in. Not sure if jobs created before the change to using NFS needs certain objects from GCS to be copied over into the NFS folders?

@inferno-chromium
Copy link
Collaborator

if you are using custom binaries, you need to reupload archives in job for them to work in new bucket. if that does not work, recreate job by deleting and creating it again.

@urbanenomad
Copy link
Author

Ok I got the NFS working and we see Crash Statistics but we don't see any data in the Fuzzer Statistics. Where does the Fuzzer Statistics get it's data? I do understand that Local instance of Clusterfuzz disable certain features, but is the trigger for local instance based on the usage of local file path rather than GCS? If so is there any way we can get Fuzzer Statistics to get populated if we use local file path on the machine? Would syncing data back into GCS help this?

@inferno-chromium
Copy link
Collaborator

Ok I got the NFS working and we see Crash Statistics but we don't see any data in the Fuzzer Statistics. Where does the Fuzzer Statistics get it's data? I do understand that Local instance of Clusterfuzz disable certain features, but is the trigger for local instance based on the usage of local file path rather than GCS? If so is there any way we can get Fuzzer Statistics to get populated if we use local file path on the machine? Would syncing data back into GCS help this?

fuzzer stats should work, check your cron job and bigquery table to see if fuzzer stats data is going through.
https://github.com/google/clusterfuzz/blob/master/src/appengine/handlers/fuzzer_stats.py

@oliverchang
Copy link
Collaborator

fuzzer stats requires GCS to work. Syncing the fuzzer stats data into the real bucket may work, but is unexplored territory.

@urbanenomad
Copy link
Author

ok will write some script to sync the nfs data into GCS and see if that updates the fuzzer stats.

thanks

@urbanenomad
Copy link
Author

Hey so I got this to work and I created a cron job that rsyncs the folders from NFS into GCS and it seems to be updating the Fuzzer Stats page now. Thanks for all the help. @oliverchang @inferno-chromium did you want me to document this somewhere on how to configure this?

@inferno-chromium
Copy link
Collaborator

Hey so I got this to work and I created a cron job that rsyncs the folders from NFS into GCS and it seems to be updating the Fuzzer Stats page now. Thanks for all the help. @oliverchang @inferno-chromium did you want me to document this somewhere on how to configure this?

Would appreciate if you can add it to documentation in like a "NFS" section at end of https://google.github.io/clusterfuzz/production-setup/setting-up-bots/. This would have things you did for configuration, and benefits (like rough cost savings, etc). I am still curious if there was significant price saving here.

@Manouchehri
Copy link

There's also now this project that might work. https://github.com/fsouza/fake-gcs-server

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants