Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NFS server file system bug #1388

Closed
maxveliaminov opened this issue May 30, 2023 · 4 comments
Closed

NFS server file system bug #1388

maxveliaminov opened this issue May 30, 2023 · 4 comments
Assignees
Labels
bug Something isn't working

Comments

@maxveliaminov
Copy link

If you find a similar existing issue, please comment on that issue instead of creating a new one.

If you are submitting a feature request, please start a discussion instead of creating an issue.

Describe the bug

When using nfs-server as a file system boot disk of the nfs-server instance is shared while attached disk remains unmounted, which leads to smaller then expected shared volume size, also in this configuration additional disk contains cent os and has 20gb file system. In some cases it could be other way around when attached disk gets mounted as root and is shared and boot disk remains unmounted.

Steps to reproduce

Steps to reproduce the behavior:

  1. Create hpc cluster project with nfs-server as file system

Expected behavior

Boot disk mounted as root additional disk mounted as /exports/data

Actual behavior

Boot disk mounted as root additional disk is not mounted at all OR Additional disk mounted as root, boot disk is not mounted at all

Version (ghpc --version)

Blueprint

If applicable, attach or paste the blueprint YAML used to produce the bug.

Expanded Blueprint

If applicable, please attach or paste the expanded blueprint. The expanded blueprint can be obtained by running ghpc expand your-blueprint.yaml.

Disregard if the bug occurs when running ghpc expand ... as well.

Output and logs


Screenshots

Nfs server on the left, controller node on the right
image

Execution environment

  • OS: [macOS, ubuntu, ...]
  • Shell (To find this, run ps -p $$): [bash, zsh, ...]
  • go version:

Additional context

Add any other context about the problem here.

@maxveliaminov maxveliaminov added the bug Something isn't working label May 30, 2023
@mr0re1 mr0re1 self-assigned this May 30, 2023
@mr0re1
Copy link
Collaborator

mr0re1 commented May 30, 2023

Hi @maxveliaminov , could you please share your blueprint (exclude any sensitive information)?

@maxveliaminov
Copy link
Author

Hi @mr0re1 here is one,

blueprint_name: palm-model

vars:
  project_id: <PROJECT_ID>
  deployment_name: <PROJECT_ID>
  region: <REGION>
  zone: <ZONE>
  machine_type: <MACHINE_TYPE>
  node_count_dynamic_max: <NODE_COUNT_DYNAMIC_MAX>
  slurm_cluster_name: palm1
  disable_public_ips: true
  enable_shielded_vm: true

deployment_groups:
  - group: primary
    modules:
      - id: network1
        source: modules/network/vpc
        kind: terraform
      - id: appsfs
        source: community/modules/file-system/nfs-server
        kind: terraform
        use:
          - network1
        settings:
          machine_type: n2-standard-2
          auto_delete_disk: true
          local_mounts: ['/apps']
      - id: spack
        source: community/modules/scripts/spack-install
        settings:
          install_dir: /apps/spack
          spack_url: https://github.com/spack/spack
          spack_ref: v0.19.1
          log_file: /apps/spack.log
          spack_cache_url:
            - mirror_name: <SPACK_CACHE_NAME>
              mirror_url: <SPACK_CACHE_URL>
          configs:
            - type: file
              scope: defaults
              content: |
                modules:
                  default:
                    tcl:
                      hash_length: 0
                      all:
                        conflict:
                          - '{name}'
                      projections:
                        all: '{name}/{version}-{compiler.name}-{compiler.version}'
          compilers:
            - gcc@8.2.0%gcc@4.8.5 target=x86_64
          environments:
            - name: palm
              content: |
                spack:
                  definitions:
                  - compilers:
                    - gcc@8.2.0
                  - mpis:
                    - intel-mpi@2018.4.274
                  - python:
                    - python@3.9.10
                  - python_packages:
                    - py-pip@22.2.2
                    - py-wheel@0.37.1
                    - py-google-cloud-storage@1.18.0
                    - py-ansible@2.9.2
                  - packages:
                    - gcc@8.2.0
                    - coreutils@8.32
                    - cmake@3.24.3
                    - flex@2.6.4
                    - bison@3.8.2
                  - mpi_packages:
                    - netcdf-c@4.7.4
                    - netcdf-fortran@4.5.3
                    - parallel-netcdf@1.12.2
                    - fftw@3.3.10
                  specs:
                  - matrix:
                    - - $packages
                    - - $%compilers
                  - matrix:
                    - - $python
                    - - $%compilers
                  - matrix:
                    - - $python_packages
                    - - $%compilers
                    - - $^python
                  - matrix:
                    - - $mpis
                    - - $%compilers
                  - matrix:
                    - - $mpi_packages
                    - - $%compilers
                    - - $^mpis

      - id: spack_startup
        source: modules/scripts/startup-script
        kind: terraform
        use:
          - network1
        settings:
          runners:
            - $(appsfs.mount_runner)
            - $(spack.install_spack_deps_runner)
            - $(spack.install_spack_runner)
            - type: data
              destination: /apps/palm/palm-install.yaml
              content: |
              123
            - type: data
              destination: /apps/spack/activate-palm-env.sh
              content: |
              456
            - type: data
              destination: /apps/palm/palm-install.sh
              content: |
              789
            - type: shell
              content: sudo chmod -R 777 /apps
              destination: chmod-apps-dir.sh
            - type: shell
              content: 'shutdown -h now'
              destination: shutdown.sh

      - id: spack_builder
        source: modules/compute/vm-instance
        kind: terraform
        use:
          - network1
          - appsfs
          - spack_startup
        settings:
          name_prefix: spack-builder
      - id: homefs
        source: community/modules/file-system/nfs-server
        kind: terraform
        use:
          - network1
        settings:
          machine_type: n2-standard-2
          auto_delete_disk: true
          local_mounts: ['/home']
      - id: debug_node_group
        source: community/modules/compute/schedmd-slurm-gcp-v5-node-group
        use:
          - network1
          - homefs
          - appsfs
        settings:
          node_count_dynamic_max: <DEBUG_MAX_NODE_COUNT>

      - source: community/modules/compute/schedmd-slurm-gcp-v5-partition
        kind: terraform
        id: debug_partition
        use:
          - network1
          - homefs
          - appsfs
          - debug_node_group
        settings:
          is_default: true
          enable_shielded_vm: null
          machine_type: null
          node_count_dynamic_max: null
          partition_name: debug

      - id: compute_node_group
        source: community/modules/compute/schedmd-slurm-gcp-v5-node-group
        use:
          - network1
          - homefs
          - appsfs

      - source: community/modules/compute/schedmd-slurm-gcp-v5-partition
        kind: terraform
        id: compute_partition
        use:
          - network1
          - homefs
          - appsfs
          - compute_node_group
        settings:
          enable_shielded_vm: null
          machine_type: null
          node_count_dynamic_max: null
          partition_name: compute

      - source: community/modules/scheduler/schedmd-slurm-gcp-v5-controller
        kind: terraform
        id: slurm_controller
        use:
          - network1
          - debug_partition
          - compute_partition
          - homefs
          - appsfs
        settings:
          machine_type: n2-standard-8

      - source: community/modules/scheduler/schedmd-slurm-gcp-v5-login
        kind: terraform
        id: slurm_login
        use:
          - network1
          - homefs
          - appsfs
          - slurm_controller
        settings:
          machine_type: n2-standard-8
          disable_login_public_ips: true

@mr0re1
Copy link
Collaborator

mr0re1 commented Jun 2, 2023

@maxveliaminov , we found a root cause, working on the fix, expect it to be fixed in develop early next week.

@mr0re1
Copy link
Collaborator

mr0re1 commented Jun 7, 2023

@maxveliaminov, #1406 contains a fix for this problem. Could you please build ghpc from develop branch and confirm if it fixes your problem? Please re-open the issue if needed.

@mr0re1 mr0re1 closed this as completed Jun 7, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants