Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change the 'rev-' directory name instead of creating a symlink #314

Closed
richie-tt opened this issue Dec 3, 2020 · 41 comments · Fixed by #343
Closed

Change the 'rev-' directory name instead of creating a symlink #314

richie-tt opened this issue Dec 3, 2020 · 41 comments · Fixed by #343
Labels
lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness.

Comments

@richie-tt
Copy link

Hi,
I would like to ask you if there is any chance to change the directory name instead of creating a symlink.
I am trying to use the git-sync tool together with kaniko (a tool to build the docker files). Unfortunately, docker command COPY sees this symlink as a file, and instead of copying all files from inside of a directory, it copies only symlink as a single file. check my issue 1513

Currently with flags GIT_SYNC_ROOT=/github and GIT_SYNC_DEST=repo is created symlink repo inside directory /github

/github # ls -la
total 8
drwxr-xr-x    1 root     root            22 Jan  1   1970 .
drwxr-xr-x    1 root     root           140 Dec  3 15:53 ..
lrwxrwxrwx  1 root     root            44 Jan  1  1970 repo -> rev-249971a4b53a2a32ca573cd512a53d6a8cf15e78
drwxr-xr-x    1 root     root          345 Jan  1  1970 rev-249971a4b53a2a32ca573cd512a53d6a8cf15e78

I am not able to use the current name of the directory rev-249971a4b53a2a32ca573cd512a53d6a8cf15e78 because it will change on each commit, so it's impossible to automate the build process for docker image.

I am open to any suggestions.
br

@richie-tt richie-tt changed the title Change the 'rev-' directory name instead creating a symlink Change the 'rev-' directory name instead of creating a symlink Dec 3, 2020
@thockin
Copy link
Member

thockin commented Dec 3, 2020

The symlink is the only way to do atomic updates, so that a reader will never see the repo not-existing.

Do I understand right that you are running git-sync to pull a tag or branch one time and then copy the results into a build? Would it be easier to just call git clone yourself?

@richie-tt
Copy link
Author

Hi @thockin,

Thank for your answer, I really appreciate it

Yes, you are right, I downloading the GIT repo in the first step, and in the next, I want to compile the source.
I wanted to use git-sync because it just working fine, before I tried to use git clone but I couldn't resolve the issue with correct permission for mounted secret id_rsa. I always got 0440 which is still too much. 129

    - name: git-secret
      secret:
        secretName: falcon-idrsa
        defaultMode: 256  #0400

I consider creating another initContainer which just will copy id_rsa file and fix the rights, but that will add another container to POD definitions and make it more complicated.

I am not mistaken git-sync cannot expose copied source code on some TCP port, so I could copy the code via git clone directly from git-sync and use it as a proxy between my private Git repo and Kubernetes ?

@richie-tt
Copy link
Author

Also, BIG advantages of git-sync are speed, where can download my repo in 15s where git clone can do that in1m30s a huge difference, even if I use git clone -b master --single-branch

I resolved the problem with a symlink, just added another initContainer step which will change the directory name.
But unfortunately, I discovered another issue to git-sync is when it fails (cannot download repo, etc) it always returns exit 0 -> EXIT_SUCCESS

It's is important that if something failed is should return exit 1 -> EXIT_FAILURE to inform the parent process about the problem

@thockin
Copy link
Member

thockin commented Dec 4, 2020 via email

@apex-omontgomery
Copy link

apex-omontgomery commented Dec 8, 2020

The way I've gotten around this issue is having a post hook script to copy the script to another directory based on the git hash.

cp -R . ../$(git rev-parse HEAD)
rm -rf ../finished*

Then in my consumer of the webhook, I use the webhook header Gitsync-Hash to use the correct directory that was created. This consumer is now responsible for using that directory, and when finished renaming it to finished-$Gitsync-Hash, which allows the webhook to clean up on the next pass.

You could also have the consumer take the webhook and do any sort of action you want with it. I'm currently toying with the webhook consumer starting a kubernetes job that has an entrypoint for the container.

# shallow clone of a single hash
git init /tmp/simple-run
git remote add origin "${GIT_REPO_URL}"
git fetch origin "{GIT_SHA}" --depth 1
git reset --hard FETCH_HEAD

Either of these scenarios would give you an actual directory and not a symlink for you to operate on.

@richie-tt
Copy link
Author

Thank you @thockin for your suggestion and thank you @wimo7083 for an interesting solution I will also look into that.

about exit 0, I noticed that it happens when it cannot read SSH Key

logs

 kubectl logs -n test myapp
INFO: detected pid 1, running init handler
ERROR: can't configure SSH: can't access SSH key: stat /etc/git-secret/id_rsaa: no such file or directory

status of pod

  - containerID: docker://f444e9e458cc14094c2094c84d632350105b4aa88f96e2296a15ac6c9842458e
    image: k8s.gcr.io/git-sync/git-sync:v3.2.0
    imageID: docker-pullable://k8s.gcr.io/git-sync/git-sync@sha256:873fc1bcd6048247036969dcb75f0b1f9c915167b86cb908f1fe3de0e060c562
    lastState: {}
    name: git-xcaf
    ready: false
    restartCount: 0
    started: false
    state:
      terminated:
        containerID: docker://f444e9e458cc14094c2094c84d632350105b4aa88f96e2296a15ac6c9842458e
        exitCode: 0  <- ### EXIT_SUCCESS ###
        finishedAt: "2020-12-10T07:48:14Z"
        reason: Completed
        startedAt: "2020-12-10T07:48:14Z"
myapp                           0/1     Pending             0          0s
myapp                           0/1     Pending             0          0s
myapp                           0/1     ContainerCreating   0          0s
myapp                           0/1     Completed           0          2s

pod definition

...
    - name: git-sync
      image: k8s.gcr.io/git-sync/git-sync:v3.2.0
      env:
        - name: GIT_SYNC_SSH
          value: "true"
        - name: GIT_SYNC_REPO
          value: ssh://git@bitbucket.repo.io:7999/repo/repo.git
        - name: GIT_SYNC_BRANCH
          value: master
        - name: GIT_SSH_KEY_FILE
          value: /etc/git-secret/id_rsaa
        - name: GIT_KNOWN_HOSTS
          value: "false"
        - name: GIT_SYNC_ONE_TIME
          value: "true"
        - name: GIT_SYNC_ROOT
          value: /git
        - name: GIT_SYNC_DEST
          value: repo
...

br

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 10, 2021
@thockin
Copy link
Member

thockin commented Mar 11, 2021

/lifecycle frozen
/remove-lifecycle stale
/remove-lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Mar 11, 2021
@eugeneYWang
Copy link

eugeneYWang commented Mar 1, 2023

Just providing future audience a valid example, in context of docker compose, of @apex-omontgomery's method.

a service of docker compose visiting a private git repo, pulling repo data to a temp dir /tmp/gitdata, and use exec hook script to data of active branch to the mounted column ./dags

  git-sync-service:
    image: k8s.gcr.io/git-sync/git-sync:v3.6.4
    user: "${AIRFLOW_UID:-50000}:0"
    profiles:
      - sync-dag
    environment:
      - GIT_SYNC_REPO=${GIT_SYNC_DAG_REPO}
      - GIT_SYNC_BRANCH=${GIT_SYNC_DAG_BRANCH}
      - GIT_SYNC_ADD_USER=true
      # -1 below means it will retry forever. TODO: discuss this value for UAT setup
      # Note: decide that maybe one-time sync is better for local setup
      # - GIT_SYNC_MAX_FAILURES = -1
      - GIT_SYNC_ROOT=/tmp/gitdata
      # - GIT_SYNC_DEST=/dags_dest/dest
      - GIT_SYNC_SSH=true
      - GIT_KNOWN_HOSTS=false
      - GIT_SYNC_SSH_KEY_FILE=/etc/git-secret/ssh
      - GIT_SYNC_ONE_TIME=true
      - GIT_SYNC_DEPTH=0
      # webhook execcommand
      - GIT_SYNC_EXECHOOK_COMMAND=/gitsync_exechook_cmd_to_use.sh
    volumes:
      # - /tmp/git_data:/git_root
      - ./dags:/dags_dest
      - ${GIT_SSH_KEY_FILE_PATH}:/etc/git-secret/ssh
      - ../toolkits/script/gitsync_exechook_cmd_to_use.sh:/gitsync_exechook_cmd_to_use.sh

the exechook script

#!/bin/sh
# this script will be executed in active path ${GIT_SYNC_ROOT}/(hash)
# assuming it can write and erase path /dags_dest 

# unless v level of git-sync is set to 6, echo message will not be shown in docker console.

# Empty the /dags_dest directory
if [ "$(ls -A /dags_dest)" ]; then
    echo "/dags_dest is not empty. Removing all files and directories."
    rm -rf /dags_dest/*
fi

# Copy everything under the active directory to /dags_dest
cp -R . /dags_dest/

echo "Files copied to /dags_dest."

@thockin
Copy link
Member

thockin commented Mar 1, 2023

For the specific case of --one-time, I could see an argument to (by opt-in) avoid symlinks and worktress and just check out into the root. It would be a fairly substantial change, but if someone wanted to tackle it I would be open to discussing.

@mostafaghadimi
Copy link

Just providing future audience a valid example, in context of docker compose, of @apex-omontgomery's method.

a service of docker compose visiting a private git repo, pulling repo data to a temp dir /tmp/gitdata, and use exec hook script to data of active branch to the mounted column ./dags

  git-sync-service:
    image: k8s.gcr.io/git-sync/git-sync:v3.6.4
    user: "${AIRFLOW_UID:-50000}:0"
    profiles:
      - sync-dag
    environment:
      - GIT_SYNC_REPO=${GIT_SYNC_DAG_REPO}
      - GIT_SYNC_BRANCH=${GIT_SYNC_DAG_BRANCH}
      - GIT_SYNC_ADD_USER=true
      # -1 below means it will retry forever. TODO: discuss this value for UAT setup
      # Note: decide that maybe one-time sync is better for local setup
      # - GIT_SYNC_MAX_FAILURES = -1
      - GIT_SYNC_ROOT=/tmp/gitdata
      # - GIT_SYNC_DEST=/dags_dest/dest
      - GIT_SYNC_SSH=true
      - GIT_KNOWN_HOSTS=false
      - GIT_SYNC_SSH_KEY_FILE=/etc/git-secret/ssh
      - GIT_SYNC_ONE_TIME=true
      - GIT_SYNC_DEPTH=0
      # webhook execcommand
      - GIT_SYNC_EXECHOOK_COMMAND=/gitsync_exechook_cmd_to_use.sh
    volumes:
      # - /tmp/git_data:/git_root
      - ./dags:/dags_dest
      - ${GIT_SSH_KEY_FILE_PATH}:/etc/git-secret/ssh
      - ../toolkits/script/gitsync_exechook_cmd_to_use.sh:/gitsync_exechook_cmd_to_use.sh

the exechook script

#!/bin/sh
# this script will be executed in active path ${GIT_SYNC_ROOT}/(hash)
# assuming it can write and erase path /dags_dest 

# unless v level of git-sync is set to 6, echo message will not be shown in docker console.

# Empty the /dags_dest directory
if [ "$(ls -A /dags_dest)" ]; then
    echo "/dags_dest is not empty. Removing all files and directories."
    rm -rf /dags_dest/*
fi

# Copy everything under the active directory to /dags_dest
cp -R . /dags_dest/

echo "Files copied to /dags_dest."

Would you please provide more reproducible example? I get confused after see that the way you handled the repository and mount DAGs in your container.

@eugeneYWang
Copy link

eugeneYWang commented May 23, 2023

Just providing future audience a valid example, in context of docker compose, of @apex-omontgomery's method.
a service of docker compose visiting a private git repo, pulling repo data to a temp dir /tmp/gitdata, and use exec hook script to data of active branch to the mounted column ./dags

  git-sync-service:
    image: k8s.gcr.io/git-sync/git-sync:v3.6.4
    user: "${AIRFLOW_UID:-50000}:0"
    profiles:
      - sync-dag
    environment:
      - GIT_SYNC_REPO=${GIT_SYNC_DAG_REPO}
      - GIT_SYNC_BRANCH=${GIT_SYNC_DAG_BRANCH}
      - GIT_SYNC_ADD_USER=true
      # -1 below means it will retry forever. TODO: discuss this value for UAT setup
      # Note: decide that maybe one-time sync is better for local setup
      # - GIT_SYNC_MAX_FAILURES = -1
      - GIT_SYNC_ROOT=/tmp/gitdata
      # - GIT_SYNC_DEST=/dags_dest/dest
      - GIT_SYNC_SSH=true
      - GIT_KNOWN_HOSTS=false
      - GIT_SYNC_SSH_KEY_FILE=/etc/git-secret/ssh
      - GIT_SYNC_ONE_TIME=true
      - GIT_SYNC_DEPTH=0
      # webhook execcommand
      - GIT_SYNC_EXECHOOK_COMMAND=/gitsync_exechook_cmd_to_use.sh
    volumes:
      # - /tmp/git_data:/git_root
      - ./dags:/dags_dest
      - ${GIT_SSH_KEY_FILE_PATH}:/etc/git-secret/ssh
      - ../toolkits/script/gitsync_exechook_cmd_to_use.sh:/gitsync_exechook_cmd_to_use.sh

the exechook script

#!/bin/sh
# this script will be executed in active path ${GIT_SYNC_ROOT}/(hash)
# assuming it can write and erase path /dags_dest 

# unless v level of git-sync is set to 6, echo message will not be shown in docker console.

# Empty the /dags_dest directory
if [ "$(ls -A /dags_dest)" ]; then
    echo "/dags_dest is not empty. Removing all files and directories."
    rm -rf /dags_dest/*
fi

# Copy everything under the active directory to /dags_dest
cp -R . /dags_dest/

echo "Files copied to /dags_dest."

Would you please provide more reproducible example? I get confused after see that the way you handled the repository and mount DAGs in your container.

The script works with my docker compose file section tightly. Yes it is mounting a folder of ./dags, in the current working directory as compose file, into /dags_dest.

What make you confused?

@eugeneYWang
Copy link

eugeneYWang commented May 23, 2023

Just providing future audience a valid example, in context of docker compose, of @apex-omontgomery's method.
a service of docker compose visiting a private git repo, pulling repo data to a temp dir /tmp/gitdata, and use exec hook script to data of active branch to the mounted column ./dags

  git-sync-service:
    image: k8s.gcr.io/git-sync/git-sync:v3.6.4
    user: "${AIRFLOW_UID:-50000}:0"
    profiles:
      - sync-dag
    environment:
      - GIT_SYNC_REPO=${GIT_SYNC_DAG_REPO}
      - GIT_SYNC_BRANCH=${GIT_SYNC_DAG_BRANCH}
      - GIT_SYNC_ADD_USER=true
      # -1 below means it will retry forever. TODO: discuss this value for UAT setup
      # Note: decide that maybe one-time sync is better for local setup
      # - GIT_SYNC_MAX_FAILURES = -1
      - GIT_SYNC_ROOT=/tmp/gitdata
      # - GIT_SYNC_DEST=/dags_dest/dest
      - GIT_SYNC_SSH=true
      - GIT_KNOWN_HOSTS=false
      - GIT_SYNC_SSH_KEY_FILE=/etc/git-secret/ssh
      - GIT_SYNC_ONE_TIME=true
      - GIT_SYNC_DEPTH=0
      # webhook execcommand
      - GIT_SYNC_EXECHOOK_COMMAND=/gitsync_exechook_cmd_to_use.sh
    volumes:
      # - /tmp/git_data:/git_root
      - ./dags:/dags_dest
      - ${GIT_SSH_KEY_FILE_PATH}:/etc/git-secret/ssh
      - ../toolkits/script/gitsync_exechook_cmd_to_use.sh:/gitsync_exechook_cmd_to_use.sh

the exechook script

#!/bin/sh
# this script will be executed in active path ${GIT_SYNC_ROOT}/(hash)
# assuming it can write and erase path /dags_dest 

# unless v level of git-sync is set to 6, echo message will not be shown in docker console.

# Empty the /dags_dest directory
if [ "$(ls -A /dags_dest)" ]; then
    echo "/dags_dest is not empty. Removing all files and directories."
    rm -rf /dags_dest/*
fi

# Copy everything under the active directory to /dags_dest
cp -R . /dags_dest/

echo "Files copied to /dags_dest."

Would you please provide more reproducible example? I get confused after see that the way you handled the repository and mount DAGs in your container.

the current git-sync downloads content to ${GIT_SYNC_ROOT}/(hash) folder, and it symlinks that hash folder to sync dest,if I remember correctly. However, docker compose, unlike k8s sidecar, does not share the symlink context with other compose services. So you are not able to read the sync dest folder in the context of docker compose.

The only way to get the git content using git-sync is to copy git content out of the above folder and move it to an folder mounted by docker compose. In my case, I directly mount the ./dags folder into git-sync service container as the copy destination.

@thockin
Copy link
Member

thockin commented May 23, 2023

I don't use compose - are you saying there's no way to share a volume between 2 containers?

@eugeneYWang
Copy link

it is possible to share volume. But the volume has to include actual files. Not a symlink created by one service. Out of that service, the symlink does not work for others.

@mostafaghadimi
Copy link

mostafaghadimi commented May 23, 2023

@eugeneYWang Here is docker-compose.yml file to reproduce (I think you are using Airflow):

version: '3.8'
services:
  airflow-worker:
    command: celery worker
    healthcheck:
      test:
        - "CMD-SHELL"
        - 'celery --app airflow.executors.celery_executor.app inspect ping -d "celery@$${HOSTNAME}"'
      interval: 30s
      timeout: 10s
      retries: 5
      start_period: 30s
    environment:
      <<: *airflow-common-env
      # Required to handle warm shutdown of the celery workers properly
      # See https://airflow.apache.org/docs/docker-stack/entrypoint.html#signal-propagation
      DUMB_INIT_SETSID: "0"
    restart: always
    volumes:
      - ./git-sync-dags/project/dags:/opt/airflow/dags:ro
      - ${AIRFLOW_PROJ_DIR:-.}/logs:/opt/airflow/logs
      - ${AIRFLOW_PROJ_DIR:-.}/plugins:/opt/airflow/plugins
  git-sync:
    image: k8s.gcr.io/git-sync:v3.1.5
    volumes:
      - ./git-sync-dags:/tmp/git
      - ~/.ssh/id_ed25519:/etc/git-secret/ssh
    environment:
      - GIT_SYNC_REPO=${GIT_SYNC_DAG_REPO}
      - GIT_SYNC_BRANCH=main
      - GIT_SYNC_REV=HEAD
      - GIT_SYNC_DEST=project
      - GIT_SYNC_WAIT=300
      - GIT_SYNC_ONE_TIME=false
      - GIT_SYNC_TIMEOUT=120
      - GIT_SYNC_SSH=true
      - GIT_SYNC_SSH_KEY_FILE=/etc/git-secret/ssh
      - GIT_KNOWN_HOSTS=false

How can I resolve the issue?

P.S. I am using this file in order to run Airflow.

@eugeneYWang
Copy link

eugeneYWang commented May 23, 2023

@mostafaghadimi I made the same mistake before: trying to replace the default git folder with a folder mounted by docker compose. In your case, you have this line: - ./git-sync-dags:/tmp/git. I tried this to directly pull git content into my dags folder, but it will not work with docker compose. You need to follow my workaround above, let git-sync download content to a virtualized folder in the container, then use the script to copy the content to a mounted volume as an overwrite.

@eugeneYWang
Copy link

also, your git-sync is way older than I used. If you decide to follow my solution, use the same version as I used at least.

@mostafaghadimi
Copy link

@eugeneYWang would you please help me to change the docker-compose file the way it works?

@eugeneYWang
Copy link

@mostafaghadimi You already have my solution for reference and my explanation to save your time. Now it is your turn to do your homework. :)

@mostafaghadimi
Copy link

@eugeneYWang BTW, I think there is a problem with your file, since some commands are commented and I asked the question for that, it's a little bit confusing. As an example GIT_SYNC_DEST is commented but it is being used in volumes!

@thockin
Copy link
Member

thockin commented May 23, 2023

But the volume has to include actual files. Not a symlink created by one service. Out of that service, the symlink does not work for others.

I don't understand what that means - the symlink points to a relative path, so it should work.

$ls -l /tmp/gs-root
total 4
drwxr-xr-x 9 thockin primarygroup 4096 May 23 15:35 ed170912c934d51402401b0a57758fa8b1b35ab3
lrwxrwxrwx 1 thockin primarygroup   40 May 23 15:35 git-sync -> ed170912c934d51402401b0a57758fa8b1b35ab3

@eugeneYWang
Copy link

eugeneYWang commented May 23, 2023

@thockin You are right. In this case where the symlink and (hash) folder are placed next to each other, my Mac, acted as the host system, can understand symlink. But out of git-sync container, I am not able to tap into the specific (hash) folder and copy its content to my ./dags folder.

because, If multiple (hash) folders exist, it is a problem to figure out which (hash) folder should be used.

@eugeneYWang
Copy link

@eugeneYWang BTW, I think there is a problem with your file, since some commands are commented and I asked the question for that, it's a little bit confusing. As an example GIT_SYNC_DEST is commented but it is being used in volumes!

Yeah, it is a meaningless comment to audience. But it is commented anyway so I think people should be aware that it is useless.

@mostafaghadimi
Copy link

mostafaghadimi commented May 23, 2023

@eugeneYWang How the following scenario is possible?

- ./dags:/dags_dest

and

- GIT_SYNC_ROOT=/tmp/gitdata

I mean that you have set the /tmp/gitdata as a path that should clone your project files into it. Therefore, the dags file should be under /tmp/gitdata directory. Am I wrong? But you've mounted /dags_dest to your Docker host machine.

@eugeneYWang
Copy link

`/

@eugeneYWang How the following scenario is possible?

- ./dags:/dags_dest

and

- GIT_SYNC_ROOT=/tmp/gitdata

If I remember correctly, /tmp/gitdata this line does not have to be there. It is not a mounted volume, so it is a virtualized folder anyway, just like default sync root. /dags_dest is the mounted volume as the destination of my script to copy to. You can change it by changing the line and the corresponding folder in the script.

@eugeneYWang
Copy link

@mostafaghadimi so good luck. try and trail. At least I saved your time of reading the source code.

@eugeneYWang
Copy link

eugeneYWang commented May 24, 2023

But the volume has to include actual files. Not a symlink created by one service. Out of that service, the symlink does not work for others.

I don't understand what that means - the symlink points to a relative path, so it should work.

$ls -l /tmp/gs-root
total 4
drwxr-xr-x 9 thockin primarygroup 4096 May 23 15:35 ed170912c934d51402401b0a57758fa8b1b35ab3
lrwxrwxrwx 1 thockin primarygroup   40 May 23 15:35 git-sync -> ed170912c934d51402401b0a57758fa8b1b35ab3

There is several implicit restrictions forbidding me to use this approach as well, in the context of Airflow setup.

After all, you don't have to consider our cases as a requirement to change git-sync. After all, Airflow with docker compose is not a suggested solution for production deployment. Eventually we will use the Airflow Helm to work with K8s.

@thockin
Copy link
Member

thockin commented May 24, 2023

Telling me there are implicit restrictions without telling me what they are doesn't help me fix them :)

You said above you CAN understand the symlink, right?

because, If multiple (hash) folders exist, it is a problem to figure out which (hash) folder should be used.

The one to use is the one that readlink points to.

@eugeneYWang
Copy link

git-sync is not designed for the context of compose and airflow. It is not worth to split all the details that I tested which probably will not be used by you. If you are curious what have I gone through, some of details can be found on #694 which was referenced above.

Telling me there are implicit restrictions without telling me what they are doesn't help me fix them :)

You said above you CAN understand the symlink, right?

because, If multiple (hash) folders exist, it is a problem to figure out which (hash) folder should be used.

The one to use is the one that readlink points to.

@thockin
Copy link
Member

thockin commented May 24, 2023

#694 is mostly about setting the wrong env var. Buirined in there you said:

the symlink created by git-sync has made Airflow scheduler fallen into a recursive loop of seeing the rev folder and the symlink re-directing back to the rev folder.

In this way, the output of git-sync cannot be used by Airflow directly in the context of docker compose.

That still doesn't actually tell me anything. Is it a bug in Airflow that can't handle symlinks properly? Is git-sync actually doing something wrong in some edge case? Is docker compose broken? Is it just a permissions problem?

I see a lot of people using git-sync with Airflow, so I'd be glad tomake it work better, but I don't know what the problem is.

The symlink that git-sync publishes has a relative target. Any tool that understands POSIX filesystems should be able to treat it like a directory unless they go out of their way to do something special for a symlink.

@eugeneYWang
Copy link

eugeneYWang commented May 24, 2023

#694 is mostly about setting the wrong env var. Buirined in there you said:

the symlink created by git-sync has made Airflow scheduler fallen into a recursive loop of seeing the rev folder and the symlink re-directing back to the rev folder.
In this way, the output of git-sync cannot be used by Airflow directly in the context of docker compose.

That still doesn't actually tell me anything. Is it a bug in Airflow that can't handle symlinks properly? Is git-sync actually doing something wrong in some edge case? Is docker compose broken? Is it just a permissions problem?

I see a lot of people using git-sync with Airflow, so I'd be glad tomake it work better, but I don't know what the problem is.

The symlink that git-sync publishes has a relative target. Any tool that understands POSIX filesystems should be able to treat it like a directory unless they go out of their way to do something special for a symlink.

Airflow expected the DAGs folder to be all actual files. The way I created symlink and rev folder in dags folder, while symlink would work, leads to a non-stop scanning of DAG files, as I describe in #694

I tried to mount dags folders as the sync dest folder. But docker compose will create a mounted volume there, leading symlink cannot be created. the mounted folder cannot be created inside sync root folder as well, as git-sync should have the ownership to create and drop things.

Eventually, I let git-sync do the pulling in its own space and exec hook to always copy data from the latest branch over to my mounted volume. This is the only approach to work with docker compose, airflow, and git-sync.

@thockin
Copy link
Member

thockin commented May 24, 2023

Airflow expected the DAGs folder to be all actual files.

So the DAG "folder" is the symlink - a POSIX-compatible tool should resolve that automatically (literally it should not need to do anything) to the current hash. Does that not work?

leads to a non-stop scanning of DAG files

That sounds like a nasty bug in Airflow - shouldn't we try to fix it there? Did you file a bug against airflow? I found some issues that relate to this but they were closed with PRs (so presumably fixed) a year ago.

I tried to mount dags folders as the sync dest folder.

Yeah, you don't want to do that.

Is there a TRIVIAL repro? I don't know airflow. I don't know how to reproduce the error that you are seeing. Preferably, something I can run locally in a docker container without a full cluster - something easy to debug and see if I can reproduce the problem. Can you help me with that?

@eugeneYWang
Copy link

eugeneYWang commented May 24, 2023

Airflow's official Helm has set up git-sync as a sidecar program and they suggested the official helm as production deployment. They used docker compose just as a learning platform and did not suggest it as a production deployment. Their compose file did not include git-sync neither.

That sounds like a nasty bug in Airflow - shouldn't we try to fix it there? Did you file a bug against airflow? I found some issues that relate to this but they were closed with PRs (so presumably fixed) a year ago.

I don't think they will put effort in fixing issues that occurs when Airflow is used in a way that they don't recommend (airflow + compose + git-sync). I have seen them pushing people back when a user want to have features, which are included in their helm deployment, in docker compose deployment.

I understand how they think. Docker compose is not designed to have comparable scaling capabilities as K8s. It will not be the final solution of Airflow deployment. If I have dedicated time to study K8S, I will probably not spend time on trying to integrate git-sync into compose and airflow as well.

Can you help me with that?

As using compose with airflow and git-sync is a hack solution, it is probably not worth to put the effort in this.

@thockin
Copy link
Member

thockin commented May 24, 2023

I tried running airflow manually. I created a git repo with a trivial DAG. I git-synced it to /tmp/gs-vol with "the_link" being the link. I set the airflow.cfg dags_folder = /tmp/gs-vol/the_link. When I airflow dags list I see my DAG. When I airflow dags test it, it runs.

From what I can see, it works. I am happy to dig in further if you can show me what doesn't work, but I don't know it well enough to repro, obviously.

@mostafaghadimi
Copy link

mostafaghadimi commented May 25, 2023

@thockin The problem will start after updating any parameters in the DAGs repository. In that case the hash is updated and should be changed dynamically, but it doesn't work properly!

@mostafaghadimi
Copy link

mostafaghadimi commented May 25, 2023

@eugeneYWang I tried the example you placed here, but I got the following error:

"msg"="hook failed" "error"="Run(/git_sync_hook.sh ): fork/exec /git_sync_hook.sh: permission denied: { st
dout: "", stderr: "" }"

Would you please help me resolving it?

  git-sync:
    image: k8s.gcr.io/git-sync/git-sync:v3.6.4
    user: "${AIRFLOW_UID:-50000}:0"
    volumes:
      - ./git-sync-dags:/tmp/git
      - .git_sync_hook.sh:/home/git_sync_hook.sh
      - ~/.ssh/id_ed25519:/etc/git-secret/ssh
    environment:
      - GIT_SYNC_REPO=git@github.com:mostafaghadimi/airflow_git_sync.git
      - GIT_SYNC_BRANCH=main
      - GIT_SYNC_REV=HEAD
      - GIT_SYNC_ROOT=/tmp/gitdata
      # - GIT_SYNC_DEST=project
      - GIT_SYNC_WAIT=300
      # - GIT_SYNC_ONE_TIME=false
      - GIT_SYNC_TIMEOUT=120
      - GIT_SYNC_SSH=true
      - GIT_SYNC_SSH_KEY_FILE=/etc/git-secret/ssh

      - GIT_SYNC_ONE_TIME=true
      - GIT_SYNC_DEPTH=0

      - GIT_KNOWN_HOSTS=false
      - GIT_SYNC_ADD_USER=true
      - GIT_SYNC_EXECHOOK_COMMAND=/home/git_sync_hook.sh

@thockin
Copy link
Member

thockin commented May 26, 2023

The problem will start after updating any parameters in the DAGs repository.

I only know how to run list and test - what airflow command do I run which eventually enters this loop?

@eugeneYWang
Copy link

@eugeneYWang I tried the example you placed here, but I got the following error:


"msg"="hook failed" "error"="Run(/git_sync_hook.sh ): fork/exec /git_sync_hook.sh: permission denied: { st

dout: "", stderr: "" }"

Would you please help me resolving it?

  git-sync:

    image: k8s.gcr.io/git-sync/git-sync:v3.6.4

    user: "${AIRFLOW_UID:-50000}:0"

    volumes:

      - ./git-sync-dags:/tmp/git

      - .git_sync_hook.sh:/home/git_sync_hook.sh

      - ~/.ssh/id_ed25519:/etc/git-secret/ssh

    environment:

      - GIT_SYNC_REPO=git@github.com:mostafaghadimi/airflow_git_sync.git

      - GIT_SYNC_BRANCH=main

      - GIT_SYNC_REV=HEAD

      - GIT_SYNC_ROOT=/tmp/gitdata

      # - GIT_SYNC_DEST=project

      - GIT_SYNC_WAIT=300

      # - GIT_SYNC_ONE_TIME=false

      - GIT_SYNC_TIMEOUT=120

      - GIT_SYNC_SSH=true

      - GIT_SYNC_SSH_KEY_FILE=/etc/git-secret/ssh



      - GIT_SYNC_ONE_TIME=true

      - GIT_SYNC_DEPTH=0



      - GIT_KNOWN_HOSTS=false

      - GIT_SYNC_ADD_USER=true

      - GIT_SYNC_EXECHOOK_COMMAND=/home/git_sync_hook.sh

that is permission issue. If your host system is MacOS or Linux, just give the script file permission to be executed.

chmod +x filepath

Forgive me for AFK

@mostafaghadimi
Copy link

mostafaghadimi commented May 26, 2023

@eugeneYWang , The webhook command works properly, but whenever I mount ./git-sync-dags/dags:/opt/airflow/dags to the airflow, I got the following error: (after any update occurs)

airflow@20fd297e1452:/opt/airflow$ cd dags
cd: error retrieving current directory: getcwd: cannot access parent directories: No such file or directory

The same error I got even before adding the webhook script.

I've searched over internet and the problem as mentioned in this link, says that the directory is deleted!

P.S. Everything works properly on git-sync container.

@thockin
Copy link
Member

thockin commented May 26, 2023

whenever I mount ./git-sync-dags/dags:/opt/airflow/dags to the airflow, I got the following error: (after any update occurs)

cd: error retrieving current directory: getcwd: cannot access parent directories: No such file or directory

Don't mount the symlink as a volume. That is hostile to updates, as you found out. Volume-mount the whole git-sync root volume, but set airflow's dags_folder to the symlink in it. E.g.

mount ./git-sync-root:/mnt/from-git and then set dags_folder = /mnt/from-git/dags

There's still a problem if airflow uses the git-sync dir as its working directory and doesn't refresh. The directory gets removed, so airflow may need to do the equivalent of chdir(".").

If someone can help me trigger the problem (just show me which airflow commands to run!!) I can see about suggesting a fix for airflow.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants