Skip to content
This repository has been archived by the owner on Jan 16, 2021. It is now read-only.

Copy-ServiceFabricApplicationPackage hangs forever #813

Closed
lanfeust69 opened this issue Feb 1, 2018 · 56 comments
Closed

Copy-ServiceFabricApplicationPackage hangs forever #813

lanfeust69 opened this issue Feb 1, 2018 · 56 comments
Assignees
Labels

Comments

@lanfeust69
Copy link

This is possibly related to #732, which has been wrongly closed IMO (following a comment almost unrelated to the original issue).

Following the upgrade of a windows on-prem cluster from 5.7.198 to 6.1.456, I cannot deploy anymore with my powershell scripts, because Copy-ServiceFabricApplicationPackage hangs forever.

I see warning events 'Transport' being dispatched when this happens, with a message along the lines of :

[redacted]:19000-[redacted]:58944 target 16be4136280-57c978fcff0df92782665bc62e4ecad8:131618836978481188: dropping message: incoming frame length 7,864,700 exceeds limit 4,469,566

Followed by a few

Failed to send reply to sender for OperationId:e14d407f-ba86-4d88-b259-f6c2f484c5e9. SendOneWay failed with S_OK

and then a bunch of

SendOneWay in DownloadAsyncOperation failed with S_OK. OperationId:0a7756da-6c70-47c2-bc81-25e7dda76223

And sure enough, if I manually remove the few files bigger than 4.4MB from the package, then everything is copied normally (but of course the missing files prevent proper operation).

It might be an inconsistent setting when splitting big files, since I could match and error with a frame length of around 5M with a file of almost exactly the same size, but there was no 7.8MB file, only a couple of much bigger ones (32MB zips).

@oanapl oanapl self-assigned this Feb 2, 2018
@oanapl
Copy link

oanapl commented Feb 2, 2018

@lanfeust69 , based on the traces you pasted, I can't tell which message is dropped because of size, it may or may not be related to upload. Could you share the traces and the package (or the file count and size information) with me (oanapl at microsoft.com)?
If you retry the upload with all files (the correct package), are you hitting the same issue all the time?

@oanapl
Copy link

oanapl commented Feb 6, 2018

This is due to a configuration change between Standalone and SFRP clusters. When you deploy a cluster in Azure, MaxMessageSize is set to 4 MB. When you download and install a cluster locally, MaxMessageSize is set to 10 MB (this was a change in 6.1). Now the client sends bigger messages (based on its configured MaxMessageSize) and the server drops them (per its configuration).

We are investigating a configuration change to Azure clusters to change the MaxMessageSize to 10 MB to resolve this issue, but that will take some time.

To unblock, you can remove the local cluster, which will remove the new configuration. Then upload from that machine will work. Note that every time you load the local cluster (either manually or through Visual Studio), the new configuration (10MB) will take effect, so upload of files bigger than 4 MB will fail again.

@oanapl oanapl added bug and removed investigating labels Feb 6, 2018
@JunRamoneda
Copy link

I am trying to publish to a stand-alone cluster and experienced the same issues reported here. Any step-by-step instructions on a workaround for this issue?

Need to publish to SF badly.

@oanapl
Copy link

oanapl commented Feb 6, 2018

@JunRamoneda , in your configuration what is the version for the cluster and what is the version for the client?
If you are using latest bits for both, the new setting should be applied to both in standalone, so this may be another problem.

@JunRamoneda
Copy link

Cluster and VS2017 both have 6.1.456.9494

@JunRamoneda
Copy link

Removing local cluster before publishing seems to work for 6.1.456.9494

@cata
Copy link

cata commented Feb 6, 2018

I can confirm that I can publish to a secure Azure cluster after removing the local cluster. Both clusters are 6.1.456.9494

@oanapl
Copy link

oanapl commented Feb 6, 2018

Thank you for confirming! We are working on the fix, and I will update the thread when the updates are pushed.

@Sveer
Copy link

Sveer commented Feb 6, 2018

Hm... It's a magic, but after removing local cluster Copy-ServiceFabricApplicationPackage working well.
Another solution - prepare sfpkg, upload it to Azure blob and call Register-ServiceFabricApplicationType -ApplicationPackageDownloadUri azureblobaddress:
https://github.com/Sveer/FixServiceFabricIssue813
It's not real fix, just register app through azure blob for test.

@francescocristallo
Copy link

Oh wow I spent an entire day + all last night reinstalling VS17 and then formatting my PC after was not working, thinking of some VS Issue due to the tools update! Glad I found this thread.
I reinstalled VS on a fresh format machine and the problem still persisted.
Trying to deploy a .Net core 2 on secure Azure live 6.1.456.9494 cluster.

Closing the local cluster (clicking exit) didn't work either, still stuck in
>Copying application to image store... and no data are uploaded.

Which steps I should do in order to "remove" the local cluster and being able to use the workaround?
Thanks

@JohnnyFun
Copy link

@francescocristallo, to remove the local cluster, go to your system tray, right click the service fabric icon, click "Remove Local Cluster"

I confirmed that that workaround worked for me too. Thanks @oanapl! I'll look out for the update.

@alex-1244
Copy link

@oanapl , seems that removing local cluster didn't help me.
Publish failed both from VisualStudio and PowerShell.

@oanapl
Copy link

oanapl commented Feb 7, 2018

@alex-1244 , can you give me a few details about your scenario and the steps you tried for mitigation? What is your cluster version and client version? What are the cmdlet parameters you run and the output for copy-servicefabricapplicationPackage? Can you try uninstalling SF from your machine, installing and running the copy cmdlet to ensure there's no local cluster?

@francescocristallo
Copy link

I confirm the "Remove cluster" trick works 👍

@alex-1244
Copy link

alex-1244 commented Feb 7, 2018

Cluster version(Azure): 6.1.456.9494
Cluster version(local): 6.1.456.9494
Deployment script call:
.\Deploy-FabricApplication.ps1 -ApplicationPackagePath 'E:\Projects\...\pkg\Debug' -PublishProfileFile ' E:\Projects\...\PublishProfiles\Cloud.xml' -DeployOnly:$false -App licationParameter:@{} -UnregisterUnusedApplicationVersionsAfterUpgrade $false -OverrideUpgradeBehavior 'None' -Overwrite Behavior 'SameAppTypeAndVersion' -SkipPackageValidation:$false -ErrorAction Stop
(it was copy-pasted from build output during publishing from visualstudio)

to remove local cluster i used @JohnnyFun suggestion

Deployment script

@JohnnyFun
Copy link

JohnnyFun commented Feb 7, 2018

If it's any help, @oanapl, I noticed the cluster I'm deploying to in azure is still showing it's at the old version (6.0.232.9494), and I recently updated my local sdk and project nuget packages to the latest (6.1.456.9494).

When I try to update my azure cluster to be the latest, from the azure portal, I get an error. But I'm able to deploy to it just fine though, and the app seems to run fine.

@oanapl
Copy link

oanapl commented Feb 8, 2018

@JohnnyFun , I recommend you follow up separately on the cluster upgrade failing. @vaishnavk can help with that.

@alex-1244 , I am sorry for the frustration you are experiencing with this issue. Have you tried uninstall, reinstall and then run upload (without setting up local cluster)? And instead of the deploy script, can you use the Powershell cmdlet directly?

@vitalybibikov
Copy link

vitalybibikov commented Feb 8, 2018

Same issue here, our CI is not working and all processes are stopped.

CI uploads both to local cluster (case when we want to run tests)
And to remote cluster, when we want to publish it on staging.

As we are using gulp scripts, that are calling powershell cmdlets directly, I can say, that it does not work.

We consider it as a critical issue.

@IgorYastrebov
Copy link

very serious one. Our CI is blocked :((((
We forced to upgrade to 6.1 because of too many DNS-related issues, now we have new one :(((

@vitalybibikov
Copy link

vitalybibikov commented Feb 8, 2018

We've fixed it by adding special tasks (we are using Bamboo) just before deploying to staging and just after the deploy.

To stop local cluster before the process:

function FuncCheckService{
     param($ServiceName)
     $arrService = Get-Service -Name $ServiceName

     if ($arrService.Status -eq "Running"){

         Stop-Service $ServiceName
         Write-Host "Stoping " $ServiceName " service" 
         " ---------------------- " 
         " Service is now stopped"
     }
 }
 FuncCheckService("FabrichostSvc")

To start local cluster after:

function FuncCheckService{
     param($ServiceName)
     $arrService = Get-Service -Name $ServiceName

     if ($arrService.Status -ne "Running"){
         Start-Service $ServiceName
         Write-Host "Starting " $ServiceName " service" 
     }

     if ($arrService.Status -eq "running"){ 
        Write-Host "$ServiceName service is already started"
     }
 }

 FuncCheckService("FabrichostSvc")

@heavenwing
Copy link

I also have this problem after upgrade to 6.1

@oanapl
Copy link

oanapl commented Feb 9, 2018

Another way to mitigate the issue is to upgrade the cluster to set the MaxMessageSize to be in sync with the version on the client. This way, there are no changes needed on the client machine.
Read upgrading the cluster fabric settings for instructions. Inside the fabricSettings section, add or edit the MaxMessageSize entry for NamingService, as shown here:

"fabricSettings": [
          {
          "name": "NamingService",
          "parameters": [
            {
                "name": "MaxMessageSize",
                "value": "10485760"
            }
          ]

@alecor191
Copy link

Thanks @oanapl for the suggestion.

  • Can you please elaborate on what the implications of increasing the message size are? (any measurable performance increases/decreases, changes in memory consumption, impact on CPU consumption on nodes, etc.)
  • Is this change already thoroughly tested on production clusters and is this an official MS recommendation for customers running production clusters?
  • Any known risks/undesired side effects after applying this change?

@oanapl
Copy link

oanapl commented Feb 11, 2018

@alecor191 , the MaxMessageSize is used to determine how big are the messages the cluster accepts. It's used for queries to determine how much information we put into one message (before we page or simply drop the message because it's too big) and for copying packages to the cluster. It will not have undesirable side effects per our testing.

We will upgrade the Azure clusters to use this value for the setting. This config upgrade will take some time to go through all regions. To unblock yourself, I proposed to manually make the upgrade instead of waiting for the general change.

@oanapl
Copy link

oanapl commented Feb 20, 2018

@ezuidema , yours is probably different, please open another issue so we can address that appropriately.

Update on the status of the original issue: Last week we started to deploy the config change that updates the clusters' MaxMessageSize to 10MB. This was meant as a mitigation for this issue, and not a permanent solution. We decided to cancel this deployment, as we have a new build available that reverts the breaking change done on the clients to set MaxMessageSize to 10MB (plus other features). We plan to make this build available as soon as we validate it properly - as usually, check our team's blog for info on the release.
We cancelled the config upgrade to avoid too much churn on the clusters with this config and then immediately the new release.
Again, we are very sorry for the inconvenience.

@JunRamoneda
Copy link

Removing local cluster was the workaround that I used before deploying. Lately, I was able to deploy to a stand-alone cluster without first removing the local cluster. I am not sure what changed but deploy/publish seems to be working properly

@olivergrimes
Copy link

It was working for a while after removing the cluster and then restarting VS but now it's stopped again. Any update on when this will be resolved?

@olitomlinson
Copy link

Had the same problem for the last couple of days. I have tried all the resolutions in this thread and non of them allow me to publish successfully to Azure.

Then all of a sudden it standard randomly working (local cluster was removed at this point).

I then turned on local cluster, publish to Azure failed.

Turned off local cluster, restarted VS2017, publish to Azure continues to fail.

Can't nail down a reliable way to reproduce and workaround this issue but it does feel like it is something related to restarting VS2017 or exiting the local cluster. or this could just be a red herring.

P.s. @oanapl trying to do a PUT/PATCH against the SF provider to add the MaxMessageSize was unsuccessful. The setting just disappeared after clicking PUT / PATCH. After doing this, the SF resource in the portal was showing a red failure banner. Had to go back into resources.azure.com and introduce a white-space change to the SF provider configuration to reset the error. The underlying cluster was actually healthy during this time.

@JohnnyFun
Copy link

JohnnyFun commented Feb 22, 2018

@dotdestroyer, when you say you "turned off local cluster", did you "Remove Local Cluster" or "Stop Local Cluster"? I would guess either would work, but when it worked for me last, I specifically selected "Remove Local Cluster" on the machine I was deploying from.

p.s., I like the phrase "standard randomly working" lol, a cynical way of looking at the state of software these days.

@olitomlinson
Copy link

@JohnnyFun I've done both and neither helped conclusively!

haha my typo hits a little too close to home right now :P

@oanapl
Copy link

oanapl commented Feb 22, 2018

The release that reverts the MaxMessageSize change (runtime 6.1.467, SDK 3.0.467) is going to be live soon. Once you update, there will be no changes needed to the Azure clusters and you can upload your packages with local cluster installed.
For the mitigation, you need to remove local cluster, not just stop.

@oanapl oanapl added this to the Runtime 6.1.467 milestone Feb 22, 2018
@alecor191
Copy link

@oanapl just to confirm: When upgrading to the patched version of SDK you mentioned, will we have any issues if we did increase the MaxMessageSize in our Azure clusters? Or will that SDK version work with any MaxMessageSize?

@oanapl
Copy link

oanapl commented Feb 22, 2018

@alecor191 , we recommend you change your cluster back to initial value (eg. remove the change you made). Thank you for checking! Application upload will succeed with the mismatched values, but it's better to have them in sync.

@oanapl
Copy link

oanapl commented Feb 23, 2018

6.1 refresh is available. Please update your local cluster to resolve the copy issues.
If you have changed the Azure cluster with MaxMessageSize to 10 MB, please revert that change.

@oanapl oanapl closed this as completed Feb 23, 2018
@GnanaSelvan
Copy link

Excellent. Its working for me after i removed the local cluster from system tray. and exited the service fabric cluster running in local.

@danrozenberg
Copy link

I was afflicted by this issue as well. Removing the local cluster makes deployment work again.

@suneetnangia
Copy link

When is the fix expected please?

@abatishchev
Copy link

The fix is already available for about 2 months now

@suvamM
Copy link

suvamM commented Jun 9, 2018

I am using runtime 6.2.283.9494 and SDK version 3.1.283.9494, and I am still facing the publish process getting stuck at "Copying application to Image Store". I have tried removing the local cluster, gotten rid of it from system tray altogether, but nothing works. This issue started all of a sudden.
Would really appreciate some help here.

@abatishchev
Copy link

@suvamM: I'd suggest to make sure this is not a local issue.
Try on a different machine. If this is a lock issue indeed then check what version of PS cmdlets you're using. Also another trap I fell myself is the difference between the versions installed on local machine and the versions used from NuGet.
For me this issue is long gone.

@suvamM
Copy link

suvamM commented Jun 9, 2018

@abatishchev Alright, updating the Nuget packages seems to have made the fix. Thanks a lot for the help.

@alexgman
Copy link

@suvamM i have this problem, but i'm not understanding what you and @abatishchev are referring to when you say nuget packages.

What is the problem with which packages?

I'm unable to deploy, and just stuck in Copying application to image store...

@abatishchev
Copy link

What is the problem

Target Service Fabric SDK doesn't match the actual version on the cluster

with which packages?

The NuGet packages of Service Fabric you reference from your services withing your application.

@suvamM
Copy link

suvamM commented Jun 10, 2018

@alexgman I take back my comment: I had updated my runtime and SDK, and the Nuget packages for my project, and the publish succeeded. But it's the same problem for every other project I have, including the template SF app which is generated when I create a new SF project. I checked that there is no version mismatch between my local SDK and the version on the cluster.

This is really strange, and is taking a lot of time to diagnose. It's just stuck at Copying application to image store...

@H286424
Copy link

H286424 commented Jun 29, 2018

unable to publish my code using Microsoft Azure ServiceFabric SDK-3.1.274 & runtime -6.2.274 on Local Cluster. Can someone suggest the required changes.

@oanapl
Copy link

oanapl commented Jun 29, 2018

@H286424 , there can be many reasons for which this can happen. The initial issue that was discussed on this thread has long been fixed. Please open another issue and add details so we can track your issue properly.

@ErikRBergman
Copy link

ErikRBergman commented Jul 4, 2018

I had this problem with SDK 3.1.301 and Azure Cluster Service Fabric 6.2.301.9494.

Changing MaxMessageSize didn't fix it.
Upgrading all nuget packages to the latest didn't fix it.

So I compared the .sfproj file in my failing application with one from a brand new SF application I created, which could be deployed.

Changing a few settings in the application .sfproj file fixed the issue.

Change "<ProjectVersion>2.0</ProjectVersion>" to "<ProjectVersion>2.2</ProjectVersion>"
Add " <TargetFrameworkVersion>v4.6.1</TargetFrameworkVersion>"

I also changed all "1.6.4" to "1.6.6" to use the current MSBuild verison and also changed the version in packages.config to 1.6.6.

I did all of the above steps, but I think it would work by just changing the ProjectVersion.

@sudeephazra
Copy link

sudeephazra commented Dec 29, 2019

The same issue persists for me and the solution here of deleting the local cluster does not solve my problem.

I have been wrestling a lot with the Service Fabric setup and seem to be hitting a major roadblock. I am trying to deploy a basic REST Spring Boot application as a guest executable on Service Fabric. I have been able to set it up on a cluster on my local machine but the deployment on the Azure cluster seems to be stuck for more than 16 hours. I have had multiple attempts at it but the problem seems to stick around.

I have created a single node Standard_D2s_v3 (1 instance) cluster on Azure and using Visual Studio 2017 v15.9.17 to deploy the Spring Boot REST application as a guest executable.

Local Cluster
Code Version: 7.0.457.9590

Azure (Cloud):
Code Version: 6.5.676.9590

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests