Panic while fine-tuning with LORA #517

romansky · 2024-03-02T14:46:43Z

Hi,

Thanks everyone working on this awesome lib!

I've been trying to fine tune a 7b model on my M2 ULTRA 64GB machine.
After some time the machine panics.. I think it has something to do with memory..

Since I'm using SSH, I was able to see the latest memory usage figure:
RAM Usage: 55.8/64.0GB - swap:75.9/77.0GB
I'm using 8K+ context size when fine tuning..
the model in the model directory is a 7b mistral model I imported and quantized to 8 bit.

I don't want to compromise on the quality of the fine tune, can I do something to allow this to complete or pass some flag to have it run a bit slower but be able to finish?

The SSH session log:

python ./lora.py \
    --model ./mlx_model \
    --train \
    --max-tokens 12000 \
    --batch-size 1 \
    --save-every 10 \
    --adapter-file ./adapters.npz \
    --resume-adapter-file ./adapters.npz \
    --iters 50
Loading pretrained model
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Total parameters 2132.851M
Trainable parameters 1.704M
Loading datasets
Loading pretrained adapters from ./adapters.npz
Training
[WARNING] Some sequences are longer than 2048 tokens. Consider pre-splitting your data to save memory.
Iter 1: Val loss 2.386, Val took 7.670s
[WARNING] Some sequences are longer than 2048 tokens. Consider pre-splitting your data to save memory.
[WARNING] Some sequences are longer than 2048 tokens. Consider pre-splitting your data to save memory.
[WARNING] Some sequences are longer than 2048 tokens. Consider pre-splitting your data to save memory.
[WARNING] Some sequences are longer than 2048 tokens. Consider pre-splitting your data to save memory.
[WARNING] Some sequences are longer than 2048 tokens. Consider pre-splitting your data to save memory.
client_loop: send disconnect: Broken pipe

The panic log:

panic(cpu 5 caller 0xfffffe0022e5aa34): watchdog timeout: no checkins from watchdogd in 92 seconds (17889 total checkins since monitoring last enabled)
Debugger message: panic
Memory ID: 0xff
OS release type: User
OS version: 23D60
Kernel version: Darwin Kernel Version 23.3.0: Wed Dec 20 21:31:00 PST 2023; root:xnu-10002.81.5~7/RELEASE_ARM64_T6020
Fileset Kernelcache UUID: xxx
Kernel UUID:xxx
Boot session UUID: xxxx
iBoot version: iBoot-10151.81.1
secure boot?: YES
roots installed: 0
Paniclog version: 14
KernelCache slide: 0x0000000019ea8000
KernelCache base:  0xfffffe0020eac000
Kernel slide:      0x0000000019eb0000
Kernel text base:  0xfffffe0020eb4000
Kernel text exec slide: 0x000000001b404000
Kernel text exec base:  0xfffffe0022408000
mach_absolute_time: 0x3e823430b3d
Epoch Time:        sec       usec
  Boot    : 0x65e07b38 0x000ae09d
  Sleep   : 0x00000000 0x00000000
  Wake    : 0x00000000 0x00000000
  Calendar: 0x65e33657 0x00036b51

Zone info:
  Zone map: 0xfffffe1805640000 - 0xfffffe3805640000
  . VM    : 0xfffffe1805640000 - 0xfffffe1cd230c000
  . RO    : 0xfffffe1cd230c000 - 0xfffffe1e6bca4000
  . GEN0  : 0xfffffe1e6bca4000 - 0xfffffe2338970000
  . GEN1  : 0xfffffe2338970000 - 0xfffffe280563c000
  . GEN2  : 0xfffffe280563c000 - 0xfffffe2cd2308000
  . GEN3  : 0xfffffe2cd2308000 - 0xfffffe319efd4000
  . DATA  : 0xfffffe319efd4000 - 0xfffffe3805640000
  Metadata: 0xfffffe5ff4658000 - 0xfffffe5ffc658000
  Bitmaps : 0xfffffe5ffc658000 - 0xfffffe600835c000
  Extra   : 0 - 0

TPIDRx_ELy = {1: 0xfffffe2338c69800  0: 0x0000000000001005  0ro: 0x0000000000000000 }
CORE 0 PVH locks held: None
CORE 1 PVH locks held: None
CORE 2 PVH locks held: None
CORE 3 PVH locks held: None
CORE 4 PVH locks held: None
CORE 5 PVH locks held: None
CORE 6 PVH locks held: None
CORE 7 PVH locks held: None
CORE 8 PVH locks held: None
CORE 9 PVH locks held: None
CORE 10 PVH locks held: None
CORE 11 PVH locks held: None
CORE 12 PVH locks held: None
CORE 13 PVH locks held: None
CORE 14 PVH locks held: None
CORE 15 PVH locks held: None
CORE 16 PVH locks held: None
CORE 17 PVH locks held: None
CORE 18 PVH locks held: None
CORE 19 PVH locks held: None
CORE 20 PVH locks held: None
CORE 21 PVH locks held: None
CORE 22 PVH locks held: None
CORE 23 PVH locks held: None
CORE 0: PC=0xfffffe0022498574, LR=0xfffffe0022498574, FP=0xfffffe6809c3fef0
CORE 1: PC=0xfffffe0022498574, LR=0xfffffe0022498574, FP=0xfffffe6807997ef0
CORE 2: PC=0xfffffe0022498574, LR=0xfffffe0022498574, FP=0xfffffe6809fefef0
CORE 3: PC=0xfffffe0022498574, LR=0xfffffe0022498574, FP=0xfffffe6809e4bef0
CORE 4: PC=0xfffffe0022498574, LR=0xfffffe0022498574, FP=0xfffffe68072fbef0
CORE 5 is the one that panicked. Check the full backtrace for details.
CORE 6: PC=0xfffffe0022498578, LR=0xfffffe0022498574, FP=0xfffffe6807ea7ef0
CORE 7: PC=0xfffffe0022498578, LR=0xfffffe0022498574, FP=0xfffffe6806fcbef0
CORE 8: PC=0xfffffe0022498574, LR=0xfffffe0022498574, FP=0xfffffe6809dd3ef0
CORE 9: PC=0xfffffe0022498574, LR=0xfffffe0022498574, FP=0xfffffe6809a47ef0
CORE 10: PC=0xfffffe0022498574, LR=0xfffffe0022498574, FP=0xfffffe6809e33ef0
CORE 11: PC=0xfffffe0022498578, LR=0xfffffe0022498574, FP=0xfffffe6808a77ef0
CORE 12: PC=0xfffffe00225ab0a4, LR=0xfffffe00225ab0a0, FP=0xfffffe6809e93e80
CORE 13: PC=0xfffffe0022498574, LR=0xfffffe0022498574, FP=0xfffffe6809cfbef0
CORE 14: PC=0xfffffe00225ab0a4, LR=0xfffffe00225ab0a0, FP=0xfffffe6809ca3e80
CORE 15: PC=0xfffffe0022498574, LR=0xfffffe0022498574, FP=0xfffffe6809ec3ef0
CORE 16: PC=0xfffffe0022498574, LR=0xfffffe0022498574, FP=0xfffffe6809c0fef0
CORE 17: PC=0xfffffe0022498574, LR=0xfffffe0022498574, FP=0xfffffe6809d8bef0
CORE 18: PC=0xfffffe0022498574, LR=0xfffffe0022498574, FP=0xfffffe68073bbef0
CORE 19: PC=0xfffffe0022498578, LR=0xfffffe0022498574, FP=0xfffffe68076dfef0
CORE 20: PC=0xfffffe0022498578, LR=0xfffffe0022498574, FP=0xfffffe6807a93ef0
CORE 21: PC=0xfffffe0022498578, LR=0xfffffe0022498574, FP=0xfffffe6808e67ef0
CORE 22: PC=0xfffffe0022498578, LR=0xfffffe0022498574, FP=0xfffffe6809983ef0
CORE 23: PC=0xfffffe0022498578, LR=0xfffffe0022498574, FP=0xfffffe6809ce3ef0
Compressor Info: 36% of compressed pages limit (OK) and 100% of segments limit (BAD) with 78 swapfiles and OK swap space
Total cpu_usage: 70341938
Thread task pri cpu_usage
0xfffffe2338c69800 kernel_task 91 2010011
0xfffffe2338cb0800 kernel_task 0 745
0xfffffe2338f64000 kernel_task 0 141
0xfffffe2338f64800 kernel_task 0 2133
0xfffffe2338f59000 kernel_task 0 19890

Panicked task 0xfffffe23387ef9a8: 0 pages, 833 threads: pid 0: kernel_task
Panicked thread: 0xfffffe2338c69800, backtrace: 0xfffffe6806bd38b0, tid: 3038
		  lr: 0xfffffe002245e9b0  fp: 0xfffffe6806bd3940
		  lr: 0xfffffe00225a7734  fp: 0xfffffe6806bd39b0
		  lr: 0xfffffe00225a5ccc  fp: 0xfffffe6806bd3aa0
		  lr: 0xfffffe002240f8cc  fp: 0xfffffe6806bd3ab0
		  lr: 0xfffffe002245e2a4  fp: 0xfffffe6806bd3e60
		  lr: 0xfffffe0022c56e1c  fp: 0xfffffe6806bd3e80
		  lr: 0xfffffe0022e5aa34  fp: 0xfffffe6806bd3ed0
		  lr: 0xfffffe0022e59800  fp: 0xfffffe6806bd3f10
		  lr: 0xfffffe0022e56d88  fp: 0xfffffe6806bd3f30
		  lr: 0xfffffe002380cce8  fp: 0xfffffe6806bd3fd0
		  lr: 0xfffffe00225a8ed0  fp: 0xfffffe6806bd3fe0
		  lr: 0xfffffe002240f940  fp: 0xfffffe6806bd3ff0
		  lr: 0xfffffe0022541e90  fp: 0xfffffe680756be60
		  lr: 0xfffffe0022543538  fp: 0xfffffe680756bf20
		  lr: 0xfffffe0022418be4  fp: 0x0000000000000000
      Kernel Extensions in backtrace:
         com.apple.driver.AppleInterruptControllerV2(1.0d1)[A1A4C713-2CF8-3485-8348-49F483D98B7B]@0xfffffe002380b220->0xfffffe002380da8f
            dependency: com.apple.driver.AppleARMPlatform(1.0.2)[603C49BE-3001-3033-B79E-584777C5610D]@0xfffffe0022e02a80->0xfffffe0022e555bb
         com.apple.driver.AppleARMWatchdogTimer(1.0)[DED22881-DEAC-3284-BD3A-752F6CF07E46]@0xfffffe0022e555c0->0xfffffe0022e5aa5b
            dependency: com.apple.driver.AppleARMPlatform(1.0.2)[603C49BE-3001-3033-B79E-584777C5610D]@0xfffffe0022e02a80->0xfffffe0022e555bb

last started kext at 413362825: com.apple.filesystems.autofs	3.0 (addr 0xfffffe00218bc380, size 5912)
loaded kexts:
com.apple.filesystems.autofs	3.0
com.apple.AppleEthernetAquantiaAqtionFirmware	1.0.36
com.apple.driver.AppleBiometricServices	1
com.apple.driver.CoreKDL	1
com.apple.driver.DiskImages.ReadWriteDiskImage	493.0.0
com.apple.driver.DiskImages.UDIFDiskImage	493.0.0
com.apple.driver.DiskImages.RAMBackingStore	493.0.0
com.apple.driver.DiskImages.FileBackingStore	493.0.0
com.apple.driver.BCMWLANFirmware4388.Hashstore	1
com.apple.driver.AppleFileSystemDriver	3.0.1
com.apple.driver.AppleUSBDeviceNCM	5.0.0
com.apple.nke.l2tp	1.9
com.apple.filesystems.tmpfs	1
com.apple.driver.AppleThunderboltIP	4.0.3
com.apple.driver.AppleAOPVoiceTrigger	300.7
com.apple.driver.SEPHibernation	1
com.apple.driver.ApplePMP	1
com.apple.driver.AppleSmartIO2	1
com.apple.filesystems.nfs	1
com.apple.filesystems.lifs	1
com.apple.filesystems.apfs	2235.80.4
com.apple.IOTextEncryptionFamily	1.0.0
com.apple.AppleEmbeddedSimpleSPINORFlasher	1
com.apple.filesystems.hfs.kext	650.0.2
com.apple.security.BootPolicy	1
com.apple.BootCache	40
com.apple.AppleFSCompression.AppleFSCompressionTypeZlib	1.0.0
com.apple.AppleFSCompression.AppleFSCompressionTypeDataless	1.0.0d1
com.apple.driver.AppleTypeCRetimer	1.0.0
com.apple.driver.AppleSN012776Amp	700.46
com.apple.driver.AppleCS42L84Audio	700.46
com.apple.driver.AppleT6022CLPC	1
com.apple.driver.AppleT6020SOCTuner	1
com.apple.driver.AppleSleepPowerPolicy	1
com.apple.driver.AppleSmartBatteryManager	161.0.0
com.apple.driver.AppleSamsungSerial	1.0.0d1
com.apple.driver.AppleEventLogHandler	1
com.apple.driver.AppleProResHW	326.11.0
com.apple.driver.AppleS8000AES	1
com.apple.driver.AppleT6021PMGR	1
com.apple.driver.AppleS5L8960XNCO	1
com.apple.driver.ApplePMPFirmware	1
com.apple.driver.AppleBCMWLANBusInterfacePCIe	1
com.apple.driver.AppleBluetoothModule	1
com.apple.driver.AudioDMAController-T602x	300.15
com.apple.driver.AppleSerialShim	1
com.apple.driver.AppleSPIMC	1
com.apple.driver.AppleJPEGDriver	6.2.2
com.apple.driver.AppleInterruptControllerV2	1.0.0d1
com.apple.driver.AppleS8000DWI	1.0.0d1
com.apple.driver.usb.AppleSynopsysUSB40XHCI	1
com.apple.driver.AppleS5L8920XPWM	1.0.0d1
com.apple.driver.AppleMobileDispT602X-DCP	140.0
com.apple.AGXG14X	276.62
com.apple.driver.AppleSDXC	3.4.3
com.apple.driver.AppleAVE2	703.54.1
com.apple.driver.AppleAVD	737.2
com.apple.driver.AppleM68Buttons	1.0.0d1
com.apple.driver.AppleT8110DART	1
com.apple.driver.AppleS5L8940XI2C	1.0.0d2
com.apple.driver.AppleT6020	1
com.apple.iokit.IOUserEthernet	1.0.1
com.apple.driver.usb.AppleUSBUserHCI	1
com.apple.iokit.IOKitRegistryCompatibility	1
com.apple.iokit.EndpointSecurity	1
com.apple.driver.AppleDiskImages2	273
com.apple.AppleSystemPolicy	2.0.0
com.apple.nke.applicationfirewall	404
com.apple.kec.InvalidateHmac	1
com.apple.kec.AppleEncryptedArchive	1
com.apple.driver.driverkit.serial	6.0.0
com.apple.driver.AppleMesaSEPDriver	100.99
com.apple.iokit.IOBiometricFamily	1
com.apple.driver.AppleEthernetAquantiaAqtion	1.0.64
com.apple.driver.DiskImages.KernelBacked	493.0.0
com.apple.driver.AppleXsanScheme	3
com.apple.driver.usb.AppleEmbeddedUSBXHCIPCI	1
com.apple.driver.usb.AppleUSBXHCIPCI	1.2
com.apple.driver.AppleConvergedIPCOLYBTControl	1
com.apple.driver.AppleConvergedPCI	1
com.apple.driver.AppleBluetoothDebug	1
com.apple.driver.usb.networking	5.0.0
com.apple.nke.ppp	1.9
com.apple.driver.AppleThunderboltPCIDownAdapter	4.1.1
com.apple.driver.AppleThunderboltDPInAdapter	8.5.1
com.apple.driver.AppleThunderboltDPAdapterFamily	8.5.1
com.apple.driver.AppleThunderboltUSBDownAdapter	1.0.4
com.apple.driver.AppleSEPHDCPManager	1.0.1
com.apple.driver.AppleAOPAudio	300.14
com.apple.driver.AppleTrustedAccessory	1
com.apple.iokit.AppleSEPGenericTransfer	1
com.apple.driver.AppleDCPDPTXProxy	1.0.0
com.apple.driver.DCPDPFamilyProxy	1
com.apple.driver.AppleDiagnosticDataAccessReadOnly	1.0.0
com.apple.driver.AppleBSDKextStarter	3
com.apple.kext.triggers	1.0
com.apple.driver.AppleBTM	1.0.1
com.apple.driver.IOHIDPowerSource	1
com.apple.driver.AppleCallbackPowerSource	1
com.apple.filesystems.hfs.encodings.kext	1
com.apple.driver.AppleSyntheticGameController	11.3.1
com.apple.driver.AppleI2CEthernetAquantia	1.0.0
com.apple.driver.AppleCSEmbeddedAudio	700.46
com.apple.driver.AppleEmbeddedAudio	700.46
com.apple.iokit.AppleARMIISAudio	300.11
com.apple.driver.ApplePassthroughPPM	3.0
com.apple.iokit.IONVMeFamily	2.1.0
com.apple.driver.AppleNANDConfigAccess	1.0.0
com.apple.driver.AppleSART	1
com.apple.driver.AppleSPU	1
com.apple.driver.ApplePMGR	1
com.apple.driver.AppleBluetoothDebugService	1
com.apple.driver.AppleBCMWLANCore	1.0.0
com.apple.iokit.IO80211Family	1200.13.0
com.apple.driver.IOImageLoader	1.0.0
com.apple.driver.AppleOLYHAL	1
com.apple.driver.AppleStockholmControl	1.0.0
com.apple.driver.AppleDisplayCrossbar	1.0.0
com.apple.driver.AppleSPMIPMU	1.0.1
com.apple.driver.AppleDialogPMU	1.0.1
com.apple.driver.AppleSPMI	1.0.1
com.apple.driver.AppleHPM	3.4.4
com.apple.iokit.IODisplayPortFamily	1.0.0
com.apple.driver.AppleARMWatchdogTimer	1
com.apple.AGXFirmwareKextG14XRTBuddy	1
com.apple.AGXFirmwareKextRTBuddy64	276.62
com.apple.driver.AppleT8112TypeCPhy	1
com.apple.driver.AppleT8103TypeCPhy	1
com.apple.driver.AppleUSBXDCIARM	1.0
com.apple.driver.AppleUSBXDCI	1.0
com.apple.iokit.IOUSBDeviceFamily	2.0.0
com.apple.driver.usb.AppleSynopsysUSBXHCI	1
com.apple.driver.usb.AppleUSBXHCI	1.2
com.apple.driver.AppleTypeCPhy	1
com.apple.driver.AppleEmbeddedUSBHost	1
com.apple.driver.usb.AppleUSBHub	1.2
com.apple.driver.usb.AppleUSBHostCompositeDevice	1.2
com.apple.driver.DCPAVFamilyProxy	1
com.apple.iokit.IOMobileGraphicsFamily-DCP	343.0.0
com.apple.driver.AppleDCP	1
com.apple.driver.AppleFirmwareKit	1
com.apple.iokit.IOMobileGraphicsFamily	343.0.0
com.apple.driver.AppleM2ScalerCSCDriver	265.0.0
com.apple.iokit.IOGPUFamily	93.10.1
com.apple.driver.AppleMCA2-T602x	800.11
com.apple.driver.AppleEmbeddedAudioLibs	300.1
com.apple.driver.AppleFirmwareUpdateKext	1
com.apple.driver.AppleMultiFunctionManager	1
com.apple.driver.corecapture	1.0.4
com.apple.driver.AppleT6000PCIeC	1
com.apple.driver.ApplePIODMA	1
com.apple.driver.AppleThunderboltNHI	7.2.81
com.apple.iokit.IOThunderboltFamily	9.3.3
com.apple.iokit.IOPortFamily	1.0
com.apple.iokit.IOAVBFamily	1220.1
com.apple.plugin.IOgPTPPlugin	1230.2
com.apple.driver.AppleEthernetAquantiaAqtionPortMonitor	1.0.0
com.apple.driver.AppleT602xPCIe	1
com.apple.driver.AppleEmbeddedPCIE	1
com.apple.driver.AppleGPIOICController	1.0.2
com.apple.driver.AppleFireStormErrorHandler	1
com.apple.driver.AppleMobileApNonce	1
com.apple.driver.usb.AppleUSBHostPacketFilter	1.0
com.apple.iokit.IOTimeSyncFamily	1230.2
com.apple.driver.DiskImages	493.0.0
com.apple.iokit.IOGraphicsFamily	598
com.apple.iokit.IOBluetoothFamily	9.0.0
com.apple.driver.AppleSSE	1.0
com.apple.driver.AppleSEPKeyStore	2
com.apple.driver.AppleUSBTDM	556
com.apple.iokit.IOUSBMassStorageDriver	243
com.apple.iokit.IOPCIFamily	2.9
com.apple.iokit.IOSCSIBlockCommandsDevice	492
com.apple.iokit.IOSCSIArchitectureModelFamily	492
com.apple.driver.AppleRSMChannel	1
com.apple.iokit.IORSMFamily	1
com.apple.driver.AppleLockdownMode	1
com.apple.driver.AppleIPAppender	1.0
com.apple.driver.AppleFDEKeyStore	28.30
com.apple.driver.AppleEffaceableStorage	1.0
com.apple.driver.AppleCredentialManager	1.0
com.apple.driver.KernelRelayHost	1
com.apple.iokit.IOUSBHostFamily	1.2
com.apple.driver.AppleUSBHostMergeProperties	1.2
com.apple.driver.usb.AppleUSBCommon	1.0
com.apple.driver.AppleSMC	3.1.9
com.apple.driver.RTBuddy	1.0.0
com.apple.driver.AppleEmbeddedTempSensor	1.0.0
com.apple.driver.AppleARMPMU	1.0
com.apple.iokit.IOAccessoryManager	1.0.0
com.apple.driver.AppleOnboardSerial	1.0
com.apple.iokit.IOSkywalkFamily	1.0
com.apple.driver.mDNSOffloadUserClient	1.0.1b8
com.apple.iokit.IONetworkingFamily	3.4
com.apple.iokit.IOSerialFamily	11
com.apple.driver.AppleSEPManager	1.0.1
com.apple.driver.AppleA7IOP	1.0.2
com.apple.driver.IOSlaveProcessor	1
com.apple.driver.AppleBiometricSensor	2
com.apple.iokit.IOHIDFamily	2.0.0
com.apple.driver.AppleANELoadBalancer	7.300.0
com.apple.driver.AppleH11ANEInterface	7.300.0
com.apple.driver.IODARTFamily	1
com.apple.AUC	1.0
com.apple.iokit.IOSurface	352.0.3
com.apple.iokit.IOAVFamily	1.0.0
com.apple.iokit.IOHDCPFamily	1.0.0
com.apple.iokit.IOCECFamily	1
com.apple.iokit.IOAudio2Family	1.0
com.apple.driver.AppleIISController	300.1
com.apple.driver.AppleAudioClockLibs	300.1
com.apple.driver.FairPlayIOKit	71.3.0
com.apple.driver.AppleARMPlatform	1.0.2
com.apple.iokit.IOSlowAdaptiveClockingFamily	1.0.0
com.apple.iokit.IOReportFamily	47
com.apple.security.quarantine	4
com.apple.security.sandbox	300.0
com.apple.iokit.IOStorageFamily	2.1
com.apple.kext.AppleMatch	1.0.0d1
com.apple.driver.AppleMobileFileIntegrity	1.0.5
com.apple.iokit.CoreAnalyticsFamily	1
com.apple.security.AppleImage4	5.0.0
com.apple.kext.CoreTrust	1
com.apple.iokit.IOCryptoAcceleratorFamily	1.0.1
com.apple.kec.pthread	1
com.apple.kec.Libm	1
com.apple.kec.Compression	1.0
com.apple.kec.corecrypto	14.0



** Stackshot Succeeded ** Bytes Traced 163604 (Uncompressed 496944) **

The text was updated successfully, but these errors were encountered:

awni · 2024-03-02T15:16:21Z

A few comments / questions:

55.8/64.0GB - swap:75.9/77.0GB

That is a lot of RAM for a 7B model! The swap is especially concerning. How did you measure that?
How did you convert the model? It would be good to be sure that the dtype for the non quantized layers is fp16 and not bf16 or fp32.
What version of MLX are you using? We have made some improvements that will help RAM, so make sure you use the latest.
Try using the lora in MLX LM instead of the lora/ example. It will default compile the norms so that will slightly reduce memory requirements for the non-LoRA layers. Otherwise, it's basically the same but with a few additional features.

I don't want to compromise on the quality of the fine tune

I assume this means you don't want to reduce the maximum sequence length or the number of lora layers? Either would help a lot.

There are some other things we/you can do to reduce memory:

Checkpointing. This is an experimental / undocumented feature. But you can look at our transformer implementation to see how to use it. Will slow things down but should also reduce memory use.
Compile the full graph: This may slow things down especially if your input sequences vary in length, but may also substantially reduce memory use.
Disable caching. In the latest MLX you can disable the buffer cache (mx.set_cache_limit(0)). You have to build from source for that. I don't expect that to help much for you since usually we try to clear the cache before we start swapping..
We have some additional improvements on the roadmap (flash attention, avoiding copies into matmul, ...) which should help in the future.

I will plan to update the mlx lm lora to do bucketing + compilation with an option to checkpoint. But in the meantime you can experiment with those if you are comfortable digging in to the Python.

romansky · 2024-03-02T16:42:51Z

Hay! thanks for the quick reply

I had one ssh session running asitop, its one of the measured stats there, its actually really nice for monitoring..
I used the import utility like so (from the model page it looks like an FP16 one)

python convert.py \
  --hf-path NurtureAI/OpenHermes-2.5-Mistral-7B-16k \
  --quantize \
  --q-bits 8

I have the latest (0.5) installed but was using the scripts ("./lora.py") from 0.3, re-importing and running all again to see if it's something from an older version..
doing that now.
yup, I understand using fewer layers may help but for my use-case I want the highest quality..
I am using checkpointing, so a question about that, if I run a session and it crashed and I run it again, does it start from the beginning assuming the model has not been past some checkpoint? (to make sure I understand what this does..)
will try some of this stuff and report!

romansky · 2024-03-02T17:28:36Z

@awni thanks a lot, just-re ran and it went smoooth.. great success!

awni · 2024-03-02T17:30:30Z

Well that's a surprise and a delight, great to hear!

I am using checkpointing, so a question about that, if I run a session and it crashed and I run it again, does it start from the beginning assuming the model has not been past some checkpoint? (to make sure I understand what this does..)

So the term "checkpointing" is overloaded. In mlx-lm you can resume from the stored adapters in the checkpoints/ directory. You have to explicitly point to the correct adapter file there when you set the flag --resume-adapter-file.

What I was referring to re checkpointing is gradient checkpointing which is a way to reduce memory use at the cost of computation. That's a totally different thing and is not currently used in any of our Lora examples.

romansky closed this as completed Mar 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Panic while fine-tuning with LORA #517

Panic while fine-tuning with LORA #517

romansky commented Mar 2, 2024

awni commented Mar 2, 2024

romansky commented Mar 2, 2024 •

edited

Loading

romansky commented Mar 2, 2024

awni commented Mar 2, 2024

Panic while fine-tuning with LORA #517

Panic while fine-tuning with LORA #517

Comments

romansky commented Mar 2, 2024

awni commented Mar 2, 2024

romansky commented Mar 2, 2024 • edited Loading

romansky commented Mar 2, 2024

awni commented Mar 2, 2024

romansky commented Mar 2, 2024 •

edited

Loading