New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Speed of Format and Free cluster count - Dedicated vs Shared #329
Comments
I wrote a reply on the forum just now, while you were writing this. Here's the text (hopefully I didn't butcher SdFat's inner details too much?)
|
Something I've wanted to explore is a way to increase SdFat's cache size at runtime, similar to how we have Serial1.addMemoryForRead(buffer, size) and Serial1.addMemoryForWrite(buffer, size) on the serial ports. The other use case that suffers terrible performance is playing more than 1 WAV file simultaneously. A large cache could allow holding the FAT sectors for several open files. Together with a smarter WAV player which has larger buffers and a scheduler so we can stagger larger reads, my hope is to eventually get SdFat many-file read performance to rival the Wav Trigger product which can play 14 stereo files simultaneously using 4 bit SDIO. Maybe larger cache would also allow SHARED_SPI to run faster? |
Future SD cards will require too much memory for many streams with boards like Teensy. This is required for UHS card like this:
Wav Trigger already is limited to old SD designs. Here is a note from The Wav Trigger site:
New cards are being designed for phones, PCs, and cameras with lots of memory. They are designed for systems like Linux with huge disk caches. I will not chase this problem with SdFat. These card have huge Allocation Units, and Record Units. The host should manage data areas with the unit of AU and transfer data in units of an RU. An AU for a modern card is measured in units of 4MB and can be as large as 64MB. An AU consists of a number or record units, RU, and an RU can be up to 512KB for modern cards. RUs are a multiple of 16KB so always do at least this size transfer. If you write less than an RU, then the next write causes the card to reads the RU, add new data and write a new RU. Soon the AU must be copied to recover flash. This causes huge latency problems. If you read less than an RU you will likely reread the RU several times. You can try to manage buffering. Use contiguous exFAT files, they have no FAT entries and are designed to be preallocated for write. Do transfers in power of two sectors. up to the 128KB exFAT cluster size. SdFat will not use its internal cache, will do a single multi-sector transfer, and not access the FAT or bit map for the exFAT file. |
Have not tried 14 files, but simultanous playing two 8 channel files and record one stereo works flawlessly if you increase the audioblocksize. My waveplayer can do that. The point is: 3ms is a too tight corset. For 14 files, i'd suggest to read in loop(). |
+ you need a fast card |
The problem is not playing a many channel file. It's playing 14 stereo files. Each file with newer UHS cards requires very large buffers for good performance. |
To really make simultaneous audio playing work we need to eventually move to a non-blocking API. If we're playing 7 files and (hypothetically) reads are taking 4 audio updates (3ms each), a scheduler needs to be able to request a non-blocking read to bring in enough data for the next 28 updates. That also means we will need 14K buffer size for each of those 7 files, or about 100K RAM. |
@biill: have not disputed that. |
But the slow format problem with SHARED_SPI is much simpler, since it's just writing zeros. There's no need for lots of RAM usage. Just adding an API at the driver level so the format code can write a sequence of blocks to all zeros should let us get nearly the best SPI speed. |
A non-blocking API won't do it. Only buffering huge transfers. The standard for SD cards allow huge read latencies, hundreds of ms, if you read less than an RU for many regions of an SD. |
We can do large buffers when we have the 8 Mbyte PSRAM chip. :) |
Would be great to be able to do lots of this stuff, but this point I am sort of trying to cherry pick a few simple things like: Why: Serial.printf("Free Cluster Count: %u dt: ", sd.freeClusterCount()); On an SPI drive would this call be 5 times slower in Shared SPI mode? But on the external 32Gb card That is in dedicated mode: the timing for the format call and the freeClusterCount:
And in shared mode:
So: Free cluster count went for a little over 1 second to 5.6 seconds The only difference in code was:
The comments about DEDICATED_SPI sounded like the ability to optimize to read in multiple sectors at a time. |
Sorry I know this is a side question, but was wondering what I would think is a simple thing. That is suppose I have a pointer to an SDClass object. So far I don't see an easy way to get to its config settings.. But I may be missing something obvious. Thanks |
Here is the wrapper for the Teensy SPI driver. As you can see, it uses SPI.transfer(buf, count) for dedicated and shared SPI. It needs to do a memcpy or memset since buf gets sent or clobbered. |
Here is the class definition for the SPI wrapper. |
You can call begin with dedicated SPI to format the card then call begin in shared mode to access the card. |
You mean like this - or did I mess something up - but least now we know it should work - was something Kurt and I were experimenting with.
It does seem to work to speed the formatting up. Opps just saw an error that I just fixed |
FWIW, apparently a new class of "A2" rated cards are now on the market which claim to give minimum 4000 random 4K reads per second, but only if command queuing is used. |
Note: Teensy has better transfer methods:
They do not clobber the input. And you can set the transfer fill character to something like 0 and not have to pass in a transfer buffer.
Yes I can get the SDCard object using the SDClass.card() method. |
Command queuing is not supported in SPI mode and I don't think it is support for the NXP SDIO controller. Edit: NXP: Support SD/SDIO standard, up to version 3.0. Need 6.0. |
All the SPI driver knows about is the SPI port and the SPI speed. these two copied from SdSpiConfig in the begin call. |
How does one read the AU and RU sizes? The only info I can find for AU size is part of the 64 byte SD status register. Looks like SdSpiCard class can read it, but SdCardInterface can't. |
"A2" does not say much - I have a Sandisk A2 - I don't use it for Teensy anymore, because it was too slow. |
It probably far from a full RU size write (and I still have no idea how to even discover what the card's RU size actually is) but I made a quick hack to FatFormatter initFatDir() to allocate a 16K buffer and call writeSectors() rather than writeSector(). It lets SHARED_SPI format almost as fast as DEDICATED_SPI on a 32GB Samsung EVO card. |
Currently I do a max transfer of one cluster in SdFat for shared SPI mode. That limits multi-block transfers to 128KB for shared SPI with exFAT. For dedicated SPI and FIFO SDIO there is no limit. You can write an entire SD for a pre-allocated exFAT file. exFAT is great. It has an allocated length and used length. Contiguous files don't use the FAT so this allows any size multi-block transfer with no access to other parts of the SD. As a result of the exFAT spec you can see an enormous write latency if a contiguous file becomes non-contiguous . In that case the entire FAT chain must be constructed which can take many seconds for a huge file. |
Using shared SPI burns flash. If you card has a 512KB RU you will get a factor of eight extra wear with 16KB writes. I am amazed how well writing 512 byte sectors works on an Uno with shared SPI. Could be a factor of 1024 wear. Data is moved in the card at hundreds of MB/sec to write at 200 KB/sec. I should probably make a call to switch between shared and dedicated SPI. I am totally redoing how dedicated/shared SPI works in the new beta so I will experiment. That way you could switch to dedicated for format or count free space. |
Yup, I've seen those new cards. My relatively new Canon camera uses them. Yes, I hear your frustration that the newer cards are designed around systems with gigabytes of RAM. Obviously we don't have anywhere near that amount of memory, even with small external PSRAM chips added. But we do have so much more than 2K to 16K memory of 1990s era microcontrollers. Even Raspberry Pi's new low-end chip has 256K RAM, and we can expect that trend from all future chip as the older IC fabs with small wafers become increasingly unprofitable. The other major trend we're seeing users building far more sophisticated projects by leveraging complex libraries, for displays with GUIs, audio, video, networking, machine learning, etc. My long-term concern is we're building those complex libraries on top of storage infrastructure designed around accessing a single file at a time, and only with blocking APIs. Yes, I know modern cards have substantial latency for random read & write. But a fixed cache of only 1 data sector and 1 FAT/bitmap sector is only going to make the matter much worse when someone plays a sound clip while their display library needs to read a JPEG image and a web server library wants to read a html file. Having to wait that latency in a blocking call, rather than being able to request a read and get a callback when the buffer is filled.... I completely understand if you're not interested in supporting multiple file access. I know you've put a lot of work into achieving amazing single file performance with only 1 or 2 sector cache. Arduino is only belatedly embracing non-blocking APIs (their latest SD library got a non-blocking write only months ago) so there isn't a well established de-facto standard to follow. I want to build these complex libraries and do so in a way where users can combine them together for their projects with the sort of ease where they can runs multiple programs on their PC. So I guess my main question is sort of about the general direction of SdFat looking into the near future? |
Morning Bill If I run your SDInfo sketch (yes i do look at your test sketches) I can see how you can get config settings relatively easily. But I think that all presumes you specify them using:
So my question is if there is anyway to get the config info using the begin with SdSpiConfig specified in the begin? and I guess the follow since I have the feeling is going to be nope - help how can I do a mod - locally of course to get it? Oh all this is to try and do what you suggested - throw it into dedicated before formatting and then put it back in shared when done but need the original settings to know what config to put it back to. Not sure this is the final way to go but what to try it. Really curious - so far seems to work but this is the last piece of the puzzle. PS. Love the explanation on the a2 cards. Man that's a lot going into the future with pcie. Thanks |
I have been here long ago. Physicists often design new things and hire Programmers/Engineers to implement the designs in big experiments. I was involved with a disk exec for early Cray super-computers. I was at UCB when the BSD UNIX disk cache was developed and helped with tests. It's amazing how powerful memory is for filesystem performance. The answer is always memory. You move I/O out of the filesystem and make the file system only access pages. The file system does adaptive read ahead and write behind the filesystem can even be a user process. You don't do async calls to drivers in the filesystem layer. Drivers and the paging system use threads or lightweight tasks in the kernel that do context switches based on events/interrupts. Too bad there is no good free RTOS for a base. Poor Arduino is trying to use mbed. mbed evolved from an OK kernel but the HAL layer is a hodgepodge of wrappers around bad company software. |
The parts of the configuration in the sd.begin(config) call are not saved in a single place. SPI port and SPI speed are saved here at about line 84 of SdSpiArduinoDriver.h
The shared/dedicated SPI mode is currently saved here at about line 358 of SdSpiCard.h You can't just change and restore these items and I am in the process of changing the structure of shared/dedicated SPI. SdFat-beta now has two classes to implement SdSpiCard.
and In short I maintain SdFat for simple users who don't modify internals. sd.begin(config) will continue to work but your mods may not. |
Do you have CFexpress or SD Express? |
Thanks Bill. CsPin wouldn't change of course. Thanks for your help was driving me crazy, Will have to look at your beta2.1.1-beta to see whats coming down the pike |
Those are UHS-II SDs. UHS-II can do upto 312MB/s. The new SD Express cards use the PCIe bus at 3940MB/s for PCIe Gen.4 × 2 Lane. I have a PC that I use for AI with Gen.4 SSDs with up to 6600MB/s sequential reads. It does a full backup at 3000MB/s in less than one minute. |
Wow, that's pretty amazing speed. |
I believe we have found a solution to the SHARED_SPI performance problem with FAT32 freeClusterCount() and format(). Code is on this branch: https://github.com/PaulStoffregen/SdFat/tree/writeSectorsSame This adds a readSectorsCallback() and writeSectorsCallback() to BlockDeviceInterface, to allow use of fast read multiple and write multiple sectors of any length with only a single 512 byte buffer. A callback function is used to refill the buffer before writing each sector, or to make use of the buffer after reading each sector. The performance with SHARED_SPI becomes approx the same as DEDICATED_SPI on these long operations, and it probably is much better for internal wear on the SD card. |
Not sure if readSectorsCallback() and writeSectorsCallback() would be a welcome addition to SdFat... but if so, would be happy to make any needed changes and send a pull request. Or maybe this could be considered for the redesign version? |
I think the simple answer to slow shared SPI for format or other cases like scan the FAT or bit-map could be a switch between shared and dedicated SPI mode. This would allow a section of fast SD I/O when you can assure the SPI bus won't be accessed by another device. A callback won't change the fact that a transfer is killed if CS is raised. Something like this for the SdCard classes when dedicated/shared SPI is enabled in SdFatConfig.h :
Your code is dead since I already completely changed how dedicated/shared SPI works in SdFat-beta. Edit: actually I will need an API that allows for the case that only shared SPI is supported for some Uno users. probably a fail return for the set mode. |
No worries about abandoned code. If the next SdFat offers a better way, happy to use it. But we're not even using this from outside the library. It's just edits in FatLib to speed up format and freeClusterCount when shared SPI is used. No other SPI device should access the SPI bus if SPI.beginTransaction() was called and the code hasn't returned to the main program. |
Thought I just try to play 14 files. |
@KurtE - Maybe time to close this issue? We now have a workaround for the original 2 shared SPI performance problems. Seems likely future SdFat will make that workaround unnecessary with an API to switch SPI modes. Bill and Frank proved enough performance exists to play multiple audio files concurrently. Sounds like only path to eliminating the read latency from hard real time DSP work looks a full RTOS to build a 2nd data reading thread which uses SdFat's blocking API to in turn provide non-blocking service to the DSP thread. Can't say I'm excited about that answer, but it seems to be the official answer and I really don't wish to argue any further. |
Thanks, Yes we addressed the main issues, I mentioned. Other issues should probably be put into a new thread specific to those issues. Thanks all |
I am now testing the changes to allow switching to dedicated SPI to optimize format() and freeClusterCount(). Here is an example fix for freeClusterCount().
Users can optimize their code with these calls. |
Here is the result of placing a call to sd.setDedicatedSpi between tests in the bench example:
|
Hi @greiman (and @PaulStoffregen and @mjs513:
For awhile now several of us have been experimenting with trying to add in MTP support for the different Teensy boards.
One of the major issues I have run into with MTP integration, is the host will often timeout if operations including startup take very long to complete. For example at sketch startup if specify a number of Storages for the host, including one or more SD cards, and it takes very long for us to answer requests from the HOST, it will timeout and MTP will not function at all.
Another place that we are working with is the ability to format the SD card. Short version of the story:
I have been seeing that formatting a larger SDCard over SPI was taking a very long time. In my case I am testing using a 32GB Samsung card and a call
Was taking in the nature of lets say 45 seconds. Note: I started off with using the SD library to actually do the calls, but then converted example to just SDFat and same results.
Doing some experimenting I am finding a drastic difference in timing with SPI_SHARED versus SPI_DEDICATED.
Simple test sketch:
Test run:
As you can see in this run it took 5.5 seconds to compute number of free clusters and about 46.5 seconds to do the format
Changing to DEDICATED_SPI drastically changes these timings:
But knowing that there could be other devices on the SPI buss, is there some way to say, please do this operation, like we are in dedicated mode?
The text was updated successfully, but these errors were encountered: