Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SSE version of PresetOutputs::PerPixelMath #62

Merged
merged 9 commits into from May 23, 2018
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
33 changes: 4 additions & 29 deletions src/libprojectM/MilkdropPresetFactory/PresetFrameIO.cpp
Expand Up @@ -37,17 +37,17 @@ float **alloc_mesh(size_t gx, size_t gy)
// round gy up to multiple 4 (for possible SSE optimization)
gy = (gy+3) & ~(size_t)3;

float **mesh = (float **)wipemalloc(gx * sizeof(float *));
float *m = (float *)wipemalloc(gx * gy * sizeof(float));
float **mesh = (float **)wipe_aligned_alloc(gx * sizeof(float *));
float *m = (float *)wipe_aligned_alloc(gx * gy * sizeof(float));
for ( int x = 0; x < gx; x++ )
mesh[x] = m + (gy * x);
return mesh;
}

float **free_mesh(float **mesh)
{
free(mesh[0]);
free(mesh);
wipe_aligned_free(mesh[0]);
wipe_aligned_free(mesh);
return NULL;
}

Expand Down Expand Up @@ -168,11 +168,8 @@ void PresetOutputs::Render(const BeatDetect &music, const PipelineContext &conte


// N.B. The more optimization that can be done on this method, the better! This is called a lot and can probably be improved.
// NOTE : Keep PerPixelMath_sse and PerPixelMath_c in sync

void PresetOutputs::PerPixelMath_c(const PipelineContext &context)
{

for (int x = 0; x < gx; x++)
{
for (int y = 0; y < gy; y++)
Expand Down Expand Up @@ -283,28 +280,6 @@ inline __m128 _mm_cosf(__m128 x)
}


/**
* SSE instructions let us do the math on 4 floats in parallel. You an see the main loop uses y += 4. Each time through the loop,
* we read operands in group of 4. This looks like a mess, but just think of it as rewriting the infix expressions as a prefix expression
*
* e.g.
* this->orig_x[x][y] * 0.5f * fZoom2Inv + 0.5f
* becomes
* __m128 x_mesh =
* _mm_add_ps(
* _mm_mul_ps(
* _mm_load_ps(&this->orig_x[x][y]),
* _mm_mul_ps(fZoomInv,_mm_set_ps1(0.5f))), // CONSIDER: common sub-expression
* _mm_set_ps1(0.5f));
*
* _mm_load_ps loads an SSE register from memory (4 floats at a time)
* _mm_set_ps1 takes a constant 0.5 and loads it (replicated 4 times)
* * The other expressions are what they sound like:
* a + b --> _mm_add_ps(a, b)
* a * b --> _mm_mul_ps(a, b)
*/
// NOTE : Keep PerPixelMath_sse and PerPixelMath_c in sync
// NOTE : Even better would be to rewrite this as a compute shader
void PresetOutputs::PerPixelMath_sse(const PipelineContext &context)
{
for (int x = 0; x < gx; x++)
Expand Down
48 changes: 44 additions & 4 deletions src/libprojectM/wipemalloc.cpp
Expand Up @@ -25,21 +25,61 @@
*/

#include "wipemalloc.h"
#include <assert.h>

void *wipemalloc( size_t count ) {
count = (count + 15) & ~(size_t)15;
void *mem = aligned_alloc( 16, count );
void *wipemalloc( size_t count )
{
void *mem = malloc( count );
if ( mem != NULL ) {
memset( mem, 0, count );
} else {
printf( "wipemalloc() failed to allocate %d bytes\n", (int)count );
}
return mem;
}
}

/** Safe memory deallocator */
void wipefree( void *ptr ) {
if ( ptr != NULL ) {
free( ptr );
}
}

void *wipe_aligned_alloc( size_t align, size_t size )
{
#if TARGET_OS_MAC
// only support powers of 2 for align
assert( (align & (align-1)) == 0 );
void *allocated = malloc(size + align - 1 + sizeof(void*));
if (allocated == NULL)
{
printf( "wipe_aligned_malloc() failed to allocate %d bytes\n", (int)size );
return NULL;
}
void *ret = (void*) (((size_t)allocated + sizeof(void*) + align -1) & ~(align-1));
*((void**)((size_t)ret - sizeof(void*))) = allocated;
return ret;
#else
void *mem = aligned_alloc( align, size );
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is aligned_alloc available on EVERY other platform that isn't apple? raspi? BSD? windows?
I know mac is special and needs ^2 alignment, that's fine. i'm concerned about the aligned_alloc being portable.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I honestly don't know. It is C11 standard, and the only google references I found about lack of support seem related to OSX. I could remove the #ifdef and always use the hand-written version.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well do whatever you think is best

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MacOS seems to contain posix_memalign, would that not be sufficient? It seems to work (almost) exactly the same except takes an address to write the pointer to.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rather use a library routine (posix_memalign and/or aligned_alloc), but I guess the real problem here is detecting which library routine is available in a general way. GCC is not my natural habitat so I'm definitely open to suggestions.

Copy link
Collaborator Author

@mbellew mbellew May 22, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's this from stackoverflow

https://stackoverflow.com/questions/16376942/best-cross-platform-method-to-get-aligned-memory

If STDC_VERSION >= 201112L use aligned_alloc.
If _POSIX_VERSION >= 200112L use posix_memalign.
If _MSC_VER is defined, use the Windows stuff.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

autoconf should be able to detect and spit out a define in config.h
search http://download.redis.io/redis-stable/deps/jemalloc/configure.ac for aligned_alloc - maybe this is something like what we want?

AC_CHECK_FUNC([memalign],
	      [AC_DEFINE([JEMALLOC_OVERRIDE_MEMALIGN], [ ])
	       public_syms="${public_syms} memalign"])

again, do whatever you think is best. i just want to avoid breaking portability if it's not a huge pain.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Super, I haven't peeked in a configure.ac file before, but this looks like just the thing.

AC_CHECK_FUNCS_ONCE([aligned_alloc posix_memalign])

BTW, please crank up the mesh size and FPS and try test, everyone. I seem to see different results on different machines from dramatic to not-so-much. I want to make sure I'm not seeing things.

if ( mem != NULL ) {
memset( mem, 0, size );
} else {
printf( "wipe_aligned_alloc() failed to allocate %d bytes\n", (int)size );
}
return mem;
#endif
}

void wipe_aligned_free( void *p )
{
#if TARGET_OS_MAC
if (p != NULL)
{
void *allocated = *((void**)((size_t)p - sizeof(void*)));
free(allocated);
}
#else
if (p != NULL)
free(p);
#endif
}
4 changes: 4 additions & 0 deletions src/libprojectM/wipemalloc.h
Expand Up @@ -57,4 +57,8 @@
void *wipemalloc( size_t count );
void wipefree( void *ptr );

/** wipe_aligned_malloc() must be matched with aligned_free() */
void *wipe_aligned_alloc( size_t align, size_t count);
inline void *wipe_aligned_alloc( size_t count ) { return wipe_aligned_alloc(16,count); }
void wipe_aligned_free( void *ptr );
#endif /** !_WIPEMALLOC_H */