# Homework 09 - Through the Levels of Abstraction

# <b><font color ="red">Assignment not yet complete. Do not start until linked on the Calendar on the Canvas page</font></b>

### Assigned for Spring 2024 Semester: 4/16/2024
### Due for Spring 2024 Semester: 4/25/2024 at 11:59pm on Canvas

### Points: 2,000 points

> <b>Note:</b> Review the <a href = "https://github.com/mmorri22/cse30321/blob/main/cse30321-syllabus.md">Course Syllabus</a> for policies regarding late submissions for Homework Assignments. For this group assignment, you may use only 48 extension tokens (there are students who have used 24 hours, so this will ensure every group has an equal opportunity to submit.)

### Sections: Here is the breakdown of the Assignment
<ol>
    <li><b>Group Assignment</b> - Brief coverage of the rules surronding group assignments</li>
    <li><b>Background</b>: Reviewing the appropriate levels of abstraction needed to implement the SIMD, Threading, and Python Multiprocessing portions of the assignments.</li>
    <li><b>Set Up</b> - Downloading and setting up the code for this Homework assignment.</li>
    <li><b>Single-Issue Multiple Data</b> - Loop Unrolling in C.</li>
    <li><b>Thread Level Parallelism</b> - Multi-threading programming using OpenMP in C.</li>
    <li><b>Multiprocessing</b> - Performing large matrix multiplication in Python.</li>
    <li><b>Detailed Submission Information</b> - What files need to be included in your submission, and where to submit.</li>
</ol>

### Philosophy of this assignment
    
Every time I teach Computer Architecture, I ask students about their preferred programming languages. And, invariably, they indicate they greatly prefer Python to C or C++. At this point, you now know what is going on "under the hood." But odds are you still prefer Python, and that's OK. (For example, I like Python much more than Java. It's cleaner, more robust, and quite elegant. For all the complaining professors do about the "old days", Python's emphasis on tabbing means that students write much cleaner C code - better indendations and whitespace, more intuitive variable names, better algorithmic approaches - than when I first started teaching. Or, frankly, when I was a student.)
    
The unfortunate reality is that we - the older folks - push students to specialize much faster than I was a student. Worse, we are pushing them to specialize in the same things, which means that not only are students all competiting for the same jobs, but we are not innovating in other areas. The urgency of the CHIPS Act is a symptom of a much larger problem. And this is a disservice to students.
    
So you should not just strive to demonstrate the skills requested in this assignment. Your bigger takeaway should be an appreciation for how a computer is put together from the ground up. At each level, there are opportunites for innovation and employment (there are opportunities for cache and cash!)

> For your reference, here is some advice from Bjarne Stroutrop, the inventer of C++, on avoiding the temptation to "overspecialize." Click on the image below to hear his advice
>[![](http://img.youtube.com/vi/-QxI-RP6-HM/mqdefault.jpg)](https://www.youtube.com/watch?v=-QxI-RP6-HM)

## Part 1 - Group Assignment Overview

This programming assignment may be performed in groups of <b>no less than 3</b> and <b>no more than 4</b> students.

Your group will submit a <b>PDF report</b> as well as code files in a .zip file through Canvas Speedgrader.

<b>Only one student</b> from the group shall submit an assignment. In the accompanying report, which we will detail later in this assignment write up, include the name of each member on the cover page, which is how the other students will get credit for the assignment.

Each student member is <b>required to perform an equal share of the work</b>. The instructor reserves the right to change a grade if it is determined that a student didn't do their work or coerced a classmate to do the work for them.

# Part 2 - Set Up

In order to use the <b>Intel x86_64 architecture</b> for the Intel Streaming SISD Extensions (SSE) and for consistency across benchmarking, you must use the <code>student06.cse.nd.edu</code> machine for this assignment.

## Downloading and Setting up your code

Once logged into <code>student06.cse.nd.edu</code>, set up a folder that separates you code from your other files in your ND machine.

    mkdir cse30321_hw09
    cd cse30321_hw09
    wget https://raw.githubusercontent.com/mmorri22/cse30321/main/homeworks/homework09/setup.sh
    chmod r+x setup.sh
    ./setup.sh

The script will download several files for each part of the assignment. Run the <code>ls</code> command and you will see three folders.

    -bash-4.2> ls
    part3 part4  part5 

> <b>Note:</b> The folders are labeled <code>part3</code>, <code>part4</code> and <code>part5</code> to correspond with the Parts in the Homework description below. <b>There are no parts 1-3 that you need to code to complete the assignment</b>.

To verify all files have been downloaded, perform the following command: <code>ls *</code>. This is what you should see:

    -bash-4.2> ls *
    part3:
    python.txt simd.txt
    
    part4:
    Makefile  part4.c  part4.h  test_part4.c

    part5:

## Part 3 - Background

In this course, you have learned about several levels of abstraction in a computer, from the <b>assembly language</b>, <b>architecture design</b>, <b>pipelining</b>, <b>cache memory</b>, <b>virtual memory</b>, <b>multiprocessing</b> and <b>multithreading</b>.

In this assignment, you will have the opportunity to demonstrate your proficiency at the level of abstraction that most of you will pursue in your careers: <b>programming</b>. You will implement C and Python solutions to classic programming problems, with a twist that you will improve upon those solutions with <b>loop reordering</b>, <b>loop unrolling<b>, <b>SIMD</b>, <b>MIMD</b>, and <b>Python multiprocessing</b>.

### Step 3.1 - Familiarize Yourself with the SIMD Functions

> Note: Include your solutions to the problems from Step 3.1 in the <code>part3/simd.txt</code> file that you downloaded into the <code>student06</code> machine.

Given the large number of available SIMD intrinsics. we want you to learn how to find the ones that you’ll need for this assignment.

Go to the <a href = "https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html">Intel Intrinsics Guide</a>. 
Open this page and once there, click the checkboxes for <b>everything</b> that begins with “SSE”.

> For <b>Problem 1</b> in your PDF, include those three intrinsics in your report. Include the following: <b>Synopsis</b>, <b>Description</b>, <b>Operation</b>, and <b>Latency and Throughput</b>

Look through the possible instructions and syntax structures, then try to find the 128-bit intrinsics for the following operations:

<ul>
    <li>Four floating point <b>div</b>isions in single precision (i.e. float)</li>
    <li>Sixteen max operations over signed 8-bit integers (i.e. char)</li>
    <li>Arithmetic shift right of eight signed 16-bit integers (i.e. short)</li>
</ul>

> Hint: Things that say “epi” or “pi” deal with integers, and those that say “ps” or “pd” deal with <b>s</b> ingle <b>p</b> recision and <b>d</b> ouble <b>p</b> recision floats.

You can visualize how the vectors and the different functions work together by inputting your code into the code environment at this <a href = "https://piotte13.github.io/SIMD-Visualiser/#/">link</a>! Another interesting tool that might help you understand the behavior of SIMD instructions is the <a href = "https://godbolt.org/z/J7HXBk">Compiler Explorer</a> project. It can also provide a lot of insights when you need to optimize any code in the future.

> General advice on working with SIMD instructions:
>
> Be ware of memory alignment. For example, <code>_m256d _mm256_load_pd (double const * mem_addr)</code> would not work with unaligned data – you would need <code>_m256d _mm256_loadu_pd</code>. Meanwhile, it is almost always desireable to keep your data aligned (can be achieved using special memory allocation APIs). In fact, when the data is aligned, aligned load/store will give identical performance to an aligned store. Aligned loads can be folded into other operations as a memory operand which reduces code size and throughput slightly. Modern CPUs have very good support for unaligned loads, but there’s still a significant performance hit when a load crosses a cache-line boundary.
>
> Recall various CPU pipeline hazards you have learned earlier this semester. Data hazards can drastically hurt performance. That being said, you may want to check data dependencies in adjacent SIMD operations if not getting the desired performance.


### Step 3.2 - Familiarize Yourself with Python Multiprocessing Functions

> Note: Include your solutions to the problems in section 3.2 in the <code>part3/python.txt</code> file that you downloaded into the <code>student06</code> machine.

### For your reference

For your reference, here are some links to code for your review that will help you navigate this assignment.
<ul>
    <li><code><a href = "">File Name</a></code> - Description</li>
    <li><code><a href = "">File Name</a></code> - Description</li>
    <li><code><a href = "">File Name</a></code> - Description</li>
    <li><code><a href = "">File Name</a></code> - Description</li>
    <li><code><a href = "">File Name</a></code> - Description</li>
    <li><code><a href = "">File Name</a></code> - Description</li>
</ul>

## Part 4 - SIMD in C

### Objectives: In this part, you will:
<ol>
    <li>learn about and use various SIMD functions to perform <b>data level parallelism</b></li>
    <li>write code to SIMD-ize certain functions</li>
    <li>demonstrate proficiency about <b>loop-unrolling</b> and understand as to why it works</li>
</ol>

### Step 4.1 - Performing the first make command

<b>Step 1</b> - To get to the code for this section, perform the following commands from the <code>cse30321_hw09</code> folder.

    -bash-4.2$ cd part4

Run the following command: <code>make part4</code>

If you successfully downloaded and ran the code, you will see the following warnings for unused (which are expected, but we have not used them yet):

    -bash-4.2> make part4
    gcc -std=c11 -Wall -c test_part4.c
    gcc -std=c11 -Wall -c part4.c
    part4.c: In function ‘simd_sum’:
    part4.c:50:10: warning: unused variable ‘_127’ [-Wunused-variable]
      __m128i _127 = _mm_set1_epi32(127);  /* Empty vector with 127 numbers */
              ^
    part4.c: In function ‘simd_unrolled_sum’:
    part4.c:66:10: warning: unused variable ‘_127’ [-Wunused-variable]
      __m128i _127 = _mm_set1_epi32(127);
              ^
    gcc -std=c11 -Wall -o part4 test_part4.o part4.o

To run the program, you will put in the executable. Three tests will fail. That is fine for now. Your job will be to fix them.

### Step 4.2 - Writing SIMD Code

You will vectorize/SIMDize the code in <code>part4.c</code> to speed up the naive implementation of the <code>sum()</code> function.

In this step, you need to vectorize the inner loop with SIMD! You will also need to use the following intrinsics:

<ul>
    <li><code>__m128i _mm_setzero_si128()</code> - returns a 128-bit zero vector. This is the equivalent of using <b>calloc</b> to implement the code.</li>
    <li><code>__m128i _mm_loadu_si128(__m128i *p)</code> - returns 128-bit vector stored at pointer p</li>
    <li><code>__m128i _mm_add_epi32(__m128i a, __m128i b)</code> - returns vector (a_0 + b_0, a_1 + b_1, a_2 + b_2, a_3 + b_3)</li>
    <li><code>void _mm_storeu_si128(__m128i *p, __m128i a)</code> - stores 128-bit vector a into pointer p</li>
    <li><code>__m128i _mm_cmpgt_epi32(__m128i a, __m128i b)</code> - returns the vector (a_i > b_i ? <code>0xffffffff : 0x0</code> for i from 0 to 3).</li>
    <ul>
        <li>returns a 32-bit all-1s mask if a_i > b_i</li> 
        <li>returns a 32-bit all-0s mask otherwise</li>
    </ul>
    <li><code>__m128i _mm_and_si128(__m128i a, __m128i b)</code> - returns vector (a_0 & b_0, a_1 & b_1, a_2 & b_2, a_3 & b_3), where & represents the bit-wise and operator</li>
</ul>

Start with the code in sum() and use SSE intrinsics to implement the sum_simd() function.

> How do we do this?

Recall that the SSE intrinsics are basically functions which perform operations on multiple pieces of data in a vector in parallel. This turns out to be faster than running through a for loop and applying the operation once for each element in the vector.

In our sum function, we’ve got a basic structure of iterating through an array. On every iteration, we add an array element to a running sum. To vectorize, you should add a few array elements to a sum vector in parallel and then consolidate the individual values of the sum vector into our desired sum at the end.

> Hint 1: <code>__m128i</code> is the data type for Intel’s special 128-bit vector. We’ll be using them to encode 4 (four) 32-bit ints.

> Hint 2: We’ve left you a vector called <code>_127</code> which contains four copies of the number 127. You should use this to compare with some stuff when you implement the condition within the sum loop.

> Hint 3: DON’T use the store function (<code>_mm_storeu_si128</code>) until <b>after</b> completing the inner loop! It turns out that storing is very costly and performing a store in every iteration will actually cause your code to slow down. 

> Hint 4: It’s bad practice to index into the <code>__m128i</code> vector like they are arrays. You should store them into arrays first with the <code>storeu</code> function, and then access the integers elementwise by indexing into the array.

> Hint 5: READ the function declarations in the above table carefully! You’ll notice that the loadu and storeu take <code>__m128i*</code> type arguments. You can just cast an int array to a <code>__m128i</code> pointer. Alternatively, you could skip the typecast at the cost of a bunch of compiler warnings.

### Step 4.3 - Loop Unrolling

Within <code>part4.c</code>, copy your simd_sum() code into simd_unrolled_sim() and unroll it 4 (four) times. Don’t forget about your tail case!

### Example Successful Run

This is an example run from the Professor's solution run on <code>student06.cse.nd.edu</code>:

    -bash-4.2> make clean
    rm -rf *.o *.swp part4
    -bash-4.2> make part4
    gcc -std=c11 -Wall -c test_part4.c
    gcc -std=c11 -Wall -c part4.c
    gcc -std=c11 -Wall -o part4 test_part4.o part4.o
    -bash-4.2> ./part4
    Generate a randomized array...... array generated.
    -------------------------
    Calculating randomized sum - No modifications.
    Sum: 25498320896
    Average time: 2.870
    -------------------------
    Starting randomized unrolled sum.
    Unrolled Sum: 25498320896
    Average time: 2.050s
    -------------------------
    Starting randomized SIMD sum.
    Sum: 25498320896
    Average time: 1.680s
    Test Succeeded! SIMD sum provides a speedup of 1.708333.
    -------------------------
    Starting randomized SIMD unrolled sum.
    Sum: 25498320896
    Average time: 1.490s
    Test Suceeded! SIMD unrolled function provided speedup of 1.926174.
    -------------------------
    All tests Passed! Correct values were produced, and speedups were achieved!


## Part 5 - Multiprocessing in Python

### Objectives: In this part, you will:
<ol>
    <li>XXX</li>
    <li>XXX</li>
    <li>XXX</li>
    <li>XXX</li>
</ol>

## Part 6 - Detailed Submission Information

1 - On the ND Machine, go to the folder where your 

    zip -r cse30321_hw09.zip cse30321

Use your favorite File Transfer Protocol (such as FileZilla) to transfer the ZIP file from the ND Machines to your laptop.

1 - You should include all the files in a .zip file. These will include:
<ul>
    <li>A <b>PDF with your report</b>. It <i>must</i> be a PDF since text editors often have issues rendering images if done on a different OS. In the past, TAs have marked points off because the image simply didn't render on their machine, so the work wasn't shown.</li>
    <li>Five <b>separate</b> assembly files to demonstrate the run at that part</li>
    <ul>
        <li><code>dllist_step1.S</code></li>
        <li><code>dllist_step2.S</code></li>
        <li><code>dllist_step3.S</code></li>
        <li><code>dllist_step4.S</code></li>
        <li><code>dllist_final.S</code> - Students have submitted the final with copies for each test for the final. This is acceptable, although the TA will be asked to make a change and run the code, so it is not required.</li>
    </ul>
</ul>

2 - <b>Only one group member</b> needs to submit the ZIP file. (In fact, do not have multiple members submit. This inevitably causes confusion among the TAs.) You must include all group members (and Notre Dame emails) on the first page of the PDF. This is how the grading TA will input the remaining marks.

3 - Upload to the Canvas Speedgrader listed under Assignments at the following link: https://canvas.nd.edu/courses/82217/assignments/257852