# Maximal Static Expansion for efficient parallelization on GPU

## Background

### Polly
My GSoC project is part of Polly. Polly is a loop and data-locality optimizer for LLVM. The optimisations are made using a mathematical model called polyhedral model. It models the memory access of the program. After modeling, transformations (tilling, loop fusion, loop unrolling ...) can be applied on the model to improve data-locality and/or parallelization. The aim of my project was to implement a transformation called Maximal Static Expansion (MSE).

TODO : explain SCop, Stamtenet, Memory Acecss, SAI

### ISL
ISL is the **I**nteger **S**et **L**ibrary used in Polly to handle all the mathematical computation during modelling and transformation. A basic understanding of what is ISL and how to use it is necessary to understand the implementation part of this report.

#### Data structures
ISL has different types of data structures.
* **Set** :
A set in ISL is represented as follow :

$$\{ S[i_0, ..., i_n] : c_0, ..., c_n\}$$

Where $i_k$ is an input variable and $c_k$ is a constraint.

For example, the set of all instances of the statement S of the following code source is $\{ S[i, j] : i=j, 0 \le i \le N, 0 \le j \le N\}$.


In [None]:
for (int i = 0; i < N; i++)
  for (int j = 0; j < N; j++)
    if (i == j)
S:    A[i][j+1] = i*j;

* **Union Set** :
An union set is just an union of set.

* **Map** : A map in ISL represents a relation. It is represented as follow :

$$\{ S[i_0, ..., i_n] \rightarrow T[o_0, ..., o_n] : c_0, ..., c_n\}$$

Where $i_k$ is an input variable, $o_k$ is an output variable and $c_k$ is a constraint.

For example, to model the memory access inside the statement S of the preeceding example, the corresponding map is :

$$\{ S[i, j] \rightarrow A[i, j+1] : i=j, 0 \le i \le N, 0 \le j \le N \}$$

This map means that in the statement S, there is a memory access to the array A at the indices [i, j+1].

**BE CAREFUL** : there is implicit constraints in this map. An unsimplified version of this map is :

$$\{ S[i_0, i_1] \rightarrow A[o_0, o_1] : i_0=i_1, i_0=o_0, i_1=o_1-1\}$$

The input part of the map is called the domain whereas the output part is called the range. For example, the domain and the range of the previous map are :

$$ domain(\{ S[i, j] \rightarrow A[i, j+1] : i=j\}) = \{ S[i, j] : i=j\} $$
$$ range(\{ S[i, j] \rightarrow A[i, j+1] : i=j\}) = \{ A[i, j+1] : i=j\} $$

* **Union Map** : An union map is just an union of map.

* **Nested Map** : There is a specific type of map called nested map. The structure is the following :


$$\{ [ S[i_0, ..., i_n] \rightarrow T[j_0, ..., j_n] ] \rightarrow [ U[k_0, ..., k_n] \rightarrow V[l_0, ..., l_n] ] : c_0, ... , c_n\}$$

With this kind of data structure, we can represent data dependencies. Let take an example.

In [None]:
for (int i = 0; i < N; i++) {
   for (int j = 0; j < N; j++) {
S:   B[i][j] = i*j;;
   }
T: A[i] = B[0][i];
}

There is a RAW dependency for the array B because we read index i of B in statement T after that the statement S has written to index j of B. The dependences map looks like :

$$\{ [ T[i] \rightarrow B[0, i] ] \rightarrow [ S[i, j] \rightarrow B[i, j] ] : 0 \le i \le N, 0 \le j \le N \}$$

### Maximal static expansion
Data-dependences in a program can lead to a very bad automatic parallelization. Modern compilers use techniques to reduce the number of such dependences. One of them is Maximal Static Expansion. The MSE is a transformation which expand the memory access to and from Array or Scalar. The goal is to disambiguate memory accesses by assigning different memory locations to non-conflicting writes. This method is described in a paper written by Denis Barthou, Albert Cohen and Jean-Francois Collard.[^f1] 
Let take a example (from the article) to understand the principle :

In [None]:
int tmp;
for (int i = 0; i < N; i++) {
    tmp = i;
    for (int j = 0; j < N; j++) {
        tmp = tmp + i + j;
    }
    A[i] = tmp;
}

The data-dependences induced by tmp make the two loops unparallelizable : the iteration j of the inner-loop needs value from the previous iteration and it is impossible to parrallelize the i-loop because tmp is used in all iterations.

If we expand the accesses to tmp according to the outermost loop, we can then parallelize the i-loop.

In [None]:
int tmp_exp[N];
for (int i = 0; i < N; i++) {
    tmp_exp[i] = i;
    for (int j = 0; j < N; j++) {
        tmp_exp[i] = tmp_exp[i] + i + j;
    }
    A[i] = tmp_exp[i];
}

The accesses to tmp are now made to/from a different location for each iteration of the i-loop. It is then possible to execute the different iteration on different computation units (GPU, CPU ...).

### Static single assignement
Due to lack of time, I was not able to implement **maximal** static expansion but only fully-indexed expansion. The principle of fully-index expansion is that each write goes to a different memory location. 
Let see the idea on an example :

In [None]:
int tmp;
for (int i = 0; i < N; i++) {
    tmp = i;
    for (int j = 0; j < N; j++) {
        B[j] = tmp + 3;
    }
    A[i] = B[i];
}

For the sake of simplicity, only the arrays will be expanded in this example. The fully expanded version is :

In [None]:
int tmp;
for (int i = 0; i < N; i++) {
    tmp = i;
    for (int j = 0; j < N; j++) {
        B_exp[i][j] = tmp + 3;
    }
    A_exp[i] = B_exp[i][i];
}

The details of fully-indexed expansion will be discussed in the following sections.


## My work
My project is part of Polly. I am a french student but during the GSoC I was a student at the university of Passau, Germany. The LooPo team welcome me and more especially Andreas Simbürger, one of my GSoC mentor and my master thesis supervisor. My other GSoC mentor is Michael Kruse, one of the main contributor to Polly, actually working in France.
I'd like to thank all the people that help and guide me and more especially Andreas and Michael. 

### JSON bug fix
As first step in open source software development and to get familiar with Polly/LLVM development process, I fixed a open bug in Polly. Polly can import data from a JSON file (in case of Polly called jscop file). It can import new array, new memory access, new schedule or new context. In the previous implementation of JSONImporter, Polly did not check if the data in the jscop file were plausible and consistent before import. This can lead to failure in the remaining part of the Polly pipeline. My work was to implement plausibility and consistency checks. Details, diff and discussions can be found here : https://reviews.llvm.org/D32739. This patch has been merged into the actual version of Polly.

### Allocate array on heap
During transformation, Polly can create new arrays because it helps optimizing the program. Before GSoC, it was only possible to allocate array on the stack. It is sufficient for small arrays, but while doing expansion, we possibly handle very large arrays. For example, if we want to expand fully this simple code :


In [None]:
for (int i = 0; i < N; i++)
  for (int j = 0; j < N; j++)
    for (int k = 0; k < N; k++)
      for (int l = 0; l < N; l++)
        A[l] = 3;

The expansion would lead to the following code :

In [None]:
for (int i = 0; i < N; i++)
  for (int j = 0; j < N; j++)
    for (int k = 0; k < N; k++)
      for (int l = 0; l < N; l++)
        A_exp[i][j][k][l] = 3;

Depending on the value of N, A_exp can have a huge number of elements. If N = 100, we have $100*100*100*100 = 100000000 = 10^8$ elements ! Thus, the possibility to allocate array on heap was needed. 

The array allocation is implemented in the IslNodeBuilder. Here is the part of the code that do the allocation :

In [None]:
void IslNodeBuilder::allocateNewArrays(BBPair StartExitBlocks) {
  for (auto &SAI : S.arrays()) {
    if (SAI->getBasePtr())
      continue;

    assert(SAI->getNumberOfDimensions() > 0 && SAI->getDimensionSize(0) &&
           "The size of the outermost dimension is used to declare newly "
           "created arrays that require memory allocation.");

    Type *NewArrayType = nullptr;

    // Get the size of the array = size(dim_1)*...*size(dim_n)
    uint64_t ArraySizeInt = 1;
    for (int i = SAI->getNumberOfDimensions() - 1; i >= 0; i--) {
      auto *DimSize = SAI->getDimensionSize(i);
      unsigned UnsignedDimSize = static_cast<const SCEVConstant *>(DimSize)
                                     ->getAPInt()
                                     .getLimitedValue();

      if (!NewArrayType)
        NewArrayType = SAI->getElementType();

      NewArrayType = ArrayType::get(NewArrayType, UnsignedDimSize);
      ArraySizeInt *= UnsignedDimSize;
    }

    if (SAI->isOnHeap()) {
      LLVMContext &Ctx = NewArrayType->getContext();

      // Get the IntPtrTy from the Datalayout
      auto IntPtrTy = DL.getIntPtrType(Ctx);

      // Get the size of the element type in bits
      unsigned Size = SAI->getElemSizeInBytes();

      // Insert the malloc call at polly.start
      auto InstIt = std::get<0>(StartExitBlocks)->getTerminator();
      auto *CreatedArray = CallInst::CreateMalloc(
          &*InstIt, IntPtrTy, SAI->getElementType(),
          ConstantInt::get(Type::getInt64Ty(Ctx), Size),
          ConstantInt::get(Type::getInt64Ty(Ctx), ArraySizeInt), nullptr,
          SAI->getName());

      SAI->setBasePtr(CreatedArray);

      // Insert the free call at polly.exiting
      CallInst::CreateFree(CreatedArray,
                           std::get<1>(StartExitBlocks)->getTerminator());

    } else {
      auto InstIt = Builder.GetInsertBlock()
                        ->getParent()
                        ->getEntryBlock()
                        .getTerminator();

      auto *CreatedArray = new AllocaInst(NewArrayType, DL.getAllocaAddrSpace(),
                                          SAI->getName(), &*InstIt);
      CreatedArray->setAlignment(PollyTargetFirstLevelCacheLineSize);
      SAI->setBasePtr(CreatedArray);
    }
  }
}

My work in this method is the size computation and the heap allocation part. The remaining code was already in place.

Let explain step by step the principle of the heap allocation.

First of all, to allocate array, we need to have the size of the memory chunk we want to allocate. To do that, we simply iterate over the dimension of the ScopArrayInfo and multiply the size of each dimensions. This is done by this code :

In [None]:
    // Get the size of the array = size(dim_1)*...*size(dim_n)
    uint64_t ArraySizeInt = 1;
    for (int i = SAI->getNumberOfDimensions() - 1; i >= 0; i--) {
      auto *DimSize = SAI->getDimensionSize(i);
      unsigned UnsignedDimSize = static_cast<const SCEVConstant *>(DimSize)
                                     ->getAPInt()
                                     .getLimitedValue();

To actually do the expansion, we need to add a malloc call at the beginning of the polly section, called polly.start. After each malloc, a free must be present. The free call is added at polly.end. These two BasicBlock (polly.start and polly.end) are passed to allocateNewArray as a BBPair. We choose polly.start and polly.end as insertion points after a dense discussion with the polly community because this certify that there is no use-after-free (for instance in case of Scop in a loop) and that all memory cells allocated with a malloc are free'd when we don't need them anymore.

To get polly.start and polly.end, we modify executeScopConditionnaly such that it return both start block and end block of the scop. In the previous version of Polly, executeScopConditionnaly only return polly.start. People who wanted polly.end assume that there is no BasicBlock between polly.start and polly.end, so they just take the successor of polly.start. 

The malloc call insertion is made by this code :

In [None]:
    auto *CreatedArray = CallInst::CreateMalloc(
              &*InstIt, IntPtrTy, SAI->getElementType(),
              ConstantInt::get(Type::getInt64Ty(Ctx), Size),
              ConstantInt::get(Type::getInt64Ty(Ctx), ArraySizeInt), nullptr,
              SAI->getName());

The free call insertion is made by this code :

In [None]:
    CallInst::CreateFree(CreatedArray, std::get<1>(StartExitBlocks)->getTerminator());

Details, diff and discussions can be found here : https://reviews.llvm.org/D33688. This patch has been merged into the actual version of Polly.

### Array Fully indexed exp
#### Principle
As a first step in term of expansion, we choose to expand only arrays because we thought that it was an easy step to build expansion infrastructure. We also choose to no implement the **maximal** expansion in this step, to focus our efforts on the architecture of expansion : we implement a Fully-index expansion. The principle is that, for each array in the Scop, we expand the write to the array according to the loop nest and then we map the reads to the right iteration of the newly create ScopArrayInfo. Let see this on an example.



In [None]:
int tmp;
for (int i = 0; i < N; i++) {
    tmp = i;
    for (int j = 0; j < N; j++) {
S:      B[j] = tmp + 3;
    }
T:  A[i] = B[i];
}

The write to B occurs inside the i and j loops. Therefore, the expanded version of B must be a two-dimensional array indexed by i and j. The write to A occurs inside the i loop only, therefore it would no need expansion. But for the sake of simplicity, we still create an expanded version of A. After write expansion, the code would look like that :

In [None]:
int tmp;
for (int i = 0; i < N; i++) {
    tmp = i;
    for (int j = 0; j < N; j++) {
S:      B_exp[i][j] = tmp + 3;
    }
T:  A_exp[i] = B[i];
}

There is no read from A, so the expansion of A is done. There is a read from B in the statement T. At this step, we need the RAW dependences. In our case, statement T depends on statement S because the memory location reads by statement T is written by statetement S during j-loop iterations. The dependency looks like :

$$ \{ T[i] \rightarrow S[i][i] : 0\le i \le N \}$$

Now that we know that the statement T must read its value from the statement S at index [i,i], we only have to know the name of the expanded version of B and modify the read. After read expansion, the source code looks like that :

In [None]:
int tmp;
for (int i = 0; i < N; i++) {
    tmp = i;
    for (int j = 0; j < N; j++) {
S:      B_exp[i][j] = tmp + 3;
    }
T:  A_exp[i] = B_exp[i][i];
}

More details can be found in these two articles[^f2][^f3].

#### Implementation
Let see now how this principle has been implemented in Polly.

Static expansion has been implemented as a ScopPass, which is a Pass triggered on every Scop detected by Polly. Guarded by an option, it is possible to ask Polly to do the expansion by adding **-polly-enable-mse** to clang or **-polly-mse** to opt command line. 

Here is the 'main' of static expansion. This code is straightforward and explains by itself the idea of expansion.

In [None]:
  // Get the RAW Dependences.
  auto &DI = getAnalysis<DependenceInfo>();
  auto &D = DI.getDependences(Dependences::AL_Reference);
  auto Dependences = isl::give(D.getDependences(Dependences::TYPE_RAW));

  for (auto SAI : S.arrays()) {
    SmallPtrSet<MemoryAccess *, 4> AllWrites;
    SmallPtrSet<MemoryAccess *, 4> AllReads;
    if (!isExpandable(SAI, AllWrites, AllReads, S, Dependences))
      continue;

    auto TheWrite = *(AllWrites.begin());
    ScopArrayInfo *ExpandedArray = expandWrite(S, TheWrite);

    for (MemoryAccess *MA : AllReads)
      expandRead(S, MA, Dependences, ExpandedArray);
  }

The three first line is just the way to get dependences from Polly infrastructure. We request the DependenceInfo analysis for RAW dependences, using the Reference level statement. In a first iteration, we were using Statement Level statement. But this causes bugs. Full discussion and bug fixing can be found here :  https://reviews.llvm.org/D36791.

Then, we iterate over ScopArrayInfo in the Scop we are processing. We check if the ScopArrayInfo is expandable. If yes, we expand the write following the principle describe below and after we expand the reads.

The method $isExpandable$ say wheter or not the current ScopArrayInfo is expandable or not. The idea of this method is to iterate over the ScopStatement of the ScopArrayInfo passed in parameter and find cases where we can **not** do the expansion. At this step of implementation, we bail out in these cases :
* When the ScopArrayInfo involves Scalar, because we are not, at this step, able to expand scalar.
* When inside a ScopStatement, a read come after a write.


In [None]:
for (int i = 0; i < N; i++) {
    for (int j = 0; j < N; j++) {
        B[i] = ... ;
        ... = B[i];
    }
}

Polly will model the two instructions as one ScopStatement and detect two memory access inside this statement :



Read : 
$$\{ S[i, j] \rightarrow B[i] : 0 \le i \le N, 0 \le j \le N \}$$
Write :
$$\{ S[i, j] \rightarrow B[i] : 0 \le i \le N, 0 \le j \le N \}$$

Then Polly will give these two memory access to ISL. But ISL has no information on the order in which the memory access appears so it decide that the read come first, which is not the case in our example. So we need to bail out in such cases.

* When the ScopArrayInfo has MayWrite access.
* When the ScopArrayInfo has more than one write because the expansion would lead to an union map as access relation, which is not possible inside Polly.


In [None]:
for (int i = 0; i < N; i++) {
    B[i] = ... ;
    for (int j = 0; j < N; j++) {
        B[j] = ... ;
    }
    ... = B[i];
}

The only-write expanded version of this example would look like this :

In [None]:
for (int i = 0; i < N; i++) {
    B_exp[i] = ... ;
    for (int j = o; j < M; j++) {
        B_exp2[i][j] = ... ;
    }
    ... = B[i+2];
}

To read of B can read either from B_exp or from B_exp2. Its memory access relation would look like, assuming that $N>M$ :


$$\{ T[i] \rightarrow B\_exp[i] : i \ge M, 0 \le i \le N, 0 \le j \le M \ ; T[i] \rightarrow B\_exp2[i][i] : i < M, 0 \le i \le N, 0 \le j \le M \}$$

* When there are too many dependences in the dependences map.
* When the expansion would lead to a read from the original array.

In [None]:
for (int i = 0; i < N; i++) {
    ... = B[i];
    for (int j = 0; j < N; j++) {
        B[j] = ... ;
    }
}

The expanded version of this example would look like :



In [None]:
for (int i = 0; i < N; i++) {
    ... = B_exp[i][i];
    for (int j = 0; j < N; j++) {
        B_exp[i][j] = ... ;
    }
}

The problem is that nobody is writing $B\_exp[i][i]$ before it is reading. So we need a copy in mechanism to manually copy data to $B\_exp$ from the original array. This mechanism is not yet implemented. 
* When there are no writes.

Here is the full $isExpandable$ method :

In [None]:
bool MaximalStaticExpander::isExpandable(
    const ScopArrayInfo *SAI, SmallPtrSetImpl<MemoryAccess *> &Writes,
    SmallPtrSetImpl<MemoryAccess *> &Reads, Scop &S,
    const isl::union_map &Dependences) {

  int NumberWrites = 0;
  for (ScopStmt &Stmt : S) {
    auto StmtReads = isl::union_map::empty(S.getParamSpace());
    auto StmtWrites = isl::union_map::empty(S.getParamSpace());

    for (MemoryAccess *MA : Stmt) {

      // Check if the current MemoryAccess involved the current SAI.
      if (SAI != MA->getLatestScopArrayInfo())
        continue;

      // For now, we are not able to expand Scalar.
      if (MA->isLatestScalarKind()) {
        emitRemark(SAI->getName() + " is a Scalar access.",
                   MA->getAccessInstruction());
        return false;
      }

      // For now, we are not able to expand array where read come after write
      // (to the same location) in a same statement.
      auto AccRel = isl::union_map(MA->getAccessRelation());
      if (MA->isRead()) {
        // Reject load after store to same location.
        if (!StmtWrites.is_disjoint(AccRel)) {
          emitRemark(SAI->getName() + " has read after write to the same "
                                      "element in same statement. The "
                                      "dependences found during analysis may "
                                      "be wrong because Polly is not able to "
                                      "handle such case for now.",
                     MA->getAccessInstruction());
          return false;
        }

        StmtReads = give(isl_union_map_union(StmtReads.take(), AccRel.take()));
      } else {
        StmtWrites =
            give(isl_union_map_union(StmtWrites.take(), AccRel.take()));
      }

      // For now, we are not able to expand MayWrite.
      if (MA->isMayWrite()) {
        emitRemark(SAI->getName() + " has a maywrite access.",
                   MA->getAccessInstruction());
        return false;
      }

      // For now, we are not able to expand SAI with more than one write.
      if (MA->isMustWrite()) {
        Writes.insert(MA);
        NumberWrites++;
        if (NumberWrites > 1) {
          emitRemark(SAI->getName() + " has more than 1 write access.",
                     MA->getAccessInstruction());
          return false;
        }
      }

      // Check if it is possible to expand this read.
      if (MA->isRead()) {

        // Get the domain of the current ScopStmt.
        auto StmtDomain = Stmt.getDomain();

        // Get the domain of the future Read access.

        auto ReadDomainSet = MA->getAccessRelation().domain();
        auto ReadDomain = isl::union_set(ReadDomainSet);




        // Get the dependences relevant for this MA
        auto MapDependences = filterDependences(S, Dependences, MA);
        auto DepsDomain = MapDependences.domain();
        unsigned NumberElementMap = isl_union_map_n_map(MapDependences.get());

        // If there are multiple maps in the Deps, we cannot handle this case
        // for now.
        if (NumberElementMap != 1) {
          emitRemark(SAI->getName() +
                         " has too many dependences to be handle for now.",
                     MA->getAccessInstruction());
          return false;
        }

        auto DepsDomainSet = isl::set(DepsDomain);

        // For now, read from the original array is not possible.
        if (!StmtDomain.is_subset(DepsDomainSet)) {
          emitRemark("The expansion of " + SAI->getName() +
                         " would lead to a read from the original array.",
                     MA->getAccessInstruction());
          return false;
        }

        Reads.insert(MA);
      }
    }
  }

  // No need to expand SAI with no write.
  if (NumberWrites == 0) {
    emitRemark(SAI->getName() + " has 0 write access.",
               S.getEnteringBlock()->getFirstNonPHI());
    return false;
  }

  return true;
}

The method $expandWrite$ is pretty simple. As its name suggests, the aim of this method is to expand write access. Here is a textual description of the algorithm.

Get the current access relation map.

In [None]:
  // Get domain from the current AM.
  auto Domain = CurrentAccessMap.domain();

Add output dimensions according to the loop nest.

In [None]:
  unsigned in_dimensions = CurrentAccessMap.dim(isl::dim::in);

  // Add dimensions to the new AM according to the current in_dim.
  NewAccessMap = NewAccessMap.add_dims(isl::dim::out, in_dimensions);

Get or Create the expanded ScopArrayInfo.

In [None]:
 // Create the string representing the name of the new SAI.
  // One new SAI for each statement so that each write go to a different memory
  // cell.
  auto CurrentStmtDomain = MA->getStatement()->getDomain();
  auto CurrentStmtName = CurrentStmtDomain.get_tuple_name();
  auto CurrentOutId = CurrentAccessMap.get_tuple_id(isl::dim::out);
  std::string CurrentOutIdString =
      MA->getScopArrayInfo()->getName() + "_" + CurrentStmtName + "_expanded";

  // Create the size vector.
  std::vector<unsigned> Sizes;
  for (unsigned i = 0; i < in_dimensions; i++) {
    assert(isDimBoundedByConstant(CurrentStmtDomain, i) &&
           "Domain boundary are not constant.");
    auto UpperBound = getConstant(CurrentStmtDomain.dim_max(i), true, false);
    assert(!UpperBound.is_null() && UpperBound.is_pos() &&
           !UpperBound.is_nan() &&
           "The upper bound is not a positive integer.");
    assert(UpperBound.le(isl::val(CurrentAccessMap.get_ctx(),
                                  std::numeric_limits<int>::max() - 1)) &&
           "The upper bound overflow a int.");
    Sizes.push_back(UpperBound.get_num_si() + 1);
  }

  // Get the ElementType of the current SAI.
  auto ElementType = MA->getLatestScopArrayInfo()->getElementType();

  // Create (or get if already existing) the new expanded SAI.
  auto ExpandedSAI =
      S.createScopArrayInfo(ElementType, CurrentOutIdString, Sizes);
  ExpandedSAI->setIsOnHeap(true);

Set the out tuple id.

In [None]:
  // Set the out id of the new AM to the new SAI id.
  NewAccessMap = NewAccessMap.set_tuple_id(isl::dim::out, NewOutId);

Add constraints to link input and ouput variables.

In [None]:
  // Add constraints to linked output with input id.
  auto SpaceMap = NewAccessMap.get_space();
  auto ConstraintBasicMap =
      isl::basic_map::equal(SpaceMap, SpaceMap.dim(isl::dim::in));
  NewAccessMap = isl::map(ConstraintBasicMap);

Set the new access relation to the memory access.

In [None]:
  // Set the new access relation map.
  MA->setNewAccessRelation(NewAccessMap);

We return the expanded ScopArrayInfo for the sake of simplicity because we will need it in the $expandRead$ method. Here is the full code of $expandWrite$ method :

In [None]:
ScopArrayInfo *MaximalStaticExpander::expandWrite(Scop &S, MemoryAccess *MA) {

  // Get the current AM.
  auto CurrentAccessMap = MA->getAccessRelation();

  unsigned in_dimensions = CurrentAccessMap.dim(isl::dim::in);

  // Get domain from the current AM.
  auto Domain = CurrentAccessMap.domain();

  // Create a new AM from the domain.
  auto NewAccessMap = isl::map::from_domain(Domain);

  // Add dimensions to the new AM according to the current in_dim.
  NewAccessMap = NewAccessMap.add_dims(isl::dim::out, in_dimensions);

  // Create the string representing the name of the new SAI.
  // One new SAI for each statement so that each write go to a different memory
  // cell.
  auto CurrentStmtDomain = MA->getStatement()->getDomain();
  auto CurrentStmtName = CurrentStmtDomain.get_tuple_name();
  auto CurrentOutId = CurrentAccessMap.get_tuple_id(isl::dim::out);
  std::string CurrentOutIdString =
      MA->getScopArrayInfo()->getName() + "_" + CurrentStmtName + "_expanded";

  // Set the tuple id for the out dimension.
  NewAccessMap = NewAccessMap.set_tuple_id(isl::dim::out, CurrentOutId);

  // Create the size vector.
  std::vector<unsigned> Sizes;
  for (unsigned i = 0; i < in_dimensions; i++) {
    assert(isDimBoundedByConstant(CurrentStmtDomain, i) &&
           "Domain boundary are not constant.");
    auto UpperBound = getConstant(CurrentStmtDomain.dim_max(i), true, false);
    assert(!UpperBound.is_null() && UpperBound.is_pos() &&
           !UpperBound.is_nan() &&
           "The upper bound is not a positive integer.");
    assert(UpperBound.le(isl::val(CurrentAccessMap.get_ctx(),
                                  std::numeric_limits<int>::max() - 1)) &&
           "The upper bound overflow a int.");
    Sizes.push_back(UpperBound.get_num_si() + 1);
  }

  // Get the ElementType of the current SAI.
  auto ElementType = MA->getLatestScopArrayInfo()->getElementType();

  // Create (or get if already existing) the new expanded SAI.
  auto ExpandedSAI =
      S.createScopArrayInfo(ElementType, CurrentOutIdString, Sizes);
  ExpandedSAI->setIsOnHeap(true);

  // Get the out Id of the expanded Array.
  auto NewOutId = ExpandedSAI->getBasePtrId();

  // Set the out id of the new AM to the new SAI id.
  NewAccessMap = NewAccessMap.set_tuple_id(isl::dim::out, NewOutId);

  // Add constraints to linked output with input id.
  auto SpaceMap = NewAccessMap.get_space();
  auto ConstraintBasicMap =
      isl::basic_map::equal(SpaceMap, SpaceMap.dim(isl::dim::in));
  NewAccessMap = isl::map(ConstraintBasicMap);

  // Set the new access relation map.
  MA->setNewAccessRelation(NewAccessMap);

  return ExpandedSAI;
}

As its name suggests, the $expandRead$ method expand the read access passed in parameter. The algorithm is pretty simple too. The goal is to map the read to the last write to the array involved.

First, we get the RAW dependences relevant for the read.

In [None]:
  // Get RAW dependences for the current WA.
  auto WriteDomainSet = MA->getAccessRelation().domain();
  auto WriteDomain = isl::union_set(WriteDomainSet);

  // Get the dependences relevant for this MA
  auto MapDependences = filterDependences(S, Dependences, MA);

  // If no dependences, no need to modify anything.
  if (MapDependences.is_empty())
    return;


  assert(isl_union_map_n_map(MapDependences.get()) == 1 &&
         "There are more than one RAW dependencies in the union map.");
  auto NewAccessMap = isl::map::from_union_map(MapDependences);

The $filterDependences$ method only filter the relevant dependences.

Then we set the out id of the map with the id of the expanded array.

In [None]:
  auto Id = ExpandedSAI->getBasePtrId();

  // Replace the out tuple id with the one of the access array.
  NewAccessMap = NewAccessMap.set_tuple_id(isl::dim::out, Id);

At the end, we set the new access relation to the memory access.

In [None]:
  // Set the new access relation.
  MA->setNewAccessRelation(NewAccessMap);

Here is the full version of $expandRead$ :

In [None]:
void MaximalStaticExpander::expandRead(Scop &S, MemoryAccess *MA,
                                       const isl::union_map &Dependences,
                                       ScopArrayInfo *ExpandedSAI) {

  // Get the current AM.
  auto CurrentAccessMap = MA->getAccessRelation();

  // Get RAW dependences for the current WA.
  auto WriteDomainSet = MA->getAccessRelation().domain();
  auto WriteDomain = isl::union_set(WriteDomainSet);

  // Get the dependences relevant for this MA
  auto MapDependences = filterDependences(S, Dependences, MA);

  // If no dependences, no need to modify anything.
  if (MapDependences.is_empty())
    return;


  assert(isl_union_map_n_map(MapDependences.get()) == 1 &&
         "There are more than one RAW dependencies in the union map.");
  auto NewAccessMap = isl::map::from_union_map(MapDependences);

  auto Id = ExpandedSAI->getBasePtrId();

  // Replace the out tuple id with the one of the access array.
  NewAccessMap = NewAccessMap.set_tuple_id(isl::dim::out, Id);

  // Set the new access relation.
  MA->setNewAccessRelation(NewAccessMap);
}

Details, diff and discussions can be found here : https://reviews.llvm.org/D34982. This patch and the bug fixing one have been merged into the actual version of Polly.

### Scalar Fully indexed exp

Polly has two mains manner of representing scalars. The first one is called MemoryKind::Value. A single memory write stores value at its definition into the memory object and at each use of the value a corresponding read is added. 

The second one is called MemoryKind::PHI. A PHI node represent the fact that a scalar can have multiples sources. Let take an example.

In [None]:
int tmp = 0;
for (int i = 0; i < N; i++) {
    tmp = tmp + 2;
}

In LLVM, everything is transformed in SSA. This means that Polly see the following source code :

In [None]:
int tmp = 0;
for (int i = 0; i < N; i++) {
    tmp_1 = PHI(tmp, tmp_2)
    tmp_2 = tmp_1 + 2;
}

$tmp\_1$ has not always the same source depending on the iteration the i-loop is in. If i=0, the source is tmp otherwise the source is tmp_2 of the previous iteration.

The expansion of MemoryKind::Value is trivial because it behave well into the MemoryKind::Array expansion algorithm.

The expansion of MemoryKind::PHI have asked a bit of work. A PHI can have only one read and multiple writes. So to expand PHI, we just exchange the role of read and writes in comparison to MemoryKind::Array expansion. This means that we expand the read access and maps the write accesses. To do that, we have refactor $expandRead$ and $expandWrite$ respectively as $mapAccess$ and $expandAccess$. $mapAccess$ do exactly the same things as before, but now take a set of MemoryAccess to expand. We also add a method called $expandPHI$ responsible for the expansion of PHI node. 

Let see how $expandPHI$ works.

In [None]:
void MaximalStaticExpander::expandPhi(Scop &S, const ScopArrayInfo *SAI,
                                      const isl::union_map &Dependences) {
  SmallPtrSet<MemoryAccess *, 4> Writes;
  for (auto MA : S.getPHIIncomings(SAI))
    Writes.insert(MA);
  auto Read = S.getPHIRead(SAI);
  auto ExpandedSAI = expandAccess(S, Read);

  mapAccess(S, Writes, Dependences, ExpandedSAI, false);
}

This method is pretty simple. First of all, we query for read (S.getPHIRead) and writes (S.getPHIIncomings). Then we ask $expandAccess$ to expand the read and $mapAccess$ to map the set of writes. And that's all !

The 'main' method of expansion now looks like :

In [None]:
bool MaximalStaticExpander::runOnScop(Scop &S) {

  // Get the RAW Dependences.
  auto &DI = getAnalysis<DependenceInfo>();
  auto &D = DI.getDependences(Dependences::AL_Reference);
  auto Dependences = isl::give(D.getDependences(Dependences::TYPE_RAW));

  for (auto SAI : S.arrays()) {
    SmallPtrSet<MemoryAccess *, 4> AllWrites;
    SmallPtrSet<MemoryAccess *, 4> AllReads;
    if (!isExpandable(SAI, AllWrites, AllReads, S, Dependences))
      continue;

    if (SAI->isValueKind() || SAI->isArrayKind()) {
      auto TheWrite = *(AllWrites.begin());
      ScopArrayInfo *ExpandedArray = expandAccess(S, TheWrite);

      mapAccess(S, AllReads, Dependences, ExpandedArray, true);
    } else if (SAI->isPHIKind()) {
      expandPhi(S, SAI, Dependences);
    }
  }

  return false;
}

Basically, we get the RAW dependences. Then we iterate over all ScopArrayInfo of the current Scop. If expandable, if it is a PHI, we use $expandPHI$ otherwise we expand directly with $expandAccess$ and $mapAccess$.

Details, diff and discussions can be found here : https://reviews.llvm.org/D36647. This patch has been merged into the actual version of Polly.


## Evaluation

## Remaining work

### MAXIMAL expansion
### Select which SAI to expand

[^f1]: Denis Barthou, Albert Cohen, and Jean-François Collard. 2000. Maximal Static Expansion. Int. J. Parallel Program. 28, 3 (June 2000), 213-243. DOI=http://dx.doi.org/10.1023/A:1007500431910 

[^f2]: P. Feautrier. 1988. Array expansion. In Proceedings of the 2nd international conference on Supercomputing (ICS '88), J. Lenfant (Ed.). ACM, New York, NY, USA, 429-441. DOI=http://dx.doi.org/10.1145/55364.55406 

[^f3]: Dynamic Single Assignment. (n.d.). [ebook] Peter Vanbroekhoven. Available at: http://www.elis.ugent.be/aces/edegem2002/vanbroekhoven.pdf [Accessed 22 Aug. 2017].