# code-book rules for the "default ruleset for Java, by r2c" 

There are 28 rules in this ruleset. In the notebook, you'll find 
parallel implementations of each of the rules for `code-book`. This is
to help me get an idea of what our high level API should look like, what
kinds of operations we need to support, and where we might be able to
"do better" than SemGrep. This also serves as a nice comparison for
performance (eventually, will need to set SemGrep up on the same data).

In [3]:
!gandiva-build.sh

/arrow/arrow-cpp-build /app/applications/jupyter-extension/nteract_on_jupyter/notebooks
-- Building using CMake version: 3.20.0
-- Arrow version: 4.0.0 (full: '4.0.0-SNAPSHOT')
-- Arrow SO version: 400 (full: 400.0.0)
-- clang-tidy not found
-- clang-format not found
-- Could NOT find ClangTools (missing: CLANG_FORMAT_BIN CLANG_TIDY_BIN) 
-- infer not found
fatal: not a git repository: /arrow/../../.git/modules/query/arrow
-- Found Python3: /usr/local/bin/python3.9 (found version "3.9.4") found components: Interpreter 
-- Found cpplint executable at /arrow/cpp/build-support/cpplint.py
-- System processor: x86_64
Using ld linker
Configured for DEBUG build (set with cmake -DCMAKE_BUILD_TYPE={release,debug,...})
-- Build Type: DEBUG
-- Using BUNDLED approach to find dependencies
-- ARROW_ABSL_BUILD_VERSION: 0f3bb466b868b523cf1dc9b2aaaed65c77b28862
-- ARROW_AWSSDK_BUILD_VERSION: 1.8.133
-- ARROW_AWS_CHECKSUMS_BUILD_VERSION: v0.1.10
-- ARROW_AWS_C_COMMON_BUILD_VERSION: v0.4.59
-- ARROW_AWS_

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
from utils.cb.java import *

# SemGrep: httpservlet-path-traversal
query = (
 new() % label_as_match()
 |where| any_arg()
   |isa| call(with_name('getParameter')) % label('because (1)')
   |where| the_receiver()
     |isa| param() % label('because (2)')
     |where| the_type()
       |isa| type_(with_name('URL')) 
)

results = Evaluator(query).evaluate()
display_results(results)

Match (RS#2):
```
new ThreadPoolExecutor(cores, threads, alive, TimeUnit.MILLISECONDS,
                queues == 0 ? new SynchronousQueue<Runnable>() :
                        (queues < 0 ? new LinkedBlockingQueue<Runnable>()
                                : new LinkedBlockingQueue<Runnable>(queues)),
                new NamedInternalThreadFactory(name, true), new AbortPolicyWithReport(name, url))
```
└─ because (1): ```
   cores = url.getParameter(CORE_THREADS_KEY, DEFAULT_CORE_THREADS)
   ```
 └─ because (2): ```
    URL url
    ```
Match (RS#2):
```
new ThreadPoolExecutor(cores, threads, alive, TimeUnit.MILLISECONDS,
                queues == 0 ? new SynchronousQueue<Runnable>() :
                        (queues < 0 ? new LinkedBlockingQueue<Runnable>()
                                : new LinkedBlockingQueue<Runnable>(queues)),
                new NamedInternalThreadFactory(name, true), new AbortPolicyWithReport(name, url))
```
└─ because (1): ```
   threads = url.getPara

In [6]:
from utils.cb.java import *

query = (
  call() % label_as_match()
  |where| any_arg()
    |isa| new() % label('because (1)')
)

results = Evaluator(query).evaluate()
display_results(results)

Match (RS#1):
```
threads.add(thread)
```
└─ because (1): ```
   thread = new Thread(baseThreadName + "-" + i) {
           @Override public void run() {
             try {
               startGate.await();
               try {
                 results.set(index, task.call());
               } finally {
                 endGate.countDown();
               }
             } catch (Exception e) {
               throw new RuntimeException(e);
             }
           }
         }
   ```
Match (RS#1):
```
threads.add(thread)
```
└─ because (1): ```
   thread = new Thread(baseThreadName + "-" + i) {
           @Override public void run() {
             try {
               startGate.await();
               try {
                 results.set(index, task.call());
               } finally {
                 endGate.countDown();
               }
             } catch (Exception e) {
               throw new RuntimeException(e);
             }
           }
         }
   ```
Match (RS#1):
```
toLi

In [5]:
from utils.cb.java import *

# SemGrep: servletresponse-writer-xss
query = (
  call() % label_as_match()
  |where| any_arg_is(
    call(with_name('getParameter')) % label('arg was')
    |where| the_receiver_is(
      param() % label("arg's receiver was")
      |where| the_type()
        |isa| type_(with_name('HttpServletRequest'))
    )
    |and_w| any_arg_is(string())
  )
  |and_w| the_receiver_is(
    call(with_name('getWriter')) % label('receiver was')
    |where| the_receiver()
      |isa| param() % label("receiver's receiver was")
      |where| the_type()
        |isa| type_(with_name('HttpServletResponse'))
  )
)

results = Evaluator(query).evaluate()
display_results(results)


Match (RS#3):
```
writer.println(request.getParameter("function"))
```
└─ arg was: ```
   request.getParameter("function")
   ```
 └─ arg's receiver was: ```
    HttpServletRequest
    ```
  └─ receiver was: ```
     writer = response.getWriter()
     ```
   └─ receiver's receiver was: ```
      HttpServletResponse
      ```


In [2]:
from utils.cb.java import *

# SemGrep: anonymous-ldap-bind
query = (
  new(with_name('InitialDirContext')) % label_as_match()
  |where| the_first_args_receiver_is(
    call(with_name('put'))
    |where| the_first_arg_is(
      field_ref(with_name('SECURITY_AUTHENTICATION'))
    )
    |and_w| the_second_arg_is(
      string() # with_text('none')
    )
  )
)

results = Evaluator(query).evaluate(debug=True)

ImportError: libtinfo.so.5: cannot open shared object file: No such file or directory

In [None]:
# SemGrep: bad-hexa-conversion

digest_results = cb.calls('digest').receiver_is(
    cb.vars(type='MessageDigest').bind()
).bind()

for_over_results = cb.fors().target_container(digest_results).bind()

matches = cb.calls('Integer.toHexString').any_arg_is(
    cb.deep_ref(for_over_results)
)

In [None]:
# SemGrep: cbc-padding-oracle

matches = cb.calls('getInstance').first_arg_is(
    cb.str(regex=r".*/CBC/PKCS5Padding/")
)


In [None]:
# SemGrep: command-injection-formatted-runtime-call

# This one is tricky! Trying to say no exec( ... "sh", "-c", user_supplied, ...)

matches1 = cb.calls(['exec', 'loadLibrary']).first_arg_is(
    cb.str_concat_or_format()
).receiver_is(cb.calls('getRuntime').bind())

matches2 = cb.calls('exec').any_arg_is(
    cb.deep_ref(cb.siblings(
        cb.str(regex=r"(sh|bash|ksh|csh|tcsh|zsh)"),
        cb.str('-c'),
        cb.var().has_no_init()
    ))
)


In [None]:
# SemGrep: formatted-sql-string

# TODO: this one is also quite complex (just long...)
# we can probably make it a lot shorter!

# https://semgrep.dev/editor?registry=java.lang.security.audit.formatted-sql-string.formatted-sql-string

In [None]:
# SemGrep: http-response-splitting

bad_cookie1 = cb.new('Cookie').any_arg_is(
    cb.calls('getParameter').bind()
)

bad_cookie2 = cb.new('Cookie').any_arg_is(
    cb.method_params().annotated_with('@PathVariable').bind()
)

matches = cb.calls('addCookie').first_arg_is(
    cb.either(bad_cookie1, bad_cookie2)
)



In [None]:
# SemGrep: ldap-injection

context_var = cb.var([
    'InitialDirContext',
    'DirContext',
    'InitialLdapContext',
    'LdapContext',
    'LdapCtx',
    'EventDirContext'
]).bind()

matches = cb.calls('search').receiver_is(
    context_var
).second_arg_is(
    cb.anything_but(cb.str())
)


In [None]:
# SemGrep: object-deserialization

matches = cb.new('ObjectInputStream')


In [None]:
# SemGrep: script-engine-injection

matches = cb.calls('eval').receiver_is(cb.either(
    cb.field(type='ScriptEngine').bind(),
    cb.var(type='ScriptEngine').bind()
)).first_arg_is(
    cb.anything_but(cb.str())
)


In [None]:

matches1 = merge(
  call() % 'c1'
  |where| the_receiver()
    |isa| formal_parameter_ref() % 'r1',
  
  call() % 'c2'
  |where| the_receiver()
    |isa| ref('r1'),
  
  ref('c1') != ref('c2')
)