# Git Repo Search


!["Robot Magnifying Glass Search"](./git-search.jpg)

Photo by <a href="https://unsplash.com/@growtika?utm_content=creditCopyText&utm_medium=referral&utm_source=unsplash">Growtika</a> on <a href="https://unsplash.com/photos/a-white-robot-holding-a-magnifying-glass-g5kpSCf3dOs?utm_content=creditCopyText&utm_medium=referral&utm_source=unsplash">Unsplash</a>
  

There are many ways to search a repository, particularly a Git Repo. We will outline some use cases with examples for a "Unix-like" file directory and also a Git Repo.


Let's do this for a library I've been looking at. 

Some unit tests break on Apple Silicon for the open source library [pyg](pyg.org). The maintainer disabled some tests. I want to find them. It has something to do with PyTorch not fully supporting compressed sparse tensor representations on Apple's `mps` framework for Apple Silicon. I received the following note:

>there are a few tests that were disabled around `test_sparse` 

and

>the `convert_coo_to_csr_indices`  doesn't seem to be supported.

It's likely that test_sparse and convert_coo_to_csr_indices are variable names or tokens inside a code file of the git repository. However, for illustration, we will assume that they could be anywhere in the git repo (filenames, directory names, commit messages, variable names, past commits, current directory).

Time to search. The repo can be cloned locally from [here](https://github.com/pyg-team/pytorch_geometric/) and then cd into it.


In [60]:
import os

os.chdir("/Users/ravikalia/Code/github.com/ml-blog/posts/git-search/")

In [61]:
print(os.getcwd())

/Users/ravikalia/Code/github.com/ml-blog/posts/git-search


In [62]:
%%bash
rm -rf pytorch_geometric
git clone https://github.com/pyg-team/pytorch_geometric.git
git fetch --all

Cloning into 'pytorch_geometric'...


Let's change the directory to the root of the cloned repo, which makes searching easier

In [63]:
os.chdir("./pytorch_geometric")

## Filename in "Unix-like" Directory

We can look for the string `test_sparse` in filenames using the shell command line tool `find`.

In [64]:
%%bash
find . -name "*test_sparse*" -o -name "*convert_coo_to_csr_indices*"

./test/utils/test_sparse.py


great, so we have a file to look at. Let's look at the file `test_sparse.py`. It seems to be unit tests related to sparsity, possibly testing utility functions for converting between sparse tensor representations. 

## String in "Unix-like" Directory

String search is a bit more complicated. `grep` is an awesome tool for this. 


In [65]:
%%bash
grep -rn . -e "test_sparse" -e "convert_coo_to_csr_indices"

./test/utils/test_cross_entropy.py:9:def test_sparse_cross_entropy_multiclass(with_edge_label_weight):
./test/utils/test_cross_entropy.py:32:def test_sparse_cross_entropy_multilabel(with_edge_label_weight):
./test/test_edge_index.py:102:def test_sparse_tensor(dtype, device):
./test/test_edge_index.py:992:def test_sparse_narrow(device):
./test/test_edge_index.py:1026:def test_sparse_resize(device):
./torch_geometric/testing/asserts.py:24:    test_sparse_layouts: Optional[List[Union[str, torch.layout]]] = None,
./torch_geometric/testing/asserts.py:49:        test_sparse_layouts (List[str or int], optional): The sparse layouts to
./torch_geometric/testing/asserts.py:62:    if test_sparse_layouts is None:
./torch_geometric/testing/asserts.py:63:        test_sparse_layouts = SPARSE_LAYOUTS
./torch_geometric/testing/asserts.py:74:    if len(test_sparse_layouts) > 0 and sparse_size is None:
./torch_geometric/testing/asserts.py:75:        raise ValueError(f"Got sparse layouts {test_sparse_layo

Many locations matched to 3 files. It's possible they aren't all relevant for testing purpose. The `.git/index` is a binary file, which is used by git to store information about the repository, it's not relevant for our task. 

## What is Git

Git is a distributed version control system. It is a tool that tracks changes in files and directories. At user-defined snapshots in time, called commits, it records the changes made to the files and directories. As a consequence it is possible to search for changes in the repository across snapshots. 


Along with `grep` and `find`, there are `git` specific tools for searching snapshots of the repo, commit messages and filtering by `date` and `author`, such as:

* `git ls-files`
* `git log`
* `git grep`

## Filename in Git Repository

The working tree is what you see when you list the files in your project's directory that are being tracked. It's the version of your project that you're currently working on. The git checkout command is used to update the working directory with a specific commit, matching the snapshot recorded in the commit. Untracked files are not affected by git checkout.

The `git ls-files` command lists the files in the working tree that are being tracked by git. The filenames can be searched for a string using the `grep` command.

In [66]:
%%bash
git ls-files | grep "test_sparse"

test/utils/test_sparse.py


There's not much else that can be done with `git ls-files` for searching filenames over snapshot commits. Some ways to search for filenames in the git repository across commits are:

log commit messages with commit where the filename was changed.

In [67]:
%%bash
git log --all -- *test_sparse*

commit 62fa51e0000913e1b3023b817485d2b248322539
Author: Matthias Fey <matthias.fey@tu-dortmund.de>
Date:   Sun Dec 24 11:56:08 2023 +0100

    Accelerate concatenation of `torch.sparse` tensors (#8670)
    
    Fixes #8664

commit 1c89e751804d1eb2fb626dabc677198a1878c34d
Author: Matthias Fey <matthias.fey@tu-dortmund.de>
Date:   Wed Oct 4 09:59:36 2023 +0200

    Skip TorchScript bug for PyTorch < 1.12 (#8123)

commit 51c50c2f9d3372de34f4ac3617f396384a36558c
Author: filipekstrm <filip.ekstrom@hotmail.com>
Date:   Tue Oct 3 20:39:04 2023 +0200

    Added `mask` argument to `dense_to_sparse` (#8117)
    
    Added optional argument mask to dense_to_sparse so that it can correctly
    invert a call to to_dense_adj by returning the correct edge_index in
    case there are graphs with different number of nodes (and hence, the
    dense adjacency matrix contains some padding)
    
    ---------
    
    Co-authored-by: Filip Ekström Kelvinius <filek51@lnx00195.ad.liu.se>
    Co-authored-by: 

## String in Git Repository

If we want to search for a string across commits, the main power of git comes from the `git log` and `git grep` commands.



In [83]:
%%bash
git log -S "test_sparse" --all

commit dba9659f6c4f29fd2be1f50b5ea12a29a926082f
Author: Matthias Fey <matthias.fey@tu-dortmund.de>
Date:   Thu Feb 29 14:04:19 2024 +0100

    Fix `EdgeIndex.resize_` linting issues (#8993)

commit 123e38ef6715f75ed9198d256cc2cb984b431630
Author: Poovaiah Palangappa <98763718+pmpalang@users.noreply.github.com>
Date:   Sun Feb 11 03:32:44 2024 -0800

    Example of a recommender system (#8546)
    
    Hi Everyone,
    
    I'm adding a recommender system example with the following salient
    features
    
    1. Dataset MovieLens – a heterogenous use case
    2. Demonstrates the use of edge based temporal sampling
    3. Visualization
      t-SNE based visualization (--visualize_emb)
    4. Uses torch_geometric.nn.pool.MIPSKNNIndex for getting recommedations
    5. Integration of the LinkPred metrics -- precision@k and ndcg@k
    
    Thanks,
    Poovaiah
    
    ---------
    
    Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
    Co-author

to be specific to a branch, replace --all with the branch name (`master` in this case)

In [82]:
%%bash
git log master -S "convert_coo_to_csr_indices" 

In [81]:
%%bash
git log master -S "test_sparse" --pretty=format:"%h" --name-only --diff-filter=A

801723efa
test/utils/test_cross_entropy.py

1dadc0705
torch_geometric/testing/asserts.py

2c01aa22c
test/utils/test_sparse.py


With regular expression search use G (`*` glob is not needed as it's implied with regular expressions).

In [80]:
%%bash
git log -G "convert_coo_to_csr_indices" --pretty=format:"%h" --name-only

In [79]:
%%bash
git log -G "coo_to_csr" --pretty=format:"%h" --name-only

390942fc4
torch_geometric/data/edge_index.py

699120e25
torch_geometric/data/edge_index.py

a6f0f4947
torch_geometric/data/edge_index.py

cf786b735
torch_geometric/data/edge_index.py

b825dc637
torch_geometric/data/edge_index.py

b5ecfd9b4
torch_geometric/data/graph_store.py
torch_geometric/nn/conv/cugraph/base.py
torch_geometric/nn/conv/rgcn_conv.py
torch_geometric/nn/dense/linear.py
torch_geometric/sampler/utils.py
torch_geometric/transforms/gdc.py
torch_geometric/utils/sparse.py

a73043736
torch_geometric/data/graph_store.py
torch_geometric/nn/conv/rgcn_conv.py
torch_geometric/nn/dense/linear.py

d29154558
torch_geometric/nn/conv/cugraph/base.py

886352bd6
torch_geometric/transforms/gdc.py

ce2a84f4d
torch_geometric/sampler/utils.py

0b3f8e98a
torch_geometric/sampler/utils.py


In [88]:
%%bash
git log -G "convert_coo" --pretty=format:"%h" --name-only

In [90]:
%%bash
git log -G "csr_indices" --pretty=format:"%h" --name-only

In [None]:
%%bash
git log master -G"test_sparse" --pretty=format:"%h" --name-only

dba9659f6
test/test_edge_index.py

123e38ef6
test/test_edge_index.py

23bbc128d
test/test_edge_index.py

ed9698d0b
torch_geometric/testing/asserts.py

1725f1436
test/utils/test_cross_entropy.py

801723efa
test/utils/test_cross_entropy.py

1dadc0705
torch_geometric/testing/asserts.py

7b4892781
test/nn/conv/test_gcn_conv.py

72e8ef33d
test/nn/conv/test_gcn_conv.py

93fab2e53
test/nn/conv/test_gcn_conv.py

d01ea9dab
test/utils/test_sparse.py

2c01aa22c
test/utils/test_sparse.py

eb4260ce0
torch_geometric/nn/functional/pool/voxel_pool_test.py

544f4ad0e
torch_geometric/nn/functional/pool/voxel_pool_test.py


## Search for Author, Date, and String in Git Commit Messages

In [92]:
%%bash
git log --author="ravkalia"

commit f0e4c829662df9eb67fd5c0abda002c9b7cd0afb
Author: Ravi Kalia <ravkalia@gmail.com>
Date:   Sun Mar 24 08:05:12 2024 -0500

    Replace `withCUDA` decorator: `withDevice` (#9082)
    
    Replace `withCUDA` for a `withDevice` decorator.
    
    Change variable name from devices to processors to reduce confusion
    against pytorch api (backends/devices) and reflect the hardware choices.
    
    Note that at this time:
    
    ## Hardware
    3 repertoires of hardware can be used to run pyTorch code:
    
    * CPU only
    * CPU and GPU
    * Unified Memory Single Chip
    
    ## Backend Software
    The backend is the software framework used to process tensors by
    pytorch. There are several. For example, the following are the backends
    available today:
    
    
    ```python
    torch.backends.cpu
    torch.backends.cuda
    torch.backends.cudnn
    torch.backends.mha
    torch.backends.mps
    torch.backends.mkl
    torch.backends.mkldnn
    torch.backends.nnpack
    t

    In `docs/source/get_started/introduction.rst`:
    
    ```diff
    - :pyg:`PyG` contains a large number of common benchmark datasets, *e.g.*, all Planetoid datasets (Cora, Citeseer, Pubmed), all graph classification datasets from `http://graphkernels.cs.tu-dortmund.de <http://graphkernels.cs.tu-dortmund.de/>`_ and their `cleaned versions <https://github.com/nd7141/graph_datasets>`_, the QM7 and QM9 dataset, and a handful of 3D mesh/point cloud datasets like FAUST, ModelNet10/40 and ShapeNet.
    + :pyg:`PyG` contains a large number of common benchmark datasets, *e.g.*, all Planetoid datasets (Cora, Citeseer, Pubmed), all graph classification datasets from `TUDatasets https://chrsmrrs.github.io/datasets/`_ and their `cleaned versions <https://github.com/nd7141/graph_datasets>`_, the QM7 and QM9 dataset, and a handful of 3D mesh/point cloud datasets like FAUST, ModelNet10/40 and ShapeNet.
    ```
    
    Please review these changes and merge the PR if everything is in order
    or 

In [95]:
%%bash
git log --author="ravkalia" --since="2022-01-01" --until="2024-02-31"

commit 25b2f208e671eeec285bfafa2e246ea0a234b312
Author: Ravi Kalia <ravkalia@gmail.com>
Date:   Wed Feb 21 11:11:33 2024 -0500

    docs: fix broken links to source of graph classification datasets (#8946)
    
    **Update Broken Dataset Links in Documentation**
    
    This PR addresses broken links in the documentation that pointed to the
    common benchmark datasets. The links were updated to point to the
    correct URL.
    
    Changes were made in the following files:
    
    1. `benchmark/kernel/README.md`
    2. `docs/source/get_started/introduction.rst`
    
    The specific changes are as follows:
    
    In `benchmark/kernel/README.md`:
    
    ```diff
    - Evaluation script for various methods on [common benchmark datasets](http://graphkernels.cs.tu-dortmund.de) via 10-fold cross validation, where a training fold is randomly sampled to serve as a validation set.
    + Evaluation script for various methods on [common benchmark datasets](https://chrsmrrs.github.io/dat

In [99]:
%%bash
git log --grep="docs"  --since="2022-01-01" --until="2022-02-31"

commit 24a185e7268f70ee549c7a424b9426b9a18b5706
Author: Ramona Bendias <ramona.bendias@gmail.com>
Date:   Mon Feb 21 13:03:52 2022 +0000

    Add general `Explainer` Class (#4090)
    
    * Add base Explainer
    
    * Update Explainer
    
    * Fix test
    
    * Clean code
    
    * Update test/nn/models/test_explainer.py
    
    Co-authored-by: Matthias Fey <matthias.fey@tu-dortmund.de>
    
    * Update torch_geometric/nn/models/explainer.py
    
    Co-authored-by: Matthias Fey <matthias.fey@tu-dortmund.de>
    
    * Update torch_geometric/nn/models/explainer.py
    
    Co-authored-by: Matthias Fey <matthias.fey@tu-dortmund.de>
    
    * Add hints and add get_num_hops
    
    * Fix
    
    * Change docstring
    
    * Update torch_geometric/utils/subgraph.py
    
    Co-authored-by: Matthias Fey <matthias.fey@tu-dortmund.de>
    
    * Fix tests
    
    * Update torch_geometric/nn/models/explainer.py
    
    * Update torch_geometric/nn/models/explainer.py
    
    * 

In [101]:
%%bash
git log --oneline --grep="docs"  --since="2022-01-01" --until="2022-02-31"

24a185e72 Add general `Explainer` Class (#4090)
6002170a5 Make models compatible to Captum (#3990)
14d588d4c Update attention.py (#4009)
50ff5e6d6 Add `full` extras to install command in contribution docs (#3991)
1e24b3a16 Refactor: `MLP` initialization (#3957)
3e4891be6 Doc improvements to set2set layers (#3889)
fac848c25 Let `TemporalData` inherit from `BaseData` and add docs (#3867)
0c29b0d5b Updated docstring for shape info - part 2 (#3739)


## Git Grep vs Grep

The main differences between `git grep` and `grep` are:

`git grep` only searches through your tracked files, while `grep` can search through any files.
`git grep` is aware of your Git repository structure and can search through old commits, branches, etc., while `grep` only searches through the current state of files.

`git grep` is faster than `grep` when searching through a Git repository because it takes advantage of Git's index data structure.

In [120]:
%%time
%%bash
git grep "test_sparse" > /dev/null

CPU times: user 1.7 ms, sys: 3.9 ms, total: 5.6 ms
Wall time: 28.7 ms


In [119]:
%%time
%%bash
grep -r "test_sparse" . > /dev/null

CPU times: user 1.69 ms, sys: 4.69 ms, total: 6.38 ms
Wall time: 365 ms


## Takeaways

There are many ways to search a repository, particularly a Git Repo. We outlined some use cases with examples for a "Unix-like" file directory and also a Git Repo.

In most cases use:

* `git grep` for searching strings in the repository in the current working tree or a specific commit

* `git log` for searching across commits.

There are many flags and options for these commands - some combinations which produce the same output. Be sure to check the documentation for more information.
