Skip to content

Commit

Permalink
Addon pipeline for source string collection (#1160)
Browse files Browse the repository at this point in the history
* Add addon pipeline for string collection

Signed-off-by: Keshav Priyadarshi <git@keshav.space>

* Add test for collect_source_strings pipeline

Signed-off-by: Keshav Priyadarshi <git@keshav.space>

* Update dockerfile to install xgettext

Signed-off-by: Keshav Priyadarshi <git@keshav.space>

* Update CI to install xgettext

Signed-off-by: Keshav Priyadarshi <git@keshav.space>

* Update docs

Signed-off-by: Keshav Priyadarshi <git@keshav.space>

* Only supported on Linux

Signed-off-by: Keshav Priyadarshi <git@keshav.space>

Co-authored-by: Philippe Ombredanne <pombredanne@nexb.com>

* Only supported on Linux

Signed-off-by: Keshav Priyadarshi <git@keshav.space>

Co-authored-by: Philippe Ombredanne <pombredanne@nexb.com>

* Add CHANGELOG for CollectSourceStrings pipeline

Signed-off-by: Keshav Priyadarshi <git@keshav.space>

---------

Signed-off-by: Keshav Priyadarshi <git@keshav.space>
Co-authored-by: Philippe Ombredanne <pombredanne@nexb.com>
  • Loading branch information
keshav-space and pombredanne committed Apr 10, 2024
1 parent d6389b2 commit 1af8d99
Show file tree
Hide file tree
Showing 10 changed files with 228 additions and 5 deletions.
3 changes: 3 additions & 0 deletions .github/workflows/ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -44,6 +44,9 @@ jobs:

- name: Install universal ctags
run: sudo apt-get install -y universal-ctags

- name: Install xgettext
run: sudo apt-get install -y gettext

- name: Install dependencies
run: make dev envfile
Expand Down
4 changes: 4 additions & 0 deletions CHANGELOG.rst
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,10 @@ v34.3.0 (unreleased)
- Associate resolved packages with their source codebase resource.
https://github.com/nexB/scancode.io/issues/1140

- Add a new `CollectSourceStrings` pipeline (addon) for collecting source string using
xgettext.
https://github.com/nexB/scancode.io/pull/1160

v34.2.0 (2024-03-28)
--------------------

Expand Down
3 changes: 2 additions & 1 deletion Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -40,7 +40,7 @@ ENV PYTHONPATH $PYTHONPATH:$APP_DIR

# OS requirements as per
# https://scancode-toolkit.readthedocs.io/en/latest/getting-started/install.html
# Also install universal-ctags for symbol collection.
# Also install universal-ctags and xgettext for symbol and string collection.
RUN apt-get update \
&& apt-get install -y --no-install-recommends \
bzip2 \
Expand All @@ -60,6 +60,7 @@ RUN apt-get update \
git \
wait-for-it \
universal-ctags \
gettext \
&& apt-get clean \
&& rm -rf /var/lib/apt/lists/* /tmp/* /var/tmp/*

Expand Down
8 changes: 8 additions & 0 deletions docs/built-in-pipelines.rst
Original file line number Diff line number Diff line change
Expand Up @@ -42,6 +42,14 @@ Analyse Docker Windows Image
:members:
:member-order: bysource

.. _pipeline_collect_source_strings:

Collect Source Strings (addon)
--------------------------------
.. autoclass:: scanpipe.pipelines.collect_source_strings.CollectSourceStrings()
:members:
:member-order: bysource

.. _pipeline_collect_symbols:

Collect Codebase Symbols (addon)
Expand Down
19 changes: 15 additions & 4 deletions docs/installation.rst
Original file line number Diff line number Diff line change
Expand Up @@ -261,13 +261,24 @@ See also `ScanCode-toolkit Prerequisites <https://scancode-toolkit.readthedocs.i
latest/getting-started/install.html#prerequisites>`_ for more details.

For the :ref:`pipeline_collect_symbols` pipeline, `Universal Ctags <https://github.com/universal-ctags/ctags>`_ is needed.
On **Linux** install it using::

sudo apt-get install universal-ctags
* On **Linux** install it using::

On **MacOS** install Universal Ctags using Homebrew::
sudo apt-get install universal-ctags

brew install universal-ctags
* On **MacOS** install Universal Ctags using Homebrew::

brew install universal-ctags

For the :ref:`pipeline_collect_source_strings` pipeline, `gettext <https://www.gnu.org/software/gettext/>`_ is needed.

* On **Linux** install it using::

sudo apt-get install gettext

* On **MacOS** install gettext using Homebrew::

brew install gettext

Clone and Configure
^^^^^^^^^^^^^^^^^^^
Expand Down
42 changes: 42 additions & 0 deletions scanpipe/pipelines/collect_source_strings.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,42 @@
# SPDX-License-Identifier: Apache-2.0
#
# http://nexb.com and https://github.com/nexB/scancode.io
# The ScanCode.io software is licensed under the Apache License version 2.0.
# Data generated with ScanCode.io is provided as-is without warranties.
# ScanCode is a trademark of nexB Inc.
#
# You may not use this software except in compliance with the License.
# You may obtain a copy of the License at: http://apache.org/licenses/LICENSE-2.0
# Unless required by applicable law or agreed to in writing, software distributed
# under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR
# CONDITIONS OF ANY KIND, either express or implied. See the License for the
# specific language governing permissions and limitations under the License.
#
# Data Generated with ScanCode.io is provided on an "AS IS" BASIS, WITHOUT WARRANTIES
# OR CONDITIONS OF ANY KIND, either express or implied. No content created from
# ScanCode.io should be considered or used as legal advice. Consult an Attorney
# for any legal advice.
#
# ScanCode.io is a free software code scanning tool from nexB Inc. and others.
# Visit https://github.com/nexB/scancode.io for support and download.

from scanpipe.pipelines import Pipeline
from scanpipe.pipes import source_strings


class CollectSourceStrings(Pipeline):
"""Collect source strings from codebase files and keep them in extra data field."""

download_inputs = False
is_addon = True

@classmethod
def steps(cls):
return (cls.collect_and_store_resource_strings,)

def collect_and_store_resource_strings(self):
"""
Collect source strings from codebase files using gettext and store
them in the extra data field.
"""
source_strings.collect_and_store_resource_strings(self.project, self.log)
66 changes: 66 additions & 0 deletions scanpipe/pipes/source_strings.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,66 @@
# SPDX-License-Identifier: Apache-2.0
#
# http://nexb.com and https://github.com/nexB/scancode.io
# The ScanCode.io software is licensed under the Apache License version 2.0.
# Data generated with ScanCode.io is provided as-is without warranties.
# ScanCode is a trademark of nexB Inc.
#
# You may not use this software except in compliance with the License.
# You may obtain a copy of the License at: http://apache.org/licenses/LICENSE-2.0
# Unless required by applicable law or agreed to in writing, software distributed
# under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR
# CONDITIONS OF ANY KIND, either express or implied. See the License for the
# specific language governing permissions and limitations under the License.
#
# Data Generated with ScanCode.io is provided on an "AS IS" BASIS, WITHOUT WARRANTIES
# OR CONDITIONS OF ANY KIND, either express or implied. No content created from
# ScanCode.io should be considered or used as legal advice. Consult an Attorney
# for any legal advice.
#
# ScanCode.io is a free software code scanning tool from nexB Inc. and others.
# Visit https://github.com/nexB/scancode.io for support and download.

from source_inspector import strings_xgettext

from scanpipe.pipes import LoopProgress


class XgettextNotFound(Exception):
pass


def collect_and_store_resource_strings(project, logger=None):
"""
Collect source strings from codebase files using xgettext and store
them in the extra data field.
"""
if not strings_xgettext.is_xgettext_installed():
raise XgettextNotFound(
"``xgettext`` not found. Install ``gettext`` to use this pipeline."
)

project_files = project.codebaseresources.files()

resources = project_files.filter(
is_binary=False,
is_archive=False,
is_media=False,
)

resources_count = resources.count()

resource_iterator = resources.iterator(chunk_size=2000)
progress = LoopProgress(resources_count, logger)

for resource in progress.iter(resource_iterator):
_collect_and_store_resource_strings(resource)


def _collect_and_store_resource_strings(resource):
"""
Collect strings from a resource using xgettext and store
them in the extra data field.
"""
result = strings_xgettext.collect_strings(resource.location)
strings = [item["string"] for item in result if "string" in item]
resource.update_extra_data({"source_strings": strings})
60 changes: 60 additions & 0 deletions scanpipe/tests/pipes/test_source_strings.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,60 @@
# SPDX-License-Identifier: Apache-2.0
#
# http://nexb.com and https://github.com/nexB/scancode.io
# The ScanCode.io software is licensed under the Apache License version 2.0.
# Data generated with ScanCode.io is provided as-is without warranties.
# ScanCode is a trademark of nexB Inc.
#
# You may not use this software except in compliance with the License.
# You may obtain a copy of the License at: http://apache.org/licenses/LICENSE-2.0
# Unless required by applicable law or agreed to in writing, software distributed
# under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR
# CONDITIONS OF ANY KIND, either express or implied. See the License for the
# specific language governing permissions and limitations under the License.
#
# Data Generated with ScanCode.io is provided on an "AS IS" BASIS, WITHOUT WARRANTIES
# OR CONDITIONS OF ANY KIND, either express or implied. No content created from
# ScanCode.io should be considered or used as legal advice. Consult an Attorney
# for any legal advice.
#
# ScanCode.io is a free software code scanning tool from nexB Inc. and others.
# Visit https://github.com/nexB/scancode.io for support and download.

import sys
from pathlib import Path
from unittest import skipIf

from django.test import TestCase

from scanpipe import pipes
from scanpipe.models import Project
from scanpipe.pipes import source_strings
from scanpipe.pipes.input import copy_input


class ScanPipeSourceStringsPipesTest(TestCase):
data_location = Path(__file__).parent.parent / "data"

def setUp(self):
self.project1 = Project.objects.create(name="Analysis")

@skipIf(sys.platform != "linux", "Only supported on Linux")
def test_scanpipe_pipes_symbols_collect_and_store_resource_strings(self):
dir = self.project1.codebase_path / "codefile"
dir.mkdir(parents=True)

file_location = self.data_location / "d2d-javascript" / "from" / "main.js"
copy_input(file_location, dir)

pipes.collect_and_create_codebase_resources(self.project1)

source_strings.collect_and_store_resource_strings(self.project1)

main_file = self.project1.codebaseresources.files()[0]
result_extra_data_strings = main_file.extra_data.get("source_strings")

expected_extra_data_strings = [
"abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ1234567890!@#$%^&*()_-+=", # noqa
"Enter the desired length of your password:",
]
self.assertCountEqual(expected_extra_data_strings, result_extra_data_strings)
27 changes: 27 additions & 0 deletions scanpipe/tests/test_pipelines.py
Original file line number Diff line number Diff line change
Expand Up @@ -1240,3 +1240,30 @@ def test_scanpipe_collect_symbols_pipeline_integration(self):
result_extra_data_symbols = main_file.extra_data.get("source_symbols")
expected_extra_data_symbols = ["generatePassword", "passwordLength", "charSet"]
self.assertCountEqual(expected_extra_data_symbols, result_extra_data_symbols)

@skipIf(sys.platform != "linux", "Only supported on Linux")
def test_scanpipe_collect_source_strings_pipeline_integration(self):
pipeline_name = "collect_source_strings"
project1 = Project.objects.create(name="Analysis")

dir = project1.codebase_path / "codefile"
dir.mkdir(parents=True)

file_location = self.data_location / "d2d-javascript" / "from" / "main.js"
copy_input(file_location, dir)

pipes.collect_and_create_codebase_resources(project1)

run = project1.add_pipeline(pipeline_name)
pipeline = run.make_pipeline_instance()

exitcode, out = pipeline.execute()
self.assertEqual(0, exitcode, msg=out)

main_file = project1.codebaseresources.files()[0]
result_extra_data_strings = main_file.extra_data.get("source_strings")
expected_extra_data_strings = [
"abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ1234567890!@#$%^&*()_-+=", # noqa
"Enter the desired length of your password:",
]
self.assertCountEqual(expected_extra_data_strings, result_extra_data_strings)
1 change: 1 addition & 0 deletions setup.cfg
Original file line number Diff line number Diff line change
Expand Up @@ -131,6 +131,7 @@ scancodeio_pipelines =
analyze_docker_image = scanpipe.pipelines.docker:Docker
analyze_root_filesystem_or_vm_image = scanpipe.pipelines.root_filesystem:RootFS
analyze_windows_docker_image = scanpipe.pipelines.docker_windows:DockerWindows
collect_source_strings = scanpipe.pipelines.collect_source_strings:CollectSourceStrings
collect_symbols = scanpipe.pipelines.collect_symbols:CollectSymbols
find_vulnerabilities = scanpipe.pipelines.find_vulnerabilities:FindVulnerabilities
inspect_elf_binaries = scanpipe.pipelines.inspect_elf_binaries:InspectELFBinaries
Expand Down

0 comments on commit 1af8d99

Please sign in to comment.