feat: Add the main file for determine package reachability level of Python project#2131
feat: Add the main file for determine package reachability level of Python project#2131cuixq merged 19 commits intogoogle:python-reachfrom
Conversation
|
can you add some description for this PR? |
cuixq
left a comment
There was a problem hiding this comment.
also let's move this to experimental/pythonreach folder and add a README file.
| @@ -0,0 +1,415 @@ | |||
| # This file is automatically @generated by Poetry 2.1.3 and should not be changed by hand. | |||
There was a problem hiding this comment.
I don't see this example project is being referenced anywhere in test?
There was a problem hiding this comment.
this is for running with the main.go in CLI. go run main.go --directory=example
|
|
||
| // 6. Comparison between the collected imported libraries and the PYPI dependencies of the libraries | ||
| // to find the reachability of the PYPI dependencies. | ||
| for _, library := range importedLibraries { |
There was a problem hiding this comment.
let's make a struct for the output so that we can test the reachability result properly as well.
There was a problem hiding this comment.
https://github.com/google/osv-scalibr/blob/main/enricher/reachability/java/java.go can be a reference about what the results should look like.
|
|
||
| for _, tc := range testCases { | ||
| t.Run(tc.name, func(t *testing.T) { | ||
| ctx := context.Background() |
There was a problem hiding this comment.
you may use t.Context() for context in testing
experimental/pythonreach/main.go
Outdated
| } | ||
| } | ||
|
|
||
| downloadFileSource, err := downloadPackageSource(downloadURL) |
There was a problem hiding this comment.
There was a problem hiding this comment.
thank you for the link!
experimental/pythonreach/testdata/multifileswithentrypoint/script.sh
Outdated
Show resolved
Hide resolved
|
we probably need |
|
This pull request has not had any activity for 60 days and will be automatically closed in two weeks |
|
Automatically closing stale pull request |
|
/gemini review |
There was a problem hiding this comment.
Code Review
This PR introduces a new tool for analyzing Python dependency reachability, which is a great feature. The implementation follows a logical flow: finding entry points, parsing manifests, analyzing imports, and then diving into library source code. However, the current implementation has several critical and high-severity issues that need to be addressed. The tool downloads and extracts package source code into the current working directory, which is unsafe. There are cases where errors are ignored, which can lead to panics. The logic for identifying manifest files and parsing Python code has some flaws that will cause the tool to fail on valid projects or produce incorrect results. I've left specific comments on these points. Addressing these issues will significantly improve the tool's reliability and usability.
| // TODO: Find alternative ways for these regexes. | ||
| mainEntryRegex = regexp.MustCompile(`^\s*if\s+__name__\s*==\s*['"]__main__['"]\s*:`) | ||
| importRegex = regexp.MustCompile(`^\s*import\s+([a-zA-Z0-9_.]+)(?:\s+as\s+([a-zA-Z0-9_]+))?`) | ||
| fromImportRegex = regexp.MustCompile(`^\s*from\s+([a-zA-Z0-9_.]+)\s+import\s+(.+)`) |
There was a problem hiding this comment.
The regular expression for from ... import ... statements does not handle multi-line imports that use parentheses, such as:
from my_library import (
module1,
module2,
)The current regex ^\s*from\s+([a-zA-Z0-9_.]+)\s+import\s+(.+) will only capture the first line. To handle this, you would need a more complex parsing logic that can read across multiple lines when an opening parenthesis is detected.
|
|
||
| return scanFile(file, func(line string) error { | ||
| for _, module := range libraryInfo.Modules { | ||
| searchTerm := fmt.Sprintf("def %s(", module.Name) |
There was a problem hiding this comment.
The method for finding where an imported item is defined is based on a simple string search: searchTerm := fmt.Sprintf("def %s(", module.Name). This is not very robust and can lead to both false positives and false negatives.
- False positives: It can match function names in comments or strings.
- False negatives: It won't match functions with different spacing (e.g.,
def my_func(...)), functions defined in classes (methods), or other imported symbols like classes or variables.
Consider using regular expressions or, for a more robust solution, a Python AST parser to accurately locate definitions.
experimental/pythonreach/main.go
Outdated
| } | ||
|
|
||
| for _, module := range library.Modules { | ||
| if module.SourceDefinedPaths == nil { |
There was a problem hiding this comment.
The logic here assumes that if the source definition path for a module is not found (module.SourceDefinedPaths == nil), all of the library's dependencies are reachable. This is a risky assumption and can lead to incorrect reachability analysis. It would be better to report the reachability as "unknown" or "undetermined" in this case, rather than defaulting to "reachable". This makes the tool's output more accurate about its limitations.
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces a new tool to determine the reachability of dependencies in a Python project, starting with support for poetry.lock. The implementation is a great start and covers the end-to-end flow from parsing manifests to analyzing library source code.
My review focuses on improving the robustness and correctness of the implementation. The main points of feedback are:
- Replacing fragile regex-based Python code parsing with a more reliable AST-based approach to handle various import syntaxes correctly.
- Improving error handling and resource management to make the tool more resilient.
- Increasing test coverage for the core analysis logic to ensure correctness and prevent future regressions.
I've provided specific suggestions for each of these points in the comments.
| // TODO: Find alternative ways for these regexes. | ||
| mainEntryRegex = regexp.MustCompile(`^\s*if\s+__name__\s*==\s*['"]__main__['"]\s*:`) | ||
| importRegex = regexp.MustCompile(`^\s*import\s+([a-zA-Z0-9_.]+)(?:\s+as\s+([a-zA-Z0-9_]+))?`) | ||
| fromImportRegex = regexp.MustCompile(`^\s*from\s+([a-zA-Z0-9_.]+)\s+import\s+(.+)`) | ||
| importItemRegex = regexp.MustCompile(`([a-zA-Z0-9_.*]+)(?:\s+as\s+([a-zA-Z0-9_]+))?`) | ||
| memberImportRegex = regexp.MustCompile(`^\s*import (\w+)\.(\w+)`) |
There was a problem hiding this comment.
The current implementation uses regular expressions to parse Python import statements and search for function definitions (e.g., in getImportedItemsFilePaths). This approach is brittle and can easily fail with different code formatting, or for constructs other than simple function imports (like classes or variables). The TODO comment acknowledges this.
To make the analysis robust and accurate, I strongly recommend using a Go library that can parse Python code into an Abstract Syntax Tree (AST). Libraries like github.com/go-python/gpython/parser would allow you to reliably identify all imported symbols and their definitions. This is crucial for the correctness of the reachability analysis.
| if err != nil { | ||
| log.Printf("failed to parse metadata from %s: %v", fileName, err) | ||
| } |
There was a problem hiding this comment.
When pypi.SdistMetadata fails, the error is logged, but the program continues. This means libraryInfo.Dependencies will be empty, leading to incorrect and misleading reachability results (it will report all dependencies as unreachable). This should be treated as a more significant error. I suggest either terminating the analysis for this library and reporting it as failed, or exiting the program entirely, to avoid producing incorrect output.
| tmpDir, err := os.MkdirTemp("", "pythonreach-src-") | ||
| if err != nil { | ||
| return fmt.Errorf("failed to create temp dir: %w", err) | ||
| } | ||
|
|
||
| tmpFile, err := os.CreateTemp(tmpDir, fileName) | ||
| if err != nil { | ||
| _ = os.RemoveAll(tmpDir) | ||
| return fmt.Errorf("failed to create temp file: %w", err) | ||
| } | ||
|
|
||
| if _, err := io.Copy(tmpFile, bytes.NewReader(sourceFile)); err != nil { | ||
| tmpFile.Close() | ||
| _ = os.RemoveAll(tmpDir) | ||
| return fmt.Errorf("failed to write to temp file: %w", err) | ||
| } | ||
| if err := tmpFile.Close(); err != nil { | ||
| _ = os.RemoveAll(tmpDir) | ||
| return fmt.Errorf("failed to close temp file: %w", err) | ||
| } | ||
|
|
||
| // Re-open the temp file for extraction. | ||
| f, err := os.Open(tmpFile.Name()) | ||
| if err != nil { | ||
| _ = os.RemoveAll(tmpDir) | ||
| return fmt.Errorf("failed to open temp file for extraction: %w", err) | ||
| } | ||
| defer f.Close() | ||
|
|
||
| if err := extractCompressedPackageSource(f, tmpDir); err != nil { | ||
| _ = os.RemoveAll(tmpDir) | ||
| return err | ||
| } | ||
|
|
||
| // Record the extracted source dir on the library info for later processing. | ||
| libraryInfo.SourceDir = tmpDir |
There was a problem hiding this comment.
The retrieveSourceAndCollectDependencies function downloads the entire package source into a memory buffer (sourceFile), then writes it to a temporary file, and then re-opens it for extraction. For large packages, this can be memory-intensive, and the intermediate disk I/O is unnecessary.
You can simplify this by modifying extractCompressedPackageSource to accept an io.Reader instead of an *os.File. Then, you can pass a bytes.NewReader(sourceFile) directly to it, avoiding the need to create, write, and re-open the temporary file for extraction. This will make the code simpler and more efficient.
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces a new experimental tool for determining the reachability of Python package dependencies. The core logic is implemented in Go and currently supports projects using poetry.lock. While the overall goal is valuable, the current implementation has several significant issues related to correctness and robustness. The most critical problem is the reliance on regular expressions for parsing Python code, which is highly unreliable. Other major concerns include fragile logic for locating source files and symbol definitions, and bugs in file path handling. I've provided several high-severity comments and suggestions to address these fundamental issues. Addressing these points will be crucial for making the tool reliable.
| func parsePoetryLock(ctx context.Context, fpath string) ([]*LibraryInfo, error) { | ||
| dir := filepath.Dir(fpath) | ||
| fsys := scalibrfs.DirFS(dir) | ||
| r, err := fsys.Open("poetry.lock") | ||
| if err != nil { | ||
| return nil, fmt.Errorf("failed to open %s: %w", fpath, err) | ||
| } | ||
| defer r.Close() | ||
|
|
||
| input := &filesystem.ScanInput{ | ||
| FS: fsys, | ||
| Path: fpath, | ||
| Reader: r, | ||
| } | ||
| extractor := poetrylock.New() | ||
| inventory, err := extractor.Extract(ctx, input) | ||
| if err != nil { | ||
| return nil, fmt.Errorf("failed to extract from %s: %w", fpath, err) | ||
| } | ||
|
|
||
| libraryInfos := []*LibraryInfo{} | ||
| for _, i := range inventory.Packages { | ||
| libraryInfos = append(libraryInfos, &LibraryInfo{Name: i.Name, Version: i.Version}) | ||
| } | ||
|
|
||
| return libraryInfos, nil | ||
| } |
There was a problem hiding this comment.
The path handling in this function is incorrect. It receives a path fpath, but then it gets the parent directory (filepath.Dir) and tries to open a hardcoded "poetry.lock" file from there. This logic is fragile and will likely fail. The function should be self-contained and parse the file specified by fpath. This also requires a change at the call site in main().
func parsePoetryLock(ctx context.Context, fpath string) ([]*LibraryInfo, error) {
dir := filepath.Dir(fpath)
fname := filepath.Base(fpath)
fsys := scalibrfs.DirFS(dir)
r, err := fsys.Open(fname)
if err != nil {
return nil, fmt.Errorf("failed to open %s: %w", fpath, err)
}
defer r.Close()
input := &filesystem.ScanInput{
FS: fsys,
Path: fname,
Reader: r,
}
extractor := poetrylock.New()
inventory, err := extractor.Extract(ctx, input)
if err != nil {
return nil, fmt.Errorf("failed to extract from %s: %w", fpath, err)
}
libraryInfos := []*LibraryInfo{}
for _, i := range inventory.Packages {
libraryInfos = append(libraryInfos, &LibraryInfo{Name: i.Name, Version: i.Version})
}
return libraryInfos, nil
}| searchTerm := fmt.Sprintf("def %s(", module.Name) | ||
| if strings.Contains(line, searchTerm) { |
There was a problem hiding this comment.
Using a simple string search for "def %s(" to find where an item is defined is highly unreliable. This approach will only find function definitions and will miss classes, variables, or entire submodules that are imported. It's also prone to false positives if the search term appears in comments or strings.
To accurately locate definitions, you should use an AST parser to identify function definitions, class definitions, and variable assignments at the top level of modules.| poetryLibraryInfos := []*LibraryInfo{} | ||
| for _, manifestFile := range manifestFiles { | ||
| switch manifestFile { | ||
| case "poetry.lock": | ||
| // Parse the poetry.lock file to get library information. | ||
| poetryLibraryInfos, err = parsePoetryLock(ctx, filepath.Join(*directory)) | ||
| if err != nil { | ||
| log.Printf("Error collecting libraries in poetry.lock: %v\n", err) | ||
| } | ||
| } | ||
| } |
There was a problem hiding this comment.
The logic for finding and parsing manifest files is flawed. The code identifies manifest files but then doesn't use the results. Instead, it makes a single, hardcoded call to parsePoetryLock with just the directory path, which is incorrect and will fail with the recommended fix in parsePoetryLock.
poetryLibraryInfos := []*LibraryInfo{}
for _, manifestFile := range manifestFiles {
switch manifestFile {
case "poetry.lock":
// Parse the poetry.lock file to get library information.
fullPath := filepath.Join(*directory, manifestFile)
libs, err := parsePoetryLock(ctx, fullPath)
if err != nil {
log.Printf("Error collecting libraries in poetry.lock: %v\n", err)
continue
}
poetryLibraryInfos = append(poetryLibraryInfos, libs...)
}
}
experimental/pythonreach/main.go
Outdated
| slices.Sort(module.ImportedLibraryNames) | ||
| importedLibs := slices.Compact(module.ImportedLibraryNames) | ||
| for _, importedLib := range importedLibs { | ||
| if strings.Contains(importedLib, dep) { |
There was a problem hiding this comment.
experimental/pythonreach/main.go
Outdated
| _ = os.RemoveAll(tmpDir) | ||
| return fmt.Errorf("failed to create temp file: %w", err) | ||
| } |
There was a problem hiding this comment.
Errors from os.RemoveAll are being ignored. While this is part of an error handling path for cleanup, failing to remove temporary directories can lead to disk space leaks over time. It's better practice to at least log these failures.
if err := os.RemoveAll(tmpDir); err != nil {
log.Printf("failed to remove temp dir %s: %v", tmpDir, err)
}
return fmt.Errorf("failed to create temp file: %w", err)
experimental/pythonreach/main.go
Outdated
| } | ||
|
|
||
| // findImportedItemPaths finds libraries in import statements in the files. | ||
| func findImportedLibrary(libraryInfo *LibraryInfo) error { |
There was a problem hiding this comment.
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces a new experimental tool to determine the reachability of dependencies in a Python project by analyzing import statements. The core logic is implemented in Go and supports poetry.lock files. While the overall structure and approach are sound, there are several critical correctness issues in the implementation. The use of regular expressions to parse Python import statements and find symbol definitions is brittle and incorrect in several key ways, which will lead to inaccurate results. I've provided specific feedback and suggestions to address these issues by improving the parsing logic. There are also opportunities to improve efficiency in file handling.
| if match := importRegex.FindStringSubmatch(line); match != nil { | ||
| libraryName := match[1] | ||
| alias := match[2] | ||
| lib := getOrCreateLibraryInfo(importedLibraries, libraryName) | ||
| lib.Alias = alias | ||
|
|
There was a problem hiding this comment.
The regular expression for import statements can incorrectly identify the library name for sub-module imports. For example, for import a.b.c, the library is a, but the regex captures a.b.c. This will cause failures when matching against dependencies. The top-level package should be extracted.
| if match := importRegex.FindStringSubmatch(line); match != nil { | |
| libraryName := match[1] | |
| alias := match[2] | |
| lib := getOrCreateLibraryInfo(importedLibraries, libraryName) | |
| lib.Alias = alias | |
| if match := importRegex.FindStringSubmatch(line); match != nil { | |
| libraryName := strings.Split(match[1], ".")[0] | |
| alias := match[2] | |
| lib := getOrCreateLibraryInfo(importedLibraries, libraryName) | |
| lib.Alias = alias | |
| libraryName := match[1] | ||
| items := match[2] | ||
|
|
||
| lib := getOrCreateLibraryInfo(importedLibraries, libraryName) |
There was a problem hiding this comment.
The regular expression for from ... import ... statements can incorrectly identify the library name for sub-module imports. For example, for from a.b import c, the library is a, but the regex captures a.b. This will cause failures when matching against dependencies. The top-level package should be extracted.
| libraryName := match[1] | |
| items := match[2] | |
| lib := getOrCreateLibraryInfo(importedLibraries, libraryName) | |
| libraryName := strings.Split(match[1], ".")[0] | |
| items := match[2] | |
| lib := getOrCreateLibraryInfo(importedLibraries, libraryName) |
| } else if match := memberImportRegex.FindStringSubmatch(line); match != nil { | ||
| libraryName := match[1] | ||
| moduleName := match[2] | ||
|
|
||
| lib := getOrCreateLibraryInfo(importedLibraries, libraryName) | ||
| lib.Modules = append(lib.Modules, &ModuleInfo{Name: moduleName}) | ||
| } |
There was a problem hiding this comment.
This regex ^\s*import (\w+)\.(\w+) seems to be intended to handle member imports, but it's redundant with the main importRegex and also incorrectly parses the library. The importRegex already handles import a.b, and as noted in another comment, it should be modified to extract the top-level package a. This block of code is likely unnecessary and contributes to incorrect parsing. This else if block can probably be removed entirely if the main importRegex logic is corrected to handle top-level package extraction.
|
|
||
| return scanFile(file, func(line string) error { | ||
| for _, module := range libraryInfo.Modules { | ||
| searchTerm := fmt.Sprintf("def %s(", module.Name) |
There was a problem hiding this comment.
Using a simple string search with fmt.Sprintf("def %s(", module.Name) to find function definitions is very brittle and will miss many cases. For example, it won't find:
- Functions with different spacing or no arguments (
def my_func() :). asyncfunctions.- Class definitions (
class MyClass:). - Imported variables.
This will lead to many imported symbols not being found in the source code, making the reachability analysis incomplete. To reliably locate definitions, you should use a Python Abstract Syntax Tree (AST) parser. There are Go libraries available for parsing Python code which would make this much more robust.
| for _, dep := range library.Dependencies { | ||
| reachable := false | ||
| for _, importedLib := range importedLibs { | ||
| if strings.Contains(importedLib, dep) { |
There was a problem hiding this comment.
The reachability check strings.Contains(importedLib, dep) is not safe and can produce false positives. For example, if a module imports a library named foobar and there is a dependency named foo, this check will incorrectly mark foo as reachable. The check should be for an exact match.
| if strings.Contains(importedLib, dep) { | |
| if importedLib == dep { |
experimental/pythonreach/main.go
Outdated
| tmpFile, err := os.CreateTemp(tmpDir, fileName) | ||
| if err != nil { | ||
| if err := os.RemoveAll(tmpDir); err != nil { | ||
| log.Printf("failed to remove temp dir %s: %v", tmpDir, err) | ||
| } | ||
| return fmt.Errorf("failed to create temp file: %w", err) | ||
| } | ||
|
|
||
| if _, err := io.Copy(tmpFile, bytes.NewReader(sourceFile)); err != nil { | ||
| tmpFile.Close() | ||
| if err := os.RemoveAll(tmpDir); err != nil { | ||
| log.Printf("failed to remove temp dir %s: %v", tmpDir, err) | ||
| } | ||
| return fmt.Errorf("failed to write to temp file: %w", err) | ||
| } | ||
| if err := tmpFile.Close(); err != nil { | ||
| if err := os.RemoveAll(tmpDir); err != nil { | ||
| log.Printf("failed to remove temp dir %s: %v", tmpDir, err) | ||
| } | ||
| return fmt.Errorf("failed to close temp file: %w", err) | ||
| } | ||
|
|
||
| // Re-open the temp file for extraction. | ||
| f, err := os.Open(tmpFile.Name()) | ||
| if err != nil { | ||
| if err := os.RemoveAll(tmpDir); err != nil { | ||
| log.Printf("failed to remove temp dir %s: %v", tmpDir, err) | ||
| } | ||
| return fmt.Errorf("failed to open temp file for extraction: %w", err) | ||
| } | ||
| defer f.Close() | ||
|
|
||
| if err := extractCompressedPackageSource(f, tmpDir); err != nil { | ||
| if err := os.RemoveAll(tmpDir); err != nil { | ||
| log.Printf("failed to remove temp dir %s: %v", tmpDir, err) | ||
| } | ||
| return err | ||
| } |
|
/gemini review |
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
#2540
The new feature helps determine the reachability of imported Python libraries in a Python project. This is part of the under developing project for imported Python libraries' reachability.
This PR includes the support for imported Python libraries defined in poetry.lock file.