Skip to content

[Quality] KISS 原则(简单): Multiple Code Duplication and Over-Engineering Issues #19

@newtontech

Description

@newtontech

实际问题

Issue 1: DRY Violation - Duplicated Atomic Number Mapping Across 20+ Files

文件:

  • pymultiwfn/io/parsers/wfn.py:229
  • pymultiwfn/io/parsers/molden.py:181
  • pymultiwfn/io/parsers/xyz.py:70
  • pymultiwfn/io/parsers/fchk.py (and 16+ other parser files)

代码 (重复出现在多个文件中):

# In wfn.py:229
symbol_to_number = {
    \"H\": 1,
    \"He\": 2,
    \"Li\": 3,
    ...  # 36 entries
}

# In molden.py:181  
symbol_to_number = {
    \"H\": 1,
    \"HE\": 2,
    \"LI\": 3,
    ...  # Same data, different case
}

# In xyz.py:70
element_mapping = {
    \"H\": 1,
    \"He\": 2,
    ...  # Same data, different variable name
}

问题: The same atomic number mapping is copy-pasted across 20+ parser files. This violates DRY (Don't Repeat Yourself) principle and makes maintenance difficult - any update to element data requires changes in multiple places.

改进建议:

# In pymultiwfn/core/definitions.py (ALREADY EXISTS!)
ELEMENT_NAME_TO_Z = {
    name.strip().upper(): i
    for i, name in enumerate(ELEMENT_NAMES)
    if name.strip() != "?? "
}

# In parsers, simply import and use:
from pymultiwfn.core.definitions import ELEMENT_NAME_TO_Z

# Single lookup function for all parsers
def get_atomic_number(symbol: str) -> int:
    \"\"\"Get atomic number from element symbol (case-insensitive).\"\"\"
    return ELEMENT_NAME_TO_Z.get(symbol.strip().upper(), 0)

Issue 2: Overly Complex ParserFactory._detect_from_content Method

文件: pymultiwfn/io/parsers/factory.py:146-268

代码 (节选):

@classmethod
def _detect_from_content(cls, filename: str) -> Optional[Type]:
    # ... 100+ lines of repetitive detection logic ...
    
    # Check for Gaussian formatted checkpoint
    if \"Gaussian\" in content and \"Formatted Checkpoint\" in content:
        return FchkLoader
    
    # Check for Molden format
    if \"[Molden Format]\" in content or \"[Atoms]\" in content:
        return MoldenLoader
    
    # Check for WFN format
    if any(line and line.split()[0].isdigit() for line in first_lines if line):
        # ... 10 lines of nested logic ...
        return WFNLoader
    
    # ... 15+ more similar checks ...
    
    # Check for PDB format
    if any(line.startswith((\"ATOM\", \"HETATM\")) for line in first_lines):
        return PDBLoader
    
    # Check for Cube format
    if len(first_lines) >= 2 and first_lines[1].strip().isdigit():
        # ... 5 lines of nested logic ...
        return CubeLoader
    
    # ... and so on for 20+ formats

问题: The content detection logic is verbose and repetitive. Each format check follows the same pattern but is implemented with slight variations, making the code hard to read and extend.

改进建议:

# Simple, declarative format detection
FORMAT_SIGNATURES = [
    (FchkLoader, lambda lines, content: 
        \"Gaussian\" in content and \"Formatted Checkpoint\" in content),
    (MoldenLoader, lambda lines, content: 
        \"[Molden Format]\" in content or \"[Atoms]\" in content),
    (WFNLoader, lambda lines, content: 
        _is_wfn_format(lines)),
    (PDBLoader, lambda lines, content: 
        any(l.startswith((\"ATOM\", \"HETATM\")) for l in lines)),
    # ... etc
]

def _detect_from_content(cls, filename: str) -> Optional[Type]:
    first_lines = _read_first_lines(filename)
    content = \"\\n\".join(first_lines)
    
    for loader_class, check_fn in FORMAT_SIGNATURES:
        if check_fn(first_lines, content):
            return loader_class
    return None

Issue 3: Code Duplication in Bond Order Calculations

文件: pymultiwfn/analysis/bonding/bondorder.py

代码 (calculate_mayer_bond_order and calculate_mulliken_bond_order):

def calculate_mayer_bond_order(wfn: Wavefunction) -> Dict[str, np.ndarray]:
    # ... validation code ...
    for i in range(n_atoms):
        bfs_i = atom_to_bfs.get(i, [])
        if not bfs_i:
            continue
        for j in range(i + 1, n_atoms):
            bfs_j = atom_to_bfs.get(j, [])
            if not bfs_j:
                continue
            # ... calculation ...

def calculate_mulliken_bond_order(wfn: Wavefunction) -> Dict[str, np.ndarray]:
    # ... validation code ...
    for i in range(n_atoms):
        bfs_i = atom_to_bfs.get(i, [])
        if not bfs_i:
            continue
        for j in range(i + 1, n_atoms):
            bfs_j = atom_to_bfs.get(j, [])
            if not bfs_j:
                continue
            # ... calculation ...

问题: The atom pair iteration pattern is duplicated between Mayer and Mulliken bond order calculations. The only difference is the actual computation inside the inner loop.

改进建议:

def _iterate_atom_pairs(wfn: Wavefunction):
    \"\"\"Generator to yield (i, j, bfs_i, bfs_j) for atom pairs.\"\"\"
    atom_to_bfs = wfn.get_atomic_basis_indices()
    n_atoms = wfn.num_atoms
    for i in range(n_atoms):
        bfs_i = atom_to_bfs.get(i, [])
        if not bfs_i:
            continue
        for j in range(i + 1, n_atoms):
            bfs_j = atom_to_bfs.get(j, [])
            if not bfs_j:
                continue
            yield i, j, bfs_i, bfs_j

def calculate_mayer_bond_order(wfn: Wavefunction) -> Dict[str, np.ndarray]:
    PS_total = wfn.Ptot @ wfn.overlap_matrix
    for i, j, bfs_i, bfs_j in _iterate_atom_pairs(wfn):
        ps_ij = PS_total[np.ix_(bfs_i, bfs_j)]
        ps_ji = PS_total[np.ix_(bfs_j, bfs_i)]
        accum = np.trace(ps_ij @ ps_ji)
        # ... store result ...

def calculate_mulliken_bond_order(wfn: Wavefunction) -> Dict[str, np.ndarray]:
    PS_total = wfn.Ptot * wfn.overlap_matrix
    for i, j, bfs_i, bfs_j in _iterate_atom_pairs(wfn):
        bnd_total[i, j] = np.sum(PS_total[np.ix_(bfs_i, bfs_j)])

Issue 4: Massive if-elif Chain in overlap.py

文件: pymultiwfn/integrals/overlap.py:388-441

代码:

def _type_to_lmn(gto_type: int) -> Tuple[int, int, int]:
    # S and P functions
    if gto_type == 0:
        return (0, 0, 0)
    elif gto_type == 1:
        return (1, 0, 0)
    elif gto_type == 2:
        return (0, 1, 0)
    elif gto_type == 3:
        return (0, 0, 1)
    # D functions
    elif gto_type == 4:
        return (2, 0, 0)
    elif gto_type == 5:
        return (0, 2, 0)
    elif gto_type == 6:
        return (0, 0, 2)
    elif gto_type == 7:
        return (1, 1, 0)
    elif gto_type == 8:
        return (1, 0, 1)
    elif gto_type == 9:
        return (0, 1, 1)
    # ... more elifs for F functions
    else:
        raise NotImplementedError(f\"GTO type {gto_type} not yet implemented\")

问题: 20+ lines of if-elif chains for a simple lookup operation. This is harder to read and maintain than a lookup table.

改进建议:

# Single lookup table - easy to read, easy to maintain
GTO_TYPE_TO_LMN = {
    0: (0, 0, 0),   # S
    1: (1, 0, 0),   # Px
    2: (0, 1, 0),   # Py
    3: (0, 0, 1),   # Pz
    4: (2, 0, 0),   # Dxx
    5: (0, 2, 0),   # Dyy
    6: (0, 0, 2),   # Dzz
    7: (1, 1, 0),   # Dxy
    8: (1, 0, 1),   # Dxz
    9: (0, 1, 1),   # Dyz
    # ... etc
}

def _type_to_lmn(gto_type: int) -> Tuple[int, int, int]:
    try:
        return GTO_TYPE_TO_LMN[gto_type]
    except KeyError:
        raise NotImplementedError(f\"GTO type {gto_type} not yet implemented\")

Issue 5: Complex WFN Loader with Over-Engineered Fallback Logic

文件: pymultiwfn/io/parsers/wfn.py:405-587

代码:

def _parse_basis_and_mo(self, lines):
    # ... 180+ lines of complex parsing logic ...
    
    # Multiple nested data structures
    shells_dict = {}
    
    # Extensive debug print statements mixed with logic
    print(f\"[DEBUG] Basis function information...\")
    
    # Complex type mapping with multiple conversion layers
    if shell_type == 1:
        our_type = 0  # S
    elif shell_type in [2, 3, 4]:
        our_type = 1  # P
    elif shell_type in [5, 6, 7, 8, 9, 10]:
        our_type = 2  # D
    # ... more conversions
    
    # Multiple fallback strategies
    try:
        # Try full overlap calculation
        overlap_matrix = self._calculate_wfn_overlap_matrix()
    except Exception as e:
        warnings.warn(f\"Full overlap calculation failed: {e}\")
        # Fallback to distance-based
        overlap_matrix = self._calculate_distance_based_overlap(basis_functions)

问题: The WFN parser is unnecessarily complex with debug prints mixed with logic, multiple fallback layers, and convoluted type conversions. This makes debugging difficult.

改进建议:

  • Separate parsing logic from debug output
  • Simplify type conversions using lookup tables
  • Remove or consolidate fallback strategies
  • Add clear separation between parsing and post-processing

为什么重要

Issue Impact
DRY Violation Maintenance burden - updating element data requires changes in 20+ files; risk of inconsistencies
Complex Factory Hard to add new formats; error-prone; difficult to test individual format detection
Duplicated Bond Logic Code bloat; harder to optimize; inconsistent behavior risk
if-elif Chain Slower execution; harder to extend for new GTO types
Over-Engineered WFN Harder to debug; unclear data flow; technical debt

引用 KISS 原则: "Simplicity is the ultimate sophistication." - Leonardo da Vinci

These issues violate the KISS principle by introducing unnecessary complexity where simple solutions exist.


优先级

  • 高 - 影响系统稳定性/安全性 (DRY violations can cause data inconsistencies)
  • 中 - 影响代码质量 (All issues)
  • 低 - 改进建议

建议修复顺序

  1. Issue 1 (DRY): Centralize atomic number lookup using existing ELEMENT_NAME_TO_Z
  2. Issue 4 (if-elif): Convert to lookup table - simplest change with clear benefit
  3. Issue 3 (Bond orders): Extract common iteration pattern
  4. Issue 2 (Factory): Refactor to declarative format signatures
  5. Issue 5 (WFN): Gradual simplification of parsing logic

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions