# RAM and Storage Data Cleaning

This notebook cleans the RAM_TYPE, RAM_SIZE, SSD_SIZE, and HDD_SIZE columns using CPU-based mappings.

## Cleaning Steps:
1. **Fix swapped columns**: Detect when RAM values are in SSD column and vice versa
2. **Handle dual storage**: Split formats like "1TB+240GB" into SSD and HDD
3. **Fill RAM_TYPE**: Use CPU ‚Üí DDR type mappings from `cpu_ddr_map.csv`
4. **Fill RAM_SIZE**: Use tier-based heuristics (i9‚Üí32GB, i7/i5‚Üí16GB, i3‚Üí8GB)
5. **Fill Storage**: Only if BOTH SSD and HDD are empty, use CPU ‚Üí storage defaults from `cpu_storage_map.csv`
6. **Normalize storage**: Convert TB to GB format (1TB ‚Üí 1000GB)

## Input/Output:
- **Input**: `data_with_cpus_gpus.csv` (output from cpus_gpus_handling.ipynb with cleaned CPU names)
- **Output**: `data_with_cleaned_ram_storage.csv`
- **Reference**: `cpu_ddr_map.csv`, `cpu_storage_map.csv` (use cleaned CPU names from cpus.csv)
- **CPU Column**: Uses `mapped_cpu_name` (standardized CPU names like "Intel Core i5-1135G7 @ 2.40GHz")

## 1. Import Libraries

In [55]:
import pandas as pd
import numpy as np
import csv
import re
from pathlib import Path

print("Libraries imported successfully!")

Libraries imported successfully!


## 2. Load Mapping Files

### CPU ‚Üí DDR Type Mapping
Maps CPU names to their compatible DDR type (DDR3, DDR4, DDR5, LPDDR3, LPDDR4, LPDDR4X, LPDDR5, LPDDR5X).

### CPU ‚Üí Storage Mapping
Maps CPU names to their default storage configuration (type: SSD/HDD, size: 256GB/512GB/1TB/etc).
Only used when BOTH SSD_SIZE and HDD_SIZE are empty.

In [56]:
def load_ddr_map(filepath):
    """Load CPU to DDR type mapping from csv.
    Expected format: cpu_name,ddr_type
    Returns dict: {cpu_name: ddr_type}
    """
    ddr_map = {}
    try:
        with open(filepath, 'r', encoding='utf-8') as f:
            reader = csv.DictReader(f)
            for row in reader:
                cpu = row.get('cpu_name', '').strip()
                ddr = row.get('ddr_type', '').strip()
                if cpu and ddr:
                    ddr_map[cpu] = ddr
        print(f"Loaded {len(ddr_map)} CPU ‚Üí DDR type mappings from {filepath}")
        # Show sample CPU names from mapping file
        sample_cpus = list(ddr_map.keys())[:5]
        print(f"Sample CPU names from mapping file: {sample_cpus}")
    except FileNotFoundError:
        print(f"WARNING: DDR map file not found: {filepath}")
    return ddr_map

def load_storage_map(filepath):
    """Load CPU to storage mapping from csv.
    Expected format: cpu_name,storage_type,storage_size
    Returns dict: {cpu_name: {'storage_type': 'SSD'/'HDD', 'storage_size': '512GB'}}
    """
    storage_map = {}
    try:
        with open(filepath, 'r', encoding='utf-8') as f:
            reader = csv.DictReader(f)
            for row in reader:
                cpu = row.get('cpu_name', '').strip()
                storage_type = row.get('storage_type', '').strip()
                storage_size = row.get('storage_size', '').strip()
                if cpu and storage_type and storage_size:
                    storage_map[cpu] = {
                        'storage_type': storage_type,
                        'storage_size': storage_size
                    }
        print(f"Loaded {len(storage_map)} CPU ‚Üí storage mappings from {filepath}")
        # Show sample CPU names from mapping file
        sample_cpus = list(storage_map.keys())[:5]
        print(f"Sample CPU names from mapping file: {sample_cpus}")
    except FileNotFoundError:
        print(f"WARNING: Storage map file not found: {filepath}")
    return storage_map

# Load mapping files
ddr_map = load_ddr_map('cpu_ddr_map.csv')
storage_map = load_storage_map('cpu_storage_map.csv')

Loaded 823 CPU ‚Üí DDR type mappings from cpu_ddr_map.csv
Sample CPU names from mapping file: ['AMD 3015e', 'AMD 3020e', 'AMD A10 Micro-6700T APU', 'AMD A10 PRO-7350B APU', 'AMD A10 PRO-7850B APU']
Loaded 738 CPU ‚Üí storage mappings from cpu_storage_map.csv
Sample CPU names from mapping file: ['AMD 3015e', 'AMD 3020e', 'AMD A10 Micro-6700T APU', 'AMD A10 PRO-7350B APU', 'AMD A10 PRO-7850B APU']


## 3. CPU Name Lookup

Direct dictionary lookup using cleaned CPU names. 

**Important**: 
- `mapped_cpu_name` column may include frequency (e.g., "Intel Core i5-1135G7 @ 2.40GHz")
- Mapping files (`cpu_ddr_map.csv`, `cpu_storage_map.csv`) don't have frequencies (e.g., "Intel Core i5-1135G7")
- The function strips "@ GHz" before matching

In [57]:
def find_cpu_in_map(cpu_name, cpu_map):
    """Find CPU in map using direct lookup.
    
    The mapped_cpu_name from cpus_gpus_handling includes frequency (e.g., "@ 2.40GHz"),
    but the mapping files don't have frequencies, so we need to strip them.
    
    Example:
    - Input: "Intel Core i5-1135G7 @ 2.40GHz"
    - Stripped: "Intel Core i5-1135G7"
    - Matches: "Intel Core i5-1135G7" in cpu_ddr_map.csv
    
    Returns: matched CPU name from map, or None
    """
    if not cpu_name or not cpu_map:
        return None
    
    # Strip frequency: "Intel Core i5-1135G7 @ 2.40GHz" -> "Intel Core i5-1135G7"
    cpu_name_clean = re.sub(r'\s*@.*', '', str(cpu_name)).strip()
    cpu_lower = cpu_name_clean.lower()
    
    # Direct case-insensitive lookup
    for map_cpu in cpu_map.keys():
        if cpu_lower == map_cpu.lower():
            return map_cpu
    
    return None

## 4. RAM Size Heuristics

When RAM_SIZE is missing, estimate based on:
1. **CPU suffix** (U-series, H-series, HX, G7, P-series) - Most accurate
2. **CPU tier** (i3/i5/i7/i9, Ryzen 3/5/7/9) - Secondary indicator
3. **Generation** - Newer gens tend to have more RAM

### Intel Suffix Patterns:
- **HX-series** (Extreme performance): 32GB (high-end gaming/workstation)
- **H-series** (High performance): 16-32GB (gaming laptops)
- **P-series** (Performance): 16GB (creator laptops)
- **U-series** (Ultra-low power): 8-16GB (thin & light)
- **G7/G4** (Iris graphics): 8-16GB (mainstream)

### AMD Patterns:
- **HX-series**: 32GB
- **HS/H-series**: 16-32GB
- **U-series**: 8-16GB

### Examples:
- Intel Core i7-1135G7 ‚Üí 16GB (i7 + G7 suffix)
- Intel Core i5-1135G7 ‚Üí 16GB (i5 + G7 suffix, not just 8GB)
- Intel Core i7-12700H ‚Üí 16GB (i7 + H-series)
- Intel Core i9-13980HX ‚Üí 32GB (i9 + HX)

In [58]:
def get_ram_size_for_cpu(cpu_name):
    """Get typical RAM size for a CPU based on suffix, tier, and generation.
    
    Priority:
    1. CPU suffix (HX, H, P, U, G7, etc.) - Most accurate indicator
    2. CPU tier (i3/i5/i7/i9, Ryzen 3/5/7/9)
    3. Generation (newer = more RAM)
    
    Returns: RAM size as string (e.g., '16') or None
    """
    if not cpu_name:
        return None
    
    cpu_lower = cpu_name.lower()
    cpu_upper = cpu_name.upper()
    
    # === PRIORITY 1: Check CPU suffix patterns (most accurate) ===
    
    # HX-series: Extreme performance (32GB)
    if 'hx' in cpu_lower or cpu_upper.endswith('HX'):
        return "32"
    
    # H-series: High performance gaming/workstation
    # i9-H or Ryzen 9-H ‚Üí 32GB
    # i7-H or Ryzen 7-H ‚Üí 16GB (but could be 32GB in newer gens)
    # i5-H ‚Üí 16GB
    if re.search(r'\d{4,5}h\b', cpu_lower) or re.search(r'-\d{4}h\b', cpu_lower):
        # Check tier for H-series
        if 'i9' in cpu_lower or 'ryzen 9' in cpu_lower:
            return "32"
        elif 'i7' in cpu_lower or 'ryzen 7' in cpu_lower:
            # 11th gen+ i7-H typically have 16GB, but can go 32GB
            return "16"
        elif 'i5' in cpu_lower or 'ryzen 5' in cpu_lower:
            return "16"
        else:
            return "16"  # Default H-series
    
    # HS-series: AMD high performance slim (16GB)
    if 'hs' in cpu_lower:
        if 'ryzen 9' in cpu_lower:
            return "32"
        else:
            return "16"
    
    # P-series: Intel Performance (creator laptops, 16GB)
    if re.search(r'\d{4,5}p\b', cpu_lower):
        return "16"
    
    # U-series: Ultra-low power (thin & light)
    # i7-U with G7 ‚Üí 16GB (like i7-1135G7)
    # i5-U with G7 ‚Üí 16GB (like i5-1135G7)
    # i7-U without G7 ‚Üí 8-16GB (check generation)
    # i3-U ‚Üí 8GB
    if re.search(r'\d{4,5}u\b', cpu_lower) or 'u @' in cpu_lower:
        # Check for G7 suffix (Iris Xe graphics - better performance)
        if 'g7' in cpu_lower or 'g4' in cpu_lower:
            # G7 models typically come with 16GB even for i5
            if 'i7' in cpu_lower or 'i5' in cpu_lower:
                return "16"
            elif 'i3' in cpu_lower:
                return "8"
        # U-series without G7
        if 'i7' in cpu_lower or 'ryzen 7' in cpu_lower:
            # Check generation: 10th gen+ ‚Üí 16GB, older ‚Üí 8GB
            gen_match = re.search(r'-(\d{1,2})\d{3}', cpu_name)
            if gen_match:
                gen = int(gen_match.group(1))
                if gen >= 10:
                    return "16"
            return "8"
        elif 'i5' in cpu_lower or 'ryzen 5' in cpu_lower:
            return "8"
        elif 'i3' in cpu_lower or 'ryzen 3' in cpu_lower:
            return "8"
        else:
            return "8"  # Default U-series
    
    # G7/G4 suffix: Iris Xe graphics (typically 16GB for i5+)
    if 'g7' in cpu_lower or 'g4' in cpu_lower:
        if 'i7' in cpu_lower or 'i9' in cpu_lower:
            return "16"
        elif 'i5' in cpu_lower:
            return "16"  # i5-1135G7 typically has 16GB
        elif 'i3' in cpu_lower:
            return "8"
    
    # Y-series: Ultra-low power (tablets, 8GB)
    if re.search(r'\d{4,5}y\b', cpu_lower):
        return "8"
    
    # M-series: Mobile (8GB)
    if 'core m' in cpu_lower or re.search(r'm\d-', cpu_lower):
        return "8"
    
    # === PRIORITY 2: Check CPU tier (if no suffix detected) ===
    
    # High-end tiers: 32GB
    if any(x in cpu_lower for x in ['i9', 'ryzen 9', 'ultra 9', 'ultra9', 
                                      'threadripper', 'epyc', 'xeon']):
        return "32"
    
    # Mid-high tiers: 16GB
    if any(x in cpu_lower for x in ['i7', 'ryzen 7', 'ultra 7', 'ultra7']):
        return "16"
    
    # Mid tiers: Check generation
    if any(x in cpu_lower for x in ['i5', 'ryzen 5', 'ultra 5', 'ultra5']):
        # Modern i5 (10th gen+) typically have 16GB
        gen_match = re.search(r'-(\d{1,2})\d{3}', cpu_name)
        if gen_match:
            gen = int(gen_match.group(1))
            if gen >= 10:
                return "16"
        return "8"
    
    # Entry-level: 8GB
    if any(x in cpu_lower for x in ['i3', 'i1', 'ryzen 3', 'ultra 3', 'ultra3',
                                      'celeron', 'pentium', 'athlon', 'atom',
                                      'core 2', 'core duo',
                                      'a4', 'a6', 'a8', 'a9',
                                      'a10', 'a12', 'e1', 'e2', 'fx-', 'n95', 'n97', 'n100', 'n200', 'n300']):
        return "8"
    
    # Default for unknown: 8GB
    return "8"

## 5. Data Validation Helpers

Functions to detect:
- **Swapped columns**: RAM values in SSD column or vice versa
- **RAM values**: 2/4/6/8/12/16/24/32/48/64/96/128 GB
- **Storage values**: 256+ GB, TB units, or dual storage (A+B format)

In [59]:
def is_ram_value(val):
    """Check if value looks like RAM (4/8/16/32/64/96/128 GB - realistic laptop/workstation RAM)."""
    if not val:
        return False
    val_clean = val.strip().upper().replace('GB', '').replace(' ', '')
    # Only these are realistic laptop/workstation RAM sizes (128GB is valid for MacBooks/workstations)
    return val_clean in ['2', '4', '6', '8', '12', '16', '24', '32', '48', '64', '96', '128']

def is_storage_value(val):
    """Check if value looks like storage (256+ GB or TB, or has +)."""
    if not val:
        return False
    val_clean = val.strip().upper()
    # Contains + means dual storage
    if '+' in val_clean:
        return True
    # TB is always storage
    if 'TB' in val_clean:
        return True
    # GB values >= 256 are likely storage (128GB could be RAM on high-end machines)
    num = val_clean.replace('GB', '').replace(' ', '')
    try:
        return int(num) >= 256
    except:
        return False

def needs_swap(ram_val, ssd_val):
    """Check if RAM and SSD columns appear to be swapped.
    Returns True if:
    - RAM has storage-like value (>=128GB or TB) AND SSD has RAM-like value, OR
    - RAM has storage-like value AND SSD is empty, OR
    - RAM is empty AND SSD has RAM-like value
    """
    ram = (ram_val or '').strip()
    ssd = (ssd_val or '').strip()
    
    ram_looks_like_storage = is_storage_value(ram)
    ssd_looks_like_ram = is_ram_value(ssd)
    ram_looks_like_ram = is_ram_value(ram)
    ssd_looks_like_storage = is_storage_value(ssd)
    
    # Case 1: RAM empty, SSD has RAM value
    if not ram and ssd_looks_like_ram:
        return True
    
    # Case 2: RAM has storage value, SSD has RAM value (definitely swapped)
    if ram_looks_like_storage and ssd_looks_like_ram:
        return True
    
    # Case 3: RAM has storage value (128GB+) and SSD also has storage value
    # This might be swapped too - check if RAM > typical max (64GB)
    if ram_looks_like_storage and not ram_looks_like_ram:
        # RAM has a storage-like value, likely swapped
        # Only swap if SSD is empty or also looks like storage
        if not ssd or ssd_looks_like_storage:
            return True
    
    return False

## 6. Storage Parsing and Normalization

- **Parse dual storage**: "1TB+240GB" ‚Üí SSD=1TB, HDD=240GB
- **Normalize to GB**: "1TB" ‚Üí "1000GB", "2TB" ‚Üí "2000GB"

In [60]:
def parse_dual_storage(val):
    """Parse 'A+B' format like '1TB+240GB' -> (primary_size, secondary_size)."""
    if not val or '+' not in val:
        return val, None
    parts = val.split('+')
    if len(parts) == 2:
        return parts[0].strip(), parts[1].strip()
    return val, None

def normalize_storage_to_gb(val):
    """Convert storage values to GB format (e.g., '1TB' -> '1000GB', '2TB' -> '2000GB').
    Also handles dual storage like '1TB 512GB' or '512GB 1TB' by taking the first part."""
    if not val:
        return val
    val_clean = val.strip()
    
    # Handle dual storage with space separator (e.g., "1TB 512GB" or "512GB 1TB")
    # Take only the first part
    if ' ' in val_clean and ('GB' in val_clean.upper() or 'TB' in val_clean.upper()):
        parts = val_clean.split()
        # Find the first storage-like part
        for part in parts:
            if 'GB' in part.upper() or 'TB' in part.upper():
                val_clean = part
                break
    
    val_upper = val_clean.upper()
    
    # Handle TB -> GB conversion
    if 'TB' in val_upper:
        try:
            num = float(val_upper.replace('TB', '').strip())
            return f"{int(num * 1000)}GB"
        except:
            return val_clean
    
    # Already in GB or other format, return as-is but ensure GB suffix
    if 'GB' in val_upper:
        return val_upper
    
    # Just a number, assume GB
    try:
        num = int(val_clean)
        return f"{num}GB"
    except:
        return val_clean

## 7. Load Input Data

In [61]:
# Load data with cleaned CPU/GPU names
df = pd.read_csv('data_with_cpus_gpus.csv')

print(f"Loaded {len(df)} rows from data_with_cpus_gpus.csv")
print(f"\nColumns: {list(df.columns)}")
print(f"\nFirst few CPU names from mapped_cpu_name column:")
print(df['mapped_cpu_name'].head(10))
print(f"\nSample of data:")
df.head()

Loaded 16392 rows from data_with_cpus_gpus.csv

Columns: ['id', 'price_preview', 'created_at', 'city', 'spec_Etat', 'model_name', 'DEDICATED_GPU', 'CPU', 'RAM_SIZE', 'SSD_SIZE', 'HDD_SIZE', 'SCREEN_SIZE', 'SCREEN_FREQUENCY', 'SCREEN_RESOLUTION', 'RAM_TYPE', 'mapped_cpu_name', 'match_score', 'cores', 'cpu_mark', 'tdp', 'gpu_name', 'match_type', 'gpu_match_score', 'gpu_g3d_mark', 'gpu_g2d_mark', 'gpu_tdp']

First few CPU names from mapped_cpu_name column:
0               Intel Core i5-1250P
1    Intel Core i7-11800H @ 2.30GHz
2    Intel Core i7-7700HQ @ 2.80GHz
3                AMD Ryzen 7 5800HS
4                   AMD Ryzen 5 240
5    Intel Core i5-10300H @ 2.50GHz
6                 AMD Ryzen 5 7520U
7    Intel Core i5-1135G7 @ 2.40GHz
8             Intel Core i7-13700HX
9    Intel Core i5-1145G7 @ 2.60GHz
Name: mapped_cpu_name, dtype: object

Sample of data:


Unnamed: 0,id,price_preview,created_at,city,spec_Etat,model_name,DEDICATED_GPU,CPU,RAM_SIZE,SSD_SIZE,...,match_score,cores,cpu_mark,tdp,gpu_name,match_type,gpu_match_score,gpu_g3d_mark,gpu_g2d_mark,gpu_tdp
0,1,75000000.0,2021 10 01T18:01:44.000Z,EL TAREF,BON TAT,IDEAPAD,,INTEL CORE I5 750S,4GB,128GB,...,66.666667,4.0,19108,28.0,Intel UHD Graphics 730,fuzzy,100.0,1000,237,15.0
1,2,33500000.0,2021 11 10T21:24:14.000Z,COLLO,JAMAIS UTILIS,AERO,NVIDIA GEFORCE RTX 3060,11TH GEN INTEL CORE I7 11800H,16GB,1TB,...,69.230769,8.0,19776,45.0,GeForce RTX 3060 12GB,fuzzy,100.0,16758,966,170.0
2,3,17000000.0,2021 09 11T20:27:59.000Z,MECHERIA,,STEALTH,NVIDIA GEFORCE GTX 1060,INTEL CORE I7 7700HQ,16GB,,...,100.0,4.0,6881,45.0,GeForce GTX 1060,fuzzy,100.0,10059,743,120.0
3,4,12000000.0,2025 03 06T00:28:39.000Z,ES SENIA,,ROG,NVIDIA GEFORCE RTX 1650,AMD RYZEN 7 5800HS,16GB,512GB,...,100.0,8.0,19476,35.0,GeForce GTX 1650,exact,100.0,7871,554,75.0
4,5,11000000.0,2024 10 09T18:10:21.000Z,TIZI OUZOU,BON TAT,,AMD RADEON RX 580,AMD RYZEN 5 2400G,16GB,128GB,...,93.75,6.0,22980,45.0,GeForce GTX 580,fuzzy,100.0,4632,489,244.0


## 8. Data Cleaning Pipeline

### Processing Steps:
1. **Fix swapped columns** (RAM in SSD column or vice versa)
2. **Split dual storage** ("1TB+240GB" format)
3. **Fill RAM_TYPE** using CPU ‚Üí DDR mappings
4. **Fill RAM_SIZE** using tier-based heuristics
5. **Fill Storage** only if BOTH SSD and HDD are empty
6. **Normalize storage** values to GB format

In [62]:
# Initialize statistics
stats = {
    'total_rows': len(df),
    'ram_type_filled': 0,
    'ram_type_unchanged': 0,
    'ram_type_not_found': 0,
    'ram_size_filled': 0,
    'ram_size_unchanged': 0,
    'storage_filled': 0,
    'storage_unchanged': 0,
    'storage_not_found': 0,
    'columns_swapped': 0,
    'dual_storage_split': 0,
}

# Track CPUs not found in maps
cpus_not_in_ddr_map = set()
cpus_not_in_storage_map = set()

# Process each row
for idx, row in df.iterrows():
    # Use mapped_cpu_name (cleaned CPU name from cpus_gpus_handling.ipynb)
    cpu_name = str(row.get('mapped_cpu_name', '')).strip() if pd.notna(row.get('mapped_cpu_name')) else ''
    
    # Skip if CPU mapping failed (NA means CPU couldn't be matched)
    if not cpu_name or cpu_name.upper() == 'NA':
        continue
    
    # === STEP 0: Fix swapped columns (RAM in SSD column or vice versa) ===
    current_ram_size = str(row.get('RAM_SIZE', '')).strip() if pd.notna(row.get('RAM_SIZE')) else ''
    current_ssd = str(row.get('SSD_SIZE', '')).strip() if pd.notna(row.get('SSD_SIZE')) else ''
    
    if needs_swap(current_ram_size, current_ssd):
        # Swap RAM and SSD values
        df.at[idx, 'RAM_SIZE'] = current_ssd if is_ram_value(current_ssd) else ''
        df.at[idx, 'SSD_SIZE'] = current_ram_size if is_storage_value(current_ram_size) else ''
        stats['columns_swapped'] += 1
        current_ram_size = df.at[idx, 'RAM_SIZE']
        current_ssd = df.at[idx, 'SSD_SIZE']
    
    # === STEP 0b: Handle dual storage format (e.g., "1TB+240GB") ===
    current_ssd = str(df.at[idx, 'SSD_SIZE']).strip() if pd.notna(df.at[idx, 'SSD_SIZE']) else ''
    if current_ssd and '+' in current_ssd:
        primary, secondary = parse_dual_storage(current_ssd)
        df.at[idx, 'SSD_SIZE'] = primary  # Keep primary in SSD
        # Optionally store secondary in HDD if HDD is empty
        current_hdd = str(row.get('HDD_SIZE', '')).strip() if pd.notna(row.get('HDD_SIZE')) else ''
        if not current_hdd or current_hdd.lower() in ['', 'nan', 'none', 'null']:
            df.at[idx, 'HDD_SIZE'] = secondary if secondary else ''
        stats['dual_storage_split'] += 1
    
    # === Fill RAM_TYPE if empty ===
    current_ram_type = str(row.get('RAM_TYPE', '')).strip() if pd.notna(row.get('RAM_TYPE')) else ''
    if not current_ram_type or current_ram_type.lower() in ['', 'nan', 'none', 'null']:
        matched_cpu = find_cpu_in_map(cpu_name, ddr_map)
        if matched_cpu:
            df.at[idx, 'RAM_TYPE'] = ddr_map[matched_cpu]
            stats['ram_type_filled'] += 1
        else:
            stats['ram_type_not_found'] += 1
            # Track CPU not found in DDR map
            cpu_name_clean = re.sub(r'\s*@.*', '', cpu_name).strip()
            cpus_not_in_ddr_map.add(cpu_name_clean)
    else:
        stats['ram_type_unchanged'] += 1
    
    # === Fill RAM_SIZE if empty ===
    current_ram_size = str(row.get('RAM_SIZE', '')).strip() if pd.notna(row.get('RAM_SIZE')) else ''
    if not current_ram_size or current_ram_size.lower() in ['', 'nan', 'none', 'null']:
        ram_size = get_ram_size_for_cpu(cpu_name)
        if ram_size:
            df.at[idx, 'RAM_SIZE'] = ram_size + "GB"
            stats['ram_size_filled'] += 1
    else:
        stats['ram_size_unchanged'] += 1
    
    # === Fill Storage ONLY if BOTH SSD_SIZE and HDD_SIZE are empty ===
    current_ssd = str(df.at[idx, 'SSD_SIZE']).strip() if pd.notna(df.at[idx, 'SSD_SIZE']) else ''
    current_hdd = str(row.get('HDD_SIZE', '')).strip() if pd.notna(row.get('HDD_SIZE')) else ''
    
    ssd_empty = not current_ssd or current_ssd.lower() in ['', 'nan', 'none', 'null', '0']
    hdd_empty = not current_hdd or current_hdd.lower() in ['', 'nan', 'none', 'null', '0']
    
    if ssd_empty and hdd_empty:
        matched_cpu = find_cpu_in_map(cpu_name, storage_map)
        if matched_cpu:
            storage_info = storage_map[matched_cpu]
            storage_type = storage_info['storage_type']
            storage_size = storage_info['storage_size'].replace('GB', '').replace('TB', '000')
            
            if storage_type == 'SSD':
                df.at[idx, 'SSD_SIZE'] = storage_size
                df.at[idx, 'HDD_SIZE'] = ''
            else:
                df.at[idx, 'HDD_SIZE'] = storage_size
                df.at[idx, 'SSD_SIZE'] = ''
            
            stats['storage_filled'] += 1
        else:
            stats['storage_not_found'] += 1
            # Track CPU not found in storage map
            cpu_name_clean = re.sub(r'\s*@.*', '', cpu_name).strip()
            cpus_not_in_storage_map.add(cpu_name_clean)
    else:
        stats['storage_unchanged'] += 1
    
    # === STEP: Normalize all storage values to GB format (TB -> GB) ===
    if pd.notna(df.at[idx, 'SSD_SIZE']) and str(df.at[idx, 'SSD_SIZE']).strip():
        df.at[idx, 'SSD_SIZE'] = normalize_storage_to_gb(str(df.at[idx, 'SSD_SIZE']))
    if pd.notna(df.at[idx, 'HDD_SIZE']) and str(df.at[idx, 'HDD_SIZE']).strip():
        df.at[idx, 'HDD_SIZE'] = normalize_storage_to_gb(str(df.at[idx, 'HDD_SIZE']))

print("Data cleaning completed!")

Data cleaning completed!


## 9. Display Cleaning Statistics

In [63]:
print("=" * 60)
print("CLEANING STATISTICS")
print("=" * 60)
print(f"Total rows processed: {stats['total_rows']}")
print()
print("RAM_TYPE:")
print(f"  - Filled from map:     {stats['ram_type_filled']}")
print(f"  - Already had value:   {stats['ram_type_unchanged']}")
print(f"  - CPU not in map:      {stats['ram_type_not_found']}")
print()
print("RAM_SIZE:")
print(f"  - Filled from tier:    {stats['ram_size_filled']}")
print(f"  - Already had value:   {stats['ram_size_unchanged']}")
print()
print("Data Fixes:")
print(f"  - Columns swapped:     {stats['columns_swapped']} (RAM was in SSD column)")
print(f"  - Dual storage split:  {stats['dual_storage_split']} (A+B format separated)")
print()
print("Storage (SSD/HDD):")
print(f"  - Filled from map:     {stats['storage_filled']}")
print(f"  - Already had value:   {stats['storage_unchanged']}")
print(f"  - CPU not in map:      {stats['storage_not_found']}")
print("=" * 60)

CLEANING STATISTICS
Total rows processed: 16392

RAM_TYPE:
  - Filled from map:     10330
  - Already had value:   5269
  - CPU not in map:      0

RAM_SIZE:
  - Filled from tier:    413
  - Already had value:   15186

Data Fixes:
  - Columns swapped:     32 (RAM was in SSD column)
  - Dual storage split:  5 (A+B format separated)

Storage (SSD/HDD):
  - Filled from map:     946
  - Already had value:   14653
  - CPU not in map:      0


## 9b. CPUs Not Found in Mapping Files

These CPUs exist in the data but are missing from the mapping files. We should add them if they are valid CPU models.

In [64]:
print("\n" + "=" * 60)
print("CPUs NOT FOUND IN MAPPING FILES")
print("=" * 60)

print(f"\nüìã CPUs not in DDR map ({len(cpus_not_in_ddr_map)} unique):")
print("-" * 60)
if cpus_not_in_ddr_map:
    # Sort alphabetically for easier review
    sorted_cpus_ddr = sorted(cpus_not_in_ddr_map)
    for i, cpu in enumerate(sorted_cpus_ddr, 1):
        print(f"{i:3}. {cpu}")
else:
    print("‚úì All CPUs found in DDR map!")

print(f"\nüìÅ CPUs not in Storage map ({len(cpus_not_in_storage_map)} unique):")
print("-" * 60)
if cpus_not_in_storage_map:
    # Sort alphabetically for easier review
    sorted_cpus_storage = sorted(cpus_not_in_storage_map)
    for i, cpu in enumerate(sorted_cpus_storage, 1):
        print(f"{i:3}. {cpu}")
else:
    print("‚úì All CPUs found in Storage map!")

# Find CPUs missing from BOTH maps
cpus_missing_both = cpus_not_in_ddr_map.intersection(cpus_not_in_storage_map)
if cpus_missing_both:
    print(f"\n‚ö†Ô∏è  CPUs missing from BOTH maps ({len(cpus_missing_both)} unique):")
    print("-" * 60)
    sorted_cpus_both = sorted(cpus_missing_both)
    for i, cpu in enumerate(sorted_cpus_both, 1):
        print(f"{i:3}. {cpu}")

print("\n" + "=" * 60)


CPUs NOT FOUND IN MAPPING FILES

üìã CPUs not in DDR map (0 unique):
------------------------------------------------------------
‚úì All CPUs found in DDR map!

üìÅ CPUs not in Storage map (0 unique):
------------------------------------------------------------
‚úì All CPUs found in Storage map!



## 9c. Export Missing CPUs to CSV

Export the missing CPUs to CSV files so you can review them and add valid entries to the mapping files.

In [65]:
# Export CPUs not in DDR map
if cpus_not_in_ddr_map:
    missing_ddr_df = pd.DataFrame({
        'cpu_name': sorted(cpus_not_in_ddr_map),
        'ddr_type': '',  # To be filled manually
        'release_year': '',  # To be filled manually
        'notes': ''  # To be filled manually
    })
    missing_ddr_df.to_csv('missing_cpus_ddr_map.csv', index=False)
    print(f"‚úì Exported {len(cpus_not_in_ddr_map)} CPUs to 'missing_cpus_ddr_map.csv'")
else:
    print("‚úì No missing CPUs for DDR map")

# Export CPUs not in Storage map
if cpus_not_in_storage_map:
    missing_storage_df = pd.DataFrame({
        'cpu_name': sorted(cpus_not_in_storage_map),
        'storage_type': '',  # To be filled manually (SSD/HDD)
        'storage_size': '',  # To be filled manually (256GB/512GB/1TB etc)
        'tier': '',  # To be filled manually (budget/mid/high)
        'notes': ''  # To be filled manually
    })
    missing_storage_df.to_csv('missing_cpus_storage_map.csv', index=False)
    print(f"‚úì Exported {len(cpus_not_in_storage_map)} CPUs to 'missing_cpus_storage_map.csv'")
else:
    print("‚úì No missing CPUs for Storage map")

print("\n‚ÑπÔ∏è  Review these files, fill in the appropriate values, and append them to:")
print("   - cpu_ddr_map.csv")
print("   - cpu_storage_map.csv")

‚úì No missing CPUs for DDR map
‚úì No missing CPUs for Storage map

‚ÑπÔ∏è  Review these files, fill in the appropriate values, and append them to:
   - cpu_ddr_map.csv
   - cpu_storage_map.csv


## 10. Preview Cleaned Data

In [66]:
# Display sample of cleaned data
print("\nSample of cleaned data (mapped_cpu_name, RAM_TYPE, RAM_SIZE, SSD_SIZE, HDD_SIZE):")
display_cols = ['mapped_cpu_name', 'RAM_TYPE', 'RAM_SIZE', 'SSD_SIZE', 'HDD_SIZE']
df[display_cols].head(20)


Sample of cleaned data (mapped_cpu_name, RAM_TYPE, RAM_SIZE, SSD_SIZE, HDD_SIZE):


Unnamed: 0,mapped_cpu_name,RAM_TYPE,RAM_SIZE,SSD_SIZE,HDD_SIZE
0,Intel Core i5-1250P,DDR5,4GB,128GB,
1,Intel Core i7-11800H @ 2.30GHz,DDR4,16GB,1000GB,
2,Intel Core i7-7700HQ @ 2.80GHz,DDR4,16GB,512GB,
3,AMD Ryzen 7 5800HS,DDR4,16GB,512GB,
4,AMD Ryzen 5 240,DDR5,16GB,128GB,145GB
5,Intel Core i5-10300H @ 2.50GHz,DDR4,8GB,512GB,
6,AMD Ryzen 5 7520U,DDR5,8GB,512GB,
7,Intel Core i5-1135G7 @ 2.40GHz,DDR4,16GB,512GB,
8,Intel Core i7-13700HX,DDR5,16GB,1000GB,
9,Intel Core i5-1145G7 @ 2.60GHz,DDR4,8GB,256GB,


## 11. Export Cleaned Data

In [67]:
# Export to CSV
output_file = 'data_with_cleaned_ram_storage.csv'
df.to_csv(output_file, index=False)

print(f"\n‚úì Exported {len(df)} rows to {output_file}")
print("\nDone!")


‚úì Exported 16392 rows to data_with_cleaned_ram_storage.csv

Done!
