# CBAM (Channel) + Coordinate Attention (Yontem 1) — Cok Detayli Calisma Defteri

Bu defter, **tek tek kod bloklarini** ve bu bloklarin **birbiriyle nasil baglandigini** adim adim aciklar.

Hedef: Bir feature map uzerinde
- once **kanal secimi** (CBAM kanal),
- sonra **eksensel uzamsal secicilik** (Coordinate attention),
- en sonda da **kontrollu residual karisim**
uygulamak.

## 0) Okuma rehberi

Bu defterde her parca su formatta ilerler:
1) Kod blogu
2) O blogun amaci
3) O blogun girdi/cikti sekilleri (shape)
4) Bir sonraki bloga nasil baglandigi

Not: Kod hucreleri calistirilabilir olacak sekilde duzenlenmistir.

## 1) Kurulum

Bu bolumde sadece PyTorch import edilir. Moduller burada tanimlanacagi icin ekstra bagimlilik yok.

In [None]:

import torch
import torch.nn as nn
import torch.nn.functional as F

## 2) Yardimci fonksiyonlar — neden gerekli?

Bu projede iki tip ayar surekli karsina cikar:

- **Gate secimi**: attention ciktisini 0–1 araligina sikistirma
- **Sicaklik (temperature)**: gate'in ne kadar keskin/sert davranacagini ayarlama

Ayrica kucuk MLP benzeri kisimlar icin aktivasyon secimi yapilir.

Asagidaki yardimcilar bunlari tek yerden yonetmek icin var.

In [None]:

def _softplus_inverse(y: torch.Tensor, eps: float = 1e-6) -> torch.Tensor:
    """Softplus ile pozitif kalacak bir parametreyi, hedef baslangic degerine oturtmak icin ters donusum."""
    return torch.log(torch.clamp(torch.exp(y) - 1.0, min=eps))


def _get_gate(gate: str):
    """Attention maskesini 0–1 araligina tasiyan fonksiyon secimi."""
    g = gate.lower()
    if g == "sigmoid":
        return torch.sigmoid
    if g == "hardsigmoid":
        return F.hardsigmoid
    raise ValueError("gate 'sigmoid' veya 'hardsigmoid' olmali.")


def _get_act(act: str):
    """Kucuk MLP bloklarinda aktivasyon secimi."""
    a = act.lower()
    if a == "relu":
        return nn.ReLU(inplace=True)
    if a == "silu":
        return nn.SiLU(inplace=True)
    raise ValueError("act 'relu' veya 'silu' olmali.")

### 2.1 `_softplus_inverse` nasil dusunulmeli?

Ogrenilebilir bir parametrenin **pozitif kalmasi** isteniyorsa (or. temperature), tipik yaklasim su:

- Iceride "ham" bir parametre tutulur
- Forward'da `softplus(ham)` uygulanir ve pozitif deger uretilir

Ama baslangicta temperature'i belirli bir degere oturtmak istersen,
ham parametreyi oyle ayarlamak gerekir ki `softplus(ham)` yaklasik hedefi versin.

`_softplus_inverse` bu baslangic ayarini pratik hale getirir.

### 2.2 Gate fonksiyonu neden iki secenek?

- `sigmoid`: daha yumusak; kucuk degisimlere daha duyarli
- `hardsigmoid`: daha ucuz ve parcali dogrusal; bazi senaryolarda daha stabil

Bu secim maskenin davranisini etkiler (cok agresif mi, yumusak mi).

## 3) Channel Attention (CBAM kanal) — butun akis

Bu modul sunu yapar:

1) `x` (B,C,H,W) -> iki ozet: avg ve max (B,C,1,1)
2) Her ozet kucuk bir MLP'den gecer -> iki aday kanal skoru uretir
3) Bu iki aday skor ya toplanir ya da **softmax ile agirliklandirilarak** birlestirilir
4) Gate uygulanir -> kanal maskesi `ca` (B,C,1,1)
5) `y = x * ca` ile kanal bazinda yeniden olceklenir

Onemli: Bu modul uzamsal (H,W) yapiyi bozmaz; sadece kanallari olcekler.

### 3.1 Kod — ChannelAttentionFusionT

Bu surumde `fusion=softmax` opsiyonu var.
Bu ne demek? Model "avg mi daha faydali, max mi?" sorusunu ornek bazinda cevaplayabilir.

Ayrica `temperature` var: maskeyi daha yumusak ya da daha keskin yapmaya yarar.

`last = self.fusion_router[-1]` kismi:
- `nn.Sequential` icindeki **son katmani** alir
- Bu son katmanin agirlik/bias'ini 0 yaparak **baslangicta tarafsiz** bir router elde edilir
  (softmax(logits) yaklasik [0.5, 0.5]).

In [None]:

class ChannelAttentionFusionT(nn.Module):
    def __init__(
        self,
        channels: int,
        reduction: int = 16,
        min_hidden: int = 4,
        fusion: str = "softmax",
        gate: str = "sigmoid",
        temperature: float = 0.9,
        learnable_temperature: bool = False,
        eps: float = 1e-6,
        act: str = "relu",
        bias: bool = True,
        fusion_router_hidden: int = 16,
        return_fusion_weights: bool = False,
    ):
        super().__init__()
        if channels < 1:
            raise ValueError("channels >= 1 olmali.")
        if reduction < 1:
            raise ValueError("reduction >= 1 olmali.")
        if fusion not in ("sum", "softmax"):
            raise ValueError("fusion 'sum' veya 'softmax' olmali.")
        if temperature <= 0:
            raise ValueError("temperature pozitif olmali.")
        if fusion_router_hidden < 1:
            raise ValueError("fusion_router_hidden >= 1 olmali.")

        self.eps = float(eps)
        self.fusion = fusion
        self.return_fusion_weights = bool(return_fusion_weights)
        self.gate_fn = _get_gate(gate)

        hidden = max(int(min_hidden), int(channels) // int(reduction))

        self.avg_pool = nn.AdaptiveAvgPool2d(1)
        self.max_pool = nn.AdaptiveMaxPool2d(1)

        self.fc1 = nn.Conv2d(channels, hidden, kernel_size=1, bias=bias)
        self.act = _get_act(act)
        self.fc2 = nn.Conv2d(hidden, channels, kernel_size=1, bias=bias)

        if self.fusion == "softmax":
            self.fusion_router = nn.Sequential(
                nn.Conv2d(2 * channels, fusion_router_hidden, kernel_size=1, bias=True),
                nn.ReLU(inplace=True),
                nn.Conv2d(fusion_router_hidden, 2, kernel_size=1, bias=True),
            )
            last = self.fusion_router[-1]
            nn.init.zeros_(last.weight)
            nn.init.zeros_(last.bias)
        else:
            self.fusion_router = None

        self.learnable_temperature = bool(learnable_temperature)
        if self.learnable_temperature:
            t0 = torch.tensor(float(temperature))
            t_inv = _softplus_inverse(t0, eps=self.eps)
            self.t_raw = nn.Parameter(t_inv)
        else:
            self.register_buffer("T", torch.tensor(float(temperature)))

    def get_T(self) -> torch.Tensor:
        if self.learnable_temperature:
            return F.softplus(self.t_raw) + self.eps
        return self.T

    def mlp(self, s: torch.Tensor) -> torch.Tensor:
        return self.fc2(self.act(self.fc1(s)))

    def forward(self, x: torch.Tensor):
        avg_s = self.avg_pool(x)  # (B,C,1,1)
        max_s = self.max_pool(x)  # (B,C,1,1)

        a = self.mlp(avg_s)       # (B,C,1,1)
        m = self.mlp(max_s)       # (B,C,1,1)

        fusion_w = None
        if self.fusion == "sum":
            z = a + m
        else:
            s_cat = torch.cat([avg_s, max_s], dim=1)     # (B,2C,1,1)
            logits = self.fusion_router(s_cat).flatten(1)  # (B,2)
            fusion_w = torch.softmax(logits, dim=1)        # (B,2)

            # (B,2) -> (B,1,1,1) yapinca a/m ile broadcast carpilabilir
            w0 = fusion_w[:, 0].view(-1, 1, 1, 1)
            w1 = fusion_w[:, 1].view(-1, 1, 1, 1)
            z = w0 * a + w1 * m

        T = self.get_T().to(device=x.device, dtype=x.dtype)
        ca = self.gate_fn(z / T)  # (B,C,1,1)
        y = x * ca                # (B,C,H,W)

        if self.return_fusion_weights and (fusion_w is not None):
            return y, ca, fusion_w
        return y, ca

### 3.2 `fusion_w[:,0].view(-1,1,1,1)` niye var?

`fusion_w` sekli `(B,2)`.
- `fusion_w[:,0]` -> `(B,)` (her ornek icin tek sayi)
- `view(-1,1,1,1)` -> `(B,1,1,1)`

Bu sekle sokunca su olur:
- `a` ve `m` zaten `(B,C,1,1)`
- `(B,1,1,1)` ile carpinca PyTorch bunu kanal boyutuna otomatik yayar (broadcast)

Sonuc: her ornek icin tek bir agirlikla o ornegin tum kanallari olceklenir.

### 3.3 `register_buffer` ile `nn.Parameter` farki

- `nn.Parameter`: egitimde optimizer gorur, guncellenir
- `register_buffer`: guncellenmez ama modelle birlikte tasinir (`state_dict` icinde, CPU/GPU tasimasi vs.)

Bu projede:
- temperature sabitse buffer
- temperature ogrenilecekse Parameter

## 4) Coordinate Attention (Plus) — butun akis

Coordinate attention kismi, uzamsal dikkati tek bir 2D maske yerine **iki eksende** uretir:
- H yonu icin ayri bir maske
- W yonu icin ayri bir maske

Akis (ozet):
1) `h_profile`: W boyunca ozet -> `(B,C,H,1)`
2) `w_profile`: H boyunca ozet -> `(B,C,1,W)`
3) Local + dilated depthwise ile cok olcekli filtreleme
4) H ve W profillerini birlestirip ortak bottleneck'ten gecirme
5) Tekrar H ve W'ye ayirip ayri head'lerle maske uretme
6) Alpha ve beta ile "ne kadar uygulanacagini" yumusatma

### 4.1 Aktivasyon ve norm yardimcilari

Coordinate kisminda kucuk bottleneck ve refine bloklari var.
- Aktivasyon: HSwish (mobil tarzi, stabil ve ucuz)
- Normalizasyon: GN genelde detection / small batch icin daha stabil olabiliyor

In [None]:

class HSwish(nn.Module):
    def forward(self, x):
        return x * F.relu6(x + 3.0, inplace=True) / 6.0


def make_norm(norm: str, ch: int):
    norm = norm.lower()
    if norm == "bn":
        return nn.BatchNorm2d(ch)
    if norm == "gn":
        g = min(32, ch)
        while ch % g != 0 and g > 2:
            g //= 2
        if ch % g != 0:
            g = 2 if (ch % 2 == 0) else 1
        return nn.GroupNorm(g, ch)
    if norm == "none":
        return nn.Identity()
    raise ValueError("norm 'none', 'bn', 'gn' disinda olamaz.")

### 4.2 `mid_floor` ve `mid` secimi neden katmanli?

Coordinate attention'da `mid` (bottleneck kanal sayisi) cok kucuk olursa:
- H/W bilgisi yeterince tasinamaz
- maske kalitesi dusus gorebilir

Cok buyuk olursa:
- hesap pahali
- maske gereksiz agresiflesebilir

Bu yuzden uc kontrol birlikte kullaniliyor:
- `in_channels // reduction` (dogal bottleneck)
- `min_mid_channels` (alt limit)
- `mid_floor` (pratik taban, "fazla kuculmesin")

Bu ucunun max'ini almak: her kanal sayisinda "makul bir bottleneck" hedefler.

### 4.3 Kod — CoordinateAttPlus (detayli)

Onemli parcalar:
- `h_local_dw` / `w_local_dw`: eksen yonlu lokal depthwise
- `h_dilated_dw` / `w_dilated_dw`: eksen yonlu dilated depthwise (genis baglam)
- `*_channel_mixer`: 1x1 ile kanallari karistirip profili guclendirme
- `shared_bottleneck_*`: H ve W bilgisini ortak ara uzayda birlestirme
- `h_attention_head` / `w_attention_head`: iki eksen icin ayri maske uretme
- `alpha_*`: eksen bazinda "attention ne kadar devreye girsin" kontrolu
- `beta`: genel kuvvet ayari

In [None]:

class CoordinateAttPlus(nn.Module):
    def __init__(
        self,
        in_channels: int,
        reduction: int = 32,
        min_mid_channels: int = 8,
        act: str = "hswish",
        init_alpha: float = 0.7,
        learnable_alpha: bool = True,
        beta: float = 0.35,
        dilation: int = 2,
        norm: str = "gn",
        use_spatial_gate: bool = False,
        spatial_gate_beta: float = 0.35,
    ):
        super().__init__()

        if in_channels < 1:
            raise ValueError("in_channels >= 1 olmali.")
        if reduction < 1:
            raise ValueError("reduction >= 1 olmali.")
        if dilation < 1:
            raise ValueError("dilation >= 1 olmali.")

        # Bottleneck kanal sayisi: cok dusmesin diye taban koyuluyor
        mid_floor = max(8, min(32, int(in_channels) // 4))
        mid = max(int(min_mid_channels), int(in_channels) // int(reduction))
        mid = max(mid, int(mid_floor))

        # Aktivasyon secimi
        act_l = act.lower()
        if act_l == "hswish":
            self.act = HSwish()
        elif act_l == "relu":
            self.act = nn.ReLU(inplace=True)
        elif act_l == "silu":
            self.act = nn.SiLU(inplace=True)
        else:
            raise ValueError("act 'hswish', 'relu', 'silu' olmali.")

        # Ortak bottleneck (H+W birlikte)
        self.shared_bottleneck_proj = nn.Conv2d(in_channels, mid, 1, bias=False)
        self.shared_bottleneck_norm = make_norm(norm, mid)
        self.shared_bottleneck_refine = nn.Conv2d(mid, mid, 1, bias=False)
        self.shared_bottleneck_refine_norm = make_norm(norm, mid)

        # Eksen yonlu lokal depthwise
        self.h_local_dw = nn.Conv2d(
            in_channels, in_channels, kernel_size=(3, 1), padding=(1, 0), groups=in_channels, bias=False
        )
        self.w_local_dw = nn.Conv2d(
            in_channels, in_channels, kernel_size=(1, 3), padding=(0, 1), groups=in_channels, bias=False
        )

        # Eksen yonlu dilated depthwise (genis baglam)
        d = int(dilation)
        self.h_dilated_dw = nn.Conv2d(
            in_channels,
            in_channels,
            kernel_size=(3, 1),
            padding=(d, 0),
            dilation=(d, 1),
            groups=in_channels,
            bias=False,
        )
        self.w_dilated_dw = nn.Conv2d(
            in_channels,
            in_channels,
            kernel_size=(1, 3),
            padding=(0, d),
            dilation=(1, d),
            groups=in_channels,
            bias=False,
        )

        # 1x1 karistirma: kanallari guclendirip profili daha esnek yapar
        self.h_channel_mixer = nn.Conv2d(in_channels, in_channels, 1, bias=True)
        self.w_channel_mixer = nn.Conv2d(in_channels, in_channels, 1, bias=True)

        # Iki eksen icin maske head'leri (mid -> C)
        self.h_attention_head = nn.Conv2d(mid, in_channels, 1, bias=True)
        self.w_attention_head = nn.Conv2d(mid, in_channels, 1, bias=True)

        # Genel kuvvet ayari
        self.beta = float(beta)

        # Alpha parametreleri (0-1 araliginda kalacak)
        eps = 1e-6
        a0 = float(init_alpha)
        a0 = min(max(a0, eps), 1.0 - eps)
        raw0 = torch.logit(torch.tensor(a0), eps=eps)

        if learnable_alpha:
            self.alpha_h_raw = nn.Parameter(raw0.clone())
            self.alpha_w_raw = nn.Parameter(raw0.clone())
        else:
            self.register_buffer("alpha_h_raw", raw0.clone())
            self.register_buffer("alpha_w_raw", raw0.clone())

        # Opsiyonel ek spatial gate (daha agresif olabilir)
        self.use_spatial_gate = bool(use_spatial_gate)
        self.spatial_gate_beta = float(spatial_gate_beta)
        if self.use_spatial_gate:
            self.spatial_gate_dw = nn.Conv2d(in_channels, in_channels, 3, padding=1, groups=in_channels, bias=False)
            self.spatial_gate_pw = nn.Conv2d(in_channels, in_channels, 1, bias=True)

        # Debug icin son maskeleri sakla
        self._last_ah = None
        self._last_aw = None

    def forward(self, x: torch.Tensor):
        _, _, H, W = x.shape

        # Eksen ozetleri: mean + max karisimi
        h_profile = 0.5 * (x.mean(dim=3, keepdim=True) + x.amax(dim=3, keepdim=True))  # (B,C,H,1)
        w_profile = 0.5 * (x.mean(dim=2, keepdim=True) + x.amax(dim=2, keepdim=True))  # (B,C,1,W)

        # Cok olcekli (lokal + dilated) filtreleme + 1x1 karistirma
        h_ms = self.h_channel_mixer(self.h_local_dw(h_profile) + self.h_dilated_dw(h_profile))  # (B,C,H,1)
        w_ms = self.w_channel_mixer(self.w_local_dw(w_profile) + self.w_dilated_dw(w_profile))  # (B,C,1,W)

        # Cat icin w'yi (B,C,W,1) yap
        w_ms = w_ms.permute(0, 1, 3, 2)

        # (B,C,H,1) + (B,C,W,1) -> (B,C,H+W,1)
        hw = torch.cat([h_ms, w_ms], dim=2)

        # Ortak bottleneck: H ve W birlikte ogrenir
        mid = self.act(self.shared_bottleneck_norm(self.shared_bottleneck_proj(hw)))
        mid = self.act(self.shared_bottleneck_refine_norm(self.shared_bottleneck_refine(mid)))

        # Tekrar ayir
        mid_h, mid_w = torch.split(mid, [H, W], dim=2)
        mid_w = mid_w.permute(0, 1, 3, 2)  # (B,mid,W,1) -> (B,mid,1,W)

        # Maskeler
        attn_h = F.hardsigmoid(self.h_attention_head(mid_h), inplace=False)  # (B,C,H,1)
        attn_w = F.hardsigmoid(self.w_attention_head(mid_w), inplace=False)  # (B,C,1,W)

        self._last_ah = attn_h.detach()
        self._last_aw = attn_w.detach()

        # Alpha: ham -> sigmoid ile 0-1
        alpha_h = torch.sigmoid(self.alpha_h_raw).to(device=x.device, dtype=x.dtype)
        alpha_w = torch.sigmoid(self.alpha_w_raw).to(device=x.device, dtype=x.dtype)

        # Yumusak karisim: alpha=0 -> etki yok, alpha=1 -> tam mask
        scale_h = (1.0 - alpha_h) + alpha_h * attn_h
        scale_w = (1.0 - alpha_w) + alpha_w * attn_w

        # Iki eksen birlesir
        scale = scale_h * scale_w

        # Beta ile genel kuvvet kontrolu
        scale = 1.0 + self.beta * (scale - 1.0)

        out = x * scale

        if self.use_spatial_gate:
            sg = self.spatial_gate_pw(self.spatial_gate_dw(x))
            sg = F.hardsigmoid(sg, inplace=False)
            sg = 1.0 + self.spatial_gate_beta * (sg - 1.0)
            out = out * sg

        return out

    @torch.no_grad()
    def last_mask_stats(self):
        if (self._last_ah is None) or (self._last_aw is None):
            return None
        ah = self._last_ah
        aw = self._last_aw
        return {
            "a_h": {"min": float(ah.min()), "mean": float(ah.mean()), "max": float(ah.max()), "std": float(ah.std())},
            "a_w": {"min": float(aw.min()), "mean": float(aw.mean()), "max": float(aw.max()), "std": float(aw.std())},
        }

### 4.4 `init_alpha -> raw0 (logit uzayi)` mantigi

Buradaki fikir:

- Kullanici `init_alpha=0.7` diyor (0–1 araliginda kontrol degeri)
- Iceride dogrudan 0.7 tutmak yerine "ham" deger tutuluyor
- Forward'da `sigmoid(ham)` ile tekrar 0–1'e donuluyor

Bu yaklasim:
- ogrenmeyi stabil yapar
- parametreyi dogal olarak 0–1 araliginda tutar

`eps` eklenme sebebi:
- 0 ve 1 civarinda sayisal sorunlar olmasin diye guvenlik payi.

### 4.5 `0.5*(mean + amax)` profilleri niye var?

Tek basina mean:
- daha stabil bir ozet verir
- ama sivri (peak) sinyalleri zayif gosterebilir

Tek basina max:
- sivri sinyali yakalar
- ama gurultuye daha acik olabilir

Ikisini yarim yarim birlestirmek:
- mean'in stabilitesini
- max'in seciciligini
aynanda tasir.

Burada ozet, her kanal icin H veya W boyunca cikartilir. Bu yuzden "konumla baglantili" sinyal yakalanir.

## 5) Birlesim blogu — Yontem 1

Bu blok:
1) `x` -> Channel attention -> `y`
2) `y` -> Coordinate attention -> `y2`
3) Residual aciksa `x` ile `y2` kontrollu karisir

Residual neden onemli?
- Iki attention ust uste gelince bazen fazla bastirma olur
- Residual, etkisini ayarlanabilir ve daha guvenli hale getirir

### 5.1 Residual alpha (blok duzeyi) neden ayri?

Coordinate'in icinde zaten `alpha_h` ve `alpha_w` var. Onlar eksen bazli yumusatma.

Buradaki `alpha_raw` ise butun blogun etkisini ayarlar:
- kanal + coord kombinasyonunun tamamini

Iki seviyeli kontrol, pratikte daha stabil bir davranis verir.

In [None]:

class CBAMChannelPlusCoord(nn.Module):
    def __init__(
        self,
        channels: int,
        ca_reduction: int = 16,
        ca_min_hidden: int = 4,
        ca_fusion: str = "softmax",
        ca_gate: str = "sigmoid",
        ca_temperature: float = 0.9,
        ca_act: str = "relu",
        ca_fusion_router_hidden: int = 16,
        learnable_temperature: bool = False,
        coord_reduction: int = 32,
        coord_min_mid: int = 8,
        coord_act: str = "hswish",
        coord_init_alpha: float = 0.7,
        coord_learnable_alpha: bool = True,
        coord_beta: float = 0.35,
        coord_dilation: int = 2,
        coord_norm: str = "gn",
        coord_use_spatial_gate: bool = False,
        coord_spatial_gate_beta: float = 0.35,
        residual: bool = True,
        alpha_init: float = 0.75,
        learnable_alpha: bool = False,
        return_maps: bool = False,
    ):
        super().__init__()
        if channels < 1:
            raise ValueError("channels >= 1 olmali.")

        self.return_maps = bool(return_maps)
        self.residual = bool(residual)

        self.ca = ChannelAttentionFusionT(
            channels=channels,
            reduction=ca_reduction,
            min_hidden=ca_min_hidden,
            fusion=ca_fusion,
            gate=ca_gate,
            temperature=ca_temperature,
            learnable_temperature=learnable_temperature,
            eps=1e-6,
            act=ca_act,
            bias=True,
            fusion_router_hidden=ca_fusion_router_hidden,
            return_fusion_weights=self.return_maps,
        )

        self.coord = CoordinateAttPlus(
            in_channels=channels,
            reduction=coord_reduction,
            min_mid_channels=coord_min_mid,
            act=coord_act,
            init_alpha=coord_init_alpha,
            learnable_alpha=coord_learnable_alpha,
            beta=coord_beta,
            dilation=coord_dilation,
            norm=coord_norm,
            use_spatial_gate=coord_use_spatial_gate,
            spatial_gate_beta=coord_spatial_gate_beta,
        )

        # Blok seviyesinde residual karisim icin alpha
        if self.residual:
            eps = 1e-6
            a0 = float(alpha_init)
            a0 = min(max(a0, eps), 1.0 - eps)
            raw0 = torch.logit(torch.tensor(a0), eps=eps)
            if learnable_alpha:
                self.alpha_raw = nn.Parameter(raw0)
            else:
                self.register_buffer("alpha_raw", raw0)

    def _alpha(self, x: torch.Tensor) -> torch.Tensor:
        # residual=True iken alpha_raw yoksa (normalde olmamali) alpha=1.0 ile fallback
        if not hasattr(self, "alpha_raw"):
            return x.new_tensor(1.0)
        return torch.sigmoid(self.alpha_raw).to(device=x.device, dtype=x.dtype)

    def forward(self, x: torch.Tensor):
        if self.return_maps:
            y, ca_map, fusion_w = self.ca(x)  # channel
            y = self.coord(y)                 # coordinate

            if self.residual:
                out = x + self._alpha(x) * (y - x)
            else:
                out = y

            coord_stats = self.coord.last_mask_stats()
            return out, ca_map, fusion_w, coord_stats

        y, _ = self.ca(x)
        y = self.coord(y)

        if self.residual:
            out = x + self._alpha(x) * (y - x)
        else:
            out = y

        return out

### 5.2 `x.new_tensor(1.0)` ne demek?

`new_tensor` su garantiyi verir:
- 1.0 degeri **x ile ayni cihazda** olusur (CPU/GPU)
- 1.0 degeri **x ile uyumlu dtype** ile olusur (float16/float32)

Bu, "GPU tensoru ile CPU scalari carpma" gibi hatalari engeller.

## 6) Test: shape ve istatistikleri okuma

Bu hucre bes seyi ayni anda kontrol eder:
- giris/cikis shape ayni mi?
- `ca_map` kanal maskesi mi?
- `fusion_w` router agirliklari mi?
- coordinate maskeleri makul aralikta mi?
- dikkat bloklari fazla bastiriyor mu? (kaba kontrol)

Not: `float(tensor)` uyarisi gormemek icin `.detach()` kullanilir.

In [None]:

def tensor_stats(t: torch.Tensor):
    t = t.detach()
    return {
        "min": float(t.min()),
        "mean": float(t.mean()),
        "max": float(t.max()),
        "std": float(t.std()),
    }

x = torch.randn(2, 64, 56, 56)

m = CBAMChannelPlusCoord(
    channels=64,
    return_maps=True,
    residual=True,
    alpha_init=0.75,
    learnable_alpha=False,
    learnable_temperature=True,
    ca_temperature=0.9,
    coord_beta=0.35,
    coord_dilation=2,
    coord_norm="gn",
)

out, ca_map, fusion_w, coord_stats = m(x)

print("x:", x.shape)
print("out:", out.shape)
print("ca_map:", ca_map.shape)
print("fusion_w:", fusion_w.shape)
print("coord_stats:", coord_stats)

print("CA stats:", tensor_stats(ca_map))
print("fusion_w mean:", fusion_w.detach().mean(dim=0).tolist())

## 7) Debug checklist — kotu gorunurse neye bakilir?

Bu bolum pratik: egitimde veya forward testinde istatistikler kotu gorunurse hizli teshis.

### 7.1 `ca_map` cok kucukse
- `ca_map.mean` dusuk (or. 0.2–0.3) ve `out` dagilimi cok daraliyorsa:
  - `ca_temperature` arttirmak maskeyi yumusatir
  - `alpha_init` dusurmek (residual etkisi) blogun genel gucunu azaltir

### 7.2 `ca_map` hep 1'e yakin ise
- attention etkisizlesmis olabilir:
  - temperature biraz dusur
  - gate degistir (sigmoid/hardsigmoid)

### 7.3 Coordinate maskeleri cok oynaksa
- `coord_stats['a_h']['std']` ve `a_w['std']` cok yuksekse:
  - `coord_beta` dusur
  - `coord_dilation` azalt
  - `coord_use_spatial_gate=False` tut

### 7.4 `fusion_w` hemen kilitleniyorsa
- `fusion_w` hizla tek tarafa gidiyorsa (or. [0.95, 0.05]):
  - router kapasitesi (fusion_router_hidden)
  - learning rate
  - temperature
  ayarlari etkiler.

Bu checklist, ciktilari "iyi mi kotu mu" hizli yorumlamak icin.