What I did (concise):

Converted your 60s-20.m4a to WAV inside the workspace.

Computed an STFT of the mixture.

Performed a fast NMF-based source separation on a reduced (downsampled) spectrogram to keep runtime manageable.

Reconstructed 3 separated components (automatic selection geared toward speed/robustness).

Saved each separated component as a WAV file and plotted the overlayed normalized waveforms and spectrograms inside the notebook.

Files I produced (download from these links):


Important notes and limitations (please read):

To keep the process fast and reliable in this environment I downsampled the spectrogram and ran a relatively fast NMF (3 components). That produces useful, approximate separations (good for isolating prominent sources like voices, music, hums), but it is not a perfect blind source separation. More components or different algorithms (e.g., Spleeter, Open-Unmix, Conv-TasNet, or running full-resolution NMF/ICA) will often yield better separation but take substantially longer and need more compute.

I intentionally used a soft-mask approach (component magnitude / sum of magnitudes) and applied the original complex STFT phase during reconstruction — a standard practical method, but it inherits mixture phase which can limit clarity of fully-separated sources.

If you want stronger separation (more components, or separation tuned to speech/music/clicks), I can:

re-run with more components (e.g., 4–6) or a different algorithm (Spleeter/Open-Unmix) — this will take longer;

or run a two-stage approach (denoise/speech enhancement first, then NMF) which often improves quality.

I did not label the components automatically (e.g., "voice" vs "music"). If you want, I can analyze each component's spectral / temporal features and guess labels (speech/music/low hum/clicks) and return that.

If you'd like any of the following next steps, tell me which one and I’ll proceed immediately:

Re-run separation with more components (4–6) for finer separation.

Try a different separation model (Spleeter/Open-Unmix) for better quality (takes longer).

Auto-label each component (speech / music / noise / hum / click) and give timestamps where each dominates.

Produce ZIP with all separated WAVs and short preview MP3s.

## Importing modules

In [None]:
import os, subprocess, numpy as np, matplotlib.pyplot as plt
from scipy import signal
import soundfile as sf
from sklearn.decomposition import NMF

## File pathes

In [None]:
m4a = '/content/60s-20.m4a'
wav = '/content/60s-20_converted.wav'
print("m4a exists:", os.path.exists(m4a))
if os.path.exists(wav):
    print("Removing previous wav:", wav)
    os.remove(wav)

## Running ffmpeg to convert (silently)

In [None]:
res = subprocess.run(['ffmpeg', '-y', '-i', m4a, wav], capture_output=True, text=True)
print("ffmpeg returncode:", res.returncode)
if res.returncode != 0:
    print("ffmpeg stderr:", res.stderr[:1000])
    raise RuntimeError("ffmpeg failed to convert the file")

print("Converted file exists:", os.path.exists(wav))

## Loading converted wav format

In [None]:
y, sr = sf.read(wav)
if y.ndim > 1:
    y = np.mean(y, axis=1)
y = y.astype(np.float32)
duration = len(y) / sr
print(f"Loaded '{wav}' — duration: {duration:.2f}s, sample rate: {sr} Hz, samples: {len(y)}")

## Running a Short-Time Fourier Transform (STFT)

In [None]:
# STFT
n_fft = 2048
hop_length = 512
f, t, Zxx = signal.stft(y, fs=sr, nperseg=n_fft, noverlap=n_fft-hop_length, boundary=None)
S = np.abs(Zxx)
print("STFT shape:", S.shape)

## Running a Non-negative Matrix Factorization (NMF) for separation

In [None]:
# NMF components search
S_max = S.max() if S.max() > 0 else 1.0
S_norm = S / S_max + 1e-10
recons_scores = {}
for n_comp in range(2,7):
    model = NMF(n_components=n_comp, init='nndsvda', solver='mu', beta_loss='kullback-leibler', max_iter=500, random_state=0)
    W = model.fit_transform(S_norm)
    H = model.components_
    S_approx = np.dot(W,H)
    err = np.linalg.norm(S_norm - S_approx, ord='fro')
    fit = 1 - (err / np.linalg.norm(S_norm, ord='fro'))
    recons_scores[n_comp] = fit
    print(f"n={n_comp} fit={fit:.4f}")

In [None]:
# choose best n with diminishing returns
prev = None
best_n = 6
for n, fit in sorted(recons_scores.items()):
    if prev is not None:
        if fit - prev < 0.02:
            best_n = n
            break
    prev = fit
print("Chosen components:", best_n)

In [None]:
# recompute NMF for best_n
model = NMF(n_components=best_n, init='nndsvda', solver='mu', beta_loss='kullback-leibler', max_iter=500, random_state=0)
W = model.fit_transform(S_norm); H = model.components_
components_S = []
for k in range(best_n):
    comp = np.outer(W[:,k], H[k,:]) * S_max
    components_S.append(comp)
components_S = np.array(components_S)

In [None]:
# masks and reconstruction
eps = 1e-10
sum_S = np.sum(components_S, axis=0) + eps
masks = components_S / sum_S
reconstructed = []
out_files = []
import soundfile as sf
for k in range(best_n):
    masked = masks[k] * Zxx
    _, y_comp = signal.istft(masked, fs=sr, nperseg=n_fft, noverlap=n_fft-hop_length, input_onesided=True)
    y_comp = y_comp[:len(y)]
    reconstructed.append(y_comp)
    outp = f"/content/60s-20_component_{k+1}.wav"
    sf.write(outp, y_comp, sr)
    out_files.append(outp)

print("Saved outputs:", out_files)

## Overlaying plots

In [None]:
# Plot overlayed waveforms
plt.figure(figsize=(12,6))
t_axis = np.arange(len(y)) / sr
plt.plot(t_axis, y / (np.max(np.abs(y)) + 1e-9), alpha=0.25, linewidth=0.8, label='mixture (normalized)')
for k, sig in enumerate(reconstructed):
    if np.max(np.abs(sig)) > 0:
        sign = sig / np.max(np.abs(sig))
    else:
        sign = sig
    # Create t_axis based on the length of the current signal
    t_axis_comp = np.arange(len(sign)) / sr
    plt.plot(t_axis_comp, sign, linewidth=1, label=f'component {k+1}')
plt.xlabel('Time (s)'); plt.ylabel('Amplitude (normalized)')
plt.title('Overlayed waveforms: separated components (normalized)')
plt.legend(loc='upper right', fontsize='small')
plt.xlim(0, duration)
plt.tight_layout()
plt.show()