# 1. UNIFICACIÓ DE DATASETS.

**En aquest apartat unificarem les imatges de les bases de dades de mamografies. Les bases que utilitzarem són:**
- CESM
- MIAS
- CMMD
- INbreast
- DDSM

**Per fer això homogeneitzarem les imatges a format jpg i estandaritzarem algunes vaiables com la forma del tumor, la densitat de les mames, etc.**

**El resultat final d'aquest apartat és un dataset únic amb totes les rutes a les imatges i les seves característiques**

In [1]:
from keras.utils import image_dataset_from_directory

from keras.layers import (
    GlobalAveragePooling2D, Flatten, Input, 
    Dense, Dropout, Conv2D, Conv2DTranspose, BatchNormalization, AveragePooling2D,
    MaxPooling2D, UpSampling2D, Rescaling, Resizing,
    RandomFlip, RandomRotation, RandomZoom, RandomContrast, Lambda)
from keras.callbacks import (EarlyStopping, ReduceLROnPlateau)
from keras.optimizers import (Adam, RMSprop)
from keras import Sequential, Model

import tensorflow as tf

from PIL import Image
import glob
import os
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import time
from sklearn.metrics import classification_report, confusion_matrix
import itertools
import random

In [2]:
pd.set_option('display.max_rows', None)

# 1.1 DATASET CESM

Font: https://wiki.cancerimagingarchive.net/pages/viewpage.action?pageId=109379611#109379611bcab02c187174a288dbcbf95d26179e8

Any: 2022

Format: Full Digital / jpg

Confirmació per biòpsia: Si

El conjunt de dades Categorized Digital Database for Low energy and Subtracted Contrast Enhanced Spectral Mammography (CDD-CESM) consta de 2006 imatges amb una resolució mitjana de 2355 × 1315 píxels. Aquestes imatges s'organitzen de la següent manera:

310 imatges de masses
48 imatges de distorsions arquitectòniques
222 imatges d'asimetries
238 imatges de calcificacions
334 imatges d'augment de masses
184 imatges d'augment no-massiu
159 imatges postoperatories
8 imatges postquimioteràpia neoadjuvant
751 imatges normals

El dataset CESM s'ha realitzat utilitzant equipament digital estàndard de mamografia, amb programari addicional que realitza l'adquisició d'imatges d'energia dual. Dos minuts després d'inocular intravenosament el pacient amb material de contrast iodat no iònic de baixa osmolaritat (dosi: 1,5 mL/kg), s'obtenen les projeccions craniocaudals (CC) i mediolaterals oblíqües (MLO). Cada projecció consta de dues exposicions, una amb baixa energia (equiparable amb una radiografia digital convencional) i una amb alta energia (45 a 49 kVp). Les imatges de baixa i alta energia es recombinen i es resten a través del processament d'imatges adequat per suprimir el parènquima mamari de fons.

Criteris i decisions:

- Utilitzem només les imatges Low energy comparables amb unes imatges Full Field Digital.
- Decidim guardar les imatges en formar .jpg
- Creem una columna que correspongui a Masses, Calcificacions, Assimetries i Distorsions presents a les mamografies.
- Per unificar criteris es realitza un mapping de la forma del tumor per homogenitzar amb altres datasets: Les categories finals són Circumscribed, Lobulated, Obscured, Speculated o Indistinct

In [3]:
cesm_doc_dir = 'DATABASE - CESM/Radiology manual annotations.xlsx'
cesm_image_dir = 'DATABASE - CESM/Low energy'

def read_annotations(excel_path):
    """Llegeix les anotacions d'un fitxer Excel."""
    def preprocess_df(df):
        df = df.rename(columns={'Pathology Classification/ Follow up': 'Label'})
        df['Image_name'] = df['Image_name'].astype(str).str.strip()
        return df
    
    annotations_dfs = []
    sheets = ['all', 'mass_description', 'asymmetry', 'calcifications', 'distortion']
    for sheet in sheets:
        df = pd.read_excel(excel_path, sheet_name=sheet)
        df = preprocess_df(df)
        if sheet == 'all':
            df = df[df['Label'] == 'Normal']
            df['Mass_shape'] = 0
        if sheet == 'mass_description':
            df['Mass'] = 1
            df['Mass_shape'] = df['Mass margin'].str.split('-').str[0]
        else:
            df[sheet.capitalize()] = 1
            df['Mass_shape'] = 0
        annotations_dfs.append(df)
    annotations_df = pd.concat(annotations_dfs, ignore_index=True)
    
    return annotations_df


def create_image_path_df(data_path):
    """
    Create a DataFrame with image paths from a given directory.

    Parameters:
    - data_path (str): Path to the directory containing images.

    Returns:
    - pd.DataFrame: DataFrame with 'Path' column containing image filenames.
    """
    image_files = os.listdir(data_path)
    full_paths = [os.path.join(data_path, file) for file in image_files]
    filenames_we = [os.path.splitext(file)[0] for file in image_files]
    
    df = pd.DataFrame(
        {'Path': full_paths, 'Filename': filenames_we}
    )
    return df


def merge_image_paths(annotations_df, image_path_df1):
    """
    Merge annotations DataFrame with image paths DataFrame based on the 'Image_name' column.

    Parameters:
    - annotations_df (pd.DataFrame): DataFrame with annotations.
    - image_path_df1 (pd.DataFrame): DataFrame with image paths.

    Returns:
    - pd.DataFrame: Merged and filtered DataFrame.
    """
    
    merged_df = pd.merge(annotations_df, image_path_df1[['Filename', 'Path']], left_on='Image_name', right_on='Filename', how='left')
        
    merged_df.dropna(subset=['Label'], inplace=True)
    
    merged_df_filtered = merged_df[merged_df['Type'] == 'DM']

    merged_df_filtered['Database'] = 'CESM'
    
    columns_to_drop = ['Filename', 'Patient_ID', 'Side', 'Type', 'BIRADS', 'Findings', 'Tags', 'Machine']
    
    merged_df_filtered = merged_df_filtered.drop(columns_to_drop, axis=1)

    column_order = ['Database', 'Image_name', 'View', 'Path', 'Label', 'ACR', 'Mass', 'Mass_shape', 'Asymmetry', 'Calcification', 'Distortion']
    
    merged_df_filtered = merged_df_filtered.reindex(columns=column_order)
    columns_to_drop = ['Filename', 'Patient_ID', 'Side', 'Type', 'BIRADS', 'Findings', 'Tags', 'Machine']

    merged_df_filtered = merged_df_filtered.reset_index()


    return merged_df_filtered


image_path_low = create_image_path_df(cesm_image_dir)

annotations_df= read_annotations(cesm_doc_dir)

df_cesm = merge_image_paths(annotations_df, image_path_low)

df_cesm.fillna(0, inplace=True)

df_cesm['ACR'].replace('_', 0, inplace=True)

mapping = {
    'Microlobulated': 'Lobulated',
    'Microlobulated ': 'Lobulated',
    'Partially obscured  ': 'Obscured',
    'Lobulated partially obscured': 'Obscured',
    'Partially obscured lobulated': 'Obscured',
    'Obscured': 'Obscured',
    'Speculated ulcerating': 'Speculated',
    'Speculated ': 'Speculated',
    'Indistinct ': 'Indistinct',
    '_': 'Unknown'
}

df_cesm['Mass_shape'] = df_cesm['Mass_shape'].replace(mapping)

df_cesm['Mass_shape'].value_counts()

print(df_cesm.info())

print(df_cesm['Label'].value_counts())

df_cesm.head()



<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1172 entries, 0 to 1171
Data columns (total 12 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   index          1172 non-null   int64  
 1   Database       1172 non-null   object 
 2   Image_name     1172 non-null   object 
 3   View           1172 non-null   object 
 4   Path           1172 non-null   object 
 5   Label          1172 non-null   object 
 6   ACR            1172 non-null   object 
 7   Mass           1172 non-null   float64
 8   Mass_shape     1172 non-null   object 
 9   Asymmetry      1172 non-null   float64
 10  Calcification  1172 non-null   float64
 11  Distortion     1172 non-null   float64
dtypes: float64(4), int64(1), object(7)
memory usage: 110.0+ KB
None
Label
Malignant    519
Normal       354
Benign       299
Name: count, dtype: int64


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  merged_df_filtered['Database'] = 'CESM'


Unnamed: 0,index,Database,Image_name,View,Path,Label,ACR,Mass,Mass_shape,Asymmetry,Calcification,Distortion
0,0,CESM,P4_L_DM_MLO,MLO,DATABASE - CESM/Low energy/P4_L_DM_MLO.jpg,Normal,0,0.0,0,0.0,0.0,0.0
1,2,CESM,P4_R_DM_CC,CC,DATABASE - CESM/Low energy/P4_R_DM_CC.jpg,Normal,0,0.0,0,0.0,0.0,0.0
2,4,CESM,P5_R_DM_CC,CC,DATABASE - CESM/Low energy/P5_R_DM_CC.jpg,Normal,0,0.0,0,0.0,0.0,0.0
3,5,CESM,P5_R_DM_MLO,MLO,DATABASE - CESM/Low energy/P5_R_DM_MLO.jpg,Normal,0,0.0,0,0.0,0.0,0.0
4,8,CESM,P5_L_DM_CC,CC,DATABASE - CESM/Low energy/P5_L_DM_CC.jpg,Normal,0,0.0,0,0.0,0.0,0.0


# 2. DATASET MIAS

Font: http://peipa.essex.ac.uk/info/mias.html

Any: 1996

Format: Analògic digitalitzat / pgm

Confirmació per biòpsia: Si

El conjunt de dades MIAS (Mammographic Image Analysis Society) consisteix en 322 imatges mamogràfiques digitals de mames esquerres i dretes, proporcionades per l'Hospital St. Bartholomew's de Londres. Aquestes imatges han estat digitalitzades a una resolució de 50 micres/píxel i tenen una mida d'aproximadament 1024x1024 píxels.

Cada imatge és acompanyada d'una sèrie d'anotacions que descriuen les característiques mamogràfiques detectades, com ara la presència de masses, microcalcificacions i altres troballes anormals. 

Criteris i decisions:

- S'equipara Circ a Circumscribed, SPIC a Spiculated i MISC a Indistinct per homogenitzar els datasets
- Les imatges es converteixen de .pgm a .jpg

In [16]:
import pandas as pd
import numpy as np
import os
from PIL import Image

def process_cesm_info(cesm_doc_dir):
    """Processa el fitxer d'informació CESM per crear un DataFrame amb Image_name i Label."""
    df = pd.read_csv(cesm_doc_dir, sep=' ')
    mapping_mass_type = {
        'CIRC': 'Circumscribed',
        'SPIC': 'Speculated',
        'MISC': 'Indistinct',
        'ASYM': '0',
        'CALC': '0',
        'ARCH': '0',
        'NORM': '0'
    }
    df['Mass'] = np.where(df['CLASS'].isin(['CIRC', 'SPIC', 'MISC']), 1, 0)
    df['Asymmetry'] = np.where(df['CLASS'].isin(['ASYM']), 1, 0)
    df['Calcification'] = np.where(df['CLASS'].isin(['CALC']), 1, 0)
    df['Distortion'] = np.where(df['CLASS'].isin(['ARCH']), 1, 0)
    df['Mass_shape'] = df['CLASS'].replace(mapping_mass_type)
    df['ACR'] = df['BG'].apply(lambda x: 'A' if x == 'F' else ('B' if x == 'G' else 'D'))
    df = df.rename(columns={'SEVERITY': 'Label', 'REFNUM': 'Image_name'})
    df['Label'] = df['Label'].replace({'B': 'Benign', 'M': 'Malignant', pd.NA: 'Normal'})
    columns_to_drop = ['BG', 'CLASS', 'Unnamed: 7', 'X', 'Y', 'RADIUS']
    df = df.drop(columns_to_drop, axis=1)
    return df
    

def convert_pgm_to_jpg(cesm_image_dir):
    """Converteix les imatges PGM a format JPG."""
    if not os.path.exists(cesm_image_dir):
        print(f"El directori {cesm_image_dir} no existeix.")
        return 0
    if not os.listdir(cesm_image_dir):
        print(f"El directori {cesm_image_dir} està buit.")
        return 0
    output_dir = os.path.join(cesm_image_dir, 'converted')
    os.makedirs(output_dir, exist_ok=True)
    converted_count = 0
    for filename in os.listdir(cesm_image_dir):
        if filename.endswith('.pgm'):
            pgm_path = os.path.join(cesm_image_dir, filename)
            jpg_path = os.path.join(output_dir, os.path.splitext(filename)[0] + '.jpg')
            try:
                with Image.open(pgm_path) as img:
                    img.convert('RGB').save(jpg_path)
                converted_count += 1
            except Exception as e:
                print(f"No s'ha pogut convertir la imatge {pgm_path}: {e}")
    return converted_count

def create_image_path_df(data_path):
    """Crea un DataFrame amb les rutes d'imatge d'un directori."""
    image_files = os.listdir(data_path)
    full_paths = [os.path.join(data_path, file) for file in image_files]
    filenames_we = [os.path.splitext(file)[0] for file in image_files]
    df = pd.DataFrame({'Path': full_paths, 'Filename': filenames_we})
    return df

def merge_image_paths(annotations_df, image_path_df):
    """Fusiona els DataFrames d'anotacions amb les rutes d'imatges."""
    merged_df = pd.merge(annotations_df, image_path_df[['Filename', 'Path']], left_on='Image_name', right_on='Filename', how='left')
    merged_df['Database'] = 'MIAS'
    columns_to_drop = ['Filename']
    merged_df = merged_df.drop(columns_to_drop, axis=1)
    column_order = ['Database', 'Image_name', 'Path', 'Label', 'Mass', 'Mass_shape', 'Asymmetry', 'Calcification', 'Distortion', 'ACR']
    merged_df = merged_df.reindex(column_order, axis=1)
    return merged_df



mias_doc_dir = 'DATABASE - MIAS/Info.txt'
mias_image_dir = 'DATABASE - MIAS/All-mias'

processed_df = process_cesm_info(mias_doc_dir)

num_images_converted = convert_pgm_to_jpg(mias_image_dir)
print(f"S'han convertit {num_images_converted} imatges a format JPG.")

image_path_mias = create_image_path_df('DATABASE - MIAS/All-mias/converted')
df_mias = merge_image_paths(processed_df, image_path_mias)

print(df_mias.info())

print(df_mias['Label'].value_counts())

df_mias.head()

S'han convertit 322 imatges a format JPG.
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 330 entries, 0 to 329
Data columns (total 10 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   Database       330 non-null    object
 1   Image_name     330 non-null    object
 2   Path           330 non-null    object
 3   Label          330 non-null    object
 4   Mass           330 non-null    int64 
 5   Mass_shape     330 non-null    object
 6   Asymmetry      330 non-null    int64 
 7   Calcification  330 non-null    int64 
 8   Distortion     330 non-null    int64 
 9   ACR            330 non-null    object
dtypes: int64(4), object(6)
memory usage: 25.9+ KB
None
Label
Normal       207
Benign        69
Malignant     54
Name: count, dtype: int64


Unnamed: 0,Database,Image_name,Path,Label,Mass,Mass_shape,Asymmetry,Calcification,Distortion,ACR
0,MIAS,mdb001,DATABASE - MIAS/All-mias/converted/mdb001.jpg,Benign,1,Circumscribed,0,0,0,B
1,MIAS,mdb002,DATABASE - MIAS/All-mias/converted/mdb002.jpg,Benign,1,Circumscribed,0,0,0,B
2,MIAS,mdb003,DATABASE - MIAS/All-mias/converted/mdb003.jpg,Normal,0,0,0,0,0,D
3,MIAS,mdb004,DATABASE - MIAS/All-mias/converted/mdb004.jpg,Normal,0,0,0,0,0,D
4,MIAS,mdb005,DATABASE - MIAS/All-mias/converted/mdb005.jpg,Benign,1,Circumscribed,0,0,0,A


# 3. DATASET INBREAST

Font: https://www.kaggle.com/datasets/martholi/inbreast

Any: 2012

Format: Full Digital / DICOM

Confirmació per biòpsia: No

El dataset INbreast consta de mamografies digitals completes en format digital. El conjunt de dades conté un total de 115 casos, amb un total de 410 imatges. Aquestes imatges representen una àmplia varietat de situacions clíniques, incloent-hi casos com ara masses, calcificacions, asimetries i distorsions. Les imatges es van adquirir en un centre situat en un hospital universitari (Centro Hospitalar de S. João, Breast Centre, Porto). Actualment només disponible a través de la plataforma Kaggle

En aquest cas, les lesions no estan confirmades per biopsia, en aquest cas tenim la puntuació BIRADS assignada pel radiòleg.

Criteris i decisions:

- S'associa el valor BIRADS 1 a Normal, 2 i 3 a Benigne i 4, 5 i 6 a Maligne
- Es converteixen les imatges a jpg i s'eliminen les imatges DICOM per alliberar espai
- Maping ACR de A a D



In [17]:
import pydicom

inbr_doc_dir = 'DATABASE - INbreast/INbreast.xls'

inbr_image_dir = 'DATABASE - INbreast/AllDICOMs'

def process_inbreast_info(inbr_doc_dir):
    """
    Process the INbreast info file to create a DataFrame with Image_name and Label.

    Parameters:
    - inbr_doc_dir (str): Path to the INbreast info file.

    Returns:
    - pd.DataFrame: Processed DataFrame.
    """
    df = pd.read_excel(inbr_doc_dir)

    df = df[['File Name', 'Bi-Rads', 'View', 'ACR', 'Mass ', 'Micros', 'Distortion', 'Asymmetry']]

    df.columns = ['Image_name', 'Label', 'View', 'ACR', 'Mass', 'Calcification', 'Distortion', 'Asymmetry']

    label_mapping = {1: 'Normal', 2: 'Benign', 3: 'Benign', 4: 'Malignant',
                     '4a': 'Malignant','4b': 'Malignant','4c': 'Malignant', 5: 'Malignant', 6: 'Malignant'}
    
    df['Label'] = df['Label'].map(label_mapping)

    acr_mapping = {1: 'A', 2: 'B', 3: 'C', 4: 'D'}

    df['ACR'] = df['ACR'].apply(lambda x: acr_mapping.get(x, 0))

    df.replace('X', 1, inplace=True)

    mass_type_mapping = {
    1: 'Unknown', 
    0: 0 
    }
    
    df['Mass_shape'] = df['Mass'].replace(mass_type_mapping)

    df.replace(np.nan, 0, inplace=True)

    df = df[df['Distortion'].isin([0, 1])]

    df.dropna(subset=['Label'], inplace=True)

    return df


'''
def convert_dicom_to_jpg(inbr_image_dir):
    """
    Convert DICOM images to JPG format.

    Parameters:
    - inbr_image_dir (str): Path to the directory containing DICOM images.

    Returns:
    - int: Number of images converted.
    """
    if not os.path.exists(inbr_image_dir):
        print(f"El directori {inbr_image_dir} no existeix.")
        return 0

    if not os.listdir(inbr_image_dir):
        print(f"El directori {inbr_image_dir} està buit.")
        return 0

    output_dir = os.path.join(inbr_image_dir, 'converted')
    os.makedirs(output_dir, exist_ok=True)

    converted_count = 0

    for filename in os.listdir(inbr_image_dir):
        if filename.endswith('.dcm'):
            dicom_path = os.path.join(inbr_image_dir, filename)
            jpg_path = os.path.join(output_dir, os.path.splitext(filename)[0] + '.jpg')

            # Convertir DICOM a JPG
            try:
                ds = pydicom.dcmread(dicom_path)
                pixel_array = ds.pixel_array

                # Normalitzar els valors de píxel a l'escala de 8 bits
                pixel_array = pixel_array.astype(np.float32)
                pixel_array = ((pixel_array - np.min(pixel_array)) / (np.max(pixel_array) - np.min(pixel_array))) * 255
                pixel_array = pixel_array.astype(np.uint8)

                image = Image.fromarray(pixel_array)
                image.save(jpg_path)
                converted_count += 1
            except Exception as e:
                print(f"No s'ha pogut convertir la imatge {dicom_path}: {e}")

    return converted_count

num_images_converted = convert_dicom_to_jpg(inbr_image_dir)

print(f"S'han convertit {num_images_converted} imatges a format JPG.")

'''

processed_df2 = process_inbreast_info(inbr_doc_dir)


In [18]:
def create_image_path_df(data_path):
    """
    Create a DataFrame with image paths from a given directory.

    Parameters:
    - data_path (str): Path to the directory containing images.

    Returns:
    - pd.DataFrame: DataFrame with 'Path' column containing image filenames.
    """
    image_files = os.listdir(data_path)
    full_paths = [os.path.join(data_path, file) for file in image_files]
    filenames_we = [os.path.splitext(file)[0] for file in image_files]
    
    df = pd.DataFrame(
        {'Path': full_paths, 'Filename': filenames_we}
    )
    return df


def merge_image_paths3(annotations_df, image_path_df1):
    """
    Merge annotations DataFrame with image paths DataFrame based on the 'Image_name' column.

    Parameters:
    - annotations_df (pd.DataFrame): DataFrame with annotations.
    - image_path_df1 (pd.DataFrame): DataFrame with image paths.

    Returns:
    - pd.DataFrame: Merged and filtered DataFrame.
    """
    
    image_path_df1['First_8'] = image_path_df1['Filename'].str[:8]
    annotations_df['Image_name'] = annotations_df['Image_name'].astype(str)
    annotations_df['Image_name'] = annotations_df['Image_name'].str[:8]

    # Combinem les anotacions dels df
    merged_df = pd.merge(annotations_df[['Image_name', 'View', 'Label', 'ACR', 'Mass', 'Mass_shape', 'Calcification', 'Distortion', 'Asymmetry']], 
                         image_path_df1[['First_8', 'Path']], 
                         left_on='Image_name', 
                         right_on='First_8', 
                         how='left')
    
    merged_df['Database'] = 'INbreast'
    
    columns_to_drop = ['First_8']
    
    merged_df = merged_df.drop(columns_to_drop, axis=1)

    column_order = ['Database', 'Image_name', 'View', 'Path', 'Label', 'ACR', 'Mass', 'Mass_shape', 'Calcification', 'Distortion', 'Asymmetry']
    
    merged_df = merged_df.reindex(columns=column_order)

    return merged_df

inbreast_paths = create_image_path_df('DATABASE - INbreast/AllDICOMs/converted')

df_INbreast = merge_image_paths3(processed_df2,inbreast_paths)




# 4. DATASET CMMD

Font: https://wiki.cancerimagingarchive.net/pages/viewpage.action?pageId=70230508

Any: 2016

Format: Full Digital / DICOM

Confirmació per biòpsia: Si


El CMMD (Chinese Mammographic Mass Database) és una base de dades de mamografies digitals completes en format digital. La base de dades consta de 3.728 mamografies d'aquests 1.775 pacients, amb el tipus de tumor benigne o maligne confirmat per biòpsia. També es proporcionen els subtipus moleculars per a 749 d'aquests pacients (1.498 mamografies). Les imatges estàn en format DICOM.

En aquest cas no hi ha casos normals, només lesions amb la seva classificació.

Criteris i decisions:

- Es converteixen les imatges a jpg i de nou s'eliminen les DICOM per alliberar espai
- En algunes lesions s'aporten les imatges contralaterals, però no queda clar com s'han de classificar i, per tant, no s'han tingut en compte

In [19]:
inbr_doc_dir = 'DATABASE - CMMD/CMMD_clinicaldata_revision.xlsx'
inbr_image_dir = 'DATABASE - CMMD/CMMD'


def process_clinical_data(file_path):
    df = pd.read_excel(file_path)

    df = df.iloc[:, [0, 4, 5]]

    df['Calcification'] = 0
    df['Mass'] = 0

    for index, row in df.iterrows():
        if row['abnormality'] == 'calcification':
            df.at[index, 'Calcification'] = 1
        
        elif row['abnormality'] == 'mass':
            df.at[index, 'Mass'] = 1
        
        elif row['abnormality'] == 'both':
            df.at[index, 'Calcification'] = 1
            df.at[index, 'Mass'] = 1

    df = df.drop('abnormality', axis=1)
    
    df.columns = ['Image_name', 'Label', 'Calcification', 'Mass']

    df = df[df['Image_name'].str.startswith('D1')]

    return df
    
'''
def convert_dicom_to_jpg_in_subdirectories(root_dir):
    for root, dirs, files in os.walk(root_dir):
        for filename in files:
            if filename.endswith('.dcm'):
                dicom_file_path = os.path.join(root, filename)
                dicom_image = pydicom.dcmread(dicom_file_path)

                image = Image.fromarray(dicom_image.pixel_array)

                image = image.convert('L')

                jpg_dir = os.path.abspath(os.path.join(root, '..', '..'))

                os.makedirs(jpg_dir, exist_ok=True)

                jpg_file_path = os.path.join(jpg_dir, filename[:-4] + '.jpg') 
                image.save(jpg_file_path)

                print(f"Fitxer convertit: {jpg_file_path}")


convert_dicom_to_jpg_in_subdirectories(inbr_image_dir)
'''
df_anotations = process_clinical_data(inbr_doc_dir)

In [20]:
import shutil
'''
def delete_subdirectories(directory):
    for subdir in os.listdir(directory):
        subdir_path = os.path.join(directory, subdir)
        if os.path.isdir(subdir_path):
            # Iterar sobre els subdirectoris del subdirectori actual
            for subdir2 in os.listdir(subdir_path):
                subdir2_path = os.path.join(subdir_path, subdir2)
                if os.path.isdir(subdir2_path):
                    # Eliminar els subdirectoris d'aquest subdirectori
                    print(f"Eliminant subdirectoris de: {subdir2_path}")
                    shutil.rmtree(subdir2_path)



delete_subdirectories(inbr_image_dir)

'''

'\ndef delete_subdirectories(directory):\n    # Iterar sobre els subdirectoris de la carpeta\n    for subdir in os.listdir(directory):\n        subdir_path = os.path.join(directory, subdir)\n        if os.path.isdir(subdir_path):\n            # Iterar sobre els subdirectoris del subdirectori actual\n            for subdir2 in os.listdir(subdir_path):\n                subdir2_path = os.path.join(subdir_path, subdir2)\n                if os.path.isdir(subdir2_path):\n                    # Eliminar els subdirectoris d\'aquest subdirectori\n                    print(f"Eliminant subdirectoris de: {subdir2_path}")\n                    shutil.rmtree(subdir2_path)\n\n\n\ndelete_subdirectories(inbr_image_dir)\n\n'

In [21]:
def create_file_paths_df(directory):
    file_paths = []
    for root, _, files in os.walk(directory):
        for file in files:
            # Crear el path complet del fitxer
            file_path = os.path.join(root, file)
            # Afegir el nom del fitxer i el path a la llista
            file_paths.append({'Filename': os.path.join(root[len(directory):].lstrip(os.path.sep), file), 'Path': file_path})
    file_paths_df = pd.DataFrame(file_paths)
    return file_paths_df


def merge_image_paths4(annotations_df, image_path_df1):
    """
    Merge annotations DataFrame with image paths DataFrame based on the 'Image_name' column.

    Parameters:
    - annotations_df (pd.DataFrame): DataFrame with annotations.
    - image_path_df1 (pd.DataFrame): DataFrame with image paths.

    Returns:
    - pd.DataFrame: Merged and filtered DataFrame.
    """
    
    image_path_df1['First_7'] = image_path_df1['Filename'].str[:7]

    merged_df = pd.merge(annotations_df[['Image_name', 'Label', 'Mass', 'Calcification']], 
                         image_path_df1[['First_7', 'Path', 'Filename']], 
                         left_on='Image_name', 
                         right_on='First_7', 
                         how='left')
    
    merged_df['Database'] = 'CMMD'
    
    columns_to_drop = ['First_7', 'Image_name']
    
    merged_df = merged_df.drop(columns_to_drop, axis=1)

    merged_df['Filename'] = merged_df['Filename'].str[:11]
    merged_df = merged_df.rename(columns={'Filename': 'Image_name'})

    mass_type_mapping = {
    1: 'Unknown',  # Mass = 1
    0: 0  # Mass = 0
    }
    
    merged_df['Mass_shape'] = merged_df['Mass'].replace(mass_type_mapping)

    column_order = ['Database', 'Image_name', 'Path', 'Label', 'Mass', 'Mass_shape', 'Calcification']
    
    merged_df = merged_df.reindex(columns=column_order)

    return merged_df

inbr_image_dir = 'DATABASE - CMMD/CMMD'

file_paths_df = create_file_paths_df(inbr_image_dir)
file_paths_df.drop([0,1], axis=0, inplace=True)
file_paths_df.reset_index()

df_CMMD = merge_image_paths4(df_anotations,file_paths_df)


# 5. DATASET DDSM

Font: http://www.eng.usf.edu/cvprg/mammography/database.html

Font: https://www.kaggle.com/datasets/cheddad/miniddsm2

Any: inicial al 1999

Format: Analògic digitalitzat / DICOM-jpeg

Confirmació per biòpsia: Si


El conjunt de dades DDSM (Digital Database for Screening Mammography) és una col·lecció extensa d'imatges mamogràfiques digitals recopilades per a aplicacions de cribratge de càncer de mama. Aquest conjunt de dades va ser creat per la University of South Florida i el Massachusetts General Hospital. Les imatges en el conjunt de dades DDSM es troben en format digital, principalment en el format DICOM (Digital Imaging and Communications in Medicine), que és el format estàndard per a les imatges mèdiques. 

Degut al gran pes que ocupa el dataset en format original en format DICOM, s'ha optat per obtenir un subset d'imatges convertides a .jpeg de la plataforma Kaggle.

Les imatges estan classificades en tres carpetes segons si són Normals, Benignes i Malignes i contenen una segona imatge amb la màscara de la lesió i un fitxer .OVERLAY amb les característiques.

Criteris i decisions:

- Maping de la densitat ACR A-D en comptes de 1-4
- Maping del tipus de lesió per homogenitzar amb altres datasets. En els casos de noms composts 'MICROLOBULATED-ILL_DEFINED-SPICULATED' s'ha escollit el primer nom 'LOBULATED'


In [22]:
# exemple de fitxer overlay

ruta_fitxer_overlay = 'DATABASE - DDSM/Benign/0029/C_0029_1.LEFT_CC.OVERLAY'

with open(ruta_fitxer_overlay, 'r') as file:
    contingut_overlay = file.read()
    
contingut_overlay

'TOTAL_ABNORMALITIES 1\nABNORMALITY 1\nLESION_TYPE MASS SHAPE OVAL MARGINS ILL_DEFINED\nASSESSMENT 3\nSUBTLETY 3\nPATHOLOGY BENIGN\nTOTAL_OUTLINES 1 \nBOUNDARY\n472 2432 4 4 4 4 4 4 4 4 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 2 3 2 3 2 3 2 3 2 3 2 3 2 3 2 3 2 3 2 3 2 3 2 3 2 3 2 3 2 3 2 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 1 2 2 1 2 2 1 2 2 1 2 2 1 2 2 1 2 2 1 2 2 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 

In [23]:
def llegir_overlay(ruta_overlay, propietat):
    with open(ruta_overlay, 'r') as file:
        contingut_overlay = file.read().split('\n')

    for i, linia in enumerate(contingut_overlay):
        if propietat in linia:
            # Si la propietat és 'BOUNDARY', llegir la següent línia
            if propietat == 'BOUNDARY':
                next_line = contingut_overlay[i+1]
                valor = next_line.split()
                return valor
            else:
                valor = linia.split(propietat)[-1].split()[0]
                return valor
                

def obtenir_propietat_overlay(path_imatge, propietat):
    if 'Normal' in path_imatge:
        return None
    
    ruta_overlay = os.path.splitext(path_imatge)[0] + ".OVERLAY"

    if os.path.exists(ruta_overlay):
        valor_propietat = llegir_overlay(ruta_overlay, propietat)
    else:
        valor_propietat = None
    
    return valor_propietat


In [24]:
import pandas as pd

def preprocess_DDSM_data(doc_dir):
    df_DDSM = pd.read_excel(doc_dir)

    # Renombrar columnes
    df_DDSM = df_DDSM.rename(columns={'fullPath': 'Path', 'fileName': 'Image_name', 'Status': 'Label', 'Density': 'ACR'})
    df_DDSM = df_DDSM[(df_DDSM['Tumour_Contour'] != "-") | (df_DDSM['Label'] == 'Normal')]

    # Eliminar columnes innecessàries
    columns_to_drop = ['Side', 'Tumour_Contour', 'Tumour_Contour2', 'Age']
    df_DDSM = df_DDSM.drop(columns_to_drop, axis=1)

    # Mapejar valors de densitat ACR
    acr_mapping = {1: 'A', 2: 'B', 3: 'C', 4: 'D'}
    df_DDSM['ACR'] = df_DDSM['ACR'].apply(lambda x: acr_mapping.get(x, 0))

    # Afegir informació addicional
    df_DDSM['Database'] = 'DDSM'
    df_DDSM['Image_name'] = df_DDSM['Image_name'].str[:-4]
    df_DDSM['Label'] = df_DDSM['Label'].replace('Cancer', 'Malignant')
    df_DDSM['Path'] = df_DDSM['Path'].str.replace('\\', '/')
    df_DDSM['Path'] = 'DATABASE - DDSM/' + df_DDSM['Path']

    # Reordenar columnes
    column_order = ['Database', 'Image_name', 'Path', 'Label', 'ACR']
    df_DDSM = df_DDSM.reindex(columns=column_order)

    # Processament addicional
    df_DDSM['LESION_TYPE'] = df_DDSM['Path'].apply(lambda x: obtenir_propietat_overlay(x, 'LESION_TYPE'))
    df_DDSM['Mass_shape'] = df_DDSM['Path'].apply(lambda x: obtenir_propietat_overlay(x, 'MARGINS'))

    df_DDSM['Calcification'] = df_DDSM['LESION_TYPE'].apply(lambda x: 1 if x == 'CALCIFICATION' else 0)
    df_DDSM['Mass'] = df_DDSM['LESION_TYPE'].apply(lambda x: 1 if x == 'MASS' else 0)

    # Mapejar tipus de massa
    mass_type_mapping = {
        'ILL_DEFINED': 'Indistinct',
        'CIRCUMSCRIBED': 'Circumscribed',
        'SPICULATED': 'Speculated',
        'OBSCURED': 'Obscured',
        'MICROLOBULATED': 'Lobulated',
        'N/A': 'Unknown',
        'ILL_DEFINED-SPICULATED': 'Indistinct',
        'CIRCUMSCRIBED-OBSCURED': 'Circumscribed',
        'OBSCURED-ILL_DEFINED': 'Obscured',
        'CIRCUMSCRIBED-ILL_DEFINED': 'Circumscribed',
        'OBSCURED-ILL_DEFINED-SPICULATED': 'Obscured',
        'OBSCURED-SPICULATED': 'Obscured',
        'MICROLOBULATED-ILL_DEFINED': 'Lobulated',
        'CIRCUMSCRIBED-OBSCURED-ILL_DEFINED': 'Circumscribed',
        'CIRCUMSCRIBED-MICROLOBULATED-ILL_DEFINED': 'Circumscribed',
        'MICROLOBULATED-SPICULATED': 'Lobulated',
        'OBSCURED-CIRCUMSCRIBED': 'Obscured',
        'MICROLOBULATED-ILL_DEFINED-SPICULATED': 'Lobulated',
        'CIRCUMSCRIBED-MICROLOBULATED': 'Circumscribed',
        'CIRCUMSCRIBED-SPICULATED': 'Circumscribed'}
    
    df_DDSM['Mass_shape'] = df_DDSM['Mass_shape'].replace(mass_type_mapping)

    df_DDSM = df_DDSM.reset_index()
        
    df_DDSM = df_DDSM.drop('LESION_TYPE', axis = 1)

    return df_DDSM

DDSM_doc_dir = 'DATABASE - DDSM/DataWMask.xlsx'

df_DDSM = preprocess_DDSM_data(DDSM_doc_dir)

print(df_DDSM.info())

print(df_DDSM['Label'].value_counts())

df_DDSM.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5238 entries, 0 to 5237
Data columns (total 9 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   index          5238 non-null   int64 
 1   Database       5238 non-null   object
 2   Image_name     5238 non-null   object
 3   Path           5238 non-null   object
 4   Label          5238 non-null   object
 5   ACR            5238 non-null   object
 6   Mass_shape     1823 non-null   object
 7   Calcification  5238 non-null   int64 
 8   Mass           5238 non-null   int64 
dtypes: int64(3), object(6)
memory usage: 368.4+ KB
None
Label
Normal       2408
Malignant    1428
Benign       1402
Name: count, dtype: int64


Unnamed: 0,index,Database,Image_name,Path,Label,ACR,Mass_shape,Calcification,Mass
0,0,DDSM,C_0029_1.LEFT_CC,DATABASE - DDSM/Benign/0029/C_0029_1.LEFT_CC.jpg,Benign,C,Indistinct,0,1
1,1,DDSM,C_0029_1.LEFT_MLO,DATABASE - DDSM/Benign/0029/C_0029_1.LEFT_MLO.jpg,Benign,C,Indistinct,0,1
2,6,DDSM,C_0033_1.RIGHT_CC,DATABASE - DDSM/Benign/0033/C_0033_1.RIGHT_CC.jpg,Benign,C,Lobulated,0,1
3,7,DDSM,C_0033_1.RIGHT_MLO,DATABASE - DDSM/Benign/0033/C_0033_1.RIGHT_MLO...,Benign,C,Lobulated,0,1
4,10,DDSM,C_0217_1.RIGHT_CC,DATABASE - DDSM/Benign/0217/C_0217_1.RIGHT_CC.jpg,Benign,B,Circumscribed,0,1


# 6. UNIFICACIÓ

Per últim, s'unifiquen els datasets en un dataframe que conté 9688 imatges: 3037 imatges normals, 3380 imatges benignes i 3271 imatges malignes. El dataset conté els següents atributs:

 | Columna      | Descripció                                              |
|--------------|---------------------------------------------------------|
| Database     | Base de dades a la qual pertany la imatge              |
| Image_name   | Nom de la imatge                                       |
| View         | Vista de la imatge (CC/MLO)                             |
| Path         | Ruta a la imatge                                       |
| Label        | Normal / Benigne o Maligne, mapejat a 0 - 1 - 2        |
| ACR          | Densitat del teixit mamari A-D                          |
| Mass         | Presència de massa, 0-1                                 |
| Mass_shape   | Forma de la massa                                       |
| Calcification| Presència de calcificació, 0-1                          |
| Distortion   | Presència de distorsió, 0-1                             |
| Asymmetry    | Presència d'assimetria, 0-1                             |

 

In [28]:
full_def = pd.concat([df_INbreast, df_CMMD, df_cesm, df_DDSM, df_mias], ignore_index=True)

label_mapping = {'Normal': 0, 'Benign': 1, 'Malignant': 2}

full_def['Label'] = full_def['Label'].replace(label_mapping)

full_def = full_def.drop(['index'], axis=1)

full_def.fillna(0, inplace=True)

full_def['ACR'].replace(0, 'Unknown', inplace=True)

full_def = full_def[full_def['Path'] != '0']

print(full_def['Label'].value_counts())

print(full_def.info())


Label
1    3380
2    3271
0    3037
Name: count, dtype: int64
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9688 entries, 0 to 9687
Data columns (total 11 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Database       9688 non-null   object 
 1   Image_name     9688 non-null   object 
 2   View           9688 non-null   object 
 3   Path           9688 non-null   object 
 4   Label          9688 non-null   int64  
 5   ACR            9688 non-null   object 
 6   Mass           9688 non-null   float64
 7   Mass_shape     9688 non-null   object 
 8   Calcification  9688 non-null   float64
 9   Distortion     9688 non-null   float64
 10  Asymmetry      9688 non-null   float64
dtypes: float64(4), int64(1), object(6)
memory usage: 832.7+ KB
None


In [26]:
full_def.to_csv('LabelsPaths.csv', index=False)


In [27]:
full_def.head()

Unnamed: 0,Database,Image_name,View,Path,Label,ACR,Mass,Mass_shape,Calcification,Distortion,Asymmetry
0,INbreast,22678622,CC,DATABASE - INbreast/AllDICOMs/converted/226786...,0,D,0.0,0,0.0,0.0,0.0
1,INbreast,22678646,CC,DATABASE - INbreast/AllDICOMs/converted/226786...,1,D,1.0,Unknown,0.0,0.0,0.0
2,INbreast,22678670,MLO,DATABASE - INbreast/AllDICOMs/converted/226786...,0,D,0.0,0,0.0,0.0,0.0
3,INbreast,22678694,MLO,DATABASE - INbreast/AllDICOMs/converted/226786...,1,D,1.0,Unknown,0.0,0.0,0.0
4,INbreast,22614074,CC,DATABASE - INbreast/AllDICOMs/converted/226140...,2,B,1.0,Unknown,1.0,0.0,0.0
