<a href="https://www.kaggle.com/code/havikhurana/image-feature-extraction?scriptVersionId=113323669" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

In [1]:
cd /kaggle/input/

/kaggle/input


In [2]:
#loading libraries
import pandas as pd
import numpy as np
import keras
import os
from sklearn.preprocessing import LabelEncoder
from tensorflow.keras.applications import ResNet50
from tensorflow.keras.applications.resnet50 import preprocess_input
from keras.preprocessing.image import load_img, img_to_array
from glob import glob
from random import randint, seed, shuffle 
import matplotlib.pyplot as plt

Using TensorFlow backend.


In [40]:
#creating file paths 
#The 5th last character in the files path is a number, 0 or 1, 
#that indicates whether cancer was found or not.

files = glob('/kaggle/input/breast-histopathology-images/*/*/*')
len(files)

278082

In [45]:
from datetime import datetime
datetime.now().time()

datetime.time(2, 55, 37, 116350)

There are 278,082 images, and for each image I'll be creating about 8000 features. This leads to an enormous dataset. In the whole sample, 28% of the images are positive for cancer, and 72% images are negative for cancer. When I ran this code to generate features for all two hundered seventy eight thousand images, in five hours, the code could read about 57,000 images, less than a quarter of the total images. Therefore, I am deciding to take a subset of the images, and then run an analysis on that. Particularly, I'll be considering 50,000 image patches, about 1/7th of the total sample. 

In [47]:
seed(10)
shuffle(files)
files = files[0:15000]
#Some path names don't end with png and are kind of broken.
#Let's remove any such paths from files
files = [f for f in files if f.endswith('png')]
len(files)

14976

So, a handful of wrong paths have been removed now (72).

In [48]:
labels = [int(f[-5]) for f in files]

In [6]:
labels.count(1)/len(files)

0.2800480769230769

In accordance with our complete sample, about 28% images are positive for cancer.

In [None]:
#function to load and show image
def load_data(files, index):
    X = []
    img = load_img(files[index], target_size = (50,50))
    pixels = img_to_array(img)
    pixels = pixels/255
    X.append(pixels)
    return np.stack(X)

def show_image(img, cm):
    plt.imshow(img, cmap = cm)
    if cm!=None:
        print(f'color map: {cm}')
    plt.xticks([])
    plt.yticks([])
    plt.show()

Each image is a three-dimensional array of size 50 * 50, with each dimension representing the three color channels, red, green, and blue. Let's look at all channels for one image.

In [None]:
X = load_data(files, 0)
show_image(X[0], None)
show_image(X[0][:, :, 0], "Reds")
show_image(X[0][:, :, 1], "Blues")
show_image(X[0][:, :, 2], "Greens")

Now, let's look at 5 random images and whether they are classified as cancerous or not.

In [None]:
for i in range(5):
    index = randint(0, len(files))
    X = load_data(files,index)
    show_image(X[0], None)
    if labels[index]==1:
        print('Cancerous: True')
    else:
        print('Cancerous: False')

The image has a 50 * 50 * 3 shape, meaning there are 50*50 pixels in RGB. I'll use a pre-trained model, ResNet50 to find image embeddings and use those embeddings as variables to make predictions. 

In [7]:
#loading pre-trained ResNet50 model to extract features
#we will read the data in batches to not go out of memory
model = ResNet50(weights="imagenet", include_top=False)
batch_size = 32

Downloading data from https://github.com/keras-team/keras-applications/releases/download/resnet/resnet50_weights_tf_dim_ordering_tf_kernels_notop.h5


In [46]:
col_names = ['serial_num', 'truth_val']
for i in range(8192):
    col_names.append(i+1)

image_feature_df = pd.DataFrame(columns = col_names)

In [9]:
image_feature_df.head()

Unnamed: 0,serial_num,truth_val,1,2,3,4,5,6,7,8,...,8183,8184,8185,8186,8187,8188,8189,8190,8191,8192


In [13]:
image_feature_df.shape[0]

0

In [50]:
print(datetime.now().time())
for (b, i) in enumerate(range(0, len(files), batch_size)):
    print("[INFO] processing batch {}/{}".format(b + 1,
                    int(np.ceil(len(files) / float(batch_size)))))
    
    batchPaths = files[i:i + batch_size]
    batchLabels = labels[i:i + batch_size]
    batchImages = []
        
    for imagePath in batchPaths:
        image = load_img(imagePath, target_size=(50, 50))
        image = img_to_array(image)
        image = np.expand_dims(image, axis=0)
        image = preprocess_input(image)
        batchImages.append(image)
    
    batchImages = np.vstack(batchImages)
    features = model.predict(batchImages, batch_size=batch_size)
    features = features.reshape((features.shape[0], 2 * 2 * 2048))
    
    for (c, index) in enumerate(features):
        feature_ls = [c + i + 1, batchLabels[c]]
        for n in index:
            feature_ls.append(n)
        image_feature_df.loc[len(image_feature_df)] = feature_ls
        
    if (image_feature_df.shape[0] > 5000):
        image_feature_df.to_csv(f'/kaggle/working/image_feature_{i}.csv')
        image_feature_df = image_feature_df[0:0]
        
print(datetime.now().time())

02:57:09.104052
[INFO] processing batch 1/468
[INFO] processing batch 2/468
[INFO] processing batch 3/468
[INFO] processing batch 4/468
[INFO] processing batch 5/468
[INFO] processing batch 6/468
[INFO] processing batch 7/468
[INFO] processing batch 8/468
[INFO] processing batch 9/468
[INFO] processing batch 10/468
[INFO] processing batch 11/468
[INFO] processing batch 12/468
[INFO] processing batch 13/468
[INFO] processing batch 14/468
[INFO] processing batch 15/468
[INFO] processing batch 16/468
[INFO] processing batch 17/468
[INFO] processing batch 18/468
[INFO] processing batch 19/468
[INFO] processing batch 20/468
[INFO] processing batch 21/468
[INFO] processing batch 22/468
[INFO] processing batch 23/468
[INFO] processing batch 24/468
[INFO] processing batch 25/468
[INFO] processing batch 26/468
[INFO] processing batch 27/468
[INFO] processing batch 28/468
[INFO] processing batch 29/468
[INFO] processing batch 30/468
[INFO] processing batch 31/468
[INFO] processing batch 32/468
[

It took a little over 10 minutes, compared to the 43 minutes it took me when saving it all in the same df. 

In [51]:
image_feature_df.shape

(1184, 8194)

In [52]:
image_feature_df.to_csv(f'/kaggle/working/image_feature_{i+1}.csv')

In [55]:
a = pd.read_csv("/kaggle/working/image_feature_8736.csv")

In [56]:
a.shape

(5024, 8195)