# Predicting Images

_Author: Nathan Robertson_

This notebook simply takes the models created earlier in the project, and applies them to the dataset.

* Predict a image's class (bedroom, bathroom, ...)
* Predict an image's rank (1-10 score).
* Create summary statistics at the listing level about the images.

### Step 0: Import packages.

In [1]:
# Standard data manipulation.
import numpy as np
import pandas as pd
pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', None)
import ast
import csv

# Handling image extraction.
from PIL import Image
import requests
from io import BytesIO
import os
import random

# for handling TF models.
import tensorflow as tf
from tensorflow.keras.models import load_model
import logging
tf.get_logger().setLevel(logging.ERROR)
from sklearn.metrics import mean_squared_error, mean_absolute_error

# Progress tracking.
from tqdm import tqdm
import time

# Parallel processing.
from concurrent.futures import ThreadPoolExecutor, as_completed

### Step 1: Import Data

We'll import the data that has the photos data (an earlier, messier file), and filter it for only the indices that are in our cleaner dataset. That way we don't waste time processing images for listings that aren't in our final model's dataset.

We'll also add in a helper function that can read in photos that have already been processed in case the job in interrupted.

In [2]:
# Load photos data, and clean indices.
photos_df = pd.read_csv('BACKUP zillow_listing_data.csv')[['zillowId','price','photosList']]
cleaned_indices = pd.read_csv('backup/BACKUP cleaned_zillow_data.csv')[['zillowId']]

# Filter photo data by zillow IDs that are in the clean indices.
photos_df = photos_df[photos_df['zillowId'].isin([id_ for id_ in cleaned_indices['zillowId']])]

# Delete unnecessary file from memory.
del cleaned_indices

  photos_df = pd.read_csv('data/backup/BACKUP zillow_listing_data.csv')[['zillowId','price','photosList']]


We'll then melt the data down so that there is one record for each unique photo-listing combination. This is where we'll drop photos that have already been processed.

In [3]:
"""
concatenate_csvs
    
Reads multiple CSV files and vertically concatenates them into a single DataFrame.

    Args:
        file_list (list): A list of file paths to the CSV files.

    Returns:
        A Python list of urls from the df.
"""

def concatenate_csvs(file_list):
    # Read each CSV into a DataFrame and store in a list
    dfs = [pd.read_csv(file) for file in file_list]
    
    # Concatenate all DataFrames vertically (axis=0)
    combined_df = pd.concat(dfs, axis=0, ignore_index=True)

    # Get just the photo urls and delete the df.
    photos = [photo for photo in combined_df['photo']]
    del combined_df

    # Return the photos.
    return photos

In [4]:
"""
melt_photo_data

Turn the original webscraped Zillow DataFrame into a melted two column DataFrame of images.

    Args:
      already_generated: if True, skip the function and just load the data from a prior run.
      
    Returns:
      A melted DataFrame of Zillow images, including the Zillow listing ID the photo came from
      and the URL of where the photo lives on Zillow's servers.

"""

def melt_photo_data(df, already_generated=False):

    if already_generated == False:
        # Set the chunk size
        chunk_size = 1000
        
        # Empty lists for capturing results.
        zillowIds = []
        photos = []
        
        # Chunk and process the data
        for chunk_start in tqdm(range(0, df.shape[0], chunk_size), desc="Processing chunks"):
            
            chunk_end = min(chunk_start + chunk_size, df.shape[0])
            chunk_df = df.iloc[chunk_start:chunk_end]
        
            for i, row in chunk_df.iterrows():
                # If there is at least one photo.
                try:
                    # Count each photo.
                    for photo in ast.literal_eval(row['photosList']):
                        zillowIds.append(row['zillowId'])
                        photos.append(photo)
                        
                # If there are no photos, skip.
                except:
                    pass
        
        # Create a two-column DataFrame of all zillowId / photo URL combinations.
        photos_df = pd.DataFrame(data={
            'zillowId': zillowIds,
            'photo': photos
        })
        
        #photos_df.to_csv('data/backup/BACKUP listing_photos_dataset.csv', index=False)
    
        return photos_df

    #elif already_generated == True:
    #    return pd.read_csv('data/backup/BACKUP listing_photos_dataset.csv', index_col=False)

In [5]:
# Melt dataset, processing chunks of 1,000 listings at a time.
photos_df = melt_photo_data(photos_df)

Processing chunks: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 130/130 [00:09<00:00, 14.09it/s]


In [6]:
# Baseline of photos in the dataset.
print('Before dropping already processed photos:', len(photos_df))

# Drop already processed photos
alreadyProcessed = concatenate_csvs(['image_predictions_output_1.csv'])
photos_df = photos_df[~photos_df['photo'].isin(alreadyProcessed)]

# Check to see if photos were dropped.
print('After dropping already processed photos:', len(photos_df))

Before dropping already processed photos: 3981931
After dropping already processed photos: 2695399


In [7]:
# Peek at melted data.
photos_df.head()

Unnamed: 0,zillowId,photo
14999,2077467346,"https://maps.googleapis.com/maps/api/staticmap?mobile=false&sensor=true&maptype=satellite&size=400x300&zoom=17&center=37.75242614746094,-122.41143035888672&key=AIzaSyBJsNQO5ZeG-XAbqqWLKwG08fWITSxg33w&&signature=kXZi_NPFusc15YXjL_zFQ1csalY="
20696,15168857,"https://maps.googleapis.com/maps/api/staticmap?mobile=false&sensor=true&maptype=satellite&size=400x300&zoom=17&center=37.72417449951172,-122.41551971435547&key=AIzaSyBJsNQO5ZeG-XAbqqWLKwG08fWITSxg33w&&signature=5Ls-Vtrmnah703S5Nr3GmddziWU="
89601,2054309795,"https://maps.googleapis.com/maps/api/staticmap?mobile=false&sensor=true&maptype=satellite&size=400x300&zoom=17&center=47.223167419433594,-122.30274200439453&key=AIzaSyBJsNQO5ZeG-XAbqqWLKwG08fWITSxg33w&&signature=uR2pHQkhYHQcTJ51U-OH7k8rQgo="
179498,2055472524,"https://maps.googleapis.com/maps/api/staticmap?mobile=false&sensor=true&maptype=satellite&size=400x300&zoom=17&center=47.14197540283203,-122.17974090576172&key=AIzaSyBJsNQO5ZeG-XAbqqWLKwG08fWITSxg33w&&signature=LMjwBIgDiw4Db8Inky4eLjuvr2A="
179635,2054412370,"https://maps.googleapis.com/maps/api/staticmap?mobile=false&sensor=true&maptype=satellite&size=400x300&zoom=17&center=47.1141357421875,-122.15737915039062&key=AIzaSyBJsNQO5ZeG-XAbqqWLKwG08fWITSxg33w&&signature=tc-u2-19gYkxXBoJZtFJVDAdw-w="


In [8]:
# Get count of photos.
len(photos_df)

2695399

### Step 2: Load Tensorflow Models

Load the trained tensorflow models that were created in the prior stage of the notebook.

In [9]:
# First upload the room categorizer model.
room_categorizer_model = load_model('model_room_classifier.h5')


# For the rest of the models, we'll need to recreate the loss functions to load the models.
def mse(y_true, y_pred):
    return mean_squared_error(y_true, y_pred)

def mae(y_true, y_pred):
    return mean_absolute_error(y_true, y_pred)

custom_objects = {'MeanSquaredError': mse,'MeanAbsoluteError': mae}

bathroom_ranker_model = load_model('model_bathroom_ranker.h5', custom_objects=custom_objects)
bedroom_ranker_model = load_model('model_bedroom_ranker.h5', custom_objects=custom_objects)
kitchen_ranker_model = load_model('model_kitchen_ranker.h5', custom_objects=custom_objects)
living_room_ranker_model = load_model('model_living_room_ranker.h5', custom_objects=custom_objects)
location_exterior_ranker_model = load_model('model_location_exterior_ranker.h5', custom_objects=custom_objects)



### Step 3: Define prediction workflow

We'll need to:
* Transform an image to match the expected input for the tensorflow models.
* Predict the room class first.
* Based on the room class, determine which rank model to use (there is one for each room class).

In [10]:
"""
transform_image

Take a Zillow listing image and convert it into a uniformly sized image.

    Args:
      url: The url where a photo resides.
      target_size: the pixel length x width size of the image after it is transformed.
      
    Returns:
      A resized, padded, and cropped photo that matches the target_size.
"""


def transform_image(url, target_size=150):
    try:
        # Download the image from the URL
        response = requests.get(url)
        response.raise_for_status()  # Raise an exception for unsuccessful responses
        image_data = BytesIO(response.content)

        # Open the image using Pillow
        image = Image.open(image_data)

        # Calculate the target size and the ratio
        target_width, target_height = (target_size, target_size)
        width_ratio = target_width / image.width
        height_ratio = target_height / image.height

        # Determine the new dimensions after resizing and cropping
        if width_ratio < height_ratio:
            new_width = target_width
            new_height = int(image.height * width_ratio)
        else:
            new_width = int(image.width * height_ratio)
            new_height = target_height

        # Resize the image to fit within the new dimensions while maintaining the aspect ratio
        resized_image = image.resize((new_width, new_height), Image.Resampling.LANCZOS)

        # Randomly crop the resized image to the target size
        left = random.randint(0, max(0, new_width - target_width))
        top = random.randint(0, max(0, new_height - target_height))
        right = left + target_width
        bottom = top + target_height

        cropped_image = resized_image.crop((left, top, right, bottom))

        image_array = np.array(cropped_image)

        return image_array

    except Exception as e:
        return None

In [11]:
"""
predict_room_class

Returns the most likely class for the room's type.

    Args:
        img_array: a numpy representation of an image.
        model: the room_categorizer_model.

    Returns:
        The predicted class for the room.
"""

def predict_room_class(img_array, model=room_categorizer_model):
    room_labels = {0: 'Bathroom',1:'Bedroom',2:'Kitchen',3:'Living Room',4:'Location Exterior',5:'Other',}

    y_pred_classes = model.predict(img_array, verbose=0)
    y_pred_index = np.argmax(y_pred_classes, axis=1)[0]
    y_pred_room = room_labels[y_pred_index]

    # Return prediction.
    return y_pred_room
    

In [12]:
"""
predict_room_rank

Returns rank for the room based on its type.

    Args:
        room_type: The predicted class for the room.

    Returns:
        The predicted rank for the room.
"""

def predict_room_rank(img_array, room_type):
    
    # End function here if the room type is other.
    if room_type=='Other':
        return None

    # Otherwise, predict the rank based on the class-specific model.
    elif room_type=='Bathroom':
        y_pred_rank = bathroom_ranker_model(img_array)
    elif room_type=='Bedroom':
        y_pred_rank = bedroom_ranker_model(img_array)
    elif room_type=='Kitchen':
        y_pred_rank = kitchen_ranker_model(img_array)
    elif room_type=='Living Room':
        y_pred_rank = living_room_ranker_model(img_array)
    elif room_type=='Location Exterior':
        y_pred_rank = location_exterior_ranker_model(img_array)

    # Round and clip prediction.
    y_pred_rank = np.clip(np.round(y_pred_rank), 1, 10)[0][0]
    
    # Return prediction.
    return y_pred_rank

In [13]:
"""
predict_single_photo

Handles the class and rank predictions for an image.

    Args:
        url: an image url.

    Returns:
        The room class and rank predictions.
"""

def predict_single_photo(url):
    
    img_array = transform_image(url)
    img_array = np.expand_dims(img_array, axis=0) # used to add batch size to make it acceptable input for model.
    y_pred_room = predict_room_class(img_array)
    y_pred_rank = predict_room_rank(img_array, y_pred_room)

    #return y_pred_room
    return (url, y_pred_room, y_pred_rank)

### Step 4: Run Prediction Worfklow

Time to let it run. We'll use parallel processing to speed this up a bit since this will take some time to execute.

In [16]:
"""
retry_with_backloff

A helper utility function to handle retrying when a photo fails up to 3 times, with 5 second pauses.
"""


def retry_with_backoff(func, *args, retries=3, wait=5, **kwargs):
    for attempt in range(1, retries + 1):
        try:
            return func(*args, **kwargs)
        except Exception as e:
            if attempt < retries:
                time.sleep(wait)
            else:
                return None  # Return None on final failure

# Define output file.
output_file = "image_predictions_output.csv"

# Define the size we want to chunk the DataFrame into. This keeps memory usage light, with each chunk
# being written to a csv and then cleared from working memory.
chunk_size = 1000

# Process data in chunks.
for chunk_start in tqdm(range(0, len(photos_df), chunk_size), desc="Processing chunks"):
    chunk = photos_df.iloc[chunk_start:chunk_start + chunk_size]

    with open(output_file, mode="a", newline="") as csvfile:
        writer = csv.writer(csvfile)

        # Write the header only once at the start
        if chunk_start == 0:
            writer.writerow(["photo", "predicted_class", "predicted_rank"])

        # Parallel prediction with retry logic
        with ThreadPoolExecutor(max_workers=8) as executor:
            futures = {
                executor.submit(retry_with_backoff, predict_single_photo, row['photo']): row['photo']
                for _, row in chunk.iterrows()
            }

            for future in as_completed(futures):
                try:
                    result = future.result()

                    # If there is a result, write it.
                    if result is not None:
                        writer.writerow([result[0], result[1], result[2]])

                    # If there is no result, don't write it.
                    else:
                        writer.writerow([None, None, None])
                        
                # If anything fails, don't write anything.
                except Exception:
                    writer.writerow([None, None, None])


Processing chunks: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2696/2696 [68:31:03<00:00, 91.49s/it]


### Step 5: Listing-level statistics

Now that we have made our predictions, we need to bring it back up up to the listing level of aggregation and calculate statistics such as count, mean, max, min, and median values for rank by image class.

In [22]:
# Load data that was read in.
predictions_df = pd.read_csv('image_predictions_output.csv')

In [28]:
# Get the Zillow ID back into the dataset.
merged_image_data = photos_df.merge(predictions_df, how='left',on='photo')

In [29]:
# Group by 'zillowId' and 'predicted_class' and calculate aggregate statistics.
grouped = merged_image_data.groupby(['zillowId', 'predicted_class'])['predicted_rank'].agg(
    count=('size'),
    min_rank=('min'),
    max_rank=('max'),
    mean_rank=('mean')
).reset_index()

# Calculate range of ranks
grouped['range_rank'] = grouped['max_rank'] - grouped['min_rank']

# Pivot the table
pivot_df = grouped.pivot(index='zillowId', columns='predicted_class')

# Flatten the multi-index columns
pivot_df.columns = [f'{metric}_{cls}' for cls, metric in pivot_df.columns]

# Reset the index to turn it back into a DataFrame
pivot_df = pivot_df.reset_index().fillna(0)

pivot_df.head()

Unnamed: 0,zillowId,Bathroom_count,Bedroom_count,Kitchen_count,Living Room_count,Location Exterior_count,Other_count,Bathroom_min_rank,Bedroom_min_rank,Kitchen_min_rank,Living Room_min_rank,Location Exterior_min_rank,Other_min_rank,Bathroom_max_rank,Bedroom_max_rank,Kitchen_max_rank,Living Room_max_rank,Location Exterior_max_rank,Other_max_rank,Bathroom_mean_rank,Bedroom_mean_rank,Kitchen_mean_rank,Living Room_mean_rank,Location Exterior_mean_rank,Other_mean_rank,Bathroom_range_rank,Bedroom_range_rank,Kitchen_range_rank,Living Room_range_rank,Location Exterior_range_rank,Other_range_rank
0,67873,6.0,0.0,2.0,0.0,7.0,20.0,4.0,0.0,6.0,0.0,2.0,0.0,5.0,0.0,7.0,0.0,5.0,0.0,4.833333,0.0,6.5,0.0,3.285714,0.0,1.0,0.0,1.0,0.0,3.0,0.0
1,67956,0.0,1.0,1.0,0.0,5.0,3.0,0.0,4.0,3.0,0.0,2.0,0.0,0.0,4.0,3.0,0.0,5.0,0.0,0.0,4.0,3.0,0.0,3.4,0.0,0.0,0.0,0.0,0.0,3.0,0.0
2,87482,1.0,3.0,2.0,0.0,15.0,8.0,3.0,3.0,3.0,0.0,3.0,0.0,3.0,3.0,4.0,0.0,6.0,0.0,3.0,3.0,3.5,0.0,4.266667,0.0,0.0,0.0,1.0,0.0,3.0,0.0
3,88272,2.0,6.0,4.0,1.0,8.0,5.0,3.0,4.0,6.0,3.0,2.0,0.0,5.0,6.0,7.0,3.0,6.0,0.0,4.0,5.0,6.5,3.0,3.625,0.0,2.0,2.0,1.0,0.0,4.0,0.0
4,89400,0.0,2.0,1.0,0.0,9.0,19.0,0.0,4.0,6.0,0.0,1.0,0.0,0.0,4.0,6.0,0.0,4.0,0.0,0.0,4.0,6.0,0.0,2.777778,0.0,0.0,0.0,0.0,0.0,3.0,0.0


### Pulse check: do any of these correlate with price?

Moment of truth -- do any of these have a correlation with price?

In [30]:
# This is a little sloppy, but I need to re-read in the original dataframe to get the price data back...
temp = pd.read_csv('BACKUP zillow_listing_data.csv')[['zillowId','price']]

full = pivot_df.merge(temp, on='zillowId',how='left')

  temp = pd.read_csv('data/backup/BACKUP zillow_listing_data.csv')[['zillowId','price']]


In [33]:
full.head()

Unnamed: 0,zillowId,Bathroom_count,Bedroom_count,Kitchen_count,Living Room_count,Location Exterior_count,Other_count,Bathroom_min_rank,Bedroom_min_rank,Kitchen_min_rank,Living Room_min_rank,Location Exterior_min_rank,Other_min_rank,Bathroom_max_rank,Bedroom_max_rank,Kitchen_max_rank,Living Room_max_rank,Location Exterior_max_rank,Other_max_rank,Bathroom_mean_rank,Bedroom_mean_rank,Kitchen_mean_rank,Living Room_mean_rank,Location Exterior_mean_rank,Other_mean_rank,Bathroom_range_rank,Bedroom_range_rank,Kitchen_range_rank,Living Room_range_rank,Location Exterior_range_rank,Other_range_rank,price
0,67873,6.0,0.0,2.0,0.0,7.0,20.0,4.0,0.0,6.0,0.0,2.0,0.0,5.0,0.0,7.0,0.0,5.0,0.0,4.833333,0.0,6.5,0.0,3.285714,0.0,1.0,0.0,1.0,0.0,3.0,0.0,169900
1,67956,0.0,1.0,1.0,0.0,5.0,3.0,0.0,4.0,3.0,0.0,2.0,0.0,0.0,4.0,3.0,0.0,5.0,0.0,0.0,4.0,3.0,0.0,3.4,0.0,0.0,0.0,0.0,0.0,3.0,0.0,79000
2,87482,1.0,3.0,2.0,0.0,15.0,8.0,3.0,3.0,3.0,0.0,3.0,0.0,3.0,3.0,4.0,0.0,6.0,0.0,3.0,3.0,3.5,0.0,4.266667,0.0,0.0,0.0,1.0,0.0,3.0,0.0,400000
3,88272,2.0,6.0,4.0,1.0,8.0,5.0,3.0,4.0,6.0,3.0,2.0,0.0,5.0,6.0,7.0,3.0,6.0,0.0,4.0,5.0,6.5,3.0,3.625,0.0,2.0,2.0,1.0,0.0,4.0,0.0,229900
4,89400,0.0,2.0,1.0,0.0,9.0,19.0,0.0,4.0,6.0,0.0,1.0,0.0,0.0,4.0,6.0,0.0,4.0,0.0,0.0,4.0,6.0,0.0,2.777778,0.0,0.0,0.0,0.0,0.0,3.0,0.0,439900


### Answer: Yes!

It seems the max rank for each photo class has some correlation with price. When we filter it for the price cut off that will be used in the model (1.5M, or the 92nd percentile where the prices start to increase exponentially), the correlation are in the high 0.30's. These are okay correlations for a social science problem, but compared to what we've gathered so far these are promising (our best predictors in the other datasets also had around high .30's correlation with price).

In [40]:
full.corr()['price'].sort_values(ascending=False)

price                           1.000000
Location Exterior_max_rank      0.176420
Location Exterior_mean_rank     0.163858
Living Room_max_rank            0.129052
Living Room_mean_rank           0.124299
Living Room_range_rank          0.111650
Living Room_min_rank            0.100862
Bedroom_mean_rank               0.095615
Location Exterior_range_rank    0.094900
Bedroom_max_rank                0.094380
Bedroom_min_rank                0.091518
Location Exterior_min_rank      0.070437
Kitchen_mean_rank               0.061479
Bedroom_range_rank              0.059616
Kitchen_min_rank                0.057501
Kitchen_max_rank                0.055448
Location Exterior_count         0.041105
Bathroom_mean_rank              0.038959
Bathroom_max_rank               0.036096
Bathroom_min_rank               0.035349
Bathroom_count                  0.019554
Living Room_count               0.015568
Bathroom_range_rank             0.014087
Bedroom_count                   0.013849
Kitchen_count   

In [42]:
full[full['price']<=1500000].corr()['price'].sort_values(ascending=False)

price                           1.000000
Location Exterior_max_rank      0.394167
Location Exterior_mean_rank     0.350601
Living Room_max_rank            0.315409
Living Room_mean_rank           0.301569
Living Room_min_rank            0.259143
Living Room_range_rank          0.249885
Kitchen_mean_rank               0.237479
Kitchen_max_rank                0.236029
Bedroom_max_rank                0.234173
Bedroom_mean_rank               0.232293
Location Exterior_range_rank    0.221002
Bedroom_min_rank                0.219560
Bathroom_max_rank               0.216974
Bathroom_mean_rank              0.207991
Kitchen_min_rank                0.201725
Bathroom_min_rank               0.166497
Bedroom_range_rank              0.165700
Bathroom_range_rank             0.153067
Location Exterior_min_rank      0.145467
Bathroom_count                  0.111842
Kitchen_range_rank              0.109847
Kitchen_count                   0.090977
Location Exterior_count         0.084648
zillowId        

### That's it! From here, we'll upload it to the team Dropbox.

In [43]:
pivot_df.to_csv('final_image_predictions.csv')