# Download images from URLs

**Important**: This notebook should be run after `create-temp-data.ipynb` if the full year of data needs to be analyzed.

The notebook takes a json as input and uses the `requests` module to download the header images for each article. In case of very large datasets, the process can take a long time and use quite some space on the local machine (up to xxx GB). The images are saved into a sub-folder of `/data` called `images`. The folder has been added to the `.gitignore`. If the script needs to be used to run tests or try out analtycal pipelines and task, I suggest using a smaller sample of data (e.g. 1 month).

In [None]:
%pip install pandas requests tqdm

In [None]:
import requests
import requests.exceptions
# parse object-like strings to make them python friendly
import ast
# helps get the correct format for the image
import imghdr

from tqdm import tqdm

import pandas as pd
import re
import os

I load the complete dataset for one year of coverage

In [None]:
input_filename= '../../input-data/temp-data.json'

In [None]:
input_dataset = pd.read_csv(input_filename)
input_dataset.head(2)


Assuming we are working with the NYT data, the "multimedia" column is an array of objects. The function below transforms the string into a iterable python instance and allows for the extraction of the first url (xlarge image). This procedure will change with data coming from other news outlets.

In [None]:
def extract_first_url(multimedia_str):
    try:
        multimedia_list = ast.literal_eval(multimedia_str)
        if isinstance(multimedia_list, list) and len(multimedia_list) > 0:
            return multimedia_list[0].get("url", None)
    except (ValueError, SyntaxError):
        return None
    return None

In [None]:
input_dataset["image_url"] = input_dataset["multimedia"].apply(extract_first_url)

In [None]:
input_dataset["image_url"]

In [None]:
len(input_dataset)

All NYT entries are marked by an ID with ambiguous characters: `nyt://interactive/`, the following code uses regex to remove this prefix and retains only the alpha-numeric hash as id.

In [None]:
input_dataset["clean_id"] = input_dataset["_id"].apply(lambda x: x.split('/')[-1])

In [None]:
input_dataset["clean_id"]

Here's where the magic happens 🤩 We iterate over the url of the article and retrieve the image url. Then we use `requests` to download the image to a local folder.

In [None]:
def download_images(urls, current_news_outlet):
    if not os.path.exists(f'../../data/images/{current_news_outlet}'):
        os.makedirs(f'../../data/images/{current_news_outlet}')
    
    for url in tqdm(urls, desc="Downloading images", unit="file"):
        filenameID = url[0]
        try:
            if current_news_outlet == 'nytimes':
                if url[1] != None:
                    response = requests.get("https://www.nytimes.com/" + url[1], stream=True)
                    if response.status_code != 200:
                        print(f"Download of {url} has failed")
                        exit()
                    
                    extension = imghdr.what(file=None, h=response.content)     
                    filename = f'../../data/images/{current_news_outlet}/{current_news_outlet}-{filenameID}.{extension}'
                
                    with open(filename, 'wb') as file:
                        file.write(response.content)
        except requests.exceptions.MissingSchema:
            print('URL is not complete')
        
    
    print(f'Download of {current_news_outlet} successful')

In [None]:
#  Instead of using the whole dataframe, I zip clean id and image_url in a python list, then feed it to the download function
list_of_urls = list(zip(*map(input_dataset.get, ['clean_id', 'image_url'])))
download_images(list_of_urls, "nytimes")