# Download images from URLs

**Important**: This notebook should be run after `create-temp-data.ipynb` if the full year of data needs to be analyzed.

The notebook takes a json as input and uses the `requests` module to download the header images for each article. In case of very large datasets, the process can take a long time and use quite some space on the local machine (up to xxx GB). The images are saved into a sub-folder of `/data` called `images`. The folder has been added to the `.gitignore`. If the script needs to be used to run tests or try out analtycal pipelines and task, I suggest using a smaller sample of data (e.g. 1 month).

In [None]:
%pip install pandas requests tqdm

In [1]:
import requests
import requests.exceptions
# parse object-like strings to make them python friendly
import ast
# helps get the correct format for the image
import imghdr

from tqdm import tqdm

import pandas as pd
import re
import os



I load the complete dataset for one year of coverage

In [2]:
input_filename= '../../input-data/temp-data.json'

In [3]:
input_dataset = pd.read_csv(input_filename)
input_dataset.head(2)


Unnamed: 0.1,Unnamed: 0,abstract,web_url,snippet,lead_paragraph,print_section,print_page,source,multimedia,headline,...,pub_date,document_type,news_desk,section_name,byline,type_of_material,_id,word_count,uri,subsection_name
0,0,Wrestling with age and a case of idea theft.,https://www.nytimes.com/2024/09/01/business/he...,Wrestling with age and a case of idea theft.,"Send questions about the office, money, career...",BU,3.0,The New York Times,"[{'rank': 0, 'subtype': 'xlarge', 'caption': N...",{'main': 'Help! I’m ‘Older’ and on the Job Hun...,...,2024-09-01T04:01:07+0000,article,SundayBusiness,Business Day,"{'original': 'By Anna Holmes', 'person': [{'fi...",News,nyt://article/da8532bd-f9bd-5ca3-9e7e-afef6e9f...,1280,nyt://article/da8532bd-f9bd-5ca3-9e7e-afef6e9f...,
1,1,"Grueling shifts, abuse from the public and sub...",https://www.nytimes.com/2024/09/01/world/asia/...,"Grueling shifts, abuse from the public and sub...",Exhausted doctors resting in crowded on-call r...,A,4.0,The New York Times,"[{'rank': 0, 'subtype': 'xlarge', 'caption': N...","{'main': 'Worked to the Bone, India’s Doctors ...",...,2024-09-01T04:01:25+0000,article,Foreign,World,{'original': 'By Anupreeta Das and Pragati K.B...,News,nyt://article/aeabc262-aeb0-5423-a7ac-8bb664cb...,1310,nyt://article/aeabc262-aeb0-5423-a7ac-8bb664cb...,Asia Pacific


Assuming we are working with the NYT data, the "multimedia" column is an array of objects. The function below transforms the string into a iterable python instance and allows for the extraction of the first url (xlarge image). This procedure will change with data coming from other news outlets.

In [4]:
def extract_first_url(multimedia_str):
    try:
        multimedia_list = ast.literal_eval(multimedia_str)
        if isinstance(multimedia_list, list) and len(multimedia_list) > 0:
            return multimedia_list[0].get("url", None)
    except (ValueError, SyntaxError):
        return None
    return None

In [5]:
input_dataset["image_url"] = input_dataset["multimedia"].apply(extract_first_url)

In [6]:
input_dataset["image_url"]

0        images/2024/09/01/multimedia/01WorkFriend-fbmg...
1        images/2024/09/01/multimedia/01india-doctors-0...
2        images/2024/09/01/multimedia/01venezuela-migra...
3                                                     None
4        images/2024/09/01/multimedia/01germany-electio...
                               ...                        
48691    images/2024/08/30/multimedia/30Gregg-03-pcbk/3...
48692    images/2024/09/01/multimedia/01ukraine-f16-pil...
48693    images/2024/08/31/multimedia/31bouie-newslette...
48694    images/2024/08/21/multimedia/JDA-Zucchini-Brea...
48695    images/2024/08/31/multimedia/31usopen-dutch-fr...
Name: image_url, Length: 48696, dtype: object

In [7]:
len(input_dataset)

48696

All NYT entries are marked by an ID with ambiguous characters: `nyt://interactive/`, the following code uses regex to remove this prefix and retains only the alpha-numeric hash as id.

In [8]:
input_dataset["clean_id"] = input_dataset["_id"].apply(lambda x: x.split('/')[-1])

In [9]:
input_dataset["clean_id"]

0        da8532bd-f9bd-5ca3-9e7e-afef6e9f76d9
1        aeabc262-aeb0-5423-a7ac-8bb664cb983b
2        42c0d0f2-ea62-5d2b-8eba-baa04180adea
3        6393c6c3-0e1f-5494-925d-165e7aafdefa
4        fe046102-78e5-530d-89e0-59ff09c0e2e4
                         ...                 
48691    15ef03c9-295b-50e0-a0f4-64f9a182675f
48692    3a3d339e-87ab-5650-b797-b7bb3cb03e5b
48693    1192db0c-51bd-525b-abb4-e8607c11b2c3
48694    83b24708-09af-55b1-baac-7efca1711d63
48695    0bca89dd-a1a7-5022-8192-9812f680fc21
Name: clean_id, Length: 48696, dtype: object

Here's where the magic happens 🤩 We iterate over the url of the article and retrieve the image url. Then we use `requests` to download the image to a local folder.

In [60]:
def download_images(urls, current_news_outlet):
    if not os.path.exists(f'../../data/images/{current_news_outlet}'):
        os.makedirs(f'../../data/images/{current_news_outlet}')
    
    for url in tqdm(urls, desc="Downloading images", unit="file"):
        filenameID = url[0]
        try:
            if current_news_outlet == 'nytimes':
                if url[1] != None:
                    response = requests.get("https://www.nytimes.com/" + url[1], stream=True)
                    if response.status_code != 200:
                        print(f"Download of {url} has failed")
                        exit()
                    
                    extension = imghdr.what(file=None, h=response.content)     
                    filename = f'../../data/images/{current_news_outlet}/{current_news_outlet}-{filenameID}.{extension}'
                
                    with open(filename, 'wb') as file:
                        file.write(response.content)
        except requests.exceptions.MissingSchema:
            print('URL is not complete')
        
    
    print(f'Download of {current_news_outlet} successful')

In [None]:
#  Instead of using the whole dataframe, I zip clean id and image_url in a python list, then feed it to the download function
list_of_urls = list(zip(*map(input_dataset.get, ['clean_id', 'image_url'])))
download_images(list_of_urls, "nytimes")

Downloading images:   0%|          | 65/48696 [00:16<4:08:29,  3.26file/s]