# Prepare data for downloading images from URLs

**Important**: This notebook should be run after `create-temp-data.ipynb` if the full year of data needs to be analyzed.

In [None]:
%pip install pandas

In [None]:
# parse object-like strings to make them python friendly
import ast
import pandas as pd



I load the complete dataset for one year of coverage

In [2]:
input_filename= '../../input-data/temp-data.csv'

In [3]:
input_dataset = pd.read_csv(input_filename)
input_dataset.head(2)


Unnamed: 0.1,Unnamed: 0,abstract,web_url,snippet,lead_paragraph,print_section,print_page,source,multimedia,headline,...,news_desk,section_name,byline,type_of_material,_id,word_count,uri,subsection_name,image_url,clean_id
0,0,Wrestling with age and a case of idea theft.,https://www.nytimes.com/2024/09/01/business/he...,Wrestling with age and a case of idea theft.,"Send questions about the office, money, career...",BU,3.0,The New York Times,"[{'rank': 0, 'subtype': 'xlarge', 'caption': N...",{'main': 'Help! I’m ‘Older’ and on the Job Hun...,...,SundayBusiness,Business Day,"{'original': 'By Anna Holmes', 'person': [{'fi...",News,nyt://article/da8532bd-f9bd-5ca3-9e7e-afef6e9f...,1280,nyt://article/da8532bd-f9bd-5ca3-9e7e-afef6e9f...,,images/2024/09/01/multimedia/01WorkFriend-fbmg...,da8532bd-f9bd-5ca3-9e7e-afef6e9f76d9
1,1,"Grueling shifts, abuse from the public and sub...",https://www.nytimes.com/2024/09/01/world/asia/...,"Grueling shifts, abuse from the public and sub...",Exhausted doctors resting in crowded on-call r...,A,4.0,The New York Times,"[{'rank': 0, 'subtype': 'xlarge', 'caption': N...","{'main': 'Worked to the Bone, India’s Doctors ...",...,Foreign,World,{'original': 'By Anupreeta Das and Pragati K.B...,News,nyt://article/aeabc262-aeb0-5423-a7ac-8bb664cb...,1310,nyt://article/aeabc262-aeb0-5423-a7ac-8bb664cb...,Asia Pacific,images/2024/09/01/multimedia/01india-doctors-0...,aeabc262-aeb0-5423-a7ac-8bb664cb983b


Assuming we are working with the NYT data, the "multimedia" column is an array of objects. The function below transforms the string into a iterable python instance and allows for the extraction of the first url (xlarge image). This procedure will change with data coming from other news outlets.

In [4]:
def extract_first_url(multimedia_str):
    try:
        multimedia_list = ast.literal_eval(multimedia_str)
        if isinstance(multimedia_list, list) and len(multimedia_list) > 0:
            return multimedia_list[0].get("url", None)
    except (ValueError, SyntaxError):
        return None
    return None

In [5]:
input_dataset["image_url"] = input_dataset["multimedia"].apply(extract_first_url)

In [6]:
input_dataset["image_url"]

0        images/2024/09/01/multimedia/01WorkFriend-fbmg...
1        images/2024/09/01/multimedia/01india-doctors-0...
2        images/2024/09/01/multimedia/01venezuela-migra...
3                                                     None
4        images/2024/09/01/multimedia/01germany-electio...
                               ...                        
48691    images/2024/08/30/multimedia/30Gregg-03-pcbk/3...
48692    images/2024/09/01/multimedia/01ukraine-f16-pil...
48693    images/2024/08/31/multimedia/31bouie-newslette...
48694    images/2024/08/21/multimedia/JDA-Zucchini-Brea...
48695    images/2024/08/31/multimedia/31usopen-dutch-fr...
Name: image_url, Length: 48696, dtype: object

In [7]:
len(input_dataset)

48696

All NYT entries are marked by an ID with ambiguous characters: `nyt://interactive/`, the following code uses regex to remove this prefix and retains only the alpha-numeric hash as id.

In [8]:
input_dataset["clean_id"] = input_dataset["_id"].apply(lambda x: x.split('/')[-1])

In [9]:
input_dataset["clean_id"]

0        da8532bd-f9bd-5ca3-9e7e-afef6e9f76d9
1        aeabc262-aeb0-5423-a7ac-8bb664cb983b
2        42c0d0f2-ea62-5d2b-8eba-baa04180adea
3        6393c6c3-0e1f-5494-925d-165e7aafdefa
4        fe046102-78e5-530d-89e0-59ff09c0e2e4
                         ...                 
48691    15ef03c9-295b-50e0-a0f4-64f9a182675f
48692    3a3d339e-87ab-5650-b797-b7bb3cb03e5b
48693    1192db0c-51bd-525b-abb4-e8607c11b2c3
48694    83b24708-09af-55b1-baac-7efca1711d63
48695    0bca89dd-a1a7-5022-8192-9812f680fc21
Name: clean_id, Length: 48696, dtype: object

In [10]:
image_download_dataset = input_dataset[["clean_id", "image_url"]]

In [11]:
image_download_dataset

Unnamed: 0,clean_id,image_url
0,da8532bd-f9bd-5ca3-9e7e-afef6e9f76d9,images/2024/09/01/multimedia/01WorkFriend-fbmg...
1,aeabc262-aeb0-5423-a7ac-8bb664cb983b,images/2024/09/01/multimedia/01india-doctors-0...
2,42c0d0f2-ea62-5d2b-8eba-baa04180adea,images/2024/09/01/multimedia/01venezuela-migra...
3,6393c6c3-0e1f-5494-925d-165e7aafdefa,
4,fe046102-78e5-530d-89e0-59ff09c0e2e4,images/2024/09/01/multimedia/01germany-electio...
...,...,...
48691,15ef03c9-295b-50e0-a0f4-64f9a182675f,images/2024/08/30/multimedia/30Gregg-03-pcbk/3...
48692,3a3d339e-87ab-5650-b797-b7bb3cb03e5b,images/2024/09/01/multimedia/01ukraine-f16-pil...
48693,1192db0c-51bd-525b-abb4-e8607c11b2c3,images/2024/08/31/multimedia/31bouie-newslette...
48694,83b24708-09af-55b1-baac-7efca1711d63,images/2024/08/21/multimedia/JDA-Zucchini-Brea...


Saves a simpliefied version of the dataset, with only the `id` and `url` for image download.

In [None]:
image_download_dataset.to_csv('../../input-data/nyt-image-urls.csv', index=False)