**Image Web-Scrapping and Image Feature Processing**

Author: Lim Yap Kai and Sebastian Png (Initial Webscraping script)

This script scrapes The Smart Local website and downloads the images to a local folder. As the scripts includes local destination, please see the output of the images scrapped and refer to the google drive repository to see the image dataset.

It also contains the functions for image processing that generates and export new files dataset files containing engineered features of the images per article.

Our dataset files: https://drive.google.com/drive/folders/1Y8wNij-YoC4rxkhjqax4H3N1YDXW3lk1?usp=share_link


*4 represents 4th attempt to re-scrape the whole image dataset due to errors and limitation
*Each attempt to scrap all images from all articles requires around 1.5days

A: Image_Resolution_Down_4 Contains images that are in jpg jpeg gif png format
B: Image_Resolution_Down_4_Conversion_gif Contains images that are gif in Folder A and converted to jpg
C: Image_Resolution_Down_4_Conversion_png Contains images that are png in Folder A and converted to jpg

Image_Resolution_Down_4: https://drive.google.com/drive/folders/1rWhJjIOsNVPgP4AKbRMwnTcwvPdDhSad?usp=share_link
Image_Resolution_Down_4_Conversion_gif: https://drive.google.com/drive/folders/1N3W-gO5EqrkQl_ujGfaDJBbjOAx--D_A?usp=share_link
Image_Resolution_Down_4_Conversion_png: https://drive.google.com/drive/folders/1poML_yxfILjOYntzTAkfJ8sXZqgWWpq0?usp=share_link

# Import Libraries

In [1]:
# !pip install opencv-python
# !pip install Pillow
# !pip install webcolors

In [2]:
import os
import re
import requests
import time

import cv2
import pandas as pd
import numpy as np

from bs4 import BeautifulSoup
from collections import Counter
from PIL import Image, ImageStat
from scipy.spatial import KDTree
from tqdm import tqdm
from webcolors import CSS3_HEX_TO_NAMES, hex_to_rgb

# Import Dataset

In [3]:
smartlocal = pd.read_parquet(path='./datasets/thesmartlocal.parquet')

smartlocal.head()

Unnamed: 0,url,timedelta,title,category,subcategory1,subcategory2,subcategory3,preview,content,n_tokens_title,...,publish_date,day_of_week,month,year,num_imgs,img_links,num_hrefs,num_self_hrefs,num_tags,num_shares
0,https://thesmartlocal.com/read/staytion-marsil...,0,Staytion Marsiling: Coworking Space In The Nor...,Adulting,Career,,,"Hooray for being able to sleep in, plus the ti...",Staytion Marsiling – Coworking space in the No...,15,...,2022-11-02,2,11,2022,7,[https://thesmartlocal.com/wp-content/uploads/...,17,17,1,27
1,https://thesmartlocal.com/read/things-to-do-es...,1,"Esplanade Is Having Free Shows, A Theatre BTS ...",Things To Do,Things To Do In Singapore,,,Do not miss the free entertainment here.,Things to do at Esplanade So you’ve been to Es...,17,...,2022-11-01,1,11,2022,5,[https://thesmartlocal.com/wp-content/uploads/...,7,7,2,73
2,https://thesmartlocal.com/read/things-to-do-no...,1,17 New Things To Do In November 2022 – Bishan ...,Things To Do,Activities,,,"In the blink of an eye, we're approaching 2023...",Things to do in November 2022 Halloween may be...,19,...,2022-11-01,1,11,2022,33,[https://thesmartlocal.com/wp-content/uploads/...,63,63,6,244
3,https://thesmartlocal.com/read/paypal-welcome-...,1,You Can Redeem Vouchers For Brands Like foodpa...,Local,Businesses,,,Vouchers can also be used on Zalora and Agoda.,PayPal’s Welcome Pack promotion With Black Fri...,16,...,2022-11-01,1,11,2022,5,[https://thesmartlocal.com/wp-content/uploads/...,4,4,2,25
4,https://thesmartlocal.com/read/things-to-do-ju...,1,9 Best Things To Do In Jurong For Westies To S...,Things To Do,Things To Do In Singapore,,,"Hot take: west side, best side.","Things to do in Jurong For too long, residents...",20,...,2022-11-01,1,11,2022,24,[https://thesmartlocal.com/wp-content/uploads/...,18,18,3,31


# Web Scraping of Images

The Webscrapping component is divided into different Tranches as certain image scrapping error will terminate the process. This can be either due to network issue or image format issues.

In [4]:
# Pattern to match to decide image name i.e. string that is enclosed in
# the group (.*)
name_patt = re.compile(pattern='https://thesmartlocal.com/read/(.*)/')

# Pattern to extract image file extension e.g. png, jpg, etc.
img_ext_patt = re.compile(pattern='.([^.]*)$')

got_problem = []
article_img_links = []
article_img_mapper = []

for row_num, url in tqdm(enumerate(smartlocal.url)):
    response = requests.get(url)

    # Using lxml’s HTML parser to parse the response text
    soup = BeautifulSoup(response.text, 'lxml')

    # List of links to images in each article
    img_links = [img.get('src') for img in soup.select('.size-full')]
    
    article_img_links.append(img_links)
    map_list = []
    
    # Download images to local folder
    for index, img_url in enumerate(img_links):
        
        try:
            img_file = Image.open(requests.get(img_url, stream=True).raw)
            file_ext = img_url.rsplit('.', 1)[-1]
            # Change the params accordingly 
            # https://jdhao.github.io/2019/07/20/pil_jpeg_image_quality/
            new_name = f'{row_num}_{index}.{file_ext}'
            img_file.save(fp=f'./Image_Resolution_Down_4/{new_name}',
                          quality=10,
                          subsampling=0,
                          optimize=True)
            
            map_list.append(new_name)

            # Add 0.05 seconds per image scraping to avoid spamming the 
            # website with requests
            time.sleep(0.05)
        except:
            print(img_url)
            got_problem.append([index, img_url])
    
    if row_num == 1:
        print(map_list)
        print(article_img_links)
        
    article_img_mapper.append(map_list)

2it [00:28, 12.77s/it]

['1_0.jpg', '1_1.jpg', '1_2.png', '1_3.png', '1_4.png', '1_5.png', '1_6.jpg', '1_7.jpg']
[['https://thesmartlocal.com/wp-content/uploads/2022/09/Go-Karting-in-Singapore-9-1.jpg', 'https://thesmartlocal.com/wp-content/uploads/2022/09/Go-Karting-in-Singapore-17.jpg', 'https://thesmartlocal.com/wp-content/uploads/2022/09/Go-Karting-in-Singapore-16.jpg', 'https://thesmartlocal.com/wp-content/uploads/2022/09/Go-Karting-in-Singapore-6.jpg', 'https://thesmartlocal.com/wp-content/uploads/2022/09/Go-Karting-in-Singapore-7.png', 'https://thesmartlocal.com/wp-content/uploads/2022/09/Go-Karting-in-Singapore-10.jpeg', 'https://thesmartlocal.com/wp-content/uploads/2022/09/Go-Karting-in-Singapore-3.jpg', 'https://thesmartlocal.com/wp-content/uploads/2022/09/Go-Karting-in-Singapore-5.png', 'https://thesmartlocal.com/wp-content/uploads/2022/09/Go-Karting-in-Singapore-15.jpg', 'https://thesmartlocal.com/wp-content/uploads/2022/09/Go-Karting-in-Singapore-13.jpg', 'https://thesmartlocal.com/wp-content/upl

1723it [4:54:14, 11.36s/it]

https://thesmartlocal.com/wp-content/uploads/2021/02/Covid-19-rules-8.png


1726it [4:54:53, 13.77s/it]

https://thesmartlocal.com/wp-content/uploads/2021/05/Motion-art-space14.jpg
https://thesmartlocal.com/wp-content/uploads/2021/05/Motion-art-space1.jpg
https://thesmartlocal.com/wp-content/uploads/2021/05/Motion-art-space7.jpg


1727it [7:26:32, 15.51s/it]  

https://thesmartlocal.com/wp-content/uploads/2021/05/Motion-art-space12.gif
https://thesmartlocal.com/wp-content/uploads/2021/05/Motion-art-space5.jpg
https://thesmartlocal.com/wp-content/uploads/2021/05/Motion-art-space3.jpg
https://thesmartlocal.com/wp-content/uploads/2021/05/Motion-art-space2.jpg
https://thesmartlocal.com/wp-content/uploads/2021/05/Motion-art-space16.jpg
https://thesmartlocal.com/wp-content/uploads/2021/05/Motion-art-space17.jpg
https://thesmartlocal.com/wp-content/uploads/2021/05/Motion-art-space6.jpg
https://thesmartlocal.com/wp-content/uploads/2021/05/Motion-art-space8.jpg
https://thesmartlocal.com/wp-content/uploads/2021/05/Motion-art-space4.jpg
https://thesmartlocal.com/wp-content/uploads/2021/05/Motion-art-space9.jpg





ConnectionError: HTTPSConnectionPool(host='thesmartlocal.com', port=443): Max retries exceeded with url: /read/changi-chapel-museum/ (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x000002856129EE20>: Failed to establish a new connection: [Errno 11001] getaddrinfo failed'))

# Tranch 2

In [36]:
# Pattern to match to decide image name i.e. string that is enclosed in
# the group (.*)
name_patt = re.compile(pattern='https://thesmartlocal.com/read/(.*)/')

# Pattern to extract image file extension e.g. png, jpg, etc.
img_ext_patt = re.compile(pattern='.([^.]*)$')

got_problem2 = []
article_img_links2 = []
article_img_mapper2 = []

for row_num, url in smartlocal.url[1726:].items():
    response = requests.get(url)

    # Using lxml’s HTML parser to parse the response text
    soup = BeautifulSoup(response.text, 'lxml')

    # List of links to images in each article
    img_links = [img.get('src') for img in soup.select('.size-full')]
    
    article_img_links2.append(img_links)
    map_list = []
    
    # Download images to local folder
    for index, img_url in enumerate(img_links):
        
        try:
            img_file = Image.open(requests.get(img_url, stream=True).raw)
            file_ext = img_url.rsplit('.', 1)[-1]
            # Change the params accordingly 
            # https://jdhao.github.io/2019/07/20/pil_jpeg_image_quality/
            new_name = f'{row_num}_{index}.{file_ext}'
            img_file.save(fp=f'./Image_Resolution_Down_4/{new_name}',
                          quality=10,
                          subsampling=0,
                          optimize=True)
            
            map_list.append(new_name)

            # Add 0.05 seconds per image scraping to avoid spamming the 
            # website with requests
            time.sleep(0.05)
        except:
            print(img_url)
            got_problem2.append([index, img_url])
    
    if row_num == 1726:
        print(map_list)
        print(article_img_links2)
        
    article_img_mapper2.append(map_list)

['1726_0.jpg', '1726_1.jpg', '1726_2.jpg', '1726_3.jpg', '1726_4.jpg', '1726_5.jpg', '1726_6.gif', '1726_7.jpg', '1726_8.jpg', '1726_9.jpg', '1726_10.jpg', '1726_11.jpg', '1726_12.jpg', '1726_13.jpg', '1726_14.jpg', '1726_15.jpg']
[['https://thesmartlocal.com/wp-content/uploads/2021/05/Motion-art-space11.jpg', 'https://thesmartlocal.com/wp-content/uploads/2021/05/Motion-art-space15.jpg', 'https://thesmartlocal.com/wp-content/uploads/2021/05/Motion-art-space13.jpg', 'https://thesmartlocal.com/wp-content/uploads/2021/05/Motion-art-space14.jpg', 'https://thesmartlocal.com/wp-content/uploads/2021/05/Motion-art-space1.jpg', 'https://thesmartlocal.com/wp-content/uploads/2021/05/Motion-art-space7.jpg', 'https://thesmartlocal.com/wp-content/uploads/2021/05/Motion-art-space12.gif', 'https://thesmartlocal.com/wp-content/uploads/2021/05/Motion-art-space5.jpg', 'https://thesmartlocal.com/wp-content/uploads/2021/05/Motion-art-space3.jpg', 'https://thesmartlocal.com/wp-content/uploads/2021/05/Motion

ConnectionError: HTTPSConnectionPool(host='thesmartlocal.com', port=443): Max retries exceeded with url: /read/sg-lighthouses/ (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x00000285075CF190>: Failed to establish a new connection: [Errno 11001] getaddrinfo failed'))

In [53]:
b = ['1726_3.jpg', '1726_4.jpg', '1726_5.jpg', '1726_6.gif', '1726_7.jpg', '1726_8.jpg', '1726_9.jpg', '1726_10.jpg', '1726_11.jpg', '1726_12.jpg', '1726_13.jpg', '1726_14.jpg', '1726_15.jpg']
article_img_mapper[-1] = article_img_mapper[-1] + b 

In [60]:
print(article_img_links2[-1])
print(article_img_links3[0])

print(article_img_mapper2[-1])
print(article_img_mapper3[0])

['https://thesmartlocal.com/wp-content/uploads/2020/12/PUBGM-Tournament-4.jpg', 'https://thesmartlocal.com/wp-content/uploads/2020/12/PUBGM-Tournament-3.jpg', 'https://thesmartlocal.com/wp-content/uploads/2020/12/PUBGM-Tournament-1.jpg', 'https://thesmartlocal.com/wp-content/uploads/2020/12/PUBGM-Tournament-3.png', 'https://thesmartlocal.com/wp-content/uploads/2020/12/PUBGM-Tournament-2.png', 'https://thesmartlocal.com/wp-content/uploads/2020/12/PUBGM-Tournament.png', 'https://thesmartlocal.com/wp-content/uploads/2020/12/PUBGM-Tournament-2.jpg']
['https://thesmartlocal.com/wp-content/uploads/2020/12/PUBGM-Tournament-4.jpg', 'https://thesmartlocal.com/wp-content/uploads/2020/12/PUBGM-Tournament-3.jpg', 'https://thesmartlocal.com/wp-content/uploads/2020/12/PUBGM-Tournament-1.jpg', 'https://thesmartlocal.com/wp-content/uploads/2020/12/PUBGM-Tournament-3.png', 'https://thesmartlocal.com/wp-content/uploads/2020/12/PUBGM-Tournament-2.png', 'https://thesmartlocal.com/wp-content/uploads/2020/1

# Tranch 3

In [55]:
# Pattern to match to decide image name i.e. string that is enclosed in
# the group (.*)
name_patt = re.compile(pattern='https://thesmartlocal.com/read/(.*)/')

# Pattern to extract image file extension e.g. png, jpg, etc.
img_ext_patt = re.compile(pattern='.([^.]*)$')

got_problem3 = []
article_img_links3 = []
article_img_mapper3 = []

for row_num, url in smartlocal.url[2135:].items():
    response = requests.get(url)

    # Using lxml’s HTML parser to parse the response text
    soup = BeautifulSoup(response.text, 'lxml')

    # List of links to images in each article
    img_links = [img.get('src') for img in soup.select('.size-full')]
    
    article_img_links3.append(img_links)
    map_list = []
    
    # Download images to local folder
    for index, img_url in enumerate(img_links):
        
        try:
            img_file = Image.open(requests.get(img_url, stream=True).raw)
            file_ext = img_url.rsplit('.', 1)[-1]
            # Change the params accordingly 
            # https://jdhao.github.io/2019/07/20/pil_jpeg_image_quality/
            new_name = f'{row_num}_{index}.{file_ext}'
            img_file.save(fp=f'./Image_Resolution_Down_4/{new_name}',
                          quality=10,
                          subsampling=0,
                          optimize=True)
            
            map_list.append(new_name)

            # Add 0.05 seconds per image scraping to avoid spamming the 
            # website with requests
            time.sleep(0.05)
        except:
            print(img_url)
            got_problem3.append([index, img_url])
    
    if row_num == 2135:
        print(map_list)
        print(article_img_links3)
        
    article_img_mapper3.append(map_list)

['2135_0.jpg', '2135_1.jpg', '2135_2.jpg', '2135_3.png', '2135_4.png', '2135_5.png', '2135_6.jpg']
[['https://thesmartlocal.com/wp-content/uploads/2020/12/PUBGM-Tournament-4.jpg', 'https://thesmartlocal.com/wp-content/uploads/2020/12/PUBGM-Tournament-3.jpg', 'https://thesmartlocal.com/wp-content/uploads/2020/12/PUBGM-Tournament-1.jpg', 'https://thesmartlocal.com/wp-content/uploads/2020/12/PUBGM-Tournament-3.png', 'https://thesmartlocal.com/wp-content/uploads/2020/12/PUBGM-Tournament-2.png', 'https://thesmartlocal.com/wp-content/uploads/2020/12/PUBGM-Tournament.png', 'https://thesmartlocal.com/wp-content/uploads/2020/12/PUBGM-Tournament-2.jpg']]
https://i2.wp.com/thesmartlocal.com/wp-content/uploads/2019/09/singapore-legends-1.jpg?resize=1080%2C1080&ssl=1
https://thesmartlocal.com/wp-content/uploads/2019/10/singapore-bomb-shelter-idea.png
https://thesmartlocal.com/wp-content/uploads/2019/06/images_easyblog_articles_7447_image5.jpg
https://thesmartlocal.com/wp-content/uploads/2019/06/ima

ConnectionError: HTTPSConnectionPool(host='thesmartlocal.com', port=443): Max retries exceeded with url: /read/unknown-pokka-drinks/ (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x000002850661DEE0>: Failed to establish a new connection: [Errno 11001] getaddrinfo failed'))

# Tranch 4

In [4]:
# Pattern to match to decide image name i.e. string that is enclosed in
# the group (.*)
name_patt = re.compile(pattern='https://thesmartlocal.com/read/(.*)/')

# Pattern to extract image file extension e.g. png, jpg, etc.
img_ext_patt = re.compile(pattern='.([^.]*)$')

got_problem4 = []
article_img_links4 = []
article_img_mapper4 = []

for row_num, url in smartlocal.url[3894:].items():
    response = requests.get(url)

    # Using lxml’s HTML parser to parse the response text
    soup = BeautifulSoup(response.text, 'lxml')

    # List of links to images in each article
    img_links = [img.get('src') for img in soup.select('.size-full')]
    
    article_img_links4.append(img_links)
    map_list = []
    
    # Download images to local folder
    for index, img_url in enumerate(img_links):
        
        try:
            img_file = Image.open(requests.get(img_url, stream=True).raw)
            file_ext = img_url.rsplit('.', 1)[-1]
            # Change the params accordingly 
            # https://jdhao.github.io/2019/07/20/pil_jpeg_image_quality/
            new_name = f'{row_num}_{index}.{file_ext}'
            img_file.save(fp=f'./Image_Resolution_Down_4/{new_name}',
                          quality=10,
                          subsampling=0,
                          optimize=True)
            
            map_list.append(new_name)

            # Add 0.05 seconds per image scraping to avoid spamming the 
            # website with requests
            time.sleep(0.05)
        except:
            print(img_url)
            got_problem4.append([index, img_url])
    
    if row_num == 3894:
        print(map_list)
        print(article_img_links4)
        
    article_img_mapper4.append(map_list)

['3894_0.jpg', '3894_1.png', '3894_2.jpg', '3894_3.jpg', '3894_4.jpg', '3894_5.jpg', '3894_6.jpg', '3894_7.jpg', '3894_8.jpg', '3894_9.jpg', '3894_10.jpg']
[['https://thesmartlocal.com/wp-content/uploads/2019/06/images_easyblog_articles_7447_image3.jpg', 'https://thesmartlocal.com/wp-content/uploads/2019/06/images_easyblog_articles_7447_image8.png', 'https://thesmartlocal.com/wp-content/uploads/2019/06/images_easyblog_articles_7447_image5.jpg', 'https://thesmartlocal.com/wp-content/uploads/2019/06/images_easyblog_articles_7447_image11.jpg', 'https://thesmartlocal.com/wp-content/uploads/2019/06/images_easyblog_articles_7447_image9.jpg', 'https://thesmartlocal.com/wp-content/uploads/2019/06/images_easyblog_articles_7447_image7.jpg', 'https://thesmartlocal.com/wp-content/uploads/2019/06/images_easyblog_articles_7447_image10.jpg', 'https://thesmartlocal.com/wp-content/uploads/2019/06/images_easyblog_articles_7447_image1.jpg', 'https://thesmartlocal.com/wp-content/uploads/2019/06/images_eas

# Tranch 5

In [5]:
# Pattern to match to decide image name i.e. string that is enclosed in
# the group (.*)
name_patt = re.compile(pattern='https://thesmartlocal.com/read/(.*)/')

# Pattern to extract image file extension e.g. png, jpg, etc.
img_ext_patt = re.compile(pattern='.([^.]*)$')

got_problem5 = []
article_img_links5 = []
article_img_mapper5 = []

for row_num, url in smartlocal.url.items():
    response = requests.get(url)

    # Using lxml’s HTML parser to parse the response text
    soup = BeautifulSoup(response.text, 'lxml')

    # List of links to images in each article
    img_links = [img.get('src') for img in soup.select('.size-full')]
    
    article_img_links5.append(img_links)
    map_list = []
    
    # Download images to local folder
    for index, img_url in enumerate(img_links):            
        map_list.append(new_name)
        
    article_img_mapper5.append(map_list)
    
    if row_num == 100:
        print(article_img_links5)
        print(article_img_mapper5)

[['https://thesmartlocal.com/wp-content/uploads/2022/09/Go-Karting-in-Singapore-9-1.jpg', 'https://thesmartlocal.com/wp-content/uploads/2022/09/Go-Karting-in-Singapore-17.jpg', 'https://thesmartlocal.com/wp-content/uploads/2022/09/Go-Karting-in-Singapore-16.jpg', 'https://thesmartlocal.com/wp-content/uploads/2022/09/Go-Karting-in-Singapore-6.jpg', 'https://thesmartlocal.com/wp-content/uploads/2022/09/Go-Karting-in-Singapore-7.png', 'https://thesmartlocal.com/wp-content/uploads/2022/09/Go-Karting-in-Singapore-10.jpeg', 'https://thesmartlocal.com/wp-content/uploads/2022/09/Go-Karting-in-Singapore-3.jpg', 'https://thesmartlocal.com/wp-content/uploads/2022/09/Go-Karting-in-Singapore-5.png', 'https://thesmartlocal.com/wp-content/uploads/2022/09/Go-Karting-in-Singapore-15.jpg', 'https://thesmartlocal.com/wp-content/uploads/2022/09/Go-Karting-in-Singapore-13.jpg', 'https://thesmartlocal.com/wp-content/uploads/2022/09/Go-Karting-in-Singapore-4-1.jpg', 'https://thesmartlocal.com/wp-content/uplo

KeyboardInterrupt: 

In [8]:
len(article_img_links5)

2633

In [13]:
# Pattern to match to decide image name i.e. string that is enclosed in
# the group (.*)
name_patt = re.compile(pattern='https://thesmartlocal.com/read/(.*)/')

# Pattern to extract image file extension e.g. png, jpg, etc.
img_ext_patt = re.compile(pattern='.([^.]*)$')

got_problem6 = []
article_img_links6 = []
article_img_mapper6 = []

for row_num, url in smartlocal.url[2633:].items():
    response = requests.get(url)

    # Using lxml’s HTML parser to parse the response text
    soup = BeautifulSoup(response.text, 'lxml')

    # List of links to images in each article
    img_links = [img.get('src') for img in soup.select('.size-full')]
    
    article_img_links6.append(img_links)
    map_list = []
    
    # Download images to local folder
    for index, img_url in enumerate(img_links):            
        map_list.append(new_name)
        
    article_img_mapper6.append(map_list)
    
    if row_num == 3000:
        print(article_img_links6)
        print(article_img_mapper6)

[['https://thesmartlocal.com/wp-content/uploads/2020/06/Giving-Back-4.png', 'https://thesmartlocal.com/wp-content/uploads/2020/06/Giving-Back-7.jpg', 'https://thesmartlocal.com/wp-content/uploads/2020/06/Giving-Back-1.png', 'https://thesmartlocal.com/wp-content/uploads/2020/06/Giving-Back-6.png', 'https://thesmartlocal.com/wp-content/uploads/2020/06/Giving-Back-3.jpg', 'https://thesmartlocal.com/wp-content/uploads/2020/06/Giving-Back-5.png', 'https://thesmartlocal.com/wp-content/uploads/2020/06/Giving-Back-10.png', 'https://thesmartlocal.com/wp-content/uploads/2020/06/Giving-Back-12.png', 'https://thesmartlocal.com/wp-content/uploads/2020/06/Giving-Back-9.png', 'https://thesmartlocal.com/wp-content/uploads/2020/06/Giving-Back-11.png', 'https://thesmartlocal.com/wp-content/uploads/2020/06/Giving-Back-8.png', 'https://thesmartlocal.com/wp-content/uploads/2020/06/Giving-Back-2.gif'], ['https://thesmartlocal.com/wp-content/uploads/2020/06/19th-june-bars.jpg', 'https://thesmartlocal.com/wp-

In [59]:
# Pattern to match to decide image name i.e. string that is enclosed in
# the group (.*)
name_patt = re.compile(pattern='https://thesmartlocal.com/read/(.*)/')

# Pattern to extract image file extension e.g. png, jpg, etc.
img_ext_patt = re.compile(pattern='.([^.]*)$')

got_problem8 = []
article_img_links8 = []
article_img_mapper8 = []

for row_num, records in smartlocal.img_links.items():
    map_list = []
    for index, img_url in enumerate(records):
        file_ext = img_url.rsplit('.', 1)[-1]
        new_name = f'{row_num}_{index}.{file_ext}'
        map_list.append(new_name)
        
        if index == 200:
            print(article_img_mapper8)
            
    article_img_mapper8.append(map_list)

In [15]:
print(len(article_img_links5))
print(len(article_img_links6))
print(len(article_img_links5) + len(article_img_links6))

2633
1327
3960


In [20]:
print(len(article_img_mapper8))

3960


In [65]:

# Insert new img_links column after num_imgs
insertion_index = smartlocal.columns.get_loc('num_imgs') + 1
smartlocal.insert(loc=insertion_index,
                  column='img_links',
                  value=combine_img_links)

In [66]:

# Insert new img_links column after num_imgs
insertion_index = smartlocal.columns.get_loc('num_imgs') + 1
smartlocal.insert(loc=insertion_index,
                  column='img_maps',
                  value=article_img_mapper8)

In [None]:
smartlocal.to_parquet('./datasets/thesmartlocal_with_img_links.parquet')

# Convert png images to jpg

In [5]:
directory_in_str = './Image_Resolution_Down_4'
directory = os.fsencode(directory_in_str)

gotproblem = []

for file in os.listdir(directory):
    filename = os.fsdecode(file)
    if(filename.endswith('.png')): #and int(filename.rsplit('_', 1)[0])>=2134):
        try:
            im1 = Image.open(f'./Image_Resolution_Down_4/{filename}').convert('RGB')
            filename2 = filename.rsplit('.', 1)[0]
            im1.save(f'./Image_Resolution_Down_4_Conversion_png/{filename2}.jpg')
        except:
            gotproblem.append(filename)



In [2]:
#gotproblem : ['2134_10.png']

# Convert gif images to jpg

In [7]:
directory_in_str = './Image_Resolution_Down_4'
directory = os.fsencode(directory_in_str)

gotproblem = []

for file in os.listdir(directory):
    filename = os.fsdecode(file)
    if(filename.endswith('.gif')):
        try:
            im1 = Image.open(f'./Image_Resolution_Down_4/{filename}').convert('RGB')
            filename2 = filename.rsplit('.', 1)[0]
            im1.save(f'./Image_Resolution_Down_4_Conversion_gif/{filename2}.jpg')
        except:
            gotproblem.append(filename)

In [31]:
# smartlocal.to_excel("smartlocal_5.xlsx")
# smartlocal.to_parquet('./datasets/thesmartlocal_5.parquet')

# Image Features Processing

In [3]:
smartlocal = pd.read_parquet(path='./datasets/thesmartlocal_6.parquet')

smartlocal['img_dark'] = np.nan
smartlocal['img_light'] = np.nan
smartlocal['img_saturation'] = np.nan
smartlocal['top3_colours'] = np.nan
smartlocal['bot3_colours'] = np.nan
smartlocal
# = smartlocal.drop(columns = ['top1_colours', 'bot1_colours'])

Unnamed: 0,url,timedelta,title,category,subcategory1,subcategory2,subcategory3,preview,content,n_tokens_title,...,img_links,num_hrefs,num_self_hrefs,num_tags,num_shares,img_dark,img_light,img_saturation,top3_colours,bot3_colours
0,https://thesmartlocal.com/read/go-karting-sing...,0,7 Go-Karting & Virtual Racing Arenas In Singap...,Things To Do,Activities,,,Here are 7 go-kart arenas that'll get the adre...,"Go-karting in Singapore Bright lights, fast ca...",14,...,[https://thesmartlocal.com/wp-content/uploads/...,28,28,1,103,,,,,
1,https://thesmartlocal.com/read/virtual-influen...,0,Rae Is SG’s First Virtual Influencer – We Ask ...,Local,Perspectives,,,"Introducing Rae: everything a human is, except...","Virtual influencer Rae These days, anyone can ...",19,...,[https://thesmartlocal.com/wp-content/uploads/...,16,16,1,3,,,,,
2,https://thesmartlocal.com/read/staycation-deal...,1,14 Staycation Deals In Singapore 2022 To Book ...,Things To Do,Hotels & Staycations,,,If you're in need of a quick getaway or want t...,Staycation deals in Singapore 2022 There’s no ...,14,...,[https://thesmartlocal.com/wp-content/uploads/...,94,94,2,378,,,,,
3,https://thesmartlocal.com/read/frozen-musical-...,1,Broadway’s Frozen Musical Is Coming To Marina ...,Things To Do,Activities,,,"Catch the Frozen musical in Singapore, with ne...",Frozen musical Singapore Yoohoo big summer blo...,17,...,[https://thesmartlocal.com/wp-content/uploads/...,3,3,3,95,,,,,
4,https://thesmartlocal.com/read/surf-schools-bali/,1,8 Best Surf Schools In Bali To Learn Surfing S...,Travel,Indonesia,,,"Surf's up, dudes and dudettes.",Surf schools in Bali You’re in Bali. You’re at...,18,...,[https://thesmartlocal.com/wp-content/uploads/...,40,40,3,21,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3955,https://thesmartlocal.com/read/chiang-rai-thin...,1364,7 Outdoorsy Things To Do In Chiang Rai To Take...,Travel,Thailand,,,Outdoor things to do in Chiang Rai A trip to T...,Outdoor things to do in Chiang Rai A trip to T...,15,...,[https://thesmartlocal.com/wp-content/uploads/...,17,17,1,0,,,,,
3956,https://thesmartlocal.com/read/joo-chiat-katon...,1367,9 Colourful Heritage Places In Joo Chiat And K...,Things To Do,Things To Do In Singapore,,,#Instawalk Recap: Joo Chiat & Katong Joo Chiat...,#Instawalk Recap: Joo Chiat & Katong Joo Chiat...,16,...,[https://thesmartlocal.com/wp-content/uploads/...,14,14,0,5,,,,,
3957,https://thesmartlocal.com/read/cold-stone-crea...,1368,"Cold Stone Creamery Now Makes Minion, Shrek an...",Things To Do,Sales & Promotions,,,Cold Stone Creamery’s character cakes When i...,Cold Stone Creamery’s character cakes When it ...,18,...,[https://thesmartlocal.com/wp-content/uploads/...,5,5,0,0,,,,,
3958,https://thesmartlocal.com/read/beach-holidays-...,1368,8 Beach Holiday Resorts In Southeast Asia From...,Travel,Travel Guides & Tips,,,Beach holiday destinations in Southeast Asia ...,Beach holiday destinations in Southeast Asia I...,13,...,[https://thesmartlocal.com/wp-content/uploads/...,43,43,1,30,,,,,


In [12]:
smartlocal = smartlocal.reset_index(drop=True)
smartlocal

Unnamed: 0,url,timedelta,title,category,subcategory1,subcategory2,subcategory3,preview,content,n_tokens_title,...,num_tags,num_shares,num_imgs_y,img_links_y,img_maps,img_dark,img_light,img_saturation,top3_colours,bot3_colours
0,https://thesmartlocal.com/read/staytion-marsil...,0,Staytion Marsiling: Coworking Space In The Nor...,Adulting,Career,,,"Hooray for being able to sleep in, plus the ti...",Staytion Marsiling – Coworking space in the No...,15,...,1,27,,,"[0_0_new.png, 0_1_new.png, 0_2_new.png, 0_3_ne...","[0.81, 0.65, 0.72, 0.91, 0.74, 0.33, 0.52]","[0.19, 0.35, 0.28, 0.09, 0.26, 0.67, 0.48]","[46.14, 62.21, 69.59, 52.29, 54.78, 50.83, 53.62]","[[black, black, black], [lightslategray, light...","[[gray, gray, gray], [gray, gray, gray], [blac..."
1,https://thesmartlocal.com/read/things-to-do-es...,1,"Esplanade Is Having Free Shows, A Theatre BTS ...",Things To Do,Things To Do In Singapore,,,Do not miss the free entertainment here.,Things to do at Esplanade So you’ve been to Es...,17,...,2,73,,,"[1_0_new.jpg, 1_1_new.jpg, 1_2_new.jpg, 1_3_ne...","[0.51, 0.93, 0.69, 0.34, 0.91]","[0.49, 0.07, 0.31, 0.66, 0.09]","[85.23, 104.4, 151.83, 69.18, 100.27]","[[darkseagreen, tan, darkgray], [black, darksl...","[[lightsteelblue, darkgray, slategray], [india..."
2,https://thesmartlocal.com/read/things-to-do-no...,1,17 New Things To Do In November 2022 – Bishan ...,Things To Do,Activities,,,"In the blink of an eye, we're approaching 2023...",Things to do in November 2022 Halloween may be...,19,...,6,244,,,"[2_0_new.png, 2_1_new.jpg, 2_2_new.png, 2_3_ne...","[0.53, 0.35, 0.59, 0.62, 0.53, 0.74, 0.12, 0.4...","[0.47, 0.65, 0.41, 0.38, 0.47, 0.26, 0.88, 0.5...","[128.23, 132.23, 102.29, 145.1, 92.48, 97.79, ...","[[black, black, black], [cornflowerblue, cornf...","[[purple, lightskyblue, lavender], [mistyrose,..."
3,https://thesmartlocal.com/read/paypal-welcome-...,1,You Can Redeem Vouchers For Brands Like foodpa...,Local,Businesses,,,Vouchers can also be used on Zalora and Agoda.,PayPal’s Welcome Pack promotion With Black Fri...,16,...,2,25,,,"[3_0_new.png, 3_1_new.jpg, 3_2_new.png, 3_3_ne...","[0.49, 0.48, 0.67, 0.51, 0.67]","[0.51, 0.52, 0.33, 0.49, 0.33]","[79.51, 73.05, 124.35, 84.41, 80.5]","[[darkslategray, lightgray, lightgray], [silve...","[[dimgray, darkgray, darkslateblue], [lightste..."
4,https://thesmartlocal.com/read/things-to-do-ju...,1,9 Best Things To Do In Jurong For Westies To S...,Things To Do,Things To Do In Singapore,,,"Hot take: west side, best side.","Things to do in Jurong For too long, residents...",20,...,3,31,,,"[4_0_new.jpg, 4_1_new.jpg, 4_2_new.jpg, 4_3_ne...","[0.58, 0.79, 0.66, 0.73, 0.76, 0.74, 0.38, 0.6...","[0.42, 0.21, 0.34, 0.27, 0.24, 0.26, 0.62, 0.3...","[109.64, 183.59, 73.78, 93.46, 82.64, 135.15, ...","[[lightsteelblue, silver, skyblue], [forestgre...","[[bisque, antiquewhite, lightgray], [gray, dar..."
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
115,https://thesmartlocal.com/read/budweiser-fifa-...,37,Budweiser Is Giving Away 2 Trips To FIFA World...,Things To Do,Contests,,,It's time to Oleh Oleh with the crowd.,Budweiser FIFA World Cup giveaway FIFA World C...,18,...,3,5,,,"[115_0_new.jpg, 115_1_new.jpg, 115_2_new.jpg, ...",,,,,
116,https://thesmartlocal.com/read/inspiration-sto...,1409,Inspiration Store At Orchard Xchange Teaches Y...,Things To Do,Events,,,JR East Inspiration Store Here’s a thought –...,JR East Inspiration Store Here’s a thought – m...,15,...,0,0,,,"[116_0_new.jpg, 116_1_new.jpg, 116_2_new.jpg, ...",,,,,
117,https://thesmartlocal.com/read/nascans-sg/,1409,NASCANS Has Ex-MOE Teachers And Coaches Maths ...,Local,Businesses,,,NASCANS Student Care Centre If you’ve racked...,NASCANS Student Care Centre If you’ve racked y...,15,...,0,0,,,"[117_0_new.jpg, 117_1_new.png, 117_2_new.jpg, ...",,,,,
118,https://thesmartlocal.com/read/family-spots-no...,1409,6 Hidden Family Spots In The North To Get Your...,Things To Do,Things To Do In Singapore,,,Family places and activities in Singapore’s No...,Family places and activities in Singapore’s No...,14,...,0,119,,,"[118_0_new.png, 118_1_new.jpg, 118_2_new.jpg, ...",,,,,


In [3]:
#https://stackoverflow.com/questions/4643847/python-how-to-get-a-list-of-color-that-used-in-one-image

def convert_rgb_to_names(rgb_tuple):
    # a dictionary of all the hex and their respective names in css3
    css3_db = CSS3_HEX_TO_NAMES
    names = []
    rgb_values = []
    for color_hex, color_name in css3_db.items():
        names.append(color_name)
        rgb_values.append(hex_to_rgb(color_hex))
    
    kdt_db = KDTree(rgb_values)
    distance, index = kdt_db.query(rgb_tuple)
    return names[index]


def convert_to_colors(lst):
    names = []
    for i in range(0, len(lst)):
        clr = convert_rgb_to_names(lst[i][0])
        names.append(clr)
    return names

In [13]:
front_load = 'G:/.shortcut-targets-by-id/1FRiRj4JDvv5-CyqFucHpDqyPraDplWlI/BT4222/'
#"./Image_Resolution_Down_4/"

gotproblem = []

for i in range(0, len(smartlocal.img_maps)):
    thresh = 128
    lst1 = []
    lst2 = []
    lst3 = []
    lst4 = []
    lst5 = []

    for j in range(0, len(smartlocal.img_maps[i])):
        img_name = smartlocal.img_maps[i][j]

        try:
            #Light and Dark
            if(img_name.endswith('.gif')):
                img_name2 = img_name.rsplit('.', 1)[0] + '.jpg'
                img_grey = cv2.imread('G:/.shortcut-targets-by-id/1FRiRj4JDvv5-CyqFucHpDqyPraDplWlI/BT4222/Image_Resolution_Down_4_Conversion_gif/'+img_name2, cv2.IMREAD_GRAYSCALE)

                if img_grey is None:
                    try:
                        im1 = Image.open(f'G:/.shortcut-targets-by-id/1FRiRj4JDvv5-CyqFucHpDqyPraDplWlI/BT4222/Image_Resolution_Down_4/{img_name}').convert('RGB')
                        img_name2 = img_name.rsplit('.', 1)[0]+'.jpg'
                        im1.save(f'G:/.shortcut-targets-by-id/1FRiRj4JDvv5-CyqFucHpDqyPraDplWlI/BT4222/Image_Resolution_Down_4_Conversion_gif/{img_name2}')
                        img_grey = cv2.imread('G:/.shortcut-targets-by-id/1FRiRj4JDvv5-CyqFucHpDqyPraDplWlI/BT4222/Image_Resolution_Down_4_Conversion_gif/'+img_name2, cv2.IMREAD_GRAYSCALE)
                    except:
                        print(img_name + 'still cannot work')

            elif(img_name.endswith('.png')):
                img_name2 = img_name.rsplit('.', 1)[0]+'.jpg'
                img_grey = cv2.imread('G:/.shortcut-targets-by-id/1FRiRj4JDvv5-CyqFucHpDqyPraDplWlI/BT4222/Image_Resolution_Down_4_Conversion_png/'+img_name2, cv2.IMREAD_GRAYSCALE)

                if img_grey is None:
                    try:
                        im1 = Image.open(f'G:/.shortcut-targets-by-id/1FRiRj4JDvv5-CyqFucHpDqyPraDplWlI/BT4222/Image_Resolution_Down_4/{img_name}').convert('RGB')
                        img_name2 = img_name.rsplit('.', 1)[0]+'.jpg'
                        im1.save(f'G:/.shortcut-targets-by-id/1FRiRj4JDvv5-CyqFucHpDqyPraDplWlI/BT4222/Image_Resolution_Down_4_Conversion_png/{img_name2}')
                        img_grey = cv2.imread('G:/.shortcut-targets-by-id/1FRiRj4JDvv5-CyqFucHpDqyPraDplWlI/BT4222/Image_Resolution_Down_4_Conversion_png/'+img_name2, cv2.IMREAD_GRAYSCALE)
                    except:
                        print(img_name + 'still cannot work')
            else:
                img_grey = cv2.imread('G:/.shortcut-targets-by-id/1FRiRj4JDvv5-CyqFucHpDqyPraDplWlI/BT4222/Image_Resolution_Down_4/'+img_name, cv2.IMREAD_GRAYSCALE)

            img_binary = cv2.threshold(img_grey, thresh, 255, cv2.THRESH_BINARY)[1]
            img_bin_flat = img_binary.flatten()
            len_img_bin_flat = len(img_bin_flat)
            unique, counts = np.unique(img_bin_flat, return_counts=True)
            dark = counts[0]/len_img_bin_flat
            light = counts[1]/len_img_bin_flat
            lst1.append(str(round(dark,2)))
            lst2.append(str(round(light, 2)))
        except:
            print('light and dark problem ' + img_name)
            lst1.append('404')
            lst2.append('404')
            gotproblem.append(img_name) 
            
        try:
            #Saturation
            #https://stackoverflow.com/questions/58831690/how-to-measure-the-saturation-of-an-image
            if(img_name.endswith('.gif')):
                img_name2 = img_name.rsplit('.', 1)[0] + '.jpg'
                img = cv2.imread(f'G:/.shortcut-targets-by-id/1FRiRj4JDvv5-CyqFucHpDqyPraDplWlI/BT4222/Image_Resolution_Down_4_Conversion_gif/{img_name2}') 
            else:
                img = cv2.imread(f'G:/.shortcut-targets-by-id/1FRiRj4JDvv5-CyqFucHpDqyPraDplWlI/BT4222/Image_Resolution_Down_4/{img_name}') 

            img_hsv = cv2.cvtColor(img, cv2.COLOR_BGR2HSV)
            saturation = img_hsv[:, :, 1].mean()
            lst3.append(str(round(saturation, 2)))
        except:
            print('saturation problem ' + img_name)
            lst3.append('404')
            gotproblem.append(img_name) 
            
        
        try:
            #Closest Colour
            img = Image.open(f'G:/.shortcut-targets-by-id/1FRiRj4JDvv5-CyqFucHpDqyPraDplWlI/BT4222/Image_Resolution_Down_4/{img_name}').convert('RGB')
            colors = Counter(img.getdata())
            colors = dict(sorted(colors.items(), key=lambda item: item[1], reverse=True))

            top3colours = list(colors.items())[:3]
            bot3colours = list(colors.items())[-5:-2]
            top3_convert = convert_to_colors(top3colours)
            bot3_convert = convert_to_colors(bot3colours)
            lst4.append(top3_convert)
            lst5.append(bot3_convert)
   
        except:
            print('closest colour problem ' + img_name)
            lst4.append(['404'])
            lst5.append(['404'])
            gotproblem.append(img_name)
    
    if i >= 1 and i <5:
        print(i)
        print(lst1)
        print(lst2)
        print(lst3)
        print(lst4)
        print(lst5)
        
    smartlocal['img_dark'][i] = lst1
    smartlocal['img_light'][i] = lst2
    smartlocal['img_saturation'][i] = lst3
    smartlocal['top3_colours'][i] = lst4
    smartlocal['bot3_colours'][i] = lst5
    
    
    if i % 100 == 0:
        print(i)

            

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  smartlocal['img_dark'][i] = lst1
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  smartlocal['img_light'][i] = lst2
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  smartlocal['img_saturation'][i] = lst3
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  smartlocal['top3_colours'][i] = lst4
A value is trying to be s

0
1
['0.51', '0.93', '0.69', '0.34', '0.91']
['0.49', '0.07', '0.31', '0.66', '0.09']
['85.23', '104.4', '151.83', '69.18', '100.27']
[['darkseagreen', 'tan', 'darkgray'], ['black', 'darkslategray', 'black'], ['darkcyan', 'darkcyan', 'darkcyan'], ['gainsboro', 'lightgray', 'gainsboro'], ['dimgray', 'dimgray', 'dimgray']]
[['lightsteelblue', 'darkgray', 'slategray'], ['indianred', 'sienna', 'saddlebrown'], ['tan', 'tan', 'tan'], ['rosybrown', 'rosybrown', 'rosybrown'], ['maroon', 'maroon', 'maroon']]
2
['0.53', '0.35', '0.59', '0.62', '0.53', '0.74', '0.12', '0.49', '0.76', '0.53', '0.14', '0.65', '0.22', '0.67', '0.71', '0.86', '0.35', '0.67', '0.18', '0.57', '0.34', '0.32', '0.76', '0.86', '0.99', '0.96', '0.55', '0.42', '0.37', '0.16', '0.93', '0.85']
['0.47', '0.65', '0.41', '0.38', '0.47', '0.26', '0.88', '0.51', '0.24', '0.47', '0.86', '0.35', '0.78', '0.33', '0.29', '0.14', '0.65', '0.33', '0.82', '0.43', '0.66', '0.68', '0.24', '0.14', '0.01', '0.04', '0.45', '0.58', '0.63', '0.



100


In [14]:
smartlocal.head()

Unnamed: 0,url,timedelta,title,category,subcategory1,subcategory2,subcategory3,preview,content,n_tokens_title,...,num_tags,num_shares,num_imgs_y,img_links_y,img_maps,img_dark,img_light,img_saturation,top3_colours,bot3_colours
0,https://thesmartlocal.com/read/staytion-marsil...,0,Staytion Marsiling: Coworking Space In The Nor...,Adulting,Career,,,"Hooray for being able to sleep in, plus the ti...",Staytion Marsiling – Coworking space in the No...,15,...,1,27,,,"[0_0_new.png, 0_1_new.png, 0_2_new.png, 0_3_ne...","[0.81, 0.65, 0.72, 0.91, 0.74, 0.33, 0.52]","[0.19, 0.35, 0.28, 0.09, 0.26, 0.67, 0.48]","[46.14, 62.21, 69.59, 52.29, 54.78, 50.83, 53.62]","[[black, black, black], [lightslategray, light...","[[gray, gray, gray], [gray, gray, gray], [blac..."
1,https://thesmartlocal.com/read/things-to-do-es...,1,"Esplanade Is Having Free Shows, A Theatre BTS ...",Things To Do,Things To Do In Singapore,,,Do not miss the free entertainment here.,Things to do at Esplanade So you’ve been to Es...,17,...,2,73,,,"[1_0_new.jpg, 1_1_new.jpg, 1_2_new.jpg, 1_3_ne...","[0.51, 0.93, 0.69, 0.34, 0.91]","[0.49, 0.07, 0.31, 0.66, 0.09]","[85.23, 104.4, 151.83, 69.18, 100.27]","[[darkseagreen, tan, darkgray], [black, darksl...","[[lightsteelblue, darkgray, slategray], [india..."
2,https://thesmartlocal.com/read/things-to-do-no...,1,17 New Things To Do In November 2022 – Bishan ...,Things To Do,Activities,,,"In the blink of an eye, we're approaching 2023...",Things to do in November 2022 Halloween may be...,19,...,6,244,,,"[2_0_new.png, 2_1_new.jpg, 2_2_new.png, 2_3_ne...","[0.53, 0.35, 0.59, 0.62, 0.53, 0.74, 0.12, 0.4...","[0.47, 0.65, 0.41, 0.38, 0.47, 0.26, 0.88, 0.5...","[128.23, 132.23, 102.29, 145.1, 92.48, 97.79, ...","[[black, black, black], [cornflowerblue, cornf...","[[purple, lightskyblue, lavender], [mistyrose,..."
3,https://thesmartlocal.com/read/paypal-welcome-...,1,You Can Redeem Vouchers For Brands Like foodpa...,Local,Businesses,,,Vouchers can also be used on Zalora and Agoda.,PayPal’s Welcome Pack promotion With Black Fri...,16,...,2,25,,,"[3_0_new.png, 3_1_new.jpg, 3_2_new.png, 3_3_ne...","[0.49, 0.48, 0.67, 0.51, 0.67]","[0.51, 0.52, 0.33, 0.49, 0.33]","[79.51, 73.05, 124.35, 84.41, 80.5]","[[darkslategray, lightgray, lightgray], [silve...","[[dimgray, darkgray, darkslateblue], [lightste..."
4,https://thesmartlocal.com/read/things-to-do-ju...,1,9 Best Things To Do In Jurong For Westies To S...,Things To Do,Things To Do In Singapore,,,"Hot take: west side, best side.","Things to do in Jurong For too long, residents...",20,...,3,31,,,"[4_0_new.jpg, 4_1_new.jpg, 4_2_new.jpg, 4_3_ne...","[0.58, 0.79, 0.66, 0.73, 0.76, 0.74, 0.38, 0.6...","[0.42, 0.21, 0.34, 0.27, 0.24, 0.26, 0.62, 0.3...","[109.64, 183.59, 73.78, 93.46, 82.64, 135.15, ...","[[lightsteelblue, silver, skyblue], [forestgre...","[[bisque, antiquewhite, lightgray], [gray, dar..."


## Export DataFrame

In [35]:
smartlocal.to_excel("smartlocalfinal.xlsx")
smartlocal.to_parquet('./datasets/thesmartlocalfinal.parquet')

In [24]:
# lightdark_problem = ['258_4.jpg', '310_18.jpg', '732_4.gif', '1294_4.jpg', '1495_0.png', '1516_4.png', '1723_0.png', '1724_11.png', '1848_8.png', '1961_5.jpg', '2134_10.png', '2387_1.gif',  '2557_9.jpg?resize=1080%2C1080&ssl=1', '3034_8.png','3220_12.png','3232_1.png', '3906_8.jpg']
# saturation_problem = ['1723_0.png', '2134_10.png', '2557_9.jpg?resize=1080%2C1080&ssl=1', '3034_8.png', '3906_8.jpg']
# colour_problem = ['1723_0.png', '2134_10.png', '2557_9.jpg?resize=1080%2C1080&ssl=1', '3034_8.png', '3906_8.jpg']