## Getting the list of product categories from Shopee

The aim of this project is to build a classifier that can predict the top level category that a product belongs to, given a product image. First, we need to obtain the list of product categories from Shopee.

The url for each category listing can be obtained from shopee via the following code:

In [None]:
#  Set import path, this is required for this notebook to be runnable from `notebooks`
import sys
sys.path.insert(0, '../')

In [None]:
# Install necessary dependencies
!pip install selenium
!pip install webdriver-manager

In [4]:
from scripts.crawler import get_category_urls
category_urls = get_category_urls()
print(category_urls)

INFO:WDM:Get LATEST chromedriver version for google-chrome 107.0.5304
INFO:WDM:Driver [/Users/naomileow/.wdm/drivers/chromedriver/mac_arm64/107.0.5304/chromedriver] found in cache


["https://shopee.sg/Women's-Apparel-cat.11012819", "https://shopee.sg/Men's-Wear-cat.11012963", 'https://shopee.sg/Mobile-Gadgets-cat.11013350', 'https://shopee.sg/Home-Living-cat.11000001', 'https://shopee.sg/Computers-Peripherals-cat.11013247', 'https://shopee.sg/Beauty-Personal-Care-cat.11012301', 'https://shopee.sg/Home-Appliances-cat.11027421', 'https://shopee.sg/Health-Wellness-cat.11027491', 'https://shopee.sg/Food-Beverages-cat.11011871', 'https://shopee.sg/Toys-Kids-Babies-cat.11011538', 'https://shopee.sg/Kids-Fashion-cat.11012218', 'https://shopee.sg/Video-Games-cat.11013478', 'https://shopee.sg/Sports-Outdoors-cat.11012018', 'https://shopee.sg/Hobbies-Books-cat.11011760', 'https://shopee.sg/Cameras-Drones-cat.11013548', 'https://shopee.sg/Pet-Food-Supplies-cat.11012453', "https://shopee.sg/Women's-Bags-cat.11012592", "https://shopee.sg/Men's-Bags-cat.11012659", 'https://shopee.sg/Jewellery-Accessories-cat.11013077', 'https://shopee.sg/Watches-cat.11012515', "https://shopee.

## Getting the product listing for each category
For each product category, 3000 most recent listings are obtained from Shopee's V4 API with the [shopee-crawler](https://github.com/lthoangg/shopee-crawler) library, which contains scripts to obtain the list of products from Shopee, given the category url.

Important note: The product listings were obtained on 21 October 2022. However, as of 13 November 2022, the code no longer works, due to a major change in Shopee's API, which now only allows credentialed access to the API.

In [None]:
# Install the necessary dependencies
!pip install shopee_crawler

datadir = '../data'
from scripts.crawler import get_category_data, download_images
for c in category_urls:
    ## Get json containing the product listing for the category and save it in a file in the `data` folder
    ## The file will be named [category].json
    get_category_data(datadir, c)

# Parse file containing the product listings for each category and obtain the associated images
# The images will be saved in `data/images/[category]`
# This takes hours to run
download_images(datadir)

## Obtaining the dataset used in the project
The raw product images that we have obtained from Shopee can be obtained from the following Google Drive [url](https://drive.google.com/file/d/1SHSNueRjjoCwcRRjS2DfDkqavJ6SJPXK/view?usp=sharing).

In [None]:
!wget --load-cookies cookies.txt "https://docs.google.com/uc?export=download&confirm=$(wget --quiet --save-cookies /tmp/cookies.txt --keep-session-cookies --no-check-certificate https://drive.google.com/uc?export=download&id=1SHSNueRjjoCwcRRjS2DfDkqavJ6SJPXK -O | sed -rn 's/.*confirm=([0-9A-Za-z_]+).*/\1\n/p')&id=1SHSNueRjjoCwcRRjS2DfDkqavJ6SJPXK" -O images.tar.gz && rm -rf cookies.txt

!tar -xvf images.tar.gz --directory ../data/
!rm images.tar.gz