# Using Selenium and Beautiful Soup--Uniqulo

In this section we will be using selenium and beautiful soup to scrape all of the women's tops from Uniqlo.
First, we will use selenium to get everylink for the shirts, and then use Beautiful Soup to get the correct picture.

Some great sources used to help in this: 
- [Selenium Website for Locating Elements](https://selenium-python.readthedocs.io/locating-elements.html)
- [Stackover flow to Save Images](https://stackoverflow.com/questions/18497840/beautifulsoup-how-to-open-images-and-download-them)
- [Get all image src's with specific class](https://stackoverflow.com/questions/8289957/python-2-7-beautiful-soup-img-src-extract)


Our steps are as follows: 
1. [first]()
2. [second]()
3. [third]()
4. [fourth]() 

#### Import your needed Libraries:

In [80]:
import time
import requests
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
from bs4 import BeautifulSoup
import cv2
import numpy as np 

#### Create variables for each of the URL

We will do this for each of the pages we will be pulling from on Uniqulo.

In [26]:
t_shirts = 'https://www.uniqlo.com/us/en/women/t-shirts-and-tops'
grahic_t = 'https://www.uniqlo.com/us/en/ut-graphic-tees/shop-all-women/short-sleeve-t-shirts?ptid=ut-graphic-tees-shop-all-women'
shirts = 'https://www.uniqlo.com/us/en/women/shirts-and-blouses'

#### Create our Chrome Driver element and get Links:

Here I am using the Chrome Driver Manager because I was facing some issues using the driver, but this resolved all of them.

We will use our function __'get_product_links'__ to return a list of all of the product URLs.

In [None]:
def get_product_links(url):
    driver = webdriver.Chrome(ChromeDriverManager().install()) 
    driver.get(url)
    links = driver.find_elements_by_partial_link_text('WOMEN')
    driver.quit()
    return links

In [23]:
t_shirt_links = get_product_links(t_shirts)
grahic_t_links = get_product_links(grahic_t)
shirt_links = get_product_links(shirts)


Checking for mac64 chromedriver:2.46 in cache
Driver found in /Users/elenasm7/.wdm/chromedriver/2.46/mac64/chromedriver


#### Use a for loop to make sure these items look right. 

We are using a for loop and calling '.get_attribute('href')' on each element. If we run this without it we will only get the selenium object back.


In [25]:
for link in t_shirt_links:
    print(link.get_attribute('href'))

https://www.uniqlo.com/us/en/women
https://www.uniqlo.com/us/en/women/uniqlo-and-alexanderwang
https://www.uniqlo.com/us/en/women
https://www.uniqlo.com/us/en/women-drape-crew-neck-short-sleeve-t-shirt-413696.html?dwvar_413696_color=COL70&cgid=women-t-shirts-and-tops
https://www.uniqlo.com/us/en/women-u-crew-neck-short-sleeve-t-shirt-414443.html?dwvar_414443_color=COL55&cgid=women-t-shirts-and-tops
https://www.uniqlo.com/us/en/women-u-relax-fit-crew-neck-short-sleeve-t-shirt-415793.html?dwvar_415793_color=COL08&cgid=women-t-shirts-and-tops
https://www.uniqlo.com/us/en/women-cropped-crew-neck-short-sleeve-t-shirt-413675.html?dwvar_413675_color=COL53&cgid=women-t-shirts-and-tops
https://www.uniqlo.com/us/en/women-striped-cropped-crew-neck-short-sleeve-t-shirt-413728.html?dwvar_413728_color=COL68&cgid=women-t-shirts-and-tops
https://www.uniqlo.com/us/en/women-ribbed-crew-neck-short-sleeve-t-shirt-413996.html?dwvar_413996_color=COL02&cgid=women-t-shirts-and-tops
https://www.uniqlo.com/us/e

In [None]:
for link in grahic_t_links:
    print(link.get_attribute('href'))

In [None]:
for link in shirt_links:
    print(link.get_attribute('href'))

Everthing except for the first three items look good! So, lets's update our list of links to exclude those.

In [27]:
t_shirt_links = t_shirt_links[3:]
grahic_t_links = grahic_t_links[3:]
shirt_links = shirt_links[3:]

#### Get all of the image URLs:

Use function to get all image URL's from each page:

In [77]:
def get_image_srcs(URLs):
    srcs = []
    for item in URLs:
        unq_page = requests.get(item.get_attribute('href'))
        soup = BeautifulSoup(unq_page.content, 'lxml')
        srcs.append([x['src'] for x in soup.findAll('img', {'class': 'productthumbnail'})])
        time.sleep(1)
    return [img for sublst in srcs for img in sublst ]
        

In [None]:
t_shirt_src = get_image_srcs(t_shirt_links)

In [None]:
grap_t_src = get_image_srcs(grahic_t_links)

In [None]:
shirt_src = get_image_srcs(shirt_links)

#### Save the Images:

From the URL lists created above we will use the 'save_src_image' function we defined to save the images to our folder.

In [None]:
def save_src_image(imag_srcs,num):
    start_time = time.time()
    for i, url in enumerate(imag_srcs):
        time.sleep(0.5)
        try:
            request = urllib.request.Request(url)
            response = urllib.request.urlopen(request)
            binary_str = response.read()
            byte_array = bytearray(binary_str)
            numpy_array = np.asarray(byte_array, dtype='int8')
            image = cv2.imdecode(numpy_array, cv2.IMREAD_UNCHANGED)
            cv2.imwrite("Uniqulo_pics/"+"uq_{}".format(i)+".png", image)
            print("Saved "+"uq_{}".format(i)+".png")
        except Exception as e:
            print(str(e))
    end_time = time.time()
    return "Done"  

In [74]:
start_time = time.time()
for i, url in enumerate(image_srcs):
#     print(url)
    time.sleep(0.5)
    try:
        request = urllib.request.Request(url)
        response = urllib.request.urlopen(request)
        binary_str = response.read()
        byte_array = bytearray(binary_str)
        numpy_array = np.asarray(byte_array, dtype='int8')
        image = cv2.imdecode(numpy_array, cv2.IMREAD_UNCHANGED)
        cv2.imwrite("Uniqulo_pics/"+"uq_{}".format(i)+".png", image)
        print("Saved "+"uq_{}".format(i)+".png")
    except Exception as e:
        print(str(e))

end_time = time.time()
print("Done")
# print("Total time: "+ str(end_time - start_time) +'sec')

Saved uq_0.png
Saved uq_1.png
Saved uq_2.png
Saved uq_3.png
Saved uq_4.png
Saved uq_5.png
Saved uq_6.png
Saved uq_7.png
Saved uq_8.png
Saved uq_9.png
Saved uq_10.png
Saved uq_11.png
Done
