# Get Product Information
> Get information from a website by using <b><u>Selenum</u></b> and <b><u>BeautifulSoup</u></b>.


[Back to <b>contents</b>](../README.md)

In [23]:
from io import BytesIO
import re
from urllib.request import urlopen
from selenium import webdriver
from bs4 import BeautifulSoup
import time
from colorthief import ColorThief
import number as MyNum #number.py
import csv


**KREAM** (https://kream.co.kr/)
* It is the largest Korean shoe resell website.
* You can get many shoe <b><u>images</u></b> and <b><u>price</u></b> information. 
<br>
<p>ScreenShot</p>

---------
![KreamWebImg](../img/kreamWeb.png)

In [24]:
url = 'https://kream.co.kr/search?category_id=34&sort=popular&per_page=40'

Product Information structure in KREAM (HTML)

![productHtml](../img/productHtml.png)


**<mark>'get' functions created based on this</mark>**
* getLink : Get links to shoe information.
* getImage : Get the image of the shoe to get the colors.
* getColors : Get the colors of the shoe by using <b><u>[colortheif](https://lokeshdhakar.com/projects/color-thief/)</u></b> module. 
* getName : Get name information
* getBrand : Get Brand information
* getPrice : Get Price information. If price is 0 continue to next shoe.
* getWish : Get Wish from text by using getNumber function (number.py)
* getReview : Get review same as getWish.

In [25]:
def getLink(a):
  return 'https://kream.co.kr' + a.get('href') + " "

def getImage(a):
  try:
    picture = a.find("picture", {"class" : "picture product_img"})
    source = picture.find("source", {"type" : "image/webp"})
    imgSrc = source.get('srcset') + " "
  except:
    imgSrc = 'NULL'
  return imgSrc

def getColors(a):
  try:
    imgSrc = getImage(a)
    fd = urlopen(imgSrc)
    f = BytesIO(fd.read())
    color_thief = ColorThief(f)
    palette = color_thief.get_palette(color_count=10, quality=1)
    colors = []
    for col in palette:
      if col[3] < 8: continue
      colors.append(col)
    colors = sorted(colors, key = lambda x : -x[3])
  except:
    print(f'getColor Exception\n{imgSrc}')
    colors = [(-1,-1,-1,-1)]
  return colors

def getName(a):
  EngName = a.find("p", {"class","name"}).getText()
  return EngName

def getBrand(a):
  return a.find("p", ["class", "brand"]).getText()
  
def getPrice(a):
  priceStr = a.find("div", ["class", "amount"]).getText()
  numbers = re.sub(r'[^0-9]', '', priceStr)
  price = 0
  if len(numbers): price = int(numbers)
  return price
  
def getWish(I):
  figure = I.find("span", ["class", "wish_figure"])
  text = figure.find("span", ["class", "text"]).getText()
  return MyNum.getNumber(text)

def getReview(I): 
  figure = I.find("span",["class","review_figure"])
  text = figure.find("span", ["class", "text"]).getText()
  return MyNum.getNumber(text)

<p style = "color : LightGreen">The getColors function sorts the colors in the order of the highest ratio.</p>
> To get the ratio of colors, you need to modify <u>colortheif.py.</u>

-- referece : https://github.com/fengsp/color-thief-py/issues/1
```python
def palette(self):
    return self.vboxes.map(lambda x: x['color'])
```
to
```python
def palette(self):
        total = sum(self.vboxes.map(lambda x: x['vbox'].count))
        return self.vboxes.map(lambda x: x['color'] + (int(x['vbox'].count / float(total) * 100),))
```


> Put each item into a list.
1. We need a webdriver. [(chromedriver.exe)](https://chromedriver.chromium.org/downloads)
2. Kream is a dynamic web page, so scrolling is required. (<b>max_scroll_num</b> = 100)
3. To confirm that scrolling is in progress, print <b>curScroll</b> every 10.
4. When the scroll is finished, append <b>items</b>.

In [26]:

driver = webdriver.Chrome('chromedriver')
driver.maximize_window()
driver.get(url)
max_scroll_num = 100
curScroll = 1;
for _ in range(max_scroll_num):
  driver.execute_script("window.scrollTo(0,document.body.scrollHeight)")
  if curScroll % 10 == 0 :
    print(f'Current Scroll :{curScroll}')
  curScroll += 1
  time.sleep(2)

time.sleep(1)
html = driver.page_source
bsObj = BeautifulSoup(html, 'html.parser')
lis = bsObj.find("div",{"class" : "search_result_list"})
items = lis.findAll("div", {"class":"search_result_item"})

Current Scroll :10
Current Scroll :20
Current Scroll :30
Current Scroll :40
Current Scroll :50
Current Scroll :60
Current Scroll :70
Current Scroll :80
Current Scroll :90
Current Scroll :100


- Create <b>shoes.csv</b> file in data folder and put shoe information. 
- Only name, brand, Colors, and link information are used here.

In [27]:
csvFile = open("../data/shoes.csv", 'w',encoding='utf-8', newline='')
writer = csv.writer(csvFile)
writer.writerow(('Name', 'Brand', 'Colors', 'Link', 'wish', 'review'))
itemCnt = 0
for item in items:
  if itemCnt == 1000 : break
  a = item.find("a", {"class" : "item_inner"})
  I = item.find("div", {"class" : "interest_figure"})
  price = getPrice(a)
  if price == 0 : continue
  itemCnt += 1
  wish = getWish(I)
  review = getReview(I)
  link = getLink(a)
  Colors = getColors(a)
  name = getName(a)
  brand = getBrand(a)
  writer.writerow((name, brand, Colors, link, wish, review))
csvFile.close()

screenshot of shoes.csv (1000 of shoe data)
<img src ="../img/shoescsv.png" width ="400">

[Back to <b>contents</b>](../README.md)