#### Using Python and Beautiful Soup to gather product and pricing information based on a starting CSV file exported from a CRM or ecommerce system.

1. Importing dependancies and adjusting display settings to view full dataframes.

In [1]:
import pandas as pd 
import requests
import time
import random
from bs4 import BeautifulSoup as BS
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException
from requests import get
import re
pd.set_option('display.max_rows', None)

2. Import CRM parts file with the part numbers needed.  I've manually removed information that's not needed so that we just have the part numbers for products we are interested in finding. With this particular CRM the Ergotron part numbers start with ERG allowing for filtering out other manufacturers within the program but in this example case I've done that already.

In [2]:
pn = []
pn2 = []
pn = pd.read_csv('CRM.csv',  thousands=",")
pn2 = pn.filter(['Product Name','Part Number','Price' ],axis=1)
pn2['Product Name'] = pn2['Product Name'].astype(str)
pn2 = pn2.loc[pn2['Product Name'].str.startswith('ERG', na=False)]
print(pn2.shape)
pn2

(261, 2)


Unnamed: 0,Product Name,Part Number
0,ERG-33-397-085,33-397-085
1,ERG-97-487-800,97-487-800
2,ERG-45-478-026,45-478-026
3,ERG-24-313-026,24-313-026
4,ERG-60-610-062,60-610-062
5,ERG-98-017,
6,ERG-97-617,97-617
7,ERG-47-058-200,47-058-200
8,ERG-24-383-026,24-383-026
9,ERG-98-353-921,


3. Making some modifications to the dataframe to help with data manipulation later.

In [3]:
pn2['Part Number'] = pn2['Product Name'].str.lstrip('ERG-')
pn2['PN_len'] = pn2['Part Number'].apply(len)
pn2.loc[pn2['PN_len'] == 10, 'Part Short'] = pn2['Part Number'].str[:6]
pn2.loc[pn2['PN_len'] == 11, 'Part Short'] = pn2['Part Number'].str[:9]
pn2

Unnamed: 0,Product Name,Part Number,PN_len,Part Short
0,ERG-33-397-085,33-397-085,10,33-397
1,ERG-97-487-800,97-487-800,10,97-487
2,ERG-45-478-026,45-478-026,10,45-478
3,ERG-24-313-026,24-313-026,10,24-313
4,ERG-60-610-062,60-610-062,10,60-610
5,ERG-98-017,98-017,6,
6,ERG-97-617,97-617,6,
7,ERG-47-058-200,47-058-200,10,47-058
8,ERG-24-383-026,24-383-026,10,24-383
9,ERG-98-353-921,98-353-921,10,98-353


4. Due to the structure of the target wesite it was determined that the most effective way to extract the info needed was to first identify the product ID codes used in the website to pull up the pricing information from a json file. So this will be step 1 of a 2 step scraping process where we will get all the product IDs for the items in our starting dataframe and and add them to a list.

In [4]:
item =[]
def delay() -> None:
    time.sleep(random.uniform(5, 10))
    return None
item = pn2['Part Short']
item2 = pn2['Part Number']
headers = {'User-Agent': 'Mozilla/5.0'}
list5 = []

for part, part2 in zip(item, item2):
    url = "https://partner.ergotron.com/en-us/products/product-details/{}#" .format(part)
    delay()
    r = requests.get(url, headers=headers)
    soup = BS(r.text, 'html.parser')
    try:
        target = soup.find("div", {"id": "dnn_ctr443_ContentPane"})
        tar = target.text
        tar = str(tar)
        tar = tar.split("productId =",1)[1]
        tar = re.sub("[^0-9]", "", tar)
    except:
        try:
            url = "https://partner.ergotron.com/en-us/products/product-details/{}#" .format(part2)
            delay()
            r = requests.get(url, headers=headers)
            soup = BS(r.text, 'html.parser')
            target = soup.find("div", {"id": "dnn_ctr443_ContentPane"})
            tar = target.text
            tar = str(tar)
            tar = tar.split("productId =",1)[1]
            tar = re.sub("[^0-9]", "", tar)
        except:    
            try: 
                url = "https://partner.ergotron.com/en-us/products/product-details/{}" .format(part2)
                delay()
                r = requests.get(url, headers=headers)
                soup = BS(r.text, 'html.parser')
                target = soup.find("div", {"id": "dnn_ctr443_ContentPane"})
                tar = target.text
                tar = str(tar)
                tar = tar.split("productId =",1)[1]
                tar = re.sub("[^0-9]", "", tar)
            except:    
                tar = "URL nonmatch"
    list5.append(tar)
    
print (list5)

['143', '331', '1055', '48', '273', '536', '385', '222', '54', 'URL nonmatch', '331', '886', '1174', 'URL nonmatch', '223', '300', '310', '353', '187', '498', '583', 'URL nonmatch', '1103', '216', '1177', '216', '144', '274', '191', '1054', '1098', '138', '234', '273', '1055', '350', '196', '205', 'URL nonmatch', 'URL nonmatch', '189', '1253', '339', '374', '191', '1058', '359', '102', '1094', '1091', '339', '217', '197', '1293', '323', '139', '969', '394', '179', '236', '302', '40', '97', '304', '880', '501', '1119', '1102', '254', '192', '220', 'URL nonmatch', '492', '190', '178', '202', '33', '106', '1309', '213', '304', '948', '1254', '347', 'URL nonmatch', '360', '354', '264', '436', '462', '1279', '1161', '405', '311', '300', '354', '100', '375', 'URL nonmatch', '881', '448', '321', '364', '449', '287', '264', '500', '389', '823', '845', '191', '101', '726', '1102', '945', '1055', '208', '943', '206', '394', '351', '824', '298', '267', '369', 'URL nonmatch', '261', '1104', '267',

5. Now we will take our product id list and add that to a new column in our dataframe and at the same time we will create a dataframe for those products for which we could not find a product ID. After looking over this list it mostly contains discontinued products.

In [5]:
pn2["erg_id"]= list5
redo = pn2[pn2.erg_id.str.contains("URL nonmatch")]
pn3 = pn2[~pn2.erg_id.str.contains("URL nonmatch")]
#pn3['Part Number'] = pn3['Part Number'].str[4:]
print (pn3)
print (redo)

         Product Name   Part Number  PN_len Part Short erg_id
0      ERG-33-397-085    33-397-085      10     33-397    143
1      ERG-97-487-800    97-487-800      10     97-487    331
2      ERG-45-478-026    45-478-026      10     45-478   1055
3      ERG-24-313-026    24-313-026      10     24-313     48
4      ERG-60-610-062    60-610-062      10     60-610    273
5          ERG-98-017        98-017       6        NaN    536
6          ERG-97-617        97-617       6        NaN    385
7      ERG-47-058-200    47-058-200      10     47-058    222
8      ERG-24-383-026    24-383-026      10     24-383     54
10     ERG-97-487-800    97-487-800      10     97-487    331
11    ERG-SV43-1120-0   SV43-1120-0      11  SV43-1120    886
12     ERG-YES24GMPW4    YES24GMPW4      10     YES24G   1174
14     ERG-47-092-800    47-092-800      10     47-092    223
15     ERG-80-063-200    80-063-200      10     80-063    300
16         ERG-90-011        90-011       6        NaN    310
17     E

6. This will be step 2 of our webscraping process where we will access the json file on the target site and using the previously gathered product id numbers we will be able to gather several values that relate to the products. The two values that we will be extracting will be "SSP" which is the suggested sale price for the Ergotron products and then we'll also get "part name" value which is a short description of the product.

In [6]:

item4=pn3['erg_id']
item3=pn3['Part Number']
headers = {'User-Agent': 'Mozilla/5.0'}
numbers = re.compile(r'\d+(?:\.\d+)?')
ssp = []
part_name = []
for part, num in zip(item4,item3):
    url = "https://partner.ergotron.com/DesktopModules/Ergotron/API/Client/GetProductDetails?culture=en-us&productId={}" .format(part)
    delay()
    r = requests.get(url, headers=headers).json()
    s = str(r)
    try:
        txt1 = s.split("'partNumber': '{}'".format(num),1)[1]
        txt2 = txt1.split("msrPrice",1)[1]
        txt3 = txt2.split(',',1)[0]
        val1 = numbers.findall(txt3)
        ssp.append(val1)
    except:
        try:
            txt1 = s.split("'partNumber': '{} '".format(num),1)[1]
            txt2 = txt1.split("msrPrice",1)[1]
            txt3 = txt2.split(',',1)[0]
            val1 = numbers.findall(txt3)
            ssp.append(val1)
        except:
            ssp.append("noprice")
    try:
        txt4 = txt1.split("'partName': ",1)[1]
        name1 = txt4.split(", 'msrPrice'",1)[0]
        part_name.append(name1)
    except:
        part_name.append("noname")
print(ssp)
print(part_name)

[['475.0'], ['114.0'], ['354.0'], ['409.0'], ['749.0'], ['17.0'], ['69.0'], ['93.0'], 'noprice', ['114.0'], ['2219.0'], ['1335.0'], ['56.0'], ['62.0'], ['29.0'], ['34.0'], ['432.0'], ['164.0'], ['899.0'], ['479.0'], ['465.0'], ['70.0'], ['465.0'], ['500.0'], ['143.0'], ['829.0'], ['359.0'], ['218.0'], ['629.0'], ['29.0'], ['749.0'], ['355.0'], ['41.0'], ['1038.0'], ['142.0'], ['945.0'], ['518.0'], ['93.0'], ['90.0'], ['829.0'], ['328.0'], ['100.0'], ['95.0'], ['62.0'], ['79.0'], ['93.0'], ['732.0'], ['831.0'], [], ['78.0'], ['577.0'], ['1645.0'], ['26.0'], ['226.0'], ['48.0'], ['56.0'], ['847.0'], ['251.0'], ['56.0'], ['4971.0'], ['110.0'], ['82.0'], ['214.0'], ['80.0'], ['176.0'], ['119.0'], 'noprice', 'noprice', ['170.0'], ['355.0'], 'noprice', ['318.0'], ['999.0'], ['380.0'], ['56.0'], ['4508.0'], ['593.0'], ['73.0'], ['141.0'], ['40.0'], ['559.0'], ['45.0'], ['31.0'], 'noprice', ['299.0'], ['27.0'], ['35.0'], ['66.0'], ['40.0'], ['48.0'], ['64.0'], [], ['56.0'], ['153.0'], ['24.0']

7. Next let's create new columns in our main dataframe for these new values we've gathered.

In [7]:
pn3["ssp"] = ssp
pn3["Description"] = part_name
print (pn3)

         Product Name   Part Number  PN_len Part Short erg_id       ssp  \
0      ERG-33-397-085    33-397-085      10     33-397    143   [475.0]   
1      ERG-97-487-800    97-487-800      10     97-487    331   [114.0]   
2      ERG-45-478-026    45-478-026      10     45-478   1055   [354.0]   
3      ERG-24-313-026    24-313-026      10     24-313     48   [409.0]   
4      ERG-60-610-062    60-610-062      10     60-610    273   [749.0]   
5          ERG-98-017        98-017       6        NaN    536    [17.0]   
6          ERG-97-617        97-617       6        NaN    385    [69.0]   
7      ERG-47-058-200    47-058-200      10     47-058    222    [93.0]   
8      ERG-24-383-026    24-383-026      10     24-383     54   noprice   
10     ERG-97-487-800    97-487-800      10     97-487    331   [114.0]   
11    ERG-SV43-1120-0   SV43-1120-0      11  SV43-1120    886  [2219.0]   
12     ERG-YES24GMPW4    YES24GMPW4      10     YES24G   1174  [1335.0]   
14     ERG-47-092-800    

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


8. Next let's create two final dataframes, one for our successful product info updates and one for the prooducts for which we could not find the information with this method. Let's go ahead and download the dataframe to a CSV file externaly.

In [8]:
disc = pn3[pn3.ssp.str.contains("noprice",na=False)]
fin = pn3[~pn3.ssp.str.contains("noprice",na=False)]
redo.to_csv(r'C:\Users\dcpst\Documents\Ergotron\erg_web_price_redo.csv', index = False)
fin.to_csv(r'C:\Users\dcpst\Documents\Ergotron\erg_web_price.csv', index = False)
disc.to_csv(r'C:\Users\dcpst\Documents\Ergotron\erg_web_price_disc.csv', index = False)