## 1 -- WEB_SCRAPING

Web scraping refers to the extraction of data from a website. This information is collected and then exported into a format that is more useful for the user. Be it a spreadsheet or an API.
We are using Python Programming Language for web-scraping and every other tasks in this project.

### 1.1 -- Installing and Importing Dependencies

A Request library in Python handles the HTTP calls, where it simplifies the way to send the data on HTTP request and handles the same back in response. It eases the ways to carry out the CRUD operation and other HTTP call activities, such as data scraping.

In [1]:
!pip install requests

Defaulting to user installation because normal site-packages is not writeable


tqdm is a library in Python which is used for creating Progress Meters or Progress Bars.

In [2]:
!pip install tqdm

Defaulting to user installation because normal site-packages is not writeable


Pandas is an open-source library that is made mainly for working with relational or labeled data both easily and intuitively. It provides various data structures and operations for manipulating numerical data and time series. Pandas dataframe are more versatile than multidimensional NumPy arrays.

In [3]:
!pip install pandas

Defaulting to user installation because normal site-packages is not writeable


A RegEx, or Regular Expression, is a sequence of characters that forms a search pattern. RegEx can be used to check if a string contains the specified search pattern.

In [4]:
!pip install regex

Defaulting to user installation because normal site-packages is not writeable


Now, importing these libraries which we have installed above..

In [5]:
import requests
import json
import pandas as pd
from tqdm import tqdm
import re

### 1.2 -- Landing page product details

Scraping code for very first page of website where all the products are listed.

In [6]:
# Declaring Header and adding a user agent
headers = {'User-Agent':
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/102.0.0.0 Safari/537.36',
'Accept-Language': 'en-US, en;q=0.5'}

pro_name,price,url,sku,slug,Id=[],[],[],[],[],[]

#This for loop is to get the json data of a landing page
for page_num in tqdm(range(1,19)):
    json_page = requests.get('https://mmrth-nd-api.honasa-production.net/v1/products/shopAllProducts?pagenumber=' + str(page_num) + '&pagesize=20&categoryId=-1', headers=headers).json()
    #Again applying loop to extract particular information for above json data
    for i in json_page['response']['list']['entities']['products'].values():
        pro_name.append(i['name'])
        sku.append(i['sku'])
        price.append(i['price'])
        slug.append(i['slug'])
        Id.append(i['id'])
        url.append('https://mamaearth.in/product/'+i['slug'])

# Creating a dict object to store data
a_dict = {
    'ID':Id,
    'SKU':sku,
    'PRODUCT_NAME':pro_name,
    'PRICE':price,
    'SLUG':slug,
    'URL':url
}

# Creating DataFrame to make data easier to interpret
aa=pd.DataFrame(a_dict)

# Using regex to extract package size information
aa=aa.join(aa['PRODUCT_NAME'].str.extract(r'\b(?P<value>\d+|.\d+|\d+-\d+|\d+x\d+|\d+\*\d+|\d+\+\d+|\d+\.\d+)\s*(?P<unit>g|ml|kg|Litre)\b', flags=re.I))
aa['PACK_SIZE']=(aa['value'] + aa['unit']).str.strip('-').str.replace('x','*').str.strip(' ')
aa

100%|██████████████████████████████████████████████████████████████████████████████████| 18/18 [00:07<00:00,  2.33it/s]


Unnamed: 0,ID,SKU,PRODUCT_NAME,PRICE,SLUG,URL,value,unit,PACK_SIZE
0,639,8904417300048,ME White Musk Eau De Parfum For a Fragrance Cl...,699.00,me-white-musk-eau-de-parfum-for-a-fragrance-cl...,https://mamaearth.in/product/me-white-musk-eau...,50,ml,50ml
1,638,8904417300031,ME Floral Eau De Parfum - Live in the Moment -...,699.00,me-floral-eau-de-parfum-live-in-the-moment-50-ml,https://mamaearth.in/product/me-floral-eau-de-...,50,ml,50ml
2,636,8904417300024,ME Oud Eau De Parfum to Unleash Your Confidenc...,699.00,me-oud-eau-de-parfum-to-unleash-your-confidenc...,https://mamaearth.in/product/me-oud-eau-de-par...,50,ml,50ml
3,634,8904417300055,ME First Rain Eau De Parfum to Refresh Your Se...,699.00,first-rain-eau-de-parfum-to-refresh-your-sense...,https://mamaearth.in/product/first-rain-eau-de...,50,ml,50ml
4,626,8904417301540,Lash Care Volumizing Mascara with Castor Oil &...,499.00,mamaearth-lash-care-volumizing-mascara-with-ca...,https://mamaearth.in/product/mamaearth-lash-ca...,13,g,13g
...,...,...,...,...,...,...,...,...,...
355,253,8906087773306,Plant-Based Diaper Pants for Babies – 7-12 kg ...,699.00,plant-based-diaper-pants-for-babies-7-12kg,https://mamaearth.in/product/plant-based-diape...,7-12,kg,7-12kg
356,252,8906087773290,Plant-Based Diaper Pants for Babies – 4-6 kg (...,699.00,plant-based-diaper-pants-for-babies-4-6kg,https://mamaearth.in/product/plant-based-diape...,4-6,kg,4-6kg
357,251,8906087773269,Plant-Based Diaper Pants for Babies – 3-5 kg (...,699.00,plant-based-diaper-pants-for-babies-3-5kg,https://mamaearth.in/product/plant-based-diape...,3-5,kg,3-5kg
358,243,8906087773443,Onion Scalp Serum with Onion and Niacinamide f...,599.00,onion-scalp-serum-with-onion-niacinamide-for-h...,https://mamaearth.in/product/onion-scalp-serum...,50,ml,50ml


### 1.3 -- Individual product page details

This code will scrape the product page information of each product individually.

In [7]:
b_dict = []
#Using for loop to get the json data of every individual product page
for j in tqdm(slug):
    product_page = requests.get(f'https://mmrth-nd-api.honasa-production.net/v1/products/info/{j}', headers=headers).json()

    try:
        Id = (product_page['id'])
    except:
        Id = 'Not found'
    
    try:
        algoliaObjectID = (product_page['algoliaObjectID'])
    except:
        algoliaObjectID = 'Not found'
        
    try:
        categories = (product_page['categories'])
    except:
        categories = 'Not found'
        
    try:
        configurableOption = (product_page['configurableOption'])
    except:
        configurableOption = 'Not found'
        
    try:
        createdAt = (product_page['createdAt'])
    except:
        createdAt = 'Not found'
        
    try:
        updatedAt = (product_page['updatedAt'])
    except:
        updatedAt = 'Not found'
        
    try:
        is_in_stock = (product_page['is_in_stock'])
    except:
        is_in_stock = 'Not found'
        
    try:
        is_saleable = (product_page['is_saleable'])
    except:
        is_saleable = 'Not found'
        
    try:
        status = (product_page['status'])
    except:
        status = 'Not found'
        
    try:
        Type = (product_page['type'])
    except:
        Type = 'Not found'
        
    try:
        visibility = (product_page['visibility'])
    except:
        visibility = 'Not found'
        
    try:
        parent = (product_page['parent'])
    except:
        parent = 'Not found'
        
    try:
        siblings = (product_page['siblings'])
    except:
        siblings = 'Not found'
    
    # Creating a dict object to store data
    b_dict.append({
        'ID':Id,
        'ALGOLIA_OBJECT_ID':algoliaObjectID,
        'CATEGORIES':categories,
        'CONFIGURABLE_OPTION':configurableOption,
        'DATE_OF_CREATION':createdAt,
        'LAST_UPDATED_DATE':updatedAt,
        'IS_IN_STOCK':is_in_stock,
        'IS_SALEABLE':is_saleable,
        'STATUS':status,
        'TYPE':Type,
        'VISIBILITY':visibility,
        'PARENT':parent,
        'SIBLINGS':siblings        
    })

# Creating DataFrame to make data easier to interpret
bb=pd.DataFrame(b_dict)
bb

100%|████████████████████████████████████████████████████████████████████████████████| 360/360 [00:57<00:00,  6.25it/s]


Unnamed: 0,ID,ALGOLIA_OBJECT_ID,CATEGORIES,CONFIGURABLE_OPTION,DATE_OF_CREATION,LAST_UPDATED_DATE,IS_IN_STOCK,IS_SALEABLE,STATUS,TYPE,VISIBILITY,PARENT,SIBLINGS
0,639,1990186000,"[2, 21, 45, 197]","{'attributeId': 222, 'optionValue': '94', 'att...",2022-08-29 09:01:43,2022-09-02 10:20:55,1,True,1,simple,4,"{'id': '640', 'sku': '22222', 'name': 'Eau De ...","[{'id': '638', 'sku': '8904417300031', 'name':..."
1,638,1990184000,"[2, 21, 45, 197]","{'attributeId': 222, 'optionValue': '93', 'att...",2022-08-29 09:01:26,2022-09-02 10:29:17,1,True,1,simple,4,"{'id': '640', 'sku': '22222', 'name': 'Eau De ...","[{'id': '639', 'sku': '8904417300048', 'name':..."
2,636,1661771144595,"[2, 21, 45, 197]","{'attributeId': 221, 'optionValue': '91', 'att...",2022-08-29 08:49:25,2022-09-02 10:22:56,1,True,1,simple,4,"{'id': '637', 'sku': '11111', 'name': 'Eau De ...","[{'id': '635', 'sku': '8904417300017', 'name':..."
3,634,1575346001,"[2, 45, 197, 21]",Not found,2022-08-29 08:07:39,2022-09-02 10:27:41,1,True,1,simple,4,Not found,Not found
4,626,1953270000,"[2, 21, 195]",Not found,2022-08-22 07:32:08,2022-08-31 06:21:57,1,True,1,simple,4,Not found,Not found
...,...,...,...,...,...,...,...,...,...,...,...,...,...
355,253,2011575002,"[6, 8, 10, 92, 5]",Not found,2021-03-25 13:34:44,2022-09-02 14:39:27,0,True,1,simple,4,Not found,Not found
356,252,9237760000,"[6, 8, 10, 92, 5]",Not found,2021-03-25 13:34:19,2022-09-02 14:39:36,0,True,1,simple,4,Not found,Not found
357,251,8599527000,"[6, 8, 10, 92, 5]",Not found,2021-03-25 13:33:55,2022-09-02 14:39:44,0,True,1,simple,4,Not found,Not found
358,243,8599407000,"[27, 76, 21, 5, 62, 23, 30, 70, 72, 31, 178, 1...",Not found,2021-03-25 13:33:47,2022-09-05 02:48:58,0,True,1,simple,4,Not found,Not found


### 1.4 -- Product Category and custom attributes

This code will help us to scrape product category of our products which is one of the important elements we need.

In [8]:
Category = []
Id = []
customattributes = []

# Using for loop to get the json data of every individual product page
for k in tqdm(slug):
    product_page = requests.get(f'https://mmrth-nd-api.honasa-production.net/v1/products/info/{k}', headers=headers).json()
    id = (product_page['id'])      
    category = (product_page['customattributes'])
    
#     Normalizing semi-structured data into a flat table
    ik = pd.json_normalize(category)
    
#     This for loop is to loop over the above table and extracting data whenever the following conditions are met
    for l in range(len(ik)):
        if ik['attribute_code'][l] == 'custom_product_type':
            customm = ik['value'][l]
            customattributes.append(product_page['customattributes'])
            Category.append(customm)
            Id.append(id)

# Creating a dict object to store data
c_dict = {
    'ID':Id,
    'PRODUCT_CATEGORY':Category,
    'CUSTOM_ATTRIBUTES':customattributes,
}

# Creating DataFrame to make data easier to interpret
cc = pd.DataFrame(c_dict)
cc

100%|████████████████████████████████████████████████████████████████████████████████| 360/360 [00:59<00:00,  6.02it/s]


Unnamed: 0,ID,PRODUCT_CATEGORY,CUSTOM_ATTRIBUTES
0,623,be_skin,"[{'attribute_code': 'image', 'value': '/v/i/vi..."
1,620,be_skin,"[{'attribute_code': 'image', 'value': '/f/w/fw..."
2,618,be_skin,"[{'attribute_code': 'image', 'value': '/g/r/gr..."
3,616,be_skin,"[{'attribute_code': 'image', 'value': '/g/r/gr..."
4,615,be_skin,"[{'attribute_code': 'image', 'value': '/g/r/gr..."
...,...,...,...
323,253,ba_body,"[{'attribute_code': 'image', 'value': '/d/i/di..."
324,252,ba_body,"[{'attribute_code': 'image', 'value': '/d/i/di..."
325,251,ba_body,"[{'attribute_code': 'image', 'value': '/d/i/di..."
326,243,be_hair,"[{'attribute_code': 'image', 'value': '/o/n/on..."


### 1.5 -- Individual product reviews

Here we will be scraping reviews for every products..

In [9]:
d_dict=[]
# Using for loop to get the json data of every individual product review section
for m in tqdm(Id):
    review_page = requests.get(f'https://mmrth-nd-api.honasa-production.net/v1/products/{m}/reviews', headers=headers).json()
    rev_count = int(review_page['count'])
    
#     Another for loop to extract the particular information from above json data
    for n in review_page['reviews']:
        created_at = n['created_at']
        nickname = n['nickname']
        try:
            price_rating = int(n['rating_votes'][0]['ratingValue'])
        except:
            price_rating = int(0)
        try:
            quality_rating = int(n['rating_votes'][1]['ratingValue'])
        except:
            quality_rating = int(0)
        try:
            value_rating = int(n['rating_votes'][2]['ratingValue'])
        except:
            value_rating = int(0)
        detail = n['detail']
        
#         Creating a dict object to store data
        d_dict.append({
            'ID':m,
            'REVIEW_COUNT':rev_count,
            'REVIEW_DATE/TIME':created_at,
            'REVIEWER_NAME':nickname,
            'PRICE_RATING':price_rating, 
            'QUALITY_RATING':quality_rating,
            'VALUE_RATING':value_rating,
            'REVIEW_CONTENT':detail
        })
    
# Creating DataFrame to make data easier to interpret
dd=pd.DataFrame(d_dict)
dd

100%|████████████████████████████████████████████████████████████████████████████████| 328/328 [00:56<00:00,  5.83it/s]


Unnamed: 0,ID,REVIEW_COUNT,REVIEW_DATE/TIME,REVIEWER_NAME,PRICE_RATING,QUALITY_RATING,VALUE_RATING,REVIEW_CONTENT
0,623,1,2022-08-29 16:38:37,Swapna Kulkarni,5,0,0,Mamaearth always wins my heart with new surpri...
1,620,47,2022-09-02 16:34:36,Zubia,5,0,0,"I've had acne my entire life, and this appears..."
2,620,47,2022-09-02 16:34:26,Shalini,5,0,0,"Great cleanser, gentle and makes my face fresh..."
3,620,47,2022-09-02 16:34:15,Hinakshi,5,0,0,I use Mamaearth green tea range and the result...
4,620,47,2022-09-02 16:34:01,Priya,4,0,0,"I have sensitive skin, and I did not experienc..."
...,...,...,...,...,...,...,...,...
28228,219,20,2020-11-23 10:38:25,Pooja,5,0,0,I use it as baby bottle cleanser and then wash...
28229,219,20,2020-11-23 10:38:01,Mamta,5,0,0,Best cleanser for baby toys ever! It's natural...
28230,219,20,2020-11-23 10:37:41,Sushma,5,0,0,I bought this baby liquid cleanser for toys an...
28231,219,20,2020-10-30 17:58:11,Niketa,5,0,0,I was recommended by one of my friend ..to use...


### 1.6 -- Merging all the DataFrames

In [10]:
# Now merging all the dataframes in one to obtain a single dataframe containing all the unstructured data which we scraped till now..
df_1 = pd.merge(pd.merge(pd.merge(aa,bb, on='ID', how='left'),cc, on='ID', how='left'),dd,on='ID', how='left')
df_1

Unnamed: 0,ID,SKU,PRODUCT_NAME,PRICE,SLUG,URL,value,unit,PACK_SIZE,ALGOLIA_OBJECT_ID,...,SIBLINGS,PRODUCT_CATEGORY,CUSTOM_ATTRIBUTES,REVIEW_COUNT,REVIEW_DATE/TIME,REVIEWER_NAME,PRICE_RATING,QUALITY_RATING,VALUE_RATING,REVIEW_CONTENT
0,639,8904417300048,ME White Musk Eau De Parfum For a Fragrance Cl...,699.00,me-white-musk-eau-de-parfum-for-a-fragrance-cl...,https://mamaearth.in/product/me-white-musk-eau...,50,ml,50ml,1990186000,...,"[{'id': '638', 'sku': '8904417300031', 'name':...",,,,,,,,,
1,638,8904417300031,ME Floral Eau De Parfum - Live in the Moment -...,699.00,me-floral-eau-de-parfum-live-in-the-moment-50-ml,https://mamaearth.in/product/me-floral-eau-de-...,50,ml,50ml,1990184000,...,"[{'id': '639', 'sku': '8904417300048', 'name':...",,,,,,,,,
2,636,8904417300024,ME Oud Eau De Parfum to Unleash Your Confidenc...,699.00,me-oud-eau-de-parfum-to-unleash-your-confidenc...,https://mamaearth.in/product/me-oud-eau-de-par...,50,ml,50ml,1661771144595,...,"[{'id': '635', 'sku': '8904417300017', 'name':...",,,,,,,,,
3,634,8904417300055,ME First Rain Eau De Parfum to Refresh Your Se...,699.00,first-rain-eau-de-parfum-to-refresh-your-sense...,https://mamaearth.in/product/first-rain-eau-de...,50,ml,50ml,1575346001,...,Not found,,,,,,,,,
4,626,8904417301540,Lash Care Volumizing Mascara with Castor Oil &...,499.00,mamaearth-lash-care-volumizing-mascara-with-ca...,https://mamaearth.in/product/mamaearth-lash-ca...,13,g,13g,1953270000,...,Not found,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
28263,219,8906087772927,Plant-Based Multipurpose Cleanser for Babies -...,499.00,plant-based-multipurpose-cleanser-for-babies-5...,https://mamaearth.in/product/plant-based-multi...,500,ml,500ml,8599427000,...,Not found,ba_body,"[{'attribute_code': 'image', 'value': '/p/l/pl...",20.0,2020-11-23 10:38:25,Pooja,5.0,0.0,0.0,I use it as baby bottle cleanser and then wash...
28264,219,8906087772927,Plant-Based Multipurpose Cleanser for Babies -...,499.00,plant-based-multipurpose-cleanser-for-babies-5...,https://mamaearth.in/product/plant-based-multi...,500,ml,500ml,8599427000,...,Not found,ba_body,"[{'attribute_code': 'image', 'value': '/p/l/pl...",20.0,2020-11-23 10:38:01,Mamta,5.0,0.0,0.0,Best cleanser for baby toys ever! It's natural...
28265,219,8906087772927,Plant-Based Multipurpose Cleanser for Babies -...,499.00,plant-based-multipurpose-cleanser-for-babies-5...,https://mamaearth.in/product/plant-based-multi...,500,ml,500ml,8599427000,...,Not found,ba_body,"[{'attribute_code': 'image', 'value': '/p/l/pl...",20.0,2020-11-23 10:37:41,Sushma,5.0,0.0,0.0,I bought this baby liquid cleanser for toys an...
28266,219,8906087772927,Plant-Based Multipurpose Cleanser for Babies -...,499.00,plant-based-multipurpose-cleanser-for-babies-5...,https://mamaearth.in/product/plant-based-multi...,500,ml,500ml,8599427000,...,Not found,ba_body,"[{'attribute_code': 'image', 'value': '/p/l/pl...",20.0,2020-10-30 17:58:11,Niketa,5.0,0.0,0.0,I was recommended by one of my friend ..to use...


In [11]:
# In case you want separate data file
df_1.to_csv('01_Scraping_Unstructured_Data.csv',index=False)

So, these are the codes for web scraping. We have collected all the raw data which may going to be very helpful in our further projects.