#**Case Study: Watches⌚ in Online E-commerce Platform Shopee (PART 1)**
A total of 173 watches from Shopee were extracted on the 15th November 2021. The main objective of this study is to identify the best selling watch in Shopee according to the number of sales. The latter part of the final project shall include the discovery of the best selling watch in Shopee by its brand, style, ratings and price range, supported by the response rate of the seller, the functions and warranty period of the watch.
---
**Author:** Tang Jia Hui<br>
**Matric Number:** A176297<br>
**Incharge:** Watch Name, Style and Brand<br>
**File Submission:**
A176297 - Group 6 - Individual Assignment 1.ipynb and a176297_watch_style_brand_name_2021.csv

---
Note: There are 4 CSV files in the A176297_TC3213_IA1_CSV.zip file submitted. 
1. **watch_product_link.csv** is the csv file that stores all 173 watches links. It is obtained in Part 1 and read in Part 2 of this project.
2. **backup_fullcsv.csv** is the csv file that stores all 173 different watches information (uncleaned) scraped from Shopee. It is obtained from Part 2-4 of this project and is read in Part 5 of this project.
3. **a176297_watch_style_brand_name.csv** is the csv file that stores the watches' data respective to the 173 watches scraped, without changing the categorical data to quantitative data.
4. **a176297_watch_style_brand_name_2021.csv** is the final csv file that stores the cleaned data respective to the 173 watches scraped.

---
To run the codes, please upload **watch_product_link.csv** csv file and start running the codes from Part 2 of this project. By the end of this project, you shall have 2 additional csv file generated, namely the **backup_fullcsv.csv**, **a176297_watch_style_brand_name.csv** and **a176297_watch_style_brand_name_2021.csv**


##**Step 1: Import Library**
Import the necessary libraries to fulfill the work of this project

In [None]:
!pip install urllib3
!pip install folium
!pip install albumentations
!pip install -q gwpy
# %%capture 



In [None]:
# Install and update library
%%capture
!pip install selenium
!apt-get update # to update ubuntu to correctly run apt install
!apt install chromium-chromedriver

from selenium import webdriver
from selenium.webdriver.common.by import By
# !cp /usr/lib/chromium-browser/chromedriver /usr/bin
import sys
import pandas as pd


#**PART 1: SELENIUM EXTRACT WATCH LINKS**
Selenium, an open-source umbrella project consisting a range of tools and libraries used for web browser automation and web scraping will be used as the tool to scrape product (watch) links in order to ease the scraping process of watches information in the latter section of this project.

##**Step 1: Extract Anchor Link of Products**
A function named get_product_links is defined to extract the links of each respective watches in Shopee. Then, the function is called to start the extraction process and the extracted link is stored in a list named as product_link.

In [None]:
%%capture
# define function to extract product links
def get_product_links(page_num, product_link_lst):
    # Find link that access product page
    url = "https://shopee.com.my/mall/Watches-cat.11001724/popular?pageNumber=" + str(page_num)
    #Define Chrome WebDriver
    sys.path.insert(0,'/usr/lib/chromium-browser/chromedriver')
    chrome_options = webdriver.ChromeOptions()
    chrome_options.add_argument('--headless')
    chrome_options.add_argument('--no-sandbox')
    chrome_options.add_argument('--disable-dev-shm-usage')
    wd = webdriver.Chrome('chromedriver', options=chrome_options)
    # get url
    wd.get(url)
    # impose pause
    wd.implicitly_wait(10)
    container = wd.find_element_by_xpath('/html/body/div[1]/div[1]/div[2]/div[2]').find_elements_by_tag_name('a')
    for link in container:
        product_link_lst.append(link.get_attribute('href'))
    wd.quit()

# Call function get_product_link
product_link = list()
for page_num in range(1, 4):
    get_product_links(page_num, product_link)


In [None]:
# filter list to ensure there is no repeated watch links
product_link = list(set(product_link))

##**Step 2: Save product links in a CSV file**
The product links are saved to a CSV file to work with the extracted links in the future.

In [None]:
# Save the list of product list 
link_dict = {'product_link': product_link}
df_link = pd.DataFrame(link_dict)
df_link.to_csv('watch_product_link.csv')

In [None]:
len(df_link) # total number of products

173

#**PART 2: EXTRACT WATCH STYLE WITH SELENIUM**
The extraction of the style of watches are performed with selenium as well. Before the extraction process begins, the csv file saved in the previous section (Part 1) is read and the links are restored as a list, called product_link.

In [None]:
import pandas as pd
productlink = pd.read_csv('/content/watch_product_link.csv', index_col=['Unnamed: 0']).values.tolist()
product_link = [link[0] for link in productlink]

In [None]:
product_link

['https://shopee.com.my/OLEVS-Jam-Tangan-Lelaki-Original-Watch-Men-Waterproof-Stainless-Steel-Quartz-Luminous-Authentic-Dragon-Design-Gold-Watch-For-Men-i.56027601.8111224520',
 'https://shopee.com.my/NAVIFORCE-2021-Bussiness-Watch-Men-Sport-Quartz-Watches-Luxury-Brand-Leather-Waterproof-LED-Digital-Wristwatch-i.50942824.6602794197',
 'https://shopee.com.my/Casio-General-W-218H-3A-Black-Resin-Band-Men-Youth-Watch-i.33853391.8945228241',
 "https://shopee.com.my/SKMEI-new-men's-outdoor-waterproof-sports-large-dial-electronic-watch-multi-function-dual-display-luminous-electronic-watch-i.83591286.10709989665",
 "https://shopee.com.my/Kids-Cute-Girl-Cat-Children's-Kids-Watch-Gel-Digital-i.83591286.2464347922",
 'https://shopee.com.my/Carlo-Rino-Jade-of-The-Orient-Dark-Green-i.126606645.5468599987',
 'https://shopee.com.my/SKMEI-Smart-Watch-Fashion-Full-Touch-Screen-Mens-Sport-Fitness-Watches-IP68-Waterproof-Bluetooth-Luxury-Connection-For-Android-ios-SmartWatch-for-men-i.83591286.7686939805

##**Step 1: Define Function to Extract Watch Style**
The function named get_style is used to extract the style for each watches.

In [None]:
def get_style(links, watch_style_lst):
    # webdriver is imported in Part 1 Step 1
    options = webdriver.ChromeOptions()
    options.add_argument('--headless')
    options.add_argument('--no-sandbox')
    options.add_argument('--disable-dev-shm-usage')
    wd = webdriver.Chrome('chromedriver',options=options)
    wd.get(links)
    wd.implicitly_wait(10)
    watch_styles_sub = wd.find_elements(By.CLASS_NAME, '_3QRNmL')
    watch_styles_sub[2].get_attribute('innerHTML')
    for index, value in enumerate(watch_styles_sub):
        if index == 2:
            watch_style_lst.append(value.get_attribute('innerHTML'))
    wd.quit()

##**Step 2: Extraction of all watches' style**
The function get_style is called and the style for each watches saved in variable watch_styles_lst are listed. Before saving the extract style to a dataframe called watch_df, the unique styles of watches will be stored in variable named as unique_styles and its values are listed later.

In [None]:
watch_style_lst = []
for i in product_link:
    get_style(i, watch_style_lst)
print('Scraping Watch Style -- Done!')

Scraping Watch Style -- Done!


In [None]:
len(watch_style_lst)

173

In [None]:
watch_style_lst

['Business &amp; Casual',
 'Business &amp; Casual',
 'Business &amp; Casual',
 'Business &amp; Casual',
 "Women's Sports",
 "Women's Casual",
 'Business &amp; Casual',
 'Business &amp; Casual',
 'Straps &amp; Clasps',
 'Business &amp; Casual',
 'Business &amp; Casual',
 'Business &amp; Casual',
 'Business Set',
 'Business &amp; Casual',
 'Business &amp; Casual',
 'Straps &amp; Clasps',
 'Business &amp; Casual',
 "Women's Business",
 'Business &amp; Casual',
 'Others',
 'Business &amp; Casual',
 "Women's Sports",
 "Women's Sports",
 'Others',
 "Women's Sports",
 'Business &amp; Casual',
 "Women's Sports",
 'Business Set',
 'Business &amp; Casual',
 "Women's Casual",
 "Women's Business",
 'Business &amp; Casual',
 'Business &amp; Casual',
 'Business &amp; Casual',
 'Business &amp; Casual',
 'Business &amp; Casual',
 "Women's Casual",
 'Display &amp; Storage',
 "Women's Sports",
 "Women's Sports",
 'Business &amp; Casual',
 "Women's Sports",
 "Women's Sports",
 'Business &amp; Casual',
 "

In [None]:
unique_styles = set(watch_style_lst)
unique_styles

{'(GWP) CATEPILLAR Limited Edition Sport Towel (Not For Sale)',
 'Business &amp; Casual',
 'Business Set',
 'Casual Set',
 'Display &amp; Storage',
 'Others',
 'Straps &amp; Clasps',
 "Women's Business",
 "Women's Casual",
 "Women's Sports"}

In [None]:
watch_df = pd.DataFrame(watch_style_lst, columns=['Watch_Style'])
watch_df

Unnamed: 0,Watch_Style
0,Business &amp; Casual
1,Business &amp; Casual
2,Business &amp; Casual
3,Business &amp; Casual
4,Women's Sports
...,...
168,Business &amp; Casual
169,Business &amp; Casual
170,Women's Casual
171,Women's Casual


#**PART 3: EXTRACT WATCH BRAND WITH SELENIUM**
The extraction of watches' brands utilize selenium and the list called product_link, same as in Part 3.

##**Step 1: Define function to Extract Watch Brand**
The function named get_brand is used to extract the brand for each watches.

In [None]:
def get_brand(links, watch_brand_lst):
    options = webdriver.ChromeOptions()
    options.add_argument('--headless')
    options.add_argument('--no-sandbox')
    options.add_argument('--disable-dev-shm-usage')
    wd = webdriver.Chrome('chromedriver',options=options)
    wd.get(links)
    wd.implicitly_wait(10)
    watch_styles_sub = wd.find_elements(By.CLASS_NAME, '_3uf2ae')
    watch_brand_lst.append(watch_styles_sub[0].get_attribute('innerHTML'))
    wd.quit()

##**Step 2: Extraction of all watches' brand**
The function get_brand is called and the brand for each watches are saved in variable watch_brand_lst and are then listed. The unique brands of watches will be stored in a variable, named unique_brands and listed before saving the extract brands to a dataframe called watch_df.

In [None]:
watch_brand_lst = []
for i in product_link:
    get_brand(i, watch_brand_lst)
print('Scraping Watch Brand -- Done!')

Scraping Watch Brand -- Done!


In [None]:
len(watch_brand_lst)

173

In [None]:
watch_brand_lst

['olevs.os',
 'naviforce.os',
 'casio.os',
 'skmei.os',
 'skmei.os',
 'carlorino.os',
 'skmei.os',
 'sanda.os',
 'uniq_my',
 'skmei.os',
 'curren.os',
 'naviforce.os',
 'bostantenwatch.os',
 'casio.os',
 'sanda.os',
 'bostanten.os',
 'bostantenwatch.os',
 'curren.os',
 'bostantenwatch.os',
 'uniq_my',
 'skmei.os',
 'skmei.os',
 'skmei.os',
 'skmei.os',
 'sanda.os',
 'skmei.os',
 'skmei.os',
 'skmei.os',
 'wishdoitwatch.wh.os',
 'naviforce.os',
 'olevs.os',
 'casio.os',
 'sanda.os',
 'skmei.os',
 'bostanten_watchlocal.os',
 'naviforce.os',
 'qqmalaysia.os',
 'skmei.os',
 'sanda.os',
 'qqmalaysia.os',
 'casio.os',
 'qqmalaysia.os',
 'skmei.os',
 'olevs.os',
 'sanda.os',
 'sanda.os',
 'casio.os',
 'qqmalaysia.os',
 'qqmalaysia.os',
 'olevs.os',
 'skmei.os',
 'curren.os',
 'skmei.os',
 'skmei.os',
 'skmei.os',
 'julius.os',
 'naviforce.os',
 'milliotandco.os',
 'sanda.os',
 'qqmalaysia.os',
 'wishdoitwatch.wh.os',
 'sanda.os',
 'skmei.os',
 'skmei.os',
 'bostanten_watchlocal.os',
 'olevs.o

In [None]:
unique_brands = set(watch_brand_lst)
unique_brands

{'adshops.os',
 'amazfit.os',
 'billion.os',
 'bostanten.os',
 'bostanten_watchlocal.os',
 'bostantenwatch.os',
 'carlorino.os',
 'casio.os',
 'catwatch.os',
 'curren.os',
 'danielwellington.os',
 'fossil.os',
 'herjewellery123',
 'icewatch.os',
 'julius.os',
 'ligewatch.os',
 'milliotandco.os',
 'naviforce.os',
 'olevs.os',
 'qqmalaysia.os',
 'sanda.os',
 'skmei.os',
 'submarinewatch.os',
 'uniq_my',
 'voarch.os',
 'wishdoitwatch.wh.os',
 'wsdwatch.os'}

In [None]:
# Append to dataframe created in scraping watch style data
watch_df['Watch_Brand'] = pd.DataFrame(watch_brand_lst, columns=['Watch_Brand'])
watch_df

Unnamed: 0,Watch_Style,Watch_Brand
0,Business &amp; Casual,olevs.os
1,Business &amp; Casual,naviforce.os
2,Business &amp; Casual,casio.os
3,Business &amp; Casual,skmei.os
4,Women's Sports,skmei.os
...,...,...
168,Business &amp; Casual,wsdwatch.os
169,Business &amp; Casual,adshops.os
170,Women's Casual,qqmalaysia.os
171,Women's Casual,sanda.os


#**PART 4: EXTRACT WATCH NAME SELENIUM**
The extraction of watches' name utilize selenium and the list called product_link, same as in Part 3 and 4.

##**Step 1: Define Function to Extract Watch Name**
The function named get_name is used to extract the name for each watches.

In [None]:
def get_name(links, watch_name_lst):
    options = webdriver.ChromeOptions()
    options.add_argument('--headless')
    options.add_argument('--no-sandbox')
    options.add_argument('--disable-dev-shm-usage')
    wd = webdriver.Chrome('chromedriver',options=options)
    wd.get(links)
    wd.implicitly_wait(10)
    watch_styles_sub = wd.find_elements(By.XPATH, '/html/body/div[1]/div/div[2]/div[2]/div/div[2]/div[3]/div/div[1]/span')
    watch_name_lst.append(watch_styles_sub[0].get_attribute('innerHTML'))
    wd.quit()

##**Step 2: Extraction of all watches' name**
The function get_name is called and the name for each watches saved in variable watch_name_lst are listed. The extracted names are saved to the watch_df dataframe.

In [None]:
watch_name_lst = []
for i in product_link:
    get_name(i, watch_name_lst)
print('Scraping Watch Name -- Done!')

Scraping Watch Name -- Done!


In [None]:
len(watch_name_lst)

173

In [None]:
watch_name_lst

['OLEVS Jam Tangan Lelaki Original Watch Men Waterproof Stainless Steel Quartz Luminous Authentic Dragon Design Gold Watch For Men',
 'NAVIFORCE 2021 Bussiness Watch Men Sport Quartz Watches Luxury Brand Leather Waterproof LED Digital Wristwatch',
 'Casio General W-218H-3A Black Resin Band Men Youth Watch',
 "SKMEI new men's outdoor waterproof sports large dial electronic watch multi-function dual display luminous electronic watch",
 "Kids Cute  Girl Cat Children's Kids Watch Gel Digital",
 'Carlo Rino Jade of The Orient - Dark Green',
 'SKMEI Smart Watch Fashion Full Touch Screen Mens Sport Fitness Watches IP68 Waterproof Bluetooth Luxury Connection SmartWatch',
 'Sanda Military Waterproof Quartz LED  Multi-function Luxury Fashion Men Watch',
 'Uniq Aspen Strap for Apple Watch  - Grey Blue Green Series 1/2/3/4/5/6/SE/7 (38/40/41/42/44/45mm)',
 "SKMEI  Fashion watch Men's Waterproof Sport Quartz Stainless Steel Steel Watch Wristwatches",
 "CURREN Men's g shock Chronograph Waterproof Wa

In [None]:
# Append to dataframe created in scraping watch style data
watch_df['Watch_Name'] = pd.DataFrame(watch_name_lst, columns=['Watch_Name'])
watch_df

Unnamed: 0,Watch_Style,Watch_Brand,Watch_Name
0,Business &amp; Casual,olevs.os,OLEVS Jam Tangan Lelaki Original Watch Men Wat...
1,Business &amp; Casual,naviforce.os,NAVIFORCE 2021 Bussiness Watch Men Sport Quart...
2,Business &amp; Casual,casio.os,Casio General W-218H-3A Black Resin Band Men Y...
3,Business &amp; Casual,skmei.os,SKMEI new men's outdoor waterproof sports larg...
4,Women's Sports,skmei.os,Kids Cute Girl Cat Children's Kids Watch Gel ...
...,...,...,...
168,Business &amp; Casual,wsdwatch.os,【Official original】WISHDOIT jam tangan lelaki ...
169,Business &amp; Casual,adshops.os,GCHOK New Men's Watch Outdoor Sport Waterproof...
170,Women's Casual,qqmalaysia.os,Q&amp;Q Japan by Citizen Ladies Rubber Analogu...
171,Women's Casual,sanda.os,Sanda Sports Watch Women Waterproof Multifunc...


In [None]:
watch_df.to_csv('backup_fullcsv.csv')

#**PART 5: DATA CLEANING / TRANSFORMATION**
From Part 2 to 4, some of the data extracted from Shopee is irrelevant to the case study. Referring to Part 2 - Step 2, the unique_style showed 10 different styles and some of which are irrelevant to watches. Besides, the data have noises such as redundant/ unnecessary information as shown in the watch name and style, which distort the quality of data extracted. Some example of noises in data are the apostrophe s in watch style, mixed language in the watch name and more which will be further explain in the following section.

##**Step 1: Import Libraries**
Import necessary libraries to perform the data cleaning

In [None]:
# %%capture
import re
import string
!pip install malaya
!pip install fasttext
!pip install fuzzywuzzy
import malaya
import fasttext
import fuzzywuzzy
from fuzzywuzzy import process
import chardet
import pandas as pd
import numpy as np
np.random.seed(0)



  'Cannot import beam_search_ops from Tensorflow Addons, `deep_model` for stemmer will not available to use, make sure Tensorflow Addons version >= 0.12.0'


##**Step 2: Read backup_fullcsv.csv file**
Read in the extracted watch data (watch style, watch brand and watch name) that were saved in Part 5 Step 2.


In [None]:
watch_df = pd.read_csv('backup_fullcsv.csv', index_col=['Unnamed: 0'])
watch_df

Unnamed: 0,Watch_Style,Watch_Brand,Watch_Name
0,Business &amp; Casual,olevs.os,OLEVS Jam Tangan Lelaki Original Watch Men Wat...
1,Business &amp; Casual,naviforce.os,NAVIFORCE 2021 Bussiness Watch Men Sport Quart...
2,Business &amp; Casual,casio.os,Casio General W-218H-3A Black Resin Band Men Y...
3,Business &amp; Casual,skmei.os,SKMEI new men's outdoor waterproof sports larg...
4,Women's Sports,skmei.os,Kids Cute Girl Cat Children's Kids Watch Gel ...
...,...,...,...
168,Business &amp; Casual,wsdwatch.os,【Official original】WISHDOIT jam tangan lelaki ...
169,Business &amp; Casual,adshops.os,GCHOK New Men's Watch Outdoor Sport Waterproof...
170,Women's Casual,qqmalaysia.os,Q&amp;Q Japan by Citizen Ladies Rubber Analogu...
171,Women's Casual,sanda.os,Sanda Sports Watch Women Waterproof Multifunc...


###**List of the unique values of each column that represents the watches attributes in the dataframe**



In [None]:
# Watch Styles
watch_df['Watch_Style'].unique()

array(['Business &amp; Casual', "Women's Sports", "Women's Casual",
       'Straps &amp; Clasps', 'Business Set', "Women's Business",
       'Others', 'Display &amp; Storage', 'Casual Set',
       '(GWP) CATEPILLAR Limited Edition Sport Towel (Not For Sale)'],
      dtype=object)

In [None]:
# Watch Brand
watch_df['Watch_Brand'].unique()

array(['olevs.os', 'naviforce.os', 'casio.os', 'skmei.os', 'carlorino.os',
       'sanda.os', 'uniq_my', 'curren.os', 'bostantenwatch.os',
       'bostanten.os', 'wishdoitwatch.wh.os', 'bostanten_watchlocal.os',
       'qqmalaysia.os', 'julius.os', 'milliotandco.os', 'billion.os',
       'danielwellington.os', 'catwatch.os', 'wsdwatch.os',
       'ligewatch.os', 'amazfit.os', 'herjewellery123', 'fossil.os',
       'icewatch.os', 'voarch.os', 'submarinewatch.os', 'adshops.os'],
      dtype=object)

In [None]:
# Watch Name
watch_df['Watch_Name'].unique()

array(['OLEVS Jam Tangan Lelaki Original Watch Men Waterproof Stainless Steel Quartz Luminous Authentic Dragon Design Gold Watch For Men',
       'NAVIFORCE 2021 Bussiness Watch Men Sport Quartz Watches Luxury Brand Leather Waterproof LED Digital Wristwatch',
       'Casio General W-218H-3A Black Resin Band Men Youth Watch',
       "SKMEI new men's outdoor waterproof sports large dial electronic watch multi-function dual display luminous electronic watch",
       "Kids Cute  Girl Cat Children's Kids Watch Gel Digital",
       'Carlo Rino Jade of The Orient - Dark Green',
       'SKMEI Smart Watch Fashion Full Touch Screen Mens Sport Fitness Watches IP68 Waterproof Bluetooth Luxury Connection SmartWatch',
       'Sanda Military Waterproof Quartz LED  Multi-function Luxury Fashion Men Watch',
       'Uniq Aspen Strap for Apple Watch  - Grey Blue Green Series 1/2/3/4/5/6/SE/7 (38/40/41/42/44/45mm)',
       "SKMEI  Fashion watch Men's Waterproof Sport Quartz Stainless Steel Steel Watch Wri

##**Step 3: Text / Data Cleaning**

###**Step 3.1: Convert all words to lowercase**

In [None]:
# change all words in the dataframe to lowercase
# Reason: To ease the latter data cleaning process by avoiding situations such as 2 same words are identified as different
#         words due to their difference in letter case 
watch_df = watch_df.applymap(lambda x: x.lower())

In [None]:
watch_df

Unnamed: 0,Watch_Style,Watch_Brand,Watch_Name
0,business &amp; casual,olevs.os,olevs jam tangan lelaki original watch men wat...
1,business &amp; casual,naviforce.os,naviforce 2021 bussiness watch men sport quart...
2,business &amp; casual,casio.os,casio general w-218h-3a black resin band men y...
3,business &amp; casual,skmei.os,skmei new men's outdoor waterproof sports larg...
4,women's sports,skmei.os,kids cute girl cat children's kids watch gel ...
...,...,...,...
168,business &amp; casual,wsdwatch.os,【official original】wishdoit jam tangan lelaki ...
169,business &amp; casual,adshops.os,gchok new men's watch outdoor sport waterproof...
170,women's casual,qqmalaysia.os,q&amp;q japan by citizen ladies rubber analogu...
171,women's casual,sanda.os,sanda sports watch women waterproof multifunc...


###**Step 3.2: Replace ampersand to the word 'and'**

In [None]:
# replace html ampersand(&amp) to the word 'and' that exists in the dataframe
# Reason: To avoid different representation of word with the same meaning (&amp; is the same as the word 'and')
watch_df = watch_df.applymap(lambda x: re.sub('&amp;', 'and', x))

In [None]:
watch_df

Unnamed: 0,Watch_Style,Watch_Brand,Watch_Name
0,business and casual,olevs.os,olevs jam tangan lelaki original watch men wat...
1,business and casual,naviforce.os,naviforce 2021 bussiness watch men sport quart...
2,business and casual,casio.os,casio general w-218h-3a black resin band men y...
3,business and casual,skmei.os,skmei new men's outdoor waterproof sports larg...
4,women's sports,skmei.os,kids cute girl cat children's kids watch gel ...
...,...,...,...
168,business and casual,wsdwatch.os,【official original】wishdoit jam tangan lelaki ...
169,business and casual,adshops.os,gchok new men's watch outdoor sport waterproof...
170,women's casual,qqmalaysia.os,qandq japan by citizen ladies rubber analogue ...
171,women's casual,sanda.os,sanda sports watch women waterproof multifunc...


###**Step 3.3: Convert 'qandq' to 'qqmalaysia'**

In [None]:
# replace 'qandq' that exists in the Watch_Name column of the dataframe to 'qqmalaysia'
# Reason: To ease the latter cleaning process and avoid the brand name being eliminated from column Watch_Name
watch_df = watch_df.applymap(lambda x: re.sub('qandq', 'qqmalaysia', x))

In [None]:
watch_df

Unnamed: 0,Watch_Style,Watch_Brand,Watch_Name
0,business and casual,olevs.os,olevs jam tangan lelaki original watch men wat...
1,business and casual,naviforce.os,naviforce 2021 bussiness watch men sport quart...
2,business and casual,casio.os,casio general w-218h-3a black resin band men y...
3,business and casual,skmei.os,skmei new men's outdoor waterproof sports larg...
4,women's sports,skmei.os,kids cute girl cat children's kids watch gel ...
...,...,...,...
168,business and casual,wsdwatch.os,【official original】wishdoit jam tangan lelaki ...
169,business and casual,adshops.os,gchok new men's watch outdoor sport waterproof...
170,women's casual,qqmalaysia.os,qqmalaysia japan by citizen ladies rubber anal...
171,women's casual,sanda.os,sanda sports watch women waterproof multifunc...


###**Step 3.4: Solve inconsistency in data**
As visualized in Part 5 Step 2, there are several different representations of the same brand. For example, the brand Bostanten is entered as bostantenwatch.os, bostanten.os and bostanten_watchlocal.os. Besides, the brand wishdoitwatch.wh.os is actually the same as wsdwatch.os. Therefore, fuzzywuzzy is applied to overcome such inconsistencies present in data.

In [None]:
matches = fuzzywuzzy.process.extract("bostantenwatch.os", watch_df['Watch_Brand'].unique(), limit=10, scorer=fuzzywuzzy.fuzz.token_sort_ratio)
print(matches)
matches2 = fuzzywuzzy.process.extract("wishdoitwatch.wh.os", watch_df['Watch_Brand'].unique(), limit=10, scorer=fuzzywuzzy.fuzz.token_sort_ratio)
print(matches2)

[('bostantenwatch.os', 100), ('bostanten_watchlocal.os', 85), ('bostanten.os', 83), ('catwatch.os', 64), ('icewatch.os', 64), ('ligewatch.os', 62), ('submarinewatch.os', 59), ('milliotandco.os', 50), ('wsdwatch.os', 50), ('wishdoitwatch.wh.os', 44)]
[('wishdoitwatch.wh.os', 100), ('wsdwatch.os', 73), ('voarch.os', 50), ('submarinewatch.os', 50), ('sanda.os', 44), ('bostantenwatch.os', 44), ('catwatch.os', 40), ('icewatch.os', 40), ('ligewatch.os', 39), ('bostanten_watchlocal.os', 38)]


In [None]:
# A function defined to replace words that are similar to the desired string into the desired string
def replace_matches_in_column(df, column, string_to_match, min_ratio = 90):
    strings = df[column].unique()
    # top 10 closest matches to input string
    matches = fuzzywuzzy.process.extract(string_to_match, strings, limit=10, scorer=fuzzywuzzy.fuzz.token_sort_ratio)
    # return matches with a ratio determined
    close_matches = [matches[0] for matches in matches if matches[1] >= min_ratio]
    # get the rows of all the close matches in our dataframe
    rows_with_matches = df[column].isin(close_matches)
    # replace all rows with close matches with the input matches 
    df.loc[rows_with_matches, column] = string_to_match
    print("Done manipulation!")

In [None]:
replace_matches_in_column(watch_df, column='Watch_Brand', string_to_match="bostantenwatch.os", min_ratio=80)
replace_matches_in_column(watch_df, column='Watch_Brand', string_to_match="wishdoitwatch.wh.os", min_ratio=70)

Done manipulation!
Done manipulation!


In [None]:
# From the output, there are only a single representation for both brand Wishdoit and Bostanten, which are wishdoitwatch.wh.os and bostantenwatch.os respectively
watch_df['Watch_Brand'].unique()

array(['olevs.os', 'naviforce.os', 'casio.os', 'skmei.os', 'carlorino.os',
       'sanda.os', 'uniq_my', 'curren.os', 'bostantenwatch.os',
       'wishdoitwatch.wh.os', 'qqmalaysia.os', 'julius.os',
       'milliotandco.os', 'billion.os', 'danielwellington.os',
       'catwatch.os', 'ligewatch.os', 'amazfit.os', 'herjewellery123',
       'fossil.os', 'icewatch.os', 'voarch.os', 'submarinewatch.os',
       'adshops.os'], dtype=object)

###**Step 3.5: Remove delimiters and symbols**

In [None]:
# remove apostrophe s that exists in the dataframe
# Reason: Avoid possible redundancy (example: women's and women are semantically the same in this case study)
#         Besides, symbols are not important for analysis.
watch_df = watch_df.applymap(lambda x: re.sub("('s|【|】|（|《|》)", ' ', x))

In [None]:
watch_df

Unnamed: 0,Watch_Style,Watch_Brand,Watch_Name
0,business and casual,olevs.os,olevs jam tangan lelaki original watch men wat...
1,business and casual,naviforce.os,naviforce 2021 bussiness watch men sport quart...
2,business and casual,casio.os,casio general w-218h-3a black resin band men y...
3,business and casual,skmei.os,skmei new men outdoor waterproof sports large...
4,women sports,skmei.os,kids cute girl cat children kids watch gel d...
...,...,...,...
168,business and casual,wishdoitwatch.wh.os,official original wishdoit jam tangan lelaki ...
169,business and casual,adshops.os,gchok new men watch outdoor sport waterproof ...
170,women casual,qqmalaysia.os,qqmalaysia japan by citizen ladies rubber anal...
171,women casual,sanda.os,sanda sports watch women waterproof multifunc...


###**Step 3.6: Remove/Replace Malay language words**
Malay language words in Watch_Name are substituted with English words as the English words have clearly depict the semantic of the necessary information of the watch. Hence, the Malay language words in the watch name is unnecessary and is considered as redundant information which should be replaced with English words for the purpose of uniformity and clarity. This step is seperated into several small steps which will be discussed as follows.

####**Step 3.6.1: Remove punctuations**

In [None]:
watch_df['Watch_Name'] = watch_df['Watch_Name'].map(lambda x: ''.join(character for character in x if character not in list(string.punctuation)))

In [None]:
watch_df

Unnamed: 0,Watch_Style,Watch_Brand,Watch_Name
0,business and casual,olevs.os,olevs jam tangan lelaki original watch men wat...
1,business and casual,naviforce.os,naviforce 2021 bussiness watch men sport quart...
2,business and casual,casio.os,casio general w218h3a black resin band men you...
3,business and casual,skmei.os,skmei new men outdoor waterproof sports large...
4,women sports,skmei.os,kids cute girl cat children kids watch gel d...
...,...,...,...
168,business and casual,wishdoitwatch.wh.os,official original wishdoit jam tangan lelaki ...
169,business and casual,adshops.os,gchok new men watch outdoor sport waterproof ...
170,women casual,qqmalaysia.os,qqmalaysia japan by citizen ladies rubber anal...
171,women casual,sanda.os,sanda sports watch women waterproof multifunc...


####**Step 3.6.2: Remove extra spaces between words**

In [None]:
# Reason: To assist the smooth removal of certain words without worrying that extra spaces will incur a difference between the same word.
# Example: 'jam tangan' and 'jam  tangan'
watch_df['Watch_Name'] = watch_df['Watch_Name'].map(lambda x: ' '.join(x.split()))

In [None]:
watch_df

Unnamed: 0,Watch_Style,Watch_Brand,Watch_Name
0,business and casual,olevs.os,olevs jam tangan lelaki original watch men wat...
1,business and casual,naviforce.os,naviforce 2021 bussiness watch men sport quart...
2,business and casual,casio.os,casio general w218h3a black resin band men you...
3,business and casual,skmei.os,skmei new men outdoor waterproof sports large ...
4,women sports,skmei.os,kids cute girl cat children kids watch gel dig...
...,...,...,...
168,business and casual,wishdoitwatch.wh.os,official original wishdoit jam tangan lelaki o...
169,business and casual,adshops.os,gchok new men watch outdoor sport waterproof m...
170,women casual,qqmalaysia.os,qqmalaysia japan by citizen ladies rubber anal...
171,women casual,sanda.os,sanda sports watch women waterproof multifunct...


In [None]:
watch_df['Watch_Name'].unique()

array(['olevs jam tangan lelaki original watch men waterproof stainless steel quartz luminous authentic dragon design gold watch for men',
       'naviforce 2021 bussiness watch men sport quartz watches luxury brand leather waterproof led digital wristwatch',
       'casio general w218h3a black resin band men youth watch',
       'skmei new men outdoor waterproof sports large dial electronic watch multifunction dual display luminous electronic watch',
       'kids cute girl cat children kids watch gel digital',
       'carlo rino jade of the orient dark green',
       'skmei smart watch fashion full touch screen mens sport fitness watches ip68 waterproof bluetooth luxury connection smartwatch',
       'sanda military waterproof quartz led multifunction luxury fashion men watch',
       'uniq aspen strap for apple watch grey blue green series 123456se7 384041424445mm',
       'skmei fashion watch men waterproof sport quartz stainless steel steel watch wristwatches',
       'curren men g

####**Step 3.6.3: Remove Malay language words with Malaya and Fasttext library**

In [None]:
# Remove words that are detected to belong to words of the Malay language.
fast_text = malaya.language_detection.fasttext()
watch_df['Watch_Name'] = watch_df['Watch_Name'].map(lambda x: ' '.join(word for word in x.split() if (fast_text.predict([word])[0] != 'malay' or (word.isalnum() and not word.isalpha() and not word.isdigit()) ) ))

# An approach of removing malay language words through identifying if one word is an English word or otherwise was tested.
# Such approach is not practiced as it tends to remove other valid English words, hence shorten the name of watches.
# This is not ideal as the name of the watch provides useful information regarding the attributes of watch and helps to reflect the characteristics of watches fancied or otherwise by buyers.
# Steps of the mentioned approach
# brand_name_lst = list(watch_df['Watch_Brand'].unique())
# brand_name_lst = [(lambda x: re.sub(".os", '', x))(x) for x in brand_name_lst]
# watch_df['Watch_Name'] = watch_df['Watch_Name'].map(lambda x: " ".join(w for w in nltk.wordpunct_tokenize(x) if w.lower() in words or not w.isalpha() or w in brand_name_lst))


  colour=colour)
101%|██████████| 30.0/29.6 [00:01<00:00, 21.6MB/s]


In [None]:
watch_df

Unnamed: 0,Watch_Style,Watch_Brand,Watch_Name
0,business and casual,olevs.os,olevs jam tangan original watch men waterproof...
1,business and casual,naviforce.os,naviforce 2021 bussiness watch men sport quart...
2,business and casual,casio.os,casio general w218h3a black resin band men you...
3,business and casual,skmei.os,skmei new men outdoor waterproof sports large ...
4,women sports,skmei.os,kids cute girl cat children kids watch gel
...,...,...,...
168,business and casual,wishdoitwatch.wh.os,official original wishdoit jam tangan original...
169,business and casual,adshops.os,gchok new men watch outdoor sport waterproof m...
170,women casual,qqmalaysia.os,qqmalaysia japan by citizen ladies rubber anal...
171,women casual,sanda.os,sanda sports watch women waterproof multifunct...


In [None]:
watch_df['Watch_Name'].unique()

array(['olevs jam tangan original watch men waterproof stainless steel quartz authentic dragon design gold watch for men',
       'naviforce 2021 bussiness watch men sport quartz watches luxury brand leather waterproof led wristwatch',
       'casio general w218h3a black resin band men youth watch',
       'skmei new men outdoor waterproof sports large dial electronic watch multifunction dual display electronic watch',
       'kids cute girl cat children kids watch gel',
       'carlo rino jade of the orient dark green',
       'skmei smart watch fashion full touch screen mens sport fitness watches ip68 waterproof bluetooth luxury connection',
       'sanda military waterproof quartz led multifunction luxury fashion men watch',
       'uniq aspen strap for apple watch grey blue green series 123456se7 384041424445mm',
       'skmei fashion watch men waterproof sport quartz stainless steel steel watch wristwatches',
       'curren men g shock waterproof watch fashion sports waterproof qu

####**Step 3.6.4: Refine the removal of Malay language words**
Although the library may remove some words that are considered as words from the Malay language, but there are some commonly appeared Malay language words in the watch name that are not removed such as "jam tangan", "wanita", "perempuan" and more. An example is such as below:




#####**Example of Malay language words that are not removed**

In [None]:
watch_df['Watch_Name'].iloc[0] # the word 'jam tangan' is not needed as the word 'watch' is present and they ('jam tangan' and 'watch') are semantically the same 

'olevs jam tangan original watch men waterproof stainless steel quartz authentic dragon design gold watch for men'

In [None]:
watch_df['Watch_Name'].iloc[34] #additional example

'bostanten men sports shock watch 3time chrono alarm date week display led light big dial military men sports watches jam tangan'

#####**Refinement process**

In [None]:
# Define function refine_and_remove_repeated_info to refine the process of removing malay language words
def refine_and_remove_repeated_info(sentence):
    sentence = re.sub('(jam tangan|jam)', '', sentence)
    sentence = re.sub('(perempuan|wanita|female|ladies)', 'women', sentence)
    sentence = re.sub('(lelaki|male)', 'men', sentence)
    sentence = re.sub('(budak|kanak|children|kids)', 'kid', sentence)
    sentence = re.sub('pasangan', 'couple', sentence)
    sentence = re.sub('rasmi', 'official', sentence)
    sentence = re.sub('asal', 'original', sentence)
    sentence = re.sub('murah', 'cheap', sentence)

    s = ''
    final_lst = []
    for w in sentence.split():
        if w not in final_lst:
            final_lst.append(w)
            s += w +" "
    return s.strip()

In [None]:
watch_df['Watch_Name'] = watch_df['Watch_Name'].apply(refine_and_remove_repeated_info)

In [None]:
watch_df

Unnamed: 0,Watch_Style,Watch_Brand,Watch_Name
0,business and casual,olevs.os,olevs original watch men waterproof stainless ...
1,business and casual,naviforce.os,naviforce 2021 bussiness watch men sport quart...
2,business and casual,casio.os,casio general w218h3a black resin band men you...
3,business and casual,skmei.os,skmei new men outdoor waterproof sports large ...
4,women sports,skmei.os,kid cute girl cat watch gel
...,...,...,...
168,business and casual,wishdoitwatch.wh.os,official original wishdoit men watch pu leathe...
169,business and casual,adshops.os,gchok new men watch outdoor sport waterproof m...
170,women casual,qqmalaysia.os,qqmalaysia japan by citizen women rubber analo...
171,women casual,sanda.os,sanda sports watch women waterproof multifunct...


####**Review: Final list of the unique values of each column that represents the watches attributes in the dataframe**

In [None]:
watch_df['Watch_Name'].unique()

array(['olevs original watch men waterproof stainless steel quartz authentic dragon design gold for',
       'naviforce 2021 bussiness watch men sport quartz watches luxury brand leather waterproof led wristwatch',
       'casio general w218h3a black resin band men youth watch',
       'skmei new men outdoor waterproof sports large dial electronic watch multifunction dual display',
       'kid cute girl cat watch gel',
       'carlo rino jade of the orient dark green',
       'skmei smart watch fashion full touch screen mens sport fitness watches ip68 waterproof bluetooth luxury connection',
       'sanda military waterproof quartz led multifunction luxury fashion men watch',
       'uniq aspen strap for apple watch grey blue green series 123456se7 384041424445mm',
       'skmei fashion watch men waterproof sport quartz stainless steel wristwatches',
       'curren men g shock waterproof watch fashion sports quartz',
       'naviforce mens watch top brand luxury fashion quartz men watc

In [None]:
watch_df['Watch_Style'].unique()

array(['business and casual', 'women  sports', 'women  casual',
       'straps and clasps', 'business set', 'women  business', 'others',
       'display and storage', 'casual set',
       '(gwp) catepillar limited edition sport towel (not for sale)'],
      dtype=object)

In [None]:
watch_df['Watch_Brand'].unique()

array(['olevs.os', 'naviforce.os', 'casio.os', 'skmei.os', 'carlorino.os',
       'sanda.os', 'uniq_my', 'curren.os', 'bostantenwatch.os',
       'wishdoitwatch.wh.os', 'qqmalaysia.os', 'julius.os',
       'milliotandco.os', 'billion.os', 'danielwellington.os',
       'catwatch.os', 'ligewatch.os', 'amazfit.os', 'herjewellery123',
       'fossil.os', 'icewatch.os', 'voarch.os', 'submarinewatch.os',
       'adshops.os'], dtype=object)

###**Step 3.7: Remove irrelevant data**
The subcategories of watches such as *straps and clasps*, *display and storage*, *others*, and *(gwp) catepillar limited edition sport towel (not for sale)* were extracted and they are irrelevant to the case study. Therefore, it will be removed from the dataframe

In [None]:
watch_df[watch_df.Watch_Style == "straps and clasps"].index.tolist()

[8, 15, 94, 153, 155]

In [None]:
watch_df.iloc[[8, 15, 94, 153, 155]]

Unnamed: 0,Watch_Style,Watch_Brand,Watch_Name
8,straps and clasps,uniq_my,uniq aspen strap for apple watch grey blue gre...
15,straps and clasps,bostantenwatch.os,bostanten men genuine leather ratchet belt wit...
94,straps and clasps,amazfit.os,band 5 strap free gift
153,straps and clasps,skmei.os,b016 stainless steel watch strap band time fun...
155,straps and clasps,skmei.os,rubber strap watch band sports multi color tou...


In [None]:
watch_df[watch_df.Watch_Style == "display and storage"].index.tolist()

[37, 45, 47, 53, 138, 139]

In [None]:
watch_df.loc[[37, 45, 47, 53, 138, 139]]

Unnamed: 0,Watch_Style,Watch_Brand,Watch_Name
37,display and storage,skmei.os,skmei logo original brand metal box gift
45,display and storage,sanda.os,sanda official original luxury watch gift box
47,display and storage,qqmalaysia.os,box upgrade qqmalaysia qc225 rm15 only please ...
53,display and storage,skmei.os,skmei logo box06 watch box gift package
138,display and storage,bostantenwatch.os,bostanten fashion gift bag
139,display and storage,skmei.os,skmei watch box small light weight simple pack...


In [None]:

watch_df[watch_df.Watch_Style == "others"].index.tolist()

[19, 23]

In [None]:
watch_df.loc[[19, 23]]

Unnamed: 0,Watch_Style,Watch_Brand,Watch_Name
19,others,uniq_my,uniq case for apple watch torres black white b...
23,others,skmei.os,skmei replace watchbands only band not watch


In [None]:
watch_df[watch_df.Watch_Style == "(gwp) catepillar limited edition sport towel (not for sale)"].index.tolist()

[72]

In [None]:
watch_df.iloc[[72]]

Unnamed: 0,Watch_Style,Watch_Brand,Watch_Name
72,(gwp) catepillar limited edition sport towel (...,catwatch.os,gwp catepillar limited edition sport towel not...


In [None]:
watch_df.drop(index=[8, 15, 94, 153, 155, 37, 45, 47, 53, 72, 138, 139, 19, 23], inplace=True) 
watch_df = watch_df.reset_index(drop=True)

In [None]:
watch_df

Unnamed: 0,Watch_Style,Watch_Brand,Watch_Name
0,business and casual,olevs.os,olevs original watch men waterproof stainless ...
1,business and casual,naviforce.os,naviforce 2021 bussiness watch men sport quart...
2,business and casual,casio.os,casio general w218h3a black resin band men you...
3,business and casual,skmei.os,skmei new men outdoor waterproof sports large ...
4,women sports,skmei.os,kid cute girl cat watch gel
...,...,...,...
154,business and casual,wishdoitwatch.wh.os,official original wishdoit men watch pu leathe...
155,business and casual,adshops.os,gchok new men watch outdoor sport waterproof m...
156,women casual,qqmalaysia.os,qqmalaysia japan by citizen women rubber analo...
157,women casual,sanda.os,sanda sports watch women waterproof multifunct...


###**Step 3.8: Convert categorical variables into quantitative variables**
As statistical models can't take in categorical data as objects or strings, therefore it is mandatory to convert them to quantitative data. In this project, the pandas *get_dummies* method is used to fulfill the mentioned conversion on the dataframe's Watch_Style and Watch_Brand column.
**Note: Watch Name will not be converted to quantitative variables as it is unneccessary and it will not be used by statistical model to perform any analysis. However, it is one of the crucial requirement to fulfill the objective of this project, which is to identify the best selling watch in Shopee.** 

In [None]:
df_style_dummy = pd.get_dummies(watch_df['Watch_Style'])
df_brand_dummy = pd.get_dummies(watch_df['Watch_Brand'])
cleaned_df = pd.concat([df_style_dummy, df_brand_dummy, watch_df['Watch_Name']], axis=1)
cleaned_df

Unnamed: 0,business and casual,business set,casual set,women business,women casual,women sports,adshops.os,billion.os,bostantenwatch.os,carlorino.os,casio.os,curren.os,danielwellington.os,fossil.os,herjewellery123,icewatch.os,julius.os,ligewatch.os,milliotandco.os,naviforce.os,olevs.os,qqmalaysia.os,sanda.os,skmei.os,submarinewatch.os,voarch.os,wishdoitwatch.wh.os,Watch_Name
0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,olevs original watch men waterproof stainless ...
1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,naviforce 2021 bussiness watch men sport quart...
2,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,casio general w218h3a black resin band men you...
3,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,skmei new men outdoor waterproof sports large ...
4,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,kid cute girl cat watch gel
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
154,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,official original wishdoit men watch pu leathe...
155,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,gchok new men watch outdoor sport waterproof m...
156,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,qqmalaysia japan by citizen women rubber analo...
157,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,sanda sports watch women waterproof multifunct...


#**Part 6: EXPORT CLEANED DATA AS A CSV FILE**
As the extracted data regarding watches are cleaned, the data is exported to a csv file, named as 'a176297_watch_style_brand_name_2021.csv'. <br> **Note:** As a backup, a csv file named as 'a176297_watch_style_brand_name.csv' is used to save the data that are not changed to quantitative data.


In [None]:
watch_df.to_csv('a176297_watch_style_brand_name.csv')
print('Done exporting dataframe as csv file!')

Done exporting dataframe as csv file!


In [None]:
cleaned_df.to_csv('a176297_watch_style_brand_name_2021.csv')
print('Done exporting dataframe as csv file!')

Done exporting dataframe as csv file!


**END OF PROJECT**