# Summary:

                                                     
- We used a tool called BeautifulSoup to extract information from the Jaquet Droz website. It helps us collect data from web pages.

- The data we obtained consists of details on 131 watches. Each watch has 40 different features or characteristics that we gathered.

- In some cases, certain features of the watches were not available or were missing. These features include information about the jewels used, frequency, clasp type, bracelet color, bracelet material, dial color, and water resistance.

- Unfortunately, we couldn't retrieve information about the prices of the watches and the currency they are priced in. This information was not available on the website.

                                                   

## Steps we take to scrap the data from the website:
- First, we visited the website and explored its structure to understand where the desired data is located.

- Since the website was dynamic and required interaction, we initially attempted to use a tool called Selenium. We programmed Selenium to navigate through the pages, click on watches, and access their details. However, we encountered difficulties with extracting all the necessary information, particularly with the technical data section. The website utilized JavaScript scrolling to display additional details, which Selenium couldn't handle effectively within our limited timeframe.

- To overcome this challenge, we opted for an alternative approach using BeautifulSoup. With BeautifulSoup, we scraped the required data in a more straightforward manner.

- We began by scraping the links, titles, and image sources of each watch from the initial page. We stored this information in a list.

- Next, we implemented a loop to visit each watch's individual page by following the links in the list.

- Each watch's page contained additional watches, so we used NavigableString to search through the XML/HTML structure.

- We implemented another loop to extract the image sources, links, and titles of the watches on the secondary page and added them to the existing lists.

- We further iterated through each watch's link and accessed the page containing the technical details.

- In this step, we retrieved the title and description of each watch and then extracted the technical details from the table. We parsed the rows and columns of the table and stored the information in a list.

- We encountered another challenge when some watches had different color variations with distinct details. To address this, we revisited the HTML of the page to identify additional color variations, extract their links, and add them to the list of links.

- By looping through these links, we were able to scrape the details of each color variation.

- For some columns, we couldn't find the corresponding details on the website. In these cases, we marked those columns as null since the watches themselves didn't provide that information.

- After collecting all the necessary details, we organized the data into a dataframe and saved it as a CSV file.

- We then read the data from the CSV file and examined its structure.

- The dataset consisted of 131 rows, numbered from 0 to 130, and contained 40 columns.

- However, these 131 rows (watchs) are not unique, as in some of them, there are the same watch in another collection and they contains the same details and reference number. So, the client wants to get unique reference number, so we had to take only one of them and not all of them.

- So, in the end, there are 67 unique rows (watches) and 40 columns.

- As we explored the data, we noticed that certain columns had missing values. These missing values indicate that the corresponding watches did not have those specific details available.

- To investigate further, we analyzed the watches with missing values and discovered that those specific watches did not provide the information for those columns.


### Challenges we faced:
- The website is very complecated to descover, and as a user, it was a bit complex to find the details we want from it.
- The website was dynamic and most of the things are done with javascript, and this was a bit hard to us to get the information.
- Getting the details of the colors of some watches as this uses javascript to show each page for movement.
- Learning about web scrapping in a short period of time.


### Things we noticed:
- The watches do not have prices.
- There are not a lot of details about each watch.
- There are watches that have the same reference number and the same details in two different collections.
- We have kept only the unique watches in this file, and they are now 67 watches.

<hr>

### Importing the neccessary libraries

In [None]:
# Import necessary libraries and modules for data manipulation, visualization, web scraping, and regular expressions.
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px

import requests
from bs4 import BeautifulSoup, NavigableString
import pandas as pd
import re

### Extracting Data using BeautifulSoup

In [None]:
def extract_asoup(url, parser='lxml'):
  # Send a GET request to the specified 'url' and retrieve the page content
  page = requests.get(url)

  # Create a BeautifulSoup object by parsing the 'page.content' using the specified 'parser'
  soup = BeautifulSoup(page.content, parser)

  # Return the resulting BeautifulSoup object
  return soup

In [None]:
#Assigning the link to the website to a variable
watches_url = ""

In [None]:
# Extract the HTML content from the specified 'watches_url' and parse it using 'html.parser'
soup = extract_asoup(watches_url, 'html.parser')

### Extract Watch URLs

In [None]:
# Find all the <a> tags (hyperlinks) in the parsed HTML using BeautifulSoup
links = soup.find_all('a', href=True)

# Extract the URLs from the <a> tags that start with ""
watches_urls = [link['href'] for link in links if link['href'].startswith("")]

watches_urls.reverse()
#Let's see the links we have extract
watches_urls

['https://www.jaquet-droz.com/en/watches/astrale',
 'https://www.jaquet-droz.com/en/watches/lady-8',
 'https://www.jaquet-droz.com/en/watches/petite-heure-minute',
 'https://www.jaquet-droz.com/en/watches/sw',
 'https://www.jaquet-droz.com/en/watches/grande-seconde',
 'https://www.jaquet-droz.com/en/watches/ateliers-d-art',
 'https://www.jaquet-droz.com/en/watches/automata',
 'https://www.jaquet-droz.com/en/watches/timepieces']

In [None]:
#Creating a empty list to save the data of the watches in it
watches_data = []
# Loop through each watch URL
for watch_url in watches_urls:
    # Extract the watch details from the URL
    watch_soup = extract_asoup(watch_url, 'html.parser')
    content = watch_soup.find('div', class_ = 'block block-system')

    # Skip to the next iteration if content is None
    if content is None:
        continue

    # Extract the parent model information
    parent_model = content.find('h1').text


    # Extract the URLs of related watches
    item_list = content.find_all('a', href=True)
    urls =[link['href'] for link in item_list]

    # Loop through each related watch URL
    for watch in urls:
        watch_URL = watch

        # Extract the watch information from the related watch URL
        watch_infos = extract_asoup(watch, 'html.parser')
        watch_info = watch_infos.find('div', class_ = 'watch-infos')


        # Extract various details of the watch
        specific_model = watch_info.find('h1', class_ = 'title-node').text
        description = watch_info.find('div', class_ = 'description').text.strip()
        watch_spec = watch_info.find('div', class_ = 'watch-spec')
        reference_number = watch_spec.find('th', text='Reference').find_next_sibling('td').text
        if any(item.get("reference_number") == reference_number for item in watches_data):
            continue
        features = watch_spec.find('th', text='Indications').find_next_sibling('td').text if watch_spec.find('th', text='Indications') else None
        jewels = watch_spec.find('th', text='Jewelling').find_next_sibling('td').text \
            if watch_spec.find('th', text='Jewelling') else None
        frequency = watch_spec.find('th', text='Frequency').find_next_sibling('td').text if watch_spec.find('th', text='Frequency') else None
        power_reserve = watch_spec.find('th', text='Power reserve').find_next_sibling('td').text if watch_spec.find('th', text='Power reserve') else None
        caliber = watch_spec.find('th', text='Movement').find_next_sibling('td').text if watch_spec.find('th', text='Movement') else None
        movement = caliber
        clasp_type = watch_spec.find('th', text='Buckle').find_next_sibling('td').text if watch_spec.find('th', text='Buckle') else None
        bracelet_color = watch_spec.find('th', text='Strap').find_next_sibling('td').text if watch_spec.find('th', text='Strap') else None
        bracelet_material = bracelet_color
        dial_color = watch_spec.find('th', text='Dial').find_next_sibling('td').text if watch_spec.find('th', text='Dial') else None
        water_resistance = watch_spec.find('th', text=lambda text: text and 'resistance' in text).find_next_sibling('td').text if watch_spec.find('th', text=lambda text: text and 'resistance' in text) else None
        case_thickness = watch_spec.find('th', text='Case').find_next_sibling('td').text if watch_spec.find('th', text='Case') else None
        diameter = case_thickness
        case_material = case_thickness

        image_URL = watch_infos.find('div', class_='watch-picture').find('img').get('src')
        price =""
        currency = ""
        brand = "Jaquet Droz"
        reference_number = watch_spec.find('th', text='Reference').find_next_sibling('td').text


         # Update the list of URLs based on variant colors
        another_colors_watch = watch_infos.find('div', class_='variantes')
        if watch_infos.select('div.variantes li.variante:not(.active)'):
            list_colors = [li.find('a')['href'] for li in watch_infos.select('div.variantes li.variante:not(.active)')]

            index = urls.index(watch)+1
            urls[index:index] = [item for item in list_colors if item not in urls]


        # Append the extracted watch data to the watches_data list
        watches_data.append({

            "reference_number": reference_number,
            "watch_URL": watch_URL,
            "type": '',
            "brand": brand,
            "year_introduced": '',
            "parent_model": parent_model,
            "specific_model": specific_model,
            "nickname": '',
            "marketing_name": '',
            "style": '',
            "currency": currency,
            "price": price,
            "image_URL": image_URL,
            "made_in": '',
            "case_shape": '',
            "case_material": case_material,
            "case_finish": '',
            "caseback": '',
            "diameter": diameter,
            "between_lugs": '',
            "lug_to_lug": '',
            "case_thickness": case_thickness,
            "bezel_material": '',
            "bezel_color": '',
            "crystal": '',
            "water_resistance": water_resistance,
            "weight": '',
            "dial_color": dial_color,
            "numerals": '',
            "bracelet_material": bracelet_material,
            "bracelet_color": bracelet_color,
            "clasp_type": clasp_type,
            "movement" : movement,
            "caliber": caliber,
           "power_reserve": power_reserve,
           "frequency": frequency,
           "jewels": jewels,
           "features": features,
           "description": description,
           "short_description": '',
        })

  reference_number = watch_spec.find('th', text='Reference').find_next_sibling('td').text
  features = watch_spec.find('th', text='Indications').find_next_sibling('td').text if watch_spec.find('th', text='Indications') else None
  if watch_spec.find('th', text='Jewelling') else None
  jewels = watch_spec.find('th', text='Jewelling').find_next_sibling('td').text \
  frequency = watch_spec.find('th', text='Frequency').find_next_sibling('td').text if watch_spec.find('th', text='Frequency') else None
  power_reserve = watch_spec.find('th', text='Power reserve').find_next_sibling('td').text if watch_spec.find('th', text='Power reserve') else None
  caliber = watch_spec.find('th', text='Movement').find_next_sibling('td').text if watch_spec.find('th', text='Movement') else None
  clasp_type = watch_spec.find('th', text='Buckle').find_next_sibling('td').text if watch_spec.find('th', text='Buckle') else None
  bracelet_color = watch_spec.find('th', text='Strap').find_next_sibling('td').text if 

In [None]:
#Creating a dataframe (watches details) from the data we got from the website
df_watches_data = pd.DataFrame(watches_data)

#Seeing the dataframe (watches details)
df_watches_data

Unnamed: 0,reference_number,watch_URL,type,brand,year_introduced,parent_model,specific_model,nickname,marketing_name,style,...,bracelet_color,clasp_type,movement,caliber,power_reserve,frequency,jewels,features,description,short_description
0,J030033240,https://www.jaquet-droz.com/en/watches/sw/tour...,,Jaquet Droz,,SW,Tourbillon SW,,,,...,Rolled-edge hand made inserted black alligator...,18-carat red gold folding clasp with black PVD...,"Jaquet Droz 25JD-S, self-winding tourbillon mo...","Jaquet Droz 25JD-S, self-winding tourbillon mo...",7 days,"21,600 v.p.h",31 jewels,Hours and minutes at 6 o'clock\r\nTourbillon f...,Black dial with rubber treatment.\r\n18-carat ...,
1,J013523242,https://www.jaquet-droz.com/en/watches/grande-...,,Jaquet Droz,,GRANDE SECONDE,Tourbillon Skelet,,,,...,Rolled-edge hand-made black alligator,18-karat red gold folding clasp,"Jaquet Droz 2625SQ, self-winding skeleton tour...","Jaquet Droz 2625SQ, self-winding skeleton tour...",7 days,"21,600 v.p.h.",30 jewels,Off-centered hours and minutes at 6 o'clock. T...,Sapphire dial with metallic sapphire base. \r\...,
2,J0135240021,https://www.jaquet-droz.com/en/watches/grande-...,,Jaquet Droz,,GRANDE SECONDE,Tourbillon Skelet,,,,...,Rubber strap,18-karat white gold folding clasp,"Jaquet Droz 2625SQ, self-winding skeleton tour...","Jaquet Droz 2625SQ, self-winding skeleton tour...",7 days,"21,600 v.p.h.",30 jewels,Off-centered hours and minutes at 6 o'clock. T...,Transparent sapphire dial and blue sapphire ba...,
3,J0135230011,https://www.jaquet-droz.com/en/watches/grande-...,,Jaquet Droz,,GRANDE SECONDE,Tourbillon Skelet Skull,,,,...,Rubber strap,18-karat red gold folding clasp,"Jaquet Droz 2625SQ, self-winding skeleton tour...","Jaquet Droz 2625SQ, self-winding skeleton tour...",7 days,"21,600 v.p.h.",30 jewels,Off-centered hours and minutes at 6 o'clock. T...,Hand-engraved and hand-painted 18-karat gold d...,
4,J0135270221,https://www.jaquet-droz.com/en/watches/grande-...,,Jaquet Droz,,GRANDE SECONDE,Tourbillon Skelet Skull,,,,...,Rubber strap,Stainless steel and plasma ceramic folding clasp,"Jaquet Droz 2625SQ, self-winding skeleton tour...","Jaquet Droz 2625SQ, self-winding skeleton tour...",7 days,"21,600 v.p.h.",30 jewels,Off-centered hours and minutes at 6 o'clock. T...,"18-karat white gold dial, hand-engraved. \r\nS...",
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
62,J899.000.064,https://www.jaquet-droz.com/en/watches/automat...,,Jaquet Droz,,AUTOMATA,Signing Machine,,,,...,,,"JAQUET DROZ MAS, mechanical movement hand-woun...","JAQUET DROZ MAS, mechanical movement hand-woun...",2 signatures,,8 jewels,,Hand-polished and satin-brushed stainless stee...,
63,J899.000.070,https://www.jaquet-droz.com/en/watches/automat...,,Jaquet Droz,,AUTOMATA,Whistling Machine,,,,...,,,"JAQUET DROZ MASIF, hand-wound automaton moveme...","JAQUET DROZ MASIF, hand-wound automaton moveme...",2 minutes,,,,"Hand-wound mechanical automaton, hand-painted ...",
64,J005024288,https://www.jaquet-droz.com/en/watches/atelier...,,Jaquet Droz,,Timepieces,Relief Tiger,,,,...,Rolled-edge hand-made black alligator,18-karat white gold ardillon buckle,"Jaquet Droz 2653.Si, self-winding mechanical m...","Jaquet Droz 2653.Si, self-winding mechanical m...",68 hours,"28,800 v.p.h.",28 jewels,Off-centered hours and minutes,"18-karat white gold and opal dial, black onyx ...",
65,J005023300,https://www.jaquet-droz.com/en/watches/atelier...,,Jaquet Droz,,Timepieces,Relief Tiger,,,,...,Rolled-edge hand-made black alligator,18-karat red gold ardillon buckle,"Jaquet Droz 2653.Si, self-winding mechanical m...","Jaquet Droz 2653.Si, self-winding mechanical m...",68 hours,"28,800 v.p.h.",28 jewels,Off-centered hours and minutes,"18-karat red gold and opal dial, black onyx su...",


In [None]:
#Number of watches beside the parent model (collection)
df_watches_data['parent_model'].value_counts()

parent_model
AUTOMATA          39
GRANDE SECONDE    23
Timepieces         3
SW                 1
ATELIERS D'ART     1
Name: count, dtype: int64

In [None]:
#Converting (Saving) the dataframe (watches data) in a csv file

df_watches_data.to_csv('df_watches_data_67.csv', index=False, encoding='utf-8-sig')

<hr>

In [None]:
#Reading the data from a csv file
df = pd.read_csv('df_watches_data.csv')

#Seeing the dataframe
df.head()

Unnamed: 0,reference_number,watch_URL,type,brand,year_introduced,parent_model,specific_model,nickname,marketing_name,style,...,bracelet_color,clasp_type,movement,caliber,power_reserve,frequency,jewels,features,description,short_description
0,J030033240,https://www.jaquet-droz.com/en/watches/sw/tour...,,Jaquet Droz,,SW,Tourbillon SW,,,,...,Rolled-edge hand made inserted black alligator...,18-carat red gold folding clasp with black PVD...,"Jaquet Droz 25JD-S, self-winding tourbillon mo...","Jaquet Droz 25JD-S, self-winding tourbillon mo...",7 days,"21,600 v.p.h",31 jewels,Hours and minutes at 6 o'clock\r\nTourbillon f...,Black dial with rubber treatment.\r\n18-carat ...,
1,J013523242,https://www.jaquet-droz.com/en/watches/grande-...,,Jaquet Droz,,GRANDE SECONDE,Tourbillon Skelet,,,,...,Rolled-edge hand-made black alligator,18-karat red gold folding clasp,"Jaquet Droz 2625SQ, self-winding skeleton tour...","Jaquet Droz 2625SQ, self-winding skeleton tour...",7 days,"21,600 v.p.h.",30 jewels,Off-centered hours and minutes at 6 o'clock. T...,Sapphire dial with metallic sapphire base. \r\...,
2,J0135240021,https://www.jaquet-droz.com/en/watches/grande-...,,Jaquet Droz,,GRANDE SECONDE,Tourbillon Skelet,,,,...,Rubber strap,18-karat white gold folding clasp,"Jaquet Droz 2625SQ, self-winding skeleton tour...","Jaquet Droz 2625SQ, self-winding skeleton tour...",7 days,"21,600 v.p.h.",30 jewels,Off-centered hours and minutes at 6 o'clock. T...,Transparent sapphire dial and blue sapphire ba...,
3,J0135230011,https://www.jaquet-droz.com/en/watches/grande-...,,Jaquet Droz,,GRANDE SECONDE,Tourbillon Skelet Skull,,,,...,Rubber strap,18-karat red gold folding clasp,"Jaquet Droz 2625SQ, self-winding skeleton tour...","Jaquet Droz 2625SQ, self-winding skeleton tour...",7 days,"21,600 v.p.h.",30 jewels,Off-centered hours and minutes at 6 o'clock. T...,Hand-engraved and hand-painted 18-karat gold d...,
4,J0135270221,https://www.jaquet-droz.com/en/watches/grande-...,,Jaquet Droz,,GRANDE SECONDE,Tourbillon Skelet Skull,,,,...,Rubber strap,Stainless steel and plasma ceramic folding clasp,"Jaquet Droz 2625SQ, self-winding skeleton tour...","Jaquet Droz 2625SQ, self-winding skeleton tour...",7 days,"21,600 v.p.h.",30 jewels,Off-centered hours and minutes at 6 o'clock. T...,"18-karat white gold dial, hand-engraved. \r\nS...",


In [None]:
reference_unique = df['reference_number'].unique()
reference_unique

array(['J030033240', 'J013523242', 'J0135240021', 'J0135230011',
       'J0135270221', 'J0135250101', 'J0135250061', 'J0135250071',
       'J0135250081', 'J0135250011', 'J0135250091', 'J0135270021',
       'J0135270011', 'J0135270031', 'J0135270061', 'J0135270231',
       'J011033202', 'J013033200', 'J013034240', 'J013013200',
       'J013014270', 'J013014271', 'J013013580', 'J013013281',
       'J013033243', 'J0328330011', 'J0328330151', 'J031034205',
       'J031033202', 'J031034203', 'J031033206', 'J031034210',
       'J031033211', 'J0327370011', 'J0327330011', 'J0327330031',
       'J0327330041', 'J0327330061', 'J0327370021', 'J031534240',
       'J031534203', 'J031533240', 'J031533200', 'J031534200',
       'J031533202', 'J032533275', 'J032534270', 'J032533271',
       'J032534271', 'J032633270', 'J032634270', 'J033033202',
       'J033034200', 'J033033206', 'J033033205', 'J032003200',
       'J032004220', 'J032004270', 'J032003220', 'J032004221',
       'J032003271', 'J032004201'

In [None]:
#Seeing the info our the watches
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 67 entries, 0 to 66
Data columns (total 40 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   reference_number   67 non-null     object 
 1   watch_URL          67 non-null     object 
 2   type               0 non-null      float64
 3   brand              67 non-null     object 
 4   year_introduced    0 non-null      float64
 5   parent_model       67 non-null     object 
 6   specific_model     67 non-null     object 
 7   nickname           0 non-null      float64
 8   marketing_name     0 non-null      float64
 9   style              0 non-null      float64
 10  currency           0 non-null      float64
 11  price              0 non-null      float64
 12  image_URL          67 non-null     object 
 13  made_in            0 non-null      float64
 14  case_shape         0 non-null      float64
 15  case_material      67 non-null     object 
 16  case_finish        0 non-nul

In [None]:
#Checking for the null (missing) features in our watches' data
df.isna().sum()
#We can notice that, some of the watches do not have some features and the website did not put them as No.
#So, because of that they are missing

reference_number      0
watch_URL             0
type                 67
brand                 0
year_introduced      67
parent_model          0
specific_model        0
nickname             67
marketing_name       67
style                67
currency             67
price                67
image_URL             0
made_in              67
case_shape           67
case_material         0
case_finish          67
caseback             67
diameter              0
between_lugs         67
lug_to_lug           67
case_thickness        0
bezel_material       67
bezel_color          67
crystal              67
water_resistance      4
weight               67
dial_color            3
numerals             67
bracelet_material     3
bracelet_color        3
clasp_type            3
movement              0
caliber               0
power_reserve         0
frequency             3
jewels                2
features              3
description           0
short_description    67
dtype: int64

<hr>