## Capstone Project - 2
**Scrap data from the URL** https://webscraper.io/test-sites/e-commerce/allinone/computers/laptops
<br>**you need to pull the following:**
<br>a. product name
<br>b. Price
<br>c. description
<br>d. Rating
<br>e. Review count
<br>**Pull the scrapped data and create a data frame then create the following charts/graphs**
<br>1. **Top 10 products by price**
<br>2. **Top 10 products by rating**
<br>3. **Top 10 products by number of reviews**

In [1]:
import requests
import pandas as pd
from bs4 import BeautifulSoup

url = 'https://webscraper.io/test-sites/e-commerce/allinone/computers/laptops'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')

# Extract data
names = [a.text.strip() for a in soup.select('a.title')]
price = [h4.text.strip() for h4 in soup.select('h4.price')]
desc = [p.text.strip() for p in soup.select('p.description')]
ratings = [p['data-rating'] for p in soup.select('p[data-rating]')]
reviews = [p.text.split()[0] for p in soup.select('p.review-count')]

# Create a DataFrame
df = pd.DataFrame({'name': names, 'price': price, 'desc': desc, 'ratings': ratings, 'reviews': reviews})

In [2]:
print(df)

                  name    price  \
0     Asus VivoBook...  $295.99   
1    Prestigio Smar...     $299   
2    Prestigio Smar...     $299   
3        Aspire E1-510  $306.99   
4    Lenovo V110-15...  $321.94   
..                 ...      ...   
112   Lenovo Legion...    $1399   
113  Asus ROG Strix...    $1399   
114  Asus ROG Strix...    $1769   
115  Asus ROG Strix...    $1769   
116  Asus ROG Strix...    $1799   

                                                  desc ratings reviews  
0    Asus VivoBook X441NA-GA190 Chocolate Black, 14...       3      14  
1    Prestigio SmartBook 133S Dark Grey, 13.3" FHD ...       2       8  
2    Prestigio SmartBook 133S Gold, 13.3" FHD IPS, ...       4      12  
3      15.6", Pentium N3520 2.16GHz, 4GB, 500GB, Linux       3       2  
4    Lenovo V110-15IAP, 15.6" HD, Celeron N3350 1.1...       3       5  
..                                                 ...     ...     ...  
112  Lenovo Legion Y720, 15.6" FHD IPS, Core i7-770...       3      

In [3]:
df

Unnamed: 0,name,price,desc,ratings,reviews
0,Asus VivoBook...,$295.99,"Asus VivoBook X441NA-GA190 Chocolate Black, 14...",3,14
1,Prestigio Smar...,$299,"Prestigio SmartBook 133S Dark Grey, 13.3"" FHD ...",2,8
2,Prestigio Smar...,$299,"Prestigio SmartBook 133S Gold, 13.3"" FHD IPS, ...",4,12
3,Aspire E1-510,$306.99,"15.6"", Pentium N3520 2.16GHz, 4GB, 500GB, Linux",3,2
4,Lenovo V110-15...,$321.94,"Lenovo V110-15IAP, 15.6"" HD, Celeron N3350 1.1...",3,5
...,...,...,...,...,...
112,Lenovo Legion...,$1399,"Lenovo Legion Y720, 15.6"" FHD IPS, Core i7-770...",3,8
113,Asus ROG Strix...,$1399,"Asus ROG Strix GL702VM-GC146T, 17.3"" FHD, Core...",3,10
114,Asus ROG Strix...,$1769,"Asus ROG Strix GL702ZC-GC154T, 17.3"" FHD, Ryze...",4,7
115,Asus ROG Strix...,$1769,"Asus ROG Strix GL702ZC-GC209T, 17.3"" FHD IPS, ...",1,8


In [4]:
# Cleaning price column by removing $ and converting into float
df['price'] = df['price'].str.replace('$', '').astype(float)

# Converting ratings and reviews into int
df['ratings'] = df['ratings'].astype(int)
df['reviews'] = df['reviews'].astype(int)


# Sorts the DataFrame by the price, ratings, reviews column in descending order
top_10_products_by_price = df.sort_values('price', ascending= False).head(10)
top_10_products_by_ratings = df.sort_values('ratings', ascending= False).head(10)
top_10_products_by_reviews = df.sort_values('reviews', ascending = False).head(10)

<strong>1. Top 10 Products By Price

In [5]:
print("Top 10 Products by Price:")
print(top_10_products_by_price[['name', 'price']])

Top 10 Products by Price:
                  name    price
116  Asus ROG Strix...  1799.00
114  Asus ROG Strix...  1769.00
115  Asus ROG Strix...  1769.00
112   Lenovo Legion...  1399.00
113  Asus ROG Strix...  1399.00
111  Asus ASUSPRO B...  1381.13
110  Toshiba Porteg...  1366.32
109  Lenovo ThinkPa...  1362.24
108  Lenovo ThinkPa...  1349.23
107   Apple MacBook...  1347.78


In [6]:
import plotly.express as px

fig = px.bar(top_10_products_by_price, x = 'price', y = 'name', text = 'price', color= 'price', hover_data= ['name', 'price', 'desc', 'ratings', 'reviews'])
fig.update_layout(title = 'Top 10 Products By Price', xaxis_title = 'Price ($)', yaxis_title = 'Name of Products', yaxis={'categoryorder':'total ascending'})
fig.show()

In [7]:
fig2 = px.treemap(top_10_products_by_price, path=['name', 'price'], values='price', color='price')
fig2.update_layout(title = 'Top 10 Products By Price in Treemap')
fig2.show()

<strong>2. Top 10 Products By Ratings

In [8]:
print("Top 10 Products by Ratings:")
print(top_10_products_by_ratings[['name', 'ratings']])

Top 10 Products by Ratings:
                 name  ratings
2   Prestigio Smar...        4
13   Asus VivoBook...        4
15   Acer Aspire 3...        4
9   Acer Aspire ES...        4
10   Acer Aspire 3...        4
17  Acer Aspire ES...        4
18   Asus VivoBook...        4
35   Acer Aspire 3...        4
23  Acer Extensa 1...        4
46            ProBook        4


In [9]:
import plotly.express as px

fig1 = px.bar(top_10_products_by_ratings, x ='name', y = 'ratings', text = 'ratings', color = 'name', hover_data = ['name', 'price', 'desc', 'ratings', 'reviews'])
fig1.update_layout(title = 'Top 10 Products By Ratings', xaxis_title = 'Product Name', yaxis_title = 'Number of Ratings', xaxis={'categoryorder':'total descending'})
fig1.show()

<strong>3. Top 10 Products By Number of Reviews

In [10]:
print("Top 10 Products by Reviews:")
print(top_10_products_by_reviews[['name', 'reviews']])

Top 10 Products by Reviews:
                 name  reviews
0    Asus VivoBook...       14
17  Acer Aspire ES...       14
34     Dell Vostro 15       14
68  Acer Nitro 5 A...       14
53  Lenovo ThinkPa...       14
81   Acer Predator...       14
65   Dell Latitude...       14
91  Asus ZenBook U...       14
66   Lenovo Legion...       13
95   Dell Latitude...       13


In [11]:
import plotly.express as px

fig2 = px.bar(top_10_products_by_reviews, x = 'reviews', y = 'name', text = 'reviews', color = 'name', hover_data = ['name', 'price', 'desc', 'ratings', 'reviews'])
fig2.update_layout(title = 'Top 10 Products By Number of Reviews', xaxis_title = 'Number of Reviews', yaxis_title = 'Products Names', yaxis = {'categoryorder':'total ascending'})
fig2.show()

In [12]:
fig2 = px.treemap(top_10_products_by_reviews, path=['name', 'reviews'], values='reviews', color='reviews')
fig2.update_layout(title = 'Top 10 Products By Number of Reviews in Treemap')
fig2.show()

# Capstone Project 2: Web Scraping and Data Visualization

## Objective
The goal of this project is to scrape data from the e-commerce website [Web Scraper Test Site](https://webscraper.io/test-sites/e-commerce/allinone/computers/laptops) and perform the following tasks:
1. Extract specific product details.
2. Create a DataFrame for analysis.
3. Visualize the data using charts and graphs.

---

## Data to Extract
The following details need to be scraped for each product:
- **Product Name**: The name of the product.
- **Price**: The price of the product.
- **Description**: A brief description of the product.
- **Rating**: The product's rating (out of 5).
- **Review Count**: The number of reviews for the product.

---

## Steps to Complete the Project
### 1. **Web Scraping**
- Use Python libraries like `requests` and `BeautifulSoup` to scrape the data from the given URL.
- Extract the required fields (name, price, description, rating, and review count) from the HTML structure of the webpage.

### 2. **Data Cleaning**
- Convert the scraped data into a structured format using `pandas`.
- Clean the data:
    - Remove unnecessary characters (e.g., `$` from prices).
    - Convert data types (e.g., price to `float`, ratings and reviews to `int`).

### 3. **Data Analysis**
- Sort the data to identify:
    - Top 10 products by price.
    - Top 10 products by rating.
    - Top 10 products by the number of reviews.

### 4. **Data Visualization**
- Use `plotly.express` to create interactive visualizations:
    - **Bar Charts**: Display the top 10 products for each category (price, rating, reviews).
    - **Treemaps**: Provide a hierarchical view of the top 10 products.

---

## Deliverables
1. **DataFrame**: A structured table containing the scraped and cleaned data.
2. **Visualizations**:
     - Bar charts for top 10 products by price, rating, and reviews.
     - Treemaps for top 10 products by price and reviews.
3. **Insights**: Key observations from the data analysis.

---

## Tools and Libraries
- **Web Scraping**: `requests`, `BeautifulSoup`
- **Data Manipulation**: `pandas`
- **Data Visualization**: `plotly.express`

---

## Summary
This project demonstrates the ability to:
1. Scrape and process data from a website.
2. Perform data cleaning and analysis.
3. Create meaningful visualizations to derive insights.