# Capstone: Topic Modelling on AMD vs Nvidia GPU

## Contents
- Data Extraction
- Data Cleaning
- [EDA](#EDA)
- [Prepare data for LDA Analysis](#Prepare-data-for-LDA-Analysis)
- [LDA Model Training](#LDA-Model-Training)
- Model creation
- Model Evaluation

In [40]:
# Common imports
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from pprint import pprint
import os

import re
# NLTK Library
from nltk.tokenize import RegexpTokenizer
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords

# Import PRAW package
import praw
from praw.models import MoreComments

# Gensim library
import gensim
from gensim.utils import simple_preprocess
import gensim.corpora as corpora

import pyLDAvis.gensim
import pickle 
import pyLDAvis

# Detect non-english words
from langdetect import detect, DetectorFactory
DetectorFactory.seed = 0

# Detect non-english words using spacy
import spacy
from spacy_langdetect import LanguageDetector
nlp = spacy.load("en_core_web_sm")
nlp.add_pipe(LanguageDetector(), name='language_detector', last=True)


# Import the wordcloud library
from wordcloud import WordCloud

%matplotlib inline

In [3]:
# Set the max rows and columns for Pandas
pd.options.display.max_columns = 100
pd.options.display.max_rows = 200

In [4]:
# Set the style use
plt.style.use('ggplot')

# Data Extraction from reddit using PRAW

In [5]:
 reddit = praw.Reddit(
     client_id="IR7Y4cUBrVAbGg",
     client_secret="podr43kzztn_CoVgtNQiNpDfjI5mjg",
     user_agent="gpu_scrapper"
 )

In [6]:
print(reddit.read_only)  # Output: True

True


## Obtain from learnpython

In [10]:
# continued from code above

for submission in reddit.subreddit("learnpython").hot(limit=30):
    print(submission.title)

# Output: 10 submissions

Ask Anything Monday - Weekly Thread
After months of learning, I finally was able to code a discord bot!
Suggestion for "elegant" and "efficient" coding guide
First project
Best IDE for python
Which one do you prefer in web scraping? BeautifulSoup or LXML?
What exactly is "in" in Python?
Pandas Data Cleaning
Multithreading on micro:bit
Lambda function seems stuck in loop.
cant find yahoo currency API Json File
[question] data analysis
Python install suggestion
How can I make this more pythonic?
Trying to install pyperclip and cant
Filtering data from 2gb csv over 5mil lines
What do I do here?
What's the difference between Modules and Packages in Python?
How to continue going through if statements, even if one fails?
Attempting to insert the right article ("a" or "an") before noun starting with a vowel
API response in JSON + additional external data - saving to a database (PostGRES?)
Reminder Program
File reading numbers
Creating and centering labels using PyQt5?
Assigning the contents o

## Authorized Reddit instances

In [11]:
 reddit = praw.Reddit(
     client_id="IR7Y4cUBrVAbGg",
     client_secret="podr43kzztn_CoVgtNQiNpDfjI5mjg",
     user_agent="gpu_scrapper",
     username="leader2345",
     password="rPLHgrS8"
 )

In [12]:
print(reddit.read_only)  # Output: False

False


## Obtain a subreddit

In [17]:
crypto_sub = reddit.subreddit("cryptocurrency")

print(crypto_sub.display_name)  # output: redditdev
print(crypto_sub.title)         # output: reddit development
print(crypto_sub.description)   # output: a subreddit for discussion of ...

cryptocurrency
Cryptocurrency News & Discussion
* [Open Mod Positions](http://bit.ly/2lpgEKX)
* [Rules](http://bit.ly/2mMN3wE)
* [Policies](http://bit.ly/2lpEAhv)
* **Flair Filters**
    1. [Adoption](https://goo.gl/DJQbeY)
    1. [Announcements](https://goo.gl/5DLv5e)
    1. [Clients](https://goo.gl/cRVehz)
    1. [Comedy](https://goo.gl/YP7E55)
    1. [Critical-Discussions](https://goo.gl/YDj2th)
    1. [Creative](https://goo.gl/kfgBwN)
    1. [Development](https://goo.gl/aOyxnC)
    1. [Educational](https://goo.gl/iGTkQQ)
    1. [Exchanges](https://goo.gl/GP6ppk)
    1. [Finance](https://goo.gl/KsVyST) 
    1. [Focused-Discussions](https://goo.gl/VUqmLc)
    1. [General-Discussions](https://goo.gl/CMyLFT)
    1. [General-News](https://goo.gl/EotdG2)
    1. [Innovations](https://goo.gl/le0scJ)
    1. [Legacy](https://goo.gl/EfwtTA)
    1. [Media](https://goo.gl/ItCfRS)
    1. [Meta](https://goo.gl/BkiVtj)
    1. [Metrics](https://goo.gl/VVBKa1)
    1. [Mining-Staking](https://goo.gl/

## Obtain `Submission` Instances from a subreddit

In [21]:
for submission in crypto_sub.hot(limit=10):
    print(submission.title)
    print(submission.score)
    print(submission.id)
    print(submission.url)

Join the Crypto Currency Discord
40
kth255
https://www.reddit.com/r/CryptoCurrency/comments/kth255/join_the_crypto_currency_discord/
Daily Discussion - January 10, 2021 (GMT+0)
69
ku2q5e
https://www.reddit.com/r/CryptoCurrency/comments/ku2q5e/daily_discussion_january_10_2021_gmt0/
Me after buying ethereum at $1190 and selling at $1200
7525
ktraow
https://i.redd.it/ma1eqgk6bba61.jpg
I just need like 2 more baby
396
ku03ya
https://i.redd.it/hvsrrg5kmda61.jpg
Whenever someone asks me for Crypto advice
177
ku3dyd
https://i.redd.it/0hm73q0ihea61.jpg
I'll jump back 10 years
2626
ktnusv
https://i.redd.it/u5xf3wgnz9a61.jpg
Elon wants to get paid in Bitcoin!
518
ktvcnp
https://i.redd.it/vslfdue3fca61.png
Don't daytrade or you end up like this guy
137
ku3csg
https://v.redd.it/t6txul42hea61
what do you guys think about my crypto pizza!
1077
ktqmfq
https://i.redd.it/lo7q6jlm2ba61.png
It's going to zero! The real shitcoin is the USD.
223
ktzv3a
https://i.redd.it/erhf2wa8kda61.jpg


In [24]:
# assume you have a Reddit instance bound to variable `reddit`
submission = reddit.submission(id="ktzv3a")
print(submission.title)  # Output: reddit will soon only be available ...

# or
# submission = reddit.submission(url='https://www.reddit.com/...')

It's going to zero! The real shitcoin is the USD.


## Obtain `Comment` Instances

In [25]:
# assume you have a Reddit instance bound to variable `reddit`
top_level_comments = list(submission.comments)
all_comments = submission.comments.list()

In [30]:
all_comments

[Comment(id='gip49q9'),
 Comment(id='gip4p5l'),
 Comment(id='gip4woq'),
 Comment(id='gip6wh2'),
 Comment(id='gipj5u0'),
 Comment(id='gipzwqx'),
 Comment(id='giqeott'),
 Comment(id='gipxa58'),
 Comment(id='gip1ob5'),
 Comment(id='gipae2w'),
 Comment(id='gipdf59'),
 Comment(id='gipnuk6'),
 Comment(id='gipolwg'),
 Comment(id='giq005q'),
 Comment(id='giq6j45'),
 Comment(id='giqad32'),
 Comment(id='giqaet0'),
 Comment(id='giqifop'),
 Comment(id='giqo043'),
 Comment(id='giqo4za'),
 Comment(id='gip2uuv'),
 Comment(id='gipsnqg'),
 Comment(id='gipcnc9'),
 Comment(id='gipmsoe'),
 Comment(id='giqgukx'),
 Comment(id='gip36ym'),
 Comment(id='gip3cl6'),
 Comment(id='gipnvra'),
 Comment(id='gipy9l7'),
 Comment(id='gip3ebd'),
 Comment(id='gip40dr'),
 Comment(id='gipajsr'),
 Comment(id='giq2psp'),
 Comment(id='gip4lgx'),
 Comment(id='gipbrad'),
 Comment(id='giqirvc'),
 Comment(id='gipcb0b'),
 Comment(id='gipdssw'),
 Comment(id='gipkvyx'),
 Comment(id='gipi8fe')]

In [32]:
# assume you have a Reddit instance bound to variable `reddit`
submission = reddit.submission(id="ktzv3a")
submission.comment_sort = "new"
top_level_comments = list(submission.comments)

In [33]:
top_level_comments

[Comment(id='giqo4za'),
 Comment(id='giqo043'),
 Comment(id='giqifop'),
 Comment(id='giqeott'),
 Comment(id='giqaet0'),
 Comment(id='giqad32'),
 Comment(id='giq6j45'),
 Comment(id='giq005q'),
 Comment(id='gipzwqx'),
 Comment(id='gipxa58'),
 Comment(id='gipsnqg'),
 Comment(id='gipolwg'),
 Comment(id='gipnuk6'),
 Comment(id='gipj5u0'),
 Comment(id='gipdf59'),
 Comment(id='gipae2w'),
 Comment(id='gip6wh2'),
 Comment(id='gip4woq'),
 Comment(id='gip4p5l'),
 Comment(id='gip49q9'),
 Comment(id='gip2uuv'),
 Comment(id='gip1ob5')]

In [36]:
import pprint
# assume you have a Reddit instance bound to variable `reddit`
submission = reddit.submission(id="39zje0")
# print(submission.title) # to make it non-lazy
pprint.pprint(vars(submission))

{'_comments_by_id': {},
 '_fetched': False,
 '_reddit': <praw.reddit.Reddit object at 0x00000256964B7700>,
 'comment_limit': 2048,
 'comment_sort': 'confidence',
 'id': '39zje0'}


## Extracting comments with PRAW

In [37]:
submission = reddit.submission(id="3g1jfi")

In [38]:
for top_level_comment in submission.comments:
    print(top_level_comment.body)

[deleted]
[Trusted] Download

[Fast] Download 

ALWAYS NEITHER OF THESE 
"i must have chosen correct cause i got a congratulation saying i was the millionth person to download the file and to click here to claim my free ipad" 
It is said that he who holds the Sacred Chalice of Ad-block shall find the One Button.
If you need help, just use the Ask toolbar!
This really bothers me.

In Indiana Jones and the Last Crusade the dude didn't get his face melted off after choosing poorly, he just started aging really fast until he shriveled up into dust.

Raiders is where dudes' faces melted off.
The one with the magnet.
http://i.imgur.com/vpM80kV.png
How about hovering your mouse over all of them to see where they lead you to? No? Noone does that?
It's fucking infuriating! I have gotten a lot better at recognizing the bullshit though.
unless it's one of those pages where the first click, no matter where on the page, will always open a new tab of ad. After that, the 2nd click will work... assumi

AttributeError: 'MoreComments' object has no attribute 'body'

In [41]:
for top_level_comment in submission.comments:
    if isinstance(top_level_comment, MoreComments):
        continue
    print(top_level_comment.body)

[deleted]
[Trusted] Download

[Fast] Download 

ALWAYS NEITHER OF THESE 
"i must have chosen correct cause i got a congratulation saying i was the millionth person to download the file and to click here to claim my free ipad" 
It is said that he who holds the Sacred Chalice of Ad-block shall find the One Button.
If you need help, just use the Ask toolbar!
This really bothers me.

In Indiana Jones and the Last Crusade the dude didn't get his face melted off after choosing poorly, he just started aging really fast until he shriveled up into dust.

Raiders is where dudes' faces melted off.
The one with the magnet.
http://i.imgur.com/vpM80kV.png
How about hovering your mouse over all of them to see where they lead you to? No? Noone does that?
It's fucking infuriating! I have gotten a lot better at recognizing the bullshit though.
unless it's one of those pages where the first click, no matter where on the page, will always open a new tab of ad. After that, the 2nd click will work... assumi

In [45]:
submission.comments.replace_more(limit=0)
for top_level_comment in submission.comments:
    print(top_level_comment.body)

[deleted]
[Trusted] Download

[Fast] Download 

ALWAYS NEITHER OF THESE 
"i must have chosen correct cause i got a congratulation saying i was the millionth person to download the file and to click here to claim my free ipad" 
It is said that he who holds the Sacred Chalice of Ad-block shall find the One Button.
If you need help, just use the Ask toolbar!
This really bothers me.

In Indiana Jones and the Last Crusade the dude didn't get his face melted off after choosing poorly, he just started aging really fast until he shriveled up into dust.

Raiders is where dudes' faces melted off.
The one with the magnet.
http://i.imgur.com/vpM80kV.png
How about hovering your mouse over all of them to see where they lead you to? No? Noone does that?
It's fucking infuriating! I have gotten a lot better at recognizing the bullshit though.
unless it's one of those pages where the first click, no matter where on the page, will always open a new tab of ad. After that, the 2nd click will work... assumi

In [44]:
submission.comments.replace_more(limit=None)
for top_level_comment in submission.comments:
    print(top_level_comment.body)

[deleted]
[Trusted] Download

[Fast] Download 

ALWAYS NEITHER OF THESE 
"i must have chosen correct cause i got a congratulation saying i was the millionth person to download the file and to click here to claim my free ipad" 
It is said that he who holds the Sacred Chalice of Ad-block shall find the One Button.
If you need help, just use the Ask toolbar!
This really bothers me.

In Indiana Jones and the Last Crusade the dude didn't get his face melted off after choosing poorly, he just started aging really fast until he shriveled up into dust.

Raiders is where dudes' faces melted off.
The one with the magnet.
http://i.imgur.com/vpM80kV.png
How about hovering your mouse over all of them to see where they lead you to? No? Noone does that?
It's fucking infuriating! I have gotten a lot better at recognizing the bullshit though.
unless it's one of those pages where the first click, no matter where on the page, will always open a new tab of ad. After that, the 2nd click will work... assumi

## Obtaining the replies of the top comments

In [46]:
submission.comments.replace_more(limit=None)
for top_level_comment in submission.comments:
    for second_level_comment in top_level_comment.replies:
        print(second_level_comment.body)

Like the holy grail. 
[This is not the button of a carpenter...](http://i.imgur.com/h8r0F.jpg)
But sometimes the real button is big and green, then I don't trust it.
The button of real be not a rounded rectangle
[deleted]
None of the buttons are real and it's actually a really small link in plain text on the bottom on the page saying "[download >>>](http://www.azlyrics.com/lyrics/rickastley/nevergonnagiveyouup.html)"
 Where many elders have tried, all have failed. There rises an entity. Legend tells Of a legendary tool. One that  allows the user to instantly select the right button. It can only be found in the depths of the chrome store zone. We call it... AdBlock
Usually it's just a little plaintext blue link in a sea of huge flashing buttons.
You mean the magnet button
Or the one that's left after adblock plus takes care of the rest.
Not on zippyshare though..
it's always the blue underlined anchor text
Its ways the blue color. 
Why would an advertising company use a plain text link?

#### Obtain the second level comments

In [48]:
submission.comments.replace_more(limit=None)
comment_queue = submission.comments[:]  # Seed with top-level
while comment_queue:
    comment = comment_queue.pop(0)
    print(comment.body)
    comment_queue.extend(comment.replies)

[deleted]
[Trusted] Download

[Fast] Download 

ALWAYS NEITHER OF THESE 
"i must have chosen correct cause i got a congratulation saying i was the millionth person to download the file and to click here to claim my free ipad" 
It is said that he who holds the Sacred Chalice of Ad-block shall find the One Button.
If you need help, just use the Ask toolbar!
This really bothers me.

In Indiana Jones and the Last Crusade the dude didn't get his face melted off after choosing poorly, he just started aging really fast until he shriveled up into dust.

Raiders is where dudes' faces melted off.
The one with the magnet.
http://i.imgur.com/vpM80kV.png
How about hovering your mouse over all of them to see where they lead you to? No? Noone does that?
It's fucking infuriating! I have gotten a lot better at recognizing the bullshit though.
unless it's one of those pages where the first click, no matter where on the page, will always open a new tab of ad. After that, the 2nd click will work... assumi

In [47]:
submission.comments.replace_more(limit=None)
for comment in submission.comments.list():
    print(comment.body)

[deleted]
[Trusted] Download

[Fast] Download 

ALWAYS NEITHER OF THESE 
"i must have chosen correct cause i got a congratulation saying i was the millionth person to download the file and to click here to claim my free ipad" 
It is said that he who holds the Sacred Chalice of Ad-block shall find the One Button.
If you need help, just use the Ask toolbar!
This really bothers me.

In Indiana Jones and the Last Crusade the dude didn't get his face melted off after choosing poorly, he just started aging really fast until he shriveled up into dust.

Raiders is where dudes' faces melted off.
The one with the magnet.
http://i.imgur.com/vpM80kV.png
How about hovering your mouse over all of them to see where they lead you to? No? Noone does that?
It's fucking infuriating! I have gotten a lot better at recognizing the bullshit though.
unless it's one of those pages where the first click, no matter where on the page, will always open a new tab of ad. After that, the 2nd click will work... assumi

# Obtain the comments from RTX 3080 

## Setting up the reddit instance

In [49]:
 reddit = praw.Reddit(
     client_id="IR7Y4cUBrVAbGg",
     client_secret="podr43kzztn_CoVgtNQiNpDfjI5mjg",
     user_agent="gpu_scrapper",
     username="leader2345",
     password="rPLHgrS8"
 )

In [50]:
print(reddit.read_only)  # Output: False

False


In [51]:
# assume you have a Reddit instance bound to variable `reddit`
submission = reddit.submission(id="itw87x")

### Top level comments only extraction

In [52]:
for top_level_comment in submission.comments:
    print(top_level_comment.body)

# I'm having issue editing the original post as Reddit says it's too long. Here are some additional information I meant to add above

# Aggregate Performance Summary

[From this aggregate post here](https://new.reddit.com/r/nvidia/comments/iu2wh5/nvidia_geforce_rtx_3080_meta_review_1910/)

|**RTX 3080 vs**|**4K % Improvement**|
|:-|:-|
|RTX 2080 Ti|\+32%|
|RTX 2080 Super|\+58%|
|RTX 2080|\+72%|
|RTX 2070 Super|\+83%|
|GTX 1080 Ti|\+88%|
|GTX 1080|\+150%|
|5700 XT|\+98%|
|Radeon VII|\+84%|
|Vega 64|\+142%|

# Written Reviews

[Forbes](https://www.forbes.com/sites/antonyleather/2020/09/16/nvidia-rtx-3080-review-just-how-fast-is-it/#56ba3148421c)

[Gamers Nexus Article](https://www.gamersnexus.net/hwreviews/3618-nvidia-rtx-3080-founders-edition-review-benchmarks)

[Jon Peddie Research](https://www.jonpeddie.com/reviews/testing-nvidia-rtx-3080)

[Puget Systems](https://www.pugetsystems.com/labs/articles/NVIDIA-GeForce-RTX-3080-10GB-Review-Roundup-1879/)

[Techspot](https://www.techspot.com

AttributeError: 'MoreComments' object has no attribute 'body'

### First and Second level comments

In [53]:
submission.comments.replace_more(limit=None)
for top_level_comment in submission.comments:
    for second_level_comment in top_level_comment.replies:
        print(second_level_comment.body)

It's probably safe to assume that the performance increase would be somewhere between the 4k and 1440p increase, right? Not a linear relationship because the aspect ratio is higher so you have not only more pixels but more objects to draw in the extra screen space.
I've got this link: https://www.pcgameshardware.de/Geforce-RTX-3080-Grafikkarte-276730/Tests/Test-Review-Founders-Edition-1357408/4/

It's in German, but there are drop down boxes where you can check-mark the frame rates you want to compare and an option for 3440x1440
https://wccftech.com/review/nvidia-geforce-rtx-3080-10-gb-ampere-graphics-card-review/amp/?__twitter_impression=true
https://youtu.be/bhBVJe3BASI 3440x1440p benchmarks. Although the YouTuber spends 11.30 minutes of a 15 minute video talking about g skills ram and building a pc. Last 4 minutes he plays horizon zero dawn, control and apex. 

Horrible review video but will give you some idea of performance
I saw somewhere that you can assume 60fps in 4k would equa

## Creating the function to scrap the data from Amazon

In [None]:
def scrape_amzn_gpu(no_page, no_gpu, no_review_page):
    """
    This function scraps the GPUs information from the Amazon website
    no_page: Number of pages to go through, minimum must be 2
    no_gpu: Number of GPU information to extract per page
    no_review_page: Number of review pages per GPU to extract
    """
    # Create the Chrome Driver object
    driver = webdriver.Chrome()

    # Id for GPU tracking
    ids = 1

    # To keep track of the last entry appended for the review titles and body and the rating
    idx_title = 0
    idx_body = 0
    idx_star = 0

    for page in range(1,no_page):
        # Gets the first page of the website
        driver.get(f'https://www.amazon.com/s?k=Computer+Graphics+Cards&i=computers&rh=n:284822&page={page}&_encoding=UTF8&c=ts&qid=1608032958&ts_id=284822')
        main_url = driver.current_url

        # Check for sponsored post
        sponsored_posts = driver.find_elements_by_xpath('//div[@data-component-type="sp-sponsored-result"]/../../../..')
        lst_index_sponsored = []
        # Loop through the sponsored posts to find the index of the sponsored product
        for post in sponsored_posts:
            lst_index_sponsored.append(int(post.get_attribute('data-index')))

        n = 0 # index number

        # Scrap n GPUs in the first page
        while n < no_gpu:

            # If the index is in sponsored list
            while n in lst_index_sponsored:
                print(f'{n} index is a Sponsored Product, will skip to the next product')
                n += 1

            # Wait for 1 seconds
            time.sleep(1)

            try:
                # Click the link for the nth GPU
                driver.find_element_by_xpath(f'//div[@data-index={str(n)}]//a[@class="a-link-normal a-text-normal"]').click()
            except NoSuchElementException:
                break

            # Gets the url of the main page of the GPU
            gpu_url = driver.current_url

            # Click on the "See all reviews" link
            try:
                driver.find_element_by_xpath('//a[@data-hook="see-all-reviews-link-foot"]').click()
            except NoSuchElementException:
                n += 1
                # Go back to the main page
                driver.get(main_url)
                continue # Go back to the start of the while loop

            # Wait for 1 seconds
            time.sleep(1)


            """
            Loop through the review page and obtain the review title, review body, ratings
            """

            # Number of review pages to loop through for each GPU
            for review_page in range(no_review_page):

                # Gets the title of the reviews for each page, selects only the first span if there are multiple spans
                title_comment = driver.find_elements_by_xpath('//*[@data-hook = "review-title"]/span[1]')

                # Gets the customer reviews for each page
                review_body = driver.find_elements_by_xpath('//*[@data-hook = "review-body"]')


                # Loop through the title comments and append it to the Customer Review Title
                for title in title_comment:
                    GPU_df.loc[idx_title, 'Customer Review Title'] = title.text
                    idx_title += 1
                    #print(f'Customer review title is {title.text}')

                # Gets the review_bodies in the page and stores them in a list
                review_list = [review.text for review in review_body]


                # Loop through the review comments and append it to the Customer Review
                for review in review_list:
                    GPU_df.loc[idx_body, 'Customer Review'] = review
                    idx_body += 1

                # Sleep
                time.sleep(1)

                # Goes to the next review page   
                try:
                    driver.find_element_by_xpath('//li[@class="a-last"]/a').click()
                    # Sleep
                    time.sleep(3)
                # If not break out of the loop, and go back to the GPU main page
                except NoSuchElementException:
                    break




            """
            Fill up the null values with their respective attributes
            """

            # Go back to the GPU main page
            driver.get(gpu_url)

            # Wait for 2 seconds
            time.sleep(2)

            # Fill up the null values with the GPU name
            GPU_df['GPU Name'].fillna(driver.find_element_by_xpath('//*[@id="productTitle"]').text, inplace=True)

            # Fill up the null values with the Chipset Brand
            try:
                chipset = driver.find_element_by_xpath('//*[@id="productDetails_techSpec_section_1"]/tbody//text()[contains(.,"Chipset Brand")]/../../td').text
                GPU_df['Chipset Brand'].fillna(chipset,inplace=True)
            except NoSuchElementException:
                GPU_df['Chipset Brand'].fillna(np.nan,inplace=True)

            # Fill up the null values with the Memory Size
            try:
                chipset = driver.find_element_by_xpath('//*[@id="productDetails_techSpec_section_1"]/tbody//text()[contains(.,"Graphics Card Ram Size")]/../../td').text
                GPU_df['Memory Size'].fillna(chipset,inplace=True)
            except NoSuchElementException:
                GPU_df['Memory Size'].fillna(np.nan,inplace=True)

            # Fill up the null values with the Memory Speed(MHz)
            try:
                chipset = driver.find_element_by_xpath('//*[@id="productDetails_techSpec_section_1"]/tbody//text()[contains(.,"Memory Speed")]/../../td').text
                GPU_df['Memory Speed(MHz)'].fillna(chipset,inplace=True)
            except NoSuchElementException:
                GPU_df['Memory Speed(MHz)'].fillna(np.nan,inplace=True)

            # Fill up the null values with the manufacturer name
            try:
                manufacturer = driver.find_element_by_xpath('//*[@id="productDetails_techSpec_section_2"]/tbody//th[contains(text(),"Manufacturer")]/../td').text
                GPU_df['Manufacturer'].fillna(manufacturer, inplace=True)
            except NoSuchElementException:
                GPU_df['Manufacturer'].fillna(np.nan,inplace=True)

            # Fill up the null values with the Price
            try:
                GPU_df['Price'].fillna(driver.find_element_by_xpath('//*[@id="price_inside_buybox"]').text, inplace=True)
            except NoSuchElementException:
                 GPU_df['Price'].fillna(np.nan, inplace=True)

            # Fill up the null values with the Customer ratings
            try:
                # Gets the overall customer ratings
                GPU_df['Overall Customer Rating'].fillna(driver.find_element_by_xpath('//div[@id="averageCustomerReviews"]//span[@id="acrPopover"]').get_attribute('title'), inplace=True)
            except NoSuchElementException:
                GPU_df['Overall Customer Rating'].fillna(np.nan, inplace=True)

            # Fill the id of the GPU for tracking
            GPU_df['id'].fillna(ids,inplace=True)
            ids += 1

            print(f'Completed scraping for {n} index in page {page}')

            # Increases the index for the next GPU
            n += 1

            # Go back to the main page
            driver.get(main_url)

        print('*'*30)
        print(f'Completed scraping for page {page}')
        print('*'*30)

    # Close the browser session
    total_gpu = max(GPU_df['id'])
    print(f'Completed scraping {total_gpu} GPUs reviews for {no_page-1} pages')
    driver.quit()

In [None]:
scrape_amzn_gpu(no_page=10, no_gpu=8, no_review_page=15)
# GPU_df.shape

In [None]:
# Export to csv file
GPU_df.to_csv('./amazon dataset/gpu_df_1.csv')

# Data cleaning

## Removing the null values

In [None]:
# Read the existing csv file
GPU_df = pd.read_csv('./amazon dataset/gpu_df_1.csv')

In [None]:
# Check the dimensions of the data
GPU_df.shape

In [None]:
# Check for null values
GPU_df.isnull().sum()

In [None]:
# Drop all the null values related to review as there only 3 of them
GPU_df[GPU_df['Customer Review'].isnull()]

In [None]:
GPU_df.dropna(subset=['Customer Review'],inplace=True)

In [None]:
# There are 19 null values related to price 
GPU_df.isnull().sum()

In [None]:
# Name of GPU with missing price
rx_5500XT = GPU_df[GPU_df['Price'].isnull()]['GPU Name'].unique()[0]
rx_5500XT

There is only 1 GPU without a price, I'll try to find a similar GPU model and impute the missing value with the price

In [None]:
GPU_df[GPU_df['GPU Name'] == rx_5500XT]['GPU Name'].duplicated().sum()

It seems the GPU with prices filled with null are duplicated. I'll drop them as they contain duplicated review title and reviews.

In [None]:
GPU_df.dropna(subset=['Price'], inplace=True)

In [None]:
# All the null values are removed
GPU_df.isnull().sum()

## Drop the unnamed colum

In [None]:
GPU_df.drop('Unnamed: 0',axis=1,inplace=True)

In [None]:
GPU_df.head()

## Removing null values in Memory Speed (MHz)

In [None]:
GPU_df.dropna(subset=['Memory Speed(MHz)'], inplace=True)

In [None]:
GPU_df.isnull().sum()

## Removing the null values in Customer Review Title

In [None]:
(GPU_df['Customer Review Title'] == '').sum()

There are 2 null values in Customer Review Title, will have to remove them they're only 2 of them

In [None]:
GPU_df[GPU_df['Customer Review Title'] == '']

In [None]:
GPU_df.drop(GPU_df[GPU_df['Customer Review Title'] == ''].index, inplace=True)

## Removing the non-gpus

In [None]:
GPU_df['GPU Name'].unique()

The following are non-GPUs and need to be dropped:
* LINKUP {75 cm} PCIE 3.0 16x Shielded Extreme High-Speed Riser Cable Premium PCI Express Port Extension Card┃90 Degree Socket
* LINKUP {35 cm} PCIE 3.0/4.0 16x Shielded Extreme High-Speed Riser Cable Port Extension PCIE Card┃Black┃Left Angle┃3.0 Gen3 Compatible
* GLOTRENDS Graphics Card GPU Brace Support Video Card Sag Holder/Holster Bracket for Computer Cases, Universal VGA Graphics Card Holder,Anodized Aerospace Aluminum (Black)
* Wendry Graphics Card GPU Brace Support, Computer Independent Graphics Aluminum Alloy Bracket, Independent Load-Bearing Bracket Can Safely Support PCB (red)
* New 2GB Graphics Video Card GPU Upgrade Replacement, for iMac 27 Inch Mid 2011 All-in-One Desktop Computer A1312 Core i7 3.4 MD063LL/A, AMD Radeon HD 6970M GDDR5, MXM VGA Board Repair Parts
* Arm Wall Mount Bracket,SD-200 Graphics Card Holder,Universal VGA Graphics Card Holder, DIY Adjustable,Graphics Card GPU Brace Support Holder,Jack Bracket Computer Video Card Support Pole(Black)
* GLOTRENDS Graphics Card GPU Brace Support Video Card Sag Holder/Holster Bracket for Computer Cases, Universal VGA Graphics Card Holder,Anodized Aerospace Aluminum (Red)
* Computer Graphics Card GPU Brace Support Bracket ,Verticle GPU Sag Stand, Video Card Sag Holder Verticle Stick Mount for Computer Cases
* icepc PCI Express x16 PCIe 3.0 Extension Cable High Shielding Property Flexible High Speed Riser Card Connector Port Adapter Compatible with GTX RTX Series, Radeon Series Graphics Card(30cm)
* LINKUP - Flexible SLI Bridge GPU Cable Extreme High-Speed Technology Premium Shielding 85 ohm Design for NVIDIA GPUs Graphic Cards - [60 cm]
* EVGA Hydro Copper Waterblock for GeForce RTX 2080 FTW3 400-HC-1289-B1
* Docooler Laptop External Independent Video Card Graphics Dock Mini PCI-E Version for V8.0 EXP GDC Beast
* Laptop External Independent Video Card Dock,for Mini PCI-E,Expresscard,6Pin+8Pin Interface Output,Without Power Supply

In [None]:
# Create an non-GPU list 
non_gpu = ['LINKUP {75 cm} PCIE 3.0 16x Shielded Extreme High-Speed Riser Cable Premium PCI Express Port Extension Card┃90 Degree Socket',
          'LINKUP {35 cm} PCIE 3.0/4.0 16x Shielded Extreme High-Speed Riser Cable Port Extension PCIE Card┃Black┃Left Angle┃3.0 Gen3 Compatible',
           'GLOTRENDS Graphics Card GPU Brace Support Video Card Sag Holder/Holster Bracket for Computer Cases, Universal VGA Graphics Card Holder,Anodized Aerospace Aluminum (Black)',
           'Wendry Graphics Card GPU Brace Support, Computer Independent Graphics Aluminum Alloy Bracket, Independent Load-Bearing Bracket Can Safely Support PCB (red)',
           'New 2GB Graphics Video Card GPU Upgrade Replacement, for iMac 27 Inch Mid 2011 All-in-One Desktop Computer A1312 Core i7 3.4 MD063LL/A, AMD Radeon HD 6970M GDDR5, MXM VGA Board Repair Parts',
           'Arm Wall Mount Bracket,SD-200 Graphics Card Holder,Universal VGA Graphics Card Holder, DIY Adjustable,Graphics Card GPU Brace Support Holder,Jack Bracket Computer Video Card Support Pole(Black)',
           'GLOTRENDS Graphics Card GPU Brace Support Video Card Sag Holder/Holster Bracket for Computer Cases, Universal VGA Graphics Card Holder,Anodized Aerospace Aluminum (Red)',
           'Computer Graphics Card GPU Brace Support Bracket ,Verticle GPU Sag Stand, Video Card Sag Holder Verticle Stick Mount for Computer Cases',
           'icepc PCI Express x16 PCIe 3.0 Extension Cable High Shielding Property Flexible High Speed Riser Card Connector Port Adapter Compatible with GTX RTX Series, Radeon Series Graphics Card(30cm)',
           'LINKUP - Flexible SLI Bridge GPU Cable Extreme High-Speed Technology Premium Shielding 85 ohm Design for NVIDIA GPUs Graphic Cards - [60 cm]',
           'EVGA Hydro Copper Waterblock for GeForce RTX 2080 FTW3 400-HC-1289-B1',
           'Docooler Laptop External Independent Video Card Graphics Dock Mini PCI-E Version for V8.0 EXP GDC Beast',
           'Laptop External Independent Video Card Dock,for Mini PCI-E,Expresscard,6Pin+8Pin Interface Output,Without Power Supply'
          ]

A total of 150 rows need to be dropped

In [None]:
# Drop the non-GPUs
GPU_df = GPU_df[~GPU_df['GPU Name'].isin(non_gpu)].copy()
GPU_df.shape

## Removing non-gaming manufacturers

In [None]:
GPU_df['Manufacturer'].unique()

In [None]:
GPU_df['Manufacturer'].value_counts()

There are certain manufacturer that are not related to computer gaming but they are for **work stations**. Brands such as
* PNY
* PNY QUADRO
* Dell Computers
* Lenovo
* hp
* Hewlett Packard
* Dell Computer Corp.
* ATI Technologies

These manufacturers will be dropped.

In [None]:
non_gaming_manufacturer = ['PNY', 'PNY QUADRO', 'Dell Computers', 'Lenovo', 'hp', 'Hewlett Packard', 'Dell Computer Corp.',
                          'Dell Computer Corp.', 'ATI Technologies']

GPU_df = GPU_df[~GPU_df['Manufacturer'].isin(non_gaming_manufacturer)]
GPU_df

### Group up duplicate manufacturers

Need to replace the manufacturer names with a short version e.g: ASUS Computer International Direct with Asus and ZOTAC with zotac

In [None]:
GPU_df['Manufacturer'].replace(['ASUS Computer International Direct', 'ZOTAC', 'MSI COMPUTER', 'VISIONTEK MASS STORAGE', 'NVIDIA Corporation'], ['Asus', 'Zotac', 'MSI', 'VisionTek', 'NVIDIA'], inplace=True)

### Compare manufacturers with GPU names to see if they're correct

In [None]:
GPU_df['Manufacturer'].unique()

In [None]:
GPU_df[GPU_df['Manufacturer'] == 'Gigabyte']['GPU Name'].unique()

Create a helper function to detect the misclassified Manufacturers for each GPU.

For each GPU `Manufacturer`, check against the first word of the `GPU Name`. If it's False, replace the `Manufacturer` name with the first word of the `GPU Name`.

In [None]:
def misclassified_manufacturer(df):
    manufacturer = df['Manufacturer']
    gpu_name = df['GPU Name']
    misclassified_dict = {}
    
    for i in range(len(df)):
        if manufacturer.iloc[i].lower() not in gpu_name.iloc[i].split()[0].lower():
            #print(f'Misclassified manufacturer for {df["GPU Name"].loc[i]} found in {df["Manufacturer"].loc[i]}')
            misclassified_dict[gpu_name.iloc[i]] = manufacturer.iloc[i].lower()
    return misclassified_dict

In [None]:
misclass_dict = misclassified_manufacturer(GPU_df)
misclass_dict

Further checking the GPUs manufacturers from the website, the following are incorrect:
* EVGA Geforce should be EVGA
* Yeston Radeon should be Docooler
* Genuine Dell ... should be Dell Computers
* Gigabyte... should be Gigabyte
* Pny ... should be PNY
* Aiposen should be Aiposen
* ASROCK should be ASROCK

Only Sapphire Radeon is correctly classified so I'll drop that.

In [None]:
# Dropping Sapphire Radeon
print(misclass_dict.pop('Sapphire Radeon 11265-05-20G Pulse RX 580 8GB GDDR5 Dual HDMI/ DVI-D/ Dual DP OC with Backplate (UEFI) PCI-E Graphics Card Graphic Cards', None))
misclass_dict

In [None]:
for k, v in misclass_dict.items():
    misclass_dict[k] = k.split()[0]
misclass_dict

* Yeston has to be Docooler
* Sapphire has to be Althon Micro Inc.
* Genuine Dell has to be Dell Computers

In [None]:
misclass_dict['Yeston Radeon RX550 Gaming Graphics Cards, 4GB Memory GDDR5 128Bit 6000MHz VGA + HD + DVI-D GPU'] = 'Docooler'
misclass_dict['Sapphire Nitro+ Radeon RX 580 8GB GDDR5 Graphics Card'] = 'Althon Micro Inc.'
misclass_dict['Genuine Dell Fh868 Silicon Image Orion Pci-express Pci-e X16 DVI 1364a Add2-n Small Profile Video Graphics Card Compatible Part Numbers: 0fh868 Fh868 (For Desktops and Small for Factor Sff Computers)'] = 'Dell Computers'
misclass_dict

In [None]:
# Replacing the misclassified manufacturers with the correct manufacturers
for gpu_name, manufacturer in misclass_dict.items():
    GPU_df.loc[GPU_df.loc[:,'GPU Name'] == gpu_name,'Manufacturer'] = manufacturer
    # GPU_df.loc[:,GPU_df['GPU Name'] == gpu_name] = manufacturer

In [None]:
GPU_df['Manufacturer'].unique()

In [None]:
# Drop Pny and Dell Computers as it's a workstation GPU
GPU_df.drop(GPU_df[(GPU_df['Manufacturer'] == 'Pny') | (GPU_df['Manufacturer'] == 'Dell Computers')].index, inplace=True)

In [None]:
GPU_df['Manufacturer'].unique()

## Cleaning the Chipset Brands

In [None]:
GPU_df['Chipset Brand'].unique()

There are certain brands that are actually graphics card names or they're the brand names. The `Chipset Brand` should either be **AMD** or **Nvidia**.

These `Chipset Brand` needs to be replaced with `Amd`:
* 'AMD'
* 'AMD Radeon RX 580'
* 'AMD Radeon'
* 'RX 570'


These `Chipset Brand` needs to be replaced with `Nvidia`:
* 'NVIDIA'
* 'GTX 1050'
* 'RTX 3060 ti'
* 'RTX 3090'
* 'RTX 3070'
* 'Gigabyte'

In [None]:
# Replacement for Chipset Brand AMD
amd_to_replace = ['AMD','AMD Radeon RX 580', 'AMD Radeon', 'RX 570']
GPU_df['Chipset Brand'].replace(amd_to_replace, ['Amd' for _ in range(len(amd_to_replace))], inplace=True)

In [None]:
# Replacement for Chipset Brand Nvidia
nvidia_to_replace = ['NVIDIA','GTX 1050', 'RTX 3060 ti', 'RTX 3090', 'RTX 3070', 'Gigabyte']
GPU_df['Chipset Brand'].replace(nvidia_to_replace, ['Nvidia' for _ in range(len(nvidia_to_replace))], inplace=True)

In [None]:
GPU_df['Chipset Brand'].unique()

## Cleaning up the Memory Speed (MHz)

The 'MHz' and 'GHz' strings need to be removed and the values need to be converted to floating values.

There are certain values with 'GHz', they need to be converted to 'MHz' by multiplying it by 1000.

In [None]:
GPU_df['Memory Speed(MHz)'].unique()

In [None]:
# Values with 'GHz'
GPU_df['Memory Speed(MHz)'].loc[GPU_df['Memory Speed(MHz)'].str.contains('GHz')].unique()

In [None]:
# Memory Speed(MHz) converted to their floating values
GPU_df['Memory Speed(MHz)'] = GPU_df['Memory Speed(MHz)'].map(lambda x:float(x.split()[0]) * 1000 if x.split()[1] == 'GHz' else float(x.split()[0]))

## Cleaning up the Memory Size

In [None]:
GPU_df['Memory Size'].unique()

There are some values which does not specify if it's GB or MB, **'4' and '8'**, and there's one value with '8192 GB' which is incorrect as there are no GPUs with over a 1000GB in memory. The MB values need to be converted to the GB values.

### 8192 GB value needs to be replaced with 8 GB, 6144 MB, 8000 MB and 4096 MB replaced with 6GB, 8GB and 4GB respectively

In [None]:
GPU_df['Memory Size'].replace(['8192 GB', '8000 MB', '6144 MB', '4096 MB'],['8 GB', '8 GB', '6 GB', '4 GB'], inplace=True)

### Need to investigate the values that do not have either MB or GB

In [None]:
not_mb_gb = list(GPU_df[~GPU_df['Memory Size'].str.contains('GB|MB')]['Memory Size'].unique())
not_mb_gb

In [None]:
GPU_df.loc[(GPU_df['Memory Size'] == not_mb_gb[0]) | (GPU_df['Memory Size'] == not_mb_gb[1]),'GPU Name'].unique()

The values are in GB format, the GB string will be added.

In [None]:
GPU_df['Memory Size'].replace(['4','8'],['4 GB', '8 GB'], inplace=True)

In [None]:
GPU_df['Memory Size'].unique()

## Cleaning up the Price

In [None]:
GPU_df['Price'].unique()

The prices need to have their dollar sign removed and the values need to be converted to their floating values

In [None]:
GPU_df['Price'] = GPU_df['Price'].map(lambda x:float(x.replace('$','')))
GPU_df['Price']

In [None]:
GPU_df['Price'].unique()

## Cleaning up the overall customer rating

The rating will be converted to their floating values.

In [None]:
GPU_df['Overall Customer Rating'].unique()

In [None]:
GPU_df['Overall Customer Rating'] = GPU_df['Overall Customer Rating'].map(lambda x:float(x.split('out')[0]))
GPU_df['Overall Customer Rating'].unique()

## Cleaning the customer review title and reviews

In [None]:
# Full function to clean the title and the post
def clean_post(df):
    """
    This function removes the unnecessary characters, punctuations, removes stop words and lemmantizes the words
    from the posts and titles. Lemmantization is used as I want to preserve the meaning of the words in which it'll compare the words against a dictionary.
    """
    new_lst = []
    
    # Stop words
    stops = set(stopwords.words('english'))
    
    # Lemmatizer
    lemmatizer = WordNetLemmatizer()
    
    for post in df:
        # Lowercase the text
        post = post.lower()

        # Find the https websites and removes them
        post = re.sub(r'\(https:.*?\)','',post)

        # Removes youtube links
        post = re.sub('https:.*?\\n','',post)

        # Removes uncaptured url links at the bottom of the text
        post = re.sub('https.*?[\\n|"]','',post)

        # Removes characters: \n\n&amp;#x200B;
        post = re.sub('\\n\\n&amp;#x200b;\\n\\n','',post)

        # Removing the special characters, like punctuation marks, periods
        post = re.sub(r'[^\w]',' ',post)
        
        # Removes digits and keeps the letters
        # post = re.sub(r'[^a-zA-Z]', ' ', post)

        # Removes underscores
        post = re.sub(' _', ' ',post)

        # Removes addtional white spaces
        post = re.sub(' +', ' ',post)
        
        # Stores the words in a list 
        lst = [] 
        
        # If the word is not in the stop words then, lemmantize the words
        for word in post.split():
            if not word in stops:
                lst.append(lemmatizer.lemmatize(word))
            
        new_lst.append(" ".join(lst))
        
    return new_lst

In [None]:
# Cleans the Customer Review column
GPU_df['Customer Review'] = clean_post(GPU_df['Customer Review'])
GPU_df['Customer Review']

In [None]:
# Randomly going through the rows to check if it's cleaned properly 
GPU_df['Customer Review'].loc[np.random.randint(2048)]

In [None]:
# Cleans the Customer Review Title column
GPU_df['Customer Review Title'] = clean_post(GPU_df['Customer Review Title'])
GPU_df['Customer Review Title']

In [None]:
# Randomly going through the rows to check if it's cleaned properly 
GPU_df['Customer Review Title'].loc[np.random.randint(2048)]

### Removing non-english words in the reviews

In [None]:
# Removing non english by creating a helper function
from langdetect import detect
def isenglish(text):
    try:
        if nlp(text)._.language.get('language') == 'en':
            return 1
        else:
            return 0
    except:
        return 0

In [None]:
GPU_df['isenglish'] = GPU_df['Customer Review'].apply(isenglish)

In [None]:
GPU_df[GPU_df.loc[:,'isenglish'] == 0][['Customer Review']].count()

A total of 130 rows were non-english reviews. These have to be removed.

In [None]:
GPU_df.shape

In [None]:
GPU_df.drop(GPU_df[GPU_df['isenglish'] == 0].index, inplace=True)

## Checking for duplicates

In [None]:
GPU_df[['Customer Review Title', 'Customer Review']].loc[GPU_df['Customer Review'].duplicated()]

In [None]:
GPU_df[['Customer Review Title', 'Customer Review']].loc[GPU_df[['Customer Review Title']].duplicated()]

In [None]:
GPU_df[GPU_df['Customer Review Title'] == 'far good']

The duplicate values doesn't be seem to be actually duplicates, just a few words that were written by the customers.

In [None]:
# Save to csv file
GPU_df.to_csv('./amazon dataset/cleaned_gpu_df_1.csv',index=False)

# EDA

In [None]:
# Read the existing csv file
GPU_df = pd.read_csv('./amazon dataset/cleaned_gpu_df_1.csv')

In [None]:
GPU_df.shape

In [None]:
# Check for null values
GPU_df.isnull().sum()

In [None]:
# Drop the rows with null values
GPU_df.dropna(inplace=True)

In [None]:
# Remove the Review title and reviews
GPU_df_no_reviews = GPU_df.drop(['Customer Review Title', 'Customer Review'], axis=1)
GPU_df_no_reviews.head()

In [None]:
# Check if the ids match
list(GPU_df_no_reviews.drop_duplicates(['id'])['id'].unique()) == list(GPU_df['id'].unique())

In [None]:
# Remove the duplicate values in the GPU_df_no_reviews
GPU_df_no_reviews = GPU_df_no_reviews.drop_duplicates(['id'])
GPU_df_no_reviews.head()

In [None]:
# Set the id as the index and reset the index
GPU_df_no_reviews = GPU_df_no_reviews.set_index('id').reset_index(drop=True)
GPU_df_no_reviews

In [None]:
GPU_df_no_reviews.shape

### Price distribution

In [None]:
# plt.figure(figsize=(99,99))
sns.displot(GPU_df_no_reviews['Price'], bins=12, aspect=1.5, height=6, color='green')
plt.axvline(GPU_df_no_reviews['Price'].mean(),color='red')
plt.axvline(GPU_df_no_reviews['Price'].median(),color='yellow')

plt.title('Distribution of sale price of GPUs', size=13)
plt.legend(['Mean sale price','Median sale price']);

The distribution shows a right skewed graph with most of the GPUs falling below the 100 dollars range. The mean and the median prices are far part showing that they are some outliers in the price distribution as seen in the price range of 800 and 1000 dollars range. 

### Distribution of AMD and Nvidia Chipsets

In [None]:
GPU_df_no_reviews['Chipset Brand'].value_counts(normalize=True)

It seems that most of the GPUs are under Nvidia with a proportion of 70% while Amd has a proportion of 30%.

### Most popular brands by their rating

In [None]:
GPU_df_no_reviews['Manufacturer'].value_counts()

As NVIDIA, NVIDIA Corporation and Althon Micro Inc. have only 1 GPUs, I'll not include them in the popular brand investigation

In [None]:
manufacturer_list = ['AMD','ASRock','Aiposen','SAPPHIRE', 'Althon Micro Inc.', 'NVIDIA']
GPU_df_no_reviews.groupby('Manufacturer').mean().drop(manufacturer_list)['Overall Customer Rating'].sort_index(ascending=False).plot(kind='barh', 
                                                                                                                                            title='Most popular brand by rating', 
                                                                                                                                            figsize=(11,7), 
                                                                                                                                            color='green')
plt.xlabel('Rating')
plt.ylabel('Brand', rotation=360);

In [None]:
GPU_df_no_reviews.groupby('Manufacturer').mean().drop(manufacturer_list)['Overall Customer Rating'].sort_values(ascending=False)

Without including Nvidia and Althon Micro Inc as they have only 1 type of GPU, Asus, EVGA and SAPPHIRE are the most popular brands given their high ratings.

The reason behind this is that consumers usually prefer 3rd party coolers fitted into the GPUs compared to the Nvidia's coolers as they're much more effective in controlling the airflow and decreasing the GPU temperature. 

### Which Chipset Brand has a higher customer rating?

In [None]:
GPU_df_no_reviews.groupby('Chipset Brand').mean()['Overall Customer Rating']

Nvidia is slightly ahead of AMD in terms of the Overall Customer rating.

### Which Manufacturer produces GPUs with higher Memory Speed and Size?

In [None]:
GPU_df_no_reviews.info()

In [None]:
GPU_df_no_reviews.groupby('Manufacturer').mean()['Memory Speed(MHz)'].sort_values().plot(kind='barh', figsize=(11,7))

plt.title('Memory speed of the GPUs produced by individual manufacturers')
plt.xlabel('Memory speed (MHz)')
plt.ylabel('Manufacturer',rotation=360);

In [None]:
GPU_df_no_reviews.groupby('Manufacturer').mean()['Memory Speed(MHz)'].sort_values(ascending=False)

ASRock, Gigabyte, XFX and EVGA manufacturers produces GPUs with high amount of memory speed which shows that they're premium brands that produce 'Enthusiast Grade' types of GPUs.

### EDA on Customer Review Title

In [None]:
customer_review_title = " ".join(GPU_df['Customer Review Title'])

In [None]:
# Create a WordCloud object
wordcloud = WordCloud(background_color="white", max_words=5000, 
                      contour_width=5, contour_color='steelblue', width=700, height=500)
wordcloud.generate(customer_review_title)
# Visualize the word cloud
wordcloud.to_image()

Based on the word cloud, it seems that consumers are mostly satisfied with their GPU purchase with 'good', 'great' and 'best' words coming out at the top. The consumers are mostly gamers and most of them play in 1080p resolution and they seem to be price sensitive with the words such as 'bang buck' and 'great value' having a bigger size. 

In [None]:
customer_review_title_list = customer_review_title.split()
customer_review_title_dict = {}

for word in customer_review_title_list:
    if word not in customer_review_title_dict.keys():
        customer_review_title_dict[word] = customer_review_title_list.count(word)
    else:
        continue
        
customer_review_title_dict

In [None]:
df = {'words': customer_review_title_dict.keys(), 'freq': customer_review_title_dict.values()}
customer_review_title_df = pd.DataFrame(df)
customer_review_title_df.sort_values('freq', ascending=False).set_index('words').head(10).plot(kind='barh', figsize=(11,7),
                                                                                              title='Frequency of words in customer review title')
plt.xticks(fontsize=12)
plt.legend([]);

The graph shows consistency with the word cloud on the frequency of the words appearing in the customer review title.

In [None]:
# customer_review_title_df['freq'].hist(bins=150)
# plt.xlim(0,50)

### EDA on Customer Review

In [None]:
customer_review = " ".join(GPU_df['Customer Review'])

In [None]:
# Create a WordCloud object
wordcloud = WordCloud(background_color="white", max_words=5000, 
                      contour_width=5, contour_color='steelblue', width=700, height=500)
wordcloud.generate(customer_review)
# Visualize the word cloud
wordcloud.to_image()

Similar to the customer review title word cloud, consumers who purchase GPUs tend to be gamers and they play on 1080p resolution. GPU fans are an important factor when making a GPU purchase as the word 'fan' size is rather big. The word 'issue' and 'problem' shows up big which suggests that consumers may have encountered issues with the GPUs they have purchased. The two brands 'amd' and 'nvidia' shows that these 2 are the major players in the GPU market. GPU drivers seem to play an important role in making sure that the GPU is functioning.

In [None]:
customer_review_list = customer_review.split()
customer_review_dict = {}

for word in customer_review_list:
    if word not in customer_review_dict.keys():
        customer_review_dict[word] = customer_review_list.count(word)
    else:
        continue
        
customer_review_dict

In [None]:
review_df = {'words': customer_review_dict.keys(), 'freq': customer_review_dict.values()}
customer_review_df = pd.DataFrame(review_df)
customer_review_df.sort_values('freq', ascending=False).set_index('words').head(10).plot(kind='barh', figsize=(11,7),
                                                                                              title='Frequency of words in customer review title')
plt.xticks(fontsize=12)
plt.yticks(fontsize=12)
plt.legend([]);

The graph shows consistency with the word cloud on the frequency of the words appearing in the customer review title.

## Prepare data for LDA Analysis

I'll be using only Customer Review to conduct the LDA Analysis as it makes up the bulk of the words.

In [None]:
# Converting to the customer reviews from series to a list.
data = GPU_df['Customer Review'].values.tolist()
data[600]

In [None]:
def sent_to_words(sentences):
    for sentence in sentences:
        # deacc=True removes punctuations
        yield(gensim.utils.simple_preprocess(sentence))

In [None]:
texts = list(sent_to_words(data))

In [None]:
# Prints the first document with up to 30 words in them
print(texts[:1][0][:30])

In [None]:
# Create Dictionary
id2word = corpora.Dictionary(texts)

# Term Document Frequency
corpus = [id2word.doc2bow(text) for text in texts]
# View
print(corpus[:1][0][:30])

## LDA Model Training

In [None]:
# number of topics
num_topics = 10

# Build LDA model
lda_model = gensim.models.LdaMulticore(corpus=corpus,
                                       id2word=id2word,
                                       num_topics=num_topics,
                                      passes=20, random_state=42)

# Print the Keyword in the 10 topics
pprint(lda_model.print_topics())
doc_lda = lda_model[corpus]

In [None]:
# Visualize the topics
pyLDAvis.enable_notebook()
LDAvis_data_filepath = os.path.join(os.getcwd()+'\\visualization\\'+'ldavis_prepared_'+str(num_topics))
# # this is a bit time consuming - make the if statement True
# # if you want to execute visualization prep yourself
if False:
    LDAvis_prepared = pyLDAvis.gensim.prepare(lda_model, corpus, id2word)
    with open(LDAvis_data_filepath, 'wb') as f:
        pickle.dump(LDAvis_prepared, f)
# load the pre-prepared pyLDAvis data from disk
with open(LDAvis_data_filepath, 'rb') as f:
    LDAvis_prepared = pickle.load(f)
pyLDAvis.save_html(LDAvis_prepared, os.getcwd()+ '\\visualization\\' + 'ldavis_prepared_'+ str(num_topics) +'.html')
LDAvis_prepared

# Others

## Testing function (Working for now)

In [None]:
# #def scrape_gpu(no_page, no_gpu, no_review_page):

# # Create the Chrome Driver object
# driver = webdriver.Chrome()

# # Id for GPU tracking
# ids = 1

# # To keep track of the last entry appended for the review titles and body and the rating
# idx_title = 0
# idx_body = 0
# idx_star = 0
# #print('Over sas')

# for page in range(1,3):
#     # Gets the first page of the website
#     #print('Over ss')
#     driver.get(f'https://www.amazon.com/s?k=Computer+Graphics+Cards&i=computers&rh=n:284822&page={page}&_encoding=UTF8&c=ts&qid=1608032958&ts_id=284822')
#     main_url = driver.current_url
#     #print('Over here')

#     # # Check for sponsored post
#     sponsored_posts = driver.find_elements_by_xpath('//div[@data-component-type="sp-sponsored-result"]/../../../..')
#     lst_index_sponsored = []
#     # Loop through the sponsored posts to find the index of the sponsored product
#     for post in sponsored_posts:
#         lst_index_sponsored.append(int(post.get_attribute('data-index')))

#     n = 0 # index number

#     # Scrap 3 GPUs in the first page
#     while n < 3:

#         # If the index is in sponsored list
#         while n in lst_index_sponsored:
#             print(f'{n} index is a Sponsored Product, will skip to the next product')
#             n += 1

#         # Wait for 1 seconds
#         time.sleep(1)
        
#         try:
#             # Click the link for the nth GPU
#             driver.find_element_by_xpath(f'//div[@data-index={str(n)}]//a[@class="a-link-normal a-text-normal"]').click()
#         except NoSuchElementException:
#             break

#         # Gets the url of the main page of the GPU
#         gpu_url = driver.current_url

#         # Click on the "See all reviews" link
#         try:
#             driver.find_element_by_xpath('//a[@data-hook="see-all-reviews-link-foot"]').click()
#         except NoSuchElementException:
#             n += 1
#             # Go back to the main page
#             driver.get(main_url)
#             continue # Go back to the start of the while loop

#         # Wait for 1 seconds
#         time.sleep(1)


#         """
#         Loop through the review page and obtain the review title, review body, ratings
#         """

#         # Number of review pages to loop through for each GPU
#         for review_page in range(2):

#             # Gets the title of the reviews for each page, selects only the first span if there are multiple spans
#             title_comment = driver.find_elements_by_xpath('//*[@data-hook = "review-title"]/span[1]')

#             # Gets the customer reviews for each page
#             review_body = driver.find_elements_by_xpath('//*[@data-hook = "review-body"]')


#             # Loop through the title comments and append it to the Customer Review Title
#             for title in title_comment:
#                 GPU_df.loc[idx_title, 'Customer Review Title'] = title.text
#                 idx_title += 1
#                 #print(f'Customer review title is {title.text}')

#             # Gets the review_bodies in the page and stores them in a list
#             review_list = [review.text for review in review_body]


#             # Loop through the review comments and append it to the Customer Review
#             for review in review_list:
#                 GPU_df.loc[idx_body, 'Customer Review'] = review
#                 idx_body += 1

#             # Sleep
#             time.sleep(1)

#             # Goes to the next review page   
#             try:
#                 driver.find_element_by_xpath('//li[@class="a-last"]/a').click()
#                 # Sleep
#                 time.sleep(3)
#             # If not break out of the loop, and go back to the GPU main page
#             except NoSuchElementException:
#                 break




#         """
#         Fill up the null values with their respective attributes
#         """

#         # Go back to the GPU main page
#         driver.get(gpu_url)

#         # Wait for 2 seconds
#         time.sleep(2)

#         # Fill up the null values with the GPU name
#         GPU_df['GPU Name'].fillna(driver.find_element_by_xpath('//*[@id="productTitle"]').text, inplace=True)

#         # Fill up the null values with the Chipset Brand
#         try:
#             chipset = driver.find_element_by_xpath('//*[@id="productDetails_techSpec_section_1"]/tbody//text()[contains(.,"Chipset Brand")]/../../td').text
#             GPU_df['Chipset Brand'].fillna(chipset,inplace=True)
#         except NoSuchElementException:
#             GPU_df['Chipset Brand'].fillna(np.nan,inplace=True)

#         # Fill up the null values with the Memory Size
#         try:
#             chipset = driver.find_element_by_xpath('//*[@id="productDetails_techSpec_section_1"]/tbody//text()[contains(.,"Graphics Card Ram Size")]/../../td').text
#             GPU_df['Memory Size'].fillna(chipset,inplace=True)
#         except NoSuchElementException:
#             GPU_df['Memory Size'].fillna(np.nan,inplace=True)

#         # Fill up the null values with the Memory Speed(MHz)
#         try:
#             chipset = driver.find_element_by_xpath('//*[@id="productDetails_techSpec_section_1"]/tbody//text()[contains(.,"Memory Speed")]/../../td').text
#             GPU_df['Memory Speed(MHz)'].fillna(chipset,inplace=True)
#         except NoSuchElementException:
#             GPU_df['Memory Speed(MHz)'].fillna(np.nan,inplace=True)

#         # Fill up the null values with the manufacturer name
#         try:
#             manufacturer = driver.find_element_by_xpath('//*[@id="productDetails_techSpec_section_2"]/tbody//th[contains(text(),"Manufacturer")]/../td').text
#             GPU_df['Manufacturer'].fillna(manufacturer, inplace=True)
#         except NoSuchElementException:
#             GPU_df['Manufacturer'].fillna(np.nan,inplace=True)

#         # Fill up the null values with the Price
#         try:
#             GPU_df['Price'].fillna(driver.find_element_by_xpath('//*[@id="price_inside_buybox"]').text, inplace=True)
#         except NoSuchElementException:
#              GPU_df['Price'].fillna(np.nan, inplace=True)

#         # Fill up the null values with the Customer ratings
#         try:
#             # Gets the overall customer ratings
#             GPU_df['Overall Customer Rating'].fillna(driver.find_element_by_xpath('//div[@id="averageCustomerReviews"]//span[@id="acrPopover"]').get_attribute('title'), inplace=True)
#         except NoSuchElementException:
#             GPU_df['Overall Customer Rating'].fillna(np.nan, inplace=True)

#         # Fill the id of the GPU for tracking
#         GPU_df['id'].fillna(ids,inplace=True)
#         ids += 1

#         print(f'Completed scraping for {n} index in page {page}')

#         # Increases the index for the next GPU
#         n += 1

#         # Go back to the main page
#         driver.get(main_url)

#     print('*'*30)
#     print(f'Completed scraping for page {page}')
#     print('*'*30)

# # Close the browser session
# total_gpu = max(GPU_df['id'])
# print(f'Completed scraping {total_gpu} GPUs reviews for {page} pages')
# driver.quit()

# Testing the review body for the page

In [None]:
# # Testing the review body for the page


# # Create the Chrome Driver object
# driver = webdriver.Chrome() 
# driver.get('https://www.amazon.com/Gigabyte-Radeon-Gaming-Graphic-GV-RX580GAMING-8GD/product-reviews/B0842VMKM5/ref=cm_cr_getr_d_paging_btm_next_2?ie=UTF8&reviewerType=all_reviews&pageNumber=2')

# # Count the number of Spans in the "Review Body" class, if more than 1, put it in a list and use "".join()
# # before appending it to the dataframe
# more_than_1 = driver.find_elements_by_xpath('///span[@data-hook="review-body"][count(./span) > 1]/span')
# review_body = driver.find_elements_by_xpath('//*[@data-hook = "review-body"]/span')

# # If there is a review that is split into multiples span
# if more_than_1: 
#     # Holds the list of reviews
#     review_list = []
#     for review in more_than_1:
#         review_list.append(review.text) # Append it into a list
#     GPU_df.loc[idx_body, 'Customer Review'] = "".join(review_list)
#     idx_body += 1
#     for review in review_body[len(more_than_3)+1:]: # Reviews with one span
#         GPU_df.loc[idx_body, 'Customer Review'] = review.text
#         idx_body += 1
# else:
#     # Loop through the review comments and append it to the Customer Review
#     for review in review_body:
#         GPU_df.loc[idx_body, 'Customer Review'] = review.text
#         idx_body += 1
    
        
# # Wait for 2 seconds
# time.sleep(2)
# driver.quit()

## Testing the review body using for loop

In [None]:
# # Testing the review body for the page


# # Create the Chrome Driver object
# driver = webdriver.Chrome() 
# driver.get('https://www.amazon.com/MSI-GT-710-2GD3-LP/product-reviews/B01DOFD0G8/ref=cm_cr_dp_d_show_all_btm?ie=UTF8&reviewerType=all_reviews')


# review_body = driver.find_elements_by_xpath('//*[@data-hook = "review-body"]/span')

# # Get all the reviews 
# reviews = driver.find_elements_by_xpath('//span[@data-hook="review-body"]')
# review_list = [x.text for x in reviews]
# print([x.text for x in reviews])

# for review in review_list:
#     GPU_df.loc[idx_body, 'Customer Review'] = review
#     idx_body += 1
            
# # Wait for 2 seconds
# time.sleep(2)
# driver.quit()

In [None]:
# review_list[2]

In [None]:
# GPU_df

In [None]:
# GPU_df['Customer Review'].loc[41]

# Testing the star ratings of the comment for the page

In [None]:
# # Testing the star ratings of the comment for the page

# # Holds the list of reviews
# list_of_stars = []

# # Create the Chrome Driver object
# driver = webdriver.Chrome() 
# driver.get('https://www.amazon.com/XFX-Radeon-1386MHz-Graphics-RX-580P8DFD6/dp/B06Y66K3XD/ref=cm_cr_arp_d_bdcrb_top?ie=UTF8&th=1')


# star_ratings = driver.find_element_by_xpath('//div[@id="averageCustomerReviews"]//span[@id="acrPopover"]').get_attribute('title')

# # Get the Profile name 
# # star_ratings = driver.find_elements_by_xpath('//div[@data-hook="review"]//span[@class= "a-profile-name"]')


# # For individual customer reviews
# # for star in star_ratings:
# #     list_of_stars.append(star.get_attribute('title'))
    
# # Wait for 2 seconds
# time.sleep(2)
# driver.quit()

In [None]:
# star_ratings

In [None]:
# GPU_df.loc[6] = ['','','','','','']
# GPU_df.loc[7]