## OBJECTIVE OF THE PROJECT:

**Scraping product reviews of *OnePlus Nord 2T 5G (Jade Fog, 12GB RAM, 256GB Storage)* and studying customer reviews of the product. Further, we are interested in knowing how the algorithm classifies negative and positive reviews.**

**DATA ACKNOWLEDGEMENT: Amazon**

In [5]:
# Importing Libraries 

from bs4 import BeautifulSoup       # for pulling data out of HTML and XML files
import requests
import pandas as pd
import warnings
warnings.filterwarnings('ignore')
import nltk

In [78]:
review_list = []
for i in range(276):

    url = "https://www.amazon.in/OnePlus-Nord-Jade-256GB-Storage/product-reviews/B0B3D39RKV/ref=cm_cr_getr_d_paging_btm_prev_1?ie=UTF8&reviewerType=all_reviews&pageNumber="+str(i)
    url = requests.get(url)

    # parsing 
    data = BeautifulSoup(url.content,'html.parser')

    data.get_text()
#     count = 1
    reviews = data.find_all('span',{'data-hook':'review-body'})  
    for j in reviews:
#         count += 1
        review_list.append(j.text)
    
#     print(i,count)            # number of reviews in the particular page

In [80]:
cust_review = []

for review in review_list:
    cust_review.append(review.strip())
    
cust_review

['Oneplus nord ce 2lite 👌',
 'I am using this phone from last 1 month almost everything is fine Except Battery capacity  it should be 5000 above. Battery backup is not very good as compared to the price. And the features of the phone are fine but not good according to cost.',
 'Smooth UI  ,No bloatware decent battery backup charging is very fast (35min) and great back camera Sony IMX 766 captures good quality photos,8 mp ultrawide has detailing issue, front camera is also very good',
 'Camera features should be more like Micro capture option,',
 'Bs 90 %',
 'Battery draining very fast',
 'Battery could be better. Front camera is not up to the mark. Rare camera is good. Everything else is good.',
 'I bought this mobile after researching a lot and then finalized it and here is my observation:Great things about this phone:- Super smooth to handle multiple apps- There is absolutely no lag- You can trust battery for entire day and even more if you are a normal user including streaming.- it 

In [85]:
dic = {'Review':cust_review,
      'Length of the Review':[len(i) for i in cust_review]
      }

dic

{'Review': ['Oneplus nord ce 2lite 👌',
  'I am using this phone from last 1 month almost everything is fine Except Battery capacity  it should be 5000 above. Battery backup is not very good as compared to the price. And the features of the phone are fine but not good according to cost.',
  'Smooth UI  ,No bloatware decent battery backup charging is very fast (35min) and great back camera Sony IMX 766 captures good quality photos,8 mp ultrawide has detailing issue, front camera is also very good',
  'Camera features should be more like Micro capture option,',
  'Bs 90 %',
  'Battery draining very fast',
  'Battery could be better. Front camera is not up to the mark. Rare camera is good. Everything else is good.',
  'I bought this mobile after researching a lot and then finalized it and here is my observation:Great things about this phone:- Super smooth to handle multiple apps- There is absolutely no lag- You can trust battery for entire day and even more if you are a normal user includi

In [89]:
rev = pd.DataFrame(dic)

rev

Unnamed: 0,Review,Length of the Review
0,Oneplus nord ce 2lite 👌,23
1,I am using this phone from last 1 month almost...,244
2,"Smooth UI ,No bloatware decent battery backup...",207
3,Camera features should be more like Micro capt...,57
4,Bs 90 %,7
...,...,...
1625,I give this review after 1 days of use this ph...,182
1626,ONE Plus nord 2t hanging issueYouTube hanging ...,123
1627,This mobile device is really awesome,36
1628,Finger print sensor work slow and fon is great...,99


## Data pre - processing

In [91]:
#1 Converting the text to lower case 
for i in range(len(rev.Review)):
    rev.Review[i] = rev.Review[i].lower()
    
    
#2 Removing punctuations and special symbols 
import string
p = string.punctuation
remv_punc = str.maketrans("", "", p)
for i in range(len(rev.Review)):
    rev.Review[i] = rev.Review[i].translate(remv_punc)
    
    
#3 Removing white spaces 
for i in range(len(rev.Review)):
    rev.Review[i] = rev.Review[i].replace("  ", " ").strip()
    
    
#4 Stemming 
from nltk.stem import PorterStemmer
ps = PorterStemmer()
for i in range(len(rev.Review)):
    rev.Review[i] = " ".join([ps.stem(w) for w in rev.Review[i].split()])

## Feature based opinion mining

#### 1. Feature Extraction
#### A. Extraction of Frequent Nouns/Noun Phrase

In [92]:
#Word tokenization

from nltk import word_tokenize

In [93]:
for i in rev.Review:
    word = word_tokenize(i)
    #print(word)

In [94]:
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\hp\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

In [95]:
for i in rev.Review:
    tokens = nltk.pos_tag(word_tokenize(i))
    #print(tokens)

In [96]:
#Save the NOUNS only
import numpy
nouns = []
for i in rev.Review:
     for word,speech in nltk.pos_tag(nltk.word_tokenize(i)):
            if speech == 'NN':
                nouns.append(word)
print(nouns)

unique_noun = numpy.unique(numpy.array(nouns))
print("Unique nouns =", unique_noun)

['oneplu', 'nord', '👌', 'i', 'use', 'thi', 'phone', 'month', 'everyth', 'batteri', 'capac', 'batteri', 'backup', 'compar', 'price', 'featur', 'phone', 'accord', 'decent', 'batteri', 'backup', 'charg', 'camera', 'soni', 'imx', 'captur', 'qualiti', 'photos8', 'mp', 'ha', 'detail', 'issu', 'front', 'camera', 'camera', 'featur', 'micro', 'captur', 'option', 'bs', 'batteri', 'drain', 'fast', 'batteri', 'front', 'camera', 'mark', 'camera', 'everyth', 'i', 'thi', 'mobil', 'research', 'lot', 'thing', 'thi', 'phone', 'smooth', 'app', 'lag', 'batteri', 'day', 'user', 'includ', 'stream', 'hand', 'i', 'cover', 'handl', 'averag', 'charg', 'featur', 'becaus', 'charg', 'min', 'phone', 'charg', 'time', 'area', 'improv', 'improveif', 'storag', 'i', 'redmi', 'mobil', 'year', 'mobil', 'phone', 'i', 'charger', 'disappoint', 'case', 'camera', 'qualiti', 'ad', 'auto', 'system', 'card', 'sim', 'supportedbatteri', 'time', 'i', 'dislik', 'batteri', 'batteri', 'qualiti', 'ekadam', 'ghatiya', 'type', 'ka', 'hai'

In [98]:
#Binary Data (for association rules mining)

df = {}
for noun in unique_noun:
    col = []
    for review in rev.Review:
        if noun in word_tokenize(review.lower()):
            col.append(1)
        else:
            col.append(0)
    df[noun] = col

#print(df)

In [99]:
df = pd.DataFrame(df)
df

Unnamed: 0,a3,a52,aa,aagaya,aay,ab,abhi,abil,abl,abov,...,😭,🤠,🤣,🤩,🤩🤩,🤩🤩🤩🤩,🤳,🥰,🥰🙏,🥱
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1625,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1626,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1627,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1628,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [100]:
from mlxtend.frequent_patterns import apriori, association_rules

In [101]:
#Association Rules Mining to extract the frequent noun phrase

In [102]:
apr_df = apriori(df, min_support = 0.03, use_colnames = True, max_len=2, verbose = 0)



In [103]:
apr_df

Unnamed: 0,support,itemsets
0,0.034969,(amazon)
1,0.055828,(awesom)
2,0.041104,(backup)
3,0.236196,(batteri)
4,0.036196,(better)
...,...,...
124,0.039877,"(thi, qualiti)"
125,0.047239,"(veri, qualiti)"
126,0.050920,"(thi, use)"
127,0.052761,"(veri, thi)"


#### B. Compactness Pruning

In [104]:
len_list = [tuple(i) for i in apr_df.itemsets if len(i)>=2]
len_list

[('backup', 'batteri'),
 ('batteri', 'camera'),
 ('charg', 'batteri'),
 ('day', 'batteri'),
 ('drain', 'batteri'),
 ('batteri', 'fast'),
 ('good', 'batteri'),
 ('batteri', 'i'),
 ('issu', 'batteri'),
 ('life', 'batteri'),
 ('one', 'batteri'),
 ('batteri', 'phone'),
 ('batteri', 'qualiti'),
 ('thi', 'batteri'),
 ('batteri', 'use'),
 ('veri', 'batteri'),
 ('buy', 'phone'),
 ('buy', 'thi'),
 ('charg', 'camera'),
 ('fast', 'camera'),
 ('good', 'camera'),
 ('i', 'camera'),
 ('life', 'camera'),
 ('one', 'camera'),
 ('phone', 'camera'),
 ('camera', 'qualiti'),
 ('thi', 'camera'),
 ('use', 'camera'),
 ('veri', 'camera'),
 ('charg', 'fast'),
 ('good', 'charg'),
 ('charg', 'i'),
 ('charg', 'phone'),
 ('thi', 'charg'),
 ('veri', 'charg'),
 ('day', 'phone'),
 ('dont', 'thi'),
 ('drain', 'fast'),
 ('good', 'fast'),
 ('fast', 'phone'),
 ('veri', 'fast'),
 ('go', 'phone'),
 ('good', 'i'),
 ('good', 'life'),
 ('good', 'mobil'),
 ('good', 'one'),
 ('good', 'phone'),
 ('good', 'product'),
 ('good', 'qua

In [107]:
# Compactness Determination
compact_count = {}
for tuple_1 in len_list:
    #print(tuple_1)
    count=0
    for review in rev.Review:
        review_1 = list(review.split())
        if tuple_1[0] in review_1 and tuple_1[1] in review_1:
            index_1 = review_1.index(tuple_1[0])
            index_2 = review_1.index(tuple_1[1])
            if abs(index_1-index_2)<=3:
                count+=1
                compact_count[tuple_1] = count
print(compact_count)

{('backup', 'batteri'): 47, ('batteri', 'camera'): 38, ('charg', 'batteri'): 31, ('day', 'batteri'): 5, ('drain', 'batteri'): 60, ('batteri', 'fast'): 43, ('good', 'batteri'): 65, ('batteri', 'i'): 2, ('issu', 'batteri'): 16, ('life', 'batteri'): 98, ('one', 'batteri'): 7, ('batteri', 'phone'): 20, ('batteri', 'qualiti'): 13, ('thi', 'batteri'): 11, ('batteri', 'use'): 6, ('veri', 'batteri'): 51, ('buy', 'phone'): 23, ('buy', 'thi'): 36, ('charg', 'camera'): 14, ('fast', 'camera'): 13, ('good', 'camera'): 81, ('i', 'camera'): 8, ('life', 'camera'): 14, ('one', 'camera'): 3, ('phone', 'camera'): 31, ('camera', 'qualiti'): 110, ('thi', 'camera'): 14, ('use', 'camera'): 2, ('veri', 'camera'): 24, ('charg', 'fast'): 60, ('good', 'charg'): 10, ('charg', 'i'): 5, ('charg', 'phone'): 8, ('thi', 'charg'): 2, ('veri', 'charg'): 13, ('day', 'phone'): 7, ('dont', 'thi'): 15, ('drain', 'fast'): 33, ('good', 'fast'): 8, ('fast', 'phone'): 6, ('veri', 'fast'): 32, ('go', 'phone'): 5, ('good', 'i'): 

In [108]:
# Compactness pruned list
compact_words = []
for key,value in compact_count.items():
    if value>=2:
        compact_words.append((key,value))
compact_words

[(('backup', 'batteri'), 47),
 (('batteri', 'camera'), 38),
 (('charg', 'batteri'), 31),
 (('day', 'batteri'), 5),
 (('drain', 'batteri'), 60),
 (('batteri', 'fast'), 43),
 (('good', 'batteri'), 65),
 (('batteri', 'i'), 2),
 (('issu', 'batteri'), 16),
 (('life', 'batteri'), 98),
 (('one', 'batteri'), 7),
 (('batteri', 'phone'), 20),
 (('batteri', 'qualiti'), 13),
 (('thi', 'batteri'), 11),
 (('batteri', 'use'), 6),
 (('veri', 'batteri'), 51),
 (('buy', 'phone'), 23),
 (('buy', 'thi'), 36),
 (('charg', 'camera'), 14),
 (('fast', 'camera'), 13),
 (('good', 'camera'), 81),
 (('i', 'camera'), 8),
 (('life', 'camera'), 14),
 (('one', 'camera'), 3),
 (('phone', 'camera'), 31),
 (('camera', 'qualiti'), 110),
 (('thi', 'camera'), 14),
 (('use', 'camera'), 2),
 (('veri', 'camera'), 24),
 (('charg', 'fast'), 60),
 (('good', 'charg'), 10),
 (('charg', 'i'), 5),
 (('charg', 'phone'), 8),
 (('thi', 'charg'), 2),
 (('veri', 'charg'), 13),
 (('day', 'phone'), 7),
 (('dont', 'thi'), 15),
 (('drain', '

> Experimental results indicate that the
proposed techniques are very promising in performing their tasks.
I believe that this problem will become increasingly important
as more people are buying and expressing their opinions on the
Web. Summarizing the reviews is not only useful to common
shoppers, but also crucial to product manufacturers. 