<img src="http://drive.google.com/uc?export=view&id=1tpOCamr9aWz817atPnyXus8w5gJ3mIts" width=500px>

Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.

# Mobile Phone Review Analysis

## Context

The product companies can utilize the detailed review comments to gather insights from the end user. Most of the products are sold via e-commerce sites like Flipkart or Amazon where customers can buy a product and give their review about the product on the web site. 
Product managers can identify the relevant reviews from the website and run a sentiment analysis tool to understand what the sentiments of the customer are. Based on their sentiments, they can identify what users think of the current product. Are they happy? Discontent? 
They can also come up with a document that lists the features, the team needs to focus on for making the product better. 

## Objective

Given the review data rating label, we will try to get insights about various brands and their ratings using text analytics and build a model to predict the rating and overall sentiment. 


### Package version

- tensorflow==2.3.0
- scikit-learn==0.22.2.post1
- pandas==1.0.5
- numpy==1.18.5
- matplotlib==3.2.2
- google==2.0.3

### Data Dictionary 

product_data.csv - contains product details
- 'asin',  - Product ASIN
- 'brand', - Product Brand
- 'title', - Product Title
- 'url',  - Product URL
- 'image', - Product Image URL
- 'rating',- Product Avg. Rating
- 'reviewUrl' - Product Review Page URL
- 'totalReviews' - Product Total Reviews
- ‘price’ - Product Price ($)
- ‘originalPrice’ - Product Original Price ($)
 
reviews.csv  - contains user review details
 
- 'asin' - Product ASIN
- 'name' - Reviewer Name
- 'rating' - Reviewer Rating (scale 1 to 5)
- 'date'  - Review Date
- 'verified' - Valid Customer
- 'title'  - Review Title
- 'body'  - Review Content
- 'helpfulVotes  - Helpful Feedbacks


## Table of Content

1. Import Libraries

2. Setting options

3. Read Data

4. Data Analysis and EDA

5. Text preprocessing and Vectorization

6. Model building

7. Conclusion and Interpretation

## 1. Import Libraries

Let us start by mounting the drive

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Let us check for the version of installed tensorflow.

In [None]:
# used to supress display of warnings
import warnings

# os is used to provide a way of using operating system dependent functionality
# We use it for setting working folder
import os

# Pandas is used for data manipulation and analysis
import pandas as pd 

# Numpy is used for large, multi-dimensional arrays and matrices, along with mathematical operators on these arrays
import numpy as np

# Matplotlib is a data visualization library for 2D plots of arrays, built on NumPy arrays 
# and designed to work with the broader SciPy stack
import matplotlib.pyplot as plt
%matplotlib inline
from matplotlib import pyplot

# Seaborn is based on matplotlib, which aids in drawing attractive and informative statistical graphics.
import seaborn as sns
import tensorflow 
print(tensorflow.__version__)

## 2. Setting Options

In [None]:
# suppress display of warnings
warnings.filterwarnings('ignore')

# display all dataframe columns
pd.options.display.max_columns = None

# to set the limit to 3 decimals
pd.options.display.float_format = '{:.7f}'.format

# display all dataframe rows
pd.options.display.max_rows = None

## 3. Read Data

### 3.1 Read the provided CSVs and check 5 random samples and shape to understand the datasets

In [None]:

product_df = pd.read_csv('/content/drive/MyDrive/product_data.csv')
product_df.shape

In [None]:
product_df.head(2)

In [None]:
import csv
reviews_df = pd.read_csv('/content/drive/MyDrive/reviews.csv')#,error_bad_lines=False,quoting=csv.QUOTE_NONE )
reviews_df.shape

In [None]:
reviews_df.head(2)

In [None]:
product_df.shape

In [None]:
reviews_df.shape

## 4.  Data Analysis and EDA

### 4.1 Drop unnecessary columns like 'url', 'image' from the product_data

In [None]:
product_df = product_df.drop(['url','image','reviewUrl'],axis=1)

In [None]:
product_df.shape

In [None]:
product_df.head(2)

### 4.2 Check statistical summary of both datasets. Note:- Include both numerical and object type columns.

In [None]:
product_df.describe()

In [None]:
product_df.describe(include='O')

In [None]:
reviews_df.describe()

In [None]:
reviews_df.describe(include='object')

In [None]:
reviews_df[reviews_df['helpfulVotes']==326]

### 4.3 From the above statistical summary, write inferences like count of unique products, top brand, top title, range of rating, price range, etc



*   There are 720 unique products
*   Rating ranges from 1 to 5. Mean rating is 3.71
*   There are 10 brands of which samsung comes at top.
*   Price ranges from 0 to 999.99 with 235 as mean price value
*   There are no duplicate asin id
*   To title is Apple iphone 6s








### 4.4 Analyze the distribution of ratings and other categorical features like brand, etc

In [None]:
sns.displot(data= product_df,x='rating')

In [None]:
sns.displot(data= product_df,x='rating',kind='kde')

In [None]:
sns.barplot(data=product_df,x=product_df.rating,y='brand')

In [None]:
product_df['brand'].value_counts().plot(kind='pie',autopct='1.0f%%',figsize=(12,10))

In [None]:
product_df['brand'].value_counts().plot(kind='bar')

In [None]:
sns.countplot(product_df.brand)

### 4.5 Display average rating per brand

In [None]:
product_df.head()

In [None]:
# Calculating average rating per brand

rating_per_brand = product_df.groupby(by='brand')['rating'].mean().sort_values(ascending=False)
rating_per_brand

In [None]:
rating_per_brand.plot(kind='barh')

### 4.6 Display average price per brand

In [None]:
# Calculating average price per brand

price_per_brand = product_df.groupby(by='brand')['price'].mean().sort_values(ascending=False)
price_per_brand

In [None]:
price_per_brand.plot(kind='barh')

### 4.7 Display average 'totalReviews' per brand

In [None]:
# Calculating average 'totalReviews' per brand
price_per_brand = product_df.groupby(by='brand')['totalReviews'].mean().sort_values(ascending=False)
price_per_brand

In [None]:
price_per_brand.plot(kind='barh')

### 4.8 Merge two datasets using 'asin' and check the shape of the final dataset

In [None]:
df = pd.merge(reviews_df, product_df,how='left',left_on='asin',right_on='asin') 
df.shape


In [None]:
df.head()

### 4.9 Rename important features with appropriate names.
Imortant features - "rating_x": "user_rating", "title_x": "review_title", "title_y": "item_title", "rating_y": "overall_rating"

In [None]:
df.rename(columns={"rating_x": "user_rating", "title_x": "review_title","title_y": "item_title","rating_y":"overall_rating"},inplace=True)

In [None]:
df.head()

### 4.10 Select rows having verified reviews and check the shape of the final dataset

In [None]:
df = df[df.verified==True]

In [None]:
df.shape

### 4.11 Check the number of reviews for various brands and report the brand that have highest number of reviews

In [None]:
# Calculating  the number of reviews for various brands 
reviews_brand = df.groupby(by='brand')['totalReviews'].count().sort_values(ascending=False)
reviews_brand

In [None]:
reviews_brand.plot(kind='barh')

### 4.12 Drop irrelevant columns and keep important features like 'brand','body','price','user_rating','review_title' for further analysis

In [None]:
df.head(2)

In [None]:
df.columns

In [None]:
df_final = df.drop(['asin', 'totalReviews','originalPrice', 'name', 'overall_rating', 'date', 'verified','item_title',  'helpfulVotes'],axis=1)

In [None]:
df_final.head(2)

In [None]:
df_final.shape

### 4.13 Perform univariate analysis. Check distribution of price, user_rating

In [None]:
sns.displot(data=df_final,x='price')

In [None]:
sns.displot(data=df_final,x='price',kind='kde')

In [None]:
sns.displot(data=df_final,x='user_rating')

In [None]:
sns.displot(data=df_final,x='user_rating',kind='kde')

In [None]:
sns.countplot(df_final['user_rating'])

In [None]:
df_final['user_rating'].value_counts()

### 4.14 Create a new column called "sentiment". It should have value as 1 (positive) if the user_Rating is greater than 3, value as 0 (negative) if the user_Rating <= 3

In [None]:
df_final['sentiment'] = df_final['user_rating'].apply(lambda x : 1 if x > 3 else 0)

In [None]:
df_final.head()

### 4.15 Check frequency distribution of the 'sentiment'

In [None]:
df_final['sentiment'].value_counts()

In [None]:
sns.countplot(df_final['sentiment'])

### 4.16 Perform bivariate analysis. Check correlation/crosstab between features and write your inferences.

In [None]:
df_final.corr()



*   user_rating and sentiments are positively correlated.
*   price doesnt seem to have larger influence on either sentiments of user or their ratings




In [None]:
sns.boxplot(x=df_final['sentiment'],y=df_final['price'])

In [None]:
pd.crosstab(df_final['brand'],df_final['sentiment'])

In [None]:
pd.crosstab(df_final['brand'],df_final['sentiment']).apply(lambda x : x/x.sum(),axis=1)

## 5. Text Preprocessing and Vectorization

We will analyze the 'body' and 'review_title' to gain more understanding.

We will peform the below tasks

- Convert the text into lowercase
- Remove punctuation
- Remove stopwords (English, from nltk corpus)
- Remove other keywords like "phone" and brand name

### 5.1 Change the datatype of the 'body' column to 'str' and convert it into lowercase. Print any two samples and check the output.

In [None]:
df_final.info()

In [None]:
df_final['body'] = df_final['body'].astype('string')

In [None]:
df_final.info()

In [None]:
df_final.head(2)

In [None]:
df_final['keywords'] = df_final['body'].str.lower()

In [None]:
df_final['keywords'].head(2)

In [None]:
df_final.head()

### 5.2 Remove punctuations from the lowercased 'body' column and display at least two samples.

In [None]:
import string
df_final['keywords'] = df_final['keywords'].str.translate(str.maketrans('', '', string.punctuation))

In [None]:
df_final.head(2)

In [None]:
#Using regEx

df_final['keywords'] = df_final['keywords'].str.replace('[^\w\s]','')

In [None]:
df_final.head(2)

### 5.3 Remove stop words from the above pre-processed 'body' column and display at least two samples.

In [None]:
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords

In [None]:
stop_words = set(stopwords.words('english'))
stop_words

In [None]:
df_final['keywords'] = df_final['keywords'].astype('str')
df_final['keywords'] = df_final['keywords'].apply(lambda words: ' '.join(w for w in words.split() if w not in stop_words))
df_final.head()


### 5.4 Apply lemmatisation on the above preprocessed text and display a few samples

In [None]:
import nltk
nltk.download('wordnet')


In [None]:
w_tokenizer = nltk.tokenize.WhitespaceTokenizer()
lemmatizer = nltk.stem.WordNetLemmatizer()

In [None]:
def lemmetize_text(text):
  return [lemmatizer.lemmatize(w) for w in w_tokenizer.tokenize(text)]

In [None]:
df_final['lemm'] = df_final['keywords'].apply(lemmetize_text)

In [None]:
df_final.head()

In [None]:
from nltk.stem import WordNetLemmatizer
  
lemmatizer = WordNetLemmatizer()
  
print("replacement :", lemmatizer.lemmatize("replacement"))
print("corpora :", lemmatizer.lemmatize("corpora"))
  
# a denotes adjective in "pos"
print("better :", lemmatizer.lemmatize("better", pos ="a"))


### 5.5 Check most common and rare words in the processed text
- We can also write a function to check word frequency of the text (Optional)

In [None]:
from collections import Counter
cnt = Counter()
for text in df_final["keywords"].values:
    for word in text.split():
        cnt[word] += 1
        
cnt.most_common(10)


In [None]:
rare_words = 10

cnt.most_common()[:-rare_words-1:-1]

In [None]:
#positive sentiments
df_pos = df_final[df_final["sentiment"]==1]


from collections import Counter
cnt = Counter()
for text in df_pos["keywords"].values:
    for word in text.split():
        cnt[word] += 1
        
cnt.most_common(10)


### 5.6 Initialize tf-idf vectorizer and transform the preprocessed body text

In [None]:

# Initialize TF-IDF vectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf_vectorizer = TfidfVectorizer()

In [None]:
# Initialize TF-IDF vectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf_vectorizer1 = TfidfVectorizer(ngram_range=(2,2))

In [None]:
df_final['tfidf'] = tfidf_vectorizer.fit_transform(df_final["keywords"])

In [None]:
df_final['tfidf1'] = tfidf_vectorizer.fit_transform(df_final["keywords"])

In [None]:
df_final.head()

In [None]:
tfidf = tfidf_vectorizer.fit_transform(df_final["keywords"])

In [None]:
tfidf1 = tfidf_vectorizer1.fit_transform(df_final['keywords'])

In [None]:
tfidf.get_shape()

In [None]:
tfidf1.get_shape()

### 5.7 Segregate the data into dependent (sentiment) and independent (transformed body using tf-idf) features for building a classifier. 

In [None]:
y = df_final['sentiment']

### 5.9 Split the data into Train & Test Sets

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y1_train, y1_test = train_test_split(tfidf, y, random_state = 50, stratify=y, test_size=0.3)

## 6. Model building

### 6.1 Build a random forest classifier to predict the 'sentiment'
### 6.2 Predict on test set
### 6.3 Check accuracy and confusion matrix

In [None]:
from sklearn.ensemble import RandomForestClassifier
rf_clf1 = RandomForestClassifier()
rf_clf1.fit(X_train, y1_train)

In [None]:
print("Train accuracy of the model is : ",rf_clf1.score(X_train, y1_train))
print("Test accuracy of the model is : ",rf_clf1.score(X_test, y1_test))

In [None]:
from sklearn.metrics import confusion_matrix
y1_pred = rf_clf1.predict(X_train)
confusion_matrix(y1_train, y1_pred)


In [None]:
from sklearn.metrics import confusion_matrix
y1_pred = rf_clf1.predict(X_test)
confusion_matrix(y1_test, y1_pred)


In [None]:
X_train, X_test, y1_train, y1_test = train_test_split(tfidf1, y, random_state = 50, stratify=y, test_size=0.3)
from sklearn.ensemble import RandomForestClassifier
rf_clf1 = RandomForestClassifier()
rf_clf1.fit(X_train, y1_train)

In [None]:
print("Train accuracy of the model is : ",rf_clf1.score(X_train, y1_train))
print("Test accuracy of the model is : ",rf_clf1.score(X_test, y1_test))


In [None]:
from sklearn.metrics import confusion_matrix
y1_pred = rf_clf1.predict(X_train)
confusion_matrix(y1_train, y1_pred)


In [None]:
from sklearn.metrics import confusion_matrix
y1_pred = rf_clf1.predict(X_test)
confusion_matrix(y1_test, y1_pred)


## 7. Write your conclusion



*   We can see model is overfit , so we can tune some hyperparameters to improve the accuracy. 
*   We can also tune some hyperparameters to reduce the shape of our tfidf,etc
*   We can try max feature 
*   We are only usinf body attribute for analysis in this notebook we can add review_title and also brand along with body to get some deeper insights.  



