• **DOMAIN**: Smartphone, Electronics

• **CONTEXT**: India is the second largest market globally for smartphones after China. About 134 million smartphones were sold across India in the year 2017 and is estimated to increase to about 442 million in 2022. India ranked second in the average time spent on mobile web by smartphone users across Asia Pacific. The combination of very high sales volumes and the average smartphone consumer behaviour has made India a very attractive market for foreign vendors. As per Consumer behaviour, 97% of consumers turn to a search engine when they are buying a product vs. 15% who turn to social media. If a seller succeeds to publish smartphones based on user’s behaviour/choice at the right place, there are 90% chances that user will enquire for the same. This Case Study is targeted to build a recommendation system based on individual consumer’s behaviour or choice.

• **DATA DESCRIPTION**:
- **author** : name of the person who gave the rating
- **country** : country the person who gave the rating belongs to
- **data** : date of the rating
- **domain**: website from which the rating was taken from
- **extract**: rating content
- **language**: language in which the rating was given
- **product**: name of the product/mobile phone for which the rating was given
- **score**: average rating for the phone
- **score_max**: highest rating given for the phone
- **source**: source from where the rating was taken

• **PROJECT OBJECTIVE**: We will build a recommendation system using popularity based and collaborative filtering methods to recommend mobile phones to a user which are most popular and personalised respectively.

### Steps and tasks:

### 1. Import the necessary libraries and read the provided CSVs as a data frame and perform the below steps.

In [None]:
#Import necessary libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

from collections import defaultdict
from surprise import SVD
from surprise import KNNWithMeans
from surprise import Dataset
from surprise import accuracy
from surprise import Reader
from surprise.model_selection import cross_validate
from surprise.model_selection import train_test_split

import warnings
warnings.filterwarnings("ignore")

In [None]:
data1 = pd.read_csv('D:/Nikhila/PGP - AIML/2. Projects/6. Recommendation Systems/phone_user_review_file_1.csv')
data2 = pd.read_csv('D:/Nikhila/PGP - AIML/2. Projects/6. Recommendation Systems/phone_user_review_file_2.csv')
data3 = pd.read_csv('D:/Nikhila/PGP - AIML/2. Projects/6. Recommendation Systems/phone_user_review_file_3.csv')
data4 = pd.read_csv('D:/Nikhila/PGP - AIML/2. Projects/6. Recommendation Systems/phone_user_review_file_4.csv')
data5 = pd.read_csv('D:/Nikhila/PGP - AIML/2. Projects/6. Recommendation Systems/phone_user_review_file_5.csv')
data6 = pd.read_csv('D:/Nikhila/PGP - AIML/2. Projects/6. Recommendation Systems/phone_user_review_file_6.csv')

#### A. Merge all the provided CSVs into one dataFrame.

In [None]:
print('Shape of the data1', data1.shape)
print('Shape of the data2', data2.shape)
print('Shape of the data3', data3.shape)
print('Shape of the data4', data4.shape)
print('Shape of the data5', data5.shape)
print('Shape of the data6', data6.shape)

print()

print(f'Total rows: {data1.shape[0]+data2.shape[0]+data3.shape[0]+data4.shape[0]+data5.shape[0]+data6.shape[0]}')

#### Check whether the column names are same in all the dataframes

In [None]:
all(np.unique(data1.columns.tolist()) == np.unique(data1.columns.tolist()+data2.columns.tolist()+data3.columns.tolist()+
                                                   data4.columns.tolist()+data5.columns.tolist()+data6.columns.tolist()))

In [None]:
data = pd.concat([data1,data2,data3,data4,data5,data6], ignore_index=True)

print('Shape of the dataframe', data.shape)

#### B. Explore, understand the Data and share at least 2 observations.

In [None]:
data.info()

**Observation** -
- 1) We see that the count of rows is less for - "score", "score_max", "extract", "author" and "product" - **Indicating missing values**.
- 2) "score" and "score_max" are **stored as float** and other features are of object type.
- 3) "date" should be of **datetype**

In [None]:
data.describe().T

In [None]:
print(data['score_max'].nunique())

**Observation** -
- 4) "score_max" value for all the observation is 10.

#### We will see the distribution of "product" and "author" since we will be dealing with it later

In [None]:
product = data['product'].value_counts()[:10]
print('Distribution of number of products: \n',product)
sns.barplot(y=product.index,x=product)
plt.tight_layout()
plt.show()

In [None]:
users = data['author'].value_counts(dropna=False)[:10]
print('Distribution of number of author: \n',users)
users.index = users.index.map(str)
sns.barplot(y=users.index,x=users)
plt.tight_layout()
plt.show()

**Observation** - 
- 5) We have "nan" values
- 6) We see authors like - 'Anonymous' and 'unknown'.
- 7) authors like "Amazon customer", "Cliente Amazon", "Client d'Amazon" are all the same in different languages.

We will remove in next steps

#### C. Round off scores to the nearest integers.

In [None]:
data['score'] = data['score'].round(0).astype('Int64')
print(list(data.score.unique()))

**Observation** - We see 0 and <NA> values here

#### D. Check for missing values. Impute the missing values, if any.

In [None]:
missing_val=data.isna().sum().round(2)
missing_val1 = (missing_val*100/data.shape[0]).round(2)
print('Missing count and percentages for each column are: \n',missing_val.astype('str') +' ('+ missing_val1.astype('str')+'%)')

del missing_val, missing_val1

**'score'** and **'score_max'** have exactly same number of missing values

a) Impute the **"score"** column with Median

In [None]:
data['score'] = data['score'].fillna(data['score'].median())

print('Shape of the dataframe after imputing "score" with median', data.shape)

b) We will not change **"score_max"** column values since it has unique value of 10 and it is irrelevant feature

c) **"extract"**,  **"author"** and **"product"** - We will remove all null values and "Anonymous" values.

In [None]:
#To remove null values
data.dropna(inplace=True)

#To remove "Anonymous" and "unknown"
unknowns = ['Anonymous','unknown','Anonymous ']
data['author'].replace(to_replace = unknowns, value = 'Anonymous', inplace=True)
data = data[data["author"] != 'Anonymous']

print('Shape of the dataframe after removing null values and "Anonymous" values', data.shape)

#### E. Check for duplicate values and remove them, if any.

In [None]:
duplicate = data[data.duplicated()]
duplicate

In [None]:
data = data.drop_duplicates()

print('Shape of the dataframe after removing duplicate values', data.shape)

#### F. Keep only 1 Million data samples. Use random state=612.

In [None]:
data = data.sample(n=1000000, random_state=612)

print('Shape of the dataframe after keeping only 1 Million data samples', data.shape)

#### G. Drop irrelevant features. Keep features like Author, Product, and Score.

In [None]:
data.drop(['phone_url','date','lang','country','source','domain','score_max','extract'], axis = 1, inplace = True)

print('Shape of the dataframe after dropping irrelevant features', data.shape)

-----

### 2. Answer the following questions

#### A. Identify the most rated products.

In [None]:
#product which has received most number of ratings - "count"
data.groupby('product')['score'].count().reset_index().sort_values('score', ascending=False)[:10]

In [None]:
#product which has highest mean score - "mean"
data.groupby('product')['score'].mean().reset_index().sort_values('score', ascending=False)[:10]

In [None]:
#product with its "mean score" and "count of score"
product_mean_count = pd.DataFrame(data.groupby('product')['score'].mean())
product_mean_count['score_counts'] = pd.DataFrame(data.groupby('product')['score'].count()) 
product_mean_count.head(10)

**Observation** - Although some of the products have score as 10, the score_count of the product is 1 hence the mean of score will be 10, which is less significant. we will explore more in section 3.

In [None]:
plt.figure(figsize=(8,6))
plt.rcParams['patch.force_edgecolor'] = True
product_mean_count['score'].hist(bins=50)

**Observation** - 
- We can see that the integer values have taller bars than the floating values.
- Furthermore, it is evident that the data has a weak normal distribution. 

- Products with a higher number of ratings usually have a high average rating as well since a good product is normally well-known, and thus usually has a higher rating. Let's see if this is also the case with the products in our dataset. We will plot average ratings against the number of ratings.

In [None]:
plt.figure(figsize=(8,6))
plt.rcParams['patch.force_edgecolor'] = True
sns.jointplot(x='score', y='score_counts', data=product_mean_count, alpha=0.4)

**Observation** - The graph shows that, in general, products with higher average ratings actually have more number of ratings, compared with products that have lower average ratings.

#### B. Identify the users with most number of reviews.

In [None]:
#author which has received most number of ratings - "count"
data.groupby('author')['score'].count().reset_index().sort_values('score', ascending=False)[:10]

In [None]:
#author which has highest mean score - "mean"
data.groupby('author')['score'].mean().reset_index().sort_values('score', ascending=False)[:10]

In [None]:
#author with its "mean score" and "count of score"
author_mean_count = pd.DataFrame(data.groupby('author')['score'].mean())
author_mean_count['score_counts'] = pd.DataFrame(data.groupby('author')['score'].count()) 
author_mean_count.head(10)

**Observation** - Although some of the authors have score as 10, the score_count of the author is 1 hence the mean of score will be 10, which is less significant.

In [None]:
plt.figure(figsize=(8,6))
plt.rcParams['patch.force_edgecolor'] = True
author_mean_count['score'].hist(bins=50)

#### C. Select the data with products having more than 50 ratings and users who have given more than 50 ratings. Report the shape of the final dataset.

In [None]:
#Select the data with products having more than 50 ratings
min_phone_ratings = 50
filter_products = data['product'].value_counts() > min_phone_ratings
filter_products = filter_products[filter_products].index.tolist()
print('Number of products with >50 rating: ', len(filter_products))

#users who have given more than 50 ratings
min_user_ratings = 50
filter_users = data['author'].value_counts() > min_user_ratings
filter_users = filter_users[filter_users].index.tolist()
print('Number of authors who have given >50 rating: ', len(filter_users))

print()

data_new = data[(data['product'].isin(filter_products)) & (data['author'].isin(filter_users))]

print('Shape of the dataframe after selecting the data with products having more than 50 ratings and users who have given more than 50 ratings', data_new.shape)

print()

data_new.head(10)

-------

### 3. Build a popularity based model and recommend top 5 mobile phones.

In [None]:
#product with its mean score.
ratings_mean_count = pd.DataFrame(data.groupby('product')['score'].mean())

# product with its score count
ratings_mean_count['rating_counts'] = pd.DataFrame(data.groupby('product')['score'].count())  

#top 5 mobile phones.
ratings_mean_count.sort_values(by=['score','rating_counts'], ascending=[False,False]).head(5)

----

### 4. Build a collaborative filtering model using SVD. You can use SVD from surprise or build it from scratch(Note: Incase you’re building it from scratch you can limit your data points to 5000 samples if you face memory issues). Build a collaborative filtering model using kNNWithMeans from surprise. You can try both user-based and item-based model.

In [None]:
# arranging columns in the order of user id,item id and score
columns_titles = ['author','product','score']
data_model = data.reindex(columns=columns_titles)

In [None]:
# Keep only 5000 data samples. Use random state=612
data_model = data_model.sample(n=5000, random_state=612)

#### a) Build a collaborative filtering model using SVD

In [None]:
# Rearrange columns for SVD and prepare train and testsets
data_svd = Dataset.load_from_df(data[['author','product','score']], Reader(rating_scale=(1, 10)))
trainset_svd, testset_svd = train_test_split(data_svd, test_size=.25,random_state=612)

In [None]:
# fit and predict using svd
def svd_func(train, test):
    algo_svm = SVD(random_state=612)
    algo_svm.fit(train)
    test_pred_svd = algo_svm.test(test)
    return test_pred_svd, algo_svm

test_pred_svd, algo_svm = svd_func(trainset_svd, testset_svd)
print('First few prediction values: \n',test_pred_svd[0:2])
print('\nRMSE value(test-set): ',round(accuracy.rmse(test_pred_svd),2),'\n') # compute RMSE
svd_rmse = round(accuracy.rmse(test_pred_svd),2)

#### b) Build a collaborative filtering model using KNN With Means from surprise using user-based model.

In [None]:
reader = Reader(rating_scale=(1, 10))
data_Knn_user = Dataset.load_from_df(data_model,reader = reader)

In [None]:
trainset_knn_user = data_Knn_user.build_full_trainset()

In [None]:
algo_u = KNNWithMeans(k=50, sim_options={'name': 'pearson_baseline', 'user_based': True})
algo_u.fit(trainset_knn_user)

In [None]:
testset_knn_user = trainset_knn_user.build_anti_testset()

In [None]:
test_pred_knn_user = algo_u.test(testset_knn_user)
test_pred_knn_user

#### c) Build a collaborative filtering model using KNN With Means from surprise using Item-based model.

In [None]:
reader = Reader(rating_scale=(1, 10))
data_Knn_item = Dataset.load_from_df(data_model,reader = reader)

In [None]:
trainset_knn_item = data_Knn_item.build_full_trainset()

In [None]:
algo_i = KNNWithMeans(k=50, sim_options={'name': 'pearson_baseline', 'user_based': False})
algo_i.fit(trainset_knn_item)

In [None]:
testset_knn_item = trainset_knn_item.build_anti_testset()

In [None]:
test_pred_knn_item = algo_i.test(testset_knn_item)
test_pred_knn_item

-----------------

### 5. Evaluate the collaborative model. Print RMSE value.

#### a) SVD Model

In [None]:
print("SVD Model : Test data")
accuracy.rmse(test_pred_svd, verbose=True)

#### b) User-based Model

In [None]:
print("User-based Model : Test Set")
accuracy.rmse(test_pred_knn_user, verbose=True)

#### c) Item-based Model

In [None]:
print("Item-based Model :Test Set")
accuracy.rmse(test_pred_knn_item, verbose=True)

-------------

### 6. Predict score (average rating) for test users

#### a) SVD Model

In [None]:
svd_pred = pd.DataFrame(test_pred_svd, columns=['uid', 'iid', 'rui', 'est', 'details'])
print('average prediction for test users for SVD Model: ',svd_pred['est'].mean())
print('average rating by test users for SVD Model: ',svd_pred['rui'].mean())
print('average prediction error for test users for SVD Model: ',(svd_pred['rui']-svd_pred['est']).abs().mean())

#### b) User-based Model

In [None]:
knn_u_pred=pd.DataFrame(test_pred_knn_user, columns=['uid', 'iid', 'rui', 'est', 'details'])
print('average prediction for test users for User-based Model: ',knn_u_pred['est'].mean())
print('average rating by test users for User-based Model: ',knn_u_pred['rui'].mean())
print('average prediction error for test users for User-based Model: ',(knn_u_pred['rui']-knn_u_pred['est']).abs().mean())

#### c) Item-based Model

In [None]:
knn_i_pred=pd.DataFrame(test_pred_knn_item, columns=['uid', 'iid', 'rui', 'est', 'details'])
print('average prediction for test users for Item-based Model: ',knn_i_pred['est'].mean())
print('average rating by test users for Item-based Model: ',knn_i_pred['rui'].mean())
print('average prediction error for test users for Item-based Model: ',(knn_i_pred['rui']-knn_i_pred['est']).abs().mean())

--------------

### 7. Report your findings and inferences.

**Top 5 most rated products are** -
1. Lenovo Vibe K4 Note (White,16GB)
2. Lenovo Vibe K4 Note (Black, 16GB)  
3. OnePlus 3 (Graphite, 64 GB)            
4. OnePlus 3 (Soft Gold, 64 GB)         
5. Huawei P8lite zwart / 16 GB           

**Authors with most number of review** -
- Although we cleaned the data for "nan" values, "Anonymous" and "unknown". We see there are author names in different languages which also means - "Unknown".(not cleaned)
- Overall data is highly skewed towards 'Amazon customers' from different countries.
- "Amazon" has the most number of reviews. Although correct 'user' names from 'Amazon' should have used.

**Products having more than 50 ratings and Users who have given more than 50 ratings.**
1. Denni - Apple iPhone 6 Space Grau 128GB SIM-Free Smart...
2. Amazon Kunde - Samsung Galaxy S7 Smartphone (5,1 Zoll (12,9 c...
3. Amazon Customer - Apple iPhone 3GS 16GB (White) - AT&T
4. Amazon Customer - OnePlus 3T (Gunmetal, 6GB RAM + 64GB memory)
5. Amazon Customer - Lenovo Vibe K5 (Silver, 16GB)
------------------------------------------------------------------------------------------------------------------------
**SVM Model**

SVD Model : Test data
RMSE: 2.5267

- average prediction for test users:  8.017424546767772
- average rating by test users:  8.001028
- average prediction error for test users:  1.9522398766493418
--------------------------------------------------------------------------------------------------------------------------
**User-based Model**

User-based Model : Test Set
RMSE: 2.5424

- average prediction for test users:  8.06401472612044
- average rating by test users:  8.016599998590907
- average prediction error for test users:  1.9390806236844527
--------------------------------------------------------------------------------------------------------------------------
**Item-based Model**

Item-based Model :Test Set
RMSE: 2.5288

- average prediction for test users:  7.980277888766065
- average rating by test users:  8.016599998590907
- average prediction error for test users:  1.9407905319252805
------------------------------------------------------------------------------------------------------------------------
**Conclusion** - 
- RMSE Error is approximately same for all 3 models but our model perfoms well in SVM Model.
- average prediction error for test users - is less in User-based Model.

--------------------

### 8. Try and recommend top 5 products for test users.

In [None]:
def get_top_n(predictions, n=5):
    # First map the predictions to each user.
    top_n = defaultdict(list)
    for uid, iid, true_r, est, _ in predictions:
        top_n[uid].append((iid, est))

    # Then sort the predictions for each user and retrieve the k highest ones.
    for uid, user_ratings in top_n.items():
        user_ratings.sort(key=lambda x: x[1], reverse=True)
        top_n[uid] = user_ratings[:n]

    return top_n

#### a) SVD Model

In [None]:
top_5_SVD = get_top_n(test_pred_svd, n=5)

In [None]:
print('Top 5 recommendations:SVD \n')
for key,value in top_5_SVD.items(): print(key,'-> ',value,'\n')

#### b) User-based Model

In [None]:
top_5_knn_u = get_top_n(test_pred_knn_user,5)

In [None]:
print('Top 5 recommendations:User \n')
for key,value in top_5_knn_u.items(): print(key,'-> ',value,'\n')

#### c) Item-based Model

In [None]:
top_5_knn_i = get_top_n(test_pred_knn_item ,5)

In [None]:
print('Top 5 recommendations:Item \n')
for key,value in top_5_knn_i.items(): print(key,'-> ',value,'\n')

--------------------

### 9. Try other techniques (Example: cross validation) to get better results.

#### a) SVD Model

In [None]:
svm_cv = cross_validate(algo_svm, data_svd, measures=['RMSE'], cv=5, verbose=False)
print('\n Mean of svm_cv score:', round(svm_cv['test_rmse'].mean(),2),'\n')
svm_cv

#### b) User-based Model

In [None]:
knn_u_cv = cross_validate(algo_u, data_Knn_user, measures=['RMSE'], cv=5, verbose=False)
print('\n Mean of knn_u_cv score:', round(knn_u_cv['test_rmse'].mean(),2),'\n')
knn_u_cv

#### c) Item-based Model

In [None]:
knn_i_cv = cross_validate(algo_i, data_Knn_item, measures=['RMSE'], cv=5, verbose=False)
print('\n Mean of knn_i_cv score:', round(knn_i_cv['test_rmse'].mean(),2),'\n')
knn_i_cv

------------------

**Conclusion** - After applying cross validation technique -
- Mean of svm_cv score: 2.52 (which is same as our SVM Model)
- Mean of knn_u_cv score: 2.63 (the value has increased from 2.54 to 2.63)
- Mean of knn_i_cv score: 2.65 (the value has increased from 2.52 to 2.65)

### 10. In what business scenario you should use popularity based Recommendation Systems ?

Popularity based Recommendation Systems is a type of recommendation system which works on the **principle of popularity and or trend** and directly recommend to the users.


This can be used in a scenario where we do not have user preference or for new users i.e, It does not suffer from cold start problems which means **on day 1 of the business also it can recommend products on various different filters and does not require user's historical data.**


For example, if a product is often purchased by most people then the system will get to know that that product is most popular so for every new user who just signed it, the system will recommend that product to that user also and **chances will be high that the new user will also purchase that**. 


Examples - 
- **YouTube**: Trending videos. 
- **Google News**: News filtered by trending and most popular news.
- **Twitter** - Trending #.
-  **Music App** - To discover trending music from different catalogs.



-------------

### 11. In what business scenario you should use CF based Recommendation Systems ?

Collaborative Filtering is considered to be one of the smart recommender systems that work on the **similarity between different users and also items** that are widely used as an e-commerce website and also online movie websites. It checks about the **taste of similar users** and does recommendations. It is a **personalised recommender system**, recommendations are made based on the past behaviour of the user.


It is suited for a set of different types of items, for example, a supermarket’s inventory where items of various categories can be added. In a set of similar items such as that of a bookstore, though, known features like writers and genres can be useful and might benefit from content-based or hybrid approaches.


Collaborative filtering can help recommenders to not overspecialize in a user’s profile and recommend items that are completely different from what they have seen before. If you want your recommender to **not suggest a pair of sneakers to someone who just bought another similar pair of sneakers, then try to add collaborative filtering to your recommender spell**.


Examples- Most websites like **Amazon, YouTube, and Netflix** use collaborative filtering as a part of their sophisticated recommendation system.

-----------

### 12. What other possible methods can you think of which can further improve the recommendation for different users ?

Other possible methods like hybrid recommendation system can be considered which is the combination of the content and collaborative filtering method. Combining collaborative and content-based filtering together may help in overcoming the shortcoming we are facing at using them separately and also can be more effective in some cases. 

Some of the approaches are -

**a) Weighted recommendation system** - The weighted recommendation system will take the outputs from each of the models and combine the result in static weightings, which the weight does not change across the train and test set.

For example, we can combine a content-based model and a item-item collaborative filtering model, and each takes a weight of 50% toward the final prediction.

**b) Switching hybrid recommendation system** - This selects a single recommendation system based on the situation. The model is used to be built for the item-level sensitive dataset, we should set the recommender selector criteria based on the user profile or other features.The switching hybrid approach introduces an additional layer upon the recommendation model, which select the appropriate model to use.

**c) Mixed hybrid recommendation system** - This approach first takes the user profile and features to generate different set of candidate datasets and inputs this to the recommendation model accordingly, and combine the prediction to produce the result recommendation.It is able to make large number of recommendations simultaneously, and fit the partial dataset to the appropriate model in order to have better performance.

**d) Feature Combination** - Here we can inject features of a collaborative recommendation model into an content-based recommendation model. The hybrid model is capable to consider the collaborative data from the sub system with relying on one model exclusively.

------------