![bookstore](bookstore.jpg)


Identifying popular products is incredibly important for e-commerce companies! Popular products generate more revenue and, therefore, play a key role in stock control.

You've been asked to support an online bookstore by building a model to predict whether a book will be popular or not. They've supplied you with an extensive dataset containing information about all books they've sold, including:

* `price`
* `popularity` (target variable)
* `review/summary`
* `review/text`
* `review/helpfulness`
* `authors`
* `categories`

You'll need to build a model that predicts whether a book will be rated as popular or not.

They have high expectations of you, so have set a target of at least 70% accuracy! You are free to use as many features as you like, and will need to engineer new features to achieve this level of performance.

In [None]:
# Import some required packages
import pandas as pd

# Read in the dataset
original = pd.read_csv("books.csv")

# Preview the first five rows
original.head()

In [None]:
original.info()

In [None]:
original.describe()

The dataset seems to be a book review dataset instead of a book dataset. The same book (based on its title, price, authors, categories, and description) can appear multiple times. As many as different reviews it's got.

In [None]:
books_df = original[['title', 'price', 'authors', 'categories', 'description', 'popularity']]
books_df.duplicated().sum()

In [None]:
targets = original[['title', 'popularity']]
targets = targets.pivot_table(columns='popularity', index='title', aggfunc='size', fill_value=0)
targets['pop_diff'] = targets['Popular'] - targets['Unpopular']
targets['pop'] = targets['pop_diff']>0
targets

We have our new 'target' dataset were each title (sometimes edition) appears only once. The new target is 'pop' that is True if the number of 'Popular' entries is bigger than the 'Unpopular' ones.

Now lets add 2 types of features:
- Book features (price, description, category and authors)
- Reviews (number, positives vs negatives)

In [None]:
def unique_concat(values):
    return ','.join(set(values))


book_feat = original[['title', 'description', 'price', 'categories', 'authors']]
book_feat = book_feat.groupby('title').agg({
    'price': 'mean',
    'description': ','.join,
    'authors': unique_concat,
    'categories': unique_concat
})


In [None]:
# One-hot encode the 'authors' and 'categories' columns
authors = book_feat['authors'].str.get_dummies(sep=',')
categories = book_feat['categories'].str.get_dummies(sep=',')

# Concatenate the original DataFrame with the one-hot encoded columns
book_feat = pd.concat([book_feat, authors, categories], axis=1)

del authors, categories

# Drop the original 'authors' and 'categories' columns
book_feat.drop(['authors', 'categories'], inplace=True, axis=1)

book_feat.head()

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf_vectorizer = TfidfVectorizer()

X_tfidf = tfidf_vectorizer.fit_transform(book_feat['description'])

tfidf_df = pd.DataFrame(X_tfidf.toarray(), columns=tfidf_vectorizer.get_feature_names_out())

book_feat.drop('description', axis=1, inplace=True)

book_descs = pd.concat([book_feat, tfidf_df], axis=1)



In [None]:
books.head()

In [None]:
revs = original[['title', 'review/helpfulness', 'review/summary', 'review/text']]
revs.head()

In [None]:
from nltk.sentiment import SentimentIntensityAnalyzer
import nltk

nltk.download('vader_lexicon')

In [None]:
revs[['rev_helps', 'rev_totals']] = revs['review/helpfulness'].str.split('/', expand=True)

# Convert the new columns to integers (optional)
revs['rev_helps'] = revs['rev_helps'].astype(int)
revs['rev_totals'] = revs['rev_totals'].astype(int)

In [None]:
sia = SentimentIntensityAnalyzer()
revs['sentiment'] = revs['review/text'].apply(lambda x: sia.polarity_scores(x)['compound'])

In [None]:
revs.head()

In [None]:
revs_grouped = revs.groupby('title').agg({
    'rev_helps': 'sum',
    'rev_totals': 'sum',
    'sentiment': 'sum'
})

In [None]:
books = pd.merge(book_descs, targets['pop'], left_index=True, right_index=True, how='inner')
books = pd.merge(books, revs_grouped, left_index=True, right_index=True, how='inner')
books.head()

In [None]:
books = books.fillna(value=0)
books.head()

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix

X = books.drop('pop_y', axis=1)
y = books['pop_y']

# del books, book_feat, authors, categories, targets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y)

In [None]:
rf = RandomForestClassifier(n_estimators=120, max_depth=50, min_samples_split=5, random_state=42, class_weight="balanced")
rf.fit(X_train, y_train)

y_pred = rf.predict(X_test)

print('train_acc: {}'.format(rf.score(X_train, y_train)))
print('test_acc: {}'.format(accuracy_score(y_test, y_pred)))
print(confusion_matrix(y_test, y_pred))

In [None]:
lr = LogisticRegression()
lr.fit(X_train, y_train)

y_pred = lr.predict(X_test)

print('train_acc: {}'.format(lr.score(X_train, y_train)))
print('test_acc: {}'.format(accuracy_score(y_test, y_pred)))
print(confusion_matrix(y_test, y_pred))

In [None]:
books.head()

In [None]:
y.value_counts(normalize=True)