# May Menachem's Project

<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Intro" data-toc-modified-id="Intro-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Intro</a></span></li><li><span><a href="#Description-of-the-data" data-toc-modified-id="Description-of-the-data-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Description of the data</a></span></li><li><span><a href="#Initialization-and-loading-the-data" data-toc-modified-id="Initialization-and-loading-the-data-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Initialization and loading the data</a></span></li><li><span><a href="#Preprocessing" data-toc-modified-id="Preprocessing-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Preprocessing</a></span></li><li><span><a href="#Tasks" data-toc-modified-id="Tasks-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Tasks</a></span><ul class="toc-item"><li><span><a href="#Printing-the-first-rows-of-each-table" data-toc-modified-id="Printing-the-first-rows-of-each-table-5.1"><span class="toc-item-num">5.1&nbsp;&nbsp;</span>Printing the first rows of each table</a></span></li><li><span><a href="#Finding-the-number-of-books-released-after-January-1,-2000" data-toc-modified-id="Finding-the-number-of-books-released-after-January-1,-2000-5.2"><span class="toc-item-num">5.2&nbsp;&nbsp;</span>Finding the number of books released after January 1, 2000</a></span></li><li><span><a href="#Finding-the-number-of-user-reviews-and-the-average-rating-for-each-book." data-toc-modified-id="Finding-the-number-of-user-reviews-and-the-average-rating-for-each-book.-5.3"><span class="toc-item-num">5.3&nbsp;&nbsp;</span>Finding the number of user reviews and the average rating for each book.</a></span></li><li><span><a href="#Identifying-the-publisher-that-has-released-the-greatest-number-of-books-with-more-than-50-pages" data-toc-modified-id="Identifying-the-publisher-that-has-released-the-greatest-number-of-books-with-more-than-50-pages-5.4"><span class="toc-item-num">5.4&nbsp;&nbsp;</span>Identifying the publisher that has released the greatest number of books with more than 50 pages</a></span></li><li><span><a href="#Identifying-the-author-with-the-highest-average-book-rating" data-toc-modified-id="Identifying-the-author-with-the-highest-average-book-rating-5.5"><span class="toc-item-num">5.5&nbsp;&nbsp;</span>Identifying the author with the highest average book rating</a></span></li><li><span><a href="#Finding-the-average-number-of-text-reviews-among-users-who-rated-more-than-50-books" data-toc-modified-id="Finding-the-average-number-of-text-reviews-among-users-who-rated-more-than-50-books-5.6"><span class="toc-item-num">5.6&nbsp;&nbsp;</span>Finding the average number of text reviews among users who rated more than 50 books</a></span></li></ul></li><li><span><a href="#Summary-and-conclusions" data-toc-modified-id="Summary-and-conclusions-6"><span class="toc-item-num">6&nbsp;&nbsp;</span>Summary and conclusions</a></span></li></ul></div>

## Intro

The coronavirus took the entire world by surprise, changing everyone's daily routine. City dwellers no longer spent their free time outside, going to cafes and malls; more people were home, reading books. That attracted the attention of startups that rushed to develop new apps for book lovers.
You've been given a database of one of the services competing in this market. It contains data on books, publishers, authors, and customer ratings and reviews of books. This information will be used to generate a value proposition for a new product.

**Tasks:**
- Find the number of books released after January 1, 2000.
- Find the number of user reviews and the average rating for each book.
- Identify the publisher that has released the greatest number of books with more than 50 pages.
- Identify the author with the highest average book rating (look only at books with at least 50 ratings).
- Find the average number of text reviews among users who rated more than 50 books.

## Description of the data

- books: Contains data on books:
    - book_id
    - author_id
    - title
    - num_pages — number of pages
    - publication_date
    - publisher_id

- authors: Contains data on authors:
    - author_id
    - author

- publishers: Contains data on publishers:
    - publisher_id
    - publisher

- ratings: Contains data on user ratings:
    - rating_id
    - book_id
    - username — the name of the user who rated the book
    - rating

- reviews: Contains data on customer reviews:
    - review_id
    - book_id
    - username — the name of the user who reviewed the book
    - text — the text of the review

## Initialization and loading the data

In [1]:
# import libraries
import warnings
warnings.filterwarnings('ignore')
import pandas as pd
from sqlalchemy import create_engine

In [2]:
# Setting amount of rows to be displayed
pd.options.display.max_rows = 1000
# Setting amount of columns to be displayed
pd.set_option('display.max_columns', 50)

In [3]:
# In each cell, all output will be displayed 
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = 'all'

In [4]:
# connecting to the data base
db_config = {'user': 'practicum_student',         # username
             'pwd': 's65BlTKV3faNIGhmvJVzOqhs', # password
             'host': 'rc1b-wcoijxj3yxfsf3fs.mdb.yandexcloud.net',
             'port': 6432,              # connection port
             'db': 'data-analyst-final-project-db'}          # the name of the database

connection_string = 'postgresql://{}:{}@{}:{}/{}'.format(db_config['user'],
                                                                     db_config['pwd'],
                                                                       db_config['host'],
                                                                       db_config['port'],
                                                                       db_config['db'])

engine = create_engine(connection_string, connect_args={'sslmode':'require'}) 

In [5]:
# definning a function that outputs the result of an SQL query
def get_q_result(sql_query):
    return pd.io.sql.read_sql(sql_query, con = engine)

## Tasks

### Printing the first rows of each table

In [6]:
# creating a list of the tables
tables_list = ['books', 'authors', 'publishers', 'ratings', 'reviews']

# creating a query string
get_first_rows_q = 'SELECT * FROM {} LIMIT 10'

# printing the 10 first rows of each table
for table in tables_list:
    print("First 10 rows of the table:",table)
    get_q_result(get_first_rows_q.format(table))

First 10 rows of the table: books


Unnamed: 0,book_id,author_id,title,num_pages,publication_date,publisher_id
0,1,546,'Salem's Lot,594,2005-11-01,93
1,2,465,1 000 Places to See Before You Die,992,2003-05-22,336
2,3,407,13 Little Blue Envelopes (Little Blue Envelope...,322,2010-12-21,135
3,4,82,1491: New Revelations of the Americas Before C...,541,2006-10-10,309
4,5,125,1776,386,2006-07-04,268
5,6,257,1st to Die (Women's Murder Club #1),424,2005-05-20,116
6,7,258,2nd Chance (Women's Murder Club #2),400,2005-05-20,116
7,8,260,4th of July (Women's Murder Club #4),448,2006-06-01,318
8,9,563,A Beautiful Mind,461,2002-02-04,104
9,10,445,A Bend in the Road,341,2005-04-01,116


First 10 rows of the table: authors


Unnamed: 0,author_id,author
0,1,A.S. Byatt
1,2,Aesop/Laura Harris/Laura Gibbs
2,3,Agatha Christie
3,4,Alan Brennert
4,5,Alan Moore/David Lloyd
5,6,Alan Paton
6,7,Albert Camus/Justin O'Brien
7,8,Aldous Huxley
8,9,Aldous Huxley/Christopher Hitchens
9,10,Aleksandr Solzhenitsyn/H.T. Willetts


First 10 rows of the table: publishers


Unnamed: 0,publisher_id,publisher
0,1,Ace
1,2,Ace Book
2,3,Ace Books
3,4,Ace Hardcover
4,5,Addison Wesley Publishing Company
5,6,Aladdin
6,7,Aladdin Paperbacks
7,8,Albin Michel
8,9,Alfred A. Knopf
9,10,Alfred A. Knopf Books for Young Readers


First 10 rows of the table: ratings


Unnamed: 0,rating_id,book_id,username,rating
0,1,1,ryanfranco,4
1,2,1,grantpatricia,2
2,3,1,brandtandrea,5
3,4,2,lorichen,3
4,5,2,mariokeller,2
5,6,3,johnsonamanda,4
6,7,3,scotttamara,5
7,8,3,lesliegibbs,5
8,9,4,abbottjames,5
9,10,4,valenciaanne,4


First 10 rows of the table: reviews


Unnamed: 0,review_id,book_id,username,text
0,1,1,brandtandrea,Mention society tell send professor analysis. ...
1,2,1,ryanfranco,Foot glass pretty audience hit themselves. Amo...
2,3,2,lorichen,Listen treat keep worry. Miss husband tax but ...
3,4,3,johnsonamanda,Finally month interesting blue could nature cu...
4,5,3,scotttamara,Nation purpose heavy give wait song will. List...
5,6,3,lesliegibbs,Analysis no several cause international.
6,7,4,valenciaanne,One there cost another. Say type save. With pe...
7,8,4,abbottjames,Within enough mother. There at system full rec...
8,9,5,npowers,Thank now focus realize economy focus fly. Ite...
9,10,5,staylor,Game push lot reduce where remember. Including...


### Finding the number of books released after January 1, 2000

In [7]:
print(" The number of books released after January 1, 2000 is:",
      get_q_result(
 
     '''SELECT   COUNT(DISTINCT book_id)
                 FROM books
                 WHERE publication_date > '2000-01-01'
     '''
                    )
      .iloc[0,0])

 The number of books released after January 1, 2000 is: 819


### Finding the number of user reviews and the average rating for each book.

In [8]:
get_q_result(
 
     '''SELECT    books.book_id, title, COUNT(review_id) AS n_reviews, COUNT(rating_id) AS n_rating, AVG(rating) AS avg_rating
        FROM  
                  books 
                  LEFT OUTER JOIN ratings ON books.book_id = ratings.book_id
                  LEFT OUTER JOIN reviews ON books.book_id = reviews.book_id      
        GROUP BY  books.book_id, title
        ORDER BY  n_reviews DESC, avg_rating DESC
     '''
             )

Unnamed: 0,book_id,title,n_reviews,n_rating,avg_rating
0,948,Twilight (Twilight #1),1120,1120,3.6625
1,750,The Hobbit or There and Back Again,528,528,4.125
2,673,The Catcher in the Rye,516,516,3.825581
3,302,Harry Potter and the Prisoner of Azkaban (Harr...,492,492,4.414634
4,299,Harry Potter and the Chamber of Secrets (Harry...,480,480,4.2875
5,75,Angels & Demons (Robert Langdon #1),420,420,3.678571
6,301,Harry Potter and the Order of the Phoenix (Har...,375,375,4.186667
7,779,The Lightning Thief (Percy Jackson and the Oly...,372,372,4.080645
8,722,The Fellowship of the Ring (The Lord of the Ri...,370,370,4.391892
9,79,Animal Farm,370,370,3.72973


The most reviewed book is `Twilight #1`. It is very popular with more than twice the reviews of the second most reviewed book. The number of reviews per book is the same as the number of ratings per book for most books. 

The books that have an average rating of 4 and above have a wide range in terms of number of reviews (from 0 to 528) and ratings(3 to 528) so there is not necessarily a correlation between the number of reviews or the number of ratings and the average rating. 

The most reviewed books are mostly fantasy series books like Twilight, Harry Potter and lord of the rings. 

### Identifying the publisher that has released the greatest number of books with more than 50 pages 

This will help to exclude brochures and similar publications from the analysis.

In [9]:
get_q_result(
 
     ''' SELECT      publishers. *, COUNT(book_id) AS n_books
         FROM 
                     publishers 
                     INNER JOIN books ON publishers.publisher_id = books.publisher_id
         WHERE       num_pages > 50      
         GROUP BY    publishers.publisher_id
         ORDER BY    COUNT(book_id) DESC
    
     '''
             )

Unnamed: 0,publisher_id,publisher,n_books
0,212,Penguin Books,42
1,309,Vintage,31
2,116,Grand Central Publishing,25
3,217,Penguin Classics,24
4,33,Ballantine Books,19
5,35,Bantam,19
6,45,Berkley,17
7,46,Berkley Books,14
8,284,St. Martin's Press,14
9,333,William Morrow Paperbacks,13


The publisher that has released the greatest number of books (42) with more than 50 pages is Penguin Books and it dominates the field. Most publishers published only few books with more that 50 pages. There are only 16 publishers with more than 10 books. 

### Identifying the author with the highest average book rating

Looking only at books with at least 50 ratings.

In [10]:
get_q_result(
 
     ''' SELECT authors.*, COUNT(rating_id) AS n_ratings, AVG(rating) AS avg_rating 
         FROM 
                   authors
                   INNER JOIN books ON authors.author_id = books.author_id
                   INNER JOIN ratings ON books.book_id = ratings.book_id     
         GROUP BY  authors.author_id
         HAVING    COUNT(rating_id) >= 50
         ORDER BY  AVG(rating)  DESC
    
     '''
             )

Unnamed: 0,author_id,author,n_ratings,avg_rating
0,130,Diana Gabaldon,50,4.3
1,236,J.K. Rowling/Mary GrandPré,312,4.288462
2,3,Agatha Christie,53,4.283019
3,402,Markus Zusak/Cao Xuân Việt Khương,53,4.264151
4,240,J.R.R. Tolkien,166,4.240964
5,499,Roald Dahl/Quentin Blake,62,4.209677
6,376,Louisa May Alcott,54,4.203704
7,498,Rick Riordan,84,4.130952
8,39,Arthur Golden,56,4.107143
9,542,Stephen King,106,4.009434


The author with the highest average book rating is Diana Gabaldon. Also here it seems that there is not necessarily a correlation between the number of ratings and the average rating.

### Finding the average number of text reviews among users who rated more than 50 books


In [11]:
print("The average number of text reviews among users who rated more than 50 books is:",
      
      get_q_result(
 
     ''' SELECT AVG(n_text_reviews) AS avg_n_text_reviews
         FROM 
                     (
                     
                      SELECT  ratings.username,
                              COUNT(DISTINCT review_id) AS n_text_reviews,
                              COUNT(DISTINCT(ratings.book_id)) AS n_rated_books
                      FROM 
                              reviews
                      RIGHT OUTER JOIN  ratings ON reviews.username = ratings.username         
                      GROUP BY          ratings.username
                      HAVING            COUNT(DISTINCT ratings.book_id) > 50
                      
                      ) AS table1
    
     '''
             ).iloc[0,0])

The average number of text reviews among users who rated more than 50 books is: 24.333333333333332


There are less reviews than ratings for people who rated more than 50 books. Probably because it takes more time to review than to rate. 

## Summary and conclusions

- The most reviewed books are mostly fantasy series books like Twilight, Harry Potter and lord of the rings. 

- The publisher that has released the greatest number of books (42) with more than 50 pages is Penguin Books and it dominates the field. Most publishers published only few books with more that 50 pages.

- The books that have an average rating of 4 and above have a wide range in terms of number of reviews (from 0 to 528) and ratings(3 to 528) so there is not necessarily a correlation between the number of reviews or the number of ratings and the average rating. Also there is not necessarily a correlation between the number of ratings of books by author and their average ratings. Hence, there is not necessarily a correlation between popularity and quality or enjoying the book while reading. 

- The average number of reviews is about half the number of ratings for people who rated more than 50 books. Probably because it takes more time to review than to rate. But the number of reviews per book is the same as the number of ratings per book for most books. So probably people who rate more give less reviews and people who review more give less ratings.

Maybe this is the reason there is not necessarily a correlation between the number of reviews or the number of ratings and the average rating. With ratings it is easy to look for books but with reviews it takes a lot of time to read many of them in order to decide which book to buy. And it might be hard to know for the potential customers if they would like the book. 

Hence, it is suggested to create a recommendation system of books for book lovers that is not based solely on ratings but also on reviews using sentiment analysis to decide what is the sentiment of the reviews and additionally on personal preferences (books the customer liked in the past). Hopefully this would help book readers to read more enjoyable books.     
 