# Exploratory Data Analysis (EDA)

In [1]:
import pandas as pd

## Books Rating

In [2]:
books_rating = pd.read_csv('Books_rating.csv')

In [3]:
books_rating.columns

Index(['Id', 'Title', 'Price', 'User_id', 'profileName', 'score', 'time',
       'summary', 'text'],
      dtype='object')

Renaming the columns:

In [4]:
books_rating = (
    books_rating
    .rename(columns=lambda x:
        'profile_name' if x == 'profileName'
        else x.lower()
    )
)

Checking for duplicated rows:

In [5]:
books_rating.duplicated().any()

np.True_

Removing the duplicated rows:

In [6]:
books_rating = books_rating.drop_duplicates()

In [7]:
books_rating.shape

(2978897, 9)

In [8]:
books_rating.head()

Unnamed: 0,id,title,price,user_id,profile_name,score,time,summary,text
0,1882931173,Its Only Art If Its Well Hung!,,AVCGYZL8FQQTD,"Jim of Oz ""jim-of-oz""",4.0,940636800,Nice collection of Julie Strain images,This is only for Julie Strain fans. It's a col...
1,826414346,Dr. Seuss: American Icon,,A30TK6U7DNS82R,Kevin Killian,5.0,1095724800,Really Enjoyed It,I don't care much for Dr. Seuss but after read...
2,826414346,Dr. Seuss: American Icon,,A3UH4UZ4RSVO82,John Granger,5.0,1078790400,Essential for every personal and Public Library,"If people become the books they read and if ""t..."
3,826414346,Dr. Seuss: American Icon,,A2MVUWT453QH61,"Roy E. Perry ""amateur philosopher""",4.0,1090713600,Phlip Nel gives silly Seuss a serious treatment,"Theodore Seuss Geisel (1904-1991), aka &quot;D..."
4,826414346,Dr. Seuss: American Icon,,A22X4XUPKF66MR,"D. H. Richards ""ninthwavestore""",4.0,1107993600,Good academic overview,Philip Nel - Dr. Seuss: American IconThis is b...


Data types:

In [9]:
books_rating.dtypes

id               object
title            object
price           float64
user_id          object
profile_name     object
score           float64
time              int64
summary          object
text             object
dtype: object

Checking for NaN values:

In [10]:
books_rating.isna().any()

id              False
title            True
price            True
user_id          True
profile_name     True
score           False
time            False
summary          True
text             True
dtype: bool

Count of non-null values:

In [11]:
books_rating.count()

id              2978897
title           2978689
price            478827
user_id         2424538
profile_name    2424420
score           2978897
time            2978897
summary         2978490
text            2978889
dtype: int64

Trying to understand the ```id``` column:

In [12]:
books_rating['id'].duplicated().any()

np.True_

In [13]:
(
    books_rating[['id', 'title']]
    .dropna(subset='title')
    .drop_duplicates()
    .apply(lambda col: col.duplicated().any())
)

id       False
title     True
dtype: bool

For each id there is only one title. However, there are different ids for the same title, e.g.:

In [14]:
(
    books_rating[['id', 'title']]
    .dropna(subset='title')
    .drop_duplicates()
    .groupby('title')['id'].count()
    .sort_values(ascending=False)
    .head(5)
)

title
Poems                           14
Persuasion                      13
Sermons on several occasions    12
Great Expectations              11
Pride and Prejudice             11
Name: id, dtype: int64

In [15]:
books_rating.query('title == "Poems"').head()

Unnamed: 0,id,title,price,user_id,profile_name,score,time,summary,text
40221,B0006E8QBU,Poems,,A2CPT04DPD2KKC,"Judy Matthews ""ikanu""",5.0,946425600,Poems by James Russell Lowell,The poetry in this book covers the full spectr...
104797,1852244925,Poems,,A2I6V7JNUR85MQ,"N. Dorward ""obsessive reviewer""",5.0,961545600,A major poet,It's hard to know how to review this book: Pry...
104798,1852244925,Poems,,A2PMCVAQC09I51,"""lexo-2x""",5.0,956880000,Do ya like good music?,I personally think that being a lyric poet is ...
104799,1852244925,Poems,,A14OJS0VWMOSWO,Midwest Book Review,5.0,1146528000,A highly recommended read for all dedicated po...,Poems is an inspired and inspirational collect...
104800,1852244925,Poems,,A1JL75766ADDFU,"""k-command""",5.0,1043625600,blow to the head,Shocking that only two reviews up for this. I ...


Duplicated values in the ```summary``` and ```text``` columns:

In [16]:
# (
#     books_rating
#     .pipe(lambda df: df[df['text'].duplicated(keep=False)])
#     .sort_values('text')
#     .head()
# )

Number of rows that everything is duplicated, but the ```id``` column:

In [17]:
(
    books_rating.shape[0]
    - books_rating.drop(columns='id').drop_duplicates().shape[0]
)

323170

Example:

In [18]:
(
    books_rating[
        books_rating.drop(columns='id').duplicated(keep=False)
    ]
    .sort_values('text')
    .iloc[:2]
)

Unnamed: 0,id,title,price,user_id,profile_name,score,time,summary,text
1322581,B000QEAR2Q,The Orthodox Church,,A2D9IEFJGB483Q,Kendal B. Hunter,5.0,1006300800,Eastern Orthodoxy for Ignorants like me,!!!The book and what I found inside!!!This boo...
215598,B000EEW0LE,The Orthodox Church,,A2D9IEFJGB483Q,Kendal B. Hunter,5.0,1006300800,Eastern Orthodoxy for Ignorants like me,!!!The book and what I found inside!!!This boo...


In [19]:
books_rating.loc[1322581].compare(books_rating.loc[215598])

Unnamed: 0,self,other
id,B000QEAR2Q,B000EEW0LE


The same review is duplicated by the id.

It seems that the ```id``` column is not useful. Let's drop it.

In [20]:
books_rating = (
    books_rating
    .drop(columns='id')
    .drop_duplicates()
)

In [21]:
(
    books_rating
    .pipe(lambda df: df[df['text'].duplicated(keep=False)])
    .sort_values('text')
    .head()
)

Unnamed: 0,title,price,user_id,profile_name,score,time,summary,text
2875916,The Great Physician's Rx for Health and Wellne...,,A2CVUD1KWW2TUT,Lv2Read,1.0,1144713600,You can't get in trouble over a book?,"!!!March 9, 2006 Rubin's company was charged, ..."
667463,The Great Physician's Rx for Health and Wellness,,A2CVUD1KWW2TUT,Lv2Read,1.0,1144713600,You can't get in trouble over a book?,"!!!March 9, 2006 Rubin's company was charged, ..."
1688387,The electric kool-aid acid test,,A25TJD77EBERPD,The Concise Critic:,3.0,1207180800,On The Road (part two),"!!Freeeeeeaky!! (Almost)::::datedSo, Tom Wolfe..."
834032,The Electric Kool-Aid Acid Test,,A25TJD77EBERPD,The Concise Critic:,3.0,1207180800,On The Road (part two),"!!Freeeeeeaky!! (Almost)::::datedSo, Tom Wolfe..."
1098529,"The adventures of Pinocchio, (Illustrated juni...",,,,5.0,1003190400,A Book Review of &quot;The Adventures of Pinno...,!!WOW!! Was that book great! !!WOW!! It deserv...


As we can see above, there are rows almost identical, but with some differences in the title.

```user_id```, ```profile_name``` and ```time``` columns:

In [22]:
(
    books_rating[['user_id', 'profile_name', 'time']]
    .drop_duplicates()
    .groupby(['user_id', 'time'])['profile_name']
    .count()
    .sort_values(ascending=False)
    .head()
)

user_id         time      
A1OQOAP4TLOTBE  1349740800    3
AZB0DL867TNYW   1353369600    3
A2K55QI7ZCXZGO  1358035200    2
A3LW3QPSFICFQZ  1361491200    2
AIGSF17EDUOQA   1353974400    2
Name: profile_name, dtype: int64

Some user ids can have more than one profile name at the same time (strange...):

In [23]:
(
    books_rating
    .query('user_id == "A1OQOAP4TLOTBE" and time == 1349740800')
    .drop_duplicates(subset='profile_name')
)

Unnamed: 0,title,price,user_id,profile_name,score,time,summary,text
269723,"Uglies (Uglies Trilogy, Book 1)",,A1OQOAP4TLOTBE,Avanelle Perry,5.0,1349740800,Uglies,"Like Matched and The Hunger Games, Uglies take..."
1268035,Camilla,,A1OQOAP4TLOTBE,"Lila Marantz ""Martien Marantz""",4.0,1349740800,Camilla,Camilla Dickinson is a fifteen year old girl w...
2132767,Messenger,,A1OQOAP4TLOTBE,Martine,5.0,1349740800,Messenger,Messenger incorporates the characters of The G...


In [24]:
(
    books_rating[['user_id', 'profile_name', 'time']]
    .drop_duplicates()
    .groupby(['profile_name', 'time'])['user_id']
    .count()
    .sort_values(ascending=False)
    .head()
)

profile_name  time      
A Customer    1358035200    13
              1295308800     9
              1295222400     9
              950745600      8
              1293667200     8
Name: user_id, dtype: int64

Also some profile names can have more than one user id at the same time.

To think: what defines an unique review?

In [25]:
books_rating.duplicated().any()

np.False_

In [26]:
books_rating.drop(columns='profile_name').duplicated().any()

np.True_

In [27]:
(
    books_rating[
        books_rating.drop(columns='profile_name').duplicated(keep=False)
    ]
    .sort_values('text')
    .iloc[:2]
)

Unnamed: 0,title,price,user_id,profile_name,score,time,summary,text
2849859,Girl in Hyacinth Blue,,A2EGK0YRDF4ZZB,Sarah,5.0,1026604800,"""Girl in Hyacinth Blue"" an unexpected treat","""Girl in Hyacinth Blue,"" like ""Girl With A Pea..."
421458,Girl in Hyacinth Blue,,A2EGK0YRDF4ZZB,S. Hodge,5.0,1026604800,"""Girl in Hyacinth Blue"" an unexpected treat","""Girl in Hyacinth Blue,"" like ""Girl With A Pea..."


As the cells above show, there are rows that only differ in the profile name. Let's drop the ```profile_name``` column.

In [28]:
books_rating = (
    books_rating
    .drop(columns='profile_name')
    .drop_duplicates()
)

```summary``` and ```text``` columns:

In [29]:
books_rating['summary'].isna().eq(books_rating['text'].isna()).all()

np.False_

There are rows with null values in the summary column but non-null values in the text column, e.g.:

In [30]:
(
    books_rating[
        books_rating['summary'].isna()
        & books_rating['text'].notna()
    ]
    .sort_values('text')
    .head()
)

Unnamed: 0,title,price,user_id,score,time,summary,text
2104919,Domain (Domain Trilogy),7.99,A3ERU38QO67XON,4.0,982368000,,"(Alten) singlehandedly out does Benchley, Cric..."
871047,The Innocent Man: Murder and Injustice in a Sm...,,A1V9WHAVZBTISF,4.0,1340668800,,6-26-12: I'm about halfway through and trying ...
532719,Witness of Gor,,A257E6TFB6FSPA,2.0,1356652800,,80% commentary and 20% story line. I have read...
2109710,"The Nerve of Foley, and Other Railroad Stories",,A3F18AV35OSSN0,5.0,1329177600,,A excellent series of short stories on how rai...
2245299,The Book of Garnishes,,AC4XXR47TP883,2.0,1127692800,,"A good book for the beginner, nothing too comp..."


Also there are rows with null values in the text column but non-null values in the summary column, e.g.:

In [31]:
(
    books_rating[
        books_rating['summary'].notna()
        & books_rating['text'].isna()
    ]
    .sort_values('text')
    .head()
)

Unnamed: 0,title,price,user_id,score,time,summary,text
469860,The Lord of the Rings - Boxed Set,,,5.0,938563200,have only one word to say read ths book,
1139849,The Lord of the Rings Box Set,,,5.0,938563200,have only one word to say read ths book,
1368032,The Lord of the Rings (3 Volume Set),,,5.0,938563200,have only one word to say read ths book,
1598329,The Lord of the Rings Trilogy (The Fellowship ...,,,5.0,938563200,have only one word to say read ths book,
1771566,The Drive,,A32VJTCIVOG88D,5.0,1136678400,Beautiful and Honest - read it 4 times so far,


In [32]:
# useful_cols_rating = ['title', 'user_id', 'score', 'time', 'summary', 'text']

In [33]:
books_rating.duplicated().any()

np.False_

In [34]:
books_rating.drop(columns='price').duplicated().any()

np.True_

There are rows that only differ in the price:

In [35]:
(
    books_rating[
        books_rating.drop(columns='price').duplicated(keep=False)
    ]
    .sort_values('text')
    .iloc[:2]
)

Unnamed: 0,title,price,user_id,score,time,summary,text
275311,Believing God,24.44,A2MN6QFG773V93,5.0,1191283200,I'm Believing God!!!!,""" Believing God"" teaches Christians to not jus..."
283433,Believing God,,A2MN6QFG773V93,5.0,1191283200,I'm Believing God!!!!,""" Believing God"" teaches Christians to not jus..."


In [36]:
_df = books_rating.dropna(subset='price')
(
    _df[
        _df.drop(columns='price').duplicated(keep=False)
    ]
    .sort_values('text')
    .iloc[:2]
)

Unnamed: 0,title,price,user_id,score,time,summary,text
1907621,The Awakening,32.95,A1DE875S68SSPX,4.0,1342137600,A Creole Mme Bovary,"""A Creole Bovary,"" is how Willa Cather describ..."
1730380,The Awakening,24.21,A1DE875S68SSPX,4.0,1342137600,A Creole Mme Bovary,"""A Creole Bovary,"" is how Willa Cather describ..."


In [37]:
books_rating.loc[1907621].compare(books_rating.loc[1730380])

Unnamed: 0,self,other
price,32.95,24.21


Let's drop the ```price``` column.

In [38]:
books_rating = (
    books_rating
    .drop(columns='price')
    .drop_duplicates()
)

Let's do some more checking on duplicates

In [39]:
books_rating.duplicated().any()

np.False_

In [40]:
books_rating.drop(columns='score').duplicated().any()

np.True_

In [41]:
(
    books_rating[
        books_rating.drop(columns='score').duplicated(keep=False)
    ]
    .sort_values('text')
    .loc[[333322, 333321]]
)

Unnamed: 0,title,user_id,score,time,summary,text
333322,Out of the Dust,A2EEAHFEIX0MP6,3.0,979257600,Out of the Dust,"""The rain has brought back some grass and the ..."
333321,Out of the Dust,A2EEAHFEIX0MP6,4.0,979257600,Out of the Dust,"""The rain has brought back some grass and the ..."


It is strange that there are rows that only differ in the score. We can use the mean.

In [42]:
books_rating.drop(columns='user_id').duplicated().any()

np.True_

In [43]:
(
    books_rating[
        books_rating.drop(columns='user_id').duplicated(keep=False)
    ]
    .sort_values('text')
    .loc[[1762172, 1762175]]
)

Unnamed: 0,title,user_id,score,time,summary,text
1762172,Letters For Sarah,AOE6JPNQYZXFI,5.0,1078790400,Compelling,"""Letters For Sarah"" by Susan Kay is a wonderfu..."
1762175,Letters For Sarah,A14CNGB2LGS9WL,5.0,1078790400,Compelling,"""Letters For Sarah"" by Susan Kay is a wonderfu..."


It is also strange that there are rows that only differ in the user id, though it is possible... We are going to let that be allowed.

In [44]:
books_rating.drop(columns='time').duplicated().any()

np.True_

In [45]:
(
    books_rating[
        books_rating.drop(columns='time').duplicated(keep=False)
    ]
    .sort_values('text')
    .loc[[1018327, 1018328]]
)

Unnamed: 0,title,user_id,score,time,summary,text
1018327,A Bed for the Night,A6MOYDXU7Y17P,4.0,1338854400,An inspirational shot in the arm,"""A Bed for the Night: The Story of the Wheelin..."
1018328,A Bed for the Night,A6MOYDXU7Y17P,4.0,1332460800,An inspirational shot in the arm,"""A Bed for the Night: The Story of the Wheelin..."


The same review in different times. That will not be allowed by default, but it can be a hyperparameter of the data preprocessing.

In [46]:
books_rating.drop(columns='summary').duplicated().any()

np.True_

In [47]:
(
    books_rating[
        books_rating.drop(columns='summary').duplicated(keep=False)
    ]
    .sort_values('text')
    .loc[[2865822, 2865823]]
)

Unnamed: 0,title,user_id,score,time,summary,text
2865822,Glacier pilot;: The story of Bob Reeve and the...,A2L7N2U5Z316ZE,5.0,1248048000,Bush Pilot Supreme,"""...the rousing story of Bob Reeve, Alaska's f..."
2865823,Glacier pilot;: The story of Bob Reeve and the...,A2L7N2U5Z316ZE,5.0,1248048000,Flying in a Wilderness,"""...the rousing story of Bob Reeve, Alaska's f..."


They only differ in the ```summary``` column.

In [48]:
books_rating.drop(columns='text').duplicated().any()

np.True_

In [49]:
(
    books_rating[
        books_rating.drop(columns='text').duplicated(keep=False)
    ]
    .iloc[:2]
)

Unnamed: 0,title,user_id,score,time,summary,text
162,History of Magic and the Occult,AMKC1EJBUXDS2,5.0,989107200,a bibliophiliac wedding of the SURREALIST &amp...,"Kurt Seligmann, Surrealist artist par excellen..."
164,History of Magic and the Occult,AMKC1EJBUXDS2,5.0,989107200,a bibliophiliac wedding of the SURREALIST &amp...,"Kurt Seligmann, Surrealist artist par excellen..."


In [50]:
books_rating.loc[162].compare(books_rating.loc[164]).values

array([["Kurt Seligmann, Surrealist artist par excellence, admitted &amp; unashamed bibliophile, has ravaged his occult library in a miraculous marriage giving birth to this classic historical account of Magic and Occultism; entirely written for the proverbial 'man about the street', and a very cosmic avenue is here tread in an admireable chronological ordering of the various mystical houses. Addressing all the various occult(semantic definition of &quot;hidden &amp; rejected knowledge&quot;) routes via the shortcut of this Art Historian's scholarly mind, as a whole this work is one of the best and most easily approachable &quot;magic histories&quot; for those who have no previous knowledge concerning occultism except an undeniable interest and fascination with the mysterious and the spiritual.Kurt Seligmann is darkly fascinated and explores endlessly the multitude of historic beliefs concerning the study of the nature of evil... such as black magic, necromancy, Elizabethian era conjur

They only differ in the ```text``` column.

Intuitively, an user id could only have one review for a title. We need to aggregate the other columns somehow.

```score``` distribution:

In [51]:
# REDO after preprocessing

# books_rating['score'].value_counts().sort_index().plot(kind='bar')

## Books Data

In [52]:
books_data = pd.read_csv('books_data.csv')

In [53]:
books_data.columns

Index(['Title', 'description', 'authors', 'image', 'previewLink', 'publisher',
       'publishedDate', 'infoLink', 'categories', 'ratingsCount'],
      dtype='object')

Renaming the columns:

In [54]:
books_data = (
    books_data
    .rename(columns={
        'Title': 'title',
        'previewLink': 'preview_link',
        'publishedDate': 'published_date',
        'infoLink': 'info_link',
        'ratingsCount': 'ratings_count',
    })
)

Checking for duplicated rows:

In [55]:
books_data.duplicated().any()

np.False_

In [56]:
books_data.drop(columns='title').duplicated().sum()

np.int64(24304)

The cells above indicates that there are rows almost identical, but with some differences in the title (not good...).

In [57]:
books_data['title'].is_unique

True

In [58]:
books_data.shape

(212404, 10)

In [59]:
books_data.head()

Unnamed: 0,title,description,authors,image,preview_link,publisher,published_date,info_link,categories,ratings_count
0,Its Only Art If Its Well Hung!,,['Julie Strain'],http://books.google.com/books/content?id=DykPA...,http://books.google.nl/books?id=DykPAAAACAAJ&d...,,1996,http://books.google.nl/books?id=DykPAAAACAAJ&d...,['Comics & Graphic Novels'],
1,Dr. Seuss: American Icon,Philip Nel takes a fascinating look into the k...,['Philip Nel'],http://books.google.com/books/content?id=IjvHQ...,http://books.google.nl/books?id=IjvHQsCn_pgC&p...,A&C Black,2005-01-01,http://books.google.nl/books?id=IjvHQsCn_pgC&d...,['Biography & Autobiography'],
2,Wonderful Worship in Smaller Churches,This resource includes twelve principles in un...,['David R. Ray'],http://books.google.com/books/content?id=2tsDA...,http://books.google.nl/books?id=2tsDAAAACAAJ&d...,,2000,http://books.google.nl/books?id=2tsDAAAACAAJ&d...,['Religion'],
3,Whispers of the Wicked Saints,Julia Thomas finds her life spinning out of co...,['Veronica Haddon'],http://books.google.com/books/content?id=aRSIg...,http://books.google.nl/books?id=aRSIgJlq6JwC&d...,iUniverse,2005-02,http://books.google.nl/books?id=aRSIgJlq6JwC&d...,['Fiction'],
4,"Nation Dance: Religion, Identity and Cultural ...",,['Edward Long'],,http://books.google.nl/books?id=399SPgAACAAJ&d...,,2003-03-01,http://books.google.nl/books?id=399SPgAACAAJ&d...,,


Data Types:

In [60]:
books_data.dtypes

title              object
description        object
authors            object
image              object
preview_link       object
publisher          object
published_date     object
info_link          object
categories         object
ratings_count     float64
dtype: object

Checking for NaN values:

In [61]:
books_data.isna().any()

title             True
description       True
authors           True
image             True
preview_link      True
publisher         True
published_date    True
info_link         True
categories        True
ratings_count     True
dtype: bool

Count of non-null values:

In [62]:
books_data.count()

title             212403
description       143962
authors           180991
image             160329
preview_link      188568
publisher         136518
published_date    187099
info_link         188568
categories        171205
ratings_count      49752
dtype: int64

In [63]:
useful_cols_data = ['title', 'description', 'authors', 'categories', 'ratings_count']

In [64]:
books_data[useful_cols_data].head()

Unnamed: 0,title,description,authors,categories,ratings_count
0,Its Only Art If Its Well Hung!,,['Julie Strain'],['Comics & Graphic Novels'],
1,Dr. Seuss: American Icon,Philip Nel takes a fascinating look into the k...,['Philip Nel'],['Biography & Autobiography'],
2,Wonderful Worship in Smaller Churches,This resource includes twelve principles in un...,['David R. Ray'],['Religion'],
3,Whispers of the Wicked Saints,Julia Thomas finds her life spinning out of co...,['Veronica Haddon'],['Fiction'],
4,"Nation Dance: Religion, Identity and Cultural ...",,['Edward Long'],,


Checking the titles:

In [65]:
(
    books_data
    [useful_cols_data]
    .dropna(subset='title')
    .pipe(lambda df: df[df['title'].str.startswith('The Great Physician')])
    # ['title'].values
)

Unnamed: 0,title,description,authors,categories,ratings_count
27238,The Great Physician's Rx for 7 Weeks of Wellne...,Expanding beyond the solely nutritionally base...,['Jordan Rubin'],['Health & Fitness'],
47088,The Great Physician's Rx for Health and Wellness,"At 19 years old, Jordan Rubin was a healthy 6'...",['Jordan Rubin'],['Health & Fitness'],12.0
123651,The Great Physician's Rx for Health and Wellne...,"At 19 years old, Jordan Rubin was a healthy 6'...",['Jordan Rubin'],['Health & Fitness'],12.0


In [66]:
(
    books_data
    [useful_cols_data]
    .dropna(subset='title')
    .pipe(lambda df:
        df[
            df['title']
            .str.lower()
            .str.startswith('the electric kool')
        ]
    )
    # ['title'].values
)

Unnamed: 0,title,description,authors,categories,ratings_count
59924,The Electric Kool-Aid Acid Test,Relates the escapades of Ken Kesey and the Mer...,['Tom Wolfe'],['Social Science'],52.0
119738,The electric kool-aid acid test,Relates the escapades of Ken Kesey and the Mer...,['Tom Wolfe'],['Social Science'],52.0


## Books Rating + Books Data

Checking the titles:

In [67]:
set(books_rating['title']) - set(books_data['title'])

set()

In [68]:
set(books_data['title']) - set(books_rating['title'])

set()

They are equal sets.

Checking the ```ratings_count```:

In [69]:
ratings_count = (
    books_data
    .dropna(subset='title')
    .set_index('title')
    .sort_index()
    ['ratings_count']
)
ratings_count.to_frame()

Unnamed: 0_level_0,ratings_count
title,Unnamed: 1_level_1
""" Film technique, "" and, "" Film acting """,
""" We'll Always Have Paris"": The Definitive Guide to Great Lines from the Movies",
"""... And Poetry is Born ..."" Russian Classical Poetry",
"""A Titanic hero"" Thomas Andrews, shipbuilder",
"""A Truthful Impression of the Country"": British and American Travel Writing in China, 1880-1949",
...,...
with an everlasting love,4.0
work and Motivation,
www.whitbread.org/book,
"xBase Programming for the True Beginner: An Introduction to the xBase Language in the Context of dBASE III+, IV, 5, FoxPro, and Clipper",


In [70]:
books_rating.groupby('title')['text'].count().sort_index().to_frame()

Unnamed: 0_level_0,text
title,Unnamed: 1_level_1
""" Film technique, "" and, "" Film acting """,2
""" We'll Always Have Paris"": The Definitive Guide to Great Lines from the Movies",2
"""... And Poetry is Born ..."" Russian Classical Poetry",1
"""A Titanic hero"" Thomas Andrews, shipbuilder",8
"""A Truthful Impression of the Country"": British and American Travel Writing in China, 1880-1949",1
...,...
with an everlasting love,21
work and Motivation,1
www.whitbread.org/book,3
"xBase Programming for the True Beginner: An Introduction to the xBase Language in the Context of dBASE III+, IV, 5, FoxPro, and Clipper",1


In [71]:
_title = 'with an everlasting love'

books_rating.query('title == @_title').head(10)

Unnamed: 0,title,user_id,score,time,summary,text
1279120,with an everlasting love,A9YRGJ159Z30R,1.0,1233705600,Really?,I cannot echo the rave reviews of this book. T...
1279121,with an everlasting love,A2F030EH54F0JE,5.0,1355184000,I have loved this book for years,I purchased this because I keep giving away my...
1279122,with an everlasting love,AP96A4VJV5UCO,5.0,1354233600,Wonderful Book!,"This is one of the most basic stories ever, ex..."
1279123,with an everlasting love,AGPAX21BNZAD8,5.0,1350950400,Incredible,This is a outstanding book. You can take so mu...
1279124,with an everlasting love,A1H2HBZ2QZA8J9,5.0,1346889600,An Everlasting Love,"I remember when I first read this book,when I ..."
1279125,with an everlasting love,A268KCYCRLJ6YT,4.0,1346284800,Everlasting,Good and easy read based on unconditional love...
1279126,with an everlasting love,ARRG24PTR1K2F,5.0,1334188800,Easy Read with Exceptional Teaching,Just led a women's retreat where this was touc...
1279127,with an everlasting love,AZWC9XAY34IPW,5.0,1296432000,a parable,"Christianna, the heroine of this novel, is eng..."
1279128,with an everlasting love,A2GWS2SCT7FXS9,5.0,1295568000,4ydeits,Love this book and have bought several since I...
1279129,with an everlasting love,A12XJ4E97F9HLQ,5.0,1294099200,Unconditional Love,Ms. Arthur has written the ultimate love story...


The ```ratings_count``` column seems to not be reliable.

We can match the 2 datasets by the title column.