# data information
Data:
- goodreads books data: basic book metadata (NEVERMIND)
- amazon books data : more detailed book data
- amazon reviews: detailed text reviews for each book 
- (MAYBE)web scraping from reddit could be useful ? 

LINK TO DOWNLOAD THE ORIGINAL UNCLEAN CSVs: download both csvs 
- https://www.kaggle.com/datasets/mohamedbakhet/amazon-books-reviews 

LINK FOR THE CLEAN DATA DOWNLOAD: 
- https://drive.google.com/drive/folders/1ZGq4Bl4NIP6XLAPWCaVVO6tRYqr4A_lU?usp=sharing

=> output: books.csv, reviews.csv

we then add the variables necessary to identify indie authors and other metrics that will be useful for our recommendation system. 

* is_indie: variable that indicates whether an author is indie or not,
an author is defined as indie if : 
    - no publisher
    - or very few books (< 3 books released) or  books released within the last 5 years starting from latest year in data set
    - or very low visibility count on social media (reddit) / or very low reviews (less than 5 reviews ?, gotta check the median and avg review count per book/author)


In [1]:
#loading the data
import pandas as pd

books=pd.read_csv('books_data_amazon.csv')

In [2]:
#checking missing vals
print(books.columns)

print(books.isnull().sum())
books.head(5)


Index(['Title', 'description', 'authors', 'image', 'previewLink', 'publisher',
       'publishedDate', 'infoLink', 'categories', 'ratingsCount'],
      dtype='object')
Title                 1
description       68442
authors           31413
image             52075
previewLink       23836
publisher         75886
publishedDate     25305
infoLink          23836
categories        41199
ratingsCount     162652
dtype: int64


Unnamed: 0,Title,description,authors,image,previewLink,publisher,publishedDate,infoLink,categories,ratingsCount
0,Its Only Art If Its Well Hung!,,['Julie Strain'],http://books.google.com/books/content?id=DykPA...,http://books.google.nl/books?id=DykPAAAACAAJ&d...,,1996,http://books.google.nl/books?id=DykPAAAACAAJ&d...,['Comics & Graphic Novels'],
1,Dr. Seuss: American Icon,Philip Nel takes a fascinating look into the k...,['Philip Nel'],http://books.google.com/books/content?id=IjvHQ...,http://books.google.nl/books?id=IjvHQsCn_pgC&p...,A&C Black,2005-01-01,http://books.google.nl/books?id=IjvHQsCn_pgC&d...,['Biography & Autobiography'],
2,Wonderful Worship in Smaller Churches,This resource includes twelve principles in un...,['David R. Ray'],http://books.google.com/books/content?id=2tsDA...,http://books.google.nl/books?id=2tsDAAAACAAJ&d...,,2000,http://books.google.nl/books?id=2tsDAAAACAAJ&d...,['Religion'],
3,Whispers of the Wicked Saints,Julia Thomas finds her life spinning out of co...,['Veronica Haddon'],http://books.google.com/books/content?id=aRSIg...,http://books.google.nl/books?id=aRSIgJlq6JwC&d...,iUniverse,2005-02,http://books.google.nl/books?id=aRSIgJlq6JwC&d...,['Fiction'],
4,"Nation Dance: Religion, Identity and Cultural ...",,['Edward Long'],,http://books.google.nl/books?id=399SPgAACAAJ&d...,,2003-03-01,http://books.google.nl/books?id=399SPgAACAAJ&d...,,


### cleaning the data 

In [3]:
#missing values removed: titles, authors, 
#maybe keep: description and category if na (could use the reviews)
#dropping
import ast
books = books.dropna(subset=['Title', 'authors','description'])


#filling na
books['publisher'] = books['publisher'].fillna('self-published')
books['publishedDate'] = books['publishedDate'].fillna('Unknown')
books['image'] = books['image'].fillna('')
books['previewLink'] = books['previewLink'].fillna('')
books['infoLink'] = books['infoLink'].fillna('')
books['categories'] = books['categories'].fillna("['Uncategorized']")
books = books.drop('ratingsCount', axis=1)


print(books.shape)


(141755, 9)


In [4]:
#convert authors and categories to lists 
books['authors'] = books['authors'].apply(ast.literal_eval)
books['categories'] = books['categories'].apply(ast.literal_eval)


In [5]:
import re

#need to also normalize the authors 

def normalize_author_name(name):
    
    name = str(name).strip()
    name = name.replace('.', '')
    name = re.sub(r'\s+', ' ', name)
    name = re.sub(r'\b([A-Z])\s+(?=[A-Z]\s|\b[A-Z]$)', r'\1', name)
    name = name.title()
    
    return name

books['authors'] = books['authors'].apply(
    lambda author_list: [normalize_author_name(a) for a in author_list]
)

#since we got many authors adding a main author column
books['main_author'] = books['authors'].apply(lambda x: x[0])
#and main genre 
books['genre'] = books['categories'].apply(lambda x: x[0])

books.info

Top 20 authors after normalization:
main_author
William Shakespeare                     134
Agatha Christie                         131
Louis L'Amour                           115
Edgar Rice Burroughs                     72
Lonely Planet                            69
Ann M Martin                             68
Mark Twain                               66
Carolyn Keene                            65
Various                                  65
Rl Stine                                 64
Charles Dickens                          61
Dk                                       59
Cs Lewis                                 57
Nora Roberts                             56
Zane Grey                                54
Isaac Asimov                             53
John Steinbeck                           50
Georgette Heyer                          50
Rudyard Kipling                          50
Library Of Congress Copyright Office     50
Name: count, dtype: int64


<bound method DataFrame.info of                                                     Title  \
1                                Dr. Seuss: American Icon   
2                   Wonderful Worship in Smaller Churches   
3                           Whispers of the Wicked Saints   
5       The Church of Christ: A Biblical Ecclesiology ...   
8                                Saint Hyacinth of Poland   
...                                                   ...   
212398               Autodesk Inventor 10 Essentials Plus   
212399  The Orphan Of Ellis Island (Time Travel Advent...   
212401                                              Mamaw   
212402                                  The Autograph Man   
212403  Student's Solutions Manual for Johnson/Mowry's...   

                                              description  \
1       Philip Nel takes a fascinating look into the k...   
2       This resource includes twelve principles in un...   
3       Julia Thomas finds her life spinning out of 

## cleaning up reviews

In [6]:
reviews=pd.read_csv('books_reviews_amazon.csv')

In [7]:
reviews = reviews[['Id', 'Title', 'review/score', 'review/text']]


In [8]:
reviews = reviews.rename(columns={
    'Id': 'ISBN',
    'review/score': 'rating',
    'review/text': 'review_text'
})

In [9]:
#keep only reviews for books  that are in our books dataset after cleaning
reviews = reviews.dropna(subset=['Title'])
reviews['review_text'] = reviews['review_text'].fillna('')
reviews = reviews[reviews['Title'].isin(books['Title'])]
reviews.to_csv('reviews_clean.csv', index=False)


In [10]:
print(f"Average reviews per book: {len(reviews) / len(reviews['Title'].unique()):.1f}")
print(f"Cleaned reviews: {len(reviews):,}")


Average reviews per book: 16.5
Cleaned reviews: 2,339,915


In [11]:
reviews.head(5)

Unnamed: 0,ISBN,Title,rating,review_text
1,826414346,Dr. Seuss: American Icon,5.0,I don't care much for Dr. Seuss but after read...
2,826414346,Dr. Seuss: American Icon,5.0,"If people become the books they read and if ""t..."
3,826414346,Dr. Seuss: American Icon,4.0,"Theodore Seuss Geisel (1904-1991), aka &quot;D..."
4,826414346,Dr. Seuss: American Icon,4.0,Philip Nel - Dr. Seuss: American IconThis is b...
5,826414346,Dr. Seuss: American Icon,4.0,"""Dr. Seuss: American Icon"" by Philip Nel is a ..."


In [12]:

#adding avg rating and review counts to books
review_stats = reviews.groupby('Title').agg({
    'rating': ['count', 'mean']
}).reset_index()

review_stats.columns = ['Title', 'review_count', 'avg_rating']

books = books.merge(review_stats, on='Title', how='left')

books['review_count'] = books['review_count'].fillna(0).astype(int)
books['avg_rating'] = books['avg_rating'].fillna(0)




In [13]:
books.to_csv('books_clean.csv', index=False)
books.head(3)


Unnamed: 0,Title,description,authors,image,previewLink,publisher,publishedDate,infoLink,categories,main_author,genre,review_count,avg_rating
0,Dr. Seuss: American Icon,Philip Nel takes a fascinating look into the k...,[Philip Nel],http://books.google.com/books/content?id=IjvHQ...,http://books.google.nl/books?id=IjvHQsCn_pgC&p...,A&C Black,2005-01-01,http://books.google.nl/books?id=IjvHQsCn_pgC&d...,[Biography & Autobiography],Philip Nel,Biography & Autobiography,9,4.555556
1,Wonderful Worship in Smaller Churches,This resource includes twelve principles in un...,[David R Ray],http://books.google.com/books/content?id=2tsDA...,http://books.google.nl/books?id=2tsDAAAACAAJ&d...,self-published,2000,http://books.google.nl/books?id=2tsDAAAACAAJ&d...,[Religion],David R Ray,Religion,4,5.0
2,Whispers of the Wicked Saints,Julia Thomas finds her life spinning out of co...,[Veronica Haddon],http://books.google.com/books/content?id=aRSIg...,http://books.google.nl/books?id=aRSIgJlq6JwC&d...,iUniverse,2005-02,http://books.google.nl/books?id=aRSIgJlq6JwC&d...,[Fiction],Veronica Haddon,Fiction,32,3.71875


In [14]:
books.shape

(141755, 13)

## adding is indie flag

author is indie if any of these criteria : 

- less than 3 books published 
OR
- self published 
OR
- less than 15 reviews

In [15]:
indie_author = books.groupby('main_author').agg({'Title': 'count','review_count': 'sum'}).reset_index()

indie_author.columns = ['main_author', 'total_books', 'total_reviews']

self_pub = books.groupby('main_author')['publisher'].apply(lambda x: 'self-published' in x.str.lower().values).reset_index(name='is_self_published')

indie_author = indie_author.merge(self_pub, on='main_author')

indie_author['is_indie'] = (
    (indie_author['is_self_published']) & (
    (indie_author['total_books'] < 3) |
    (indie_author['total_reviews'] < 20))
)

print(f"Indie authors: {indie_author['is_indie'].sum()} / {len(indie_author)}")


Indie authors: 13408 / 94318


In [16]:
books['is_indie'] = books['main_author'].map(indie_author.set_index('main_author')['is_indie']).fillna(False)
books.info

<bound method DataFrame.info of                                                     Title  \
0                                Dr. Seuss: American Icon   
1                   Wonderful Worship in Smaller Churches   
2                           Whispers of the Wicked Saints   
3       The Church of Christ: A Biblical Ecclesiology ...   
4                                Saint Hyacinth of Poland   
...                                                   ...   
141750               Autodesk Inventor 10 Essentials Plus   
141751  The Orphan Of Ellis Island (Time Travel Advent...   
141752                                              Mamaw   
141753                                  The Autograph Man   
141754  Student's Solutions Manual for Johnson/Mowry's...   

                                              description  \
0       Philip Nel takes a fascinating look into the k...   
1       This resource includes twelve principles in un...   
2       Julia Thomas finds her life spinning out of 

In [17]:
books.to_csv('books_clean.csv', index=False)
indie_author.to_csv('authors_clean.csv', index=False)