# Goodreads Data Exploration
## A Journey from R to Python

### Background
Goodreads is an online book review and recommendation site. It's a great place to find new books to read and to keep track of what you've read. It's also a great place to find data on books and their reviews. This project is a personal exploration of the data available on Goodreads and a comparison of the differences between R and Python.

I had initially begun with this project in R, a task born out of a desire to develop my own personal skills and simply to see what I could accomplish with just a bit of determination. It was admittably a bit rough around the edges and in tandem with polishing and finishing that project, I'd like to set out on another in tandem: to teach myself Python. Throughout academia, R is commonly used and throughout my studies, I have had many courses and done many projects using it. However, the *"real-world"* doesn't employ it as commonly. That award by and far goes to Python, in the data space that is. Therefore, in preparation for my graduation with my master's and job-seeking, I'm eager to learn as much as I can!

For the sake of understanding, I'll be making personal notes as to the differences in syntax and perceived utility and function between the operation used within the two languages for the sake of this project.

### Methodology
If we can break down the process of going about a project like this, we can see that there are a few key steps:
1. Data Visualization
2. Analysis & Extrapolation

#### Data Visualization
This step would involve things like examining the data, breaking it down into smaller chunks, and visualizing it in a way that makes sense. This would tell us the most obvious trends and patterns in the data and give us a good idea of what we're working with. Here we can do things such as seeing who is the most popularly quoted author and what kinds of quotes and tags of quotes end up becoming popular. Here is also where we can some fun expermienting with abstract and creative visual representations of the data.

#### Analysis & Extrapolation
This step would involve things like making predictions, testing hypotheses, and making inferences. This would tell us the less obvious trends and patterns in the data. This is where we can really dig into the data and find the hidden gems. Tenets in the realm of causal inference and econometrics would come into play here. As a *hopeful* trained economist, I hope my years of education prove most useful here.

Some preliminary ideas here is to investigate the **significance of the kind of tags attached to quotes to its subsequent popularity**. For example, does a quote tagged with "love" have a higher chance of becoming popular than a quote tagged with "hate"? Does a quote tagged with "life" have a higher chance of becoming popular than a quote tagged with "death"? Does a quote tagged with "romance" have a higher chance of becoming popular than a quote tagged with "horror"? For curiosity's sake, can we extrapolate this to speak of something of the kind of sentiment that we human beings seek out?

Another idea is to see investiate the **boosting effects that the tags have on the authors themselves**? Are some authors more likely to get the popular tags and how much of an effect does this have on the popularity of the quote itself?

Going further, we can break down the quotes themselves into different sub-groups based on sentiment and see the degree to which they correlate with the sentiment of their attached tags. Likewise, **is their popularity boosted by the mismatch of the sentiment of their tags?** 

Lastly, I can seek to break down the quotes into clusters based on devised and extrapolated features to create a more macroscopic overview of the quotes present on Goodreads, as a way of wrapping up the project.

### Prelimanry Data Exploration
Importing the relevant packages and libraries; similar to R. However the need to note the name the library is imported as is noted here.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import normalize

In [6]:
#Loading the data
quotes = pd.read_csv("quotes.csv")

In [5]:
#Let's see the dimension of the data
quotes.shape

(3001, 5)

In [7]:
#Now, let's see the first 10 rows of the data
quotes.head(10)

Unnamed: 0,index,quote,author,tags,likes
0,0,Be yourself; everyone else is already taken.,Oscar Wilde,attributed-no-source;be-yourself;honesty;inspi...,149270
1,1,You've gotta dance like there's nobody watching,William W. Purkey,dance;heaven;hurt;inspirational;life;love;sing,118888
2,2,Be the change that you wish to see in the world.,Mahatma Gandhi,action;change;inspirational;philosophy;wish,106749
3,3,No one can make you feel inferior without your...,"Eleanor Roosevelt,",confidence;inspirational;wisdom,85854
4,4,Live as if you were to die tomorrow. Learn as ...,Mahatma Gandhi,carpe-diem;education;inspirational;learning,73033
5,5,Darkness cannot drive out darkness: only light...,"Martin Luther King Jr.,",darkness;drive-out;hate;inspirational;light;lo...,72616
6,6,"Without music, life would be a mistake.","Friedrich Nietzsche,",inspirational;music;philosophy,67297
7,7,We accept the love we think we deserve.,"Stephen Chbosky,",inspirational;love,66047
8,8,"Imperfection is beauty, madness is genius and ...",Marilyn Monroe,attributed-no-source;be-yourself;inspirational,48176
9,9,There are only two ways to live your life. One...,Albert Einstein,attributed-no-source;inspirational;life;live;m...,47424


The dataset is recorded as such where it contains information on the book, the author, the average rating, the number of ratings, the number of reviews, the number of pages, the year published, the genres, and the awards won. The data set is a bit messy and has some columns that are not needed for this project. The columns that are not needed are dropped and the remaining columns are renamed to be more easily understood.

In [None]:
#separate tags into a list
quotes['tags'] = quotes['tags'].apply(lambda x: x.split(","))