# Statistics
This notebook contains basic statistics for the datasets.


### Import Packages

In [1]:
import pandas as pd

### Import Data

In [2]:
df_books = pd.read_json("goodreads_books_poetry.json", lines=True)
df_interactions = pd.read_json("goodreads_interactions_poetry.json", lines=True)
df_reviews = pd.read_json("goodreads_reviews_poetry.json", lines=True)
df_authors = pd.read_json("goodreads_book_authors.json", lines=True)
df_series = pd.read_json("goodreads_book_series.json", lines=True)

### df_books

We print basic information about the no. of rows and columns, column names, no. of non-null values for each column, and the column data types.

In [8]:
df_books.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 36514 entries, 0 to 36513
Data columns (total 29 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   isbn                  36514 non-null  object 
 1   text_reviews_count    36514 non-null  int64  
 2   series                36514 non-null  object 
 3   country_code          36514 non-null  object 
 4   language_code         36514 non-null  object 
 5   popular_shelves       36514 non-null  object 
 6   asin                  36514 non-null  object 
 7   is_ebook              36514 non-null  object 
 8   average_rating        36514 non-null  float64
 9   kindle_asin           36514 non-null  object 
 10  similar_books         36514 non-null  object 
 11  description           36514 non-null  object 
 12  format                36514 non-null  object 
 13  link                  36514 non-null  object 
 14  authors               36514 non-null  object 
 15  publisher          

From the above results it may seem like all values are non-null. This is however not the case and we have for example null-values masquerading as empty strings. We address the issue of null-values for each column in the data preprocessing. Also, we can see columns with descriptive statistics like `text_reviews_count`, `average_rating`, and `ratings_count`. It would perhaps be better to compute these statistics using queries/views rather than adding them directly to the table. Hence we drop these columns in the preprocessing.

We now print the total memory usage of the dataframe in bytes.

In [3]:
print(f"{sum(df_books.memory_usage(deep=True))} bytes")

101594785 bytes


We now print descriptive statistics for the columns.

In [10]:
df_books.describe()

Unnamed: 0,text_reviews_count,average_rating,book_id,ratings_count,work_id
count,36514.0,36514.0,36514.0,36514.0,36514.0
mean,14.690886,4.063838,10634520.0,279.6882,13401610.0
std,110.594374,0.399965,10353450.0,7633.414,17049740.0
min,0.0,0.0,234.0,0.0,166.0
25%,2.0,3.84,1185514.0,9.0,930767.0
50%,4.0,4.1,7223308.0,23.0,3284191.0
75%,9.0,4.31,18218720.0,69.0,21655330.0
max,10403.0,5.0,36485480.0,1029527.0,58229640.0


### df_interactions

We print basic information about the no. of rows and columns, column names, no. of non-null values for each column, and the column data types.

In [4]:
df_interactions.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2734350 entries, 0 to 2734349
Data columns (total 10 columns):
 #   Column                  Dtype 
---  ------                  ----- 
 0   user_id                 object
 1   book_id                 int64 
 2   review_id               object
 3   is_read                 bool  
 4   rating                  int64 
 5   review_text_incomplete  object
 6   date_added              object
 7   date_updated            object
 8   read_at                 object
 9   started_at              object
dtypes: bool(1), int64(2), object(7)
memory usage: 190.4+ MB


We now print the total memory usage of the dataframe in bytes.

In [5]:
print(f"{sum(df_interactions.memory_usage(deep=True))} bytes")

1542894381 bytes


We now print descriptive statistics for the columns.

In [13]:
df_interactions.describe()

Unnamed: 0,book_id,rating
count,2734350.0,2734350.0
mean,6808744.0,1.824787
std,9698381.0,2.123223
min,234.0,0.0
25%,42040.0,0.0
50%,592221.0,0.0
75%,12193300.0,4.0
max,36485480.0,5.0


### df_reviews

We print basic information about the no. of rows and columns, column names, no. of non-null values for each column, and the column data types.

In [6]:
df_reviews.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 154555 entries, 0 to 154554
Data columns (total 11 columns):
 #   Column        Non-Null Count   Dtype 
---  ------        --------------   ----- 
 0   user_id       154555 non-null  object
 1   book_id       154555 non-null  int64 
 2   review_id     154555 non-null  object
 3   rating        154555 non-null  int64 
 4   review_text   154555 non-null  object
 5   date_added    154555 non-null  object
 6   date_updated  154555 non-null  object
 7   read_at       154555 non-null  object
 8   started_at    154555 non-null  object
 9   n_votes       154555 non-null  int64 
 10  n_comments    154555 non-null  int64 
dtypes: int64(4), object(7)
memory usage: 13.0+ MB


As before, it may seem like all values are non-null. This is not necessarily the case and we address the issue of null-values for each column in the data preprocessing.

We now print the total memory usage of the dataframe in bytes.

In [7]:
print(f"{sum(df_reviews.memory_usage(deep=True))} bytes")

175832277 bytes


We now print descriptive statistics for the columns.

In [16]:
df_reviews.describe()

Unnamed: 0,book_id,rating,n_votes,n_comments
count,154555.0,154555.0,154555.0,154555.0
mean,10237820.0,3.815205,1.525632,0.252557
std,10261130.0,1.310501,7.232086,1.63836
min,234.0,0.0,-1.0,-1.0
25%,522663.0,3.0,0.0,0.0
50%,6928895.0,4.0,0.0,0.0
75%,18222720.0,5.0,1.0,0.0
max,36485480.0,5.0,1065.0,168.0


We can see that `rating` is in the range 0-5. Also, there are negative values for `n_votes` and `n_comments`. This should not be possible as these columns represent the no. of votes and no. of comments respectively. We address this in the data preprocessing.

### df_authors

We print basic information about the no. of rows and columns, column names, no. of non-null values for each column, and the column data types.

In [8]:
df_authors.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 829529 entries, 0 to 829528
Data columns (total 5 columns):
 #   Column              Non-Null Count   Dtype  
---  ------              --------------   -----  
 0   average_rating      829529 non-null  float64
 1   author_id           829529 non-null  int64  
 2   text_reviews_count  829529 non-null  int64  
 3   name                829529 non-null  object 
 4   ratings_count       829529 non-null  int64  
dtypes: float64(1), int64(3), object(1)
memory usage: 31.6+ MB


As before, it may seem like all values are non-null. This is not necessarily the case and we address the issue of null-values for each column in the data preprocessing. Also, we can see columns with descriptive statistics like `average_rating`, `text_reviews_count` and `ratings_count`. It would perhaps be better to compute these statistics using queries/views rather than adding them directly to the table. Hence we drop these columns in the preprocessing.

We now print the total memory usage of the dataframe in bytes.

In [9]:
print(f"{sum(df_authors.memory_usage(deep=True))} bytes")

85650534 bytes


We now print descriptive statistics for the columns.

In [19]:
df_authors.describe()

Unnamed: 0,average_rating,author_id,text_reviews_count,ratings_count
count,829529.0,829529.0,829529.0,829529.0
mean,3.844779,5751610.0,106.865331,1595.326
std,0.603013,5129977.0,1770.225828,44796.69
min,0.0,3.0,0.0,0.0
25%,3.58,932718.0,2.0,8.0
50%,3.9,4952564.0,6.0,31.0
75%,4.17,7838936.0,20.0,131.0
max,5.0,17343370.0,448570.0,18532720.0


### df_series

We print basic information about the no. of rows and columns, column names, no. of non-null values for each column, and the column data types.

In [10]:
df_series.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 400390 entries, 0 to 400389
Data columns (total 7 columns):
 #   Column              Non-Null Count   Dtype 
---  ------              --------------   ----- 
 0   numbered            400390 non-null  object
 1   note                400390 non-null  object
 2   description         400390 non-null  object
 3   title               400390 non-null  object
 4   series_works_count  400390 non-null  int64 
 5   series_id           400390 non-null  int64 
 6   primary_work_count  400390 non-null  int64 
dtypes: int64(3), object(4)
memory usage: 21.4+ MB


As before, it may seem like all values are non-null. This is not necessarily the case and we address the issue of null-values for each column in the data preprocessing. Also, we can see columns with descriptive statistics like `series_works_count` and `primary_work_count`. It would perhaps be better to compute these statistics using queries/views rather than adding them directly to the table. Hence we drop these columns in the preprocessing.

We now print the total memory usage of the dataframe in bytes.

In [11]:
print(f"{sum(df_series.memory_usage(deep=True))} bytes")

156037473 bytes


In [22]:
df_series.describe()

Unnamed: 0,series_works_count,series_id,primary_work_count
count,400390.0,400390.0,400390.0
mean,21.588149,623045.0,19.771653
std,65.1031,294445.3,63.501377
min,-14.0,144392.0,0.0
25%,3.0,363737.2,3.0
50%,6.0,615837.0,5.0
75%,14.0,877564.8,12.0
max,893.0,1143859.0,893.0
