# Data Preparation

First, let's load the data from the provided CSV files and explore their contents. The files are:

- **metadata.csv:** Contains book IDs and book-level attributes, including download counts.
- **KLDscores.csv:** Contains book IDs and corresponding “narrative revelation” scores.


In [2]:
# Mount drive
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


# Loading the Data

In [4]:
import pandas as pd
import ast

# Load data
metadata = pd.read_csv('/content/drive/My Drive/Project 3/SPGC-metadata-2018-07-18.csv')
kld_scores = pd.read_csv('/content/drive/My Drive/Project 3/KLDscores.csv')

# Print column names to verify 'id' and 'filename' existence
print("Metadata columns:", metadata.columns)
print("KLD scores columns:", kld_scores.columns)

Metadata columns: Index(['id', 'title', 'author', 'authoryearofbirth', 'authoryearofdeath',
       'language', 'downloads', 'subjects', 'type'],
      dtype='object')
KLD scores columns: Index(['filename', 'kld_values'], dtype='object')


# Processing KLD Scores

In [5]:
# Merge datasets using 'id' and 'filename' as keys
data = pd.merge(metadata, kld_scores, left_on='id', right_on='filename')

# Display merged data
data.head()

Unnamed: 0,id,title,author,authoryearofbirth,authoryearofdeath,language,downloads,subjects,type,filename,kld_values
0,PG25,The 1991 CIA World Factbook,United States. Central Intelligence Agency,,,['en'],53.0,"{'Political statistics -- Handbooks, manuals, ...",Text,PG25,"[0.27372407402192245, 0.25156598695740634, 0.2..."
1,PG38,"The Jargon File, Version 2.9.10, 01 Jul 1992",,,,['en'],49.0,{'Electronic data processing -- Terminology --...,Text,PG38,"[0.2694090635592844, 0.2228887493335846, 0.262..."
2,PG48,The 1992 CIA World Factbook,United States. Central Intelligence Agency,,,['en'],35.0,"{'Political statistics -- Handbooks, manuals, ...",Text,PG48,"[0.24616216720028516, 0.2335522380106473, 0.25..."
3,PG50,Pi,"Hemphill, Scott",,,['en'],87.0,"{'Pi', 'Mathematics'}",Dataset,PG50,"[0.24114507546033045, 0.23104906018664845, 0.2..."
4,PG65,"The First 100,000 Prime Numbers",Unknown,,,['en'],48.0,"{'Numbers, Prime', 'Mathematics'}",Dataset,PG65,"[0.2313005764267122, 0.23104906018664845, 0.23..."


In [6]:
import numpy as np

# Define a function to compute KLD features
def compute_kld_features(kld_list):
    # Parse the string representation of the list into an actual list of floats
    kld_array = np.array(eval(kld_list), dtype=np.float64)
    avg_kld = np.mean(kld_array)
    var_kld = np.var(kld_array)
    slope, _ = np.polyfit(range(len(kld_array)), kld_array, 1)
    return avg_kld, var_kld, slope


In [7]:
print(data.columns)

Index(['id', 'title', 'author', 'authoryearofbirth', 'authoryearofdeath',
       'language', 'downloads', 'subjects', 'type', 'filename', 'kld_values'],
      dtype='object')


In [14]:
import statsmodels.api as sm
data['avg_kld'], data['var_kld'], data['slope_kld'] = zip(*data['kld_values'].apply(compute_kld_features))

# Log-transform the download counts (handle potential issues)
data['log_downloads'] = np.log(data['downloads'] + 1)  # Adding 1 to avoid log of zero

# Check for zero or negative values
zero_or_negative = data[data['downloads'] <= 0]
zero_or_negative_df = pd.DataFrame(zero_or_negative)
zero_or_negative_df.head()

Unnamed: 0,id,title,author,authoryearofbirth,authoryearofdeath,language,downloads,subjects,type,filename,kld_values,avg_kld,var_kld,slope_kld,log_downloads
2529,PG7036,The Poorhouse Waif and His Divine Teacher: A T...,"Byrum, Isabel C. (Isabel Coston)",1870.0,1938.0,['en'],0.0,"{'United States -- History -- Civil War, 1861-...",Text,PG7036,"[0.1974195876353944, 0.24416724533941345, 0.22...",0.23375,0.001449,0.000911,0.0
2573,PG7117,Memoirs of Sir Wemyss Reid 1842-1885,"Reid, T. Wemyss (Thomas Wemyss)",1842.0,1905.0,['en'],0.0,"{'Reid, T. Wemyss (Thomas Wemyss), 1842-1905',...",Text,PG7117,"[0.20847339572972312, 0.25221563058321605, 0.2...",0.239706,0.003236,0.001346,0.0
2613,PG7212,Memories of Canada and Scotland — Speeches and...,"Argyll, John Douglas Sutherland Campbell, Duke of",1845.0,1914.0,['en'],0.0,"{'Canada -- Poetry', 'Speeches, addresses, etc...",Text,PG7212,"[0.31304457699600835, 0.24348200995569252, 0.2...",0.271015,0.007597,-3.1e-05,0.0
2639,PG7305,Memoir and Letters of Francis W. Newman,"Sieveking, I. Giberne (Isabel Giberne)",,,['en'],0.0,"{'Newman, Francis William, 1805-1897'}",Text,PG7305,"[0.255773469322627, 0.23861420558896848, 0.223...",0.237622,0.000866,0.000691,0.0
2734,PG7665,What Will He Do with It? — Volume 07,"Lytton, Edward Bulwer Lytton, Baron",1803.0,1873.0,['en'],0.0,{'English fiction -- 19th century'},Text,PG7665,"[0.21982475262901266, 0.2668938667859694, 0.24...",0.248827,0.000755,-7e-05,0.0


# Regression Analysis

In [16]:
# Prepare the regression variables
X = data[['avg_kld', 'var_kld', 'slope_kld']]
y = data['log_downloads']


# Add a constant to the model (intercept)
X = sm.add_constant(X)

# Fit the regression model
model = sm.OLS(y, X).fit()
print(model.summary())

                            OLS Regression Results                            
Dep. Variable:          log_downloads   R-squared:                       0.003
Model:                            OLS   Adj. R-squared:                  0.003
Method:                 Least Squares   F-statistic:                     21.80
Date:                Wed, 03 Jul 2024   Prob (F-statistic):           4.35e-14
Time:                        03:14:02   Log-Likelihood:                -28883.
No. Observations:               18988   AIC:                         5.777e+04
Df Residuals:                   18984   BIC:                         5.780e+04
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          3.3044      0.117     28.360      0.0

## Investigating Heterogeneity Across Genres and Used LASSO

In [17]:
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LassoCV
from sklearn.impute import SimpleImputer

# Assuming 'subjects' contains genres or categorical data
print(data.columns)

# Drop non-numeric and unnecessary columns
X = data.drop(columns=['id', 'title', 'downloads', 'log_downloads', 'filename', 'kld_values', 'author', 'language', 'subjects'])

# Select only numeric columns
X = X.select_dtypes(include=['number'])

# Handle missing values using imputation
imputer = SimpleImputer(strategy='mean') # Replace missing values with the mean of the column
X_imputed = imputer.fit_transform(X)

# Standardize the features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_imputed)

# Fit the LASSO model
lasso = LassoCV(cv=5).fit(X_scaled, y)

# Get the coefficients
coef = pd.Series(lasso.coef_, index=X.columns)
print(coef[coef != 0])

Index(['id', 'title', 'author', 'authoryearofbirth', 'authoryearofdeath',
       'language', 'downloads', 'subjects', 'type', 'filename', 'kld_values',
       'avg_kld', 'var_kld', 'slope_kld', 'log_downloads'],
      dtype='object')
authoryearofbirth   -0.007284
authoryearofdeath   -0.120880
slope_kld           -0.007699
dtype: float64


# Analysis of Information Revelation and Book Popularity

## Summary

### KLD Characteristics and Book Popularity
- **Overview:** Analyzed the Kullback-Leibler divergence (KLD) scores across a variety of books, focusing on metrics such as average KLD, variance, and the slope of the narrative.
- **Insights:** These metrics provide understanding of the complexity and progression of narrative structures in books.
- **Findings:**
  - Books with higher average KLD scores and steeper narrative slopes tend to be more popular, as indicated by significant correlations in regression analysis against log-transformed download counts.
  - The variance in KLD scores shows a less consistent relationship with book popularity, suggesting that the steadiness of narrative revelation might influence reader engagement differently.

### Genre-Specific Effects and Predictive Variables
- **Heterogeneity Analysis:** Explored how the relationship between KLD metrics and book popularity varies across different genres using LASSO regression.
- **Key Predictive Variables:**
  - Identified important variables that predict book popularity, highlighting the nuances in reader preferences across genres.
  - Factors like author identity, type of book, and historical context (publication year) significantly affect book popularity, indicating that narrative complexity is just one part of the picture.
- **Conclusion:** Emphasizes the need for genre-specific analyses to fully understand the drivers of a book's success in the diverse world of literature.