# KindleUnlimited Subscription Analysis

## Introduction

In this notebook, I assess performance data for titles that have been on Amazon's subscription program KindleUnlimited (KU for short). My goal was to find the 3 following items:

1. The 5 titles with the highest daily borrow rate from April 1, 2019 through June 30, 2019 and their borrow rates.

2. The mean and standard deviation of daily borrow rates for titles active in Prime Reading vs. the mean and standard deviation for ones not active in Prime Reading and determining if the difference is significant.

3. The correlation betweeen genre and daily borrow rates

From exploring the dataset and finding the 3 above items, I then have recommendations for the business team in charge of nominating the titles. 

## The Data

I was given two csv files, which I import and interpret below. There are over 2900 titles across 

###  Dataset Interpretability

A borrow is a book read through the KU program, while a sale is a book purchased on Amazon. Most titles are only on the platform quarterly but others are on for multiple quarters.

For each title, there is different supporting data:
    1. whether the title was simultaneously in Prime Reading during the quarter of participation
    2. average sales on Amazon for the previous 90 days immediately before starting in KU
    3. the genre
    4. the first and last day of participation on KJ

In [2]:
# import libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

%matplotlib inline

In [6]:
# import csv's into dataframes

borrows = pd.read_csv('data_science_challenge_ku_borrows[3][1][2][2].csv')
titles_data = pd.read_csv('data_science_challenge_ku_titles_data[3][1][2][1].csv')

#print headers

print('1st dataset: KU Borrows')


print(borrows.head())

print('2nd dataset: Title Information')
titles_data.head()

1st dataset: KU Borrows
   title_id     Date  ku_borrows
0         0  4/21/18         3.0
1         0  4/22/18         6.0
2         0  4/23/18         1.0
3         0  4/24/18         0.0
4         0  4/25/18         3.0
2nd dataset: Title Information


Unnamed: 0,title_id,quarter_start_date,first_day_in_KU,last_day_in_KU,avg_units_sold_in_preceding_90_days,active_in_prime_reading,genre
0,0,4/1/18,4/1/18,6/30/19,1.422222,0,nonfiction
1,0,7/1/18,4/1/18,6/30/19,1.145455,1,nonfiction
2,0,10/1/18,4/1/18,6/30/19,0.855556,1,nonfiction
3,0,1/1/19,4/1/18,6/30/19,0.755556,1,nonfiction
4,0,4/1/19,4/1/18,6/30/19,0.6,1,nonfiction


In [7]:
#display info

print(borrows.info())
print(titles_data.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 222942 entries, 0 to 222941
Data columns (total 3 columns):
 #   Column      Non-Null Count   Dtype  
---  ------      --------------   -----  
 0   title_id    222942 non-null  int64  
 1   Date        222942 non-null  object 
 2   ku_borrows  222932 non-null  float64
dtypes: float64(1), int64(1), object(1)
memory usage: 5.1+ MB
None
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2905 entries, 0 to 2904
Data columns (total 7 columns):
 #   Column                               Non-Null Count  Dtype  
---  ------                               --------------  -----  
 0   title_id                             2905 non-null   int64  
 1   quarter_start_date                   2905 non-null   object 
 2   first_day_in_KU                      2905 non-null   object 
 3   last_day_in_KU                       2905 non-null   object 
 4   avg_units_sold_in_preceding_90_days  2902 non-null   float64
 5   active_in_prime_reading              290

In [19]:
#convert dates from object to datetime

borrows['Date'] = pd.to_datetime(borrows.Date,format = '%m/%d/%y', errors = 'coerce')

titles_data['quarter_start_date'] = pd.to_datetime(titles_data.quarter_start_date, format = '%m/%d/%y',errors = 'coerce')
titles_data['first_day_in_KU'] = pd.to_datetime(titles_data.first_day_in_KU,format = '%m/%d/%y',errors = 'coerce')
titles_data['last_day_in_KU'] = pd.to_datetime(titles_data.last_day_in_KU,format = '%m/%d/%y',errors = 'coerce')

#convert column types to datetime

Borrows invalid dates #: False    222933
True          9
Name: Date, dtype: int64
Quarter_start_date invalid dates #: False    2904
True        1
Name: quarter_start_date, dtype: int64
first_day_in_KU invalid dates #: False    2905
Name: first_day_in_KU, dtype: int64
last_day_in_KU invalid dates #: False    2903
True        2
Name: last_day_in_KU, dtype: int64


In [20]:
#verify type of date columns is now datetime

print(type(borrows['Date']))

#count number of errors by NaT

print("Borrows invalid dates #:", borrows['Date'].isna().value_counts())
print("Quarter_start_date invalid dates #:", titles_data['quarter_start_date'].isna().value_counts())
print("first_day_in_KU invalid dates #:", titles_data['first_day_in_KU'].isna().value_counts())
print("last_day_in_KU invalid dates #:", titles_data['last_day_in_KU'].isna().value_counts())

<class 'pandas.core.series.Series'>
Borrows invalid dates #: False    222933
True          9
Name: Date, dtype: int64
Quarter_start_date invalid dates #: False    2904
True        1
Name: quarter_start_date, dtype: int64
first_day_in_KU invalid dates #: False    2905
Name: first_day_in_KU, dtype: int64
last_day_in_KU invalid dates #: False    2903
True        2
Name: last_day_in_KU, dtype: int64
