# Meta
https://www.python-graph-gallery.com/324-map-a-color-to-network-nodes 


# Audible Data Analysis

#### Data Scraped on 5/10/21

### Introduction
This notebook documents the data analysis of the title and category information scraped from Audible.com and Amazon.com. Generally we will focus on book length, price, ratings, and number of listens. 


### Requirements
Please run the web scraping programs below before analsis. Static images will be included in the presentation accompanying this notebook. **Confirm requirements.txt is met**:
- books_scrapy_audible -> category_spider
- books_scrapy_audible -> titles_spider

Running these programs should result in the following csv files
- books_scrapy_audible -> category_hierarchy_n_urls.csv
- books_scrapy_audible -> title_information.csv

In [1]:
import numpy as np
import seaborn
import pandas as pd
import networkx as nx
import matplotlib.pyplot as plt

### Data Import
Due to the nature of the scraping program, which you can read about in the README.md, most titles have been scraped multiple times. Luckily all urls lead to the same product and are unique so they can help us easily filter for unique titles, and categories can be condensed into a list which will act like a list of tags. 

In [30]:
cat_data   = pd.read_csv("books_scrapy_audible/category_hierarchy_n_urls.csv")
title_data = pd.read_csv("books_scrapy_audible/title_information.csv")

In [31]:
cat_data.head()

Unnamed: 0,category_name,category_numb_title,parent_category,self_url,title_list_url
0,Travel & Tourism,9433,Audible,https://www.audible.com/cat/Travel-Tourism-Aud...,https://www.audible.com/search?node=1858109501...
1,Teen,20201,Audible,https://www.audible.com/cat/Teen-Audiobooks/18...,https://www.audible.com/search?node=1858071501...
2,Sports & Outdoors,24533,Audible,https://www.audible.com/cat/Sports-Outdoors-Au...,https://www.audible.com/search?node=1858064801...
3,Science Fiction & Fantasy,57300,Audible,https://www.audible.com/cat/Science-Fiction-Fa...,https://www.audible.com/search?node=1858060601...
4,Science & Engineering,19058,Audible,https://www.audible.com/cat/Science-Engineerin...,https://www.audible.com/search?node=1858054001...


In [25]:
title_data.head()

Unnamed: 0,author,count_rating,language,length,narrator,pod_flag,price,release_date,star_rating,subtitle,title,title_category,title_url
0,Yuval Noah Harari,43059.0,English,917,Derek Perkins,False,34.22,08-15-17,4.5,A Brief History of Humankind,Sapiens,Biological Sciences,https://www.audible.com/pd/Sapiens-Audiobook/B...
1,Walter Isaacson,1622.0,English,964,"Kathe Mazur,Walter Isaacson",False,28.34,03-09-21,4.5,"Jennifer Doudna, Gene Editing, and the Future ...",The Code Breaker,Biological Sciences,https://www.audible.com/pd/The-Code-Breaker-Au...
2,James Nestor,4300.0,English,438,James Nestor,False,24.5,05-26-20,5.0,The New Science of a Lost Art,Breath,Biological Sciences,https://www.audible.com/pd/Breath-Audiobook/05...
3,Robin Wall Kimmerer,4346.0,English,1004,Robin Wall Kimmerer,False,34.99,12-27-15,5.0,"Indigenous Wisdom, Scientific Knowledge and th...",Braiding Sweetgrass,Biological Sciences,https://www.audible.com/pd/Braiding-Sweetgrass...
4,Suzanne Simard,5.0,English,733,Suzanne Simard,False,31.5,05-04-21,5.0,Discovering the Wisdom of the Forest,Finding the Mother Tree,Biological Sciences,https://www.audible.com/pd/Finding-the-Mother-...


### Category Structure
Audible has a variety of categories allowing the user to filter their search as desired. Interestingly the category system seems to function more like tags, with content belonging to multiple unconnected categories. Additionally, since only 1200 results are displayed in the search pane, categories containing more than 1200 titles are generally broken down into sub-categories. 

In [32]:
# adding a false entry to make assignments easier. 
cat_data = cat_data.append({'category_name':"Audible", 
                            'category_numb_title':361480, 
                            'parent_category':"", 
                            'self_url':"",
                            'title_list_url':""},
                           ignore_index = True)

# Creating a Unique column which is the concatenated current and super category.
cat_data['unique_cat_name'] = [i +"_"+ j for i, j in zip(cat_data['parent_category'], cat_data['category_name'])]


In [34]:
cat_data['unique_cat_name']

0                Audible_Travel & Tourism
1                            Audible_Teen
2               Audible_Sports & Outdoors
3       Audible_Science Fiction & Fantasy
4           Audible_Science & Engineering
                      ...                
1045           Social Studies_Law & Crime
1046            Social Studies_Government
1047             Social Studies_Economics
1048               Social Studies_Careers
1049                             _Audible
Name: unique_cat_name, Length: 1050, dtype: object

In [27]:
cat_data.head()

Unnamed: 0,category_name,category_numb_title,parent_category,self_url,title_list_url,unique_cat_name
0,Travel & Tourism,9433,Audible,https://www.audible.com/cat/Travel-Tourism-Aud...,https://www.audible.com/search?node=1858109501...,Audible_Travel & Tourism
1,Teen,20201,Audible,https://www.audible.com/cat/Teen-Audiobooks/18...,https://www.audible.com/search?node=1858071501...,Audible_Teen
2,Sports & Outdoors,24533,Audible,https://www.audible.com/cat/Sports-Outdoors-Au...,https://www.audible.com/search?node=1858064801...,Audible_Sports & Outdoors
3,Science Fiction & Fantasy,57300,Audible,https://www.audible.com/cat/Science-Fiction-Fa...,https://www.audible.com/search?node=1858060601...,Audible_Science Fiction & Fantasy
4,Science & Engineering,19058,Audible,https://www.audible.com/cat/Science-Engineerin...,https://www.audible.com/search?node=1858054001...,Audible_Science & Engineering


In [33]:
graph = nx.from_pandas_edgelist(cat_data, 'parent_category','unique_cat_name', create_using=nx.DiGraph())

node_weight = cat_data.set_index('unique_cat_name').reindex(graph.nodes())
#node_weight = node_weight.reindex(graph.nodes())

nx.draw(graph, with_labels = False,arrows = True, node_color = node_weight['category_numb_title'])#, node_color = )
plt.title("Audible Categories")
plt.show()

ValueError: cannot reindex from a duplicate axis

In [18]:


#this line
graph = nx.from_pandas_edgelist(cat_data_2, 
                                'parent_category',
                                'category_name',
                                create_using=nx.DiGraph())
graph.nodes
node_weight = cat_data_2.set_index('category_name')#.reindex(graph.nodes())
#node_weight = node_weight.reindex(graph.nodes())
#node_weight

In [41]:
#unique_cat = []
#for i,cat in enumerate(cat_data_2['category_name']):
#    unique_cat.append(cat + "_" + str(i))


    
unique_cat
#cat_data_2['unique_cat_name'] = zip(cat_data_2['category_name'],str(cat_data_2).index.values.tolist())
#cat_data_2['category_name']  str(cat_data_2.index.values.tolist())
#cat_data_2

['Travel & Tourism_0',
 'Teen_1',
 'Sports & Outdoors_2',
 'Science Fiction & Fantasy_3',
 'Science & Engineering_4',
 'Romance_5',
 'Religion & Spirituality_6',
 'Relationships, Parenting & Personal Development_7',
 'Politics & Social Sciences_8',
 'Australia & Oceania_9',
 'Biographies_10',
 'Winter Sports_11',
 'Science Fiction_12',
 'Science_13',
 'Contemporary_14',
 'Other Religions, Practices & Sacred Texts_15',
 'Relationships_16',
 'Personal Development_17',
 'Philosophy_18',
 'Archaeology_19',
 'Time Travel_20',
 'Biological Sciences_21',
 'Physics_22',
 'History & Philosophy_23',
 'Conflict Resolution_24',
 'Society_25',
 'Greek & Roman_26',
 'Consciousness & Thought_27',
 'Biotechnology_28',
 'Philosophy_29',
 'Meditation_30',
 'History_31',
 'Philosophy_32',
 'Animals_33',
 'Evolution & Genetics_34',
 'Biology_35',
 'Ecology_36',
 'Anatomy & Physiology_37',
 'Botany & Plants_38',
 'Modern_39',
 'Ethics & Morality_40',
 'Genetics_41',
 'Evolution_42',
 'Metaphysics_43',
 'Ea

In [37]:
cat_data_2

Unnamed: 0,category_name,category_numb_title,parent_category,self_url,title_list_url,unique_cat_name
0,Travel & Tourism,9433,Audible,https://www.audible.com/cat/Travel-Tourism-Aud...,https://www.audible.com/search?node=1858109501...,"Travel & Tourism_[0, 1, 2, 3, 4, 5, 6, 7, 8, 9..."
1,Teen,20201,Audible,https://www.audible.com/cat/Teen-Audiobooks/18...,https://www.audible.com/search?node=1858071501...,"Teen_[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12..."
2,Sports & Outdoors,24533,Audible,https://www.audible.com/cat/Sports-Outdoors-Au...,https://www.audible.com/search?node=1858064801...,"Sports & Outdoors_[0, 1, 2, 3, 4, 5, 6, 7, 8, ..."
3,Science Fiction & Fantasy,57300,Audible,https://www.audible.com/cat/Science-Fiction-Fa...,https://www.audible.com/search?node=1858060601...,"Science Fiction & Fantasy_[0, 1, 2, 3, 4, 5, 6..."
4,Science & Engineering,19058,Audible,https://www.audible.com/cat/Science-Engineerin...,https://www.audible.com/search?node=1858054001...,"Science & Engineering_[0, 1, 2, 3, 4, 5, 6, 7,..."
...,...,...,...,...,...,...
1045,Law & Crime,40,Social Studies,https://www.audible.com/cat/Social-Studies/Law...,https://www.audible.com/search?node=1857225401...,"Law & Crime_[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10,..."
1046,Government,105,Social Studies,https://www.audible.com/cat/Social-Studies/Gov...,https://www.audible.com/search?node=1857225301...,"Government_[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, ..."
1047,Economics,71,Social Studies,https://www.audible.com/cat/Social-Studies/Eco...,https://www.audible.com/search?node=1857225201...,"Economics_[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 1..."
1048,Careers,72,Social Studies,https://www.audible.com/cat/Social-Studies/Car...,https://www.audible.com/search?node=1857225101...,"Careers_[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11,..."


# SOME EDA ABOU THe categories

# Title information

- clean duplicates compressing categories
- most prolific author, narrator
- author in most categories?
- narrator by category
- length vs price vs star/count
- podcasts?
- space for audio dramas?