# Project Wattpad

## Getting Wattpad Data
This Jupyter Notebook file uses the Wattpad API to get data from Wattpad. The main content that we will use for analysis is the Wattpad Stories. The stories have categories and languages associated with them. The category and language data is also available via the api. 

Our main focus here will be to get all the raw data from the api, do the data cleanup and save it into csv files that we will use for analysis later.

In [1]:
# Import Dependencies
import requests
import json
import numpy as np
import pandas as pd
import csv
import yaml
import os
from pandas.io.json import json_normalize

Set up for API calls
We need to first set up the details to be able to make the api calls and define the placeholders for our data files and other variables.

In [2]:
# Load the config.yaml file to get the api keys and other parameters
with open("./config.yaml") as y:
    cfg = yaml.load(y)

header = {
    "Authorization": "Basic {}".format(cfg["keys"]["API_KEY"]),
    "Content-Type": "application/json",
    "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.186 Safari/537.36",

    }

# Files to save our data
categories_file_name = "data/categories.csv"
languages_file_name = "data/languages.csv"

### Getting the Categories from the Wattpad api
The Wattpad api provides a call to get a list of categories used to categorize all the stories. 
We will get this list and store it as a csv file fo use later

In [3]:
################################################################################
# This function makes a Wattpad api call to get a list of all the categories
# It writes all the categories data into a csv file to be used later
################################################################################
def get_categories():
    category_url = "https://www.wattpad.com/v4/categories"
    
    # Make the api call
    req = requests.get(category_url, headers=header)
    category_response = req.json()
    
    # Write to the csv file
    with open(categories_file_name,'w') as csvfile:
        write=csv.writer(csvfile, delimiter=',')
        
        # Write the header row
        write.writerow(["ID","NAME"])
        
        # Loop through the data and write
        for category in category_response["categories"]:
            write.writerow([category["id"],category["name"]])
            

In [4]:
# Call the function to get all the categories from Whatpad and then view the data 
# from the csv file that is created to make sure we have usable data
get_categories()

# Open the csv file and read its contents to see if we got all the data right
with open(categories_file_name) as csvfile:
    reader = csv.reader(csvfile, delimiter=",")
    for row in reader:
        print(row)

['ID', 'NAME']
['4', 'Romance']
['5', 'Science Fiction']
['3', 'Fantasy']
['7', 'Humor']
['12', 'Paranormal']
['8', 'Mystery / Thriller']
['9', 'Horror']
['11', 'Adventure']
['23', 'Historical Fiction']
['1', 'Teen Fiction']
['6', 'Fanfiction']
['2', 'Poetry']
['17', 'Short Story']
['21', 'General Fiction']
['24', 'ChickLit']
['14', 'Action']
['18', 'Vampire']
['22', 'Werewolf']
['13', 'Spiritual']
['16', 'Non-Fiction']
['10', 'Classics']
['19', 'Random']


Getting Languages from Wattpad
The Wattpad api provides a call to get a list of languages used for all the stories. We will get this list and store it as a csv file fo use later

In [5]:
################################################################################
# This function makes a Wattpad api call to get a list of all the languages
# It writes all the language code data into a csv file to be used later
################################################################################
def get_languages():
    language_url = "https://www.wattpad.com/v4/languages"
    
    # Make the api call
    req = requests.get(language_url, headers=header)
    category_response = req.json()
    
    # Write to the csv file
    with open(languages_file_name,'w') as csvfile:
        write=csv.writer(csvfile, delimiter=',')
        
        # Write the header row
        write.writerow(["LANGUAGE_CODE"])
        
        # Loop through the data and write 
        for category in category_response["languages"]:
            write.writerow([category["code"]])

In [6]:
# Make the call to get the languages and then view the data from the csv file that
# is created to make sure we have usable data
get_languages()

# Open the csv file and read its contents to see if we got all the data right
with open(languages_file_name) as csvfile:
    reader = csv.reader(csvfile, delimiter=",")
    for row in reader:
        print(row)

['LANGUAGE_CODE']
['en']
['fr']
['it']
['de']
['es']
['pt-PT']
['pt-BR']
['ru']
['zh-TW']
['ja']
['ko']
['zh-CN']
['nl']
['pl']
['ro']
['ar']
['he']
['tl']
['vi']
['id']
['hi']
['ms']
['tr']
['cs']
['ml']
['sv']
['nn']
['hu']
['da']
['el']
['fa']
['th']
['is']
['fi']
['et']
['lv']
['lt']
['ca']
['bs']
['sr']
['hr']
['sl']
['bg']
['sk']
['be']
['uk']
['bn']
['ur']
['ta']
['sw']
['af']
['gu']
['or']
['pa']
['as']
['mr']


### Getting Stories from Wattpad
The main content we will be working with is Wattpad stories. The api gives us a list of stories written by users that are read by all the users. We will use this content for our analysis.

In [7]:
def get_stories(x):
    BASE_URL = "https://www.wattpad.com/v4/stories?limit=100offset%3D0&offset=" + str(x) + "&filter=new"

    req = requests.get(BASE_URL.format("stories"), headers=header)
    json_response = req.json()
    return(json_response)

In [8]:
#number of stories
N = 10000
json_list = []
for x in np.arange(0, N, 100):
    json_list.append(get_stories(x))

In [9]:
pages_of_stories = [x['stories'] for x in json_list]

In [10]:
################################################################################
# Creates a single array of all stories downloaded, parses each json element
# into its own column, then changes the values of the categories column to be
# a single integer instead of an array.
################################################################################

flat_list=[x for y in pages_of_stories for x in y]

stories_df = json_normalize(flat_list)

for i in range(len(stories_df['categories'])):
    stories_df.loc[i, 'categories'] = stories_df['categories'][i][0]
    
stories_df.to_csv(os.path.join('Data', 'stories_3_13_2018_new.csv'))

In [11]:
stories_df.categories.unique()

array([6, 19, 4, 1, 17, 3, 16, 2, 8, 7, 24, 11, 9, 12, 23, 14, 5, 22, 18,
       13, 21], dtype=object)

In [12]:
thr_df = pd.read_csv(os.path.join('Data', 'stories_3_09_2018.csv'))
fri_df = pd.read_csv(os.path.join('Data', 'stories_3_10_2018.csv'))
sat_df = pd.read_csv(os.path.join('Data', 'stories.csv'))
mon_new_df = pd.read_csv(os.path.join('Data', 'stories_3_12_2018_new.csv'))
mon_df = pd.read_csv(os.path.join('Data', 'stories_3_12_2018.csv'))

In [13]:
frames = [thr_df, fri_df, sat_df, mon_df, mon_new_df]
stories_concat = pd.concat(frames)
stories_df = stories_concat.drop_duplicates(subset='id')

In [14]:
stories_df

In [15]:
stories_merge.to_csv(os.path.join('Data', 'stories_merge.csv'))

## Data Munging

### Removing unwanted rows

In [16]:
# read in data
df = pd.read_csv(os.path.join('Data', 'stories_3_12_2018_new.csv'))

In [17]:
# remove rows without tags
df = df.loc[df['tags'] != '[]']

In [18]:
# remove NaN descriptions
df=df.loc[[len(str(x)) > 4 for x in df['description']]]

In [19]:
df.head()

Unnamed: 0.1,Unnamed: 0,categories,commentCount,completed,copyright,cover,cover_timestamp,createDate,deleted,description,...,parts,rating,readCount,tags,title,url,user.avatar,user.fullname,user.name,voteCount
0,0,6,12,False,1,https://a.wattpad.com/cover/129173744-256-k151...,2017-12-02T02:06:21Z,2017-11-18T19:52:12Z,False,[ coming soon ]\n\nIn which a regular 19 year ...,...,"[{'id': 496588471, 'title': 'Mr Country Club |...",0,63,"['collinskey', 'devankey', 'keybros', 'keypers...",Mr Country Club | devan key [ coming soon ],https://www.wattpad.com/story/129173744-mr-cou...,https://a.wattpad.com/useravatar/voidtube.128....,chelsea,voidtube,5
1,1,17,0,False,1,https://a.wattpad.com/cover/141691720-256-k874...,2018-03-13T00:15:43Z,2018-03-13T00:15:42Z,False,swallow!,...,"[{'id': 546886656, 'title': 'impudence!', 'url...",0,0,"['aesthetic', 'blood', 'death', 'detective', '...",the offing,https://www.wattpad.com/story/141691720-the-of...,https://a.wattpad.com/useravatar/sunfully.128....,♡,sunfully,0
2,2,17,0,False,1,https://a.wattpad.com/cover/141267339-256-k611...,2018-03-10T00:08:20Z,2018-03-09T00:26:38Z,False,I don't see many of these and I'm not sure if ...,...,"[{'id': 545128060, 'title': 'Oneshot 1: Rulers...",1,32,"['lesbian', 'oneshot', 'readerxcrush']",Female Crush X Female Reader,https://www.wattpad.com/story/141267339-female...,https://a.wattpad.com/useravatar/Dem0g0rgon.12...,,Dem0g0rgon,4
3,3,7,0,False,1,https://a.wattpad.com/cover/141129404-256-k400...,2018-03-07T13:31:33Z,2018-03-07T13:31:32Z,False,Read some of these funny memes! * I did not ma...,...,"[{'id': 544550562, 'title': 'Part 1', 'url': '...",0,8,"['fun', 'funny', 'humor', 'memes']",F̺͆u̺͆n̺͆n̺͆y̺͆ M̺͆e̺͆m̺͆e̺͆s̺͆,https://www.wattpad.com/story/141129404-f%CD%8...,https://a.wattpad.com/useravatar/_Daily_News_....,,_Daily_News_,0
4,4,19,200,False,1,https://a.wattpad.com/cover/136339277-256-k619...,2018-03-10T02:23:45Z,2018-01-24T06:04:34Z,False,unpopular opinions.\n\n❝ you have married an i...,...,"[{'id': 526165586, 'title': 'burn.', 'url': 'h...",1,629,"['gender', 'movies', 'opinions', 'pennywise', ...",BURN | unpopular opinions,https://www.wattpad.com/story/136339277-burn-u...,https://a.wattpad.com/useravatar/dacrethotgome...,benny,dacrethotgomery,95


### Removing unwanted columns
We have some unwanted columns in the table so we will remove them and create a stripped down version of the data frame that we will work with for the analysis. 
The following columns are removed:
* Unnamed: 0
* copyright
* cover
* cover_timestamp
* firstPartId
* firstPublishedPart.createDate
* firstPublishedPart.id
* lastPublishedPart.createDate
* lastPublishedPart.id
* parts

In [20]:
df.columns

Index(['Unnamed: 0', 'categories', 'commentCount', 'completed', 'copyright',
       'cover', 'cover_timestamp', 'createDate', 'deleted', 'description',
       'firstPartId', 'firstPublishedPart.createDate', 'firstPublishedPart.id',
       'id', 'language.id', 'language.name', 'lastPublishedPart.createDate',
       'lastPublishedPart.id', 'length', 'mature', 'modifyDate', 'numParts',
       'parts', 'rating', 'readCount', 'tags', 'title', 'url', 'user.avatar',
       'user.fullname', 'user.name', 'voteCount'],
      dtype='object')

In [21]:
# remove the unwanted columns to minimize clutter
stripped_df = df[["categories","commentCount","completed",
                  "createDate","deleted","description","id", 
                  "language.id","language.name","length","mature",
                  "modifyDate","numParts","rating","readCount",
                  "tags","title","url","user.avatar","user.fullname",
                  "user.name","voteCount"]]

### Merging with the categories data
The categories in the stories list is listed as an id so we need to get the corresponding category names into this dataframe. We will also rename and reorder the columns to make more logical sense

In [22]:
categories_df = pd.read_csv('Data/categories.csv')
categories_df

Unnamed: 0,ID,NAME
0,4,Romance
1,5,Science Fiction
2,3,Fantasy
3,7,Humor
4,12,Paranormal
5,8,Mystery / Thriller
6,9,Horror
7,11,Adventure
8,23,Historical Fiction
9,1,Teen Fiction


In [23]:
# Add a new column to put in the corresponding category names
stripped_df["categoryName"] = stripped_df["categories"]
stripped_df["categoryName"].unique()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


array([ 6, 17,  7, 19,  8,  2, 23,  4, 14,  1, 16, 12,  3,  5, 21, 11, 18,
       13, 22,  9, 24])

In [24]:
# Function to get the category name from the categories_df given the category id
def get_category(x):
    name = categories_df.loc[categories_df["ID"]==int(x),"NAME"]
    return name.iloc[0]

# Replace the id with name in the categoryName column
stripped_df["categoryName"] = stripped_df["categoryName"].apply(get_category)
stripped_df["categoryName"].unique()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  import sys


array(['Fanfiction', 'Short Story', 'Humor', 'Random',
       'Mystery / Thriller', 'Poetry', 'Historical Fiction', 'Romance',
       'Action', 'Teen Fiction', 'Non-Fiction', 'Paranormal', 'Fantasy',
       'Science Fiction', 'General Fiction', 'Adventure', 'Vampire',
       'Spiritual', 'Werewolf', 'Horror', 'ChickLit'], dtype=object)

In [25]:
# rename some columns to keep the column name format consistent
stripped_df = stripped_df.rename(columns={"categories":"categoryId",
                            "language.id": "languageId",
                            "language.name": "languageName",
                            "user.avatar": "userAvatar",
                            "user.fullname":"userFullname",
                            "user.name":"userName"})

# reorder the columns
stripped_df = stripped_df[["id","title","description","url","createDate",
                          "modifyDate","completed","numParts","deleted","length",
                          "categoryId","categoryName","languageId","languageName","mature","rating",
                          "tags","commentCount","readCount","voteCount","userAvatar","userFullname","userName"]]
stripped_df.head()

Unnamed: 0,id,title,description,url,createDate,modifyDate,completed,numParts,deleted,length,...,languageName,mature,rating,tags,commentCount,readCount,voteCount,userAvatar,userFullname,userName
0,129173744,Mr Country Club | devan key [ coming soon ],[ coming soon ]\n\nIn which a regular 19 year ...,https://www.wattpad.com/story/129173744-mr-cou...,2017-11-18T19:52:12Z,2018-03-13T23:54:10Z,False,3,False,7590,...,English,False,0,"['collinskey', 'devankey', 'keybros', 'keypers...",12,63,5,https://a.wattpad.com/useravatar/voidtube.128....,chelsea,voidtube
1,141691720,the offing,swallow!,https://www.wattpad.com/story/141691720-the-of...,2018-03-13T00:15:42Z,2018-03-13T23:54:11Z,False,3,False,1676,...,English,False,0,"['aesthetic', 'blood', 'death', 'detective', '...",0,0,0,https://a.wattpad.com/useravatar/sunfully.128....,♡,sunfully
2,141267339,Female Crush X Female Reader,I don't see many of these and I'm not sure if ...,https://www.wattpad.com/story/141267339-female...,2018-03-09T00:26:38Z,2018-03-13T23:54:08Z,False,3,False,24132,...,English,False,1,"['lesbian', 'oneshot', 'readerxcrush']",0,32,4,https://a.wattpad.com/useravatar/Dem0g0rgon.12...,,Dem0g0rgon
3,141129404,F̺͆u̺͆n̺͆n̺͆y̺͆ M̺͆e̺͆m̺͆e̺͆s̺͆,Read some of these funny memes! * I did not ma...,https://www.wattpad.com/story/141129404-f%CD%8...,2018-03-07T13:31:32Z,2018-03-13T23:54:08Z,False,4,False,203,...,English,False,0,"['fun', 'funny', 'humor', 'memes']",0,8,0,https://a.wattpad.com/useravatar/_Daily_News_....,,_Daily_News_
4,136339277,BURN | unpopular opinions,unpopular opinions.\n\n❝ you have married an i...,https://www.wattpad.com/story/136339277-burn-u...,2018-01-24T06:04:34Z,2018-03-13T23:54:07Z,False,15,False,17044,...,English,False,1,"['gender', 'movies', 'opinions', 'pennywise', ...",200,629,95,https://a.wattpad.com/useravatar/dacrethotgome...,benny,dacrethotgomery


In [26]:
# save the data frame to a csv file to be used for vizualizations later
stripped_df.to_csv(os.path.join('Data', 'stories_for_viz.csv'), index=False)

In [27]:
len(stripped_df)

1428