# Scrape Forum Data
This notebook scrapes the Youbemom special needs forum and inserts the results into a SQLite database

## Data Sources
- Youbemom forum: https://www.youbemom.com/forum/special-needs

## Changes
- 2020-08-11: Started project
- 2020-08-18: Updated forum crawl
- 2020-08-22: Updated database structure
- 2020-10-22: Added additional subforums
- 2020-10-24: Added deleted post column
- 2020-10-24: Added dne and family_id threads column

## Database Structure
- threads
 - id: automatically assigned
 - family_id: automatically assigned
 - url: url of top post
 - subforum: subforum name
 - dne: does not exist (shouldn't be any 1s on this method)
- posts
 - id: automatically assigned
 - family_id: thread->id
 - message_id: the unique id of the message from the html
 - parent_id: id of post this post is responding to, 0 if top post
 - date_recorded: date the data is fetched
 - date_created: date the data was created
 - title: title of the post
 - body: body of the post
 - subforum: subforum name
 - deleted: post has been deleted

## TODO
- 

## Imports

In [2]:
import sqlite3
from pathlib import Path
from youbemom import loop_threads, loop_posts, create_connection, set_up_db
from datetime import datetime

## File Locations

In [3]:
p = Path.cwd()
path_parent = p.parents[0]

In [4]:
path_db = path_parent / "database" / "youbemomTables.db"
path_db = str(path_db)

## Variables
The toddler forum is so much larger than the other forums so it is scraped separately (see below)

In [5]:
forum_list = ["special-needs", "newborn", "preschool", "elementary", "tween-teen"]

## Scrape the Thread URLs
Connect to the database and create the tables.
NOTE: Scrape from Subforum scrapes listed subforums up to earliest (2014/01/01)

In [6]:
conn = create_connection(path_db)
set_up_db(conn)

In [6]:
earliest = datetime(2014, 1, 1, 0, 0, 0)
for subforum in forum_list:
    loop_threads(conn, path_db, subforum, earliest)

## Scrape Thread Posts
For subforum in the forum and for urls in the threads table in that subforum, pulls a batch of 100 urls and scrapes each post in the thread

In [7]:
for subforum in forum_list:
    loop_posts(conn, path_db, subforum)

batch: 1
batch: 2
batch: 3
batch: 4
batch: 5
batch: 6
batch: 7
batch: 8
batch: 9
batch: 10
batch: 11
batch: 12
batch: 13
batch: 14
batch: 15
batch: 16
batch: 17
batch: 18
batch: 19
batch: 20
batch: 21
batch: 22
batch: 23
batch: 24
batch: 25
batch: 26
batch: 27
batch: 28
batch: 29
batch: 30
batch: 31
batch: 32
batch: 33
batch: 34
batch: 35
batch: 36
batch: 37
batch: 38
batch: 39
batch: 40
batch: 41
batch: 42
batch: 43
batch: 44
batch: 45
batch: 46
batch: 47
batch: 48
batch: 49
batch: 50
batch: 51
batch: 52
batch: 53
batch: 54
batch: 55
batch: 56
batch: 57
batch: 58
batch: 59
batch: 60
batch: 61
batch: 62
batch: 63
batch: 64
batch: 65
batch: 66
batch: 67
batch: 68
batch: 69
batch: 70
batch: 71
batch: 72
batch: 73
batch: 74
batch: 75
batch: 76
batch: 77
batch: 78
batch: 79
batch: 80
batch: 81
batch: 82
batch: 83
batch: 84
batch: 85
batch: 86
batch: 87
batch: 88
batch: 89
batch: 90
batch: 91
batch: 92
batch: 93
batch: 94
batch: 95
batch: 96
batch: 97
batch: 98
batch: 99
batch: 100
batch: 1