## Using BeautifulSoup to Scrap Workouts
We will scrap the [Crossfit Weightlifting Blog](http://crossfitweightlifting.com/category/wod/weightlifter/) and following this [BeautifulSoup](https://www.dataquest.io/blog/web-scraping-tutorial-python/) tutorial

In [8]:
import requests
from bs4 import BeautifulSoup

url= 'http://crossfitweightlifting.com/category/wod/'
headers = {
    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.13; rv:60.0) Gecko/20100101 Firefox/60.0'
}
page = requests.get(url, headers = headers)

soup = BeautifulSoup(page.content, 'html.parser')

#### 403 Forbidden Error
From [this](https://stackoverflow.com/questions/38489386/python-requests-403-forbidden) it seems the page rejects `Get` requests that don't have a `User-Agent`

### View the nicely format HTML
use the `prettify` method to view the `BeautifulSoup` object

In [12]:
print(soup.prettify())

<!DOCTYPE doctype html>
<!--[if lt IE 7]><html lang="en-US" prefix="og: http://ogp.me/ns#" class="no-js lt-ie9 lt-ie8 lt-ie7"> <![endif]-->
<!--[if (IE 7)&!(IEMobile)]><html lang="en-US" prefix="og: http://ogp.me/ns#" class="no-js lt-ie9 lt-ie8"><![endif]-->
<!--[if (IE 8)&!(IEMobile)]><html lang="en-US" prefix="og: http://ogp.me/ns#" class="no-js lt-ie9"><![endif]-->
<!--[if gt IE 8]><!-->
<html class="no-js" lang="en-US" prefix="og: http://ogp.me/ns#">
 <!--<![endif]-->
 <head>
  <meta charset="utf-8"/>
  <!-- Google Chrome Frame for IE -->
  <meta content="IE=edge,chrome=1" http-equiv="X-UA-Compatible"/>
  <title>
   WOD Archives - CrossFit Weightlifting
  </title>
  <!-- mobile meta (hooray!) -->
  <meta content="True" name="HandheldFriendly"/>
  <meta content="320" name="MobileOptimized"/>
  <meta content="width=device-width, initial-scale=1.0" name="viewport">
   <link href="http://crossfitweightlifting.com/xmlrpc.php" rel="pingback"/>
   <!-- wordpress head functions -->
   <!--

### Get the Title and Links to each `<h2 class="entry-title">`

In [26]:
titles = soup.find_all('h2', class_ = 'entry-title')
title_dict = {}
for entry in titles:
    title_dict[entry.find('a').text] = entry.find('a').get('href')
title_dict

{'Friday, October 19, 2018- Week Three- Day Five': 'http://crossfitweightlifting.com/2018/10/friday-october-19-2018-week-three-day-five/',
 'Friday, October 26, 2018- Week Four- Day Five': 'http://crossfitweightlifting.com/2018/10/friday-october-26-2018-week-four-day-five/',
 'Monday, October 22, 2018- Week Four- Day One- Slight Deload': 'http://crossfitweightlifting.com/2018/10/monday-october-22-2018-week-four-day-one-slight-deload/',
 'Saturday, October 20, 2018- Week Three- Day Six': 'http://crossfitweightlifting.com/2018/10/saturday-october-20-2018-week-three-day-six/',
 'Sunday, October 21, 2018- Week Three- Day Seven': 'http://crossfitweightlifting.com/2018/10/sunday-october-21-2018-week-three-day-seven/',
 'Thursday, October 18, 2018- Week Three- Day Four': 'http://crossfitweightlifting.com/2018/10/thursday-october-18-2018-week-three-day-four/',
 'Thursday, October 24, 2018- Week Four- Day Four': 'http://crossfitweightlifting.com/2018/10/thursday-october-24-2018-week-four-day-fo

### Then find the link to the next page `<li class='bpn-next-link'>`

In [28]:
next_page_url = soup.find('li', class_ = 'bpn-next-link').find('a').get('href')
next_page_url

'http://crossfitweightlifting.com/category/wod/page/2/'

### Tie it all together!
Write a function to loop thru `x` number of pages and get all the workout titles and urls in a dictionary

In [61]:
def GetTitleAndLinks( in_soup ):
    titles = in_soup.find_all('h2', class_ = 'entry-title')
    title_dict = {}
    for entry in titles:
        title_dict[entry.find('a').text] = entry.find('a').get('href')
    return title_dict

def GetNextPageUrl( in_soup ):
    return in_soup.find('li', class_ = 'bpn-next-link').find('a').get('href')

def GetWorkouts( pageLim = 10):
    url= 'http://crossfitweightlifting.com/category/wod/'
    headers = {
        'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.13; rv:60.0) Gecko/20100101 Firefox/60.0'
    }
    page = requests.get(url, headers = headers)
    
    outDict = {}
    
    print(f'Begin parsing {url}...\n')
    for i in range(pageLim):
        soup = BeautifulSoup(page.content, 'html.parser')
        post_dict = GetTitleAndLinks( soup )
        next_url = GetNextPageUrl( soup )
        
        outDict = {**outDict, **post_dict}
        page = requests.get(next_url, headers = headers )
        print( f'page {i+1} of {pageLim} done.\n')

    return outDict

def GetWorkoutText(in_soup):
    return in_soup.find('section', class_ = 'entry-content').text

def PrintWorkouts( lim = 100, pageLim = 10):
    woddict = GetWorkouts( pageLim)
    print(f'Found {len(woddict)} workouts, returning text for {min(len(woddict), lim)}\n')

    lim_keys = list(woddict.keys())[:min(lim, len(woddict))]
    lim_dict = { k: woddict[k] for k in lim_keys}
    out_json = []
    for wod in lim_dict:
        wod_url = lim_dict[wod]
        wod_page = requests.get(wod_url, headers = headers )
        wod_txt = GetWorkoutText( BeautifulSoup(wod_page.content, 'html.parser'))
        print(f'{wod}\n---\n{wod_txt}\n\n')
        wod_dict = {'title': wod, 'url': lim_dict[wod], 'wod': wod_txt, 'source': 'crossfit weightlifting'}
        out_json.append(wod_dict)

    return out_json

data = PrintWorkouts(lim = 2, pageLim = 1)

Begin parsing http://crossfitweightlifting.com/category/wod/...

page 1 of 1 done.

Found 10 workouts, returning text for 2

Friday, October 26, 2018- Week Four- Day Five
---

Zotts Press: 3×3
Snatch Balance: 3,2,1,1,1
Behind the Neck Jerk: 5×1
Snatch Deadlifts: 3×3
Russian Twists: 3×30. Slow and Controlled. 
Hip Extensions: 3×10



Wednesday, October 24, 2018- Week Four- Day Three
---

Hang Snatch + Power Snatch: (1+1) x 3
Clean Pull + Clean (1+1) x 4. 
Back Squats: 3,2,1,3,2,1





## Let's insert this Stuff to a mLab NoSQL MongoDB
[This link](https://www.mongodb.com/json-and-bson) talks about NoSQL and JSON  
[This link](https://marcobonzanini.com/2015/09/07/getting-started-with-mongodb-and-python/) talks about Python and MongoDB  
[This link](https://www.mongodb.com/blog/post/getting-started-with-python-and-mongodb) is another document on Python and MongoDB

In [68]:
from pymongo import MongoClient
dbuser = 'jho'
dbpwd = 'databaseM1ab'
strconnect = f'mongodb://{dbuser}:{dbpwd}@ds052837.mlab.com:52837/'
client = MongoClient(strconnect)

db = client['xccelerate_ftds3']
coll = db['crossfit_wods']

In [63]:
data = PrintWorkouts(10, 2)

Begin parsing http://crossfitweightlifting.com/category/wod/...

page 1 of 2 done.

page 2 of 2 done.

Found 20 workouts, returning text for 10

Friday, October 26, 2018- Week Four- Day Five
---

Zotts Press: 3×3
Snatch Balance: 3,2,1,1,1
Behind the Neck Jerk: 5×1
Snatch Deadlifts: 3×3
Russian Twists: 3×30. Slow and Controlled. 
Hip Extensions: 3×10



Wednesday, October 24, 2018- Week Four- Day Three
---

Hang Snatch + Power Snatch: (1+1) x 3
Clean Pull + Clean (1+1) x 4. 
Back Squats: 3,2,1,3,2,1



Thursday, October 24, 2018- Week Four- Day Four
---

 
REST



Tuesday, October 23, 2018- Week Four- Day Two
---

 
Push Press: 3×5. Light and Perfect!
Push Jerk: 3,2,1,1,1
Power Clean + Jerk (1+1)x2x 3 working sets. 
Clean Pulls: 3×3
3 Sets:
Weighted Planks x 30 seconds
KB Side Bends x 20 each side



Monday, October 22, 2018- Week Four- Day One- Slight Deload
---

Muscle Snatch: 3,3,2,2,1,1
SN PP + OHS (2+2) x 3
Snatch: 3×3, 3×2, 3×1. Work consistency and keep the weight where you can f

In [69]:
coll.insert_many(data)
print(coll.count())

OperationFailure: Authentication failed.

### Export to JSON

In [75]:
jdata = PrintWorkouts(lim = 10, pageLim = 2)

Begin parsing http://crossfitweightlifting.com/category/wod/...

page 1 of 2 done.

page 2 of 2 done.

Found 20 workouts, returning text for 10

Friday, October 26, 2018- Week Four- Day Five
---

Zotts Press: 3×3
Snatch Balance: 3,2,1,1,1
Behind the Neck Jerk: 5×1
Snatch Deadlifts: 3×3
Russian Twists: 3×30. Slow and Controlled. 
Hip Extensions: 3×10



Wednesday, October 24, 2018- Week Four- Day Three
---

Hang Snatch + Power Snatch: (1+1) x 3
Clean Pull + Clean (1+1) x 4. 
Back Squats: 3,2,1,3,2,1



Thursday, October 24, 2018- Week Four- Day Four
---

 
REST



Tuesday, October 23, 2018- Week Four- Day Two
---

 
Push Press: 3×5. Light and Perfect!
Push Jerk: 3,2,1,1,1
Power Clean + Jerk (1+1)x2x 3 working sets. 
Clean Pulls: 3×3
3 Sets:
Weighted Planks x 30 seconds
KB Side Bends x 20 each side



Monday, October 22, 2018- Week Four- Day One- Slight Deload
---

Muscle Snatch: 3,3,2,2,1,1
SN PP + OHS (2+2) x 3
Snatch: 3×3, 3×2, 3×1. Work consistency and keep the weight where you can f

In [77]:
import json
json_txt = json.dumps(jdata)
json_data = json.loads( json_txt )
fname = 'data/crossfit_wod.json'
with open(fname, 'w') as f:
    json.dump( json_data , f)