In [35]:
import pandas as pd
import requests
from bs4 import BeautifulSoup
import json
import re
from tqdm.auto import tqdm
import urllib.request
from selectolax.parser import HTMLParser

In [8]:
df = pd.read_pickle('kickstarter.pkl')

In [9]:
sample = df.loc[:200]

Testing which script runs faster for the full dataset.
There are a few parts to change/improve

- How we get the HTML 
- How we parse the HTML 
- How we clean the HTML

# Cleaning the HTML
At the end of it, we know we want to have only the p tags and if possible h1+ tags and ul tags. We can do this by reading each and appending to a list and joining on that list to create a string for the description. 

According to https://waymoot.org/home/python_string/, a list comprehension is the most effective method to joining a list of strings. Thus we'll be implementing this into our pull function for all tests. 

# Acquiring the HTML
Before we process and parse HTML, we have to get the HTML first. The two ways I'm focusing on are utilizing the requests library and the urllib library, r and u for function respectively. 

In [10]:
## Requests & BeautifulSoup
def r_bs_pull(url):
    try:
        result = requests.get(url)
        soup = BeautifulSoup(result.content)
        
        textfield = '\n'.join([item.text for item in soup.find('div', 'full-description').find_all('p')])

        return(textfield)
    except AttributeError:
        return 'Missing Description'
    
## URLLIB & BeautifulSoup
def u_bs_pull(url):
    try:
        result = urllib.request.urlopen(url)
        soup = BeautifulSoup(result.read())
    
        textfield = '\n'.join([item.text for item in soup.find('div', 'full-description').find_all('p')])
    
        return(textfield)
    except AttributeError:
        return 'Missing Description'

In [11]:
%%time
sample['r_bs_pull'] = sample['web_url'].apply(r_bs_pull)
#7 Min 54s for 201 files

CPU times: user 5min 25s, sys: 2.68 s, total: 5min 27s
Wall time: 7min 54s


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


In [39]:
%%time
sample['u_bs_pull'] = sample['web_url'].apply(u_bs_pull)
#3min 17s for 201 files
#3m 32s for 201 files

CPU times: user 29.1 s, sys: 712 ms, total: 29.8 s
Wall time: 3min 32s


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


From this experiment, it looks like using urllib in this case is much faster. 

# Parsing HTML
We've optimized the HTML content so now we need to optimize parsing through the HTML. This comparison will be between BeautifulSoup and selectolax.

In [36]:
# def u_bs_pull(url):
#     try:
#         result = urllib.request.urlopen(url)
#         soup = BeautifulSoup(result.read())
    
#         textfield = '\n'.join([item.text for item in soup.find('div', 'full-description').find_all('p')])
    
#         return(textfield)
#     except AttributeError:
#         return 'Missing Description'
    
def u_sl_pull(url):
    try:
        result = urllib.request.urlopen(url)
        parse = HTMLParser(result.read())
        
        textfield = '\n'.join([node.text() for node in parse.css_first("div.full-description".css('p'))])
        
        return(textfield)
    except AttributeError:
        return 'Missing Description'
    
## Since we've seen u_bs run, we'll just comapre it with u_sl

In [40]:
%%time
sample['u_sl_pull'] = sample['web_url'].apply(u_sl_pull)
# 2min 59s
# 2min 59s

CPU times: user 4.23 s, sys: 922 ms, total: 5.15 s
Wall time: 2min 59s


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


The improvement is very noticable, saving .2 seconds per iteration on average.

I want to apply the p tags and also the ul tag so we'll work with BS and see how fast this'll work. 

In [41]:
def u_bs_pull_plus(url):
    try:
        result = urllib.request.urlopen(url)
        soup = BeautifulSoup(result.read())
    
        textfield = '\n'.join([item.text for item in soup.find('div', 'full-description').find_all(['p','ul'])])
    
        return(textfield)
    except AttributeError:
        return 'Missing Description'

In [42]:
%%time
sample['u_bs_pull_plus'] = sample['web_url'].apply(u_bs_pull_plus)
# 3min 15s.

CPU times: user 29.1 s, sys: 677 ms, total: 29.8 s
Wall time: 3min 15s


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


Since adding the 'ul' tag isn't much different in BS, I'll be using urllibrequest and the beautifulsoup library together. 

In [44]:
from multiprocessing import  Pool
from functools import partial
import numpy as np

def parallelize(data, func, num_of_processes=8):
    data_split = np.array_split(data, num_of_processes)
    pool = Pool(num_of_processes)
    data = pd.concat(pool.map(func, data_split))
    pool.close()
    pool.join()
    return data

def run_on_subset(func, data_subset):
    return data_subset.apply(func, axis=1)

def parallelize_on_rows(data, func, num_of_processes=8):
    return parallelize(data, partial(run_on_subset, func), num_of_processes)

In [None]:
parallelize_on_rows(df, u_bs_pull_plus) 

Process ForkPoolWorker-3:
Process ForkPoolWorker-2:
Process ForkPoolWorker-8:
Process ForkPoolWorker-6:
Process ForkPoolWorker-4:
Process ForkPoolWorker-1:
Process ForkPoolWorker-7:
Process ForkPoolWorker-5:
Traceback (most recent call last):
  File "/Users/Matt/anaconda3/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
    self.run()
  File "/Users/Matt/anaconda3/lib/python3.7/multiprocessing/process.py", line 99, in run
    self._target(*self._args, **self._kwargs)
  File "/Users/Matt/anaconda3/lib/python3.7/multiprocessing/pool.py", line 110, in worker
    task = get()
  File "/Users/Matt/anaconda3/lib/python3.7/multiprocessing/queues.py", line 354, in get
    return _ForkingPickler.loads(res)
  File "/Users/Matt/anaconda3/lib/python3.7/site-packages/pandas/core/indexes/base.py", line 152, in _new_Index
    def _new_Index(cls, d):
KeyboardInterrupt
Process ForkPoolWorker-9:
Process ForkPoolWorker-10:

KeyboardInterrupt

Traceback (most recent call last):
Traceback 