### Get links from Basketball Intelligence
- Parse HTML to pull links
- Save lists of links by date, title, author, etc (pick metadata)

#### Part 1
- Ray's site goes back to 2013. We will use Python standard library `calendar` to get all the days and it looks like we can just iterate over the dates as URL and pull the respective `a` tags.

#### Part 2
- Download Article, will use Python `newspaper` module and see how it does otherwise will look at `Scrapy`

In [13]:
import requests
from bs4 import BeautifulSoup, Tag, NavigableString
import re
import pandas as pd
import time
import random
import os

In [23]:
try:
    os.mkdir('downloads')
except:
    print("Directory Already Exists")

Directory Already Exists


In [4]:
BASE_URL = "http://basketballintelligence.net/2018/12/"

In [5]:
# creating a list of years and months I want to scrape
years = [year for year in range(2019, 2012, -1)]
months = [month for month in range(12, 0, -1)]

Some Regex Tutorials
- https://docs.python.org/3.3/howto/regex.html
- https://regexone.com/

In [7]:
def extract_link(text):
    # this is regex, a compiling pattern I want to extract from strings in the scrape
    extractor = re.compile(r'http[s]?.+/.+.[com|net|html].+')
    try:
        return re.search(extractor, text).group(0)
    except:
        pass
        # return 0

In [8]:
extract_link("bobs homes http://basketballintelligence.net/2018/12/")

### If you need to scrape by actual Day instead of Year-Month

You can use this:

```python

import calendar

new_list = []

def flatten_list(lists):
    for elem in lists:
        if not isinstance(elem, list):
            new_list.append(elem)   
        else:
            flatten_list(elem)
    return new_list

all_days = []
for year in range(2013, 2020):
    print(year)
    for i in calendar.Calendar().yeardatescalendar(year):
        all_days.append(i)

days_to_parse = flatten_list(all_days)[105:-3]

```

In [10]:
def scrap_daily_links(soup):
    daily_links = []
    print(len(soup.find_all('div', {"class":"entry-content"})))
    
    if len(soup.find_all('div', {"class":"entry-content"})) > 1:
        for group in soup.find_all('div', {"class":"entry-content"}):
            for j in group.find_all('div'):
                try:
                    if 'http' in j.text:
                        daily_links.append(extract_link(j.text))
                    #print(j.find('div')['data-url'])
                except:
                    pass
                # print('****')
    elif len(soup.find_all('div', {"class":"entry-content"})) > 1:
        for group in soup.find_all('div', {"class":"entry-content"}):
            for j in group.find_all('div'):
                try:
                    if 'http' in j.text:
                        daily_links.append(extract_link(j.text))
                    #print(j.find('div')['data-url'])
                except:
                    pass
                # print('****')
    else:
        print('<<<<<<<<<< ---------- >>>>>>>>>>>')
        print('No Div Entry-Content in Blog')

    return list(set(daily_links))

In [12]:
for year in years:
    for month in months:
        page = 1
        print(f'{year} - {month}')
        
        BASE_URL = f"http://basketballintelligence.net/{year}/{str(month).zfill(2)}/"
        
        r = requests.get(BASE_URL)
        soup = BeautifulSoup(r.text, 'lxml')
        daily_links = scrap_daily_links(soup)
        
        # save DF, also add some metadata to enrich data set
        df = pd.DataFrame(daily_links, columns=['daily_links'])
        df.insert(0, 'page', page)
        df.insert(0, 'month', month)
        df.insert(0, 'year', year)
        filename = f"downloads/articles{str(month).zfill(2)}{year}.csv"
        df.to_csv(filename,index=False)
        
        print(f"DF Rows: {len(df)}")
        try:
            while soup.find('div',{"class":"nav-previous"}).find('a')['href'] is not None:
                print("----- Older Posts - Paginating")
                page +=1
                URL = soup.find('div',{"class":"nav-previous"}).find('a')['href']
                r = requests.get(URL)
                soup = BeautifulSoup(r.text, 'lxml')
                daily_links = scrap_daily_links(soup)

                df = pd.DataFrame(daily_links, columns=['daily_links'])
                df.insert(0, 'page', page)
                df.insert(0, 'month', month)
                df.insert(0, 'year', year)
                filename = f"downloads/articles{str(month).zfill(2)}{year}_{page}.csv"
                df.to_csv(filename,index=False)
                
                print(f"DF Rows: {len(df)}")
                
                print(f"Page: {page}")
                wait = random.uniform(1.2, 2)
                print(wait)
                time.sleep(wait)

        except:
            print("**** No More Pages")
            pass
        wait = random.uniform(1.4, 2.5)
        print(wait)
        time.sleep(wait)


2019 - 12
5
DF Rows: 115
----- Older Posts - Paginating
5
Page: 2
1.3055428737733152
----- Older Posts - Paginating
5
Page: 3
1.2116270149668293
----- Older Posts - Paginating
6
Page: 4
1.4942431406337786
----- Older Posts - Paginating
5
Page: 5
1.5169882806222779
----- Older Posts - Paginating
8
Page: 6
1.8546663020288428
----- Older Posts - Paginating
1
<<<<<<<<<< ---------- >>>>>>>>>>>
No Div Entry-Content in Blog
Page: 7
1.5487424981384328
**** No More Pages
2.258583252681861
2019 - 11
5
DF Rows: 181
----- Older Posts - Paginating
5
Page: 2
1.9971343069031628
----- Older Posts - Paginating
5
Page: 3
1.3337195230043517
----- Older Posts - Paginating
5
Page: 4
1.3163910006286395
----- Older Posts - Paginating
5
Page: 5
1.8234947619598716
----- Older Posts - Paginating
5
Page: 6
1.6274813142080689
**** No More Pages
2.287227836392683
2019 - 10
5
DF Rows: 163
----- Older Posts - Paginating
5
Page: 2
1.9295983863492363
----- Older Posts - Paginating
5
Page: 3
1.415523572172862
----- Old

5
Page: 6
1.604192637821623
**** No More Pages
1.7394971310483218
2018 - 5
5
DF Rows: 102
----- Older Posts - Paginating
5
Page: 2
1.9636896462459772
----- Older Posts - Paginating
5
Page: 3
1.2547025741363305
----- Older Posts - Paginating
5
Page: 4
1.5517859660808484
----- Older Posts - Paginating
5
Page: 5
1.6608205784931938
----- Older Posts - Paginating
5
Page: 6
1.409296317669884
----- Older Posts - Paginating
3
Page: 7
1.9853314979193613
**** No More Pages
1.9323798558289167
2018 - 4
5
DF Rows: 126
----- Older Posts - Paginating
5
Page: 2
1.7410290671147135
----- Older Posts - Paginating
5
Page: 3
1.7418690419327176
----- Older Posts - Paginating
5
Page: 4
1.6475666669327897
----- Older Posts - Paginating
5
Page: 5
1.5367300833865751
----- Older Posts - Paginating
5
Page: 6
1.9649155655478392
----- Older Posts - Paginating
3
Page: 7
1.45450487449188
**** No More Pages
2.34784061932424
2018 - 3
5
DF Rows: 77
----- Older Posts - Paginating
5
Page: 2
1.3526215682077012
----- Older 

----- Older Posts - Paginating
2
Page: 7
1.9420276416141427
**** No More Pages
2.061409702524792
2016 - 11
5
DF Rows: 13
----- Older Posts - Paginating
5
Page: 2
1.9509794480944875
----- Older Posts - Paginating
5
Page: 3
1.948291554070491
----- Older Posts - Paginating
5
Page: 4
1.3359001056308988
----- Older Posts - Paginating
5
Page: 5
1.6590304526065542
----- Older Posts - Paginating
5
Page: 6
1.3808493850414285
----- Older Posts - Paginating
1
<<<<<<<<<< ---------- >>>>>>>>>>>
No Div Entry-Content in Blog
Page: 7
1.2471166811051706
**** No More Pages
1.5709662616514837
2016 - 10
5
DF Rows: 23
----- Older Posts - Paginating
5
Page: 2
1.7213473596647013
----- Older Posts - Paginating
5
Page: 3
1.3574755368480083
----- Older Posts - Paginating
5
Page: 4
1.2411608163568406
----- Older Posts - Paginating
5
Page: 5
1.7642243822371482
----- Older Posts - Paginating
5
Page: 6
1.350867152462852
----- Older Posts - Paginating
1
<<<<<<<<<< ---------- >>>>>>>>>>>
No Div Entry-Content in Blog


2015 - 4
5
DF Rows: 1
----- Older Posts - Paginating
5
Page: 2
1.289358741028024
----- Older Posts - Paginating
5
Page: 3
1.2315743035835973
----- Older Posts - Paginating
5
Page: 4
1.5104960033052264
----- Older Posts - Paginating
5
Page: 5
1.4643932315514778
----- Older Posts - Paginating
5
Page: 6
1.7976002225844159
----- Older Posts - Paginating
1
<<<<<<<<<< ---------- >>>>>>>>>>>
No Div Entry-Content in Blog
Page: 7
1.7089223928304496
**** No More Pages
2.3950829104634535
2015 - 3
5
DF Rows: 7
----- Older Posts - Paginating
5
Page: 2
1.297504890986529
----- Older Posts - Paginating
5
Page: 3
1.3187541071743367
----- Older Posts - Paginating
5
Page: 4
1.917430783274011
----- Older Posts - Paginating
5
Page: 5
1.5579244576029756
----- Older Posts - Paginating
5
Page: 6
1.2577643103955225
----- Older Posts - Paginating
1
<<<<<<<<<< ---------- >>>>>>>>>>>
No Div Entry-Content in Blog
Page: 7
1.9137280677450215
**** No More Pages
2.2591456547933433
2015 - 2
5
DF Rows: 7
----- Older Pos

5
Page: 4
1.8232831457937198
**** No More Pages
2.431670902190796
2013 - 7
5
DF Rows: 0
----- Older Posts - Paginating
5
Page: 2
1.347319665159316
----- Older Posts - Paginating
5
Page: 3
1.7405768629081981
----- Older Posts - Paginating
5
Page: 4
1.7490354727837176
----- Older Posts - Paginating
5
Page: 5
1.3955343257407713
----- Older Posts - Paginating
5
Page: 6
1.423741103839377
----- Older Posts - Paginating
5
Page: 7
1.5049415922158036
----- Older Posts - Paginating
5
Page: 8
1.5085431186098577
**** No More Pages
1.4697963848487197
2013 - 6
5
DF Rows: 0
----- Older Posts - Paginating
5
Page: 2
1.4168973471851627
----- Older Posts - Paginating
5
Page: 3
1.4110156532790488
----- Older Posts - Paginating
5
Page: 4
1.5939853415464127
----- Older Posts - Paginating
5
Page: 5
1.9607384368460488
----- Older Posts - Paginating
5
Page: 6
1.2028087001558447
----- Older Posts - Paginating
5
Page: 7
1.620062484356332
----- Older Posts - Paginating
4
Page: 8
1.4959428274337367
**** No More Pa

In [11]:
pd.read_csv("z_downloads/articles012019.csv")

Unnamed: 0,year,month,page,daily_links
0,2019,1,1,https://www.washingtonpost.com/sports/2019/01/...
1,2019,1,1,http://www.startribune.com/well-prepared-wolve...
2,2019,1,1,https://www.doubleclutch.uk/how-a-bland-offens...
3,2019,1,1,https://hashtagbasketball.com/golden-state-war...
4,2019,1,1,https://theathletic.com/787992/2019/01/28/deng...
...,...,...,...,...
121,2019,1,1,https://www.thedreamshake.com/2019/1/28/182010...
122,2019,1,1,https://fansided.com/2019/01/28/big-man-defens...
123,2019,1,1,https://theathletic.com/785614/2019/01/25/morn...
124,2019,1,1,https://sportsday.dallasnews.com/dallas-maveri...
