This notebook demonstrates how to search The [Sun newspaper](https://www.thesun.co.uk).

Import the two required libraries

In [6]:
import pandas as pd
from requests_html import HTMLSession

Start a `requests_html` session.

In [7]:
session = HTMLSession()

The function that was written during class to scrape a search page.

In [8]:
def scrape_sun_search(url):
    
    '''Scrapes Sun search page and returns URLs and Title'''
    r = session.get(url)
    parsed_html = r.html
    article_info = parsed_html.find('.text-anchor-wrap')
    article_list = []
    for article in article_info:
        url = list(article.absolute_links)[0]
        text = article.text

        article_dict = {'url': url, 
                        'text': text}

        article_list.append(article_dict)
    article_df = pd.DataFrame(article_list)
    return article_df

The loop that was created in class.

In [9]:
sun_dfs = []
url = 'https://www.thesun.co.uk/page/16/?s=brexit'

for page in range(1, 50):
    url = 'https://www.thesun.co.uk/page/16/?s=brexit'.replace('16', str(page))
    df = scrape_sun_search(url)
    sun_dfs.append(df)
    

Last step is merge all of the collected dataframes and store the results as a CSV file.

In [10]:
df = pd.concat(sun_dfs)

In [11]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 490 entries, 0 to 9
Data columns (total 2 columns):
text    490 non-null object
url     490 non-null object
dtypes: object(2)
memory usage: 11.5+ KB


In [12]:
df.to_csv('sun_brexit.csv')

Below is a more flexible version. Note that this functions calls our earlier function!

In [24]:
def search_sun(searchterm, pages):
    '''Searches the The Sun. 
       Returns a df of titles and URLs'''
    
    results_dfs = []
    
    for page in range(1, pages+1):
        url = 'https://www.thesun.co.uk/page/'
        url = url + str(page) + '/?s=' + searchterm
        df = scrape_sun_search(url)
        results_dfs.append(df)
    result_df = pd.concat(results_dfs)
    return result_df


In [25]:
snake_df = search_sun('snakes', 2)

In [26]:
snake_df.head()

Unnamed: 0,text,url
0,BONKOSAURUS SEX\nDinosaurs 'would be with us t...,https://www.thesun.co.uk/tech/8998777/dinosaur...
1,SWEET LITTLE LIES\nThe 'natural' foods that ar...,https://www.thesun.co.uk/fabulous/food/9005828...
2,MR RIGHT?\nJess Wright cosies up new mystery m...,https://www.thesun.co.uk/tvandshowbiz/9001998/...
3,SUFFOCATED ON STAGE\nCircus performer 'strangl...,https://www.thesun.co.uk/news/8999226/distress...
4,LOOK ON THE TIGHT SIDE\nWe test 'spray-on' PVC...,https://www.thesun.co.uk/fabulous/8992765/we-t...


You might also want the text of newspaper articles. For this, I use the `Article3K` library.

In [27]:
from newspaper import Article

def get_article_info(url):
    article = Article(url)
    article.download()
    article.parse()
    
    article_details = {'title'       : article.title,
                       'text'        : article.text,
                       'url'         : article.url,
                       'date'        : article.publish_date}

    return article_details

In [17]:
test_url = 'https://www.thesun.co.uk/news/8931618/theresa-may-leadership-challenge-sir-graham-brady/'

In [18]:
get_article_info(test_url)

{'title': 'Theresa May is safe until December as party chiefs refuse to change rules so she can be kicked out earlier',
 'url': 'https://www.thesun.co.uk/news/8931618/theresa-may-leadership-challenge-sir-graham-brady/',
 'date': datetime.datetime(2019, 4, 24, 17, 4, 1, tzinfo=tzutc())}

I want to add a sleep to be polite.

In [19]:
from time import sleep

I extract the urls I want from the dataframe

In [32]:
urls_to_get = snake_df['url'].values
urls_to_get[:5]

array(['https://www.thesun.co.uk/tech/8998777/dinosaurs-alive-today-sex-lakes/',
       'https://www.thesun.co.uk/fabulous/food/9005828/natural-foods-unhealthy-sugar-misleading/',
       'https://www.thesun.co.uk/tvandshowbiz/9001998/jess-wright-mystery-man-holiday/',
       'https://www.thesun.co.uk/news/8999226/distressing-video-circus-trainer-strangled-by-large-snake-during-performance-russia/',
       'https://www.thesun.co.uk/fabulous/8992765/we-test-spray-on-pvc-trousers-loved-by-celebs-for-how-much-squeak-and-sweat-they-cause/'],
      dtype=object)

In [33]:
article_texts = []  #empty list to store list
for url in urls_to_get:  # loop ever each url
    df = get_article_info(url) # download the article info
    article_texts.append(df) # add the article info to our list
    sleep(1) # pause one second

In [34]:
text_df = pd.DataFrame(article_texts) # stack all of our dataframes together

In [135]:
len(text_df)

138

In [35]:
text_df.head()

Unnamed: 0,date,text,title,url
0,2019-05-05 14:35:42+00:00,THE DINOSAURS would still walk among us today ...,Dinosaurs ‘would be with us today’ if their ‘s...,https://www.thesun.co.uk/tech/8998777/dinosaur...
1,2019-05-04 23:47:25+00:00,THE Sun on Sunday can reveal claims by some br...,The Sun exposes foods branded ‘natural’ that a...,https://www.thesun.co.uk/fabulous/food/9005828...
2,2019-05-04 08:29:06+00:00,JESS Wright has confirmed she's all loved up a...,Jess Wright cosies up new mystery man on a lux...,https://www.thesun.co.uk/tvandshowbiz/9001998/...
3,2019-05-03 17:23:27+00:00,DISTRESSING video shows a circus trainer slump...,Distressing video 'shows circus trainer strang...,https://www.thesun.co.uk/news/8999226/distress...
4,2019-05-02 22:44:11+00:00,SPRAY-ON trousers are one of spring’s hottest ...,We test ‘spray-on’ PVC trousers loved by celeb...,https://www.thesun.co.uk/fabulous/8992765/we-t...


In [124]:
text_df.to_json('text_df.json', orient='records')