# Clarkesworld

Source: https://clarkesworldmagazine.com/

Description: Sci-fi/fantasy magazine

**Topics:**

* Scraping

## Scrape the Clarkesworld homepage `4 points`

I want a CSV file that includes a row for each story, including the columns:

* Title
* Byline
* URL to story
* Category (fiction/non-fiction/cover art)
* Issue number (e.g. 180)
* Publication date (e.g. September 2021)

### Importing

In [2]:
import requests
import pandas as pd
from bs4 import BeautifulSoup as bs

response = requests.get("https://clarkesworldmagazine.com/")
soup = bs(response.text, 'html.parser')

### Getting the information

In [141]:
results = soup.select('div.index-col1,div.index-col2')

rows = []

for r in results:
    print('----------')  
    issue = []
    row = {}
    row['Title'] = r.select_one('.story').text
    row['Byline'] = r.select_one('.byline').text
    row['URL to story'] = r.select_one('a').get('href')
    #row['URL to story'] = r.select_one('a').['href']
    row['Category'] = r.parent.p.text.strip()
    issue = r.parent.parent.h1.text.split('–')
    row['Issue number'] = issue[0].replace('ISSUE','').strip()
    row['Publication date'] = issue[1]
    rows.append(row)
    print(row)
     

----------
{'Title': 'The Dragon Project', 'Byline': 'by Naomi Kritzer', 'URL to story': 'https://clarkesworldmagazine.com/kritzer_03_22', 'Category': 'FICTION', 'Issue number': '186', 'Publication date': ' March 2022'}
----------
{'Title': 'Saturn Devouring His Son', 'Byline': 'by EA Mylonas', 'URL to story': 'https://clarkesworldmagazine.com/mylonas_03_22', 'Category': 'FICTION', 'Issue number': '186', 'Publication date': ' March 2022'}
----------
{'Title': 'Rain of Days', 'Byline': 'by Ray Nayler', 'URL to story': 'https://clarkesworldmagazine.com/nayler_03_22', 'Category': 'FICTION', 'Issue number': '186', 'Publication date': ' March 2022'}
----------
{'Title': 'The Memory of Water', 'Byline': 'by Tegan Moore', 'URL to story': 'https://clarkesworldmagazine.com/moore_03_22', 'Category': 'FICTION', 'Issue number': '186', 'Publication date': ' March 2022'}
----------
{'Title': 'Wanting Things', 'Byline': 'by Cal Ritterhoff', 'URL to story': 'https://clarkesworldmagazine.com/ritterhoff

### Saving to CSV

In [142]:
df = pd.DataFrame(rows)
df.to_csv(path_or_buf='clarkesworld.csv')
df

Unnamed: 0,Title,Byline,URL to story,Category,Issue number,Publication date
0,The Dragon Project,by Naomi Kritzer,https://clarkesworldmagazine.com/kritzer_03_22,FICTION,186,March 2022
1,Saturn Devouring His Son,by EA Mylonas,https://clarkesworldmagazine.com/mylonas_03_22,FICTION,186,March 2022
2,Rain of Days,by Ray Nayler,https://clarkesworldmagazine.com/nayler_03_22,FICTION,186,March 2022
3,The Memory of Water,by Tegan Moore,https://clarkesworldmagazine.com/moore_03_22,FICTION,186,March 2022
4,Wanting Things,by Cal Ritterhoff,https://clarkesworldmagazine.com/ritterhoff_03_22,FICTION,186,March 2022
5,It Takes a Village,by Priya Chand,https://clarkesworldmagazine.com/chand_03_22,FICTION,186,March 2022
6,Meddling Fields,by R.T. Ester,https://clarkesworldmagazine.com/ester_03_22,FICTION,186,March 2022
7,Commencement Address,"by Arthur Liu, translated by Stella Jiayue Zhu",https://clarkesworldmagazine.com/liu_03_22,FICTION,186,March 2022
8,Validating Rage: Women in Horror,by Carrie Sessarego,https://clarkesworldmagazine.com/sessarego_03_22,NON-FICTION,186,March 2022
9,Breaking the Gender Barrier: A Conversation wi...,by Arley Sorg,https://clarkesworldmagazine.com/wang-chen_int...,NON-FICTION,186,March 2022


## Auto-updating scraper `3 points`

Using GitHub Actions, implement a scraper that will keep track of everything posted to the Clarkesworld homepage. For example, when issue 181 comes out it should *add to the CSV* instead of just replacing it.

> Tip: `drop_duplicates` might save you a lot of effort at one point or another.

In [1]:
import requests
import pandas as pd
from bs4 import BeautifulSoup as bs
import os.path

response = requests.get("https://clarkesworldmagazine.com/")
soup = bs(response.text, 'html.parser')
results = soup.select('div.index-col1,div.index-col2')

rows = []

for r in results:
    issue = []
    row = {}
    try:    
        row['Title'] = r.select_one('.story').text
    except:
        row['Title'] = 'unbekannt'

    try:    
        row['Byline'] = r.select_one('.byline').text  
    except:
        row['Byline'] = 'unbekannt'

    try:    
        row['URL to story'] = r.select_one('a').get('href')
    except:
        row['URL to story'] = 'unbekannt'

    try:    
        row['Category'] = r.parent.p.text.strip()
    except:
        row['Category'] = 'unbekannt'

    try:    
        issue = r.parent.parent.h1.text.split('–')
        row['Issue number'] = issue[0].replace('ISSUE','').strip()
        row['Publication date'] = issue[1]
    except:
        row['Issue number'] = 'unbekannt'
        row['Publication date'] = 'unbekannt'
    rows.append(row)
  
if os.path.isfile('clarkesworld.csv'):
    df1 = pd.read_csv('clarkesworld.csv',index_col=0)
    df2 = pd.DataFrame(rows)
    frames = [df1,df2]
    df3 = pd.concat(frames)
    
else:
    df3 = pd.DataFrame(rows)

df3.drop_duplicates(subset='URL to story', keep='first', inplace=True,ignore_index=True) 
df3.to_csv('clarkesworld.csv')
df3




Unnamed: 0,Title,Byline,URL to story,Category,Issue number,Publication date
0,The Dragon Project,by Naomi Kritzer,https://clarkesworldmagazine.com/kritzer_03_22,FICTION,186,March 2022
1,Saturn Devouring His Son,by EA Mylonas,https://clarkesworldmagazine.com/mylonas_03_22,FICTION,186,March 2022
2,Rain of Days,by Ray Nayler,https://clarkesworldmagazine.com/nayler_03_22,FICTION,186,March 2022
3,The Memory of Water,by Tegan Moore,https://clarkesworldmagazine.com/moore_03_22,FICTION,186,March 2022
4,Wanting Things,by Cal Ritterhoff,https://clarkesworldmagazine.com/ritterhoff_03_22,FICTION,186,March 2022
5,It Takes a Village,by Priya Chand,https://clarkesworldmagazine.com/chand_03_22,FICTION,186,March 2022
6,Meddling Fields,by R.T. Ester,https://clarkesworldmagazine.com/ester_03_22,FICTION,186,March 2022
7,Commencement Address,"by Arthur Liu, translated by Stella Jiayue Zhu",https://clarkesworldmagazine.com/liu_03_22,FICTION,186,March 2022
8,Validating Rage: Women in Horror,by Carrie Sessarego,https://clarkesworldmagazine.com/sessarego_03_22,NON-FICTION,186,March 2022
9,Breaking the Gender Barrier: A Conversation wi...,by Arley Sorg,https://clarkesworldmagazine.com/wang-chen_int...,NON-FICTION,186,March 2022
