<a href="https://colab.research.google.com/github/jkellett11/sds510/blob/main/Mod4_Basics.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Module 4 - Basics**

Here, I will attempt to scrape https://news.asu.edu for certain keywords in their current news articles for my SDS course.

In [1]:
#importing libraries
import requests
from bs4 import BeautifulSoup
from time import sleep
import pandas as pd
import csv

Defining Functions and Creating a CSV Shell

In [2]:
#defining function to grab the article data prior to iterating through the 'potentialarticles' list.
def grabinfo(url):
  thispage = requests.get(url)
  #setting a variable to tell beautiful soup what page to use and to parse through the html
  bsinfo = BeautifulSoup(thispage.text, 'html.parser')
  #finding all instances of 'p' where the article text is. (I initially tried to use bsinfo.find('view-content').find_all('p') however that resulted in just one url being returned)
  paragraphs = bsinfo.find_all('p')
  article = '\n\n'.join(x.text for x in paragraphs)
  #getting the headline of each article, if needed!
  title = bsinfo.find('h1').text
  return {'url':url,'title':title, 'article body':article}
#creating empty csv with just a header to append to later with filtered articles.
with open('ASU News1.csv','w') as f:
  writer = csv.writer(f)
  writer.writerow(['url', 'keyword_found'])

In [3]:
#showing that the csv was created correctly with just a header row for now.
df = pd.read_csv('ASU News1.csv')
df

Unnamed: 0,url,keyword_found


Defining Variables

In [4]:
#setting the desired page to scrape
asupage = requests.get('https://news.asu.edu')

In [5]:
#telling beautifulsoup to parse the html for 'asupage'
bs = BeautifulSoup(asupage.text, 'html.parser')

In [6]:
#defining a keyword list for searching purposes
keywordlist = ['medicine', 'breakthrough', 'upcoming']

In [7]:
#attempt 2 - gathering potential links and placing them in a list; leaves duplicates.
potentialarticles = []
#telling bs to find all instances of 'a' followed by a 'href'
links = bs.find_all('a', href=True)
for link in links:
  #defining href and retrieving the hyperlink
  href = link.get('href')
  #making sure only the href gets appended to avoid the 'https://news.asu.edu/' portion being repeated.
  if href.startswith('https://news.asu.edu') and '/202' in href:
    #appending the links to the list
    potentialarticles.append(href)
    #print(href) #used to check that the right articles were being pulled.
    #adding sleep to be nice to the servers between requests.
    sleep(5)
    #making sure that the full url link is appended correct as some links in href do not have the 'news.asu.edu' portion.
  if href.startswith('/') and '/202' in href:
    full_url = 'https://news.asu.edu' + link['href']
    #print(full_url) #used to check that the right articles were being pulled.
    #appending the links to the list
    potentialarticles.append(full_url)
    #adding sleep to be nice to the servers between requests.
    sleep(10)


In [8]:
#length of list before cleaning of duplicates
len(potentialarticles) #57 articles atm but keep in mind there are duplicates currently.
#potentialarticles.nunique() #used to check!

57

In [9]:
#cleaning list of duplicates by creating a dictionary and then converting it back into a list to preserve the order, if necessary!
potentialarticles = list(dict.fromkeys(potentialarticles))
len(potentialarticles)
#potentialarticles #used to check!

37

Filtering and Appending the Data Frame

In [10]:
#setting up for filtering by keyword
#creating list to easily check if there are errors when appending the dataframe.
filteredarticles=[]
for url in potentialarticles:

  articleinfo = grabinfo(url)
  #defining 'articletext' to search through the body text
  articletext = articleinfo['article body']
  #adding sleep to be nice to the servers between requests.
  sleep(5)

  #df=pd.concat([df, pd.DataFrame([articleinfo])], ignore_index=True)

  for keyword in keywordlist:
    #telling it to check for keyword in article's body, using '.lower()' to return all results regardless of case-sensitivity
    if keyword.lower() in articletext.lower():
      #appending the list with which keyword was found and at which url.
      filteredarticles.append({'keyword found': keyword, 'url':url})
      ##used pd.concat instead of df.append due to it being depreciated now
      df=pd.concat([df, pd.DataFrame([{'keyword_found': keyword, 'url':url}])], ignore_index=True)

Updating the CSV

In [11]:
#updating the empty df with the newly discovered results.
df.to_csv('ASU News1.csv', index=False)
df

Unnamed: 0,url,keyword_found
0,https://news.asu.edu/20251103-sun-devil-commun...,medicine
1,https://news.asu.edu/20251107-health-and-medic...,medicine
2,https://news.asu.edu/20251104-local-national-a...,medicine
3,https://news.asu.edu/20251103-science-and-tech...,breakthrough
4,https://news.asu.edu/20250807-environment-and-...,breakthrough
5,https://news.asu.edu/20251022-university-news-...,medicine


In [12]:
#checking the same articles were found.
len(filteredarticles)

6

In [13]:
print(f'During my scrape of the ASU News site, I discovered there were {len(filteredarticles)} found that contained the following keywords: {keywordlist}')

During my scrape of the ASU News site, I discovered there were 6 found that contained the following keywords: ['medicine', 'breakthrough', 'upcoming']
