RE: OSINT Monitoring

Dear Data Consultant,

Your assistance thus far has helped us in our investigations. I'm afraid we need your help once again. The Hamburgler has been captured, but her associates are now attempting to fence explosives and drugs in places like Craigslist garage sales. They seem to be using terms like "mattress," "cabinet," and "wrench," though there may be others. We need a tool that we can schedule to check daily for these keywords showing up in local markets so we can then investigate them more fully.

We know about Craigslist, but we suspect they are also using Parler and Gab--not just to sell goods but also to coordinate events and make connections. We aren't sure which terms they are using there, so you'll need to use your best judgment .

- Smith, Agent-in-Charge

Write a Python script that, when executed once, goes to the Phoenix Craigslist site, and crawls each of the listings in the "Garage & Moving Sales" page, looking for mentions of the words above, and generates a csv that lists the keyword and the URL at which it was found. 
Pass the Web Scraping Basics quiz with a score of 85% or better.
This project is a bit more complicated than what we've done in the tutorials, but only a little bit. It will require you to extract the links (found in "a" tags) and sort through them to make sure you are only getting the ones in the gallery. You may need to look through the BeautifulSoup documentation to figure out how to get at the hrefs from those A tags. Note that Craigslist seems to be blocking Google IPs, so you'll need to do this one from your own machine. Once you have a list of URLs, you'll need to crawl them one-by-one, and search the text of the ads for each of your keywords (see above). Note that while Craigslist has an API, you need to be scraping directly here to show your scraping skills. Remember to build in a sizeable "sleep" between each request. 

In [12]:
pip install selenium

Note: you may need to restart the kernel to use updated packages.


In [1]:
pip install beautifulsoup4

Note: you may need to restart the kernel to use updated packages.


In [302]:
import requests
from time import sleep
import pandas as pd
from selenium import webdriver
import csv
from selenium.webdriver.common.keys import Keys
from bs4 import BeautifulSoup
# this imports the path to get to selenium from a separate
# python file
from infos import PATH

In [305]:
def find_hamburgler():
    
    # this opens Chrome from the kernel
    driver = webdriver.Chrome(executable_path=PATH)
    
    sleep(2)
    
    # this takes me to craigslist
    craigslist = driver.get('https://phoenix.craigslist.org/')
    
    sleep(4)
    
    # this clicks on the garage and home sales hyperlink
    garagesales = driver.find_element_by_class_name('gms').click()
    
    sleep(4)
    
    #this uses the send keys function to search all the listings to find ones with the word "mattress"
    driver.find_element_by_id('query').send_keys('mattress',Keys.RETURN)
    
    sleep(2)
    
    # this grabs the page info so Beautiful soup can be used to
    # find specific tags
    html = driver.page_source
    soup = BeautifulSoup(html, "lxml")
    
    sleep(2)
    
    # this narrows down the tags so I can get to href. This was done this way
    # because I didn't want to grab ALL the links on the pages, just the 
    # ones that had the relevant search results.
    mattress = soup.find('ul', {'id': 'search-results', 'class': 'rows'})
    listings_mattress = mattress.find_all('li', 'result-row')
    
    sleep(2)
    
    # this grabs the relevant url under the href tags
    # and then labels the url accordingly
    for listings_m in listings_mattress:
        link_m = listings_m.a['href']
        link_m += 'n/n/' + ' - mattress'
        
        # this prints a status update
        print(f'mattress search results: {link} ')
        
    # this clears the search box...    
    driver.find_element_by_id('query').clear()
    
    sleep(2)
    
    # so the function can search for the next code word
    driver.find_element_by_id('query').send_keys('cabinet', Keys.RETURN)
    
    sleep(3)
    
    html = driver.page_source
    soup = BeautifulSoup(html, "lxml")
    
    cabinet = soup.find('ul', {'id': 'search-results', 'class': 'rows'})
    listings_cabinet = cabinet.find_all('li', 'result-row')
    
    for listings_c in listings_cabinet:
        link_c = listings_c.a['href']
    
        link_c += '/n/n' + ' - cabinet'
    
        print(f'cabinet search results: {link_c} ')
        
        sleep(2)
    
    driver.find_element_by_id('query').clear()
    
    sleep(2)
    
    driver.find_element_by_id('query').send_keys('wrench', Keys.RETURN)
    
    sleep(3)
    
    wrench = soup.find('ul', {'id': 'search-results', 'class': 'rows'})
    
    listings_wrench = wrench.find_all('li', 'result-row')
    
    for listings_w in listings_wrench:
        link_w = listings_w.a['href']
    
        link_w += '/n/n' + ' - wrench'
        print(f'wrench search results: {link_w} ')
     
    # creating an empty list so the search results can be added
    allposts = []
    
    # appended a list of the search results to the empty list
    allposts.append([link_w, link_m, link_c])
    
    # creating a data fram using pandas
    df = pd.DataFrame(allposts)
    
    # creats a new csv and adds the information from the list
    df.to_csv('modfour.csv')
    
    # ends function and closes webdriver
    return;
    
        

In [308]:
pip install schedule

Collecting schedule
  Downloading schedule-1.1.0-py2.py3-none-any.whl (10 kB)
Installing collected packages: schedule
Successfully installed schedule-1.1.0
Note: you may need to restart the kernel to use updated packages.


In [311]:
import schedule
import time

schedule documentation link for future reference: https://schedule.readthedocs.io/en/stable/

In [313]:
# this schedules the find_hamblurgler function to run every day at 1:00 p.m.

def job():
    print("I'm working...")
    
schedule.every().day.at("13:00").do(find_hamburgler)


while True:
    schedule.run_pending()
    time.sleep(1)

KeyboardInterrupt: 