# Craigslist Scraping Operationalization

## Raw Data Examples

SQL Query Results by Foreign Key 'Zip Code' for: 

1. Count of Businesses in Same Zip Code
2. Count of Eviction Notices in same Zip Code ignore
3. Count of Different Types of Schools in same Zip Code
4. Avg Cost of Home Prices


In [1]:
zip_query = {'businesses': {'94102': 1, '11111': 5, '94104': 10, '94105': 10 }, 
             'evictions': {'94102': 7, '11111': 4, '94104': 10, '94105': 9 },
             'schools_k9': {'94102': 10, '11111': 5, '94104': 2, '94105': 8 },
             'schools_hs': {'94102': 9, '11111': 6, '94104': 6, '94105': 10 },
             'home_prices': {'94102': 8, '11111': 4, '94104': 1, '94105': 4 }}

Raw Hive Table for each of: 

1. Bicycle Parking Locations
2. Bicycle Share Locations
3. Parking Lots / Spaces
4. SFPD Incidents 2016
5. Trees

with GeoTag Foreign Key. 

In [2]:
bike_parking = {'location 1': (37.7606289177, -122.410647009), 
                'location 2': (37.7855355791102, -122.411302813025),
                'location 3': (37.7759676911831, -122.441396661871),
                'location 4': (37.7518243814, -122.426627114),
                'location 5': (37.75182438, -122.4266271)}

bike_share =   {'location 1': (37.7606289177, -122.410647009), 
                'location 2': (37.7855355791102, -122.411302813025),
                'location 3': (37.7759676911831, -122.441396661871),
                'location 4': (37.7518243814, -122.426627114),
                'location 5': (37.75182438, -122.4266271)}

parking =      {'location 1': (37.7606289177, -122.410647009), 
                'location 2': (37.7855355791102, -122.411302813025),
                'location 3': (37.7759676911831, -122.441396661871),
                'location 4': (37.7518243814, -122.426627114),
                'location 5': (37.75182438, -122.4266271)}

SFPD =      {'location 1': (37.7606289177, -122.410647009), 
                'location 2': (37.7855355791102, -122.411302813025),
                'location 3': (37.7759676911831, -122.441396661871),
                'location 4': (37.7518243814, -122.426627114),
                'location 5': (37.75182438, -122.4266271)}

trees =      {'location 1': (37.7606289177, -122.410647009), 
                'location 2': (37.7855355791102, -122.411302813025),
                'location 3': (37.7759676911831, -122.441396661871),
                'location 4': (37.7518243814, -122.426627114),
                'location 5': (37.75182438, -122.4266271)}

Neighbourhood Bounding Boxes for Neighbourhood Lookups (delete?)

In [3]:
neighbourhood_boxes = {
    "adams_point": [
        [37.80789, -122.25000],
        [37.81589,	-122.26081],
    ],
    "piedmont": [
        [37.82240, -122.24768],
        [37.83237, -122.25386],
    ],
     "example1": [
        [37.76, -122.3],
        [38.0, -122.4],
    ],
     "example2": [
        [37.76, -122.4],
        [38.0, -122.5],
    ]

}

List of Zip Codes (delete?)

In [4]:
zipcodes = [94102,94103,94104,94105,94107,94108,94109,94110,94111,94112,94114,94115,94116,94117,94118,94119,94120,
            94121,94122,94123,94124,94125,94126,94127,94128,94129,94130,94131,94132,94133,94134,94137,94139,94140,
            94141,94142,94143,94144,94145,94146,94147,94151,94158,94159,94160,94161,94163,94164,94172,94177,94188]

### Sample Inputs from User

In [5]:
max_rent = 4000
min_rent = 1000
category = 'apa'
min_rank_businesses = None
min_rank_evictions = 8
min_rank_schools_k9 = 8
min_rank_schools_hs = None
min_rank_home_prices = 5
distance_to_bike_parking = "Short"      # Short, Medium, Long
distance_to_bike_share = "Short"        # Short, Medium, Long
density_of_parking_spots_500m = "Low"   # Low, Medium, High Density within 500m
density_of_SFPD_Incidents = "Low"       # Low, Medium, High Density in 2016
density_of_trees_100m = "High"          # Low, Medium, High Density within 500m

### Sample Raw Outputs from Craigslist Scraper

In [6]:
output = {'id': '6060895324', 'has_map': True, 'price': '$1600', 'url': 'http://sfbay.craigslist.org/sfc/apa/6060895324.html',
          'name': 'Furnished Room', 'has_image': True, 'datetime': '2017-03-26 09:33', 'where': 'nob hill', 'geotag': (37.790788, -122.419036)}


# Scraper Class w/ Filters

In [7]:
def scrape_craigslist(max_rent= None, min_rent = None, cat = category):

    from craigslist import CraigslistHousing
    import filtering_functions
    import zip_lookup
    import neighbourhood_lookup

    cl = CraigslistHousing(site='sfbay', area='sfc', category= cat,
                             filters={'max_price': max_rent, 'min_price': min_rent})

    results = cl.get_results(sort_by='newest', geotagged=True, limit=100) #do we need to set a reasonable limit?
    tentative_rental = []
    valid_rentals = []
    for result in results:
        
        ################################################################################################
        ### INITIALIZE RESULT
        
        # assign geotag if it is provided in the rental ad
        if result['geotag'] is not None:
            geotag = result['geotag']
        else:
            continue #skip for now because there is no geotag    
        
        zipcode = zip_lookup.zip_lookup_by_geotag(geotag)
        tentative_rental.append(result)
        
        ### Get Approximate Neighbourhood by Geotag
        result["area"] = neighbourhood_lookup.neighbourhood_lookup(geotag)
        
        
        ################################################################################################
        ### APPLY FILTERS
        
        
        #-----------------------------------------------------------------------------------------------
        ## ZIPCODE BASED FILTERS
        #-----------------------------------------------------------------------------------------------
        
        # # OF BUSINESSES
        if zipcode not in filtering_functions.check_businesses(min_rank_businesses, zip_query):
            continue
        
        # NEARBY SCHOOLS
        if zipcode not in filtering_functions.check_evictions(min_rank_evictions, zip_query):
            continue
        
        #-----------------------------------------------------------------------------------------------
        ## DISTANCE BASED FILTERS
        #-----------------------------------------------------------------------------------------------
        
        
        
        
        
        
        

        ################################################################################################
        # Made it to the end of the filters intact? Rental is Valid for this User's Query!
        valid_rentals.append(result)
        tentative_rental = [] # reset the tentative rental, continue loop
        
        ################################################################################################
        ### DISPLAY VALID RESULTS
        print(result["area"], result["price"], result["name"], result["url"])
        
        
    return valid_rentals


In [8]:
scrape_craigslist(max_rent = max_rent, min_rent = min_rent)

South Beach $3195 Stop in Today and Lease this Spacious Studio! http://sfbay.craigslist.org/sfc/apa/6070937189.html
South Beach $2500 Luxury Rincon Hill Studio (399 Fremont), short or long-term lease http://sfbay.craigslist.org/sfc/apa/6047926965.html


[{'area': 'South Beach',
  'datetime': '2017-04-02 09:49',
  'geotag': (37.78723, -122.391465),
  'has_image': True,
  'has_map': True,
  'id': '6070937189',
  'name': 'Stop in Today and Lease this Spacious Studio!',
  'price': '$3195',
  'url': 'http://sfbay.craigslist.org/sfc/apa/6070937189.html',
  'where': 'SOMA / south beach'},
 {'area': 'South Beach',
  'datetime': '2017-04-02 09:44',
  'geotag': (37.786765, -122.392053),
  'has_image': True,
  'has_map': True,
  'id': '6047926965',
  'name': 'Luxury Rincon Hill Studio (399 Fremont), short or long-term lease',
  'price': '$2500',
  'url': 'http://sfbay.craigslist.org/sfc/apa/6047926965.html',
  'where': 'SOMA / south beach'}]