# Source A

A notebook for scraping, processing, and storing (to Airtable) the listings in [Source A](https://www.mhvillage.com/parks).

## Scrape Data

For this trial, I'll be scraping listings specifically form the state of California.

In [3]:
from bs4 import BeautifulSoup
import requests
import time

In [4]:
# Set to california (i.e. /parks/ca)
main_url = 'https://www.mhvillage.com'
url = 'https://www.mhvillage.com/parks/ca'

In [5]:
response = requests.get(url)
soup = BeautifulSoup(response.text)

Get sublinks to each city.

In [6]:
city_list = soup.find_all('app-state-item', attrs={'class': 'ng-star-inserted'})
city_name = [c.select_one('a > strong').text.strip() for c in city_list]
links_per_city = [main_url + c.select_one('a')['href'] for c in city_list]


Loop to each sublink and get data. Note that the displayed items are not exclusively from the city itself only. It displays "nearby" listings as well. Therefore, we use the `ui-city-state-zip-widget` HTML tag to determine first if the listing is from that exact city.

In [7]:
listings_dct = {}
total_items = 0

for city, link in zip(city_name, links_per_city):
    response_per_city = requests.get(link)
    soup_per_city = BeautifulSoup(response_per_city.text)
    
    listings = soup_per_city.find_all('ui-park-card', attrs={'class': 'ng-star-inserted'})
    
    # Only retain the listings exactly within the city
    listings_dct[city] = [l for l in listings if city in l.find('ui-city-state-zip-widget').text]
    
    total_items += len(listings_dct[city])
    print(f'Currently stored {total_items} listings ...')
    
    # If 50 listings
    if total_items >= 50:
        break
    
    time.sleep(5)

Currently stored 3 listings ...
Currently stored 8 listings ...
Currently stored 15 listings ...
Currently stored 16 listings ...
Currently stored 18 listings ...
Currently stored 19 listings ...
Currently stored 20 listings ...
Currently stored 24 listings ...
Currently stored 25 listings ...
Currently stored 26 listings ...
Currently stored 33 listings ...
Currently stored 34 listings ...
Currently stored 36 listings ...
Currently stored 40 listings ...
Currently stored 41 listings ...
Currently stored 46 listings ...
Currently stored 65 listings ...


Generate sublinks and `BeautifulSoup` objects to each listing.

In [8]:
listings_link_dct = {}

for k in listings_dct.keys():
    listings_link_dct[k] = [main_url + i.find('a')['href'] for i in listings_dct[k]]

In [9]:
# Create separate BeautifulSoup object per listing so I don't have to make requests every time I try
listings_soup_dct = {}

for city in listings_link_dct.keys():
    listing_links = listings_link_dct[city]
    listings_soup_dct[city] = []
    
    for link in listing_links:
        response_per_listing = requests.get(link)
        listings_soup_dct[city].append(BeautifulSoup(response_per_listing.text))
        
        time.sleep(5)
    
    print(f'Finished extracting for {city}')

Finished extracting for Acampo
Finished extracting for Acton
Finished extracting for Adelanto
Finished extracting for Adin
Finished extracting for Aerial Acres
Finished extracting for Agoura Hills
Finished extracting for Agua Dulce
Finished extracting for Aguanga
Finished extracting for Ahwahnee
Finished extracting for Albion
Finished extracting for Alpine
Finished extracting for Alta Loma
Finished extracting for Altaville
Finished extracting for Alturas
Finished extracting for Alviso
Finished extracting for American Canyon
Finished extracting for Anaheim


In [10]:
# Park name = h1
# Street address = ui-street-address-widget (need to strip spaces on both sides and comma on the end)
# City = city variable
# State = ui-city-state-zip-widget (split with ", " then split with space - first one is state)
# ZIP = ui-city-state-zip-widget (split with ", " then split with space - second one is state)
# Phone number = there seems to be none (N/A)
# Website = https://www.mhvillage.com/
# Total lots = Number of sites
# Rent range = 
# Pet policy = 
# Amenities = 
# Latitude = google-map-link attrs={'data-test-id': 'view-on-map'} its `a` child -> href -> split 'query=' then split at ',' get the first
# Longitude = google-map-link attrs={'data-test-id': 'view-on-map'} its `a` child -> href -> split 'query=' then split at ',' get the second
# Source website URL = lnk
# Listing Price = N/A

In [11]:
listings_details = {}

for city in listings_soup_dct.keys():
    listings_soup_lst = listings_soup_dct[city]
    listings_link_lst = listings_link_dct[city]
    listings_details[city] = []
    
    for sp, lnk in zip(listings_soup_lst, listings_link_lst):
        # Basic info
        park_name = sp.find('h1').text.strip()
        street_address = sp.find('ui-street-address-widget').text.strip().rstrip(',')
        state = sp.find('ui-city-state-zip-widget').text.split(', ')[1].split(' ')[0]
        zipcode = sp.find('ui-city-state-zip-widget').text.split(', ')[1].split(' ')[1]
        phone_number = 'N/A'
        website = 'https://www.mhvillage.com/'
        
        # Total lots
        try:
            total_lots = sp.select('div#forSaleCount').text.strip()
        except:
            total_lots = 'N/A'
        
        # Average monthly rent
        try:
            rent_range_ul = sp.select_one('strong[translate="parks.averageMonthlyRent"] + ul')
            rent_range_bullets = [li.get_text(strip=True) for li in rent_range_ul.find_all('li')]
            rent_range = '\n'.join(rent_range_bullets)
        except:
            rent_range = 'N/A'
        
        # Get pet policy text
        try:
            pet_policy_ul = sp.select_one('strong[translate="parks.petPolicies"] + ul')
            pet_policy_bullets = [li.get_text(strip=True) for li in pet_policy_ul.find_all('li')]
            pet_policy = '\n'.join(pet_policy_bullets)
        except:
            pet_policy = 'N/A'
        
        # Get amenities text
        try:
            amenities_ul = sp.select_one('strong[translate="parks.amenities"] + ul')
            amenities_bullets = [li.get_text(strip=True) for li in amenities_ul.find_all('li')]
            amenities = '\n'.join(amenities_bullets)
            amenities = amenities.rstrip(':')
        except:
            amenities = 'N/A'
        
        # Get latitude and longitude
        gmaps_link = sp.find('google-map-link', attrs={'data-test-id': 'view-on-map'}).find('a')['href']
        latitude = gmaps_link.split('query=')[1].split(',')[0]
        longitude = gmaps_link.split('query=')[1].split(',')[1]
        
        # Other details
        source_url = lnk
        listing_price = 'N/A'
        
        listings_details[city].append(
            {
                'Park Name': park_name,
                'Street Address': street_address,
                'City': city,
                'State': state,
                'ZIP': zipcode,
                'Phone Number': phone_number,
                'Website': website,
                'Total Lots': total_lots,
                'Rent Range': rent_range,
                'Pet Policy': pet_policy,
                'Amenities': amenities,
                'Latitude': latitude,
                'Longitude': longitude,
                'Source Website URL': source_url,
                'Listing Price': listing_price
            }
        )
    

In [12]:
listings_details

{'Acampo': [{'Park Name': 'Arbor Mobile Home Park',
   'Street Address': '19690 North Highway 99',
   'City': 'Acampo',
   'State': 'CA',
   'ZIP': '95220',
   'Phone Number': 'N/A',
   'Website': 'https://www.mhvillage.com/',
   'Total Lots': 'N/A',
   'Rent Range': 'N/A',
   'Pet Policy': 'Pets Allowed: Yes\nPet Restrictions: No Pit Bulls',
   'Amenities': 'Spa\nBasketball Court\nBilliard Room\nLibrary\nWiFi',
   'Latitude': '38.19398',
   'Longitude': '-121.19886',
   'Source Website URL': 'https://www.mhvillage.com/parks/2334',
   'Listing Price': 'N/A'},
  {'Park Name': 'Mokelumne Beach RV Park',
   'Street Address': '18450 Calfornia 99',
   'City': 'Acampo',
   'State': 'CA',
   'ZIP': '95220',
   'Phone Number': 'N/A',
   'Website': 'https://www.mhvillage.com/',
   'Total Lots': 'N/A',
   'Rent Range': 'N/A',
   'Pet Policy': 'N/A',
   'Amenities': 'WiFi',
   'Latitude': '38.14962',
   'Longitude': '-121.26043',
   'Source Website URL': 'https://www.mhvillage.com/parks/122348',


## Pass to Airtable

In [13]:
from pyairtable import Api, Base
from dotenv import load_dotenv
import os

Load some variables.

In [14]:
load_dotenv()

True

In [15]:
AIRTABLE_PAT = os.getenv('AIRTABLE_PAT')
BASE_ID = os.getenv('BASE_ID')
TABLE_NAME = 'Listings'

In [16]:
api = Api(AIRTABLE_PAT)
base = Base(api, BASE_ID)
table = base.table(TABLE_NAME)

Reformat dictionary of listings detail.

In [17]:
records = []

for city in listings_details.keys():
    listings_details_lst = listings_details[city]
    
    for listing in listings_details_lst:
        records.append(listing)

In [18]:
for r in records:
    table.create(r)

## Export in Pandas

In [19]:
import pandas as pd

In [21]:
source_A_df = pd.DataFrame(records)

In [23]:
source_A_df.to_csv('source_A_listings_only.csv')