<a href="https://colab.research.google.com/github/miguel-peralta/cars_ista322/blob/main/cars.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Cars Relational Databases
ISTA 322 Final Project, Spring 2024 <br>
Miguel Candido Aurora Peralta <br>
## Extract
### KBB Web Scraping
The cars are separated into new and used cars categories. The lists new and used cars are on separate pages. These lists will be combined into one for this project.
#### Used Cars



In [2]:
# Imports
import requests
from bs4 import BeautifulSoup
import pandas as pd
import re

In [3]:
def get_html_doc(url):
  '''Returns the HTML document as a JSON response for the given URL'''
  # requests HTML document for URL
  response = requests.get(url)
  # returns JSON response
  return response.text

In [4]:
def get_models_df(page_type):
  '''
  Returns a dataframe containing the url, make, model, and year from the
  relative URLs listed on the car models list.
  Args:
    page_type (string): either "new" or "used"
  Returns:
    car_info (DataFrame): make, model, year, and url of each model
  '''
  # Create HTML object from url
  url = f'https://www.kbb.com/car-make-model-list/{page_type}/view-all/model/'
  html = get_html_doc(url)
  soup = BeautifulSoup(html, 'html.parser')
  # Create list to store relative URLS from the page
  links = []
  # Get all links from the page (they all have the attribute target="_self")
  for link in soup.find_all('a', attrs={'target': '_self'}):
    # Add links to the list
    links.append(link.get('href'))

  # The new cars page has one extra footer link
  if page_type == "used":
    car_links = links[40:-17]
  if page_type == "new":
    car_links = links[40:-18]

  # Create dataframe and lists to store info from split links
  car_info = pd.DataFrame(columns=['make', 'model', 'year'])
  urls = []
  make = []
  model = []
  year = []
  base_url = 'https://www.kbb.com'

  # Split links using / as delimeter and add information to lists
  for car in car_links:
    urls.append(base_url+car)
    link_split = car.split('/')
    make.append(link_split[1])
    model.append(link_split[2])
    year.append(link_split[3])

  # Use lists to populate dataframe
  car_info['url'] = urls
  car_info['make'] = make
  car_info['model'] = model
  car_info['year'] = year

  return car_info


In [5]:
used_cars = get_models_df('used')
new_cars = get_models_df('new')

In [7]:
new_cars.tail()

Unnamed: 0,make,model,year,url
1034,bmw,8-series,2023.0,https://www.kbb.com/bmw/8-series/2023/
1035,ferrari,812-gts,2023.0,https://www.kbb.com/ferrari/812-gts/2023/
1036,porsche,911,2024.0,https://www.kbb.com/porsche/911/2024/
1037,porsche,911,2023.0,https://www.kbb.com/porsche/911/2023/
1038,faq,new-cars,,https://www.kbb.com/faq/new-cars/
