## Webscaping Ford Mustangs from Cars.com

In [2]:
import requests
from bs4 import BeautifulSoup

There are 11 pages of mustangs on sale on Cars.com. This means we have to scrape all 11 pages. First, create a loop over all eleven pages and assign them to a list

In [3]:
pages = []

for i in range(1, 12):
    url = 'https://www.cars.com/for-sale/searchresults.action/?dealerType=all&mdId=21712&mkId=20015&page=' + str(i) + '&perPage=20&rd=30&searchSource=PAGINATION&sort=relevance&stkTypId=28881&zc=20814'
    pages.append(url)

In [4]:
# Let's check if it looks correctly

pages[:5]

['https://www.cars.com/for-sale/searchresults.action/?dealerType=all&mdId=21712&mkId=20015&page=1&perPage=20&rd=30&searchSource=PAGINATION&sort=relevance&stkTypId=28881&zc=20814',
 'https://www.cars.com/for-sale/searchresults.action/?dealerType=all&mdId=21712&mkId=20015&page=2&perPage=20&rd=30&searchSource=PAGINATION&sort=relevance&stkTypId=28881&zc=20814',
 'https://www.cars.com/for-sale/searchresults.action/?dealerType=all&mdId=21712&mkId=20015&page=3&perPage=20&rd=30&searchSource=PAGINATION&sort=relevance&stkTypId=28881&zc=20814',
 'https://www.cars.com/for-sale/searchresults.action/?dealerType=all&mdId=21712&mkId=20015&page=4&perPage=20&rd=30&searchSource=PAGINATION&sort=relevance&stkTypId=28881&zc=20814',
 'https://www.cars.com/for-sale/searchresults.action/?dealerType=all&mdId=21712&mkId=20015&page=5&perPage=20&rd=30&searchSource=PAGINATION&sort=relevance&stkTypId=28881&zc=20814']

Now, make the html look nicer so we can extract the data easier

In [8]:
page = requests.get(url)
soup = BeautifulSoup(page.text, 'html.parser')

# Won't print entire html for the purpose of this notebook
# print(soup.prettify())

<!DOCTYPE doctype html>
<html lang="en">
 <head>
  <meta content="app-id=353263352" name="apple-itunes-app"/>
  <link href="/cldstatic/header/0.1.32/manifest.json" rel="manifest"/>
  <link href="/cldstatic/header/0.1.32/header.component.css" rel="stylesheet">
   <script defer="" src="/cldstatic/header/0.1.32/header.component.js">
   </script>
   <meta charset="utf-8"/>
   <meta content="ie=edge" http-equiv="x-ua-compatible"/>
   <meta content="width=device-width, initial-scale=1, viewport-fit=cover" name="viewport"/>
   <title>
    Cars.com
   </title>
   <meta content="noindex, follow" name="robots"/>
   <meta content="telephone=no" name="format-detection"/>
   <link href="/static/www/fb628e116171/manifest.json" rel="manifest"/>
   <meta content="app-id=353263352" name="apple-itunes-app"/>
   <link href="https://www.cstatic-images.com" rel="preconnect"/>
   <link href="https://fonts.gstatic.com" rel="preconnect"/>
   <link href="https://assets.adobedtm.com" rel="preconnect"/>
   <link

Now, we can see where the data we want is located

In [9]:
# We are going to loop over all eleven pages and append each entry to a new list

l = []

for i in pages:
    page = requests.get(i)
    soup = BeautifulSoup(page.text, 'html.parser')
    x = soup.find_all(class_ = 'listing-row__phone obscure')
    for i in x:
        l.append(i)

In [10]:
# Let's see if the results look okay

l[:5]

[<div class="listing-row__phone obscure">
 <div class="fake-number"> (<span></span>) <span></span>-<span></span></div>
                                             (301) 945-9727
                                         </div>,
 <div class="listing-row__phone obscure" data-customlink="page-state" data-linkname="dealer-phone-srp" data-phone-click=' {"listingId":776385613,"mkId":20015,"mkNm":"Ford","mdId":21712,"mdNm":"Mustang","trimId":35423,"trimName":"V6 Premium","modelYearId":51683,"modelYear":2014,"stkTyp":"Used","state":"MD","zipcode":"20814","phone":"3019459727","sellerId":467378,"apigeeHost":"https://api.cars.com","shadowSettingsProxy":"/consumer-shadow-settings/","srpAPIKey":"Fi1DINVeB0SQhnD1FvtAfUX0KuwHF8Al","bodystyleName":"Coupe","price":14850,"dealerName":"Kensington Auto Sales Inc","customerId":188866,"stockType":"Used","cpo":false}'>
 <div class="fake-number"> (<span></span>) <span></span>-<span></span></div>
                                             (301) 945-9727
    

There seems to be unwanted information between every data entry.
We can get rid of every other item

In [11]:
all_cars = l[1::2]

Now, transform the list into pandas data frame so it can be cleaned

In [13]:
import pandas as pd
import numpy as np

In [14]:
df = pd.Series(all_cars)

In [15]:
data = df.to_frame('text')

In [16]:
# Check the data just in case
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 220 entries, 0 to 219
Data columns (total 1 columns):
text    220 non-null object
dtypes: object(1)
memory usage: 1.8+ KB


We can start cleaning the data

In [17]:
# Change the data from an Object to a String so we can manupulate it easily
data['text2'] = data['text'].astype('str')

In [21]:
data['text2'][1]

'<div class="listing-row__phone obscure" data-customlink="page-state" data-linkname="dealer-phone-srp" data-phone-click=\' {"listingId":773415394,"mkId":20015,"mkNm":"Ford","mdId":21712,"mdNm":"Mustang","trimId":29632,"trimName":"","modelYearId":35797618,"modelYear":2018,"stkTyp":"Used","state":"VA","zipcode":"20814","phone":"8778877910","sellerId":56915476,"apigeeHost":"https://api.cars.com","shadowSettingsProxy":"/consumer-shadow-settings/","srpAPIKey":"Fi1DINVeB0SQhnD1FvtAfUX0KuwHF8Al","bodystyleName":"Coupe","price":23497,"dealerName":"Ourisman Honda of Tysons Corner","customerId":5386115,"stockType":"Used","cpo":false}\'>\n<div class="fake-number"> (<span></span>) <span></span>-<span></span></div>\n                                            (877) 887-7910\n                                        </div>'

In [22]:
# We are only interested in the text between the brackets
data['text3'] = data['text2'].apply(lambda st: st[st.find("{")+1:st.find("}")])

In [23]:
data['text3'].head(15)

0     "listingId":776385613,"mkId":20015,"mkNm":"For...
1     "listingId":773415394,"mkId":20015,"mkNm":"For...
2     "listingId":776082126,"mkId":20015,"mkNm":"For...
3     "listingId":773039496,"mkId":20015,"mkNm":"For...
4     "listingId":776639811,"mkId":20015,"mkNm":"For...
5     "listingId":769175771,"mkId":20015,"mkNm":"For...
6     "listingId":772129975,"mkId":20015,"mkNm":"For...
7     "listingId":771098935,"mkId":20015,"mkNm":"For...
8     "listingId":770461485,"mkId":20015,"mkNm":"For...
9     "listingId":776811460,"mkId":20015,"mkNm":"For...
10    &quot;listingId&quot;:775766626,&quot;mkId&quo...
11    "listingId":777053169,"mkId":20015,"mkNm":"For...
12    "listingId":773040943,"mkId":20015,"mkNm":"For...
13    "listingId":775135983,"mkId":20015,"mkNm":"For...
14    "listingId":776332904,"mkId":20015,"mkNm":"For...
Name: text3, dtype: object

Not all entries are the same, this can be changed

In [24]:
data['text3'] = data['text3'].str.replace('&quot;', '"')

In [25]:
data['text3'].head(15)

0     "listingId":776385613,"mkId":20015,"mkNm":"For...
1     "listingId":773415394,"mkId":20015,"mkNm":"For...
2     "listingId":776082126,"mkId":20015,"mkNm":"For...
3     "listingId":773039496,"mkId":20015,"mkNm":"For...
4     "listingId":776639811,"mkId":20015,"mkNm":"For...
5     "listingId":769175771,"mkId":20015,"mkNm":"For...
6     "listingId":772129975,"mkId":20015,"mkNm":"For...
7     "listingId":771098935,"mkId":20015,"mkNm":"For...
8     "listingId":770461485,"mkId":20015,"mkNm":"For...
9     "listingId":776811460,"mkId":20015,"mkNm":"For...
10    "listingId":775766626,"mkId":20015,"mkNm":"For...
11    "listingId":777053169,"mkId":20015,"mkNm":"For...
12    "listingId":773040943,"mkId":20015,"mkNm":"For...
13    "listingId":775135983,"mkId":20015,"mkNm":"For...
14    "listingId":776332904,"mkId":20015,"mkNm":"For...
Name: text3, dtype: object

It seems to be correct now

Let's split the text into multiple new columns for each category

In [26]:
splitt = data['text3'].str.split(",", expand = True)

In [27]:
data['Make'] = splitt[2].str.split(':').str[1]
data['Model'] = splitt[4].str.split(':').str[1]
data['Trim'] = splitt[6].str.split(':').str[1]
data['Year'] = splitt[8].str.split(':').str[1]
data['Condition'] = splitt[9].str.split(':').str[1]
data['State'] = splitt[10].str.split(':').str[1]
data['ZipCode'] = splitt[11].str.split(':').str[1]
data['Price'] = splitt[18].str.split(':').str[1]
data['Dealer'] = splitt[19].str.split(':').str[1]

In [28]:
data = data[['Make', 'Model', 'Trim', 'Year', 'Condition',
       'State', 'ZipCode', 'Price', 'Dealer']]

Check on the data

In [29]:
data.head()

Unnamed: 0,Make,Model,Trim,Year,Condition,State,ZipCode,Price,Dealer
0,"""Ford""","""Mustang""","""V6 Premium""",2014,"""Used""","""MD""","""20814""",14850,"""Kensington Auto Sales Inc"""
1,"""Ford""","""Mustang""","""""",2018,"""Used""","""VA""","""20814""",23497,"""Ourisman Honda of Tysons Corner"""
2,"""Ford""","""Mustang""","""GT""",2017,"""Used""","""MD""","""20814""",29970,"""DARCARS Volkswagen of Silver Spring"""
3,"""Ford""","""Mustang""","""GT""",2018,"""Used""","""MD""","""20814""",31500,"""Sheehy Ford Lincoln of Gaithersburg"""
4,"""Ford""","""Mustang""","""EcoBoost Premium""",2016,"""Used""","""MD""","""20814""",19900,"""Academy Ford"""


In [31]:
# Get rid of the quotation marks
data['Make'] = data['Make'].astype('str').str.replace('"', '')
data['Model'] = data['Model'].astype('str').str.replace('"', '')
data['Trim'] = data['Trim'].astype('str').str.replace('"', '')
data['Condition'] = data['Condition'].astype('str').str.replace('"', '')
data['State'] = data['State'].astype('str').str.replace('"', '')
data['ZipCode'] = data['ZipCode'].astype('str').str.replace('"', '')
data['Dealer'] = data['Dealer'].astype('str').str.replace('"', '')

In [32]:
data.head()

Unnamed: 0,Make,Model,Trim,Year,Condition,State,ZipCode,Price,Dealer
0,Ford,Mustang,V6 Premium,2014,Used,MD,20814,14850,Kensington Auto Sales Inc
1,Ford,Mustang,,2018,Used,VA,20814,23497,Ourisman Honda of Tysons Corner
2,Ford,Mustang,GT,2017,Used,MD,20814,29970,DARCARS Volkswagen of Silver Spring
3,Ford,Mustang,GT,2018,Used,MD,20814,31500,Sheehy Ford Lincoln of Gaithersburg
4,Ford,Mustang,EcoBoost Premium,2016,Used,MD,20814,19900,Academy Ford
