In [1]:
# Import libaries
import pandas as pd
from bs4 import BeautifulSoup
import requests
import lxml

### Step 1: Create a soup object from the home page

- Do the webscraping from the given website as below while using `requests.get` and for parsing HTML I used `BeautifulSoup` from `bs4`.
- Afterward, I make sure the `status_code` is `200` then I used specified parser `lxml` and print the parsed HTML

In [2]:
url = 'https://pages.git.generalassemb.ly/rldaggie/for-scraping/'
res = requests.get(url)

res.status_code

soup = BeautifulSoup(res.text, 'lxml') # parse html for python
soup

<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8"/>
<meta content="IE=edge" http-equiv="X-UA-Compatible"/>
<meta content="width=device-width, initial-scale=1.0" name="viewport"/>
<title>Nutrition Information</title>
<link crossorigin="anonymous" href="https://stackpath.bootstrapcdn.com/bootstrap/4.4.1/css/bootstrap.min.css" integrity="sha384-Vkoo8x4CGsO3+Hhxv8T/Q5PaXtkKtu6ug5TOeNV6gBiFeWPGFN9MuhOf23Q9Ifjh" rel="stylesheet"/>
</head>
<body>
<header>
<section class="container">
<nav class="navbar navbar-expand-lg navbar-light bg-light" role="navigation">
<a class="navbar-brand" href="/">Nutrition Information</a> </nav>
</section>
</header>
<main class="container" role="main">
<br/>
<div class="alert alert-danger">
        NOTE: This data is super old and rife with errors. It's meant for scraping practice only.
      </div>
<table class="table" id="restaurants">
<thead>
<tr>
<th>Name</th>
<th></th>
</tr>
</thead>
<tbody>
<tr>
<td>
<a href="restaurants/1.html">A&amp;W Restauran

### Step 2: Scrape the home page soup for every restaurant

Note: Your best bet is to create a list of dictionaries, one for each restaurant. Each dictionary contains the restaurant's name and path from the `href`. The result of your scrape should look something like this:

```python
restaurants = [
    {'name': 'A&W Restaurants', 'href': 'restaurants/1.html'}, 
    {'name': "Applebee's", 'href': 'restaurants/2.html'},
    ...
]
```

- From the parsed HTML, we can observe that restaurants name and its href were stored inside <table class="table" id="restaurants">
- We can use the find('table',id='restaurants') to extract only restaurants name and its href

In [3]:
table = soup.find('table',id='restaurants')
table

<table class="table" id="restaurants">
<thead>
<tr>
<th>Name</th>
<th></th>
</tr>
</thead>
<tbody>
<tr>
<td>
<a href="restaurants/1.html">A&amp;W Restaurants</a> </td>
</tr>
<tr>
<td>
<a href="restaurants/2.html">Applebee's</a> </td>
</tr>
<tr>
<td>
<a href="restaurants/3.html">Arby's</a> </td>
</tr>
<tr>
<td>
<a href="restaurants/4.html">Atlanta Bread Company</a> </td>
</tr>
<tr>
<td>
<a href="restaurants/5.html">Bojangle's Famous Chicken 'n Biscuits</a> </td>
</tr>
<tr>
<td>
<a href="restaurants/6.html">Buffalo Wild Wings</a> </td>
</tr>
<tr>
<td>
<a href="restaurants/7.html">Burger King</a> </td>
</tr>
<tr>
<td>
<a href="restaurants/8.html">Captain D's</a> </td>
</tr>
<tr>
<td>
<a href="restaurants/9.html">Carl's Jr.</a> </td>
</tr>
<tr>
<td>
<a href="restaurants/10.html">Charley's Grilled Subs</a> </td>
</tr>
<tr>
<td>
<a href="restaurants/11.html">Chick-fil-A</a> </td>
</tr>
<tr>
<td>
<a href="restaurants/12.html">Chili's</a> </td>
</tr>
<tr>
<td>
<a href="restaurants/13.html">Chi

- Now, we have the `table` variables hold all the restaurant name and its href in a HTML.
- We need to filter out into a variable `restaurants` which is a list of dictionary contains the `name` and `href` 

In [4]:
# Initialize an empty list to store dictionaries
restaurants = []

# Iterate through each row in the table, skipping the header row
for row in table.find_all('tr')[1:]:
    restaurants_name = row.find('a').text
    restaurants_href = row.find('a')['href']
    restaurants.append({'name':restaurants_name,'href':restaurants_href})

restaurants

[{'name': 'A&W Restaurants', 'href': 'restaurants/1.html'},
 {'name': "Applebee's", 'href': 'restaurants/2.html'},
 {'name': "Arby's", 'href': 'restaurants/3.html'},
 {'name': 'Atlanta Bread Company', 'href': 'restaurants/4.html'},
 {'name': "Bojangle's Famous Chicken 'n Biscuits",
  'href': 'restaurants/5.html'},
 {'name': 'Buffalo Wild Wings', 'href': 'restaurants/6.html'},
 {'name': 'Burger King', 'href': 'restaurants/7.html'},
 {'name': "Captain D's", 'href': 'restaurants/8.html'},
 {'name': "Carl's Jr.", 'href': 'restaurants/9.html'},
 {'name': "Charley's Grilled Subs", 'href': 'restaurants/10.html'},
 {'name': 'Chick-fil-A', 'href': 'restaurants/11.html'},
 {'name': "Chili's", 'href': 'restaurants/12.html'},
 {'name': 'Chipotle Mexican Grill', 'href': 'restaurants/13.html'},
 {'name': "Church's", 'href': 'restaurants/14.html'},
 {'name': 'Corner Bakery Cafe', 'href': 'restaurants/15.html'},
 {'name': 'Dairy Queen', 'href': 'restaurants/16.html'},
 {'name': "Denny's", 'href': 'res

### Step 3: Using the `href`, scrape each restaurant's page and create a single list of food dictionaries.

Your list of foods should look something like this:
```python
foods = [
    {
        'calories': '0',
        'carbs': '0',
        'category': 'Drinks',
        'fat': '0',
        'name': 'A&W® Diet Root Beer',
        'restaurant': 'A&W Restaurants'
    },
    {
        'calories': '0',
        'carbs': '0',
        'category': 'Drinks',
        'fat': '0',
        'name': 'A&W® Diet Root Beer',
        'restaurant': 'A&W Restaurants'
    },
    ...
]
```

**Note**: Remove extra white space from each category

- Before doing a mass webscraping for all restaurants, I will only do with first restaurant in the variable `restaurants`to see how does the HTML looks like.
- We will combine the original `url` combine with the first restaurant's `href` then try to webscrap the HTML from the website.
- We can observe all the content is inside a `tr`, each seperated in `td`.
- We can assign the each data we get from `td` into a dictionary, then we append into `foods_sample`
- Now, the `foods_sample` hold every foods in the first restaurants.

In [28]:
full_url_sample = url + restaurants[0]['href']
full_url_sample
res_sample = requests.get(full_url_sample)
soup_sample = BeautifulSoup(res_sample.text, 'lxml')
restaurants[0]['name'].strip()

'A&W Restaurants'

In [None]:
table_sample = soup_sample.find_all('tr')
table_sample 

In [20]:
foods_sample = []
for food in table_sample[1:]:  # Skipping the header row
    td_elements = food.find_all('td')  # Find all td elements in the row

    foods_sample_details = {
        'name': td_elements[0].text.strip(),
        'category': td_elements[1].text.strip(),
        'calories': td_elements[2].text.strip(),
        'fat': td_elements[3].text.strip(),
        'carbs': td_elements[4].text.strip(),
    }
    foods_sample.append(foods_sample_details)


- We learned how to do webscraping, parsing, accesing and appending data in the a list from above example.
- We replicate it into a for loop to do with all restaurants.

In [36]:
foods = []

for restaurant in restaurants:
    full_url = url + restaurant['href']
    food_from_website = requests.get(full_url)
    if food_from_website.status_code == 200:
        soup = BeautifulSoup(food_from_website.text, 'lxml')
        for food in soup.find_all('tr')[1:]:
            food_details = {
                'restaurant': restaurant['name'].strip(),
                'name': food.find_all('td')[0].text.strip(),
                'category': food.find_all('td')[1].text.strip(),
                'calories': food.find_all('td')[2].text.strip(),
                'fat': food.find_all('td')[3].text.strip(),
                'carbs': food.find_all('td')[4].text.strip(),
            }
            foods.append(food_details)

In [37]:
foods

[{'restaurant': 'A&W Restaurants',
  'name': 'Original Bacon Double Cheeseburger',
  'category': 'Burgers',
  'calories': '760',
  'fat': '45',
  'carbs': '45'},
 {'restaurant': 'A&W Restaurants',
  'name': 'Coney (Chili) Dog',
  'category': 'Entrees',
  'calories': '340',
  'fat': '20',
  'carbs': '26'},
 {'restaurant': 'A&W Restaurants',
  'name': 'Chili Fries',
  'category': 'French Fries',
  'calories': '370',
  'fat': '15',
  'carbs': '49'},
 {'restaurant': 'A&W Restaurants',
  'name': 'Strawberry Milkshake (small)',
  'category': 'Shakes',
  'calories': '670',
  'fat': '29',
  'carbs': '90'},
 {'restaurant': 'A&W Restaurants',
  'name': 'A&WÂ® Root Beer Freeze (large)',
  'category': 'Shakes',
  'calories': '820',
  'fat': '18',
  'carbs': '150'},
 {'restaurant': 'A&W Restaurants',
  'name': 'Caramel Sundae',
  'category': 'Desserts',
  'calories': '340',
  'fat': '9',
  'carbs': '57'},
 {'restaurant': 'A&W Restaurants',
  'name': 'Strawberry Banana Smoothee',
  'category': 'Shak

### Step 4: Create a pandas DataFrame from your list of foods

**Note**: Your DataFrame should have 5,131 rows

In [38]:
df = pd.DataFrame(foods)

In [45]:
df.shape

(5131, 6)

- We have total 5131 rows and 6 columns.

In [46]:
df.head()

Unnamed: 0,restaurant,name,category,calories,fat,carbs
0,A&W Restaurants,Original Bacon Double Cheeseburger,Burgers,760,45,45
1,A&W Restaurants,Coney (Chili) Dog,Entrees,340,20,26
2,A&W Restaurants,Chili Fries,French Fries,370,15,49
3,A&W Restaurants,Strawberry Milkshake (small),Shakes,670,29,90
4,A&W Restaurants,A&WÂ® Root Beer Freeze (large),Shakes,820,18,150


### Step 5: Export to csv

**Note:** Don't export the index column from your DataFrame

In [41]:
df.to_csv('foods.csv', index = False)