# Level 2 - Beautiful Soup

---

# The Mission

Your company `SpiderLegion` has just signed a contract with an Analytics Company called `DashItUp`.
`DashItUp` is well known for it's dashboarding capabilities specializing in monitoring website metrics such as views, content shares, new users, and database errors!

While dashboards are nice, `DashItUp` is now wanting to spend some time on a new `summarize` feature. 
`DashItUp` wants to run web crawlers against their dashboards to fetch the `key metrics` and print them off as a single report.

## Key Metrics

* User Count
* Any _system errors_, how recent?
    * System errors can be one of the following: `Database error`, `CPU overload`, or `Out of memory`
* Bounce Rate
* Top and bottom countries by utility
* Most recent user names with links to their profiles
* Name of the user that owns the dashboard

`DashItUp` has _many_ websites that use the same template (they all look the same). 
They believe that if you can write a web crawler for one, they should be able to apply the same code to the other dashboards they own to get similar results.

---

## Fetch The Website Contents

`DashItUp` was kind enough to give us a website to test against.
The website content can be found in the `assets` folder called `website.html`.
We already have some code that is responsible for opening that file, reading it, and saving the contents to a variable called `website_contents`.

(Source HTML code is from the Analytics Template from the website https://www.w3schools.com/w3css/w3css_templates.asp)

In [1]:
with open("../assets/website.html") as website_file:
    website_contents = website_file.read()
    
website_contents

What a jumbled mess!
It is nearly impossible to understand what is going on here without some hardcore `HTML` understanding..
Unless we visualize it!

In [2]:
# In jupyter, you can visualize raw HTML using these two functions!
# It is essentially "embedding" the website content within the notebook
from IPython.core.display import display, HTML

display(HTML(website_contents))

---

## Get to Work!

### Import the tools needed

In [3]:
from bs4 import BeautifulSoup
import pandas as pd

---

## Create the Soup!

In [4]:
# code here

soup = BeautifulSoup(website_contents, "html.parser")
soup.title

<title>DashItUp, A Dashboard</title>

---

## User Count

In [5]:
# code here

# The answer we are looking for lives within the 
# "da-dashboardCards" section of the website, so target that first
dashboard_cards_soup = soup.find("div", attrs={"class": "da-dashboardCards"})

dashboard_cards_soup

<div class="w3-row-padding w3-margin-bottom da-dashboardCards">
<div class="w3-quarter">
<div class="w3-container w3-red w3-padding-16">
<div class="w3-left"><i class="fa fa-comment w3-xxxlarge"></i></div>
<div class="w3-right">
<h3 class="da-dashboardCardMetric">52</h3>
</div>
<div class="w3-clear"></div>
<h4 class="da-dashboardCardLabel">Messages</h4>
</div>
</div>
<div class="w3-quarter">
<div class="w3-container w3-blue w3-padding-16">
<div class="w3-left"><i class="fa fa-eye w3-xxxlarge"></i></div>
<div class="w3-right">
<h3 class="da-dashboardCardMetric">99</h3>
</div>
<div class="w3-clear"></div>
<h4 class="da-dashboardCardLabel">Views</h4>
</div>
</div>
<div class="w3-quarter">
<div class="w3-container w3-teal w3-padding-16">
<div class="w3-left"><i class="fa fa-share-alt w3-xxxlarge"></i></div>
<div class="w3-right">
<h3 class="da-dashboardCardMetric">23</h3>
</div>
<div class="w3-clear"></div>
<h4 class="da-dashboardCardLabel">Shares</h4>
</div>
</div>
<div class="w3-quarter"

In [6]:
# grab all of the direct children
dashboard_cards = dashboard_cards_soup.findAll(recursive=False)

dashboard_cards

[<div class="w3-quarter">
 <div class="w3-container w3-red w3-padding-16">
 <div class="w3-left"><i class="fa fa-comment w3-xxxlarge"></i></div>
 <div class="w3-right">
 <h3 class="da-dashboardCardMetric">52</h3>
 </div>
 <div class="w3-clear"></div>
 <h4 class="da-dashboardCardLabel">Messages</h4>
 </div>
 </div>,
 <div class="w3-quarter">
 <div class="w3-container w3-blue w3-padding-16">
 <div class="w3-left"><i class="fa fa-eye w3-xxxlarge"></i></div>
 <div class="w3-right">
 <h3 class="da-dashboardCardMetric">99</h3>
 </div>
 <div class="w3-clear"></div>
 <h4 class="da-dashboardCardLabel">Views</h4>
 </div>
 </div>,
 <div class="w3-quarter">
 <div class="w3-container w3-teal w3-padding-16">
 <div class="w3-left"><i class="fa fa-share-alt w3-xxxlarge"></i></div>
 <div class="w3-right">
 <h3 class="da-dashboardCardMetric">23</h3>
 </div>
 <div class="w3-clear"></div>
 <h4 class="da-dashboardCardLabel">Shares</h4>
 </div>
 </div>,
 <div class="w3-quarter">
 <div class="w3-container w3

In [7]:
# Users is the LAST child.. Since it's a list, we can target that!
user_card = dashboard_cards[-1]

user_card

<div class="w3-quarter">
<div class="w3-container w3-orange w3-text-white w3-padding-16">
<div class="w3-left"><i class="fa fa-users w3-xxxlarge"></i></div>
<div class="w3-right">
<h3 class="da-dashboardCardMetric">50</h3>
</div>
<div class="w3-clear"></div>
<h4 class="da-dashboardCardLabel">Users</h4>
</div>
</div>

In [8]:
user_card.find("h3", attrs={"class": "da-dashboardCardMetric"}).text

'50'

---

## Any _system errors_, how recent?
System errors can be one of the following: 

* `Database error`
* `CPU overload`
* `Out of memory`

In [9]:
# code here

# The content we want lives within the "da-feeds" section of the website,
# so we can target that first!
feeds = soup.find("div", attrs={"class": "da-feeds"})

feeds

<div class="w3-twothird da-feeds">
<h5>Feeds</h5>
<table class="w3-table w3-striped w3-white">
<tr>
<td><i class="fa fa-user w3-text-blue w3-large"></i></td>
<td>New record, over 90 views.</td>
<td><i>10 mins</i></td>
</tr>
<tr>
<td><i class="fa fa-bell w3-text-red w3-large"></i></td>
<td>Database error.</td>
<td><i>15 mins</i></td>
</tr>
<tr>
<td><i class="fa fa-users w3-text-yellow w3-large"></i></td>
<td>New record, over 40 users.</td>
<td><i>17 mins</i></td>
</tr>
<tr>
<td><i class="fa fa-comment w3-text-red w3-large"></i></td>
<td>New comments.</td>
<td><i>25 mins</i></td>
</tr>
<tr>
<td><i class="fa fa-bookmark w3-text-blue w3-large"></i></td>
<td>Check transactions.</td>
<td><i>28 mins</i></td>
</tr>
<tr>
<td><i class="fa fa-laptop w3-text-red w3-large"></i></td>
<td>CPU overload.</td>
<td><i>35 mins</i></td>
</tr>
<tr>
<td><i class="fa fa-share-alt w3-text-green w3-large"></i></td>
<td>New shares.</td>
<td><i>39 mins</i></td>
</tr>
</table>
</div>

In [10]:
# since the result lives in a table, we can use pandas to extract it
# and put it in a dataframe for us!
import pandas as pd


dataframes = pd.read_html(str(feeds))
dataframes

[    0                           1        2
 0 NaN  New record, over 90 views.  10 mins
 1 NaN             Database error.  15 mins
 2 NaN  New record, over 40 users.  17 mins
 3 NaN               New comments.  25 mins
 4 NaN         Check transactions.  28 mins
 5 NaN               CPU overload.  35 mins
 6 NaN                 New shares.  39 mins]

In [11]:
feed_dataframe = dataframes[0]
feed_dataframe

Unnamed: 0,0,1,2
0,,"New record, over 90 views.",10 mins
1,,Database error.,15 mins
2,,"New record, over 40 users.",17 mins
3,,New comments.,25 mins
4,,Check transactions.,28 mins
5,,CPU overload.,35 mins
6,,New shares.,39 mins


In [12]:
# It may be easier if we clean up this dataframe a little first
feed_dataframe.columns = ["icon", "message", "minutes"]
feed_dataframe = feed_dataframe.drop(columns=["icon"])

feed_dataframe

Unnamed: 0,message,minutes
0,"New record, over 90 views.",10 mins
1,Database error.,15 mins
2,"New record, over 40 users.",17 mins
3,New comments.,25 mins
4,Check transactions.,28 mins
5,CPU overload.,35 mins
6,New shares.,39 mins


In [13]:
# now, all we need to do is filter it down!
error_messages = [
    "Database error.",
    "CPU overload.",
    "Out of memory."
]

error_dataframe = feed_dataframe[feed_dataframe["message"].isin(error_messages)]
error_dataframe

Unnamed: 0,message,minutes
1,Database error.,15 mins
5,CPU overload.,35 mins


---

## Bounce Rate

In [14]:
# code here

# This one is a bit simpler since the element has an "id" attribute
# We can use that ("da-bounceRateStat") to target the element directly
soup.find(id="da-bounceRateStat").text

'\n75%\n'

In [15]:
# If you want to get rid of the newlines..
soup.find(id="da-bounceRateStat").text \
    .replace("\n", "") \
    .replace("%", "")

'75'

---

## Top and bottom countries by utility

In [16]:
# code here

# This one will be similar to the feeds! Pandas to the rescue.
# First, let's target the HTML that includes our table
country_utility_soup = soup.find(attrs={"class": "da-countryUtility"})
country_utility_soup

<div class="w3-container da-countryUtility">
<h5>Countries</h5>
<table class="w3-table w3-striped w3-bordered w3-border w3-hoverable w3-white">
<tr>
<th>Country</th>
<th>Utility</th>
</tr>
<tr>
<td>France</td>
<td>1.5%</td>
</tr>
<tr>
<td>UK</td>
<td>15.7%</td>
</tr>
<tr>
<td>United States</td>
<td>65%</td>
</tr>
<tr>
<td>Russia</td>
<td>5.6%</td>
</tr>
<tr>
<td>Spain</td>
<td>2.1%</td>
</tr>
<tr>
<td>India</td>
<td>1.9%</td>
</tr>
</table><br/>
<button class="w3-button w3-dark-grey">More Countries  <i class="fa fa-arrow-right"></i></button>
</div>

In [17]:
country_utility_tables = pd.read_html(str(country_utility_soup))
country_utility_table = country_utility_tables[0]
country_utility_table

Unnamed: 0,Country,Utility
0,France,1.5%
1,UK,15.7%
2,United States,65%
3,Russia,5.6%
4,Spain,2.1%
5,India,1.9%


In [18]:
# Now, we want to grab the top and bottom country by utility
# Let's see what the datatypes are!
country_utility_table.dtypes

Country    object
Utility    object
dtype: object

In [19]:
# Ok, they are "object"/"text" types..
# We want to handle that before sorting.
country_utility_table["Utility"] = country_utility_table["Utility"] \
    .str.replace("%", "") \
    .astype("float64")

country_utility_table.dtypes

Country     object
Utility    float64
dtype: object

In [20]:
# Now, we can sort and reset the index
country_utility_table = country_utility_table \
    .sort_values(by="Utility", ascending=False) \
    .reset_index(drop=True)

country_utility_table

Unnamed: 0,Country,Utility
0,United States,65.0
1,UK,15.7
2,Russia,5.6
3,Spain,2.1
4,India,1.9
5,France,1.5


In [21]:
# Grab the top..
country_utility_table.head(1)

Unnamed: 0,Country,Utility
0,United States,65.0


In [22]:
# Grab the bottom
country_utility_table.tail(1)

Unnamed: 0,Country,Utility
5,France,1.5


---

## Most recent user names with links to their profiles

In [23]:
# code here

# This one is a little trickier since it is not structured as a table, 
# and we are also trying to grab 2 things!
# This sounds like a job for the trusty for-loop

# Start by narrowing down the HTML to the area of interest, "da-recentUsers"
recent_users = soup.find(attrs={"class": "da-recentUsers"})

print(recent_users.prettify())

<div class="w3-container da-recentUsers">
 <h5>
  Recent Users
 </h5>
 <ul class="w3-ul w3-card-4 w3-white">
  <li class="w3-padding-16">
   <a href="#/profile/mike">
    <img class="w3-left w3-circle w3-margin-right" src="../assets/mike.png" style="width:35px"/>
    <span class="w3-xlarge">
     Mike
    </span>
    <br/>
   </a>
  </li>
  <li class="w3-padding-16">
   <a href="#/profile/jill">
    <img class="w3-left w3-circle w3-margin-right" src="../assets/jill.png" style="width:35px"/>
    <span class="w3-xlarge">
     Jill
    </span>
    <br/>
   </a>
  </li>
  <li class="w3-padding-16">
   <a href="#/profile/jane">
    <img class="w3-left w3-circle w3-margin-right" src="../assets/jane.png" style="width:35px"/>
    <span class="w3-xlarge">
     Jane
    </span>
    <br/>
   </a>
  </li>
 </ul>
</div>



In [24]:
# Luckily, we can see a pattern here! 
# Let's assess the structure
"""

<div class="... da-recentUsers">
  ...
  <ul ...>  --------------------------------- start of the loop
    <li>  ----------------------------------- element within loop
      <a href="PROFILE URL HERE!!"> --------- element url!!
        ...
        <span ...>PROFILE NAME HERE!</span> - element name!!
      </a>
    </li>
    <li>...</li> ---------------------------- next element
    <li>...</li> ---------------------------- next element
  </ul>
</div>

"""

# SO!! The <ul> element makes up the "base" of our structure.
# Every child element within (<li>) represents a different recent user.
# Therefore, if we loop over the elements of the <ul> element,  
# we should be able to extract the URL and name per user.
print()




In [25]:
# Let's start by just trying to loop over the <li> elements, 
# then we can build off of that.
for list_element in recent_users.find("ul").findAll("li"):
    print("\n", str(list_element.prettify()))


 <li class="w3-padding-16">
 <a href="#/profile/mike">
  <img class="w3-left w3-circle w3-margin-right" src="../assets/mike.png" style="width:35px"/>
  <span class="w3-xlarge">
   Mike
  </span>
  <br/>
 </a>
</li>


 <li class="w3-padding-16">
 <a href="#/profile/jill">
  <img class="w3-left w3-circle w3-margin-right" src="../assets/jill.png" style="width:35px"/>
  <span class="w3-xlarge">
   Jill
  </span>
  <br/>
 </a>
</li>


 <li class="w3-padding-16">
 <a href="#/profile/jane">
  <img class="w3-left w3-circle w3-margin-right" src="../assets/jane.png" style="width:35px"/>
  <span class="w3-xlarge">
   Jane
  </span>
  <br/>
 </a>
</li>



In [26]:
# Ok, now let's try to grab the href from the <a> tag
for list_element in recent_users.find("ul").findAll("li"):
    a_tag = list_element.find("a")
    print(a_tag["href"])

#/profile/mike
#/profile/jill
#/profile/jane


In [27]:
# Let's try to grab the user name from the <span> tag
for list_element in recent_users.find("ul").findAll("li"):
    span_tag = list_element.find("span")
    print(span_tag.text)

Mike
Jill
Jane


In [28]:
# Now, put them together
# Let's try to grab the user name from the <span> tag
for list_element in recent_users.find("ul").findAll("li"):
    a_tag = list_element.find("a")
    span_tag = list_element.find("span")
    
    print(span_tag.text, a_tag["href"])

Mike #/profile/mike
Jill #/profile/jill
Jane #/profile/jane


In [29]:
# And finally, add them to a list to be used later!
recent_user_info = []

for list_element in recent_users.find("ul").findAll("li"):
    a_tag = list_element.find("a")
    span_tag = list_element.find("span")
    
    recent_user_info.append([span_tag.text, a_tag["href"]])
    
recent_user_info  

[['Mike', '#/profile/mike'],
 ['Jill', '#/profile/jill'],
 ['Jane', '#/profile/jane']]

---

## Name of the user that owns the dashboard

In [30]:
# code here

# This is a trick question! Try opening the "website.html" in your browser
# and see the "responsiveness" of the website.
# When the website gets below a certain width, the "menu" gets hidden!
# We can't see it in the notebook..
# But that doesn't mean it is not there.

# The info that we need lives within the "da-welcomeMenu" element
welcome_menu = soup.find(attrs={"class": "da-welcomeMenu"})
welcome_menu

<div class="w3-col s8 w3-bar da-welcomeMenu">
<span>Welcome, <strong>Mike</strong></span><br/>
<a class="w3-bar-item w3-button" href="#"><i class="fa fa-envelope"></i></a>
<a class="w3-bar-item w3-button" href="#"><i class="fa fa-user"></i></a>
<a class="w3-bar-item w3-button" href="#"><i class="fa fa-cog"></i></a>
</div>

In [31]:
welcome_menu.find("strong").text

'Mike'