# Web scrapping the NBA website to follow the games live without your collegues knowing

###  Series 1 of 3

Every now and then I like to follow some sports to distract myself from the usual routine and to get a mental rest from heavy thought processes at work or when developing my personal projects. So in colloquial terms, one may say that I like to procrastinate occasionally. 

NBA games is one of these distractors that I use to get my mind to relax. Unfortunately for me, it turns out that the games tend to be showcased in the middle of office hours, making it a little bit difficult to follow the games live. Luckily and understanding that we are humans, the company that I work with allow us to access sites like the NBA so we are able to connect with things that interest us having always in mind fullfilling the projects assigned to ourselves.

However for some people this may be more restrictive, so been able to get the scores live of a Lakers vs Clippers game while your boss is breathing in your neck may be challenging. Fortunately for us, with a little bit of python and web scrapping magic you can run some scripts to get the latest updated score of the games without people knowing that you are :).

In this 3 tutorials I will take through the basics of how to do web scrapping over a Javascript rendered website (In this case the NBA website) to a more extensive review of how to create some code that allow us to retrieve any information to create a database of the results of your favourite team over the season.

For this, you will find yourself using python and a popular webscrapper called Selenium to explore and extract any information that you need from any available website. A quick disclaimer here: Always check that the website you are wanting to scrap allows and is ok with you doing so. Also, ideally if a website has an API that you can query and get the information from, it will be much more efficient and practical to use such channel to extract the information rather than scrapping their xml tree.


### If you want to jump straight to the application.... 

please click here:   . However, if you are not familiar with how to do some web scrapping and html reading I strongly recommend you to follow along the series


### First things first

As a said before, the idea is for us to cover the basic and build from there so the first thing that you will like to do to be able to follow along (other than having python installed in your machine), is to install and set up selenium in your working space. The best thing you will like to do to be able to, is to follow the instructions of selenium according to your OS and the web browser you regularly use to navigate. Please find all these information here: https://selenium-python.readthedocs.io/installation.html

### Once you have set up the basics, we can start...

As usual, the first thing to do is to call the libraries that we will be using for this particular project

In [3]:
import sys
sys.path.insert(0, "C:/Users/luisa/Documents/GitHub/Web-Spider---Sentiment-Analysis/venv/env/Lib/site-packages")

sys.path.insert(0, "C:/Users/RojasL/PycharmProjects/untitled3/venv/Lib/site-packages")

In [4]:
# Importing all Selenium relevant modules
from selenium import webdriver
import pandas as pd
# Some HTML and display core function for flashy displays ;)
from IPython.core.display import display, HTML, Image
# Python widgets
from ipywidgets import interact, interactive, fixed, interact_manual, Box, VBox, Label, Layout, HBox
import ipywidgets as widgets

Then, what we will like to do is to define the URL that we will like selenium to open and thus extract the information from there. If you find yourself having to fill a form or press some buttons on the website before getting to the actual place from where you want to extract the informatio. Worry not my friend!!, we can do that with selenium too and will show you later how to!

In [6]:
url_logos = 'https://stats.nba.com/teams/'

For this first exercise we will extract the logos and names of all the NBA team and create a little interactor using python widgets. Then, we shall initialize our selenium web driver (In my case using firefox, howevere you can use any just by changing the driver call a bit i.e. Chrome: driver.Chrome()) and pass the url that we want to query using the get function

In [7]:
#driver = webdriver.Firefox()
driver = webdriver.Chrome('C:/Users/RojasL/Downloads/chromedriver.exe')

This will open a web browser that is controllable via python and so you can automate tasks or build different codes to iterate and interact however you would like with the given website. Now, before jumping into the NBA logos we have to understand some of the basics extraction task. How to get an specific element from an html structure.... Imagine we have the following structure:

In [8]:
html_structure= '''
<html>
    <head></head>
    <body>
        <div>
            <h1 id="header"> This is my header</h1>
        </div>
        <div>
            <p id="paragraph1"> This is my paragraph 1</p>
            <div>
                <p id="subparagraph1"> this is my subparagraph 1</p>
            </div>
        </div>
        <div>
            <p id="paragraph2"> This is my paragraph 1</p>
            <div>
                <p id="subparagraph2"> this is my subparagraph 2</p>
            </div>
        </div>
    </body>      
</html>
'''

Which in a web browser will look something like this:

In [9]:
HTML(html_structure)

Now imagine you would like to extract the content of subparagraph 1. You can use a very handy function call find_elements_by_xpath that allows you to specify the path of a certain element and then extract multiple features of such element (In our case, the content). For that, we pass the html structure to our selenium web driver

In [10]:
driver.get("data:text/html;charset=utf-8,{0}".format(html_structure))

And apply our handy function by specifying the path of what we want to extract. Here we can either specify the full path or abreviate its location and it will return us a list with all the elements that satisfy the given description. This is why, we specify that we want to get the elemente [0] from that list and furthermore that we want the text of such element

In [11]:
# Full path
display(driver.find_elements_by_xpath("html/body/div/div/p[@id = 'subparagraph1']")[0].text)
# Abbreviated version
display(driver.find_elements_by_xpath("//p[@id = 'subparagraph1']")[0].text)

'this is my subparagraph 1'

'this is my subparagraph 1'

What if I want to extract all the existant subparagraphs and form a dataframe with the contents of all this elements? Easy, use exactly the same function. Just change a bit the searching criteria using contains in the statement criteria and we are all done!!!

In [12]:
# All elementes in the HTML structure that satisfy the criteria
list_of_elements = driver.find_elements_by_xpath("//p[contains(@id, 'subparagraph')]")
# Printing all the returned elements
for i in list_of_elements:
    print(i.text)

this is my subparagraph 1
this is my subparagraph 2


We have just succesfully scrap our first HTML! hurray!!!. More specifically, we told the selenium driver to extract all the 'p' elements which in its 'id' feature contain the word subparagraph. 

Reader: Ehem... But you didn't create any dataframe

Luis: Oooppsss... True true

In [13]:
#combining it all together in a shorter version
pd.DataFrame(data = [i.text for i in driver.find_elements_by_xpath("//p[contains(@id, 'subparagraph')]")], 
            columns = ['my_query'])


Unnamed: 0,my_query
0,this is my subparagraph 1
1,this is my subparagraph 2


Now let's move to the real thing. For that, we can have our selenium driver reading the NBA logos URL

In [14]:
driver.get(url_logos)

Once we have done so, a good approach to identify the items that we want to extract out of the website that we are currently querying is to place the cursor on top of the element and then right click and inspect. Something like this will pop up:

In [15]:
Image("C:/Users/luisa/Documents/GitHub/Untitled Folder/teamlist.png")

TypeError: a bytes-like object is required, not 'str'

TypeError: a bytes-like object is required, not 'str'

<IPython.core.display.Image object>

So as we want to extract the logos and the team name, it turns out that both of them are enclosed by a table element followed by other multiple elements until we reach a nested image (< img >) right underneath the 'a' tag that hold the hyperlink and the text with the team's name and which subsequently holds the url source ('src') of the logo picture in the image tag. Then using the selenium and python magic that we have just learnt, we will try to find a common pattern that all the teams have so we can get them in a list and extract the features that we want from their html element. First for the team names

In [18]:
team_list = driver.find_elements_by_xpath("//a[contains(@class, 'stats-team-list')]")
# Print the first 5 elements of the list
for i, v in enumerate(team_list):
    if i <= 5: print(v.text)

Boston Celtics
Brooklyn Nets
New York Knicks
Philadelphia 76ers
Toronto Raptors
Chicago Bulls


what have we just done???. We told the selenium driver to find all the a tags that have a class whose name contains 'stats-team-list'. And once those elements have been found, print the text that is contained in the 'a' tag. Now for the logos we can do something similar with the exception that the logo is not held in the text element of that 'img' tag but in the 'src' element. Therefore, we have to slightly modify the code so rather than extracting the text, we would take the src element.

In [19]:
logos = driver.find_elements_by_xpath("//img[contains(@class, 'team-logo')]")
for i, v in enumerate(logos):
    if i <= 5: print(v.get_attribute('src'))        

https://stats.nba.com/media/img/teams/logos/BOS_logo.svg
https://stats.nba.com/media/img/teams/logos/BKN_logo.svg
https://stats.nba.com/media/img/teams/logos/NYK_logo.svg
https://stats.nba.com/media/img/teams/logos/PHI_logo.svg
https://stats.nba.com/media/img/teams/logos/TOR_logo.svg
https://stats.nba.com/media/img/teams/logos/CHI_logo.svg


As you just saw, given that the logos were stored in an attribute of the querying tag. We make use of the get_attribute function from selenium, so we can get the url source of each of the logos. Now in the same way that we made the dataframe out of the subparagraphs, we will do the same with the logos and teams names:

In [20]:
logos = pd.DataFrame(columns = ['logo', 'team_name'])
team_list = driver.find_elements_by_xpath("//a[contains(@class, 'stats-team-list')]")
# As they both have the same length, we use the counter of one to extract the other and then fill the dtframe
for i, v in enumerate(driver.find_elements_by_xpath("//img[contains(@class, 'team-logo')]")):
    logos.loc[i, 'logo'] = v.get_attribute('src')
    logos.loc[i, 'team_name'] = team_list[i].text
    
display(logos.head())

Unnamed: 0,logo,team_name
0,https://stats.nba.com/media/img/teams/logos/BO...,Boston Celtics
1,https://stats.nba.com/media/img/teams/logos/BK...,Brooklyn Nets
2,https://stats.nba.com/media/img/teams/logos/NY...,New York Knicks
3,https://stats.nba.com/media/img/teams/logos/PH...,Philadelphia 76ers
4,https://stats.nba.com/media/img/teams/logos/TO...,Toronto Raptors


I know, I know.... you are right. These are not the logos, just a bunch of web addresses with the logos. But hey!!, if we use a little bit of html in our python notebook we can display the logo:

In [21]:
team = 3
HTML("<h3>{0}</h3><img src={1} style='max-height:250px;'></img>".format(logos.loc[team, 'team_name'], logos.loc[team, 'logo']))

Then if we use it with an interactive widget, we get:

In [22]:
def displayer(team):
    location = logos['team_name'] == team
    html_str = "<h3>{0}</h3><img src={1} style='max-height:250px;'></img>" \
         .format(logos.loc[location, 'team_name'].values[0], logos.loc[location, 'logo'].values[0])
    print("This is the html code use to print what's below: " + html_str)
    return(HTML(html_str))
    
widget = widgets.Dropdown(
    options= logos['team_name'],
    value=None,
    description='Team:')

interact_manual(displayer, team = widget)

interactive(children=(Dropdown(description='Team:', options=('Boston Celtics', 'Brooklyn Nets', 'New York Knic…

<function __main__.displayer(team)>

### Right next 

Explore how to extract the results so far of the games played in the NBA