<a href="https://colab.research.google.com/github/gvierneza/misc/blob/master/VerUno.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Simple Page Scraper for *Veritas Uno Challenge 2019*


---



(hit ctrl-F9 while connected to the Interwebs to run all of this...it should all run as long as you can get to the url held in score_page)<br />  This application scrapes the current Veritas Uno Challenge results from the free service that we have used to host the leaderboard and reformats and calculates the percentage of wins for each player.

## Imports

In [0]:
import urllib2 #library for reading from the Interwebs.
from bs4 import BeautifulSoup #hugely popular Python parsing library, in our case for HTML
import re #regular expressions

Setup the specific source page

In [0]:
score_page = 'https://keepthescore.co/game/trdfqrmttse/'

Get the source _N.B., This user agent (or at least some user agent) is required_ for the keepthsecore site or it returns 403s

In [0]:
try:
  request=urllib2.Request(score_page,None,headers={'User-Agent':'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.7) Gecko/2009021910 Firefox/3.0.7',}
  )
  page = urllib2.urlopen(request)
  soup = BeautifulSoup(page,'html.parser') #docs at https://www.crummy.com/software/BeautifulSoup/bs4/doc/
except:
  print('Something evil this way has already come.  You are probably not connected to the Interwebs.')


Parse the source...there are no really good markers to use, I just pulled these out because they seemed as logical as any.<br />
The first find_all grabs the players from their row
and the second grabs them from the top scoring row.  Note that the player row has a blank td in it but it doesn't matter as it also doesn't have a title attribute so it isn't found.  The scores row DOES have a value for the first item but it isn't a real score. So, we have to start grabbing the scores from the [1] item in the scores array.  The iter_counter keeps track of that for us.  <br />The named capture group in the regex grabs the score which is converted to an int so that it can be logically added to the total number of games.


In [0]:
player_box = soup.find_all('a',attrs={'title': 'Edit or delete player'})
scores = soup.find('tr', attrs={'class': 'info'}).find_all('th')
players = {}
iter_counter = game_total = 0
for player in player_box:
  iter_counter+=1
  m = re.match(r"^(?P<score>\d+).*$",scores[iter_counter].text.strip())
  players[player.text.strip()] = int_score = int(m.group('score'))
  game_total+=int_score



Sort the items in the dictionary by the values and print them out, calculating the percentage as it flows.

In [0]:
s_dict = sorted(players.items(), key=lambda x: x[1], reverse=True)
print("{} games played".format(game_total))
print "{:<15} {:<5} % of total".format('Name', 'wins')
for k, v in s_dict:
  percentage_of_total = float(v)/float(game_total) * 100
  print "{:<15} {:<5}  {:.2f}%".format(k, v, percentage_of_total)

481 games played
Name            wins  % of total
Matthew         135    28.07%
Eric whatever   106    22.04%
Chad            97     20.17%
Pawel           84     17.46%
Anjula          35     7.28%
Wild Bill       5      1.04%
Abby            5      1.04%
Talia           4      0.83%
John            4      0.83%
Tim             3      0.62%
Bill 2.X        2      0.42%
Josh k          1      0.21%
Scott           0      0.00%


# Coding Challenge for those who might accept it:<br />
Extend this to read in all the wins by each player over time and plot that using something like matplotlib and [maybe] numpy.  Given what I have written above and any documentation you find on line it should be pretty easy to scrape the data out of the page as it is just a big HTML table.<br />Plotting it with matplotlib wouldn't be hard either, [https://howtothink.readthedocs.io/en/latest/PvL_H.html](https://howtothink.readthedocs.io/en/latest/PvL_H.html) gives one way.  The other would be to use numpy ([https://numpy.org/devdocs/](https://numpy.org/devdocs/) together with matplotlib or another lib.<br />  Oh, and also add some error handling, please! :). 

### For those who question "What this is":<br /> 
This is a Jupyter Notebook that just happens to be embedded in the Google Suite so it allows us to use the normal RRD credentials to access it.  Here is a decent intro to the tool: [https://www.datacamp.com/community/tutorials/tutorial-jupyter-notebook](https://www.datacamp.com/community/tutorials/tutorial-jupyter-notebook).  This tool allows you to pretty much have all the power of Python in a browser (as the code runs on a remote server).  I have used it for all of the machine learning courses I have used on Coursera and it is pretty much the _de facto_ way of disseminating Python and data science learning nowadays.<br />The text cells are filled via Markdown, which is one of the goals for many of you.  Here is a great intro if you are still struggling with this: [https://colab.research.google.com/notebooks/markdown_guide.ipynb](https://colab.research.google.com/notebooks/markdown_guide.ipynb)

### For those who question "Why Python":<br />
Python is extremely easy-to-use and yet is powerful enough to run much of the current AI and machine learning algorithms in use today.  For this reason, and the fact that it is one of the top languages in use and in demand today ([https://insights.dice.com/2019/10/08/python-java-top-languages-employers/](https://insights.dice.com/2019/10/08/python-java-top-languages-employers/)).  With its libraries, there is almost nothing that it cannot do.  This fact alone, let alone that Evil Paul loves the language, makes it worth digging into a bit.  As I said, it is very easy to learn and the fact that I wrote the above code in less than an hour while waiting for my son is proof enough of that as I am no Python guru.<br />Expanding the horizons, thinking outside of the box of what we do everyday...these are part of what we do here.  I encourage you to embrace the opportunity here.
