# Structured Text

## The Problem

I've followed NCAA Wrestling for many years, and lately have taken an interest in talent identification and athlete development.

Consider, for example, the problem of recuiting collegiate wrestlers. Pontential scholarship athletes will commonly be screened from high school teams (in rare cases, international wrestlers may come from club teams); thus, there is consider interest in high school rankings. See, for example,

- [Intermat](https://intermatwrestle.com/rankings/high_school)
- [The Open Mat](https://news.theopenmat.com/category/high-school-wrestling/high-school-wrestling-rankings)
- [WIN](https://www.win-magazine.com/category/hs-rankings/)

It might be useful to compare these different ranking services to determine which are best at predicting collegiate success. To the end, I've decided to compare the 2015 high school class with the results from the 2018 NCAA tournament.

##  Manual Solution

- Copied table from [The Open Mat]('https://news.theopenmat.com/high-school-wrestling/high-school-wrestling-rankings/final-2015-clinch-gear-national-high-school-wrestling-individual-rankings/57136') into Excel, edited and saved as [CSV](./openmat2015.csv)
- Copied data from the 2018 NCAA Tournament from [PDF source](https://i.turner.ncaa.com/sites/default/files/external/gametool/brackets/wrestling_d1_2018.pdf) and from [FloArena](https://arena.flowrestling.org/event/8f1c1320-e1ac-31b5-c401-e7dda525e4b3) and compiled into [CSV](./ncaa2018.csv). These data also include final rankings from the Coaches Poll and [WrestleStat](https://www.wrestlestat.com/season/2018/rankings/starters) and results from various conference tournaments.


Can we merge these two tables to determine how top rank 2015 high school wrestlers performed in 2018?

Read tables

In [None]:
import pandas

ncaa18_dat = pandas.read_csv("./ncaa2018.csv")
hs2015_dat = pandas.read_csv("./openmat2015.csv")

Remove the non-qualifiers from the NCAA data set.

In [None]:
ncaa18_dat = ncaa18_dat[ncaa18_dat.Finish != 'NQ']

Process wrestler names in the high school set to match the NCAA format.

In [None]:
names = hs2015_dat['Name'].apply(lambda x: x.split(' '))
names_dat = pandas.DataFrame(list(zip(*names)))
names_dat = names_dat.T
names_dat.rename(columns = {0:'First', 1:'Last'}, inplace = True)
hs2015_dat = pandas.concat([hs2015_dat, names_dat], axis=1)

To simplify analysis, group ranks into quantiles:

In [None]:
import math
max_rank = max(hs2015_dat.Rank)
hs2015_dat['Quartile'] = hs2015_dat['Rank'].apply(lambda x: math.ceil(4*x/max_rank))


### How well does high school ranking predict participation in the NCAA tournament?


In [None]:
merged_dat = pandas.merge(hs2015_dat, ncaa18_dat, on=['First', 'Last'],how='inner')
pandas.crosstab(merged_dat.Quartile,merged_dat.Finish)

In [None]:
from statsmodels.graphics.mosaicplot import mosaic
plo = mosaic(merged_dat, ['Quartile', 'Finish'])

In [None]:
### What is the relationship between high school rank and NCAA place?

In [None]:
merged_dat = pandas.merge(hs2015_dat, ncaa18_dat, on=['First', 'Last'],how='outer',left_index=True, right_index=True)
merged_dat['Quartile']=merged_dat.Quartile.fillna(5)
merged_dat['Finish']=merged_dat.Finish.fillna('NQ')
pandas.crosstab(merged_dat.Quartile,merged_dat.Finish)

In [None]:
from statsmodels.graphics.mosaicplot import mosaic
plt = mosaic(merged_dat, ['Quartile', 'Finish'])

# Web Scraping solution

The data in `openmat2015.csv` were copied from a table in

https://news.theopenmat.com/high-school-wrestling/high-school-wrestling-rankings/final-2015-clinch-gear-national-high-school-wrestling-individual-rankings/57136

Can we 'scrape' this table directly?

In [None]:
path = 'https://news.theopenmat.com/high-school-wrestling/high-school-wrestling-rankings/final-2015-clinch-gear-national-high-school-wrestling-individual-rankings/57136'

In [None]:
from lxml import html
import requests
page = requests.get(path)
tree = html.fromstring(page.content)

print(type(tree))

print(page.content[1:100])

The HtmlElement class gives us access to the HTML structure. See https://lxml.de/api/lxml.html.HtmlElement-class.html

In [None]:
dir(tree)[40:50]

We can navigate the HTML tree using `xpath`

In [None]:
table = tree.xpath('//table')
print(table)

The xpath syntax allows for some non-standard function evaluation; consder


In [None]:
table_nodes = tree.xpath('//table/node()')
print(table_nodes)
table_text = tree.xpath('//table/text()')
print(table_text)

We extract the table headers with

In [None]:
table_head = table[0].xpath('//th/text()')
print(table_head)

and the body of the table

In [None]:
table_body = table[0].xpath('//tbody')
print(table_body[0])

The codes for xpath can also specify nested elements.

In [None]:
table_body_rows = table[0].xpath('//tbody/node()')
print(table_body_rows[0:9])
table_body_rows = table[0].xpath('//tbody/tr')
print(table_body_rows[0:9])

Accessing elements via HTMLElement is, to my thinking, very non-standard. Consider

In [None]:
print(table_body_rows[0])
print(table_body_rows[0].xpath('//td/text()')[0:8])
print(table_body_rows[1].xpath('//td/text()')[0:8])
print(table_body_rows[2].xpath('//td/text()')[0:8])

In [None]:
table_body_row = table[0].xpath('//tbody/tr[1]/td')
print(table_body_row)

In [None]:
table_cell = table[0].xpath('//tbody/tr[2]/td[1]/text()')
print(table_cell)
table_cell = table[0].xpath('//tbody/tr[2]/td[2]/text()')
print(table_cell)
table_cell = table[0].xpath('//tbody/tr[2]/td[3]/text()')
print(table_cell)

Fortunately, pandas gives us a simpler interface:

In [None]:
import pandas
openmat2015_dat = pandas.read_html(path)
print(openmat2015_dat)

I get inconsistent behaviour with this code:

In [None]:
path = 'https://www.wrestlestat.com/nationaltourneyresult/2018/individual/125'
ncaa2018_125 = pandas.read_html(path)
print(ncaa2018_125)

## HTML Attributes

Not only can we parse HTML nodes, we can also examine attributes. Consider the links to national tournament results at WrestleStat

In [None]:
path = 'https://www.wrestlestat.com/nationaltourneyresult'
page = requests.get(path)
tree = html.fromstring(page.content)
link_nodes = tree.xpath('//a[@href]')
print(link_nodes[0:10])
for i in range(20, 30):
    print(link_nodes[i].attrib)

We access attributes using `key` syntax

In [None]:
for i in range(20, 30):
    print(link_nodes[i].attrib['href'])

and we can subset in the `xpath` syntax

In [None]:
link_nodes = tree.xpath('//a[contains(@href, "2018")]')
print(link_nodes)
for i in range(0, 11):
    print(link_nodes[i].attrib['href'])

In [None]:
link_nodes = tree.xpath('//a[contains(@href, "2018") and contains(@href, "individual")]')
print(link_nodes)
for i in range(0, 9):
    print(link_nodes[i].attrib['href'])

# Exercises

# 1

The data table we read directly from The Open Mat has weight classes as headings. Can we process the HTML to create a table with weight class in columns?

# 2

Can these be read into tables compatible with the analysis at the top of this page?

- https://intermatwrestle.com/rankings/high_school
- https://news.theopenmat.com/high-school-wrestling/high-school-wrestling-rankings/adidas-national-high-school-wrestling-individual-rankings-january-2nd-2020/76034
- https://www.flowrestling.org/rankings/6448067-2019-20-high-school-rankings/35060-pound-for-pound

# 3

Go back to
https://www.itl.nist.gov/div898/strd/anova/SiRstv.html

Can you write code to read the data linked on this page, then iterate over the linked data sets by following the `Next Dataset` links?

# 4

Iterate over links in
https://www.win-magazine.com/category/hs-rankings/
and parse individual pages, i.e.
https://www.win-magazine.com/2019/12/wins-december-2019-high-school-rankings/
