# Version history
Natalia Vélez, January 2020

The goal of this script is to scrape the [version history](https://onehouronelife.gamepedia.com/Version_history) from the OHOL wiki. In future analyses, this table will be used to partition the data—in general, we want to aggregate data from the same release and separate data from different releases.

In [2]:
import requests
import pandas as pd
from bs4 import BeautifulSoup

Download HTML document:

In [5]:
url = 'https://onehouronelife.gamepedia.com/Version_history'
url_request = requests.get(url)
html = url_request.content
print(html)

b'<!DOCTYPE html>\n<html class="client-nojs" lang="en" dir="ltr">\n<head>\n<meta charset="UTF-8"/>\n<title>Version history - Official One Hour One Life Wiki</title>\n<script>document.documentElement.className = document.documentElement.className.replace( /(^|\\s)client-nojs(\\s|$)/, "$1client-js$2" );</script>\n<script>(window.RLQ=window.RLQ||[]).push(function(){mw.config.set({"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":false,"wgNamespaceNumber":0,"wgPageName":"Version_history","wgTitle":"Version history","wgCurRevisionId":12550,"wgRevisionId":12550,"wgArticleId":3490,"wgIsArticle":true,"wgIsRedirect":false,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["Stubs","Version History"],"wgBreakFrames":true,"wgPageContentLanguage":"en","wgPageContentModel":"wikitext","wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June","July","August","September"

Find table within HTML document:

In [19]:
soup = BeautifulSoup(html, 'lxml')
all_tables = soup.find_all('table')
ver_table = all_tables[1]

Column names:

In [34]:
print(ver_table.find_all('th'))
ver_cols = ['version_no', 'name', 'release_date', 'onetech_link']

[<th>Version number
</th>, <th>Name
</th>, <th>Release Date
</th>, <th>Onetech link
</th>]


Retrieve data from HTML table:

In [47]:
data = []
rows = ver_table.find_all('tr')
for row in rows:
    raw_cols = row.find_all('td')
    cols = []
    for ele in raw_cols:
        # For the last column, get the onetech link instead of the link label
        if 'onetech' in ele.text.strip():
            link = ele.find('a')
            txt = link.get('href')
        else:
            txt = ele.text.strip()
        cols.append(txt)
        
    #cols = [ele.text.strip() for ele in cols]
    data.append(cols)
    
# Remove empty rows
data = [r for r in data if len(r) > 0]
data[:6]

[['254', '', 'July 28, 2019', 'https://onetech.info/versions/254'],
 ['253', '', 'July 26, 2019', 'https://onetech.info/versions/253'],
 ['252', 'Grand Arc[1]', 'July 26, 2019', 'https://onetech.info/versions/252'],
 ['250', '', 'July 20, 2019', 'https://onetech.info/versions/250'],
 ['249',
  'New Brothers[2]',
  'July 19, 2019',
  'https://onetech.info/versions/249'],
 ['247', 'More Fixes[3]', 'July 6, 2019', 'https://onetech.info/versions/247']]

Save data as pandas dataframe:

In [51]:
ver_df = pd.DataFrame(data, columns=ver_cols)
ver_df.head()

Unnamed: 0,version_no,name,release_date,onetech
0,254,,"July 28, 2019",https://onetech.info/versions/254
1,253,,"July 26, 2019",https://onetech.info/versions/253
2,252,Grand Arc[1],"July 26, 2019",https://onetech.info/versions/252
3,250,,"July 20, 2019",https://onetech.info/versions/250
4,249,New Brothers[2],"July 19, 2019",https://onetech.info/versions/249


Reformat dates:

In [52]:
ver_df.release_date = pd.to_datetime(ver_df.release_date)
ver_df.release_date = ver_df.release_date.dt.strftime('%Y-%m-%d')
ver_df.head()

Unnamed: 0,version_no,name,release_date,onetech
0,254,,2019-07-28,https://onetech.info/versions/254
1,253,,2019-07-26,https://onetech.info/versions/253
2,252,Grand Arc[1],2019-07-26,https://onetech.info/versions/252
3,250,,2019-07-20,https://onetech.info/versions/250
4,249,New Brothers[2],2019-07-19,https://onetech.info/versions/249


Save to file:

In [53]:
ver_df.to_csv('outputs/version_history.tsv', sep='\t', index=None)