# Scraping

This notebook contains step-by-step instructions of how to scrape together an opening database from games played at webDiplomacy and vDiplomacy.

## Imports

In [1]:
import pandas as pd
import glob, os

from cleaning import discard_game, discard_short_games, manuals
from games import scrape_games, load_year
from search import scrape_search_pages
from variants import add_variant

## Meta-information

First let us check what databases are currently available.

In [16]:
for database in glob.glob('data/*.csv'):
    print(database)

The table `variants.csv` stores meta-information such as aliases of the variants on different webpages, what dictionary should be used for translating provinces names to short forms, and information about user id's to ignore when scraping (read: bots).

In [17]:
variants = pd.read_csv('data/variants.csv')
variants

You should use the function `add_variant` from the `variants` package if you want to create a database for a new variant. It will modify the `variants.csv` table, create the table file for the variant, and create the necessary folders.

`add_variant(name, powers, years=2, overwrite=False)`

The parameters are:

`name (string)`: Your alias for the variant.

`powers (list of strings)`: The list of powers appearing in the variant.

`years (integer)`: The number of game _years_ to be included in the opening analysis.

`overwrite (boolean)`: Whether an existing table file should be overwritten.

Below are example of how to call `add_variant` if you want to create a opening database for Germany vs. Italy.

In [18]:
# ClassicGvI
add_variant('ClassicGvI', ['Germany', 'Italy'], years=2, overwrite=True)

variants = pd.read_csv('data/variants.csv')
variants

Some information remains to be filled in. You'll have to enter the informatino manually (i.e., using pandas), see the code below. Notice that `ClassicGvI` is the variant with index 1 in the above table.

The meaning of some entries are clear.
`webDiplomacy` hold he alias for the variant on webDiplomacy.
`vDiplomacy` holds The alias for the variant on vDiplomacy.
`Ignore` holds a list of user indices to ignore (used to filter out bots).
`Start` holds the starting year of the variant.

The remaining two parameters are special. If you want to replace full length province names with abbreviations, then you should provide a dictionary, stored as `.json` file in the `abbreviations` folder. There is a dictionary for the classic map already. The `Dictionary` column should contain the name of the dictionary file (without the file extension). You can leave this field empty if you wish, in which case full length province names will be used.

The `Messaging` parameter can be set to `'All'` or `'Gunboat'`.

In [19]:
variants = pd.read_csv('data/variants.csv')

# The variant 'ClassicGvI' has index 1 in the dataframe.
index = 1

variants.loc[index, 'webDiplomacy'] = 'ClassicGvI'
variants.loc[index, 'vDiplomacy'] = 'ClassicGvI'
variants.loc[index, 'Dictionary'] = 'Classic'
variants.loc[index, 'Ignore'] = "[108388]"
variants.loc[index, 'Start'] = 1901
variants.loc[index, 'Messaging'] = 'All'

variants.to_csv('data/variants.csv', index=False)
variants

There might be other parameters that you want to filter by: phase lenght, pot size, etc. Most such parameters will be stored in the database, allowing you to filter the database at a later stage. If this is not the case, then you may submit a ticket on GitHub (or code it yourself).

## Scraping lists of games.
The first step is to scrape lists of all games that we are interested in, which we do by scraping the search page, using

`scrape_game_lists(variant, webpage, first, last)`

Here, `variant` is the variant name, `webpage` is the name of the host website. The arguments  `first` and `last` are integers marking the first and last SERPs to be scraped; they allow for the list of all games to be scraped in batches.

There is no issue with duplicates: if you scrape a SERP which includes a game already in your database, then that game will be ignored. Hence, if you want to scrape in batches _and_ make sure that you don't miss any games, then you should start with the first SERP. 

If you plan on keeping the database up to date, then you should make a note of the lowest GameID of an _active_ game; that'll be the game to look for when determining how many SERPs to scrape at your next visit. That's something which you will have to do manually.

There is a 3s curtesy delay in between scrapes, as to not disturb the server too much.

In [7]:
scrape_search_pages('ClassicGvI', 'webDiplomacy', 1, 5)

## Cleaning, part 1
If you decided to have columns for _k_ in-game years in the database, then you will have to discard all games that ended before year _k_. 
You may of course discrad games that ended before year _j_, as long as _j_ is at least as big as _k_.

You can use the function `discard_short_games(variant, years, dataframe)`. We pass on the variant dataframe from above, which contains the information of the starting year for the variant.

In [8]:
discard_short_games('ClassicGvI', 2, variants)

## Scraping game files
Next step is to scrape the html files with orders for each games. Use the function `scrape_games(variant, webpage, m=m, verbose=verbose)`. The parameter `m` is the batch size. You can turn off the printed comments with the `verbose` parameter.

There is a 3s curtesy delay in between scrapes, as to not disturb the server too much. 

In [20]:
scrape_games('ClassicGvI', 'webDiplomacy', m=100, verbose=True)

## Cleaning, part 2
The order page scraped from vDip only shows the last 10 in-game years. That is, we cannot automatically load first two years' orders into the database if a game went on for more than 10 years. Therefore, the datafram has a `Manual` column. N.B., the current version of this function is slow and could be improved.

**You must run this line even if you did not scrape from vDiplomacy**

In [11]:
manuals('ClassicGvI', variants)

Let us check how many games we have which require manual entries. Ideally, you should take the time to enter them manually. But the world is not an ideal place... If the variant you have chosen typically lasts more than 10 years, then you might just have to stick with only scraping the webDip games.

In [22]:
data = pd.read_csv('data/ClassicGvI.csv')
data.Manual.value_counts()

## Import orders into the database
The following function reads in all orders from in-game year `year` into the database:

`load_year(year, variant, webpage, variants, m=m)`

Here, `m` is the batch size. Loading the database, and comparing with available games (i.e., files in the games folders), is the step which takes the most time. The number of games loaded at once does not make a big difference. That is, you might as well load all available games at once. Default is 1000 games per batch.

Any `SettingWithCopyWarning` can be ignored.

In [23]:
load_year(1, 'ClassicGvI', 'webDiplomacy', variants)

One comment is in its place. If we look at the explicit values in the column for Spring 01 orders, then the entries _seem_ to be lists of string. In fact, they are strings. This is an artefact of the database being stored as a `.csv` file. If you want to break the strings apart into lists, to analyse separate orders, then you'll have to do some preprocessing.

N.B., if you see an `NaN` when running the below code, then it is because one of you first five games has been discarded.

In [24]:
example = pd.read_csv('data/ClassicGvI.csv')
example.ItalyS1[:5]

## Cleaning, part 3
There are a few things (for example, bugs with the `order.php` page at webDip/vDip) which can mess up some assumptions of the functions that we use in this notebook. One example is if the game ends with only units belonging to one power. This is not a big problem: such 'outlier' games tend to be 'non-competitive'. For example, the players might deliberatly try to create a funny looking map. You'll probably find a few such game when analysing the data. The most reasonable thing is to discard those games. Use the function `discard_game(index, variant)` to discard the game which has index `index` in the database.

A second problem can be the custom win conditions on vDiplomacy. This usually does not affect the functions in this notebook, but you might want to exclude such games anyways. There is no code to do this automatically at the moment.

In [65]:
discard_game(0, 'ClassicGvI')