# Create the input features and output lables for the Machine Learning classification training

## Overall goal:
This project is an attempt to classify a Civ 6 starting city location based solely on the tile/plot information available when settling.

This workbook is where we start selecting the data we are going to use to train the classifier.

## Labels:
This is the values we wish to predict. This version of the model to determine a "good" city is based on the various cumulative yields it produces during the 50 turns following the city being settled, graded into 5 quantiles scored 0 through 4 for each yield. These scores are then added together and graded into 5 quantiles scored 0 through 4 to give the city "goodness"

That is, all the food the city produces is summed. Then graded against which quantile it falls against ALL the cities. This process is repeated for production, gold, science, and culture to get 5 different scores for the city in question. these 5 scores are then summed to give a potential maximum score of 20 and a potential minimum of 0.

We will use the totalScore (0, 1, 2, 3, 4) as the value we wish to predict as part of the classification

## Features:
This is the values we have in order to make the classification. This version of the model uses a combined terrain, feature, and resource key, determined by looking at all the known plots in the database, to determine the plot frequency percentage.

That is, the plots in the first 2 rings of the city centre (19 in total) are classified as belonging to one of the keys, summed and then divided by 19 to get the percentage of tile in the city that are for example "Plains (Hills) with Woods and luxury resource. Hint, this is pretty rare occurance from the data I've gathered.

Likewise the percentage of tiles with a river is calculcated and a binary yes/no whether the city has a river.

## Observations:
The following questions and assumptions still need to be reviewed and reflected on to improve the model.

### What is a "good" city?
There are many other factors that could determine a good city, especially later in the game when you have options to utilise tile better. Assuming that we are literally only interested in the first 50-55 turns of the games focussing on the lifetime yield seems acceptable.

### Which tiles should be used as input?
With the exception of Peter (Russia) most cities cap out (in 50 turns) at the city centre, the first ring, and 2-3 tiles of the second ring. That is between 8-9 tiles. Also, usually the population can't utilise all these tiles in any case. There is also the possibility of buying tiles. In the end I decided to use all the tiles in the first 2 rings

### How do you summarise to input to manageable levels without losing to much accuracy?
When I started this I thought to break the inputs down per tile, that is have input collumns for the each tile with features, resources, bonuses, next to a river, has a worker etc ... this very quickly leads to an input feature explosion. E.g. assume 19 tiles, 15+ terrains, 7+ features, 44+ resources, has/hasn't river already gives >175k input columns. Assuming you need 20 samples per feature for any semblance of accurcy I'd need to have >3.5 million city records.

The terrain, feature, resource key percentage reduces the input requirement to around 70 which is manageable. I only have around 10-20% samples of what is recommended, but lets see where this gets us.

### What else should be used to answer the question?
This complexity becomes unmanageable pretty quickly if you try to include, for example build order in the city. Likewise district placement considerations.

In [1]:
import sqlite3
import pandas as pd

In [2]:
cnx = sqlite3.connect('Database/Civ6CitySettledData.db')
cur = cnx.cursor()
print(cnx)
print(cur)

<sqlite3.Connection object at 0x0000019B4D0A3E30>
<sqlite3.Cursor object at 0x0000019B4F17F5E0>


## Labels:

Use the specifically created database view to retrieve the per turn data collected. This view also aligns timelines, that is, takes care of cities settled in turn 2, or even turn 3. I was surprised to learn the AI actually moves settlers before settling.

In [3]:
sqlSelect = 'SELECT * FROM cityPerTurnView WHERE turns >= 1 and turns <= 50'
cityPt = pd.read_sql_query(sqlSelect, cnx)
print(cityPt.shape)

(6400, 29)


I have excluded Faith as it was simply too variable to use. Also, it doesn't appear to be a core yield in general, although it is situationally very useful.

In [4]:
yields = ['food', 'production', 'gold', 'science', 'culture']
#yields = ['food']
cityIds = list(cityPt['cityId'].unique())
labelsDf = pd.DataFrame(columns={'foodTotal', 'foodScore', 'productionTotal', 'productionScore',
                                 'goldTotal', 'goldScore', 'scienceTotal', 'scienceScore',
                                 'cultureTotal', 'cultureScore', 'cityTotal', 'cityScore'},
                        index=cityIds)
labelsDf.reset_index(level=0,inplace=True)
labelsDf.rename(columns={'index':'cityId'}, inplace=True)
labelsDf.sort_values(by='cityId', inplace=True)
labelsDf.index = pd.RangeIndex(len(labelsDf.index))
labelsDf.fillna(0, inplace=True)

quantiles = [0, .5, .75, .9, 1]
#quantiles = [0, .75, 1]

for yld in yields:
    columnName = "{}PerTurn".format(yld)
    cumulativeDf = cityPt[['cityId', 'turns', columnName]].pivot(index='turns',
                                                                 columns='cityId',
                                                                 values=columnName).cumsum()
    t50 = cumulativeDf.loc[50].to_frame()
    t50.reset_index(level=0, inplace=True)
    t50.rename(columns={50:"{}Total".format(yld)}, inplace=True)
    # join here via boolean map as using cityID, not index - see df.update below
    labelsDf.loc[labelsDf.cityId.isin(t50.cityId), "{}Total".format(yld)] = t50["{}Total".format(yld)]
    
    quantilesDf = pd.qcut(labelsDf["{}Total".format(yld)], quantiles, labels=False).to_frame()
    quantilesDf.rename(columns={"{}Total".format(yld):"{}Score".format(yld)}, inplace=True)
    # update works cause we have index alignment
    labelsDf.update(quantilesDf)
    labelsDf['cityTotal'] = labelsDf['cityTotal'] + labelsDf["{}Score".format(yld)]

quantilesDf = pd.qcut(labelsDf['cityTotal'], quantiles, labels=False).to_frame()
quantilesDf.rename(columns={'cityTotal':'cityScore'}, inplace=True)
# update works cause we have index alignment
labelsDf.update(quantilesDf)

print(labelsDf)

     cityId  cityScore  goldTotal  cultureTotal  productionScore  \
0         1          0     289.80        101.97                2   
1         2          1     660.70        164.75                2   
2         3          0     251.50        153.38                0   
3         4          0     262.50        140.42                0   
4         5          0     250.00        141.82                0   
5         6          1     305.00        111.22                1   
6         7          1     452.30        141.93                1   
7         8          1     406.05         99.73                1   
8         9          1     338.00        169.84                1   
9        10          1     275.52        211.89                0   
10       11          0     252.50        161.56                1   
11       12          0     250.95        129.82                0   
12       13          0     278.55        105.20                0   
13       14          0     257.50        143.70 

In [5]:
labelsDf.to_csv('ModelInput/labels.csv', encoding='utf-8', index=False)

## Features:

Use the cityPlotsSettled data collected to prepare the features we intend using as input into the model. The category key idea is explained in the introduction.

We also need to add the percentage of tiles that has a river, as well as the cityHasRiver input.

In [6]:
sqlSelect = 'SELECT * FROM cityPlotsSettled'
cityPs = pd.read_sql_query(sqlSelect, cnx)
print(cityPs.shape)
#print(cityPs.dtypes)

(2432, 15)


In [7]:
# Create category key for plot by combining terrain and feature
cityPs['catTf'] = cityPs['terrain'] + cityPs['feature']
# Remove None, whitespace, and brackets
cityPs['catTf'] = cityPs['catTf'].apply(lambda s: s.replace('None', '').replace(' ', '').replace('(','').replace(')', ''))

# And, a category key for resources...
# Consolidate luxury resources
cityPs['catR'] = cityPs['resource']
# Remove all "hidden" strategic resources
cityPs['catR'] = cityPs['catR'].apply(lambda s: s.replace('Uranium', 'None').replace('Oil', 'None').replace('Niter', 'None'))
cityPs['catR'] = cityPs['catR'].apply(lambda s: s.replace('Aluminum', 'None').replace('Coal', 'None').replace('Iron', 'None'))
# Consolidate Luxuries
cityPs['catR'] = cityPs['catR'].apply(lambda s: s.replace('Dyes', 'Lux').replace('Silver', 'Lux').replace('Diamonds', 'Lux'))
cityPs['catR'] = cityPs['catR'].apply(lambda s: s.replace('Tea', 'Lux').replace('Salt', 'Lux').replace('Olives', 'Lux'))
cityPs['catR'] = cityPs['catR'].apply(lambda s: s.replace('Ivory', 'Lux').replace('Sugar', 'Lux').replace('Coffee', 'Lux'))
cityPs['catR'] = cityPs['catR'].apply(lambda s: s.replace('Cotton', 'Lux').replace('Furs', 'Lux').replace('Whales', 'Lux'))
cityPs['catR'] = cityPs['catR'].apply(lambda s: s.replace('Marble', 'Lux').replace('Jade', 'Lux').replace('Turtles', 'Lux'))
cityPs['catR'] = cityPs['catR'].apply(lambda s: s.replace('Gypsum', 'Lux').replace('Mercury', 'Lux').replace('Tobacco', 'Lux'))
cityPs['catR'] = cityPs['catR'].apply(lambda s: s.replace('Wine', 'Lux').replace('Truffles', 'Lux').replace('Incense', 'Lux'))
cityPs['catR'] = cityPs['catR'].apply(lambda s: s.replace('Silk', 'Lux').replace('Citrus', 'Lux').replace('Spices', 'Lux'))
cityPs['catR'] = cityPs['catR'].apply(lambda s: s.replace('Cocoa', 'Lux').replace('Pearls', 'Lux'))
# Consolidate bonus resources
cityPs['catR'] = cityPs['catR'].apply(lambda s: s.replace('Sheep', 'Bonus').replace('Bananas', 'Bonus'))
cityPs['catR'] = cityPs['catR'].apply(lambda s: s.replace('Fish', 'Bonus').replace('Deer', 'Bonus'))
cityPs['catR'] = cityPs['catR'].apply(lambda s: s.replace('Crabs', 'Bonus').replace('Copper', 'Bonus'))
# Remove None, whitespace, and brackets
cityPs['catR'] = cityPs['catR'].apply(lambda s: s.replace(' ', '').replace('(','').replace(')', ''))

# Still have 35 different "categories" here!?
#print(cityPs[['plotId','catTf']].groupby('catTf').count().sort_values(by='catTf', ascending=True).count())
#print(cityPs[['plotId','catR']].groupby('catR').count().sort_values(by='catR', ascending=True).count())

#print(list(cityPs['catTf'].unique()))
#print(list(cityPs['catR'].unique()))

cols = ['cityId'] + list(cityPs['catTf'].unique()) + list(cityPs['catR'].unique())
#list(filter(None, list(cityPs['catR'].unique())))
featuresDf = pd.DataFrame(columns = cols)
print(featuresDf.shape)
print(featuresDf.head)

(0, 36)
<bound method NDFrame.head of Empty DataFrame
Columns: [cityId, PlainsHills, PlainsMountain, PlainsRainforest, Plains, PlainsHillsWoods, DesertHills, PlainsWoods, PlainsHillsRainforest, Grassland, CoastandLake, GrasslandHills, DesertMountain, Desert, GrasslandWoods, GrasslandMarsh, Ocean, Tundra, Snow, GrasslandHillsWoods, GrasslandMountain, CoastandLakeReef, DesertFloodplains, TundraWoods, TundraMountain, TundraHills, DesertOasis, TundraHillsWoods, None, Bonus, Lux, Wheat, Stone, Cattle, Horses, Rice]
Index: []

[0 rows x 36 columns]>


In [8]:
# category percentages
for cityId in cityPs['recordedCityId'].unique():
    plotsDf = cityPs[cityPs['recordedCityId'] == cityId]

    catTfSeries = plotsDf.groupby(['catTf'])['plotId'].count()
    catRSeries = plotsDf.groupby(['catR'])['plotId'].count()
                                
    featuresDf.loc[cityId, 'cityId'] = cityId
    for cat, val in catTfSeries.iteritems():
        featuresDf.loc[cityId, cat] = val
    for cat, val in catRSeries.iteritems():
        featuresDf.loc[cityId, cat] = val
featuresDf.fillna(0, inplace=True)
print(featuresDf.head)

<bound method NDFrame.head of      cityId  PlainsHills  PlainsMountain  PlainsRainforest  Plains  \
1         1            3               1                 1       6   
2         2            1               0                 3       2   
3         3            1               0                 1       6   
4         4            1               0                 2       4   
5         5            0               0                 1       0   
6         6            0               0                 0       0   
7         7            4               2                 4       5   
8         8            0               0                 1       2   
9         9            1               1                 4       4   
10       10            5               3                 0       2   
11       11            3               1                 6       4   
12       12            2               0                 2       6   
13       13            0               0                 0  

In [9]:
# hasRiver percentage and cityHasRiver calculation...
for cityId in cityPs['recordedCityId'].unique():
    plotsDf = cityPs[cityPs['recordedCityId'] == cityId]
    
    featuresDf.loc[cityId, 'cityHasRiver'] = plotsDf[plotsDf['isCity'] == True].iloc[0,:].hasRiver

# Drop the "none" now
del featuresDf['None']

In [10]:
print(featuresDf.head())

   cityId  PlainsHills  PlainsMountain  PlainsRainforest  Plains  \
1       1            3               1                 1       6   
2       2            1               0                 3       2   
3       3            1               0                 1       6   
4       4            1               0                 2       4   
5       5            0               0                 1       0   

   PlainsHillsWoods  DesertHills  PlainsWoods  PlainsHillsRainforest  \
1                 2            1            3                      1   
2                 1            2            2                      0   
3                 0            0            2                      0   
4                 1            0            1                      0   
5                 0            0            0                      0   

   Grassland      ...       DesertOasis  TundraHillsWoods  Bonus  Lux  Wheat  \
1          1      ...                 0                 0      1    2      1  

In [11]:
featuresDf.to_csv('ModelInput/features.csv', encoding='utf-8', index=False)