# Palpite
Cartola FC tips.

## Goal
To develop a model to predict how many points will an athlete score on Cartola FC.

## Data Sources
### Cartola FC
The historical Cartola FC dataset used on this model is available on [github.com/henriquepgomide/caRtola](https://github.com/henriquepgomide/caRtola/tree/master/data).

### Betting Lines
The historical betting lines for Brasileirão Série A is available on [football-data.co.uk/brazil.php](https://www.football-data.co.uk/brazil.php).

Many thanks to the maintainers from both projects on making the data available for free to the world.

## Time Interval
The Cartola FC API changed its format on 2018. Datasets before this year do not attend the expected formats, and therefore, won't be used.

The seasons used to build the model will be:

* 2018
* 2019
* Half 2020

It means:

* 2.5 seaons
* 95 rounds
* 950 matches

It seems to be enough and there is no need of trying to format previous years' datasets.

## Imports

In [12]:
import datetime
import itertools
import json
import os

import numpy as np
import pandas as pd

## Load data
### Cartola FC
Load data from 2018, 2019 and 2020 Cartola FC seasons.

In [13]:
def load_all_rounds(season):
    """ 
    Load all rounds data from a season to a pandas DataFrame.
    
    Also rename columns to a shorter name and introduce a year feature.
    """
    file_path_list = [f"data//cartola//{season}//rodada-{i + 1}.csv" for i in range(38)]
    rounds = [
        pd.read_csv(file_path, index_col=0) for file_path in file_path_list if os.path.exists(file_path)
        ]
    data = pd.concat(rounds, ignore_index=True)

    data.columns = [col.replace("atletas.", "") for col in data.columns]

    data["year"] = [season for _ in range(data.shape[0])]
    return data


def load_seasons(seasons):
    """ Load specified seasons into a same pandas DataFrame. """
    return pd.concat([load_all_rounds(year) for year in seasons], ignore_index=True)
    

cartola_data = load_seasons([2018, 2019, 2020])
cartola_data

Unnamed: 0,nome,slug,apelido,foto,atleta_id,rodada_id,clube_id,posicao_id,status_id,pontos_num,...,FT,GS,CV,GC,PP,DP,year,jogos_num,PI,DS
0,Matheus Ferraz Pereira,matheus-ferraz,Matheus Ferraz,https://s.glbimg.com/es/sde/f/2018/03/17/6d461...,38632,1,AME,zag,Nulo,0.0,...,,,,,,,2018,,,
1,Willian Lanes de Lima,lima,Lima,https://s.glbimg.com/es/sde/f/2018/03/17/3d9ef...,38506,1,AME,zag,Nulo,0.0,...,,,,,,,2018,,,
2,Rómulo Otero Vásquez,otero,Otero,https://s.glbimg.com/es/sde/f/2017/04/03/9fe40...,83004,1,ATL,mei,Provável,16.5,...,,,,,,,2018,,,
3,Diego Ribas da Cunha,diego,Diego,https://s.glbimg.com/es/sde/f/2017/08/16/3ba37...,38909,1,FLA,mei,Provável,0.8,...,,,,,,,2018,,,
4,Rodrigo Eduardo Costa Marinho,rodriguinho,Rodriguinho,https://s.glbimg.com/es/sde/f/2018/03/20/c125f...,61033,1,COR,mei,Provável,16.5,...,,,,,,,2018,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
75341,Luiz Henrique Araújo Silva,luiz-henrique,Luiz Henrique,https://s.glbimg.com/es/sde/f/2020/03/06/809a2...,101595,19,356,mei,Nulo,0.0,...,,,,,,,2020,1.0,,1.0
75342,Alex de Oliveira Nascimento,alex,Alex,https://s.glbimg.com/es/sde/f/2020/08/25/c7ba1...,101600,19,277,zag,Nulo,0.0,...,,,,,,,2020,5.0,15.0,5.0
75343,Ramon Ramos Lima,ramon,Ramon,https://s.glbimg.com/es/sde/f/2020/08/07/26587...,101597,19,262,lat,Nulo,0.0,...,,,,,,,2020,3.0,3.0,
75344,Ulisses Wilson Jeronymo Rocha,ulisses,Ulisses,https://s.glbimg.com/es/sde/f/2020/08/17/0e567...,101647,19,267,zag,Nulo,0.0,...,,,,,,,2020,0.0,,


### Keep relevant data
This model do not aim to predict an athlete scouts (stats). So they are all removed. 

Redundant features, like athlete name, nickname, slug are also removed.

Price-related features are removed since I want the model to completely disregard prices.

In [14]:
to_keep = [
    'year',
    'apelido',  # Athlete nickname.
    'rodada_id',  # Round number.
    'clube.id.full.name',  # Club full name.
    'posicao_id',  # Position ID.
    'status_id',  # Status ID.
    'media_num',  # Mean number.
    'pontos_num',  # Points number.
]

cartola_data = cartola_data[to_keep]
cartola_data

Unnamed: 0,year,apelido,rodada_id,clube.id.full.name,posicao_id,status_id,media_num,pontos_num
0,2018,Matheus Ferraz,1,América-MG,zag,Nulo,0.0,0.0
1,2018,Lima,1,América-MG,zag,Nulo,0.0,0.0
2,2018,Otero,1,Atlético-MG,mei,Provável,16.5,16.5
3,2018,Diego,1,Flamengo,mei,Provável,0.8,0.8
4,2018,Rodriguinho,1,Corinthians,mei,Provável,16.5,16.5
...,...,...,...,...,...,...,...,...
75341,2020,Luiz Henrique,19,356,mei,Nulo,0.5,0.0
75342,2020,Alex,19,Santos,zag,Nulo,2.4,0.0
75343,2020,Ramon,19,Flamengo,lat,Nulo,-0.1,0.0
75344,2020,Ulisses,19,267,zag,Nulo,0.0,0.0


### Translate
Code should preferable be written in English. 
Even though it is a brazilian fantasy game, with a brazilian target audience, 
you never know when someone else may be interested in reading your code for whatever the purpose it may be. 
So it is a nice idea to always keep it in English.

In [15]:
cartola_data = cartola_data.rename({
    "apelido": "name",
    "rodada_id": "round",
    "clube.id.full.name": "club",
    "posicao_id": "position",
    "status_id": "status",
    "pontos_num": "points",
    "media_num": "mean",
}, axis=1)

cartola_data["position"] = cartola_data["position"].replace({
    "zag": "defender",
    "mei": "midfielder",
    "ata": "forward",
    "gol": "goalkeeper",
    "lat": "fullback",
    "tec": "coach",
})

cartola_data["status"] = cartola_data["status"].replace({
    "Nulo": "null",
    "Provável": "expected",
    "Contudido": "injured",
    "Dúvida": "doubt",
    "Suspenso": "suspended",
})

cartola_data.head()

Unnamed: 0,year,name,round,club,position,status,mean,points
0,2018,Matheus Ferraz,1,América-MG,defender,,0.0,0.0
1,2018,Lima,1,América-MG,defender,,0.0,0.0
2,2018,Otero,1,Atlético-MG,midfielder,expected,16.5,16.5
3,2018,Diego,1,Flamengo,midfielder,expected,0.8,0.8
4,2018,Rodriguinho,1,Corinthians,midfielder,expected,16.5,16.5


### Club ID
There is pertubation in the `club_id`.

For the year 2018 it used team abbreviation, while for the remainder uses club ID. I'll convert everything to club ID to keep coherence.

To do so I'll use the `times_ids.csv`.

In [16]:
clubs = pd.read_csv(r"data\cartola\times_ids.csv")
clubs.head()

Unnamed: 0,nome.cbf,nome.cartola,nome.completo,cod.older,cod.2017,cod.2018,id,abreviacao,escudos.60x60,escudos.45x45,escudos.30x30
0,América - MG,América-MG,America MG,327,327,327,327,AME,https://s.glbimg.com/es/sde/f/organizacoes/201...,https://s.glbimg.com/es/sde/f/organizacoes/201...,https://s.glbimg.com/es/sde/f/organizacoes/201...
1,America - RN,Atlético-RN,America RN,200,200,1,200,OUT,,,
2,Atlético - GO,Atlético-GO,Atletico GO,201,373,373,373,ATL,,,
3,Atlético - MG,Atlético-MG,Atletico Mineiro,282,282,282,282,ATL,https://s.glbimg.com/es/sde/f/equipes/2017/11/...,https://s.glbimg.com/es/sde/f/equipes/2017/11/...,https://s.glbimg.com/es/sde/f/equipes/2017/11/...
4,Atlético - PR,Atlético-PR,Atletico Paranaense,293,293,293,293,ATL,https://s.glbimg.com/es/sde/f/equipes/2015/06/...,https://s.glbimg.com/es/sde/f/equipes/2015/06/...,https://s.glbimg.com/es/sde/f/equipes/2015/06/...


In [17]:
# # Create a dictionary from the dataset.
# club_name_mapping = {row["nome.cartola"]: row["id"] for _, row in clubs.iterrows()}


# cartola_data["club_id"] = cartola_data["club_name"].apply(
#     uniformize_club_name, mapping=club_name_mapping
#     )

# cartola_data.head()

### Data Transformation
There is still one big problem.

The following columns vary from game to game. 

* `status`
* `points`
* `mean`

Status is available before the match to the user, while mean is available after the match. In the next round the user will have an updated status, and the mean from the last match.

So, for a model to be useful for the user, it should consider current match status, and last match mean.

Naturally, for the first match, mean will be null. Also, Notice that points won't be transformed since it is the target feature.

In [18]:
# Duplicate data and reduce the round feature
antecipated_data = cartola_data.copy()
antecipated_data["round"] = [round_ - 1 for round_ in antecipated_data["round"]]

# Join on features that don't change
join_on = ['year', 'name', 'round', 'club', 'position']
cartola_data = pd.merge(
    left=antecipated_data, 
    right=cartola_data, 
    on=join_on, 
    how="left", 
    suffixes=("_current", "_previous")  # Translate to _current, _previous
    )

# Reset rounds.
cartola_data["round"] = [round + 1 for round in cartola_data["round"]]

# Remove columns that won't be no longer used.
cartola_data = cartola_data.drop(columns=["status_previous", "points_previous", "mean_current"])

# Remove suffixes
cartola_data.columns = [col.replace("_previous", "").replace("_current", "") for col in cartola_data.columns]

# Transform mean NaN in 0
cartola_data["mean"] = cartola_data["mean"].fillna(0)

cartola_data

Unnamed: 0,year,name,round,club,position,status,points,mean
0,2018,Matheus Ferraz,1,América-MG,defender,,0.0,0.0
1,2018,Lima,1,América-MG,defender,,0.0,0.0
2,2018,Otero,1,Atlético-MG,midfielder,expected,16.5,0.0
3,2018,Diego,1,Flamengo,midfielder,expected,0.8,0.0
4,2018,Rodriguinho,1,Corinthians,midfielder,expected,16.5,0.0
...,...,...,...,...,...,...,...,...
75388,2020,Luiz Henrique,19,356,midfielder,,0.0,0.5
75389,2020,Alex,19,Santos,defender,,0.0,2.4
75390,2020,Ramon,19,Flamengo,fullback,,0.0,-0.1
75391,2020,Ulisses,19,267,defender,,0.0,0.0


### Dates
We should introduce a dates feature. This is important in order to merge with the betting lines dataset later on.

I'll concatenate the matches datasets from each year.

In [19]:
def load_date_dataset(year):
    """ Load dates dataset. """
    data = pd.read_csv(f"data\\cartola\\{year}\\{year}_partidas.csv")
    data["year"] = [year for _ in range(data.shape[0])]
    return data


# Load dates datasets.
years = [2018, 2019, 2020]
matches_data = pd.concat([load_date_dataset(year) for year in years], ignore_index=True)

matches_data

Unnamed: 0,game,round,date,home_team,score,away_team,arena,year,home_score,away_score
0,1.0,1,14/04/2018 - 16:00,Cruzeiro - MG,0 x 1,Grêmio - RS,Mineirão - Belo Horizonte - MG,2018,,
1,2.0,1,15/04/2018 - 19:00,Atlético - PR,5 x 1,Chapecoense - SC,Arena da Baixada - Curitiba - PR,2018,,
2,3.0,1,15/04/2018 - 11:00,América - MG,3 x 0,Sport - PE,Independência - Belo Horizonte - MG,2018,,
3,4.0,1,14/04/2018 - 19:00,Vitória - BA,2 x 2,Flamengo - RJ,Manoel Barradas - Salvador - BA,2018,,
4,5.0,1,15/04/2018 - 16:00,Vasco da Gama - RJ,2 x 1,Atlético - MG,São Januário - Rio de Janeiro - RJ,2018,,
...,...,...,...,...,...,...,...,...,...,...
935,,18,2020-10-25,285,,262,,2020,,
936,,18,2020-10-24,265,,356,,2020,,
937,,18,2020-10-25,293,,284,,2020,,
938,,18,2020-10-24,354,,294,,2020,,


This dataset also have some problems that should be fixed:

* Dates are in different format depending on the year.
* Team names are in various formats.


In [20]:
def uniformize_club_name(name, mapping):
    """ Uniformize all occurences to the ID. """
    if name in mapping:
        return mapping[name]
    return name


def interpret_date(string, patterns):
    """ Interpret some specific patterns of date. """
    for pattern in patterns:
        try:
            return datetime.datetime.strptime(string, pattern).date()
        except ValueError:
            pass
    raise ValueError("Couldn't match any date pattern.")


# Create a dictionary from the dataset.
club_cbf_name_mapping = {row["nome.cbf"]: row["nome.cartola"] for _, row in clubs.iterrows()}
club_id_mapping = {row["id"]: row["nome.cartola"] for _, row in clubs.iterrows()}

# Apply uniformize_club_name function.
for mapping in [club_cbf_name_mapping, club_id_mapping]:
    matches_data["home_team"] = matches_data["home_team"].apply(uniformize_club_name, mapping=mapping)
    matches_data["away_team"] = matches_data["away_team"].apply(uniformize_club_name, mapping=mapping)

# Interpret string ad datetime.
matches_data["date"] = matches_data["date"].apply(interpret_date, patterns=["%d/%m/%Y - %H:%M", "%Y-%m-%d"])

matches_data

Unnamed: 0,game,round,date,home_team,score,away_team,arena,year,home_score,away_score
0,1.0,1,2018-04-14,Cruzeiro,0 x 1,Grêmio,Mineirão - Belo Horizonte - MG,2018,,
1,2.0,1,2018-04-15,Atlético-PR,5 x 1,Chapecoense,Arena da Baixada - Curitiba - PR,2018,,
2,3.0,1,2018-04-15,América-MG,3 x 0,Sport,Independência - Belo Horizonte - MG,2018,,
3,4.0,1,2018-04-14,Vitória,2 x 2,Flamengo,Manoel Barradas - Salvador - BA,2018,,
4,5.0,1,2018-04-15,Vasco,2 x 1,Atlético-MG,São Januário - Rio de Janeiro - RJ,2018,,
...,...,...,...,...,...,...,...,...,...,...
935,,18,2020-10-25,Internacional,,Flamengo,,2020,,
936,,18,2020-10-24,Bahia,,Fortaleza,,2020,,
937,,18,2020-10-25,Athlético-PR,,Grêmio,,2020,,
938,,18,2020-10-24,Ceará-SC,,Coritiba,,2020,,


### Merge Cartola FC and matches datasets
Now that everything looks fine with the matches dataset, it is ready to be merged on the Cartola FC dataset.

In [21]:
# Rename the matches dataset to attend the cartola dataset features names. This is required for merging.
matches_renamed = matches_data.rename({
    "home_team": "club_home",
    "away_team": "club_away"
    }, axis=1)

# We need to do it in two steps. First merge searching for the club in home teams. 
# Then do it again for away teams.

for club_type in ["home", "away"]:
    matches_renamed["club"] = matches_renamed[f"club_{club_type}"]

    # Filter to merge only desired features.
    matches_renamed_filtered = matches_renamed[["round", "year", "club", "date"]]

    merge_on = ["round", "year", "club"]
    cartola_data = pd.merge(cartola_data, matches_renamed_filtered, on=merge_on, how="left")

# Unify both new columns in a single column.
cartola_data["date"] = np.where(
    pd.isna(cartola_data["date_x"]),
    cartola_data["date_y"],
    cartola_data["date_x"]
    )

# Remove date_x and date_y.
cartola_data = cartola_data.drop(columns=["date_x", "date_y"])

cartola_data.tail()

Unnamed: 0,year,name,round,club,position,status,points,mean,date
75427,2020,Luiz Henrique,19,356,midfielder,,0.0,0.5,
75428,2020,Alex,19,Santos,defender,,0.0,2.4,
75429,2020,Ramon,19,Flamengo,fullback,,0.0,-0.1,
75430,2020,Ulisses,19,267,defender,,0.0,0.0,
75431,2020,Kaio Magno,19,267,forward,,0.0,0.0,


In [22]:
cartola_data.shape[0] - cartola_data.count()

year           0
name           0
round          0
club           0
position       0
status         0
points         0
mean           0
date        9037
dtype: int64