### raw data

the data is the partially scraped database of the [soccerway](https://int.soccerway.com) website.

there are a number of tables connected by ids:

- match 
  - matches with id, timestamp, result summary and participating teams
- match_info:
  - more detailed info about the matches
- lineup:
  - players starting games
- goals:
  - goals scored at matches, with possibly an assisting player
- subs:
  - at a given match who replaced who and at what time
  - also other players who just sat on the bench but did not make an appearance
- coaches:
  - coaches of teams at games
- sidelined:
  - reasons why some players missed matches
- events:
  - yellow and red cards received during matches, own goals and penalty misses
- players:
  - general info about players
- areas, competitions, seasons, rounds:
  - there is a hierarchy -> match -> round -> season -> competition -> area
  - there are additional tables describing this hierarchy


#### download the data like this

- initially, a sample dataset will be provided, the full dataset will be available 1 week before the deadline (it can be updated the same way)
- please save the data to disk so that you don't have to download it everytime you start this notebook

In [None]:
region_df = pd.read_pickle(
    f"https://borza-hotelcom-data.s3.eu-central-1.amazonaws.com/soccerway-region_df.pkl"
)
competition_df = pd.read_pickle(
    f"https://borza-hotelcom-data.s3.eu-central-1.amazonaws.com/soccerway-competition_df.pkl"
)
season_df = pd.read_pickle(
    f"https://borza-hotelcom-data.s3.eu-central-1.amazonaws.com/soccerway-season_df.pkl"
)
round_df = pd.read_pickle(
    f"https://borza-hotelcom-data.s3.eu-central-1.amazonaws.com/soccerway-round_df.pkl"
)
match_df = pd.read_pickle(
    f"https://borza-hotelcom-data.s3.eu-central-1.amazonaws.com/soccerway-match_df.pkl"
)
goal_df = pd.read_pickle(
    f"https://borza-hotelcom-data.s3.eu-central-1.amazonaws.com/soccerway-goal_df.pkl"
)
match_info_df = pd.read_pickle(
    f"https://borza-hotelcom-data.s3.eu-central-1.amazonaws.com/soccerway-match_info_df.pkl"
)
lineup_df = pd.read_pickle(
    f"https://borza-hotelcom-data.s3.eu-central-1.amazonaws.com/soccerway-lineup_df.pkl"
)
coach_df = pd.read_pickle(
    f"https://borza-hotelcom-data.s3.eu-central-1.amazonaws.com/soccerway-coach_df.pkl"
)
sidelined_df = pd.read_pickle(
    f"https://borza-hotelcom-data.s3.eu-central-1.amazonaws.com/soccerway-sidelined_df.pkl"
)
sub_df = pd.read_pickle(
    f"https://borza-hotelcom-data.s3.eu-central-1.amazonaws.com/soccerway-sub_df.pkl"
)
event_df = pd.read_pickle(
    f"https://borza-hotelcom-data.s3.eu-central-1.amazonaws.com/soccerway-event_df.pkl"
)
player_df = pd.read_pickle(
    f"https://borza-hotelcom-data.s3.eu-central-1.amazonaws.com/soccerway-player_df.pkl"
)

### questions

it's difficult to answer all the questions, allocate your resources wisely

#### individual answers

one number:

- how many players scored at least 20 goals in 4 different areas
- how many players were sent off twice by the same referee
- how many goals did teams score after having a player sent off
- what is the highest number of left footed players fielded by one team in the same match
- number of players sent off after being substituted
- number of games with at least 2 missed penalties

top 5:

- top 5 players with number of assists in one season (player name, season id and assist count)
- top 5 players with highest number of goals as a substitute
- top 5 players who sat on the bench most number of times for an entire game
- top 5 players with number of goals in first half
- top 5 players with most different types of reasons for being sidelined
- top 5 players with highest number of games where they both scored and assisted
- top 5 teams with most penalties missed
- top 5 teams with most penalties missed against
- top 5 teams with most average yellow cards per game (at least 20 games)
- top 5 teams with lowest average number of substitutions per game (at least 20 games)
- top 5 teams with earliest average time of first substitution (at least 20 games)
- top 5 teams with highest ratio of wins after being behind at half time (at least 20 games)


#### features

(wherever a value is impossible to calculate, leave it NaN)

in a table, where each row is a match, add the following features:
- for each team:
    - the number of games they played in the last 21 days
    - in how many different competitions each team played in the last 21 days
    - number of goals scored in last 21 days by players sidelined for this match
    - days since their last match
    - goal difference in the last 10 matches
    - win rate in last 10 matches
    - average number of bookings in the last 10 matches
    - average time of earliest substitutions in the last 10 games
    - number of different nationalities in starting lineup

- for the game:
  - how many times the 2 teams have met previously
  - how many times the 2 coaches have met previously
  - how many previous games in the season
  - number of different competitions in the region
  - draw rate in the season so far
  - average number of yellow cards in a game in the season so far

### solution

a solution can consist of 2 parts:
- a .py file not more than 300 lines + import lines
  - formatted `black -l 79`
- a notebook conatining 
  - an import cell, importing any necessary libararies
  - a data reading cell, reading the raw data
  - a [black](../support-notebooks/black.ipynb) nbextension formatted, at most 80 line computation cell, that answers as many questions as possible

the notebook needs to produce results in less than 90 seconds in a 4 core 8gb ram machine, after loading all the data