# Data Processing
While we have the required raw data to begin an endeavor into data visualization,
there are some transformations necessary to get the best result.

## 1.0 Opening & Exploring the Data

In [22]:
# IMPORTS
import numpy as np
import pandas as pd


In [23]:
Raw = pd.read_csv(r'Tot_2020_partial.csv').drop(columns=['Unnamed: 0'])
Raw.head()

Unnamed: 0,assists,bonus,bps,clean_sheets,creativity,element,fixture,goals_conceded,goals_scored,ict_index,...,total_points,transfers_balance,transfers_in,transfers_out,value,was_home,yellow_cards,full,team,ppm
0,0,0,0,0,0.0,543,56,0,0,0.0,...,0,0,0,0,4.0,False,0,Alfie Whiteman,Tottenham,0.0
1,0,0,0,0,0.0,543,69,0,0,0.0,...,0,7377,8129,752,4.0,True,0,Alfie Whiteman,Tottenham,0.0
2,0,0,0,0,0.0,543,72,0,0,0.0,...,0,3735,4819,1084,4.0,False,0,Alfie Whiteman,Tottenham,0.0
3,0,0,0,0,0.0,543,89,0,0,0.0,...,0,2501,3859,1358,4.0,True,0,Alfie Whiteman,Tottenham,0.0
4,0,0,0,0,0.0,543,94,0,0,0.0,...,0,720,1398,678,4.0,False,0,Alfie Whiteman,Tottenham,0.0


To have a better understanding of our data, it is often best to understand how the data is
framed in columns, and what data type are said columns

In [24]:
def get_cat(series: pd.Series, empty_flag: bool) -> str:
    """
    :param series: a pandas series of data (should be of one data type)
    :param empty_flag: a boolean to call out an empty column (containing all NaNs)
    :return: a string that is the 'category of the datatype'

    This function will take a given series and return a generalized category for the datatype.
    This is necessary for readability as df.dtypes will return ungrouped types (i.e. float64, int64)
    as well as return some options that do not provide much insight (i.e. object)
    """
    if empty_flag:  # Catch Empty Columns
        return 'Empty'
    elif pd.api.types.is_numeric_dtype(series):
        return 'Numerical'
    elif pd.api.types.is_datetime64_dtype(series):
        return 'Date'
    elif pd.api.types.is_bool_dtype(series):
        return 'Boolean'
    elif pd.api.types.is_categorical_dtype(series):
        return 'Categorical'
    elif pd.api.types.is_string_dtype(series):
        return 'Text'
    else:  # Catches all others for review
        return 'Unknown'


def col_categories(data: pd.DataFrame):
    """
    :param data: a pandas Dataframe consisting of columns of data you want to explore their categories
    :return: None
    Prints the column name and data type of each column in the dataframe
    """
    if data.dropna().empty:  # Catches Error
        raise ValueError('DataFrame Provided Contains is either empty or only contains NaNs')
        # sys.exit(1)
    for col in data.columns:  # For Each Column
        if data[col].dropna().empty:  # Catch Empty Columns
            empty = True  # Set Flag
        else:
            empty = False  # Set Flag
        print(col, ': ', get_cat(data[col], empty))  # Print and Call Column Categories

In [25]:
col_categories(Raw)

assists :  Numerical
bonus :  Numerical
bps :  Numerical
clean_sheets :  Numerical
creativity :  Numerical
element :  Numerical
fixture :  Numerical
goals_conceded :  Numerical
goals_scored :  Numerical
ict_index :  Numerical
influence :  Numerical
kickoff_time :  Text
minutes :  Numerical
opponent_team :  Text
own_goals :  Numerical
penalties_missed :  Numerical
penalties_saved :  Numerical
red_cards :  Numerical
round :  Numerical
saves :  Numerical
selected :  Numerical
team_a_score :  Numerical
team_h_score :  Numerical
threat :  Numerical
total_points :  Numerical
transfers_balance :  Numerical
transfers_in :  Numerical
transfers_out :  Numerical
value :  Numerical
was_home :  Numerical
yellow_cards :  Numerical
full :  Text
team :  Text
ppm :  Numerical


As we can see, we have quite a few columns that we may need a better definition for.
I will have to check back with the source of this data to expand on these definitions.

It will also be good at this time to identify what other variables we may want to aggregate on
through a simple grouping process.

- assists :  Numerical
    - Number of Assists by Player
- bonus :  Numerical
    - Number of Bonus Points (Fantasy Football)
    - They are devised and allocated post-match according to a Bonus Points System (BPS)
- bps :  Numerical
    - Unknown
    - **Without knowing this metric, we should probably remove it**
- clean_sheets :  Numerical
    - Fantasy Football
    - 0 or 1, 1 indication a clean sheet kept, while the player was on the field
- creativity :  Numerical
    - Creativity Index (Fantasy Football)
    - Creativity assesses player performance in terms of producing goalscoring opportunities for others.
- element :  Numerical
    - Player ID
- fixture :  Numerical
    - Fixture ID
- goals_conceded :  Numerical
    - Goals Conceded While Player was On
- goals_scored :  Numerical
    - Goals Scored By Player
- ict_index :  Numerical
    - Fantasy Football
    - Includes Influence, Creativity, and Threat
- influence :  Numerical
    - Fantasy Football
    - Influence Score
- kickoff_time :  Text
    - Match Time
- minutes :  Numerical
    - Minutes Played
- opponent_team :  Text
    - Opposing Team
- own_goals :  Numerical
    - Own Goal Scored by Player
- penalties_missed :  Numerical
    - Penalties Missed
- penalties_saved :  Numerical
    - Penalties Saved
- red_cards :  Numerical
    - Red Cards to Player
- round :  Numerical
    - Gameweek
    - **Need to Aggregate By**
- saves :  Numerical
    - Total Saved
- selected :  Numerical
    - Fantasy Football - Number of time Player was selected
- team_a_score :  Numerical
    - Away Team Goals
- team_h_score :  Numerical
    - Home Team Goals
- threat :  Numerical
    - Fantasy Football
    - Threat Index
- total_points :  Numerical
    - Fantasy Football Total Points for Game
- transfers_balance :  Numerical
    - Fantasy Football
    - Transfers In - Transfers Out
- transfers_in :  Numerical
    - Number of FPL users that transferred player in this gameweek
- transfers_out :  Numerical
    - Number of FPL users that transferred player out this gameweek
- value :  Numerical
    - Price in Fantasy Football
- was_home :  Numerical
    - Binary Variable if team was home
- yellow_cards :  Numerical
    - Yellow Card Received by Player
- full :  Text
    - Player Name
    - **Need to Aggregate By**
- team :  Text
    - Team Name
- ppm :  Numerical
    - Fantasy Football
    - Points Per Million Index - Points Per Game Per Cost in Millions