# Introduction
In this final notebook I will analyse the data I scraped from various sources to find out what factors influence the transfer fee for a player. For this, I will firstly download the data from the repository and make some final data transformations. After that I will make several linear regressions to analyse different relationships with the transferfee. Finally, I will conclude this analysis.

In [None]:
#import necessary packages
import pandas as pd
import numpy as np
import statsmodels.api as sm

# Data Import and Transformation
In the following, I will import the necessary data and make final transformations. The main transformations I will make is to calculate the variables (e.g. goals) per 90 minutes instead of taking the absolute values. I will do that in order to make the variables more comparable and avoid the problem of Multicollinearity.
## FBref Data
Here I will import and transform the data I scraped from [FBref.com](https://fbref.com/en/):

In [None]:
positions=["DF","MF","FW"]#field position to iterate through

In [None]:
#since the name and surname are seperated by a "-", I need to remove the "-" so I can merge this data with the other data sources
def change_index(df):
    names=[]
    for name in list(df.index):
        names.append(name.replace("-"," "))
    df.index=names
    return df

### Standard

In [None]:
#Since I can calculate most of the things (except expected goal/assisits/...) more accurately with the data from trasnfermarkt
#I will only use the data from here, which I can not calculate with tm data
for position in positions:
    file=position+"/standard.csv"
    standard=pd.read_csv(file,index_col=0)
    standard=standard[['xG/90', 'xA/90', 'xG+xA/90', 'npxG/90', 'npxG+xA/90']]
    change_index(standard)
    exec(f"{position}_standard=standard")

### Shooting

In [None]:
#Since I can calculate most of the things (except expected goal/assisits/...) more accurately with the data from trasnfermarkt
#I will only use the data from here, which I can not calculate with tm data
for position in positions:
    file=position+"/shooting.csv"
    shooting=pd.read_csv(file,index_col=0)
    columns=list(shooting.columns)
    shooting["Sh/90"]=shooting["Sh"]/shooting["90s"]
    shooting["SoT/90"]=shooting["SoT"]/shooting["90s"]
    shooting["PK%"]=shooting["PK"]/shooting["PKatt"]
    shooting["G-xG/90"]=shooting["G-xG"]/shooting["90s"]
    shooting["np:G-xG/90"]=shooting["np:G-xG"]/shooting["90s"]
    columns = [e for e in columns if e not in ("SoT%","Sh/90","SoT/90","G/Sh","G/SoT")]
    shooting.drop(columns,axis=1,inplace=True)
    change_index(shooting)
    exec(f"{position}_shooting=shooting")

### Passing

In [None]:
#Since I can calculate most of the things (except expected goal/assisits/...) more accurately with the data from trasnfermarkt
#I will only use the data from here, which I can not calculate with tm data
for position in positions:
    file=position+"/passing.csv"
    passing=pd.read_csv(file,index_col=0)
    columns=list(passing.columns)
    passing["Cmp (Total)/90"]=passing["Cmp (Total)"]/passing["90s"]
    passing["TotDist/90"]=passing["TotDist"]/passing["90s"]
    passing["PrgDist/90"]=passing["PrgDist"]/passing["90s"]
    passing["Cmp (Short)/90"]=passing["Cmp (Short)"]/passing["90s"]
    passing["Cmp (Medium)/90"]=passing["Cmp (Medium)"]/passing["90s"]
    passing["Cmp (Long)/90"]=passing["Cmp (Long)"]/passing["90s"]
    passing["A-xA/90"]=passing["A-xA"]/passing["90s"]
    passing["KP/90"]=passing["KP"]/passing["90s"]
    passing["Final 1/3 /90"]=passing["1/3"]/passing["90s"]
    passing["PPA/90"]=passing["PPA"]/passing["90s"]
    passing["CrsPA/90"]=passing["CrsPA"]/passing["90s"]
    passing["Prog/90"]=passing["Prog"]/passing["90s"]
    columns = [e for e in columns if e not in ("Cmp% (Total)","Cmp% (Short)","Cmp% (Medium)","Cmp% (Long)")]
    passing.drop(columns,axis=1,inplace=True)
    passing=passing.dropna()
    change_index(passing)
    exec(f"{position}_passing=passing")

### Pass Types

In [None]:
#Since I can calculate most of the things (except expected goal/assisits/...) more accurately with the data from trasnfermarkt
#I will only use the data from here, which I can not calculate with tm data
for position in positions:
    file=position+"/pass_types.csv"
    pass_types=pd.read_csv(file,index_col=0)
    columns=list(pass_types.columns)
    pass_types["Live/90"]=pass_types["Live"]/pass_types["90s"]
    pass_types["Dead/90"]=pass_types["Dead"]/pass_types["90s"]
    pass_types["FK/90"]=pass_types["FK"]/pass_types["90s"]
    pass_types["TB/90"]=pass_types["TB"]/pass_types["90s"]
    pass_types["Press/90"]=pass_types["Press"]/pass_types["90s"]
    pass_types["Sw/90"]=pass_types["Sw"]/pass_types["90s"]
    pass_types["Crs/90"]=pass_types["Crs"]/pass_types["90s"]
    pass_types["CK/90"]=pass_types["CK"]/pass_types["90s"]
    pass_types["In (CK)/90"]=pass_types["In (CK)"]/pass_types["90s"]
    pass_types["Out (CK)/90"]=pass_types["Out (CK)"]/pass_types["90s"]
    pass_types["Str (CK)/90"]=pass_types["Str (CK)"]/pass_types["90s"]
    pass_types["Ground (Height)/90"]=pass_types["Ground (Height)"]/pass_types["90s"]
    pass_types["Low (Height)/90"]=pass_types["Low (Height)"]/pass_types["90s"]
    pass_types["High (Height)/90"]=pass_types["High (Hight)"]/pass_types["90s"]
    pass_types["Right/Left"]=pass_types["Right Foot"]/pass_types["Left Foot"]
    pass_types["TI/90"]=pass_types["TI"]/pass_types["90s"]
    pass_types["Off/90"]=pass_types["Off"]/pass_types["90s"]
    pass_types["Out/90"]=pass_types["Out"]/pass_types["90s"]
    pass_types["Int/90"]=pass_types["Int"]/pass_types["90s"]
    pass_types["Blocks/90"]=pass_types["Blocks"]/pass_types["90s"]
    pass_types.drop(columns,axis=1,inplace=True)
    change_index(pass_types)
    exec(f"{position}_pass_types=pass_types")

### Goal and Shot Creation

In [None]:
#Since I can calculate most of the things (except expected goal/assisits/...) more accurately with the data from trasnfermarkt
#I will only use the data from here, which I can not calculate with tm data
for position in positions:
    file=position+"/Goals_and_Shooting_Creation.csv"
    gasc=pd.read_csv(file,index_col=0)
    columns=list(gasc.columns)
    gasc["PassLive (SCA)/90"]=gasc["PassLive (SCA)"]/gasc["90s"]
    gasc["PassDead (SCA)/90"]=gasc["PassDead (SCA)"]/gasc["90s"]
    gasc["Drib (SCA)/90"]=gasc["Drib (SCA)"]/gasc["90s"]
    gasc["Sh (SCA)/90"]=gasc["Sh (SCA)"]/gasc["90s"]
    gasc["Fld (SCA)/90"]=gasc["Fld (SCA)"]/gasc["90s"]
    gasc["Def (SCA)/90"]=gasc["Def (SCA)"]/gasc["90s"]
    gasc["PassLive (GCA)/90"]=gasc["PassLive (GCA)"]/gasc["90s"]
    gasc["PassDead (GCA)/90"]=gasc["PassDead (GCA)"]/gasc["90s"]
    gasc["Drib (GCA)/90"]=gasc["Drib (GCA)"]/gasc["90s"]
    gasc["Sh (GCA)/90"]=gasc["Sh (GCA)"]/gasc["90s"]
    gasc["Fld (GCA)/90"]=gasc["Fld (GCA)"]/gasc["90s"]
    gasc["Def (GCA)/90"]=gasc["Def (GCA)"]/gasc["90s"]
    columns = [e for e in columns if e not in ("SCA90","GCA90")]
    gasc.drop(columns,axis=1,inplace=True)
    change_index(gasc)
    exec(f"{position}_gasc=gasc")

### Defensive Actions

In [None]:
#Since I can calculate most of the things (except expected goal/assisits/...) more accurately with the data from trasnfermarkt
#I will only use the data from here, which I can not calculate with tm data
for position in positions:
    file=position+"/defensive.csv"
    defensive=pd.read_csv(file,index_col=0)
    columns=list(defensive.columns)
    defensive["Tkl (Total)/90"]=defensive["Tkl (Total)"]/defensive["90s"]
    defensive["TklW/90"]=defensive["TklW (Total)"]/defensive["90s"]
    defensive["TklW (Total)%"]=defensive["TklW (Total)"]/defensive["Tkl (Total)"]
    defensive["Def 3rd (Tkl)/90"]=defensive["Def 3rd (Tkl)"]/defensive["90s"]
    defensive["Mid 3rd (Tkl)/90"]=defensive["Mid 3rd (Tkl)"]/defensive["90s"]
    defensive["Att 3rd (Tkl)/90"]=defensive["Att 3rd (Tkl)"]/defensive["90s"]
    defensive["Tkl (vs. dribbles)/90"]=defensive["Tkl (vs. dribbles)"]/defensive["90s"]
    defensive["Past (vs. dribbles)/90"]=defensive["Past (vs. dribbles)"]/defensive["90s"]
    defensive["Succ (Press)/90"]=defensive["Succ (Press)"]/defensive["90s"]
    defensive["Def 3rd (Press)/90"]=defensive["Def 3rd (Press)"]/defensive["90s"]
    defensive["Mid 3rd (Press)/90"]=defensive["Mid 3rd (Press)"]/defensive["90s"]
    defensive["Att 3rd (Press)/90"]=defensive["Att 3rd (Press)"]/defensive["90s"]
    defensive["Blocks/90"]=defensive["Blocks"]/defensive["90s"]
    defensive["ShSv/90"]=defensive["ShSv"]/defensive["90s"]
    defensive["Pass Blocked/90"]=defensive["Pass"]/defensive["90s"]
    defensive["Int/90"]=defensive["Int"]/defensive["90s"]
    defensive["Tkl + Int/90"]=defensive["Tkl+Int"]/defensive["90s"]
    defensive["Clr/90"]=defensive["Clr"]/defensive["90s"]
    defensive["Err/90"]=defensive["Err"]/defensive["90s"]
    columns = [e for e in columns if e not in ("Succ % (Press)","Tkl% (vs. dribbles)")]
    defensive.drop(columns,axis=1,inplace=True)
    change_index(defensive)
    exec(f"{position}_defensive=defensive")

### Possession

In [None]:
#Since I can calculate most of the things (except expected goal/assisits/...) more accurately with the data from trasnfermarkt
#I will only use the data from here, which I can not calculate with tm data
for position in positions:
    file=position+"/possession.csv"
    possession=pd.read_csv(file,index_col=0)
    columns=list(possession.columns)
    trans_col=[e for e in columns if e not in ("90s","Succ% (Dribbles)","Att (Dribbles)","Target of pass","Rec%")]
    for col in trans_col:
        new_col=col+"/90s"
        possession[new_col]=possession[col]/possession["90s"]
    columns = [e for e in columns if e not in ("Succ% (Dribbles)","Rec%")]
    possession.drop(columns,axis=1,inplace=True)
    change_index(possession)
    exec(f"{position}_possession=possession")

### Playing Time

In [None]:
#Since I can calculate most of the things (except expected goal/assisits/...) more accurately with the data from trasnfermarkt
#I will only use the data from here, which I can not calculate with tm data
for position in positions:
    file=position+"/playing_time.csv"
    playing_time=pd.read_csv(file,index_col=0)
    playing_time=playing_time[["+/-90","xG+/-90"]]
    change_index(playing_time)
    exec(f"{position}_playing_time=playing_time")

### Miscellaneous

In [None]:
#Since I can calculate most of the things (except expected goal/assisits/...) more accurately with the data from trasnfermarkt
#I will only use the data from here, which I can not calculate with tm data
for position in positions:
    file=position+"/miscellaneous.csv"
    miscellaneous=pd.read_csv(file,index_col=0)
    columns=list(miscellaneous.columns)
    trans_col=[e for e in columns if e not in ("90s","CrdY","CrdR","2CrdY","Crs","Int","TklW","OG","Won% (Aerial Duels)","Lost (Aerial Duels)")]
    for col in trans_col:
        new_col=col+"/90s"
        miscellaneous[new_col]=miscellaneous[col]/miscellaneous["90s"]
    columns = [e for e in columns if e not in ("Won% (Aerial Duels)")]
    miscellaneous.drop(columns,axis=1,inplace=True)
    change_index(miscellaneous)
    exec(f"{position}_miscellaneous=miscellaneous")

### Goalkeeper

In [None]:
GK_standard=pd.read_csv("GK/GK_standard.csv", index_col=0)
GK_advanced=pd.read_csv("GK/GK_advanced.csv", index_col=0)
change_index(GK_standard)
change_index(GK_advanced)

## Transfermarket Data
### Player stats
Here, I will sepereate the data to field players and Goalkeepers since both have different characteristics that are important.

In [None]:
#download stats_tm data
stats_tm=pd.read_pickle("stats_tm.pkl")
stats_tm

In [None]:
#create df without GK and of only GK
stats_tm_field=stats_tm[stats_tm["Position (GK: 1, Other: 0)"]==0]
stats_tm_GK=stats_tm[stats_tm["Position (GK: 1, Other: 0)"]==1]

In [None]:
#Drop unneeded columns
stats_tm_field.drop("Position (GK: 1, Other: 0)",axis=1,inplace=True)
stats_tm_GK.drop("Position (GK: 1, Other: 0)",axis=1,inplace=True)

#### Transform Transfermarkt data for field players:

In [None]:
#make necessary transformations
stats_tm_field["Minutes"]=stats_tm_field["Minutes Field"]
stats_tm_field["90s"]=stats_tm_field["Minutes"]/90
stats_tm_field["Starts"]=stats_tm_field["Games Played"]-stats_tm_field["Games subbed on"]
stats_tm_field["Starts%"]=stats_tm_field["Starts"]/stats_tm_field["Games Played"]
stats_tm_field["G-PK"]=stats_tm_field["Goals"]-stats_tm_field["Penalty Goals"]
stats_tm_field["Gls/90"]=stats_tm_field["Goals"]/stats_tm_field["90s"]
stats_tm_field["Ast/90"]=stats_tm_field["Assists"]/stats_tm_field["90s"]
stats_tm_field["OG/90"]=stats_tm_field["Own goals"]/stats_tm_field["90s"]
stats_tm_field["G-PK/90"]=stats_tm_field["G-PK"]/stats_tm_field["90s"]
stats_tm_field["Yellow/90"]=stats_tm_field["Yellow Cards"]/stats_tm_field["90s"]
stats_tm_field["Red/90"]=stats_tm_field["Red Cards"]/stats_tm_field["90s"]
stats_tm_field["Min/MP"]=stats_tm_field["Minutes"]/stats_tm_field["Games Played"]
stats_tm_field["Completed"]=stats_tm_field["Starts"]-stats_tm_field["Games subbed off"]
stats_tm_field["Completed%"]=stats_tm_field["Completed"]/stats_tm_field["Games Played"]
stats_tm_field["On Bench for 90min%"]=(stats_tm_field["Games in Squad"]-stats_tm_field["Games Played"])/stats_tm_field["Games in Squad"]
stats_tm_field["Subbed on%"]=stats_tm_field["Games subbed on"]/stats_tm_field["Games Played"]
stats_tm_field["Subbed off%"]=stats_tm_field["Games subbed off"]/stats_tm_field["Games Played"]
stats_tm_field

In [None]:
#drop columns that are not needed
stats_tm_field.drop(['ID', 'Current Team (2020/21)','Games in Squad','Goals', 'Assists', 'Own goals',
       'Games subbed on', 'Games subbed off', 'Yellow Cards', 'Red Cards',
       'Penalty Goals',"G-PK","Completed","Starts","Two Yellow Cards","Minutes Field","Minutes GK","Minutes"], axis=1,inplace=True)
stats_tm_field

In [None]:
#drop duplicates
stats_tm_field.drop_duplicates(inplace=True)

#### Transform Transfermarkt data for Goalkeepers:

In [None]:
#Rename the columns so they are correct (for goalkeepers they were wrong till now)
stats_tm_GK=stats_tm_GK.rename(columns={"Assists":"Own Goals","Own goals":"Games subbed on",
                                        "Games subbed on":"Games subbed off","Games subbed off":"Yellow Cards",
                                        "Yellow Cards":"Two Yellow Cards","Red Cards":"Red Cards","Two Yellow Cards":"Goals conceded",
                                        "Penalty Goals":"Clean Sheets"})
stats_tm_GK

In [None]:
#Make the necessary data transformations for goalkeepers
stats_tm_GK["Minutes"]=stats_tm_GK["Minutes GK"]
stats_tm_GK["90s"]=stats_tm_GK["Minutes"]/90
stats_tm_GK["Starts"]=stats_tm_GK["Games Played"]-stats_tm_GK["Games subbed on"]
stats_tm_GK["Starts%"]=stats_tm_GK["Starts"]/stats_tm_GK["Games Played"]
stats_tm_GK["Yellow/90"]=stats_tm_GK["Yellow Cards"]/stats_tm_GK["90s"]
stats_tm_GK["Red/90"]=stats_tm_GK["Red Cards"]/stats_tm_GK["90s"]
stats_tm_GK["Min/MP"]=stats_tm_GK["Minutes"]/stats_tm_GK["Games Played"]
stats_tm_GK["Completed"]=stats_tm_GK["Starts"]-stats_tm_GK["Games subbed off"]
stats_tm_GK["Completed%"]=stats_tm_GK["Completed"]/stats_tm_GK["Games Played"]
stats_tm_GK["On Bench for 90min%"]=(stats_tm_GK["Games in Squad"]-stats_tm_GK["Games Played"])/stats_tm_GK["Games in Squad"]
stats_tm_GK["Subbed on%"]=stats_tm_GK["Games subbed on"]/stats_tm_GK["Games Played"]
stats_tm_GK["Subbed off%"]=stats_tm_GK["Games subbed off"]/stats_tm_GK["Games Played"]
stats_tm_GK["GA/90"]=stats_tm_GK["Goals conceded"]/stats_tm_GK["90s"]
stats_tm_GK["CS%"]=stats_tm_GK["Clean Sheets"]/stats_tm_GK["Games Played"]
stats_tm_GK

In [None]:
#drop columns that are not needed
stats_tm_GK.drop(['ID', 'Current Team (2020/21)','Games in Squad','Goals',  'Own Goals',
       'Games subbed on', 'Games subbed off', 'Yellow Cards', 'Red Cards',"Completed","Starts","Two Yellow Cards",
                  "Minutes Field","Minutes GK","Minutes","Goals conceded","Clean Sheets"], axis=1,inplace=True)
stats_tm_GK

In [None]:
#drop duplicates
stats_tm_GK.drop_duplicates(inplace=True)

### Transfer Fees

In [None]:
transfers=pd.read_pickle("Transfers.pkl")

## Contracts Data from FIFA 19

In [None]:
contracts=pd.read_pickle("contracts.pkl")

# Statistical Analysis
In the following I will conduct the different Analyses:

In [None]:
def regress_res(df):
    Y=df["Fee"]#Defining the independent variable
    X=sm.add_constant(df.drop("Fee",axis=1))#Defining the regressors and adding a constant (the intercept B0) with the sm.add_constant method
    regression = sm.OLS(Y,X, missing='drop')#Initializing the OLS rergeression
    regresults = regression.fit()#Fit the model by calling the OLS object’s fit() method
    return regresults

## Transfermarket data and contract length (Field Players)

In [None]:
#merge field player data with transfer data
transfermarkt_reg_field=pd.merge(stats_tm_field,transfers,how="inner", left_index=True, right_index=True).drop(["Left","Joined","ID"],axis=1)
#merge with contract data
transfermarkt_regXcontracts_field=pd.merge(transfermarkt_reg_field,contracts,how="inner", left_index=True, right_index=True).drop(["Left","Joined","ID","Player","Age_y"],axis=1)
transfermarkt_regXcontracts_field

In [None]:
regresresults=regress_res(transfermarkt_regXcontracts_field)
#summary of regression
regresresults.summary()

In [None]:
#Setting significane level
significane=0.075

# For-loop to get all parameters below certain p-value
for item in regresresults.pvalues.iteritems(): #Gets all the p-values
    if item[1]<significane: #Checks if variables are significant or not
        print(item) #prints out all signifcant variables

The model seems to have little prdictive power overall, as it has an adjsuted-$R^2$ of 0.214. However, there are some variable that seem to be significant (p < 0.075). The Contract length and the points per game won in the previous season are highly significant. This indicates that being at a successful club and having a long time remaining on the contract significantly increase the transfer value of a player. Furthemore, the age of a player is slightly significant, which indicates that if a player is younger this should signficantly increase the transfer value of that player

## Transfermarket data and contract length (Goal Keepers)

In [None]:
#merge transfermarkt goalkeeper stats with transfer data
transfermarkt_reg_GK=pd.merge(stats_tm_GK,transfers,how="inner", left_index=True, right_index=True).drop(["Left","Joined","ID"],axis=1)
#merge that with contracts data
transfermarkt_regXcontracts_GK=pd.merge(transfermarkt_reg_GK,contracts,how="inner", left_index=True, right_index=True).drop(["Left","Joined","ID","Player","Age_y"],axis=1)
transfermarkt_regXcontracts_GK.shape

Since there are 17 variables and 17 observations, I will have to drop some variables.

In [None]:
#only keep needed variables
transfermarkt_regXcontracts_GK=transfermarkt_regXcontracts_GK[["Points per game","90s","GA/90","CS%","Age_x","Contract","Fee"]]

In [None]:
regresresults=regress_res(transfermarkt_regXcontracts_GK)

#summary of regression
regresresults.summary()

The model seems to have some no explanatory power, sinec it has an adjusted-$R^2$ of 0.014. Since we only have 17 observations and the model has 6 degrees of freedom, it is difficult to make any inference from this sample. Furtheremore, none of the observation are significant(p < 0.05).

## FBref Goalkeeper

In [None]:
#only keep relevant columns
GK_regres=GK_advanced[["PSxG+/-/90","#OPA/90"]]
#merge with transfer data
GK_regres=pd.merge(GK_regres,transfers,how="inner", left_index=True, right_index=True).drop(["ID","Left","Joined"],axis=1)

In [None]:
regresresults=regress_res(GK_regres)

#summary of regression
regresresults.summary()

While the adjusted-$R^2$ of this model is very high (0.538), this model only has litte explanatory power since there are only 11 observations but 3 variables. Still, none of the vraiables are signficant, which means that neither the post shot expected goals minus allowed goals per 90 minutes nor the number of times the Goalkeeper was outside his penalty area have a significant effect on the transfer value of a Goalkeeper.

## FBref Defender (Defensive Actions Data)

In [None]:
#merge defensive data with transfer data
defensive_reg=pd.merge(DF_defensive,transfers,how="inner", left_index=True, right_index=True).drop(["ID","Left","Joined"],axis=1)
defensive_reg

In [None]:
regresresults=regress_res(defensive_reg)

#summary of regression
regresresults.summary()

In [None]:
#Setting significane level
significane=0.05

# For-loop to get all parameters below certain p-value
for item in regresresults.pvalues.iteritems(): #Gets all the p-values
    if item[1]<significane: #Checks if variables are significant or not
        print(item) #prints out all signifcant variables

Overall, this model seems to have litte explanatory power, since the adjusted $R^2$ is 0.186. There are three significant values: A successful percentage of tackles per tackle tried, the number of successful tackles per 90 minutes and the number of times an oposing player sucessfuly dribbled past the defender per 90 minutes. While these relationships appear to be significant, the relationships this regression postulates are highly questionable: It postulate that the more successful tackles a defender makes per 90 minutes, the smaller his transfer value. Furthermore, it postulates that the more oposing players were able to dribble past the defender per 90 minutes, the higher his transfer value. Additionally, it is hard to make any inference from this sample, since there are 17 variable but only 37 observations.

## FBref Defenders (Passing)

In [None]:
#only keep progressive distance column
passing_reg=DF_passing["PrgDist/90"]
#merge with transfer data
passing_reg=pd.merge(passing_reg,transfers,how="inner", left_index=True, right_index=True).drop(["ID","Left","Joined"],axis=1)
passing_reg

In [None]:
regresresults=regress_res(passing_reg)
#summary of regression
regresresults.summary()

Since the adjusted-$R^2$ of this model is only 0.202, this model does not seem to have good explanatory power. However, the age of the defender and the progressive distance passes of a defender have travelled, both have a significant (p < 0.05) effect on the transfer fee of a defender. This indicates that if a player can play passess that progress the game, this significantly increases his transfer value.

## FBref Midfielders (Goals and Shot Creation data)
Since there are only 24 observations in this dataset, I will only look at two data points: Shot Creating Actions per 90 minutes and Goal Creating Actions per 90 minutes.

In [None]:
#get needed columns
gasc_reg=MF_gasc[["SCA90","GCA90"]]
#merge with transfer data
gasc_reg=pd.merge(gasc_reg,transfers,how="inner", left_index=True, right_index=True).drop(["ID","Left","Joined"],axis=1)
gasc_reg

In [None]:
regresresults=regress_res(gasc_reg)

#summary of regression
regresresults.summary()

This model seems to have good explanatory power. It has an adjusted-$R^2$ of 0.410, which means that 41.0% of the variance in the transfer fee of midfielders can be explained by this model. Additionally, this model has to variables that are significant: Age and goal creating actions per 90 minutes. This means, that an increasing age significantly decreases the value of a midfielder while an increasing amount of goal creating actions per 90 minutes significantly increases the value of a midfielder. The amount of shot creating actions per 90 minutes seems to be insignificant for the value of a player. 

## FBref Striker (Shooting data)

In [None]:
#get needed columns
shooting_reg=FW_shooting[["SoT/90","G/SoT", "G-xG/90"]]
#merge with transfer data
shooting_reg=pd.merge(shooting_reg,transfers,how="inner", left_index=True, right_index=True).drop(["ID","Left","Joined"],axis=1)
shooting_reg.replace([np.inf, -np.inf], np.nan, inplace=True)

In [None]:
regresresults=regress_res(shooting_reg)
#summary of regression
regresresults.summary()

Since the adjusted-$R^2$ is negative, we can conclude that the model has no explanatory power. The Shots on Target per 90 minutes, the goals per shot on target and the amount of goals above the amount of expected goal per 90 minutes all have an highly insignificant effect on the transfer value of a striker.

## FBref Striker (Goal and Shot Creation data)

In [None]:
#get needed columns
gasc_reg_FW=FW_gasc[["SCA90","GCA90"]]
#merge with transfer data
gasc_reg_FW=pd.merge(gasc_reg_FW,transfers,how="inner", left_index=True, right_index=True).drop(["ID","Left","Joined"],axis=1)
gasc_reg_FW

In [None]:
regresresults=regress_res(gasc_reg)
#summary of regression
regresresults.summary()

Overall, this model seems to have litte explanatory power, since the adjusted $R^2$ is only 0.198. None of the Varaiables seem to be significant (p < 0.05), which means that neither the Shot nor the Goal Creating Actions nor the age seem to significantly influence the transfer fee.

## Striker Transfermarkt

In [None]:
# merge transfermarkt data with a random forward data set from fbref to get all strikers
striker_tm_reg=pd.merge(stats_tm_field,FW_standard,how="inner", left_index=True, right_index=True)
# merge that with transfer data to get transfer fee
striker_tm_reg=pd.merge(striker_tm_reg,transfers,how="inner", left_index=True, right_index=True)
#only keep needed columns
striker_tm_reg=striker_tm_reg[["Height (cm)","Points per game","Gls/90","Ast/90","Fee","Age"]]

In [None]:
regresresults=regress_res(striker_tm_reg)
#summary of regression
regresresults.summary()

Overall, this model seems to have litte explanatory power, since the adjusted $R^2$ is only 0.117. None of the Varaiables seem to be significant (p < 0.05), which means that neither the height nor the points per game nor the age and nor the goals and assisits per 90 minutes seem to significantly influence the transfer fee of a striker.

## FBref Playing Time (Field Players)

In [None]:
#Merge all field player data from fbref together
playing_time_reg=pd.concat([DF_playing_time,MF_playing_time,FW_playing_time],axis=0)
#merge with transfer data
playing_time=pd.merge(playing_time,transfers,how="inner", left_index=True, right_index=True).drop(["ID","Left","Joined"],axis=1)

In [None]:
regresresults=regress_res(playing_time)
#summary of regression
regresresults.summary()

Overall, this model seems to have no explanatory power, since the adjusted $R^2$ is negative. None of the Varaiables seem to be significant (p < 0.05), which means that neither the age nor the surplus of goals per 90 minutes a team scored while the player was on the pitch nor the expected surplus of goals per 90 minutes for a team while the player was on the pitch significantly influence the transfer fee. This indicates that wheter the team was successful when the player was on the pitch has no significant effect on the transfer fee of the player.

# Conclusion
To sum up, I have conducted ten linear regressions with the data I attained through web-scraping. Through this, I was able to find certain significant relationships (and certain relationships that were not significant) between player statistics and the transfer fee of a player. Three factors that seem to overall significantly impact the transfer fee of a field player, regardless of position, are the points per game the player has won with his team, the age of the player, and the remaining contract length of the player. This indicates that players that play for a successful team, are young or have a long term contract should be more expensive than other players. Furthermore, two different non-relationships were striking to me in the data. Firstly, stats like Goals or Assists per 90 minutes did not significantly influence the transfer fee. This was surprising to me since scoring goals is arguably the most important thing in football. Secondly, the fact whether a team was successful or not with a player on the pitch did not seem to significantly influence the transfer fee for that player.

If we look at what drives (or does not drive) the value of a player based on his position, there are also some surprising findings. For defenders, for instance, I was not able to identify any significant relationships between their defending ability and the transfer fee. However, I was able to identify a significant relationship between passes that progress the game and the transfer fee, which to me indicates that a defender that can play a pass and build up the game from the back is more valuable than a defender that can (just) tackle well. Furthermore, for midfielders, I was able to establish a positive relationship between their amount of Goal Creating Actions per 90 minutes and their transfer value, which means that a midfielder that is able to create a lot of Goal Creating Actions is very valuable. For strikers, I was not able to establish any relationship between a certain attribute and their transfer value. However, I was able to identify certain (surprising) non-relationships for strikers. For instance, the number of goals and assists per 90 minutes a striker produces do not significantly influence the transfer value of a player. Similarly, the difference between goals and expected goals per 90 minutes, where a positive value indicates an above-average ability to score goals (or better luck), did not seem to significantly influence the transfer value of a player. Likewise, the goals per shot on target did not significantly influence the transfer fee of a player, which indicates that the goal-scoring ability of a striker does not influence the transfer value of that striker. Furthermore, the amount of Goal and Shot Creating Actions per 90 minutes of a striker do also not significantly influence the value of a striker.

What has to be noted in this analysis though, is that for some players there was no data available on [fbref.com]( https://fbref.com/en/). Because of this, for some players and stats, there is only very little data available. For goalkeepers, for instance, I was unable to identify any relationships since I had less than 20 observations available. Because of this, conclusions from this analysis should be treated cautiously.
