# Football Player Price Visualization & Prediction

## Football Player Price Visualization

In the English Premier League, May - July represents a lull period due to the lack of club football.
What makes up for it, is the intense transfer speculation that surrounds all major player transfers today. 
An important part of negotiations is predicting the fair market price for a player. 
Tasked with predicting this market value of a player using the data provided below;
the attached data set consists of the following attributes:
name: name of the player
club: club of the player
age : age of the player
### position : the usual position on the pitch
position_cat : 
- 1 for attackers
- 2 for midfielders
- 3 for defenders
- 4 for goalkeepers

### market_value : as on transfermrkt.com on july 20th, 2017
- page_views : average daily wikipedia page views from september 1, 2016 to may 1, 2017
- fpl_value : value in fantasy premier league as on july 20th, 2017
- fpl_sel : % of fpl players who have selected that player in their team
### fpl_points : fpl points accumulated over the previous season 
### region: 
- 1 for England
- 2 for Eu
- 3 for Americans
- 4 for Rest of world
### nationality
- new_foreign : whether a new signing from a different league, for 2017/18 (till 20th july)
- age_cat
- club_id
- big_club: whether one of the top 6 clubs
- new_signing: whether a new signing for 2017/18 (till 20th july)

TASK: 
- Used seaborn, numpy, pandas to investigate the data and presented the findings
- Built a model using linear regression to predict market_value

In [None]:
import seaborn as sns
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [None]:
fb=pd.read_csv('//Users//guneetkohli//Desktop//Football Player Price Prediction//football 8 sem.csv')

In [None]:
fb.head()

### Distribution Of Market Value


In [None]:
sns.distplot(fb.market_value)
plt.xlabel('Market Value')
plt.ylabel('count')
plt.show()

Clearly not a normal distribution, but this was expected. Teams tend to have few elite players, and a large number of low + mid value players in their squads. An analysis of a team’s 1st 15 would probably look more like a normal distribution, since we’d be excluding low value fringe / youth players.

### Distribution Of Popularity

Similar distribution to market value, except the presence of outliers as the popularity comes close to 8000

In [None]:
sns.histplot(fb.page_views)
plt.xlabel('Popularity')
plt.ylabel('count')
plt.show()

### FPL Valuation


In [None]:
g = sns.scatterplot(x= fb.fpl_value,y = fb.market_value)

g.figure.set_size_inches(5,5)

plt.title("Impact of FPL Value On Market Value")
plt.show()

There seems to be nice agreement between the FPL value and market value, 
despite the fact that FPL valuation is decidedly shorter term, so age would be less of a factor. 

### Market Value with Age
It is fairly intuitive that older players will, on average, have lower market values. A rough illustration -

In [None]:

g = sns.swarmplot(x = "age",
              y= 'market_value', 
              data = fb,
              size = 7,
              hue="age", legend=True)

g.figure.set_size_inches(7,4)

plt.title("Impact of Age On Market Value")
plt.show()

### Relationship between Region and Market Value Dependence

In [None]:
sns.jointplot( x= fb.region,y=fb.market_value,data=fb)
plt.show()

### Relationship between Player Position and market value
    1: 'Attackers',
    2: 'Midfielder',
    3: 'Defender',
    4: 'Goalkeeper'

In [None]:
sns.catplot(x="position_cat", y="market_value", kind="bar", data=fb,hue="position_cat", legend=True)

#### Attackers and midfielders have higher values in market

### Region and Market Value Dependence

In [None]:
sns.jointplot(x=fb.region,y=fb.market_value)
plt.show()

### Market Value of Players from Big Clubs

In [None]:
sns.jointplot(x=fb.big_club,y=fb.market_value, hue=fb.market_value)
plt.show()

### Top 5  Valuable Players based on market value

In [None]:
fb.nlargest(5,'market_value')

### Simple Correlation

In [None]:

numeric_columns = fb.select_dtypes(include=['float64', 'int64'])

corr = numeric_columns.corr()


g = sns.heatmap(corr, vmax=.3, center=0,square=True, linewidths=.5, cbar_kws={"shrink": .5}, annot=True, fmt='.2f', cmap='coolwarm')

sns.despine()
g.figure.set_size_inches(14,10)

plt.show()

As it can be implied from the heatmap easily, Market Vaue depends on various factors, namely
Page views, Fpl_value, Fpl_points (Pairplots shown below also depict this fact)

# Subset data and create Stacked Plot between club and nationality

In [None]:
fb.club.unique()

In [None]:
fb.nationality.unique()

In [None]:
sns.set(style="ticks")

filtered_fb = fb[
    (fb['club'].isin(['Arsenal', 'Bournemouth', 'Brighton+and+Hove', 'Burnley',
       'Chelsea', 'Crystal+Palace', 'Everton', 'Huddersfield',
       'Leicester+City', 'Liverpool', 'Manchester+City',
       'Manchester+United', 'Newcastle+United', 'Southampton',
       'Stoke+City', 'Swansea', 'Tottenham', 'Watford', 'West+Brom',
       'West+Ham']) & 
                      (fb['nationality'].isin(['Chile', 'Germany', 'Czech Republic', 'England', 'France', 'Spain',
       'Nigeria', 'Switzerland', 'Wales', 'Brazil', 'Egypt', 'Argentina',
       'Colombia', 'Bosnia', 'Norway', 'Poland', 'Scotland', 'Congo DR',
       'Ireland', 'Netherlands', 'Australia', "Cote d'Ivoire", 'Finland',
       'Cameroon', 'Austria', 'Israel', 'Northern Ireland', 'Canada',
       'Belgium', 'Iceland', 'Serbia', 'Portugal', 'Ghana', 'South Korea',
       'Mali', 'Senegal', 'Curacao', 'Denmark', 'Slovenia',
       'Trinidad and Tobago', 'Bermuda', 'Benin', 'Algeria', 'Jamaica',
       'Japan', 'Tunisia', 'Croatia', 'Estonia', 'Ecuador', 'Armenia',
       'Italy', 'Sweden', 'United States', 'Morocco', 'The Gambia',
       'Kenya', 'Greece', 'Uruguay', 'Romania', 'Venezuela',
       'New Zealand'])))]

#print(filtered_fb)

df_plot=filtered_fb.groupby(['club', 'nationality']).size().reset_index().pivot(columns='club', index='nationality', values=0).reset_index()

g = df_plot.set_index('nationality').T.plot(kind='bar', stacked=True, color=sns.color_palette())
sns.despine()
g.figure.set_size_inches(15,15) 
plt.legend(loc = 2, bbox_to_anchor = (1,1))
plt.title('Relationship between Club and Nationality')
plt.show()

# Pairplots

In [None]:
filtered_fb.head()

In [None]:
g = sns.pairplot(filtered_fb[['fpl_value','fpl_sel','fpl_points','page_views','market_value']])

In [None]:
g = sns.swarmplot(x = "club",y= 'market_value', hue="club",data = filtered_fb, size = 7, alpha=1)

sns.despine()

# Set the figure size
g.figure.set_size_inches(10, 10)
# Rotate x-axis labels to prevent overlap
plt.xticks(rotation=90)

# Set the title of the plot
plt.title("Impact of Club On Market Value")

# Show the plot
plt.show()

In [None]:
g2 = sns.stripplot(x="club", y="market_value", hue="club", data=filtered_fb, jitter=True, dodge=True)

sns.despine()

# Set the figure size
g2.figure.set_size_inches(10, 10)
# Rotate x-axis labels to prevent overlap
plt.xticks(rotation=90)

# Set the title of the plot
plt.title("Impact of Club On Market Value")

# Show the plot
plt.show()

In [None]:
g = sns.boxplot(x = "club",
              y = 'market_value', hue="club",
              data = filtered_fb, whis=np.inf)

g.figure.set_size_inches(12,12)
plt.xticks(rotation=90)
plt.title("5 Number Summary--Market Values of Various clubs")
plt.show()

## Simple Correlation

In [None]:
# Drop non-numeric columns
fb_numeric = fb.select_dtypes(include=['float64', 'int64'])

# Compute the correlation matrix
corr = fb_numeric.corr()

# Plot heatmap
g = sns.heatmap(corr, vmax=.3, center=0,
            square=True, linewidths=.5, cbar_kws={"shrink": .5}, annot=True, fmt='.2f', cmap='coolwarm')

sns.despine()
g.figure.set_size_inches(14,10)

plt.show()


# INSIGHTS


Factors that generate a higher market value are:
1. Attackers and Defenders
2. Players aged 22 to 28 
3. Page Views
4. European players 
5. higher fpl_values,higher fpl_points and more page_views

Another factor:
1. Mostly Armenians are part of different clubs
2. Market value generated by non big clubs is denser but the non big clubs generate highest market_value

In [None]:
fb.head()

# Football Player Price Prediction

### LINEAR REGRESSION


As illustrated from the heatmap, one can see that Market Value and Fpl_Value have highest correlation, numerically, 0.79. 

Thus, Prioritizing fpl_value attribute to predict market_value

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import mean_squared_error, accuracy_score

### Normalizing the data

In [None]:
X = fb['fpl_value'].values
Y = fb['market_value'].values

After executing the cell below , the input features (X) and target variable (Y) will be scaled to the range [0, 1] and stored in the variables X and Y, respectively. These scaled features and target variable can then be used for training a machine learning model.

In [None]:
x_scaler = MinMaxScaler() #initializes a MinMaxScaler object for scaling the input features (X)
X = x_scaler.fit_transform(X.reshape(-1,1))
X = X[ : , -1]
y_scaler = MinMaxScaler()
Y = y_scaler.fit_transform(Y.reshape(-1,1))
Y = Y[ : , -1]

### Splitting the Data

In [None]:
xtrain, xtest, ytrain, ytest = train_test_split(X, Y, test_size=0.2)

### Error Function
Mean squared error (MSE) between the predicted values from a linear regression model and the target values. The MSE quantifies the overall fit of the model to the data, with lower values indicating better model performance.
- m: slope (rate of change of the dependent variable t with respect to changes in x)
- x: Independent variable or Input values ()
- c: constant term or the value of t when x is zero.
- t: Trying to predict this variable (Market Value)

In [None]:
def error(m, x, c, t):
    N = x.size
    e = sum(((m * x + c) - t)**2)
    return e * 1/(2 * N)

### Update Function
Used to update the parameters m (slope) and c (intercept) of a linear regression model using gradient descent.

In [None]:
def update(m, x, c, t, learning_rate):
    grad_m = sum(2*((m*x+c)-t)*x)
    grad_c = sum(2*((m*x+c)-t))
    m = m - grad_m * learning_rate
    c = c - grad_c * learning_rate
    return m,c

### Gradient Descent Function

In [None]:
def gradient_descent(init_m, init_c, x, t, learning_rate, iterations, error_threshold):
    m = init_m
    c = init_c
    error_values = []
    mc_values = []
    for i in range(iterations):
        e = error(m,x,c,t)
        if e < error_threshold:
            print("Error less than the threshold. Stopping the gradient descent.")
            break
        error_values.append(e)
        m,c = update(m,x,c,t,learning_rate)
        mc_values.append((m,c))
    return m, c, error_values, mc_values

In [None]:
init_m = 0.9
init_c = 0
learning_rate = 0.001
iterations = 250
error_threshold = 0.001
m, c, error_values, mc_values = gradient_descent(init_m, init_c, xtrain, ytrain, learning_rate, iterations, error_threshold)

### Visualization

In [None]:
mc_values_anim = mc_values[0:250:5]
#contains a subset of (m, c) tuples from the first 250 iterations of gradient descent, sampled at every 5th iteration. 
#This subset can be used to visualize the progression of parameter values during the optimization process.

In [None]:
def init():
    plt.scatter(xtest, ytest, color='g')
    ax.set_xlim(0, 1.0)
    ax.set_ylim(0, 1.0)
    return ln,

def update_frame(frame):
    m, c = mc_values_anim[frame]
    x1, y1 = -0.5, m * -.5 + c
    x2, y2 = 1.5, m * 1.5 + c
    ln.set_data([x1, x2], [y1, y2])
    return ln,

### Visualization of the learning process

In [None]:
sns.scatterplot(x=xtrain, y=ytrain)
plt.plot(xtrain, (m * xtrain + c), color='r')

### Plotting Error Values

In [None]:
plt.plot(np.arange(len(error_values)), error_values)
plt.ylabel('Errors')
plt.xlabel('Iterations')

### Prediction

### Calculate the predictions

In [None]:
predicted = (m * xtest + c)

In [None]:
# Calculate MSE for the predicted value on the testing set
mean_squared_error(ytest, predicted)

In [None]:
# Putting xtest, ytest and predicted values into a single DataFrame
p = pd.DataFrame(list(zip(xtest, ytest, predicted)), columns=['x', 'target_y', 'predicted_y'])
p.head()

### Reshape Normalization

In [None]:
predicted  = predicted.reshape(-1,1)
xtest  = xtest.reshape(-1,1)
ytest  = ytest.reshape(-1,1)

xtest_scaled = x_scaler.inverse_transform(xtest)
ytest_scaled = y_scaler.inverse_transform(ytest)
predicted_scaled = y_scaler.inverse_transform(predicted)

xtest_scaled = xtest_scaled[ : , -1]
ytest_scaled = ytest_scaled[ : , -1]
predicted_scaled = predicted_scaled[ : , -1]

p = pd.DataFrame(list(zip(xtest_scaled, ytest_scaled, predicted_scaled)), columns=['x', 'target_y', 'predicted_y'])
p = p.round(decimals = 2)
p.head(10)

In [None]:
# Create a scatter plot for actual target values
plt.scatter(p['x'], p['target_y'], color='blue', label='Actual')

# Create a scatter plot for predicted target values
plt.scatter(p['x'], p['predicted_y'], color='red', label='Predicted')

# Add labels and title
plt.xlabel('Original Scaled Input Features')
plt.ylabel('Target Values')
plt.title('Comparison of Actual vs Predicted Target Values')
plt.legend()

# Show the plot
plt.show()


In [None]:
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression

# Assuming fb is your DataFrame containing 'market_value' and 'fpl_value' columns

# Split data into train and test sets
X = fb[['fpl_value']]  # Assuming 'fpl_value' is the feature
y = fb['market_value']  # Assuming 'market_value' is the target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize and fit the model
model = LinearRegression()
model.fit(X_train, y_train)

# Make predictions
y_pred = model.predict(X_test)

# Calculate mean squared error
mse = mean_squared_error(y_test, y_pred)
print("Mean Squared Error:", mse)


In [None]:
import numpy as np

# Assuming fb is your DataFrame containing 'market_value' column

# Calculate variance
market_value_variance = np.var(fb['market_value'])

print("Variance of market_value:", market_value_variance)


As a general rule of thumb, it's helpful to compare your MSE to the variance of your target variable. 
If the MSE is much smaller than the variance, it suggests that your model is performing well. 
Conversely, if the MSE is close to or larger than the variance, it indicates that your model is not capturing the variability in the data.

In [None]:
if mse < market_value_variance:
    print("The model's performance is good as the mean squared error is much smaller than the variance of the target variable 'market_value'.")
else:
    print("The model's performance is relatively poor as the mean squared error is close to or larger than the variance of the target variable 'market_value'.")