Q1: What features/columns had a relatively even or normal distribution? Which features/columns did not?

A1: The current rank column had a pretty even distribution. The earnings column had a variety in range of distribution, as well as some outliers.

Q2: How did you handle missing values? Why did you do this method as opposed to others?

A2: The way that I handled my missing values were to just remove them. These were located in the previous rank column. I chose this method because the column was not even needed for the overall output of the model.

Q3: How did you encode your categorical data? Why did you do this method as opposed to others?

A3: I used label encoding for the year and nationality columns. I used label encoding for these because each of these can have a set value depending on a certain order that is logical and makes sense. I used one hot encoding for the sports because sports do not typically go in order based on anything relevant to each other and the values are more like indexes.

Q4: How did you handle removing outliers? Why did you use this method as opposed to others?

A4: I decided to use IQR because it is very useful for any deviation and distribution for data.

Q5: How did you normalize/standardize the data? Why did you use this method as opposed to others?

A5: I used normalization because the outliers were removed before I proceeded to scale the data values. This is also best for distributions that are not even. 

Q6: How did each model perform? Which performed the best?

A6: The model that performed the best was the decision tree regressor with an r2 score of around 0.94. The MLP regressor did the worst out of the models with a r2 score of 0.73.

Q7: Did any models seem to have a relatively high amount of bias (underfitting)? Variance (overfitting)?

A7: The data in the decision tree regressor seems to be able to fit into the model fully. This means that there is a low bias and a low variance as well. The lowest r2 score was in the MLP regressor, which also correlates that the amount of goodfitting will be lower which is leading to more bias causing some underfitting.



In [1]:
# import libraries
import pandas as pd
import matplotlib.pyplot as plt

In [2]:
# load the data
rawDF = pd.read_csv("athlete.csv")

rawDF

Unnamed: 0,S.NO,Name,Nationality,Current Rank,Previous Year Rank,Sport,Year,earnings ($ million)
0,1,Mike Tyson,USA,1,,boxing,1990,28.6
1,2,Buster Douglas,USA,2,,boxing,1990,26.0
2,3,Sugar Ray Leonard,USA,3,,boxing,1990,13.0
3,4,Ayrton Senna,Brazil,4,,auto racing,1990,10.0
4,5,Alain Prost,France,5,,auto racing,1990,9.0
...,...,...,...,...,...,...,...,...
296,297,Stephen Curry,USA,6,9,Basketball,2020,74.4
297,298,Kevin Durant,USA,7,10,Basketball,2020,63.9
298,299,Tiger Woods,USA,8,11,Golf,2020,62.3
299,300,Kirk Cousins,USA,9,>100,American Football,2020,60.5


In [3]:
for col in rawDF:
    naCount = rawDF[col].isna().sum()
    print(f"The number of na values in the {col} col is {naCount}")

The number of na values in the S.NO col is 0
The number of na values in the Name col is 0
The number of na values in the Nationality col is 0
The number of na values in the Current Rank col is 0
The number of na values in the Previous Year Rank col is 24
The number of na values in the Sport col is 0
The number of na values in the Year col is 0
The number of na values in the earnings ($ million) col is 0


In [4]:
rawDF = rawDF.drop("Previous Year Rank", axis=1)

In [5]:
rawDF = rawDF.drop("Name", axis=1)

In [6]:
rawDF["Nationality"].value_counts()

Nationality
USA                 206
UK                   13
Germany              13
Switzerland          12
Portugal             10
Brazil                9
Argentina             9
Canada                6
Italy                 4
Finland               3
France                3
Philippines           3
Russia                1
Australia             1
Dominican             1
Austria               1
Filipino              1
Spain                 1
Serbia                1
Northern Ireland      1
Ireland               1
Mexico                1
Name: count, dtype: int64

In [7]:
rawDF["Sport"].value_counts()

Sport
Basketball                      54
Boxing                          29
basketball                      27
Golf                            24
Soccer                          22
golf                            20
Tennis                          18
boxing                          17
American Football               17
soccer                          11
Auto Racing                     10
F1 racing                        8
auto racing                      7
tennis                           5
F1 Motorsports                   5
motorcycle gp                    4
NFL                              3
Baseball                         3
NASCAR                           3
baseball                         3
Ice Hockey                       2
Auto Racing (Nascar)             2
cycling                          1
American Football / Baseball     1
Hockey                           1
ice hockey                       1
NBA                              1
Auto racing                      1
MMA           

In [8]:
rawDF["Year"].value_counts()

Year
2002    11
1990    10
2007    10
2019    10
2018    10
2017    10
2016    10
2015    10
2014    10
2013    10
2012    10
2011    10
2010    10
2009    10
2008    10
2006    10
1991    10
2005    10
2004    10
2003    10
2000    10
1999    10
1998    10
1997    10
1996    10
1995    10
1994    10
1993    10
1992    10
2020    10
Name: count, dtype: int64

In [9]:
# label encoding

def getLabelList(n):
    labelList = []
    for i in range(n):
        labelList.append(i)
    return labelList

In [10]:
yearList = [1990, 1991, 1992, 1993, 1994, 1995, 1996, 1997, 1998,
    1999, 2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009,
    2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019, 2020]

In [11]:
rawDF['Year'] = rawDF['Year'].replace(yearList, getLabelList(len(yearList)))

In [12]:
rawDF.tail()

Unnamed: 0,S.NO,Nationality,Current Rank,Sport,Year,earnings ($ million)
296,297,USA,6,Basketball,30,74.4
297,298,USA,7,Basketball,30,63.9
298,299,USA,8,Golf,30,62.3
299,300,USA,9,American Football,30,60.5
300,301,USA,10,American Football,30,59.1


In [13]:
countryList = ['USA', 'UK', 'Germany', 'Switzerland', 'Portugal', 'Brazil', 'Argentina', 'Canada', 'Italy',
            'Finland', 'France', 'Philippines', 'Russia', 'Australia', 'Dominican', 'Austria',
              'Filipino', 'Spain', 'Serbia', 'Northern Ireland', 'Ireland', 'Mexico' ]

In [14]:
rawDF['Nationality'] = rawDF['Nationality'].replace(countryList, getLabelList(len(countryList)))

In [15]:
rawDF.head()

Unnamed: 0,S.NO,Nationality,Current Rank,Sport,Year,earnings ($ million)
0,1,0,1,boxing,0,28.6
1,2,0,2,boxing,0,26.0
2,3,0,3,boxing,0,13.0
3,4,5,4,auto racing,0,10.0
4,5,10,5,auto racing,0,9.0


In [16]:
from sklearn.preprocessing import OneHotEncoder

#creating instance of one-hot-encoder
#sparse = False puts it in format where it can be dataframe later
encoder = OneHotEncoder(sparse=False)

#perform one-hot encoding on 'City' column 
encodedData = encoder.fit_transform(rawDF[['Sport']])

encoder.categories_




[array(['American Football', 'American Football / Baseball', 'Auto Racing',
        'Auto Racing (Nascar)', 'Auto racing', 'Baseball', 'Basketball',
        'Boxing', 'F1 Motorsports', 'F1 racing', 'Golf', 'Hockey',
        'Ice Hockey', 'MMA', 'NASCAR', 'NBA', 'NFL', 'Soccer', 'Tennis',
        'auto racing', 'baseball', 'basketball', 'boxing', 'cycling',
        'golf', 'ice hockey', 'motorcycle gp', 'soccer', 'tennis'],
       dtype=object)]

In [17]:
oneHotDF = pd.DataFrame(encodedData, columns=encoder.categories_)

oneHotDF

Unnamed: 0,American Football,American Football / Baseball,Auto Racing,Auto Racing (Nascar),Auto racing,Baseball,Basketball,Boxing,F1 Motorsports,F1 racing,...,auto racing,baseball,basketball,boxing,cycling,golf,ice hockey,motorcycle gp,soccer,tennis
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
296,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
297,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
298,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
299,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [18]:
cleanedDF = pd.concat([rawDF, oneHotDF], axis=1).drop(columns = "Sport")

cleanedDF.head()

Unnamed: 0,S.NO,Nationality,Current Rank,Year,earnings ($ million),"(American Football,)","(American Football / Baseball,)","(Auto Racing,)","(Auto Racing (Nascar),)","(Auto racing,)",...,"(auto racing,)","(baseball,)","(basketball,)","(boxing,)","(cycling,)","(golf,)","(ice hockey,)","(motorcycle gp,)","(soccer,)","(tennis,)"
0,1,0,1,0,28.6,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
1,2,0,2,0,26.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
2,3,0,3,0,13.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
3,4,5,4,0,10.0,0.0,0.0,0.0,0.0,0.0,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,5,10,5,0,9.0,0.0,0.0,0.0,0.0,0.0,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [19]:
oneHotDF.columns = oneHotDF.columns.get_level_values(0)

oneHotDF.head()

Unnamed: 0,American Football,American Football / Baseball,Auto Racing,Auto Racing (Nascar),Auto racing,Baseball,Basketball,Boxing,F1 Motorsports,F1 racing,...,auto racing,baseball,basketball,boxing,cycling,golf,ice hockey,motorcycle gp,soccer,tennis
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [20]:
cleanedDF = pd.concat([rawDF, oneHotDF], axis=1).drop(columns = "Sport")

cleanedDF.head()

Unnamed: 0,S.NO,Nationality,Current Rank,Year,earnings ($ million),American Football,American Football / Baseball,Auto Racing,Auto Racing (Nascar),Auto racing,...,auto racing,baseball,basketball,boxing,cycling,golf,ice hockey,motorcycle gp,soccer,tennis
0,1,0,1,0,28.6,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
1,2,0,2,0,26.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
2,3,0,3,0,13.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
3,4,5,4,0,10.0,0.0,0.0,0.0,0.0,0.0,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,5,10,5,0,9.0,0.0,0.0,0.0,0.0,0.0,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [21]:
# IQR definition
import numpy as np

# NOTE: on newer version of numpy, interpolation is replaced with method:
Q1 = np.percentile(cleanedDF['earnings ($ million)'], 25, method='midpoint')
Q3 = np.percentile(cleanedDF['earnings ($ million)'], 75, method='midpoint')

IQR = Q3 - Q1

IQR

35.4

In [22]:
maxThreshold = Q3+1.5*IQR
minThreshold = Q1-1.5*IQR

iqrDF = cleanedDF[cleanedDF["earnings ($ million)"] < maxThreshold]
iqrDF = iqrDF[iqrDF["earnings ($ million)"] > minThreshold]

iqrDF.head()

Unnamed: 0,S.NO,Nationality,Current Rank,Year,earnings ($ million),American Football,American Football / Baseball,Auto Racing,Auto Racing (Nascar),Auto racing,...,auto racing,baseball,basketball,boxing,cycling,golf,ice hockey,motorcycle gp,soccer,tennis
0,1,0,1,0,28.6,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
1,2,0,2,0,26.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
2,3,0,3,0,13.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
3,4,5,4,0,10.0,0.0,0.0,0.0,0.0,0.0,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,5,10,5,0,9.0,0.0,0.0,0.0,0.0,0.0,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [23]:
Q1 = np.percentile(cleanedDF['Current Rank'], 25, method='midpoint')
Q3 = np.percentile(cleanedDF['Current Rank'], 75, method='midpoint')

IQR = Q3 - Q1

IQR

5.0

In [24]:
# normalization
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()

# initialize the DF
NormalizedDF = iqrDF

NormalizedDF[["earnings ($ million)"]] = scaler.fit_transform(iqrDF[["earnings ($ million)"]])

NormalizedDF.head()

Unnamed: 0,S.NO,Nationality,Current Rank,Year,earnings ($ million),American Football,American Football / Baseball,Auto Racing,Auto Racing (Nascar),Auto racing,...,auto racing,baseball,basketball,boxing,cycling,golf,ice hockey,motorcycle gp,soccer,tennis
0,1,0,1,0,0.199223,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
1,2,0,2,0,0.173955,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
2,3,0,3,0,0.047619,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
3,4,5,4,0,0.018465,0.0,0.0,0.0,0.0,0.0,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,5,10,5,0,0.008746,0.0,0.0,0.0,0.0,0.0,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [25]:
xDF = NormalizedDF.drop(columns=["earnings ($ million)"])

xDF.tail()

Unnamed: 0,S.NO,Nationality,Current Rank,Year,American Football,American Football / Baseball,Auto Racing,Auto Racing (Nascar),Auto racing,Baseball,...,auto racing,baseball,basketball,boxing,cycling,golf,ice hockey,motorcycle gp,soccer,tennis
296,297,0,6,30,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
297,298,0,7,30,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
298,299,0,8,30,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
299,300,0,9,30,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
300,301,0,10,30,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [26]:
yDF = pd.DataFrame(NormalizedDF["earnings ($ million)"])

yDF.head()

Unnamed: 0,earnings ($ million)
0,0.199223
1,0.173955
2,0.047619
3,0.018465
4,0.008746


In [27]:
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

xTrain, xTest, yTrain, yTest = train_test_split(xDF, 
                                                yDF, 
                                                test_size=0.30)

model = LinearRegression().fit(xTrain, yTrain)

In [28]:
xTrain.head()

Unnamed: 0,S.NO,Nationality,Current Rank,Year,American Football,American Football / Baseball,Auto Racing,Auto Racing (Nascar),Auto racing,Baseball,...,auto racing,baseball,basketball,boxing,cycling,golf,ice hockey,motorcycle gp,soccer,tennis
53,54,0,4,5,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
61,62,0,2,6,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
46,47,0,7,4,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
22,23,5,3,2,0.0,0.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
96,97,0,7,9,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [29]:
yTrain.tail()

Unnamed: 0,earnings ($ million)
224,0.50243
221,0.680272
191,0.941691
270,0.368319
219,0.334305


In [30]:
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

xTrain, xTest, yTrain, yTest = train_test_split(xDF, 
                                                yDF, 
                                                test_size=0.30)

model = LinearRegression().fit(xTrain, yTrain)

In [31]:
from sklearn.metrics import r2_score
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import mean_squared_error

preds = model.predict(xTest)

print (r2_score(yTest, preds))
print (mean_absolute_error(yTest, preds))
print (mean_squared_error(yTest, preds))

0.8314505139675434
0.08051891202642529
0.010746838644799285


In [33]:
from sklearn import svm

model = svm.LinearSVR().fit(xTrain, yTrain)

preds = model.predict(xTest)

print(r2_score(yTest, preds))
print(mean_absolute_error(yTest, preds))
print(mean_squared_error(yTest, preds))

0.6689032800935042
0.09723493978033988
0.021110969297006475


  y = column_or_1d(y, warn=True)


In [34]:
from sklearn import tree

model = tree.DecisionTreeRegressor().fit(xTrain, yTrain)

preds = model.predict(xTest)

print(r2_score(yTest, preds))
print(mean_absolute_error(yTest, preds))
print(mean_squared_error(yTest, preds))

0.8885928783991103
0.05516428080060274
0.0071033996478383965


In [35]:
from sklearn.neural_network import MLPRegressor

model = MLPRegressor().fit(xTrain, yTrain)

preds = model.predict(xTest)

print(r2_score(yTest, preds))
print(mean_absolute_error(yTest, preds))
print(mean_squared_error(yTest, preds))

  y = column_or_1d(y, warn=True)


0.6651203150610538
0.11619614630563327
0.02135217391743963


