Tennis Ace
Overview

This project contains a series of open-ended requirements which describe the project you’ll be building. There are many possible ways to correctly fulfill all of these requirements, and you should expect to use the internet, Codecademy, and other resources when you encounter a problem.
Project Goals

You will create a linear regression model that predicts the outcome for a tennis player based on their playing habits. By analyzing and modeling the Association of Tennis Professionals (ATP) data, you will determine what it takes to be one of the best tennis players in the world.
Setup Instructions

If you choose to do this project on your computer instead of Codecademy, you can download what you’ll need by clicking the “Download” button below. If you need help setting up your computer, be sure to check out our setup guide.
Tasks
7/8 Complete
Mark the tasks as complete by checking them off
Prerequisites
1.

In order to complete this project, you should have completed the Linear Regression and Multiple Linear Regression lessons in the Machine Learning Course. This content is also covered in the Data Scientist Career Path.
Project Requirements
2.

“Game, Set, Match!”

No three words are sweeter to hear as a tennis player than those, which indicate that a player has beaten their opponent. While you can head down to your nearest court and aim to overcome your challenger across the net without much practice, a league of professionals spends day and night, month after month practicing to be among the best in the world. Today you will put your linear regression knowledge to the test to better understand what it takes to be an all-star tennis player.

Provided in tennis_stats.csv is data from the men’s professional tennis league, which is called the ATP (Association of Tennis Professionals). Data from the top 1500 ranked players in the ATP over the span of 2009 to 2017 are provided in file. The statistics recorded for each player in each year include service game (offensive) statistics, return game (defensive) statistics and outcomes. Load the csv into a DataFrame and investigate it to gain familiarity with the data.

Open the hint for more information about each column of the dataset.

The ATP men’s tennis dataset includes a wide array of tennis statistics, which are described below:
Identifying Data

    Player: name of the tennis player
    Year: year data was recorded

Service Game Columns (Offensive)

    Aces: number of serves by the player where the receiver does not touch the ball
    DoubleFaults: number of times player missed both first and second serve attempts
    FirstServe: % of first-serve attempts made
    FirstServePointsWon: % of first-serve attempt points won by the player
    SecondServePointsWon: % of second-serve attempt points won by the player
    BreakPointsFaced: number of times where the receiver could have won service game of the player
    BreakPointsSaved: % of the time the player was able to stop the receiver from winning service game when they had the chance
    ServiceGamesPlayed: total number of games where the player served
    ServiceGamesWon: total number of games where the player served and won
    TotalServicePointsWon: % of points in games where the player served that they won

Return Game Columns (Defensive)

    FirstServeReturnPointsWon: % of opponents first-serve points the player was able to win
    SecondServeReturnPointsWon: % of opponents second-serve points the player was able to win
    BreakPointsOpportunities: number of times where the player could have won the service game of the opponent
    BreakPointsConverted: % of the time the player was able to win their opponent’s service game when they had the chance
    ReturnGamesPlayed: total number of games where the player’s opponent served
    ReturnGamesWon: total number of games where the player’s opponent served and the player won
    ReturnPointsWon: total number of points where the player’s opponent served and the player won
    TotalPointsWon: % of points won by the player

Outcomes

    Wins: number of matches won in a year
    Losses: number of matches lost in a year
    Winnings: total winnings in USD($) in a year
    Ranking: ranking at the end of year

3.

Perform exploratory analysis on the data by plotting different features against the different outcomes. What relationships do you find between the features and outcomes? Do any of the features seem to predict the outcomes?

We utilized matplotlib’s .scatter() method to plot different features against different outcomes. Check out the documentation here for a refresher on how to utilize it.

We found a strong relationship between the BreakPointsOpportunities feature and the Winnings outcome.
4.

Use one feature from the dataset to build a single feature linear regression model on the data. Your model, at this point, should use only one feature and predict one of the outcome columns. Before training the model, split your data into training and test datasets so that you can evaluate your model on the test set. How does your model perform? Plot your model’s predictions on the test set against the actual outcome variable to visualize the performance.

Our first single feature linear regression model used 'FirstServeReturnPointsWon' as our feature and Winnings as our outcome.

features = data[['FirstServeReturnPointsWon']]
outcome = data[['Winnings]]

We utilized scikit-learn’s train_test_split function to split our data into training and test sets:

features_train, features_test, outcome_train, outcome_test = train_test_split(features, outcome, train_size = 0.8)

We then created a linear regression model and trained it on the training data:

model = LinearRegression()
model.fit(features_train,outcome_train)

To score the model on the test data, we used our LinearRegression object’s .score() method.

model.score(features_test,outcome_test)

We then found the predicted outcome based on our model and plotted it against the actual outcome:

prediction = model.predict(features_test)
plt.scatter(outcome_test,prediction, alpha=0.4)

5.

Create a few more linear regression models that use one feature to predict one of the outcomes. Which model that you create is the best?

We found that our best single feature linear regression model came from using 'BreakPointsOpportunities' as the feature to predict 'Winnings'.
6.

Create a few linear regression models that use two features to predict yearly earnings. Which set of two features results in the best model?

We followed the same steps as in the last exercise to create a linear regression model with 'BreakPointsOpportunities' and 'FirstServeReturnPointsWon' as our features to predict 'Winnings'.

features = data[['BreakPointsOpportunities',
'FirstServeReturnPointsWon']]
outcome = data[['Winnings']]

7.

Create a few linear regression models that use multiple features to predict yearly earnings. Which set of features results in the best model?

Head to the Codecademy forums and share your set of features that resulted in the highest test score for predicting your outcome. What features are most important for being a successful tennis player?

We created a linear regression model with the below features to predict 'Winnings':

features = players[['FirstServe','FirstServePointsWon','FirstServeReturnPointsWon',
'SecondServePointsWon','SecondServeReturnPointsWon','Aces',
'BreakPointsConverted','BreakPointsFaced','BreakPointsOpportunities',
'BreakPointsSaved','DoubleFaults','ReturnGamesPlayed','ReturnGamesWon',
'ReturnPointsWon','ServiceGamesPlayed','ServiceGamesWon','TotalPointsWon',
'TotalServicePointsWon']]
outcome = players[['Winnings']]

Solution
8.

Great work! Visit our forums to compare your project to our sample solution code. You can also learn how to host your own solution on GitHub so you can share it with other learners! Your solution might look different from ours, and that’s okay! There are multiple ways to solve these projects, and you’ll learn more by seeing others’ code.


In [None]:
import codecademylib3_seaborn
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression

mlr = LinearRegression()

# load and investigate the data here:
df = pd.read_csv('tennis_stats.csv')
print(df.head())

# perform exploratory analysis here:
plt.scatter(df.DoubleFaults, df.Wins, alpha=0.4, color = 'Salmon')
plt.xlabel("Double Faults")
plt.ylabel("Wins")
plt.title("DoubleFaults vs Wins")
plt.show()
plt.clf()

plt.scatter(df.BreakPointsOpportunities, df.Wins, alpha=0.4, color = 'LightGreen')
plt.xlabel("Break Points Opportunities")
plt.ylabel("Wins")
plt.title("Break Points Opportunities vs Wins")
plt.show()
plt.clf()

plt.scatter(df.ReturnGamesPlayed, df.Wins, alpha=0.4, color = 'LightBlue')
plt.xlabel("Return Games Played")
plt.ylabel("Wins")
plt.title("Return Games Played vs Wins")
plt.show()
plt.clf()


## Double Faults vs Wins:
plt.scatter(df.DoubleFaults, df.Wins, alpha=0.4, color = 'LightSalmon')
plt.xlabel("Double Faults")
plt.ylabel("Wins")
plt.title("DoubleFaults vs Wins")

X = df.DoubleFaults
X = X.values.reshape(-1, 1)
y = df.Wins

x_train, x_test, y_train, y_test = train_test_split(X, y, train_size=0.8, test_size=0.2)

mlr.fit(x_train, y_train)
y_predicted = mlr.predict(x_train)
print(mlr.score(x_train, y_train))

plt.scatter(x_train, y_predicted, alpha=0.4, c='MidnightBlue', label = 'Training')

mlr.fit(x_test, y_test)
y_predicted_test = mlr.predict(x_test)
print(mlr.score(x_test, y_test))

plt.scatter(x_test, y_predicted_test, alpha=0.4, c='OrangeRed', label = 'Test')

plt.legend()

plt.show()
plt.clf()



## BreakPointsOpportunities vs Wins:
plt.scatter(df.BreakPointsOpportunities, df.Wins, alpha=0.4, color = 'LightGreen')
plt.xlabel("Break Points Opportunities")
plt.ylabel("Wins")
plt.title("Break Points Opportunities vs Wins")

X = df.BreakPointsOpportunities
X = X.values.reshape(-1, 1)
y = df.Wins

x_train, x_test, y_train, y_test = train_test_split(X, y, train_size=0.8, test_size=0.2)

mlr.fit(x_train, y_train)
y_predicted = mlr.predict(x_train)
print(mlr.score(x_train, y_train))

plt.scatter(x_train, y_predicted, alpha=0.4, c='DarkSlateGray', label = 'Training')

mlr.fit(x_test, y_test)
y_predicted_test = mlr.predict(x_test)
print(mlr.score(x_test, y_test))

plt.scatter(x_test, y_predicted_test, alpha=0.4, c='MediumVioletRed', label = 'Test')

plt.legend()
plt.show()
plt.clf()


## ReturnGamesPlayed vs Wins:
plt.scatter(df.ReturnGamesPlayed, df.Wins, alpha=0.4, color = 'LightBlue')
plt.xlabel("Return Games Played")
plt.ylabel("Wins")
plt.title("Return Games Played vs Wins")

X = df.ReturnGamesPlayed
X = X.values.reshape(-1, 1)
y = df.Wins

x_train, x_test, y_train, y_test = train_test_split(X, y, train_size=0.8, test_size=0.2)

mlr.fit(x_train, y_train)
y_predicted = mlr.predict(x_train)
print(mlr.score(x_train, y_train))

plt.scatter(x_train, y_predicted, alpha=0.4, c='Purple', label = 'Training')


mlr.fit(x_test, y_test)
y_predicted_test = mlr.predict(x_test)
print(mlr.score(x_test, y_test))

plt.scatter(x_test, y_predicted_test, alpha=0.4, c='Gold', label = 'Test')

plt.legend()
plt.show()
























## perform two feature linear regressions here:






















## perform multiple feature linear regressions here:





















