# Introduction to Machine Learning in Sports Analytics
This notebook contains the notes taken from the 5th and final course in the Coursera Sports Analytics Program.

## Week 1
### What is Machine Learning?
- Think of machine learning as a paradigm of building computation models based on historical data without explicitly creating rules
- The models are built iteratively (at least conceptually):
    - Some data is collected about a phenomena
    - Statistical methods are applied to the data to organize it
    - The resulting model can be used for new data

#### Breaking Down Machine Learning
There are a few main branches of Machine Learning depending upon the task:

##### Supervised Learning
- The task is to learn the relationship between <ins>historical data</ins> and some <ins>labels</ins> which already exist
- The labels are usually provided by humans, and the goal of the models created is to predict the labels on new data which does not have a label
- There are two broad categories of supervised approaches
    - **Regression**, where the label is a <ins>target value</ins> such as the draft pick position of a player which you might predict from their previous performance
    - **Classification**, where the label is a <ins>class value</ins> and is categorical in nature, such as predicting the kind of activity based on sensor data from wearables
- This class will focus entirely on this form of machine learning

##### Unsupervised Learning
- These approaches do not require the label.  The historical data is used to identify common features about the data which could be used to understand new data
- Most common task is <ins>clustering</ins> of data, which is sometimes just done statistically or is done with visualization of data in mind
- In sports analytics there are numerous great examples:
    - Which players are similar based on their stats?
    - Which physical activities share similar sensor data?
    - Which teams have similar playing patterns?
- Despite not having a label, human decision making is still an important part of the process in determining which features you are clustering on
    - A model clustering NHL players where goal scoring and location on ice will differentiate players by position, while one trained on time on ice and salary will differentiate between team investment choices

##### Semi-Supervised Learning
- Involves a mixture of supervised and unsupervised learning approaches where the human labeling of data is expensive or incomplete
- For instance, imagine you've scraped the web and have a collection of thousands of pictures of athletes and you want to analyse what pair of shoes they are each wearing using computer vision and machine learning
    - Individual shoe identification may be difficult and error prone
    - **Seeding** the system with a few classifications (supervised approach) can be used to identify features of the images which can be used to cluster unlabeled images (unsupervised approach) which then can be used for more labeling

##### Reinforcement Learning
- A method of training a supervised method where a human does not provide labels but the machine can sense the labels in its environment
- Most commonly this is done by providing some fucntion which rewards the machine for correctly classifying data in real-time without human intervention
- An example of this might be in sport analytics is with amateur athlete training, where some broad objective is known and can be measure (eg compliance with a training program which could be read from wearable data) and the machine is able to take some interventions and try and make this happen (eg email or phone app nudges)
- Ther relationship between when and how often to send emails is then learned from watching the effectiveness they have on compliance with the training program
- This is similar to A/B or Randomised Control Trial (RCT) but done with machine learning instead of equal proportions of subjects

#### The Machine Learning Space
- Supervised (classification, regression)
- Unsupervised (clustering)
- Semisupervised
- Reinforcement

### The Machine Learning Workflow
- **Process Data**
    - Determine features likely to be of significance to the task
    - Acquire and clean data to create these features
    - Label data

- **Create Models**
    - Identify model choice and evaluation strategy
    - Separate data into training, validation, and testing data sets
    - Train and tune models using training/validation data
    - Evaluate model performance on testing data

- **Deploy Model**
    - Make predictions on unseen data and evaluate in the wild
    - Expand through iterations as needed
    
#### Processing Data: Defining Your ML Problem
- Starts with thinking about the problem, what is it you want to model and predict?
    - Game score, match outcome, player salary, movement result, etc.
- The details matter in the prediction:
    - Do you just care about the accuracy of the model? Or do you want the model to be interpretable?
    - How generalizable do you expect the model to be? Where will you use this model?
    - These start to inform your data collection, modeling, and evaluation strategies
- What are the ideal features (or attributes) you think would be useful?
    - Break this list down into those you have high confidence in and those that you are less sure of (lean towards breadth when doing this)
- Be as explicit as possible and be aware of potential scope creep

#### Processing Data: Acquiring Data
A common challenge! There are three broad categories:<br>

- You purchase the data from a third party
    - There are numerous data vendors specifically set up to provide sports outcomes data largely with an eye towards gambling and risk management markets
    - Pricing depends on a few aspects of the data: historical size, accuracy, specific features, frequency
- Web Scraping
    - Complex and wonderful space
    - Lot of technical, ethical, and legal considerations around the access of web data
    - Web scraping can be a very fragile way to obtain data: are you building a proof of concept, or a longer-lived service?
- First party data collection
    - Especially common with wearable technologies and data scientists embedded in sports teams
    - Can be integral in some tasks, and highly valuable in competitive tasks
    
#### Processing Data: Labeling Data
- Core to your question: what are you tring to predict?
- Several different approaches:
    - Often there is a ground truth which can be objectively observed, such as the score of a match or the outcome of a tournament
    - Sometimes the label must be added by a human expert as it isn't found in the data you have (eg the MVP or stars of the game might be announced on TV but not found in web-scraped data)
    - Sometimes you want to engage a group of experts to help label your data
        - Commonly called crowdwork, with the general idea being that you can speed up the labeling of data, collect diverse, or achieve a consensus in difficult tasks
        - Lots of important considerations on accuracy when labeling data
    - When classifying data, is your data balanced among classes? If not, are you able to collect more for minority classes?
    
#### Create Models: Choosing a Model
- Choosing the right modeling technique could be a course in itself
    - Some techniques reult in a model which is more interpretable than others (eg Decision Trees)
    - Some require large amounts of data to work well (eg Neural Networks)
    - Some require significant computational resources (eg Deep Learning)
    - And of course, some will just work better for the particular problem and data that you have
- Start simple instead of going for the *latest and greatest*
    - In this course we'll demonstrate some specific fundamental models which have good results and can work well with moderate sized datasets: Decision Trees, Support Vector Machines (SVMs), and Regression Trees
    - But we'll go a bit further, and talk about how you can bring multiple different models together to improve accuracy through a process called ensembles

#### Create Models: Partitioning Your Data
- When training a model you want to ensure it is generalizable to new data so that the predictive power is high
- To do this, we partition the data into three sets:
    - **Training data**: the data the learning algorithm sees to learn from and create a model
    - **Validation data**: the data you use to evaluate the quality of the model as you are training it
    - **Test data**: the data your client uses to understand how well your model performs
- Coneptually you and your client might be the same person!
    - The more your learning algorithm can observe evidence from your validation or test sets the more likely it is to overtrain to that data
- When partitioning your data it is common to use an 80/20 rule, however, this is not alway appropriate, we'll go thru examples later

#### Create Models: Evaluating Your Model
- This is aplace where your goal really matters
    - Do you want to predict who isn't going to win a tournament with high accuracy? That's easy
- There are different metrics and each ehlps inform us about how a model practices within the context of a question
    - Let's say we are predicting who is going to be in the NCAA March Madness tournament
    - There are 68 teams who make it out of 350
    - We can naively get an accuracy rate of 80% just by predicting no one will make it!
- Accuracy is almost always an inappropriate measure, and there are many better measures depending on your goal
    - **Kappa**: chance corrected accuracy
    - **Precision**: true positives divided by true positives and false positives
    - **Recall**: true positives diveded by true positives and false negatives
    - **F1 Score**: a combination of precision and recall
    
#### Deploy Model
- Once you have built and evaluated your model you are ready to deploy it in production
- This is where engineering comes in
    - Pipelines of data and modeling, resulting in continually improving systems
    - Timeliness of the model and predictions, especially if you are predicting in live settings
    - Feasibility of predictions, do you need to apply models on an embedded device?
- One challenge is including hard-to-measure information in the process
    - When does a significant event (eg covid) invalidate expectations of generalizability?
    - When do new data sources provide an opportunity to improve the model?
    - How can judgement from humans be integrated in a probabilistic way?
- Deployment is very specific to goals of solving the problem

### Our First Model: NHL Game Outcomes
Follow the [hockey_wins](./hockey_wins-1.ipynb) notebook for an overview of this process

#### Reflection on NHL Game Outcomes
- We did a lot! Throughout the process we:
    - Acquired data, through APIs and light weight web-scraping
    - Cleaned the data, aligning values throughout
    - Made choices on features, putting our knowledge of the sport into our analysis
    - Made decisions on how to represent missing data
    - Ran a fair analysis, building a model on 800 observations and evaluating its accuracy on the remaining ~500 items
- But, let's throw up a few flags
    - Lots of our choices were pretty arbitrary and naive
    - Our features from the previous season were not inspected deeply
    - We don't have a sense as to where this model will likely be good and where it will be bad
    
### Considerations in Deploying The Model
#### Considerations in Using the Game Predictor Model
- There are a lot of considerations to apply this kind of model in practice
    - Does your model generalize well?
    - Which features are important to the accuracy of the model?
    - Is there bias in the model?
- There are both techniques to apply to detect these issues as well as techniques to mitigate problems that might arise from them
- Let's explore our hockey game data a little bit more, with the goal of *sensemaking*, or to learn how it actually works

#### Example of Deploying our NHL Game Outcomes Model
Follow the [hockey_wins2](./hockey_wins-2.ipynb) notebook for an example.