# Paper outline

The purpose of this notebook is to outline our paper, including responsibilities.

## Similar papers

List similar (published) papers to help guide our writing:
- https://arxiv.org/abs/1912.11762
    - This would be a good paper to cite in the introduction. It will provide evidence that our work is something people care about (since plenty of papers already exist). We can also reference the accuracies and types of models use to influence our choices.
    - Their paper also compares accuracy of neural nets to non-neural nets, which we should also discuss.
- http://www.acsij.org/acsij/article/view/246
    - This could be referenced briefly when discussing our data collection, but I don't see a ton of useful stuff for us here.
- https://content.iospress.com/download/journal-of-sports-analytics/jsa0018?id=journal-of-sports-analytics%2Fjsa0018
    - Obviously we should reference this when discussing our variables.
    - This could also be used as one simple baseline to compare our results. If we simply compare the Pythagorean expectation for each team, what would our accuracy be?
- https://doi.org/10.1080/08839514.2018.1442991
    - PDF: https://mustang.cec.miamioh.edu/Resources/Publication/Baseball_Machine_Learning.pdf
    - This seems like a fantastic paper for our introduction. Since it already summarizes lots of past work we can just use what they found.
    - It would be really nice if this had a table showing accuracy of various models. Perhaps we can pull out the relavant info from this paper and make our own.
- https://www.athensjournals.gr/sports/2016-3-4-1-Tolbert.pdf
    - This paper focuses on the world series outcomes. That's obviously very different than what we're doing.
    - They focus entirely on using SVM's.
    - This paper has some seemingly very impressive results, but their data is also imbalanced. So accuracy is problematic. They're also just predicting something different than we are.
    - This paper can be referenced, but only briefly.
- https://content.sciendo.com/view/journals/ijcss/15/2/article-p91.xml?language=en
    - They get "nearly 60% accuracy" (58.92%) using SVM, but also try DTs, KNN and NN
    - Their paper seems fairly similar to what we're doing (but we got better results), so this is worth referencing to show that our work is an improvement on past work. We're also just better across the board with all of our models.
- https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=&ved=2ahUKEwj0s8qIm7DuAhUOQ60KHU-ZBHwQFjACegQIAxAC&url=http%3A%2F%2Fwww.jds-online.com%2Ffile_download%2F39%2Fjds-142.pdf&usg=AOvVaw16MAtgR6u_aF9eG3viSXw9
    - This link is broken for me.

## Sections

List the sections we plan to include in the paper:
- Abstract
    - Why is this topic important?
    - What do we do different?
    - Talk about the creation of the data loader for future research
    - Results
    - Brief Discussion
- Introduction
    - Why does predicting wins and win probabilities matter?
        - Helps identify stats correlating to winning
        - Betting
        - Gives insight into the mysteries of baseball
    - Literature Overview
        - See above section
- Methodology
    - Data Collection
        - Discuss that we are looking at predicting whether the home team wins or loses
        - Baseball Reference
        - Lehman's Dataset
        - Batting Covariates
        - Data Loader Github
            - Talk about how it works and why future researchers can use it
    - Features
        - Discuss each feature (or group of features) and why it is important
        - Possibly only focus on the percent difference columns
    - Models
        - Baseline Model
        - KNN
        - XGB
        - NN
        - Ensemble?
    - Model Evaluation
        - Accuracy
        - Probabilities
        - **I know we were discussing how we want to do this which is in the other notebook**
- Results
    - Show how our models compare at predicting high difference probability games
        - This is a good place to show probabilities graph
    - Show accuracy results on our models compared to other papers
        - Give a table showing compared accuracies
    - Also show models probability vs games predicted probability
        - This is where we can show that for games where one team has a 0.7 win proba and the other is 0.3 we would expect the models to show this. Whereas if it is 0.56 and 0.54 we want to see how the model does and if it gets it right.
- Discussion
    - Discuss why the reader should care that we are able to have good accuracy on high difference probability games for most models
    - Talk about why the KNN doesn't perform well across the board for probabilities
        - Possibly because they predict differently than the XGB and NN
    - Talk about how our models across the board outperform all other models in literature
- Future work
    - Adding in more data on specific players for each team
    - Try to find more features that will increase accuracy
    - The KNN, XGB, and NN when put together for a theoretical accuracy perform at around 80% therefore we can assume they predict in different ways. How do each predict and is there anyway to make a better ensemble?
- Conclusion
    - Restate similar to what is found in the abstract

## Work still remaining

List the work that still needs to be done:
- Compute class probabilities for KNN model
- Compute class probabilities for ensembled (majority vote) model
