# 

# 01. Problem definition


## Define the business problem you are trying to solve and the objectives you want to accomplish.
The main goal is to hone skills in ML, particulally when dealing with "relatevily" big data files (a few Gb SQL DB, multiple tables, 10-15 million rows). "Relatevily" means that this dataset cannot be converted to Pandas DataFrame or, at least, it takes a lot of time. 

We will use the NBA dataset for this goal.

The objectives to answer the following questions:

- Efficiency (EFF) Prediction: Use such features as points, assists, rebounds, age, etc. to predict EFF using different ML methods.

- Player Performance Prediction: Use linear regression to predict a player's performance in terms of points, assists, rebounds, or any other statistical category based on their previous games, their age, position, minutes played, etc. You can use data from the play_by_play and common_player_info tables for this.

- Team Performance Prediction: Predict the number of wins a team might have in a season based on various parameters like team's average points per game, average rebounds, average assists, etc. Data from the team_info_common table can be used here.

- Player Improvement Over Time: Analyze how a player's performance (points, rebounds, assists, etc.) improves or declines over time. This could be dependent on variables such as age, experience (number of seasons played), team changes, etc. This would require data from play_by_play, common_player_info, and potentially team_history tables.

- Effect of Draft Pick on Career: Examine how a player's draft position (from draft_history) affects their overall career statistics or longevity in the NBA. This analysis could reveal if higher draft picks generally lead to better careers.

- Player Attributes and Performance: Investigate relationships between player physical attributes (height, weight, wingspan from draft_combine_stats) and their on-court performance. You can study whether players with certain physical attributes are more likely to excel in specific areas (e.g., taller players and rebounding).

- Impact of Home/Away Games: Assess the impact of playing at home vs. away on a team's performance. The play_by_play table might include the necessary data to investigate this aspect.

- Etc.

## Identify the target audience and stakeholders.
It is just me &#x1F603; and other enthusiasts who love basketball and ML.

## Determine the data sources required to solve the problem.
There are three data sources.

1. We will use [the NBA Database](https://github.com/wyattowalsh/nba-db):
> This repository contains the associated code base for the creation and updating of [the Kaggle NBA Database](https://www.kaggle.com/datasets/wyattowalsh/basketball). The nba-api is utilized as the API client for stats.nba.com and numerous endpoints are extracted to produce the database tables. .SQLite is the database format of choice for this project. The database is updated daily and monthly via cron scheduled Kaggle Notebooks.

    This SQLite DB has a size of about 3 Gb and 16 tables. One of the tables has more than 13 million rows.  

    Unfortunatelly, this dataset does not provide a detailed description of all tables and columns. Some names are obvious, while others can be found on the project's GitHub page:

    [User Guide](https://github.com/wyattowalsh/nba-db/blob/main/docs/user_guide/endpoints.md)

    In addition, it may be useful to visit the following resources to better understand abbreviations and names:
    - [The nba_api  project](https://github.com/swar/nba_api), especially: [Examples](https://github.com/swar/nba_api/tree/master/docs/examples) and [Endpoints](https://github.com/swar/nba_api/tree/master/docs/nba_api/stats)
    - [Developer Portal](https://gom-uat.ngss.nba.com/ui/developer) and [NBA Game Distribution API](https://developer.geniussports.com/nbangss/rest/index_central.html)


2. We will use [the balldontlie project](https://www.balldontlie.io/home.html?shell#introduction) to collect addional data, especially, about advanced statistics.

3. We will use [basketball-reference.com](https://www.basketball-reference.com/) to collect addional data, especially, about common player info.

   
## Define the success criteria and metrics for evaluating the model's performance.
The success criteria and performance metrics depend on the task and will be established during the research proces

# Memo with the main stages


### Problem Definition:

- Define the business problem you are trying to solve and the objectives you want to accomplish.
- Identify the target audience and stakeholders.
- Determine the data sources required to solve the problem.
- Define the success criteria and metrics for evaluating the model's performance.
### Data Collection:

- Collect and gather the data needed to solve the problem.
- Identify the relevant data sources and acquire the data.
- Perform data quality checks to ensure the data is accurate and complete.
- Store the data in a format that can be easily accessed and analyzed.
### Data Exploration:

- Perform data profiling to understand the structure, size, and quality of the data.
- Use data visualization to explore the data and identify patterns or trends.
- Use statistical analysis and hypothesis testing to gain insights into the data.
### Data Preparation:

- Prepare the data for modeling.
- Perform data cleaning to remove missing or inconsistent data, and correct any errors.
- Perform feature engineering to extract relevant features from the data.
- Transform the data into a format that can be used by the machine learning algorithm.
### Model Building:

- Develop a machine learning model that can solve the problem.
- Select an appropriate algorithm that is suited to the data and the problem.
- Split the data into training and testing sets.
- Train the model using the training data.
- Tune the model by adjusting the hyperparameters to improve its performance.
- Test the model using the testing data to evaluate its performance.
### Model Evaluation:

- Evaluate the performance of the model and assess its ability to solve the problem.
- Measure the model's accuracy, precision, recall, F1 score, and other performance metrics.
- Visualize the results to gain insights into the model's performance.
- Compare the model's performance against the success criteria defined in the problem definition stage.
### Model Deployment:

- Deploy the model into production.
- Integrate the model into the business process and make it available for use by end-users.
- Develop a user interface or API to enable end-users to interact with the model.
- Test the deployment to ensure that the model is working correctly in the production environment.
### Model Maintenance:

- Monitor and maintain the model's performance over time.
- Track the model's performance using data from the production environment.
- Identify any changes in the data or business environment that affect the model's accuracy and update the model as necessary.
- Continuously improve the model by retraining it with new data and updating the algorithm or hyperparameters as needed.