Notebook: Project Folder Structure
1. Introduction

Maintaining a well-organized project folder structure is crucial for collaboration, reproducibility, and efficient management of a data science project. In this notebook, we propose a folder structure, based on the one proposed by Driven Data for their Cookiecutter Data Science project.

For more information, you can visit the [Cookiecutter Data Science project](https://cookiecutter-data-science.drivendata.org/).
2. Project Folder Structure

Here's the recommended folder structure:

```vbnet
project/
├── LICENSE     <- Open-source license if one is chosen
├── Makefile    <- Makefile with convenience commands like `make data` or `make train`
├── README.md   <- The top-level README for developers using this project.
├── data
│   ├── external    <- Data from third party sources.
│   ├── interim     <- Intermediate data that has been transformed.
│   ├── processed   <- The final, canonical data sets for modeling.
│   └── raw         <- The original, immutable data dump.
│
├── envs
│   ├── .dev_env        <- Base environment for development
│   ├── .prod_env       <- Environment for production
│   ├── .dev2_env       <- Extra environment for development, if needed
│   ├── dev_req.txt     <- The requirements file for reproducing the base development environment
│   ├── prod_req.txt    <- The requirements file for reproducing the production environment
│   └── dev2_req.txt    <- The requirements file for reproducing the extra development environment
│
├── docs               <- A default mkdocs (www.mkdocs.org) or Sphinx (https://www.sphinx-doc.org)
│                         project.
│
├── models             <- Trained and serialized models, model predictions, or model summaries
│
├── notebooks          <- Jupyter notebooks. Naming convention is a number (for ordering),
│                         the creator's initials, and a short `-` delimited description, e.g.
│                         `1.0-jqp-initial-data-exploration`.
│
├── references         <- Data dictionaries, manuals, and all other explanatory materials.
│
├── reports            <- Generated analysis as HTML, PDF, LaTeX, etc.
│   └── figures        <- Generated graphics and figures to be used in reporting
│
└── src                <- Source code for use in this project.
    │
    ├── __init__.py             <- Makes src a Python module
    ├── config.py               <- Store useful variables and configuration
    ├── dataset.py              <- Scripts to download or generate data
    ├── features.py             <- Code to create features for modeling
    ├── modeling                
    │   ├── __init__.py 
    │   ├── predict.py          <- Code to run model inference with trained models          
    │   └── train.py            <- Code to train models
    └── plots.py                <- Code to create visualizations   
```

---
3. Folder Structure Details
    LICENSE: Contains the open-source license if one is chosen.
    Makefile: Contains convenience commands like make data or make train to streamline project operations.
    README.md: The top-level README file that provides an overview and instructions for the project.
    data/: Contains all data-related files and folders.
        external/: Data from third party sources.
        interim/: Intermediate data that has been transformed.
        processed/: The final, canonical datasets for modeling.
        raw/: The original, immutable data dump.
    docs/: A default mkdocs project; see www.mkdocs.org for details.
    models/: Contains trained and serialized models, model predictions, or model summaries.
    notebooks/: Contains Jupyter notebooks. The naming convention is a number (for ordering), the creator's initials, and a short - delimited description, e.g. 1.0-jqp-initial-data-exploration.
    pyproject.toml: Project configuration file with package metadata for {{ cookiecutter.module_name }} and configuration for tools like black.
    references/: Contains data dictionaries, manuals, and all other explanatory materials.
    reports/: Generated analysis as HTML, PDF, LaTeX, etc.
        figures/: Generated graphics and figures to be used in reporting.
    requirements.txt: The requirements file for reproducing the analysis environment, e.g., generated with pip freeze > requirements.txt.
    setup.cfg: Configuration file for flake8.
    {{ cookiecutter.module_name }}/: Source code for use in this project.
        init.py: Makes {{ cookiecutter.module_name }} a Python module.
        config.py: Stores useful variables and configuration.
        dataset.py: Scripts to download or generate data.
        features.py: Code to create features for modeling.
        modeling/:
            init.py
            predict.py: Code to run model inference with trained models.
            train.py: Code to train models.
        plots.py: Code to create visualizations.


4. Example Folder Structure

Let's create an example folder structure for a data science project using this layout.

python

import os

def create_folder_structure(base_path):
    folders = [
        "data/external", "data/interim", "data/processed", "data/raw",
        "docs",
        "models",
        "notebooks",
        "references",
        "reports/figures",
        "{{ cookiecutter.module_name }}",
        "{{ cookiecutter.module_name }}/modeling"
    ]
    
    for folder in folders:
        os.makedirs(os.path.join(base_path, folder), exist_ok=True)
    
    files = ["LICENSE", "Makefile", "README.md", "pyproject.toml", "requirements.txt", "setup.cfg", 
             "{{ cookiecutter.module_name }}/__init__.py", 
             "{{ cookiecutter.module_name }}/config.py", 
             "{{ cookiecutter.module_name }}/dataset.py", 
             "{{ cookiecutter.module_name }}/features.py", 
             "{{ cookiecutter.module_name }}/modeling/__init__.py", 
             "{{ cookiecutter.module_name }}/modeling/predict.py", 
             "{{ cookiecutter.module_name }}/modeling/train.py", 
             "{{ cookiecutter.module_name }}/plots.py"]
    
    for file in files:
        open(os.path.join(base_path, file), 'a').close()

# Create the folder structure
base_path = "project"
create_folder_structure(base_path)

5. Exercise

    Create a similar folder structure for your own data science project using the provided script.
    Add some initial scripts or data files to each folder.
    Document the purpose of each folder in your README.md file.