# **Chapter 4:** Project Structure

## Introduction

A well-organized project structure is a cornerstone of project success, particularly in data science, where projects can become complex with various datasets, scripts, notebooks, and dependencies. 

Structuring your project effectively from the start can save you time and trouble later on.

## Chapter outline

4.1 Best Practices for Structuring Python Projects

4.2 A Typical Python Project Layout

4.3 Importance of other Project Files

---
---

## **Chapter 4.1:** Best Practices for Structuring Python Projects

The organization of your Python project should promote ease of maintenance, scalability, and collaboration. 

Here are some best practices:


* **Use a Consistent Directory Structure:** A predictable layout allows team members and contributors to find files and directories quickly.

* **Isolate Dependencies:** Utilize virtual environments to keep your project's dependencies separate from the global Python installation and other projects.

* **Automate Repetitive Tasks:** Use scripts for tasks like setting up environments, running tests, and packaging your project.

* **Keep Configuration Files at the Root:** Place files like .gitignore, setup.py, and requirements.txt at the root of your project for easy access.

* **Separate Source Code from Data:** Especially in data science projects, separating code from data can help in managing large datasets and models.

* **Version Control:** Use a version control system like Git from the start of your project, even if you're the only contributor (we will see this in more detail in chapter 5).

## **Chapter 4.2:** A Typical Python Project Layout

A typical Python data science project might have the following structure:

```bash
data_science_project/
│
├── .gitignore                  # Specifies intentionally untracked files to ignore
├── LICENSE                     # Contains the licensing information
├── README.md                   # The top-level description of the project
├── requirements.txt            # The dependencies file for reproducing the analysis environment
├── setup.py                    # Makes the project pip installable (setuptools)
│
├── data/
│   ├── processed/              # Final, canonical datasets for modeling
│   └── raw/                    # The original, immutable data dump
│
├── docs/                       # A default Sphinx project for documentation
│
├── notebooks/                  # Jupyter notebooks for exploration and presentation
│
├── references/                 # Data dictionaries, manuals, and all other explanatory materials
│
├── reports/                    # Generated analysis as HTML, PDF, LaTeX, etc.
│   └── figures/                # Graphics and figures to be used in reporting
│
└── src/                        # Source code for use in this project
    ├── __init__.py             # Makes src a Python module
    │
    ├── data/                   # Scripts to download or generate data
    │   └── make_dataset.py
    │
    ├── features/               # Scripts to turn raw data into features for modeling
    │   └── build_features.py
    │
    ├── models/                 # Scripts to train models and then use trained models to make predictions
    │   ├── predict_model.py
    │   └── train_model.py
    │
    └── visualization/          # Scripts to create exploratory and results-oriented visualizations
        └── visualize.py
```

This structure is not exhaustive or mandatory but serves as a guideline. Adapt it based on the specific needs of your project.

Read more about standardized project structures in python: [here](https://drivendata.github.io/cookiecutter-data-science/). There is an oportunity to create the whole structure with a package, one command and some decisions like *Select your environment manager* that you make with typing numbers, e.g. 2 for Conda. The Practical Exercise 4 is about creating a project structure with that package (see below).


## **Chapter 4.3:** Importance of other Project Files

* **requirements.txt:** Lists all Python dependencies for your project, allowing anyone to recreate your development environment. 
    * Use `pip freeze > requirements.txt` to generate this file (chapter 2). Read more: [here](https://pip.pypa.io/en/stable/cli/pip_freeze/). 

* **README.md:** Provides an overview of your project, its structure, how to set it up, and how to use it. This is the first file users and contributors will look at, so make it informative.
    * Read more about writing a usefull readme: [here](https://www.makeareadme.com/).

* **License:** Including a license in your project is crucial as it tells others what they can and cannot do with your code. Choose a license that aligns with how you wish your project to be used. 
    * Read more about finding the right license: [here](https://choosealicense.com/). 

By adhering to these practices and structuring your project effectively, you create a solid foundation for successful project development, facilitating clear communication with collaborators and ensuring your data science projects are robust, reproducible, and scalable.

---

## 👨‍💻 **Practice Tasks 4.3:** Creating a Project Structure with CookieCutter Data Science (CCDS)

Here`s a step-by-step instruction to install *CookieCutter Data Science (CCDS)* and how to create a project structure with one command. Try it out:

- Navigate into a folder where you want your project structure:  
    e.g. `C:\Repo\python-for-engineers\programs>cd project_structure`

- Create and **activate** a new virtual environment (or just activate an existing virtual environment):  
    `see practical exercise 2.2`

- Check if python and pip is already installed:  
    ```bash
    python --version
    pip --version
    ```

- Install GIT (needed for CCDS):  
    ```bash
    winget install --id Git.Git -e --source winget
    ```

- Install CCDS  
    ```bash
    pip install cookiecutter-data-science
    ```

- Run ccds
    ```bash
    ccds
    ```  
    - If you already used ccds, you will be asked to re-download it.
    ```cmd 
    You've downloaded C:\Users\User\.cookiecutters\cookiecutter-data-science
    before. Is it okay to delete and re-download it? [y/n] (y): y
    ```
    - Enter "y" to confirm and use the actual version of ccsd. Enter "n" to use the last installed version of ccsd.

- Name the asked directorys and answer the following questions. For the next chapter it is recommended to use the same project-, repo- and modul-name like the example.  
    * project-name: unit1_chapter4
    * repo-name: python_for_engineers
    * modul-name: program_code

- Depending on your answers, ccds will create the directory template.
- You will find the template in the folder where you installed it.


![image.png](attachment:image.png)