# 1. Project structure

When a data science project is about to kick-off, there are different templates that can be used. In fact, there are already different data science project templates created in GitHub that are publicly available for use. <br>
A well known example is the Cookie Cutter [template](https://drivendata.github.io/cookiecutter-data-science/) which apparently is my favorite considering the name (Ioannis here!).

For EasyJet, the data science team has already developed a personlized project template upon various discussions between the team members. <br>
You can find the project template in the following GitLab [link](http://code.europe.easyjet.local/data_science/project-template/tree/master)

What follows below is an overview of the DS project template and its different folders, i.e. what they are and what type of files you should put within them.

_Note:_ The project files are in Python but it is expected to use the same project template in R if needed (with the relevation adaptations)

The current working directory is always the "/repos/_my-project-name_" <br>
In our case the current working directory is "/repos/my-awesome-project"

In [1]:
import os

print(f'My Current working directory is: {os.getcwd()}')

My Current working directory is: /repos/my-awesome-project/src/notebooks


The project template contains the following folders/files:
- **src/** _folder_
- **main.py** script <br>
The main python script that runs the sequential functions the create the model or runs the complete process; this will be the productionised script
- **main.ipynb** notebook (optional but recommended) <br>
Mirrors the main.py script and can be used for debugging
- **.gitlab-cy.yml** yaml file used for TDD (hidden file) <br>
Configuration of the the TDD that triggers the CI-CD pipeline

The main interest is in the subfolders of the src/ folder and how these subfolders are used. <br>
The following tree-constructed visualisation shows how the src directory is broken down into different sub-directories:

## 1.1 Folder's structure

<b>src/</b> <br>
│ <br>
├────────────── <b>config/</b> <br>
├────────────── <b>data/</b> <br>
│&emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp;├──────────────── <b>assets/</b> <br>
│&emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp;├──────────────── <b>json/</b> <br>
│&emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp;└──────────────── <b>models/</b> <br>
├────────────── <b>etl/</b> <br>
├────────────── <b>ml/</b> <br>
├────────────── <b>notebooks/</b> <br>
├────────────── <b>opt/</b> <br>
├────────────── <b>sql/</b> <br>
└────────────── <b>tests/</b> <br>

## 1.2 Folder's explanation

* **Config/** <br>
Contains config files (e.g. _config.ini_)

* **Data/**
    * **assets/** <br>
    Used to store .csv, .xlsx, or other similar files. By default .csv and .xlsx are blocked from being uploaded to GitLab
    * **json/** <br>
    Used to store lists or dictionaries in an easy to edit way <br>
    * **models/** <br>
    Used to store serialised machine learning models. For instance, a trained model could be stored here as a pickle file (e.g. .pkl)
    
* **etl/** <br>
ETL stands for Extract, Transform, Load. This folder contains any pipeline scripts that are used in the project. <br>
For instance, a Machine Learning project can have scripts that perform pre-processing in the data. Such scripts should be listed within the _etl_ folder

* **ml/** <br>
Folder that contains any machine learning scripts that are used to generate a machine learning model, update or predict from the machine learning models. 

* **notebooks/** <br>
The notebooks folder can be used to store any python notebook that were used for research based work. That can include data exploration, feature engineering, model creation etc. <br>
Notebooks are critical to store the rationale of the developer to easily on-board new people into the project.

* **opt/** 
Optional folder for flexibility. For instance, for the Customer Segmentation project, this folder was renamed to "Processes" due to the size of the project.<br>
You can see the folder structure of customer segmentation from the corresponding [GitLab repo](http://code.europe.easyjet.local/data_science/customer-segmentation/tree/master/src)

* **sql/** <br>
Can be used to store SQL code (i.e. _.sql_ files) to query different databases and tables, e.g. DataHub

* **tests/** <br>
Contains tests required for the project as per our Test Driven Development (TDD). <br>
E.g. Unit, Integration, Validation etc

# 2. ConfigReader (ezTools)

The configReader is a class created in ezTools which enables you to to read the config file created in the the _src/config/_ folder. <br>
The class is about converting a _config.ini_ into a tuple/dictionary to make the config parameters easily accessible.

There are multiple config extensions that can be used in Python and you are welcome to explore different config files in the following [link](https://martin-thoma.com/configuration-files-in-python/#:~:text=configuration%20handling%3A%20cfg_load-,Python%20Configuration%20File,to%20avoid%20uploading%20it%20accidentally.)
However, this ConfigReader class talks about how to work with [.ini](https://en.wikipedia.org/wiki/INI_file#:~:text=The%20name%20of%20these%20configuration,this%20method%20of%20software%20configuration.) config file which stands for initialisation.

There are two main parameters that should be taken into account for the purpose of the demonstration:
1. **config_path** <br>
The path to the .ini config file
2. **config_tuple** <br>
If True -> Returns tuple <br>
If False -> Returns dictionary

The difference between the tuple and the dictionary is that the latter is mutable which means you can add parameters as you go. This is really useful when you know that some parameters will be used in the future.

_Example:_ In customer segmentation we keep the number of principal components to transform the test data and we even store dataframes to be passed to the next process.

## 2.1 Read config as tuple

In [2]:
# Load config reader
from eztools.operations import ConfigReader

In [3]:
# Read config.ini
config_path = '/repos/my-awesome-project/src/config/config.ini'
config = ConfigReader(config_path, config_tuple = True).read_config()

# Print config
config

Struct(settings=Struct(repos_path='/repos/my-awesome-project', mnt_path='/mnt'), data=Struct(data_path='/mnt/df_wine.csv', response='wine_colour', positive_class='white', target_names='red,white'))

In [4]:
# Accessing parameters of config
# Settings -> repos_path
config.settings.repos_path

'/repos/my-awesome-project'

## 2.2 Read config as dictionary

In [5]:
# Read config.ini
config_path = '/repos/my-awesome-project/src/config/config.ini'
config = ConfigReader(config_path, config_tuple = False).read_config()

# Print config
config

{'settings': {'repos_path': '/repos/my-awesome-project', 'mnt_path': '/mnt'},
 'data': {'data_path': '/mnt/df_wine.csv',
  'response': 'wine_colour',
  'positive_class': 'white',
  'target_names': 'red,white'}}

In [6]:
# Accessing parameters of config
# Settings -> repos_path
config['settings']['repos_path']

'/repos/my-awesome-project'

In [7]:
# Update config
config['settings']['success'] = 'yes'

# Print config
config

{'settings': {'repos_path': '/repos/my-awesome-project',
  'mnt_path': '/mnt',
  'success': 'yes'},
 'data': {'data_path': '/mnt/df_wine.csv',
  'response': 'wine_colour',
  'positive_class': 'white',
  'target_names': 'red,white'}}

If you are interested to know more about how the config file was created and built (i.e. dive into the code), please do refer to the ezTools [documentation](http://code.europe.easyjet.local/data_science/ezTools/blob/master/eztools/operations.py)

# 3. Logger (ezTools)

The Logger() class is located in ezTools and enables the creation of a logger object from the python logging library. <br>
Logging provides a flexible framework for emitting log messages in python processes.

Python contains its own logging capabilities and you can check them out if you want to in the following [link](https://www.loggly.com/ultimate-guide/python-logging-basics/#:~:text=Standard%20Library%20Logging%20Module,log%20messages%20from%20Python%20programs.&text=Internally%2C%20the%20message%20is%20turned,object%20registered%20for%20this%20logger.)

Let's now discuss some of the main parameters in the Logger() class

1. **logger_path** <br>
The path to the logger where the logs will be stored
2. **logger_name** <br>
The name of the logger
3. **log_in_console** <br>
Whether the log message is printed in the console
4. **log_in_file** <br>
Whether an output file is produced

In [8]:
# Load logger
from eztools.operations import Logger

# Import logging for
import logging

In [9]:
# Add our report logger
logger = Logger(logger_path = '/mnt/logs/', logger_name = 'L&L', log_in_console = True, log_in_file = False).get_logger()

In [10]:
# List our loggers
loggers = [logging.getLogger(name) for name in logging.root.manager.loggerDict]
loggers

 <Logger eztools.operations (INFO)>,
 <Logger L&L (DEBUG)>]

# 4. Demo

In [11]:
import pandas as pd

In [12]:
DATA_PATH = '/mnt/df_wine.csv'

df = pd.read_csv(DATA_PATH)
df

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality,wine_colour
0,7.0,0.270,0.36,20.7,0.045,45.0,170.0,1.00100,3.00,0.45,8.8,6,white
1,6.3,0.300,0.34,1.6,0.049,14.0,132.0,0.99400,3.30,0.49,9.5,6,white
2,8.1,0.280,0.40,6.9,0.050,30.0,97.0,0.99510,3.26,0.44,10.1,6,white
3,7.2,0.230,0.32,8.5,0.058,47.0,186.0,0.99560,3.19,0.40,9.9,6,white
4,7.2,0.230,0.32,8.5,0.058,47.0,186.0,0.99560,3.19,0.40,9.9,6,white
...,...,...,...,...,...,...,...,...,...,...,...,...,...
6492,6.2,0.600,0.08,2.0,0.090,32.0,44.0,0.99490,3.45,0.58,10.5,5,red
6493,5.9,0.550,0.10,2.2,0.062,39.0,51.0,0.99512,3.52,0.76,11.2,6,red
6494,6.3,0.510,0.13,2.3,0.076,29.0,40.0,0.99574,3.42,0.75,11.0,6,red
6495,5.9,0.645,0.12,2.0,0.075,32.0,44.0,0.99547,3.57,0.71,10.2,5,red
