# Data Science Project Structure and Life Cycle

- Data Science Project Structure 
- Data Science Life Cycle and processes

## Data Science Project Structure

### Data science project template

https://github.com/makcedward/ds_project_template

#### Experiment Quran Dataset

Check this of my experiment Quran Dataset using Data science project template https://github.com/langsari/quran-dataset

<img src="images/ds-project-structure.png">

### Cookiecutter Data Science

A logical, reasonably standardized, but flexible project structure for doing and sharing data science work.

https://drivendata.github.io/cookiecutter-data-science/

install

`pip install cookiecutter`

Starting a new project

`cookiecutter https://github.com/drivendata/cookiecutter-data-science`

<img src="images/cookiecutter-data-science-project-template.png"> 

## Data Science Life Cycle and Processes

This process refer to CRISP-DM process

More detial 
1. https://en.wikipedia.org/wiki/Cross-industry_standard_process_for_data_mining
2. https://www.datascience-pm.com/crisp-dm-2/
3. https://datacubeth.ai/crisp-dm/

### Step 1: Business Understanding
<strong>Target</strong>: What does the business need?<br>
<strong>Phase</strong>: 
- Determine business objectives; 
- assess situation; 
- determine data mining goals; 
- produce project plan

<strong>Resonsible</strong>: Product Owner, Business Analyst. Data Expert

WHAT prediction of price for shipping

### Step 2: Data Understanding
<strong>Target</strong>: What data do we have / need? Is it clean?<br>
<strong>Phases</strong>: 
- collect initial data; 
- describe data; 
- explore data; 
- verify data quality<br>

<strong>Resonsible</strong>: Data Engineer, Data Scientist, Data Analyst

In [1]:
import pandas as pd

data = pd.read_excel('dataset/Superstore.xlsx')

In [2]:
data.head(10)

Unnamed: 0,Row ID,Order Priority,Discount,Unit Price,Shipping Cost,Customer ID,Customer Name,Ship Mode,Customer Segment,Product Category,...,Region,State or Province,City,Postal Code,Order Date,Ship Date,Profit,Quantity ordered new,Sales,Order ID
0,18606,Not Specified,0.01,2.88,0.5,2,Janice Fletcher,Regular Air,Corporate,Office Supplies,...,Central,Illinois,Addison,60101,2012-05-28,2012-05-30,1.32,2,5.9,88525
1,20847,High,0.01,2.84,0.93,3,Bonnie Potter,Express Air,Corporate,Office Supplies,...,West,Washington,Anacortes,98221,2010-07-07,2010-07-08,4.56,4,13.01,88522
2,23086,Not Specified,0.03,6.68,6.15,3,Bonnie Potter,Express Air,Corporate,Office Supplies,...,West,Washington,Anacortes,98221,2011-07-27,2011-07-28,-47.64,7,49.92,88523
3,23087,Not Specified,0.01,5.68,3.6,3,Bonnie Potter,Regular Air,Corporate,Office Supplies,...,West,Washington,Anacortes,98221,2011-07-27,2011-07-28,-30.51,7,41.64,88523
4,23088,Not Specified,0.0,205.99,2.5,3,Bonnie Potter,Express Air,Corporate,Technology,...,West,Washington,Anacortes,98221,2011-07-27,2011-07-27,998.2023,8,1446.67,88523
5,23597,Medium,0.09,55.48,14.3,3,Bonnie Potter,Express Air,Corporate,Office Supplies,...,West,Washington,Anacortes,98221,2011-11-09,2011-11-11,1388.0523,37,2011.67,88524
6,25549,Low,0.08,120.97,26.3,3,Bonnie Potter,Delivery Truck,Corporate,Technology,...,West,Washington,Anacortes,98221,2013-07-01,2013-07-08,1001.4453,12,1451.37,88526
7,20228,Not Specified,0.02,500.98,26.0,5,Ronnie Proctor,Delivery Truck,Home Office,Furniture,...,West,California,San Gabriel,91776,2010-12-13,2010-12-15,4390.3665,12,6362.85,90193
8,19483,Low,0.08,6.48,6.81,5,Ronnie Proctor,Regular Air,Home Office,Office Supplies,...,West,California,San Gabriel,91776,2012-05-12,2012-05-21,-141.26,18,113.25,90197
9,24782,High,0.01,90.24,0.99,6,Dwight Hwang,Regular Air,Home Office,Office Supplies,...,West,California,San Jose,95123,2011-05-26,2011-05-26,1045.4673,16,1515.17,90194


In [3]:
data.describe()

Unnamed: 0,Row ID,Discount,Unit Price,Shipping Cost,Customer ID,Product Base Margin,Postal Code,Profit,Quantity ordered new,Sales,Order ID
count,9426.0,9426.0,9426.0,9426.0,9426.0,9354.0,9426.0,9426.0,9426.0,9426.0,9426.0
mean,20241.015277,0.049628,88.303686,12.795142,1738.422236,0.512189,52446.327286,139.23641,13.79843,949.706272,82318.489073
std,6101.890965,0.031798,281.540982,17.181203,979.167197,0.135229,29374.597802,998.486483,15.107688,2598.019818,19149.448857
min,2.0,0.0,0.99,0.49,2.0,0.35,1001.0,-16476.838,1.0,1.32,6.0
25%,19330.25,0.02,6.48,3.1925,898.0,0.38,29406.0,-74.017375,5.0,61.2825,86737.25
50%,21686.5,0.05,20.99,6.05,1750.0,0.52,52302.0,2.5676,10.0,203.455,88344.5
75%,24042.75,0.08,85.99,13.99,2578.75,0.59,78516.0,140.24385,17.0,776.4025,89987.75
max,26399.0,0.25,6783.02,164.73,3403.0,0.85,99362.0,16332.414,170.0,100119.16,91591.0


### Step 3: Data Preparation
<strong>Target</strong>: How do we organize the data for modeling?<br>
<strong>Phases</strong>: Generally, the most time-consuming phase, 
- select data; 
- clean data; 
- construct data;
- integrate data; 
- format data

<strong>Resonsible</strong>: Data Engineer, Data Analyst

### Step 4: Modeling Select
<strong>Target</strong>: What modeling techniques should we apply? <br>
<strong>Phases</strong>: 
- Modeling technique; 
- generate test design; 
- build model; 
- assess model

<strong>Resonsible</strong>: Data Scientist

### Step 5: Evaluation
<strong>Target</strong>: Which model best meets the business objectives?<br>

<strong>Phases</strong>: 
- Evaluate results;
- review process;
- determine next steps

<strong>Resonsible</strong>: Business Analyst, Data Scientist

### Step 6: Deployment
<strong>Target</strong>: How do stakeholders access the results?<br>
<strong>Phases</strong>: 
- Plan deployment; 
- plan monitoring and maintenance; 
- produce final report; 
- review project

<strong>Resonsible</strong>: Software Developer, Business Analyst