===========================================


Gebil Jibul


Description: This program demonstrates building a directory structure for a small retail business' data lake. 


=========================================== 

# Acquiring and Storing Data

Assume that you are the owner of a small but growing retail business, *Datums R Us*. Your store sells technology, tools, and clothing for the discerning data scientist. You currently have stores in the following five locations. 

- Bellevue, Nebraska
- Columbus, Ohio
- Denver, Colorado
- San Francisco, California
- Baltimore, Maryland

You have been tasked with creating a data lake for the company using a [directory structure based on Cookiecutter Data Science recommendations](https://drivendata.github.io/cookiecutter-data-science/#directory-structure). This basic directory structure works well for small, self-contained data science projects and organizing large-scale data warehouses.

```
├── data
│   ├── external       <- Data from third-party sources.
│   ├── interim        <- Intermediate data that has been transformed.
│   ├── processed      <- The final, canonical data sets for modeling and reports.
│   └── raw            <- The original, immutable data dump.
```

You have identified the following items for initial inclusion in the data lake. 

**External Data Sets**

- Census (Updated Yearly)
- Weather Forecasts (Updated Daily)

**Raw Data Dumps**

- Sales (Updated Hourly)
- Inventory (Updated Daily)
- Expenses (Updated Daily)

**Processed Data Sets and Reports**

*Weekly*

- Modeling Data Set

*Monthly*

- Inventory Update Request

*Quarterly*

- Quarterly Financial Report

In the first part, I will describe the directory structure for the data lake. For the most part, this directory structure will not depend on the technical details of how I store the data. I could be storing the data in a local filesystem, a distributed filesystem such as HDFS, or object storage, such as Amazon S3. 

I will only be creating the directory structures and not populating actual content. Real-world data lakes store data in a variety of formats including,  Apache Parquet, Google Protocol Buffers, Apache Avro, JSONL, and CSV. 

I will use Python's built-in [calendar library](https://docs.python.org/3/library/calendar.html), and [datetime library](https://docs.python.org/3/library/datetime.html) to work with the dates and times required for this assignment. I will use the [PurePosixPath](https://docs.python.org/3/library/pathlib.html#pathlib.PurePosixPath) class from Python's built-in [pathlib library](https://docs.python.org/3/library/pathlib.html) to represent locations on the data lake. 

I will generate the output directories for an entire year's worth of data starting on January 1st of this year, all times will be in Coordinated Universal Time (UTC). 

In [1]:
# Imports the required Python libraries and 
# sets global variables for the assignment
import calendar
import datetime
from pathlib import PurePosixPath

today = datetime.date.today()
current_year = today.year
days_in_year = 365

if calendar.isleap(current_year):
    days_in_year +=1

hours_in_year = days_in_year * 24

In [2]:
# Creates paths for the external, interim, processed, and raw directories
# Use these paths when creating new paths

root_data_dir = PurePosixPath('/data')
external_data_dir = root_data_dir.joinpath('external')
interim_data_dir = root_data_dir.joinpath('interim')
processed_data_dir = root_data_dir.joinpath('processed')
raw_data_dir = root_data_dir.joinpath('raw')

print('Root Data Directory: {}'.format(root_data_dir))
print('External Data Directory: {}'.format(external_data_dir))
print('Interim Data Directory: {}'.format(interim_data_dir))
print('Processed Data Directory: {}'.format(processed_data_dir))
print('Raw Data Directory: {}'.format(raw_data_dir))

Root Data Directory: /data
External Data Directory: /data/external
Interim Data Directory: /data/interim
Processed Data Directory: /data/processed
Raw Data Directory: /data/raw


In [1]:
#I will be using three Census data sets as examples of external data updated yearly. These data sets are:

- [American Community Survey (ACS) Summary File](https://www.census.gov/programs-surveys/acs/data/summary-file.html)
- [American Community Survey (ACS) Public Use Microdata Sample (PUMS)]( https://www.census.gov/programs-surveys/acs/microdata.html)
- [Tiger/Line Shapefiles](https://www.census.gov/geographies/mapping-files/time-series/geo/tiger-line-file.html)

If you are curious, you can find the actual data sets at the following locations: 

- [ACS Summary File](https://www2.census.gov/programs-surveys/acs/summary_file/)
- [PUMS](https://www2.census.gov/programs-surveys/acs/data/pums/)
- [Tiger](https://www2.census.gov/geo/tiger/)

For this project, we use the following naming convention for external data sets

```
/data/external/<source>/<data-set>/<year>/
```
where *source* is the organization providing the data, *data-set* is the specific data set, and *year* is the year. 

```
data
├── external
│   ├── census
│   │   ├── acs-summaryfile
│   │   │   ├── 2015
│   │   │   ├── 2016
│   │   │   ...
│   │   │   ...
│   │   │   └── 2019
│   │   ├── pums
│   │   │   ├── 2015
│   │   │   ├── 2016
│   │   │   ...
│   │   │   ...
│   │   │   └── 2020
│   │   └── tiger
│   │       ├── 2015
│   │       ├── 2016
│   │   │   ...
│   │   │   ...
│   │       └── 2020
│   └── nwc-wpc
├── interim
├── processed
└── raw
```

Create and add the paths for these data sets. Verify that you have added the paths correctly. 

SyntaxError: invalid syntax (Temp/ipykernel_7008/380149806.py, line 3)

In [2]:
acs_summary_file_dirs = set()
pums_dirs = set()
tiger_dirs = set()

#Create and add the paths for this data set

In [3]:
# Will generate directories to reduce repeated code
# This function is continiously used throughout assignment
def DirGenerator(root, dir_names, add_to=None):
    paths = [root.joinpath(directory) for directory in dir_names]
    # Appends to a set for refrence
    if add_to is not None:
        for path in paths:
            add_to.add(format(path))

In [5]:
# Creates initial directories
census_data_dir = external_data_dir.joinpath('census')
acs_summary_data_dir = census_data_dir.joinpath('acs-summaryfile')
pums_data_dir = census_data_dir.joinpath('pums')
tiger_data_dir = census_data_dir.joinpath('tiger')

# Generates the directories
years = ['2015', '2016', '2017', '2018', '2019']       
DirGenerator(acs_summary_data_dir, years, add_to=acs_summary_file_dirs)

years = ['2015', '2016', '2017', '2018', '2019', '2020']
DirGenerator(pums_data_dir, years, add_to=pums_dirs)
DirGenerator(tiger_data_dir, years, add_to=tiger_dirs)

# Should output sorted directories from 2015 to present 
sorted(list(acs_summary_file_dirs)), sorted(list(pums_dirs)), sorted(list(tiger_dirs))

(['/data/external/census/acs-summaryfile/2015',
  '/data/external/census/acs-summaryfile/2016',
  '/data/external/census/acs-summaryfile/2017',
  '/data/external/census/acs-summaryfile/2018',
  '/data/external/census/acs-summaryfile/2019'],
 ['/data/external/census/pums/2015',
  '/data/external/census/pums/2016',
  '/data/external/census/pums/2017',
  '/data/external/census/pums/2018',
  '/data/external/census/pums/2019',
  '/data/external/census/pums/2020'],
 ['/data/external/census/tiger/2015',
  '/data/external/census/tiger/2016',
  '/data/external/census/tiger/2017',
  '/data/external/census/tiger/2018',
  '/data/external/census/tiger/2019',
  '/data/external/census/tiger/2020'])

Finally, I will create directories for a daily data set based on the [National Weather Service's (NWS) Weather Prediction Center's (WPC) daily forecasts](https://www.wpc.ncep.noaa.gov/kml/kmlproducts.php). 

For this part, I use the following naming convention

```
/data/external/nwc-wpc/forecasts/<year>/<month>/<day>/
```
where *year* is the year, *month* is the two-digit month, and *day* is the two-digit day. We use this convention when working with date-based data as the directories are naturally in date order. 

```
data
├── external
│   ├── census
│   └── nwc-wpc
│       └── forecasts
│           └── 2020
│               ├── 01
│               │   ├── 01
│               │   ├── 02
│               │   ├── 03
│               │   ...
│               │   ...
│               │   ├── 30
│               │   └── 31
│               ├── 02
│               │   ├── 01
│               │   ├── 02
│               │   ...
│               │   ...
│               │   ├── 28
│               │   └── 29
│               ├── 03
│               ...
│               ...
│               ├── 11
│               └── 12
│                   ├── 01
│                   ├── 02
│                   ...
│                   ...
│                   ├── 29
│                   ├── 30
│                   └── 31
├── interim
├── processed
└── raw
```

Create and add the paths for these data sets. Verify that you have added the paths correctly. 

In [4]:
# Code below is used to build dict structure {month: [days]}
# Helpful for passing to function DirGenerator
import calendar

# Collapses dates in calander dates list
def flatten(dates):
    for i in dates:
        if isinstance(i, list):
            for j in flatten(i):
                yield j
        else:
            yield i
            

cal = calendar.Calendar()
year = 2020

# Creates dict with {month: []} pairs
dates = list(flatten(cal.yeardatescalendar(year)))
months = set(str(date.month) for date in dates)
month_days = {month: [] for month in months}

# Adds list of values for each month's days within month_days
for date in dates:
    if str(date.day) not in month_days[str(date.month)]:
        month_days[str(date.month)].append(str(date.day))

# Creates padding with 0s (e.g., 1 to 01)
padded_dict = {}
for month, days in month_days.items():
    days = [str(day.zfill(2)) for day in days]
    padded_dict[str(month.zfill(2))] = list(days)
    
month_days = padded_dict

In [5]:
forecast_dirs = set()

# TODO: Create and add the paths for this data set

# Creates initial directories
nwc_wpc_data_dir = external_data_dir.joinpath('nwc-wpc')
forecasts_data_dir = nwc_wpc_data_dir.joinpath('forecasts')
for2020_data_dir = forecasts_data_dir.joinpath('2020')

NameError: name 'external_data_dir' is not defined

In [6]:
# Creates directories for months
DirGenerator(for2020_data_dir, months)

# Creates directories for days in months dirs
parent = format(for2020_data_dir)
for month, days in month_days.items():
    child = PurePosixPath((f'{parent}/{month}'))
    DirGenerator(child, days, add_to=forecast_dirs)

# Should have 365 directories (366 if leap year)
len(forecast_dirs)

NameError: name 'for2020_data_dir' is not defined

In the second part, I will create the structure for the raw source data. We will use the following directory naming convention. 

```
/data/raw/inventory/<location>/<year>/<month>/<day>/
/data/raw/expenses/<location>/<year>/<month>/<day>/
/data/raw/sales/<location>/<year>/<month>/<day>/<hour>/
```
For *location*, we will use the three-letter IATA code for the airport nearest to the location.  We will use the same year, month, and day convention from the previous example. For *hour*, we will use the two-digit hour value based on a 24-hour clock set to UTC. 

The following is an example of the directory structure for daily data dumps. 

```
data
├── external
├── interim
├── processed
└── raw
    ├── expenses
    ├── inventory
    │   ├── bwi
    │   ├── cmh
    │   ├── den
    │   ├── oma
    │   │   └── 2020
    │   │       ├── 01
    │   │       │   ├── 01
    │   │       │   ├── 02
    │   │       │   ...    
    │   │       │   └── 31
    │   │       ├── 02
    │   │       │   ├── 01
    │   │       │   ...
    │   │       │   └── 29
    │   │       ├── 03
    │   │       ... 
    │   │       ├── 11
    │   │       └── 12
    │   │           ├── 01
    │   │           ├── 02
    │   │           ...  
    │   │           └── 31
    │   └── sfo
    └── sales
```

Create and add the paths for these data sets. Verify that you have added the paths correctly.

In [7]:
inventory_dirs = set()
expenses_dirs = set()

#Create and add the paths for this data set

In [8]:
# Creates initial directories
raw_expenses_data_dir = raw_data_dir.joinpath('expenses')
raw_inventory_data_dir = raw_data_dir.joinpath('inventory')

in_location_dirs = [raw_inventory_data_dir.joinpath('bwi'),
                    raw_inventory_data_dir.joinpath('cmh'),
                    raw_inventory_data_dir.joinpath('den'),
                    raw_inventory_data_dir.joinpath('oma'),
                    raw_inventory_data_dir.joinpath('sfo')]

ex_location_dirs = [raw_expenses_data_dir.joinpath('bwi'),
                    raw_expenses_data_dir.joinpath('cmh'),
                    raw_expenses_data_dir.joinpath('den'),
                    raw_expenses_data_dir.joinpath('oma'),
                    raw_expenses_data_dir.joinpath('sfo')]

NameError: name 'raw_data_dir' is not defined

In [9]:
# Creates directories for months in inventory location dirs
for location_dir in in_location_dirs:
    DirGenerator(location_dir, months)

# Creates directories for months in expenses location dirs
for location_dir in ex_location_dirs:
    DirGenerator(location_dir, months)

# Creates directories for days in inventory
for PosixPath in in_location_dirs:
    # Formats parent path
    parent = f'{format(PosixPath)}/2020'
    for month, days in month_days.items():
        child = PurePosixPath((f'{parent}/{month}'))
        DirGenerator(child, days, add_to=inventory_dirs)

# Creates directories for days in expenses
for PosixPath in ex_location_dirs:
    # Formats parent path
    parent = f'{format(PosixPath)}/2020'
    for month, days in month_days.items():
        child = PurePosixPath((f'{parent}/{month}'))
        DirGenerator(child, days, add_to=expenses_dirs)

NameError: name 'in_location_dirs' is not defined

In [10]:
# Should have 1825 directories (1830 if leap year)
len(inventory_dirs), len(expenses_dirs) 

(0, 0)

#Finally, I create the paths for the hourly sales data. The following is an example of the directory structure for the sales data. 

```
├── external
├── interim
├── processed
└── raw
    ├── expenses
    ├── inventory
    └── sales
        ├── bwi
        ├── cmh
        ├── den
        ├── oma
        │   └── 2020
        │       ├── 01
        │       │   └── 01
        │       │       ├── 00
        │       │       ├── 01   
        │       │       ├── 02
        │       │       ...     
        │       │       ├── 22
        │       │       └── 23
        │       ├── 02
        │       ...
        │       └── 12
        └── sfo
```

In [11]:
sales_dirs = set()

#Create and add the paths for this data set

In [12]:
# Creates initial directories
raw_sales_data_dir = raw_data_dir.joinpath('sales')

sales_location_dirs = [raw_sales_data_dir.joinpath('bwi'),
                       raw_sales_data_dir.joinpath('cmh'),
                       raw_sales_data_dir.joinpath('den'),
                       raw_sales_data_dir.joinpath('oma'),
                       raw_sales_data_dir.joinpath('sfo')]

NameError: name 'raw_data_dir' is not defined

In [13]:
# Creates directories for months in sales location dirs
for location_dir in sales_location_dirs:
    DirGenerator(location_dir, months)

# Creates directories for days in sales
for PosixPath in sales_location_dirs:
    # Formats parent path
    parent = f'{format(PosixPath)}/2020'
    for month, days in month_days.items():
        child = PurePosixPath((f'{parent}/{month}'))
        DirGenerator(child, days, add_to=sales_dirs)

# Lists hours in a day
hours = [str(hour).zfill(2) for hour in range(24)]

# Creates directories for hours in sales
hourly_sales_dirs = set()
for sales_day_dir in sales_dirs:
    parent = PurePosixPath(sales_day_dir)
    DirGenerator(parent, hours, add_to=hourly_sales_dirs) 

NameError: name 'sales_location_dirs' is not defined

In [14]:
sales_dirs = hourly_sales_dirs

# Should have 43,800 directories (43,920 if leap year)
len(sales_dirs) 

NameError: name 'hourly_sales_dirs' is not defined

I have two choices for structuring the weekly data set. I can use the following naming convention where the date is based on the first day of the week. 

```
/data/processed/modeling/<year>/<month>/<day>/
```

Otherwise, I could use a naming convention where *week* is the number of weeks it has been since the beginning of the year. 
 
```
/data/processed/modeling/<year>/<week>/
```

I will use the first option for our naming convention. Python's *calendar* library has a function that determines the first day of the week. 

In [15]:
modeling_data_dirs = set()

#Create and add the paths for this data set

In [16]:
year = 2020
week_starter = 0

# Creates dict {month: [week_start_date]} for year
month_weeks = {str(month+1): [] for month in range(12)}
for i in range(12):
    month = i+1
    for date in cal.itermonthdays4(year, month):
        if date[0] == year and date[3] == week_starter:
            if str(date[2]) not in month_weeks[str(date[1])]:
                month_weeks[str(date[1])].append(str(date[2]))
            
# Creates padding with 0s (e.g., 1 to 01)
padded_dict = {}
for month, weeks in month_weeks.items():
    weeks = [str(week.zfill(2)) for week in weeks]
    padded_dict[str(month.zfill(2))] = list(weeks)
    
month_weeks = padded_dict

In [17]:
# Creates initial directory
modeling_data_dir = processed_data_dir.joinpath('modeling')

# Creates directories for months
DirGenerator(modeling_data_dir, months)

# Creates directories for days in months dirs
parent = format(modeling_data_dir)
for month, weeks in month_weeks.items():
    child = PurePosixPath((f'{parent}/{month}'))
    DirGenerator(child, weeks, add_to=modeling_data_dirs)

NameError: name 'processed_data_dir' is not defined

In [18]:
# Should have 52 directories
len(modeling_data_dirs)

0

Next, I create the monthly inventory requests using the following convention. 

```
/data/processed/inventory/requests/<year>/<month>/
```

In [19]:
inventory_request_dirs = set()

#Create and add the paths for this data set

In [20]:
# Creates initial directory
inventory_data_dir = processed_data_dir.joinpath('inventory')
requests_data_dir = processed_data_dir.joinpath('requests')
requests2020_data_dir = processed_data_dir.joinpath('2020')

# Generates the month directories
DirGenerator(requests2020_data_dir, month_days.keys(), add_to=inventory_request_dirs)

NameError: name 'processed_data_dir' is not defined

In [21]:
 # Should output 12 directories
sorted(list(inventory_request_dirs))

[]

Finally, I create the quarterly financial reports using the following convention. 

```
`/data/processed/financials/quarterly/<year>/<quarter>/`
```
While it does not matter for this assignment, the following are the typical dates associated with financial quarters. 

In [22]:
financials_dirs = set()

#Create and add the paths for this data set

In [23]:
# Creates initial directory
financials_data_dir = processed_data_dir.joinpath('financials')
quarterly_data_dir = financials_data_dir.joinpath('quarterly')
quarterly2020_data_dir = quarterly_data_dir.joinpath('2020')

quarters = ['01', '02', '03', '04']

# Generates the quarterly directories
DirGenerator(quarterly2020_data_dir, quarters, add_to=financials_dirs)

NameError: name 'processed_data_dir' is not defined

In [24]:
# Should output four quarterly directories
sorted(list(financials_dirs)) 

[]