# Project Title
### Data Engineering Capstone Project

#### Project Summary
Pull AT&T's cell site throughput and number of user data from Vertica Database. Calculate weighted 5th percentile throughput values aggregated by Nation, Market, Submarket, and Vendor. Push new KPI into Oracle database to visual on PowerBI front end. This new KPI will be used to measure customer experience.

The project follows the follow steps:
* Step 1: Scope the Project and Gather Data
* Step 2: Explore and Assess the Data
* Step 3: Define the Data Model
* Step 4: Run ETL to Model the Data
* Step 5: Complete Project Write Up

In [1]:
# Do all imports and installs here
import pandas as pd

### Step 1: Scope the Project and Gather Data

#### Scope 
Explain what you plan to do in the project in more detail. What data do you use? What is your end solution look like? What tools did you use? etc>

The raw data consist of daily cell site data nationwide (>1million) with number of users and throughput KPIs. I wrote a new Airflow Vertica operater to pull the data and save it to a flat file. Then, an Airflow Python operater computes weighted percentile throughput values for each vendor, market, and submarket. These new KPIs are then pushed into an Oracle database using an Airflow Oracle operator. This data will be used by other team to visualize on the frontend using PowerBI

#### Describe and Gather Data 
Describe the data sets you're using. Where did it come from? What type of information is included? 

The raw data consist of daily cell site data nationwide (>10million) with number of users and throughput KPIs. Two queries are ran for each of the vendor data.

In [15]:
# Read in the data here

In [2]:
# Sample set of input from both vendors concatenated 
# Real data is > 10million rows
df = pd.read_csv('ran_raw_2020-09-30.csv')
df.head()

Unnamed: 0,DATETIME,MARKET,SUBMARKET,VENDOR,EUTRANCELL,AVG_RRC_CONN_NUM,AVG_RRC_CONN_DEN,AVG_RRC_CONN,DL_DRB_TP_NUM,DL_DRB_TP_DEN,...,QCI8_DL_TPUT_DEN,QCI9_DL_TPUT_DEN,DL_DRB_QCI6_TP,DL_DRB_QCI7_TP,DL_DRB_QCI8_TP,DL_DRB_QCI9_TP,QCI6_ERAB_SETUP_SUCC,QCI7_ERAB_SETUP_SUCC,QCI8_ERAB_SETUP_SUCC,QCI9_ERAB_SETUP_SUCC
0,9/30/2020,Arizona/New Mexico,Arizona,N,AZL00001_2A_1,4412642.0,86400.0,51.072245,636020600.0,47316.72,...,18286.504,26331.513,19967.70581,17944.4644,13583.05348,12876.62122,328,8334,125337,74173
1,9/30/2020,Arizona/New Mexico,Arizona,N,AZL00001_2A_2,2076938.0,86400.0,24.038634,469536700.0,29041.877,...,9571.206,16912.516,24660.84797,15649.42302,15314.72823,16692.21627,301,7896,45728,58074
2,9/30/2020,Arizona/New Mexico,Arizona,N,AZL00001_2B_1,4484860.0,86400.0,51.908102,753076900.0,75357.443,...,27936.866,43743.291,6709.192382,15281.22019,10346.02089,9344.148545,532,11146,163010,105202
3,9/30/2020,Arizona/New Mexico,Arizona,N,AZL00001_2B_2,3157864.0,86400.0,36.549352,709031800.0,63232.53,...,19337.307,37233.0,6147.454488,12462.38724,11735.65095,10741.09161,252,10174,59485,76442
4,9/30/2020,Arizona/New Mexico,Arizona,N,AZL00001_2C_1,6332621.0,86400.0,73.294225,1165714000.0,88547.795,...,23224.261,62315.488,10365.02955,18113.14694,13794.43928,12708.57053,365,8399,180504,154635


### Step 2: Explore and Assess the Data
#### Explore the Data 
Identify data quality issues, like missing values, duplicate data, etc.

#### Cleaning Steps
Document steps necessary to clean the data

# Performing cleaning tasks here
The SQL statements provided in Airflow/plugins/sql performs all the data cleaning necessary




### Step 3: Define the Data Model
#### 3.1 Conceptual Data Model
Map out the conceptual data model and explain why you chose that model

Flat raw files and one final table on Oracle database

#### 3.2 Mapping Out Data Pipelines
List the steps necessary to pipeline the data into the chosen data model

start_operator >> vertica_to_file >> file_transform >> file_to_oracle

### Step 4: Run Pipelines to Model the Data 
#### 4.1 Create the data model
Build the data pipelines to create the data model.

In [10]:
# Run Airflow DAG
# The initial data pull is replaced with flat input file due to private AT&T Vertica Database
# The file is outputted out as a flat file due to private AT&T Oracle Database
# However the custom operators and tasks are included in the Airflow directory

#### 4.2 Data Quality Checks
Explain the data quality checks you'll perform to ensure the pipeline ran as expected. These could include:
 * Integrity constraints on the relational database (e.g., unique key, data type, etc.)
 * Unit tests for the scripts to ensure they are doing the right thing
 * Source/Count checks to ensure completeness
 
Run Quality Checks

In [5]:
# Check there is data
df.shape[0] > 0

True

In [12]:
# Check both vendors data is available
len(df['VENDOR'].unique()) == 2

True

#### 4.3 Data dictionary 
Create a data dictionary for your data model. For each field, provide a brief description of what the data is and where it came from. You can include the data dictionary in the notebook or in a separate file.

In [5]:
df2 = pd.read_csv('sample_output.csv')
df2.head()

Unnamed: 0,DATETIME,MARKET,SUBMARKET,VENDOR,QCI,KPI,Value
0,9/27/2020,Arizona/New Mexico,Arizona,N,QCI 6+7+8+9,TRAFFIC,3023060000000.0
1,9/27/2020,Arizona/New Mexico,Arizona,N,QCI 6,TRAFFIC,43442050000.0
2,9/27/2020,Arizona/New Mexico,Arizona,N,QCI 7,TRAFFIC,228150000000.0
3,9/27/2020,Arizona/New Mexico,Arizona,N,QCI 8,TRAFFIC,1080600000000.0
4,9/27/2020,Arizona/New Mexico,Arizona,N,QCI 9,TRAFFIC,1670900000000.0


In [7]:
# Output
data_dictionary= {
    'DATETIME': 'Date of the cell site data',
    'MARKET': 'AT&T Market',
    'SUBMARKET': 'AT&T Submarket',
    'VENDOR': 'Wireless RAN vendor',
    'QCI': 'Quality of service class identifier',
    'KPI': 'The KPI name for the row',
    'Value': 'The value of the KPI specified given'
}


# Step 5: Complete Project Write Up
## What's the goal? What queries will you want to run? How would Spark or Airflow be incorporated? Why did you choose the model you chose?
The goal of my project is to migrate my crontab jobs at AT&T into Airflow. The output are flat files of the data pulled and a single table of the transformed data pushed into an Oracle database. The flat files are needing to archiving the data for troubleshooting or ad hoc deep dives. The Oracle table allows other teams to access the data using PowerBI dashbaords.


## Clearly state the rationale for the choice of tools and technologies for the project.
I chose to migrate my crontab jobs into Airflow for my work at AT&T. Airflow will be able to handle the automation, logging, and alarms. Airflow also allows an easy way to modify my data pipeline or add addition data quality checks. Finally the UI is a big plus!
 
The Vertica and Oracle databases were requirements from other teams and was not a choice.

## Document the steps of the process.
The code pulls AT&T's cell site throughput and number of user data from Vertica Database. Two queries are ran for each of the vendors. Then, it calculates weighted percentile throughput values aggregated by Nation, Market, Submarket, and Vendor. Finally, it pushes new KPI into Oracle database to visual on PowerBI front end. This new KPI will be used to measure customer experience.

## Propose how often the data should be updated and why.
The pipeline is ran daily during maintenance window (1AM-5AM) since the output is aggregated at a daily level

## Post your write-up and final data model in a GitHub repo.
Can't publish project due to sensitive proprietary AT&T data

## Include a description of how you would approach the problem differently under the following scenarios:
### If the data was increased by 100x.
Use postgres instead of oracle database for faster write speeds. Store flat files                                          in S3 instead of locally. Transform the data using hadoop cluster to maximize parallel compute.

### If the pipelines were run on a daily basis by 7am.
This pipeline runs daily currently. It runs during maintenence window from 12am-4am with multiple retries. If all retries fail, then the newst date will not be viewed on the frontend. I would have to troubleshoot, debug, and deploy quality checks to ensure the pipeline does not fail again.

### If the database needed to be accessed by 100+ people.
I would migrate the Oracle database to AWS RDS to handle the scaling. 

