# Project Title
### Data Engineering Capstone Project

#### Project Summary
--describe your project at a high level--

The project follows the follow steps:
* Step 1: Scope the Project and Gather Data
* Step 2: Explore and Assess the Data
* Step 3: Define the Data Model
* Step 4: Run ETL to Model the Data
* Step 5: Complete Project Write Up

In [2]:
# Do all imports and installs here
import pandas as pd
from pyspark.sql import functions as F

### Step 1: Scope the Project and Gather Data

#### Scope 
Explain what you plan to do in the project in more detail. What data do you use? What is your end solution look like? What tools did you use? etc>

#### Describe and Gather Data 
Describe the data sets you're using. Where did it come from? What type of information is included?   
[Kaggle Berkeley Earth Climate Change: Earth Surface Temperature Data
Exploring global temperatures since 1750](https://www.kaggle.com/berkeleyearth/climate-change-earth-surface-temperature-data)   
Source: [GlobalLandTemperaturesByCity.csv](https://www.kaggle.com/berkeleyearth/climate-change-earth-surface-temperature-data?select=GlobalLandTemperaturesByCity.csv)
- dt (object)  
- AverageTemperature (float64)  
- AverageTemperatureUncertainty (float64)  
- City (object)  
- Country (object)  
- Latitude (object)  
- Longitude (object)

In [12]:
# Read in the data here
file_name = '../../data2/GlobalLandTemperaturesByCity.csv'
df = pd.read_csv(file_name)

In [13]:
df.head()

Unnamed: 0,dt,AverageTemperature,AverageTemperatureUncertainty,City,Country,Latitude,Longitude
0,1743-11-01,6.068,1.737,Århus,Denmark,57.05N,10.33E
1,1743-12-01,,,Århus,Denmark,57.05N,10.33E
2,1744-01-01,,,Århus,Denmark,57.05N,10.33E
3,1744-02-01,,,Århus,Denmark,57.05N,10.33E
4,1744-03-01,,,Århus,Denmark,57.05N,10.33E


In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8599212 entries, 0 to 8599211
Data columns (total 7 columns):
dt                               object
AverageTemperature               float64
AverageTemperatureUncertainty    float64
City                             object
Country                          object
Latitude                         object
Longitude                        object
dtypes: float64(2), object(5)
memory usage: 459.2+ MB


### Step 2: Explore and Assess the Data
#### Explore the Data 
Identify data quality issues, like missing values, duplicate data, etc.
- The data contain a lot NaN value   

#### Cleaning Steps
Document steps necessary to clean the data
- perform dropna() to clean the data

In [6]:
df.isna().sum()

dt                                    0
AverageTemperature               364130
AverageTemperatureUncertainty    364130
City                                  0
Country                               0
Latitude                              0
Longitude                             0
dtype: int64

In [14]:
# Performing cleaning tasks here
df.dropna(inplace=True)
df.head()

Unnamed: 0,dt,AverageTemperature,AverageTemperatureUncertainty,City,Country,Latitude,Longitude
0,1743-11-01,6.068,1.737,Århus,Denmark,57.05N,10.33E
5,1744-04-01,5.788,3.624,Århus,Denmark,57.05N,10.33E
6,1744-05-01,10.644,1.283,Århus,Denmark,57.05N,10.33E
7,1744-06-01,14.051,1.347,Århus,Denmark,57.05N,10.33E
8,1744-07-01,16.082,1.396,Århus,Denmark,57.05N,10.33E


In [8]:
df.describe()

Unnamed: 0,AverageTemperature,AverageTemperatureUncertainty
count,8235082.0,8235082.0
mean,16.72743,1.028575
std,10.35344,1.129733
min,-42.704,0.034
25%,10.299,0.337
50%,18.831,0.591
75%,25.21,1.349
max,39.651,15.396


In [None]:
df.to_csv("cleanData/clean_GlobalLandTemperaturesByCity.csv", index=False)

### Step 3: Define the Data Model
#### 3.1 Conceptual Data Model
Map out the conceptual data model and explain why you chose that model
Source: [GlobalLandTemperaturesByCity.csv](https://www.kaggle.com/berkeleyearth/climate-change-earth-surface-temperature-data?select=GlobalLandTemperaturesByCity.csv)
- dt (object)  
- AverageTemperature (float64)  
- AverageTemperatureUncertainty (float64)  
- City (object)  
- Country (object)  
- Latitude (object)  
- Longitude (object) 

GlobalLandTemperaturesByCountry.csv
- dt
- AverageTemperature
- AverageTemperatureUncertainty
- Country   

GlobalLandTemperaturesByMajorCity.csv  
- dt (object)
- AverageTemperature (float64)
- AverageTemperatureUncertainty (float64)
- City (object)
- Country (object)
- Latitude (object)  
- Longitude (object)

#### 3.2 Mapping Out Data Pipelines
List the steps necessary to pipeline the data into the chosen data mode

Create dimension table

In [15]:
df['year']=df['dt'].apply(lambda x: x[:4])
df['month']=df['dt'].apply(lambda x: x[5:7])
df.drop('dt',axis=1,inplace=True)
df=df[['year','month','AverageTemperature','City','Country','Latitude','Longitude']]
df['Latitude']=df['Latitude'].str.strip('N')
df['Longitude']=df['Longitude'].str.strip('E')
df.head()

Unnamed: 0,year,month,AverageTemperature,City,Country,Latitude,Longitude
0,1743,11,6.068,Århus,Denmark,57.05,10.33
5,1744,4,5.788,Århus,Denmark,57.05,10.33
6,1744,5,10.644,Århus,Denmark,57.05,10.33
7,1744,6,14.051,Århus,Denmark,57.05,10.33
8,1744,7,16.082,Århus,Denmark,57.05,10.33


Create Fact Table

In [16]:
df_globalTempeeratureByCountry = df
df.drop(['Latitude', 'Longitude','City'],axis=1,inplace=True)
df_globalTempeeratureByCountry=df_globalTempeeratureByCountry[['year','month','AverageTemperature','Country']]
df = df.groupby(["month"]).mean()
df = df.sort_values(["AverageTemperature"])
df_globalTempeeratureByCountry.head()                                                        

Unnamed: 0,year,month,AverageTemperature,Country
0,1743,11,6.068,Denmark
5,1744,4,5.788,Denmark
6,1744,5,10.644,Denmark
7,1744,6,14.051,Denmark
8,1744,7,16.082,Denmark


Save Clean Data into CSV

In [17]:
df_globalTempeeratureByCountry.to_csv("cleanData/GlobalLandTemperaturesByCountry.csv", index=False)

### Step 4: Run Pipelines to Model the Data 
#### 4.1 Create the data model
Build the data pipelines to create the data model.

In [30]:
# Write code here
import pyspark

spark = pyspark.sql.SparkSession.builder.appName("Capstone").getOrCreate()

Import clean_data using pyspark

In [31]:
df = spark.read.option("header",True).csv("cleanData/clean_GlobalLandTemperaturesByCity.csv", inferSchema = True)
df_globalTempeeratureByCountry = spark.read.option("header",True).csv("cleanData/GlobalLandTemperaturesByCountry.csv",inferSchema = True)

In [32]:
df_globalTempeeratureByCountry

DataFrame[year: int, month: int, AverageTemperature: double, Country: string]

In [33]:
df.limit(5).toPandas()

Unnamed: 0,year,month,AverageTemperature,Country
0,1743,11,6.068,Denmark
1,1744,4,5.788,Denmark
2,1744,5,10.644,Denmark
3,1744,6,14.051,Denmark
4,1744,7,16.082,Denmark


In [21]:
def read_data(filename):
    file = '{}'.format(filename)
    return spark.read.format('com.github.saurfang.sas.spark').load(file)

def clean_df(df):
    return df.dropna(inplace=True)

def create_dimension_table(df):
    df['year']=df['dt'].apply(lambda x: x[:4])
    df['month']=df['dt'].apply(lambda x: x[5:7])
    df.drop('dt',axis=1,inplace=True)
    df=df[['year','month','AverageTemperature','City','Country','Latitude','Longitude']]
    df['Latitude']=df['Latitude'].str.strip('N')
    df['Longitude']=df['Longitude'].str.strip('E')
    return df

def create_fact_table(df):
    df_globalTempeeratureByCountry = df
    df.drop(['Latitude', 'Longitude','City'],axis=1,inplace=True)
    df_globalTempeeratureByCountry=df_globalTempeeratureByCountry[['year','month','AverageTemperature','Country']]
    df = df.groupby(["Month"]).mean()
    df = df.sort_values(["AverageTemperature"])
    return df

In [37]:
def etl(files):
    df = read_data(file)
    df = clean_(df)
    df = create_dimension_table(df)
    df.to_csv('dimension_table.csv', index=False)
    df = create_fact_table(df)
    df.to_csv('fact_table.csv', index=False)

In [38]:
etl(file_name)

Data clean_GlobalLandTemperaturesByCity.csv: 1.
Data clean_GlobalLandTemperaturesByMajorCity.csv: 2.


#### 4.2 Data Quality Checks
Explain the data quality checks you'll perform to ensure the pipeline ran as expected. These could include:
 * Integrity constraints on the relational database (e.g., unique key, data type, etc.)
 * Unit tests for the scripts to ensure they are doing the right thing
 * Source/Count checks to ensure completeness
 
Run Quality Checks  
- Checks if the dataframe is not empty.
- Checks if the given keys are a unique key in the dataframe.
- Checks if the dataframe has the expected dtypes.
- Checks if the dataframe has the expected ammount of columns.

In [None]:
# Perform quality checks here
columns_num = 
keys = 
expect_dtype = 

if df.shape()[0] > 0:
    print("The dataframe is not empty.")
else:
    raise ValueError("DataFrame is empty!")
        
if df.select(F.countDistinct(*keys)).first()[0] == df.shape()[0]:
    print("The given keys are a unique key in the dataframe.")
else:
    raise ValueError("The given keys are not unique key!")
 
 if df.dtypes == expect_dtype:
    print("The dataframe has the expected dtypes.")
else:
    raise ValueError("The dataframe do not match with expected dtypes!")    
    
if df.shape()[1] == columns_num:
    print("The dataframe has the expected ammount of columns.")
else:
    raise ValueError("The dataframe does not has the expected ammount of columns!")

#### 4.3 Data dictionary 
Create a data dictionary for your data model. For each field, provide a brief description of what the data is and where it came from. You can include the data dictionary in the notebook or in a separate file.

#### Step 5: Complete Project Write Up
* Clearly state the rationale for the choice of tools and technologies for the project.
* Propose how often the data should be updated and why.
* Write a description of how you would approach the problem differently under the following scenarios:
 * The data was increased by 100x.
     - Let us imagine we are deploying a AWS EMR Start Cluster instance .xlarge . if the data increased by 100x, we can save data to S3 and switch from a .xlarg to .10xlarge. This one has 640 GiB ashould be be able to handle it. Another way is to split the data into smaller chunks that can run in parallel thus make the process faster and more efficient.  
 * The data populates a dashboard that must be updated on a daily basis by 7am every day.
     - The Airflow helps us monitored and scheduler the runtime to avoid running past the 7am. The output data must be stored in an accessible database or accessible in S3 allow the data to search every day  if there is new data coming. Assume the pipeline takes approximate 1h to run. Then a schedule should be set for running the pipeline every night at 5am leave a buffer of 1h. 
 * The database needed to be accessed by 100+ people.
     -  Amazon Redshift Clusters are scaleable with elastic resize such that when ever the database or data warehouse runs the risk of not response the requests anymore, its performance could be increased to handle requests of the authorized 100+ people. Another way is with Amazon RDS, we can deploy scalable PostgreSQL DBs