# World Temperature vs CO2 and SO2 Emissions
### Data Engineering Capstone Project

#### Project Summary
This Project creates a Data Lake type of ETL pipeline to process, clean and store data related to world temperature and emissions. Data can be used to analyse if country emissions have impact on world temperature.  

The project follows the follow steps:
* Step 1: Scope the Project and Gather Data
* Step 2: Explore and Assess the Data
* Step 3: Define the Data Model
* Step 4: Run ETL to Model the Data
* Step 5: Complete Project Write Up

In [1]:
# Do all imports and installs here
import pandas as pd
from pyspark.sql import SparkSession
import os
import configparser
from datetime import datetime

In [2]:
# Read config file
config = configparser.ConfigParser()
config.read_file(open('dl.cfg'))

os.environ["AWS_ACCESS_KEY_ID"]= config['AWS']['AWS_ACCESS_KEY_ID']
os.environ["AWS_SECRET_ACCESS_KEY"]= config['AWS']['AWS_SECRET_ACCESS_KEY']

# NOTE: Use these if using AWS S3 as a storage
INPUT_DATA = config['AWS']['INPUT_DATA']
OUTPUT_DATA = config['AWS']['OUTPUT_DATA']

# NOTE: Use these if using local storage
INPUT_DATA_WEATHER_LOCAL = config['LOCAL']['INPUT_DATA_WEATHER_LOCAL']
INPUT_DATA_CC_LOCAL      = config['LOCAL']['INPUT_DATA_CC_LOCAL']
INPUT_DATA_CO2_LOCAL     = config['LOCAL']['INPUT_DATA_CO2_LOCAL']
INPUT_DATA_SO2_LOCAL     = config['LOCAL']['INPUT_DATA_SO2_LOCAL']

OUTPUT_DATA_LOCAL        = config['LOCAL']['OUTPUT_DATA_LOCAL']

DATA_LOCAL               = config['COMMON']['DATA_LOCAL']
DATA_STORAGE             = config['COMMON']['DATA_STORAGE']

#print(AWS_ACCESS_KEY_ID)
#print(AWS_SECRET_ACCESS_KEY)

print(INPUT_DATA_WEATHER_LOCAL)
print(INPUT_DATA_CC_LOCAL)
print(INPUT_DATA_CO2_LOCAL)
print(INPUT_DATA_SO2_LOCAL)
print(OUTPUT_DATA_LOCAL)
print(DATA_LOCAL)
print(DATA_STORAGE)

data/GlobalLandTemperaturesByCity.csv
data/iso-3166-country-codes.json
data/CO2_Emissions_Capita-historical.xlsx
data/SO2_Emissions_Capita-historical.xlsx
data/output_data/
yes
parquet


### Step 1: Scope the Project and Gather Data

#### Scope 
Explain what you plan to do in the project in more detail. What data do you use? What is your end solution look like? What tools did you use? etc

Scope of the project is to create an ETL pipeline for processing, cleaning and storing data related to world temperature, country CO2 and SO2 emissions, and country codes. 
Output of the ETL pipeline: processed data stored in Star schema model to parquet files. 
Tools: python, pandas, pyspark, (Amazon AWS S3)

#### Describe and Gather Data 
Describe the data sets you're using. Where did it come from? What type of information is included? 
* **data/GlobalLandTemperaturesByCity.csv**: World land temperatures by city

  * Source: https://www.kaggle.com/berkeleyearth/climate-change-earth-surface-temperature-data/
  * Temperature-data example:
  * ![Temperature-data example](./Udacity-DEND-Project-Capstone-TemperatureData-20190804-1.png)

  * NOTE: Please unzip climate-change-earth-surface-temperature-data.zip and move files to /data folder. GlobalLandTemperaturesByCity.csv si used primary in this ETL pipeline.

* **data/CO2_Emissions_Capita-historical.xlsx**: CO2 emission data by country

  * Source: https://clio-infra.eu/Indicators/CO2EmissionsperCapita.html
  * => https://datasets.socialhistory.org/dataset.xhtml?persistentId=hdl:10622/DG654S
  * CO2 Emissions per Capita (1500-2000)
  * Last update: 2012-09-01
  * CO2-emission example:
  * ![CO2-data example](./Udacity-DEND-Project-Capstone-CO2Data-20190804-2.png)

* **data/SO2_Emissions_Capita-historical.xlsx**: SO2 emission data by country

  * Source: https://clio-infra.eu/Indicators/SO2EmissionsperCapita.html
  * => https://datasets.socialhistory.org/dataset.xhtml?persistentId=hdl:10622/IRT0YU
  * SO2 Emissions per Capita (1850-2000)
  * Last update: 2013-05-18
  * SO2-emission example:
  * ![SO2-data example](./Udacity-DEND-Project-Capstone-SO2Data-20190804-3.png)

* **data/iso-3166-country-codes.json**: country codes (ISO-3166)

  * Source: https://github.com/lukes/ISO-3166-Countries-with-Regional-Codes
    ISO-3166-1 and ISO-3166-2 Country and Dependent Territories Lists with UN Regional Codes
  * ISO-3166: https://www.iso.org/iso-3166-country-codes.html
  * Country Code example:
  * ![CountryCode-data example](./Udacity-DEND-Project-Capstone-CountryCodesData-20190804-4.png)

### 1.1 Define config

In [34]:
# Read in the data here
if DATA_LOCAL == "yes":
    temp_by_city      = INPUT_DATA_WEATHER_LOCAL
    co2_data          = INPUT_DATA_CO2_LOCAL
    so2_data          = INPUT_DATA_SO2_LOCAL
    country_codes     = INPUT_DATA_CC_LOCAL
    output_data       = OUTPUT_DATA_LOCAL
elif DATA_LOCAL == "no":
    input_data_bucket = INPUT_DATA
    temp_by_city      = INPUT_DATA_WEATHER
    co2_data          = INPUT_DATA_CO2
    so2_data          = INPUT_DATA_SO2
    country_codes     = INPUT_CC
    output_data       = OUTPUT_DATA

if DATA_STORAGE == "postgresql":
    pass
elif DATA_STORAGE == "parquet":
    data_storage      = DATA_STORAGE
    
# Read Global Land Temperarture by City data:
temp_by_city_df = pd.read_csv(temp_by_city, header=0, sep=",")

# Read Global CO2 emission data:
co2_data_df = pd.read_excel(co2_data, header=2)

# Read Global SO2 emission data:
so2_data_df = pd.read_excel(so2_data, header=2)

# Read Global country codes data:
country_codes_df = pd.read_json(country_codes, orient="records")

### 1.2 Show data snippets

In [6]:
temp_by_city_df.head()

Unnamed: 0,dt,AverageTemperature,AverageTemperatureUncertainty,City,Country,Latitude,Longitude
0,1743-11-01,6.068,1.737,Århus,Denmark,57.05N,10.33E
1,1743-12-01,,,Århus,Denmark,57.05N,10.33E
2,1744-01-01,,,Århus,Denmark,57.05N,10.33E
3,1744-02-01,,,Århus,Denmark,57.05N,10.33E
4,1744-03-01,,,Århus,Denmark,57.05N,10.33E


In [7]:
co2_data_df.head()

Unnamed: 0,Webmapper code,Webmapper numeric code,ccode,country name,start year,end year,1500,1501,1502,1503,...,2006,2007,2008,2009,2010,2011,2012,2013,2014,2015
0,geacron/1,1,,Aceh,1520,1539,,,,,...,,,,,,,,,,
1,geacron/2,2,,Aceh,1540,1619,,,,,...,,,,,,,,,,
2,geacron/3,3,,Aceh,1620,1639,,,,,...,,,,,,,,,,
3,geacron/4,4,,Aceh,1640,1679,,,,,...,,,,,,,,,,
4,geacron/5,5,,Aceh,1680,1879,,,,,...,,,,,,,,,,


In [8]:
so2_data_df.head()

Unnamed: 0,Webmapper code,Webmapper numeric code,ccode,country name,start year,end year,1500,1501,1502,1503,...,2006,2007,2008,2009,2010,2011,2012,2013,2014,2015
0,geacron/1,1,,Aceh,1520,1539,,,,,...,,,,,,,,,,
1,geacron/2,2,,Aceh,1540,1619,,,,,...,,,,,,,,,,
2,geacron/3,3,,Aceh,1620,1639,,,,,...,,,,,,,,,,
3,geacron/4,4,,Aceh,1640,1679,,,,,...,,,,,,,,,,
4,geacron/5,5,,Aceh,1680,1879,,,,,...,,,,,,,,,,


In [9]:
country_codes_df.head()

Unnamed: 0,alpha-2,alpha-3,country-code,intermediate-region,intermediate-region-code,iso_3166-2,name,region,region-code,sub-region,sub-region-code
0,AF,AFG,4,,,ISO 3166-2:AF,Afghanistan,Asia,142,Southern Asia,34
1,AX,ALA,248,,,ISO 3166-2:AX,Åland Islands,Europe,150,Northern Europe,154
2,AL,ALB,8,,,ISO 3166-2:AL,Albania,Europe,150,Southern Europe,39
3,DZ,DZA,12,,,ISO 3166-2:DZ,Algeria,Africa,2,Northern Africa,15
4,AS,ASM,16,,,ISO 3166-2:AS,American Samoa,Oceania,9,Polynesia,61


### 1.3 Create Spark session

In [10]:
	
#from pyspark.sql import SparkSession
#spark = SparkSession.builder.\
#config("spark.jars.packages","saurfang:spark-sas7bdat:2.0.0-s_2.11")\
#.enableHiveSupport().getOrCreate()
#df_spark =spark.read.format('com.github.saurfang.sas.spark').load('../../data/18-83510-I94-Data-2016/i94_apr16_sub.sas7bdat')

spark = SparkSession.builder\
                     .config("spark.jars.packages","org.apache.hadoop:hadoop-aws:2.7.0")\
                     .getOrCreate()

### 1.4 Read Temperature data to Spark

In [15]:

temp_df_spark = spark.read.csv(temp_by_city, header=True, sep=',')

In [16]:
temp_df_spark.printSchema()
temp_df_spark.show(5, truncate=False)

root
 |-- dt: string (nullable = true)
 |-- AverageTemperature: string (nullable = true)
 |-- AverageTemperatureUncertainty: string (nullable = true)
 |-- City: string (nullable = true)
 |-- Country: string (nullable = true)
 |-- Latitude: string (nullable = true)
 |-- Longitude: string (nullable = true)

+----------+------------------+-----------------------------+-----+-------+--------+---------+
|dt        |AverageTemperature|AverageTemperatureUncertainty|City |Country|Latitude|Longitude|
+----------+------------------+-----------------------------+-----+-------+--------+---------+
|1743-11-01|6.068             |1.7369999999999999           |Århus|Denmark|57.05N  |10.33E   |
|1743-12-01|null              |null                         |Århus|Denmark|57.05N  |10.33E   |
|1744-01-01|null              |null                         |Århus|Denmark|57.05N  |10.33E   |
|1744-02-01|null              |null                         |Århus|Denmark|57.05N  |10.33E   |
|1744-03-01|null            

### 1.5 Read CO2 data to Spark

In [57]:
co2_schema = ['webmapper_code', 'webmapper_numeric_code', 'ccode', 'country_name', 'start_year', 'end_year', \
             '1500', '1501', '1502', '1503', '1504', '1505', '1506', '1507', '1508', '1509', '1510', \
              '1511', '1512', '1513', '1514', '1515', '1516', '1517', '1518', '1519', '1520', \
              '1521', '1522', '1523', '1524', '1525', '1526', '1527', '1528', '1529', '1530', \
              '1531', '1532', '1533', '1534', '1535', '1536', '1537', '1538', '1539', '1540', \
              '1541', '1542', '1543', '1544', '1545', '1546', '1547', '1548', '1549', '1550', \
              '1551', '1552', '1553', '1554', '1555', '1556', '1557', '1558', '1559', '1560', '1561', '1562', '1563', '1564', '1565', '1566', '1567', '1568', '1569', '1570', '1571', '1572', '1573', '1574', '1575', '1576', '1577', '1578', '1579', '1580', '1581', '1582', '1583', '1584', '1585', '1586', '1587', '1588', '1589', '1590', '1591', '1592', '1593', '1594', '1595', '1596', '1597', '1598', '1599', '1600', '1601', '1602', '1603', '1604', '1605', '1606', '1607', '1608', '1609', '1610', '1611', '1612', '1613', '1614', '1615', '1616', '1617', '1618', '1619', '1620', '1621', '1622', '1623', '1624', '1625', '1626', '1627', '1628', '1629', '1630', '1631', '1632', '1633', '1634', '1635', '1636', '1637', '1638', '1639', '1640', '1641', '1642', '1643', '1644', '1645', '1646', '1647', '1648', '1649', '1650', '1651', '1652', '1653', '1654', '1655', '1656', '1657', '1658', '1659', '1660', '1661', '1662', '1663', '1664', '1665', '1666', '1667', '1668', '1669', '1670', '1671', '1672', '1673', '1674', '1675', '1676', '1677', '1678', '1679', '1680', '1681', '1682', '1683', '1684', '1685', '1686', '1687', '1688', '1689', '1690', '1691', '1692', '1693', '1694', '1695', '1696', '1697', '1698', '1699', '1700', '1701', '1702', '1703', '1704', '1705', '1706', '1707', '1708', '1709', '1710', '1711', '1712', '1713', '1714', '1715', '1716', '1717', '1718', '1719', '1720', '1721', '1722', '1723', '1724', '1725', '1726', '1727', '1728', '1729', '1730', '1731', '1732', '1733', '1734', '1735', '1736', '1737', '1738', '1739', '1740', '1741', '1742', '1743', '1744', '1745', '1746', '1747', '1748', '1749', '1750', '1751', '1752', '1753', '1754', '1755', '1756', '1757', '1758', '1759', '1760', '1761', '1762', '1763', '1764', '1765', '1766', '1767', '1768', '1769', '1770', '1771', '1772', '1773', '1774', '1775', '1776', '1777', '1778', '1779', '1780', '1781', '1782', '1783', '1784', '1785', '1786', '1787', '1788', '1789', '1790', '1791', '1792', '1793', '1794', '1795', '1796', '1797', '1798', '1799', '1800', '1801', '1802', '1803', '1804', '1805', '1806', '1807', '1808', '1809', '1810', '1811', '1812', '1813', '1814', '1815', '1816', '1817', '1818', '1819', '1820', '1821', '1822', '1823', '1824', '1825', '1826', '1827', '1828', '1829', '1830', '1831', '1832', '1833', '1834', '1835', '1836', '1837', '1838', '1839', '1840', '1841', '1842', '1843', '1844', '1845', '1846', '1847', '1848', '1849', '1850', '1851', '1852', '1853', '1854', '1855', '1856', '1857', '1858', '1859', '1860', '1861', '1862', '1863', '1864', '1865', '1866', '1867', '1868', '1869', '1870', '1871', '1872', '1873', '1874', '1875', '1876', '1877', '1878', '1879', '1880', '1881', '1882', '1883', '1884', '1885', '1886', '1887', '1888', '1889', '1890', '1891', '1892', '1893', '1894', '1895', '1896', '1897', '1898', '1899', '1900', '1901', '1902', '1903', '1904', '1905', '1906', '1907', '1908', '1909', '1910', '1911', '1912', '1913', '1914', '1915', '1916', '1917', '1918', '1919', '1920', '1921', '1922', '1923', '1924', '1925', '1926', '1927', '1928', '1929', '1930', '1931', '1932', '1933', '1934', '1935', '1936', '1937', '1938', '1939', '1940', '1941', '1942', '1943', '1944', '1945', '1946', '1947', '1948', '1949', '1950', '1951', '1952', '1953', '1954', '1955', '1956', '1957', '1958', '1959', '1960', '1961', '1962', '1963', '1964', '1965', '1966', '1967', '1968', '1969', '1970', '1971', '1972', '1973', '1974', '1975', '1976', '1977', '1978', '1979', '1980', '1981', '1982', '1983', '1984', '1985', '1986', '1987', '1988', '1989', '1990', '1991', '1992', '1993', '1994', '1995', '1996', '1997', '1998', '1999', '2000', '2001', '2002', '2003', '2004', '2005', '2006', '2007', '2008', '2009', '2010', '2011', '2012', '2013', '2014', '2015']
co2_df_spark = spark.createDataFrame(co2_data_df, schema=co2_schema)

In [58]:
#years = list(range(1500, 2015, 1))
#print(years)

In [59]:
#years_ = []
#for year in years:
#    year_ = str(year)
#    years_.append(year_)
#print(years_)

In [60]:
co2_df_spark.printSchema()
co2_df_spark.show(5, truncate=False)

root
 |-- webmapper_code: string (nullable = true)
 |-- webmapper_numeric_code: long (nullable = true)
 |-- ccode: double (nullable = true)
 |-- country_name: string (nullable = true)
 |-- start_year: long (nullable = true)
 |-- end_year: long (nullable = true)
 |-- 1500: double (nullable = true)
 |-- 1501: double (nullable = true)
 |-- 1502: double (nullable = true)
 |-- 1503: double (nullable = true)
 |-- 1504: double (nullable = true)
 |-- 1505: double (nullable = true)
 |-- 1506: double (nullable = true)
 |-- 1507: double (nullable = true)
 |-- 1508: double (nullable = true)
 |-- 1509: double (nullable = true)
 |-- 1510: double (nullable = true)
 |-- 1511: double (nullable = true)
 |-- 1512: double (nullable = true)
 |-- 1513: double (nullable = true)
 |-- 1514: double (nullable = true)
 |-- 1515: double (nullable = true)
 |-- 1516: double (nullable = true)
 |-- 1517: double (nullable = true)
 |-- 1518: double (nullable = true)
 |-- 1519: double (nullable = true)
 |-- 1520: double 

+--------------+----------------------+-----+------------+----------+--------+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+--

### 1.6 Read SO2 data to Spark

In [64]:
so2_schema = ['webmapper_code', 'webmapper_numeric_code', 'ccode', 'country_name', 'start_year', 'end_year', \
             '1500', '1501', '1502', '1503', '1504', '1505', '1506', '1507', '1508', '1509', '1510', \
              '1511', '1512', '1513', '1514', '1515', '1516', '1517', '1518', '1519', '1520', \
              '1521', '1522', '1523', '1524', '1525', '1526', '1527', '1528', '1529', '1530', \
              '1531', '1532', '1533', '1534', '1535', '1536', '1537', '1538', '1539', '1540', \
              '1541', '1542', '1543', '1544', '1545', '1546', '1547', '1548', '1549', '1550', \
              '1551', '1552', '1553', '1554', '1555', '1556', '1557', '1558', '1559', '1560', '1561', '1562', '1563', '1564', '1565', '1566', '1567', '1568', '1569', '1570', '1571', '1572', '1573', '1574', '1575', '1576', '1577', '1578', '1579', '1580', '1581', '1582', '1583', '1584', '1585', '1586', '1587', '1588', '1589', '1590', '1591', '1592', '1593', '1594', '1595', '1596', '1597', '1598', '1599', '1600', '1601', '1602', '1603', '1604', '1605', '1606', '1607', '1608', '1609', '1610', '1611', '1612', '1613', '1614', '1615', '1616', '1617', '1618', '1619', '1620', '1621', '1622', '1623', '1624', '1625', '1626', '1627', '1628', '1629', '1630', '1631', '1632', '1633', '1634', '1635', '1636', '1637', '1638', '1639', '1640', '1641', '1642', '1643', '1644', '1645', '1646', '1647', '1648', '1649', '1650', '1651', '1652', '1653', '1654', '1655', '1656', '1657', '1658', '1659', '1660', '1661', '1662', '1663', '1664', '1665', '1666', '1667', '1668', '1669', '1670', '1671', '1672', '1673', '1674', '1675', '1676', '1677', '1678', '1679', '1680', '1681', '1682', '1683', '1684', '1685', '1686', '1687', '1688', '1689', '1690', '1691', '1692', '1693', '1694', '1695', '1696', '1697', '1698', '1699', '1700', '1701', '1702', '1703', '1704', '1705', '1706', '1707', '1708', '1709', '1710', '1711', '1712', '1713', '1714', '1715', '1716', '1717', '1718', '1719', '1720', '1721', '1722', '1723', '1724', '1725', '1726', '1727', '1728', '1729', '1730', '1731', '1732', '1733', '1734', '1735', '1736', '1737', '1738', '1739', '1740', '1741', '1742', '1743', '1744', '1745', '1746', '1747', '1748', '1749', '1750', '1751', '1752', '1753', '1754', '1755', '1756', '1757', '1758', '1759', '1760', '1761', '1762', '1763', '1764', '1765', '1766', '1767', '1768', '1769', '1770', '1771', '1772', '1773', '1774', '1775', '1776', '1777', '1778', '1779', '1780', '1781', '1782', '1783', '1784', '1785', '1786', '1787', '1788', '1789', '1790', '1791', '1792', '1793', '1794', '1795', '1796', '1797', '1798', '1799', '1800', '1801', '1802', '1803', '1804', '1805', '1806', '1807', '1808', '1809', '1810', '1811', '1812', '1813', '1814', '1815', '1816', '1817', '1818', '1819', '1820', '1821', '1822', '1823', '1824', '1825', '1826', '1827', '1828', '1829', '1830', '1831', '1832', '1833', '1834', '1835', '1836', '1837', '1838', '1839', '1840', '1841', '1842', '1843', '1844', '1845', '1846', '1847', '1848', '1849', '1850', '1851', '1852', '1853', '1854', '1855', '1856', '1857', '1858', '1859', '1860', '1861', '1862', '1863', '1864', '1865', '1866', '1867', '1868', '1869', '1870', '1871', '1872', '1873', '1874', '1875', '1876', '1877', '1878', '1879', '1880', '1881', '1882', '1883', '1884', '1885', '1886', '1887', '1888', '1889', '1890', '1891', '1892', '1893', '1894', '1895', '1896', '1897', '1898', '1899', '1900', '1901', '1902', '1903', '1904', '1905', '1906', '1907', '1908', '1909', '1910', '1911', '1912', '1913', '1914', '1915', '1916', '1917', '1918', '1919', '1920', '1921', '1922', '1923', '1924', '1925', '1926', '1927', '1928', '1929', '1930', '1931', '1932', '1933', '1934', '1935', '1936', '1937', '1938', '1939', '1940', '1941', '1942', '1943', '1944', '1945', '1946', '1947', '1948', '1949', '1950', '1951', '1952', '1953', '1954', '1955', '1956', '1957', '1958', '1959', '1960', '1961', '1962', '1963', '1964', '1965', '1966', '1967', '1968', '1969', '1970', '1971', '1972', '1973', '1974', '1975', '1976', '1977', '1978', '1979', '1980', '1981', '1982', '1983', '1984', '1985', '1986', '1987', '1988', '1989', '1990', '1991', '1992', '1993', '1994', '1995', '1996', '1997', '1998', '1999', '2000', '2001', '2002', '2003', '2004', '2005', '2006', '2007', '2008', '2009', '2010', '2011', '2012', '2013', '2014', '2015']


so2_df_spark = spark.createDataFrame(so2_data_df, schema=so2_schema)

In [65]:
so2_df_spark.printSchema()
so2_df_spark.show(5, truncate=False)

root
 |-- webmapper_code: string (nullable = true)
 |-- webmapper_numeric_code: long (nullable = true)
 |-- ccode: double (nullable = true)
 |-- country_name: string (nullable = true)
 |-- start_year: long (nullable = true)
 |-- end_year: long (nullable = true)
 |-- 1500: double (nullable = true)
 |-- 1501: double (nullable = true)
 |-- 1502: double (nullable = true)
 |-- 1503: double (nullable = true)
 |-- 1504: double (nullable = true)
 |-- 1505: double (nullable = true)
 |-- 1506: double (nullable = true)
 |-- 1507: double (nullable = true)
 |-- 1508: double (nullable = true)
 |-- 1509: double (nullable = true)
 |-- 1510: double (nullable = true)
 |-- 1511: double (nullable = true)
 |-- 1512: double (nullable = true)
 |-- 1513: double (nullable = true)
 |-- 1514: double (nullable = true)
 |-- 1515: double (nullable = true)
 |-- 1516: double (nullable = true)
 |-- 1517: double (nullable = true)
 |-- 1518: double (nullable = true)
 |-- 1519: double (nullable = true)
 |-- 1520: double 

+--------------+----------------------+-----+------------+----------+--------+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+--

### 1.7 Read Country Code data to Spark

In [31]:
country_code_schema = []
country_codes_df_spark = spark.createDataFrame(country_codes_df)

In [32]:
country_codes_df_spark.printSchema()
country_codes_df_spark.show(5, truncate=False)

root
 |-- alpha-2: string (nullable = true)
 |-- alpha-3: string (nullable = true)
 |-- country-code: long (nullable = true)
 |-- intermediate-region: string (nullable = true)
 |-- intermediate-region-code: string (nullable = true)
 |-- iso_3166-2: string (nullable = true)
 |-- name: string (nullable = true)
 |-- region: string (nullable = true)
 |-- region-code: string (nullable = true)
 |-- sub-region: string (nullable = true)
 |-- sub-region-code: string (nullable = true)

+-------+-------+------------+-------------------+------------------------+-------------+--------------+-------+-----------+---------------+---------------+
|alpha-2|alpha-3|country-code|intermediate-region|intermediate-region-code|iso_3166-2   |name          |region |region-code|sub-region     |sub-region-code|
+-------+-------+------------+-------------------+------------------------+-------------+--------------+-------+-----------+---------------+---------------+
|AF     |AFG    |4           |                  

### 1.8 Write Spark DataFrames to parquet files

In [39]:
# Write Temperature data to parquet file:
temp_df_path = output_data + "temperature_staging.parquet" + "_" + str(datetime.now())
print(f"OUTPUT: {temp_df_path}")
temp_df_spark.write.mode("overwrite").parquet(temp_df_path)

# Read parquet file back to Spark:
temp_df_spark = spark.read.parquet(temp_df_path)

data/output_data/temperature_staging.parquet_2019-08-04 18:18:43.918157


In [42]:
#temp_df_spark.printSchema()
#temp_df_spark.show(5, truncate=False)

In [61]:
# Write CO2 data to parquet file:
co2_df_path = output_data + "co2_staging.parquet" + "_" + str(datetime.now())
print(f"OUTPUT: {co2_df_path}")
co2_df_spark.write.mode("overwrite").parquet(co2_df_path)

# Read parquet file back to Spark:
co2_df_spark = spark.read.parquet(co2_df_path)

OUTPUT: data/output_data/co2_staging.parquet_2019-08-04 18:58:10.757250


In [63]:
#co2_df_spark.printSchema()
#co2_df_spark.show(5, truncate=False)

In [66]:
# Write SO2 data to parquet file:
so2_df_path = output_data + "so2_staging.parquet" + "_" + str(datetime.now())
print(f"OUTPUT: {so2_df_path}")
so2_df_spark.write.mode("overwrite").parquet(so2_df_path)

# Read parquet file back to Spark:
so2_df_spark = spark.read.parquet(so2_df_path)

OUTPUT: data/output_data/so2_staging.parquet_2019-08-04 19:13:35.799744


In [68]:
#so2_df_spark.printSchema()
#so2_df_spark.show(5, truncate=False)

In [70]:
# Write Country Code data to parquet file:
country_code_df_path = output_data + "country_code_staging.parquet" + "_" + str(datetime.now())
print(f"OUTPUT: {country_code_df_path}")
country_codes_df_spark.write.mode("overwrite").parquet(country_code_df_path)

# Read parquet file back to Spark:
country_codes_df_spark = spark.read.parquet(country_code_df_path)

OUTPUT: data/output_data/country_code_staging.parquet_2019-08-04 19:16:34.183426


In [73]:
#country_codes_df_spark.printSchema()
#country_codes_df_spark.show(5, truncate=False)

### Step 2: Explore and Assess the Data
#### Explore the Data 
Identify data quality issues, like missing values, duplicate data, etc.

#### Cleaning Steps
Document steps necessary to clean the data

In [None]:
# Performing cleaning tasks here





### Step 3: Define the Data Model
#### 3.1 Conceptual Data Model
Map out the conceptual data model and explain why you chose that model

#### 3.2 Mapping Out Data Pipelines
List the steps necessary to pipeline the data into the chosen data model

### Step 4: Run Pipelines to Model the Data 
#### 4.1 Create the data model
Build the data pipelines to create the data model.

In [None]:
# Write code here

#### 4.2 Data Quality Checks
Explain the data quality checks you'll perform to ensure the pipeline ran as expected. These could include:
 * Integrity constraints on the relational database (e.g., unique key, data type, etc.)
 * Unit tests for the scripts to ensure they are doing the right thing
 * Source/Count checks to ensure completeness
 
Run Quality Checks

In [None]:
# Perform quality checks here

#### 4.3 Data dictionary 
Create a data dictionary for your data model. For each field, provide a brief description of what the data is and where it came from. You can include the data dictionary in the notebook or in a separate file.

#### Step 5: Complete Project Write Up
* Clearly state the rationale for the choice of tools and technologies for the project.
* Propose how often the data should be updated and why.
* Write a description of how you would approach the problem differently under the following scenarios:
 * The data was increased by 100x.
 * The data populates a dashboard that must be updated on a daily basis by 7am every day.
 * The database needed to be accessed by 100+ people.