# Dataset AdventureWorksDW

### Operações SQL simples

A base **AdventureWorksDW** https://github.com/microsoft/sql-server-samples/ é bastante conhecida no mundo de dados Microsoft.

Vamos fazer algumas atividades com o SparkSQL, como desafios.

### Carregando o PySpark

In [None]:
# !pip install pyspark

In [1]:
from pyspark import SparkContext, SparkConf
from pyspark.sql import SQLContext

In [2]:
conf = SparkConf().setMaster('local').setAppName('PySpark SQL')
sc = SparkContext.getOrCreate(conf = conf)

Criação do contexto do objeto SparkSQL que será responsável por executar as *query* do Spark com comandos SQL.

In [3]:
sql = SQLContext(sc)

In [4]:
sql

<pyspark.sql.context.SQLContext at 0x7f84a2173b00>

Aqui é a criação de um Dataframe com os dados que estamos lendo do arquivo CSV. Usamos o contexto do SparkSQL, mas ainda sim é um Dataframe

In [14]:
FactInternetSales_Spark = sql.read.format("csv").options(header='true').load('AdventureWorksDW/FactInternetSales.csv')
DimSalesTerritory_Spark = sql.read.format("csv").options(header='true').load('AdventureWorksDW/DimSalesTerritory.csv')
DimProductSubcategory_Spark = sql.read.format("csv").options(header='true').load('AdventureWorksDW/DimProductSubcategory.csv')
DimProductCategory_Spark = sql.read.format("csv").options(header='true').load('AdventureWorksDW/DimProductCategory.csv')
DimProduct_Spark = sql.read.format("csv").options(header='true').load('AdventureWorksDW/DimProduct.csv')
DimCustomer_Spark = sql.read.format("csv").options(header='true').load('AdventureWorksDW/DimCustomer.csv')

In [9]:
FactInternetSales_Spark.printSchema()

root
 |-- ProductKey: string (nullable = true)
 |-- OrderDateKey: string (nullable = true)
 |-- DueDateKey: string (nullable = true)
 |-- ShipDateKey: string (nullable = true)
 |-- CustomerKey: string (nullable = true)
 |-- PromotionKey: string (nullable = true)
 |-- CurrencyKey: string (nullable = true)
 |-- SalesTerritoryKey: string (nullable = true)
 |-- SalesOrderNumber: string (nullable = true)
 |-- SalesOrderLineNumber: string (nullable = true)
 |-- RevisionNumber: string (nullable = true)
 |-- OrderQuantity: string (nullable = true)
 |-- UnitPrice: string (nullable = true)
 |-- ExtendedAmount: string (nullable = true)
 |-- UnitPriceDiscountPct: string (nullable = true)
 |-- DiscountAmount: string (nullable = true)
 |-- ProductStandardCost: string (nullable = true)
 |-- TotalProductCost: string (nullable = true)
 |-- SalesAmount: string (nullable = true)
 |-- TaxAmt: string (nullable = true)
 |-- Freight: string (nullable = true)
 |-- CarrierTrackingNumber: string (nullable = tru

In [15]:
FactInternetSales_Spark.select(["ProductKey","TotalProductCost"]).show(10)

+----------+----------------+
|ProductKey|TotalProductCost|
+----------+----------------+
|       310|       2171.2942|
|       346|       1912.1544|
|       346|       1912.1544|
|       336|        413.1463|
|       346|       1912.1544|
|       311|       2171.2942|
|       310|       2171.2942|
|       351|       1898.0944|
|       344|       1912.1544|
|       312|       2171.2942|
+----------+----------------+
only showing top 10 rows



A partir do dataframe **dadosSpark**, vamos registrar uma tabela temporária do SQL, chamada **Carros**

In [16]:
FactInternetSales_Spark.registerTempTable("FactInternetSales")
DimSalesTerritory_Spark.registerTempTable("DimSalesTerritory")
DimProductSubcategory_Spark.registerTempTable("DimProductSubcategory")
DimProductCategory_Spark.registerTempTable("DimProductCategory")
DimProduct_Spark.registerTempTable("DimProduct")
DimCustomer_Spark.registerTempTable("DimCustomer")

A partir do motor do **SparkSQL** vamos escrever uma *query* em SQL que retornará todas as linhas e colunas da nossa base

In [18]:
sql.sql("SELECT sub.* FROM \
            DimProductSubcategory as SUB inner join \
            DimProductCategory Cat \
                on SUB.ProductCategoryKey = Cat.ProductCategoryKey \
            where Cat.ProductCategoryKey = 2").show()

+---------------------+------------------------------+-----------------------------+-----------------------------+----------------------------+------------------+
|ProductSubcategoryKey|ProductSubcategoryAlternateKey|EnglishProductSubcategoryName|SpanishProductSubcategoryName|FrenchProductSubcategoryName|ProductCategoryKey|
+---------------------+------------------------------+-----------------------------+-----------------------------+----------------------------+------------------+
|                    4|                             4|                   Handlebars|                        Barra|               Barre d'appui|                 2|
|                    5|                             5|              Bottom Brackets|              Eje de pedalier|             Axe de p�dalier|                 2|
|                    6|                             6|                       Brakes|                       Frenos|                      Freins|                 2|
|                    7

In [None]:
sql.sql("").show()

In [None]:
sql.sql("").show()

In [None]:
sql.sql("").show()

In [None]:
sql.sql("").show()