-
Notifications
You must be signed in to change notification settings - Fork 0
/
6-SparkSQL.py
266 lines (202 loc) · 8.78 KB
/
6-SparkSQL.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
# Databricks notebook source
# MAGIC %md
# MAGIC In this exercise we will use Spark to explore the hypothesis stated below:
# MAGIC
# MAGIC **The number of farmer's markets in a given zip code can be predicted from the income and taxes paid in a given area.**
# MAGIC
# MAGIC We explore the process for discovering whether or not we can accurately predict the number of farmer's markets. We will use two datasets.
# MAGIC
# MAGIC ![img](http://training.databricks.com/databricks_guide/USDA_logo.png)
# MAGIC
# MAGIC The [**Farmers Markets Directory and Geographic Data**](http://catalog.data.gov/dataset/farmers-markets-geographic-data/resource/cca1cc8a-9670-4a27-a8c7-0c0180459bef) dataset contains information on the longitude and latitude, state, address, name, and zip code of Farmers Markets in the United States. The raw data is published by the Department of Agriculture.
# MAGIC
# MAGIC ![img](http://training.databricks.com/databricks_guide/irs-logo.jpg)
# MAGIC
# MAGIC The [**SOI Tax Stats - Individual Income Tax Statistics - ZIP Code Data (SOI)**](http://catalog.data.gov/dataset/zip-code-data) study provides detailed tabulations of individual income tax return data at the state and ZIP code level and is provided by the IRS. The data includes items, such as:
# MAGIC
# MAGIC - Number of returns, which approximates the number of households
# MAGIC - Number of personal exemptions, which approximates the population
# MAGIC - Adjusted gross income
# MAGIC - Wages and salaries
# MAGIC - Dividends before exclusion
# MAGIC - Interest received
# COMMAND ----------
# MAGIC %md Read in the data
# MAGIC
# MAGIC This data is located in in csv files and Apache Spark 2.0 can read the data in directly.
# COMMAND ----------
taxes2013 = spark.read\
.option("header", "true")\
.csv("dbfs:/databricks-datasets/data.gov/irs_zip_code_data/data-001/2013_soi_zipcode_agi.csv")
# COMMAND ----------
markets = spark.read\
.option("header", "true")\
.csv("dbfs:/databricks-datasets/data.gov/farmers_markets_geographic_data/data-001/market_data.csv")
# COMMAND ----------
# MAGIC %md Use the `display` command to take a quick look at the dataframes you created.
# COMMAND ----------
display(taxes2013)
# COMMAND ----------
# MAGIC %md Now register the DataFrames as Spark SQL tables so that we can use our SQL skills to manipulate the data. The lifetime of this temporary table is tied to the Spark Context that was used to create this DataFrame. When you shutdown the SQLContext associated with a cluster the temporary table disappears as well.
# COMMAND ----------
taxes2013.createOrReplaceTempView("taxes2013")
markets.createOrReplaceTempView("markets")
# COMMAND ----------
# MAGIC %md
# MAGIC
# MAGIC ## Running SQL Commands
# MAGIC
# MAGIC This is a python notebook. To use another language in a python notebook we prefix the cell with the language identifier. To use SQL in your python notebook, prefix the cell with `%sql`
# MAGIC
# MAGIC Use SQL to `show tables`.
# COMMAND ----------
# MAGIC %sql show tables
# COMMAND ----------
# MAGIC %md Take a quick look at the tables using `SELECT *`. This is very similar to calling `display` on the DataFrame.
# COMMAND ----------
# MAGIC %sql SELECT * FROM taxes2013
# COMMAND ----------
# MAGIC %md Next, cleanup the data using SQL
# MAGIC 1. Create a new table called `cleaned_taxes` from the `taxes2013` temp table
# MAGIC 1. All the values are currently strings, convert them to approriate data types as needed
# MAGIC 1. zipcode => int
# MAGIC 1. mars1 => int
# MAGIC 1. mars2 => int
# MAGIC 1. numdep => int
# MAGIC 1. A02650 => double
# MAGIC 1. A00300 => double
# MAGIC 1. a01000 => double
# MAGIC 1. a00900 => double
# MAGIC 1. Rename columns for clarity as needed
# MAGIC 1. state => state
# MAGIC 1. zipcode => zipcode
# MAGIC 1. mars1 => single_returns
# MAGIC 1. mars2 => joint_returns
# MAGIC 1. numdep => numdep
# MAGIC 1. A02650 => total_income_amount
# MAGIC 1. A00300 => taxable_interest_amount
# MAGIC 1. a01000 => net_capital_gains
# MAGIC 1. a00900 => biz_net_income
# MAGIC 1. Shorten each zip code to 4 digits instead of 5, to group nearby areas together
# COMMAND ----------
# MAGIC %sql
# MAGIC DROP TABLE IF EXISTS cleaned_taxes;
# MAGIC
# MAGIC CREATE TABLE cleaned_taxes AS
# MAGIC SELECT
# MAGIC state,
# MAGIC int(zipcode / 10) as zipcode,
# MAGIC int(mars1) as single_returns,
# MAGIC int(mars2) as joint_returns,
# MAGIC int(numdep) as numdep,
# MAGIC double(A02650) as total_income_amount,
# MAGIC double(A00300) as taxable_interest_amount,
# MAGIC double(a01000) as net_capital_gains,
# MAGIC double(a00900) as biz_net_income
# MAGIC FROM taxes2013
# COMMAND ----------
# MAGIC %md Now that the data is cleaned up, create some nice plots
# MAGIC
# MAGIC Explore the average total income per zip code per state.
# MAGIC
# MAGIC Which states have the highest income per zip code?
# COMMAND ----------
# MAGIC %sql SELECT state, total_income_amount FROM cleaned_taxes
# COMMAND ----------
# MAGIC %md
# MAGIC
# MAGIC New Jersey and California have higher average incomes per zip code.
# COMMAND ----------
# MAGIC %md Next let's explore some specifics of this particular dataset.
# MAGIC
# MAGIC Use SQL to `describe` the dataset so that you can see the schema.
# COMMAND ----------
# MAGIC %sql describe cleaned_taxes
# COMMAND ----------
# MAGIC %md Let's look at the set of zip codes with the lowest total capital gains and plot the results.
# COMMAND ----------
# MAGIC %sql
# MAGIC SELECT zipcode, SUM(net_capital_gains) AS cap_gains
# MAGIC FROM cleaned_taxes
# MAGIC WHERE NOT (zipcode = 0000 OR zipcode = 9999)
# MAGIC GROUP BY zipcode
# MAGIC ORDER BY cap_gains ASC
# MAGIC LIMIT 10
# COMMAND ----------
# MAGIC %md Let's look at a combination of capital gains and business net income to see what we find.
# MAGIC
# MAGIC Build a `combo` metric that represents the total capital gains and business net income by zip code.
# COMMAND ----------
# MAGIC %sql
# MAGIC SELECT zipcode,
# MAGIC SUM(biz_net_income) as business_net_income,
# MAGIC SUM(net_capital_gains) as capital_gains,
# MAGIC SUM(net_capital_gains) + SUM(biz_net_income) as capital_and_business_income
# MAGIC FROM cleaned_taxes
# MAGIC WHERE NOT (zipcode = 0000 OR zipcode = 9999)
# MAGIC GROUP BY zipcode
# MAGIC ORDER BY capital_and_business_income DESC
# MAGIC LIMIT 50
# COMMAND ----------
# MAGIC %md
# MAGIC We can also get a peek at what will happen when we execute the query.
# MAGIC
# MAGIC Use the `EXPLAIN` keyword in SQL.
# COMMAND ----------
# MAGIC %sql
# MAGIC EXPLAIN
# MAGIC SELECT zipcode,
# MAGIC SUM(biz_net_income) as net_income,
# MAGIC SUM(net_capital_gains) as cap_gains,
# MAGIC SUM(net_capital_gains) + SUM(biz_net_income) as combo
# MAGIC FROM cleaned_taxes
# MAGIC WHERE NOT (zipcode = 0000 OR zipcode = 9999)
# MAGIC GROUP BY zipcode
# MAGIC ORDER BY combo desc
# MAGIC limit 50
# COMMAND ----------
# MAGIC %md We can see above that we're fetching the data from `dbfs:/user/hive/warehouse/cleaned_taxes` which is where the data is stored when we registered it as a temporary table.
# MAGIC
# MAGIC Let's `cache` the data
# COMMAND ----------
# MAGIC %sql CACHE TABLE cleaned_taxes
# COMMAND ----------
# MAGIC %md When we cache data using SQL, Spark caches *eagerly*--right away. This differs from `cacheTable` on the `SqlContext` which caches the data only when it is needed.
# MAGIC
# MAGIC Let's run the exact same query again. You'll notice that it takes just a fraction of the time because the data is stored in memory.
# COMMAND ----------
# MAGIC %sql
# MAGIC SELECT zipcode,
# MAGIC SUM(biz_net_income) as net_income,
# MAGIC SUM(net_capital_gains) as cap_gains,
# MAGIC SUM(net_capital_gains) + SUM(biz_net_income) as combo
# MAGIC FROM cleaned_taxes
# MAGIC WHERE NOT (zipcode = 0000 OR zipcode = 9999)
# MAGIC GROUP BY zipcode
# MAGIC ORDER BY combo desc
# MAGIC limit 50
# COMMAND ----------
# MAGIC %md Now `EXPLAIN` the plan
# COMMAND ----------
# MAGIC %sql
# MAGIC EXPLAIN
# MAGIC SELECT zipcode,
# MAGIC SUM(biz_net_income) as net_income,
# MAGIC SUM(net_capital_gains) as cap_gains,
# MAGIC SUM(net_capital_gains) + SUM(biz_net_income) as combo
# MAGIC FROM cleaned_taxes
# MAGIC WHERE NOT (zipcode = 0000 OR zipcode = 9999)
# MAGIC GROUP BY zipcode
# MAGIC ORDER BY combo desc
# MAGIC limit 50
# COMMAND ----------
# MAGIC %md
# MAGIC Instead of going down to the source data it performs a `InMemoryColumnarTableScan`. This means that it has all of the information that it needs in memory.
# MAGIC
# MAGIC Now let's look at the Farmer's Market Data.
# MAGIC
# MAGIC Start with a total summation of farmer's markets per state.
# COMMAND ----------
# MAGIC %sql SELECT State, COUNT(State) as Sum
# MAGIC FROM markets
# MAGIC GROUP BY State