# Introduction to DataFrames Lab - Solution
## Distinct Articles

##![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) Instructions

In the cell provided below, write the code necessary to count the number of distinct articles in our data set.
0. Copy and paste all you like from the previous notebook.
0. Read in our parquet files.
0. Apply the necessary transformations.
0. Assign the count to the variable `totalArticles`
0. Run the last cell to verify that the data was loaded correctly.

**Bonus**

If you recall from the beginning of the previous notebook, the act of reading in our parquet files will trigger a job.
0. Define a schema that matches the data we are working with.
0. Update the read operation to use the schema.

##![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) Getting Started

Run the following cell to configure our "classroom."

In [4]:
%run "../Includes/Classroom-Setup"

##![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) Show Your Work

In [6]:
(source, sasEntity, sasToken) = getAzureDataSource()
spark.conf.set(sasEntity, sasToken)

path = source + "/wikipedia/pagecounts/staging_parquet_en_only_clean/"

In [7]:
# ANSWER

# This version does not include the bonus.

from pyspark.sql.types import *

parquetDir = "/mnt/training/wikipedia/pagecounts/staging_parquet_en_only_clean/"

df = (spark
  .read
  .parquet(parquetDir)
  .select("article")
  .distinct()
)
totalArticles = df.count()

print("Distinct Articles: {0:,}".format( totalArticles ))

In [8]:
# ANSWER

# This version DOES include the bonus.

from pyspark.sql.types import *

schema = StructType([
  StructField("project", StringType(), False),
  StructField("article", StringType(), False),
  StructField("requests", IntegerType(), False),
  StructField("bytes_served", LongType(), False)
])

totalArticles = (spark.read
  .schema(schema)
  .parquet(parquetDir)
  .select("article")
  .distinct()
  .count()
)

print("Distinct Articles: {0:,}".format( totalArticles ))

##![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) Verify Your Work
Run the following cell to verify that your `DataFrame` was created properly.

In [10]:
expected = 1783138
assert totalArticles == expected, "Expected the total to be " + str(expected) + " but found " + str(totalArticles)
