# Analyzing Wikipedia Clickstream Data
* [View Solution Notebook](./solution.html)

* [Project Page Link](https://www.codecademy.com/courses/big-data-pyspark/projects/analyzing-wikipedia-pyspark)

### Import Libraries

In [1]:
from pyspark.sql import SparkSession

## Task Group 1 - Introduction to Clickstream Data

### Task 1
Create a new `SparkSession` and assign it to a variable named `spark`.

In [2]:
# Create a new SparkSession
spark = SparkSession\
    .builder\
    .config('spark.app.name', 'learning_spark_sql')\
    .getOrCreate()

### Task 2

Create an RDD from a list of sample clickstream counts and save it as `clickstream_counts_rdd`.

In [3]:
# Sample clickstream counts
sample_clickstream_counts = [
    ["other-search", "Hanging_Gardens_of_Babylon", "external", 47000],
    ["other-empty", "Hanging_Gardens_of_Babylon", "external", 34600],
    ["Wonders_of_the_World", "Hanging_Gardens_of_Babylon", "link", 14000],
    ["Babylon", "Hanging_Gardens_of_Babylon", "link", 2500]
]

# Create RDD from sample data
clickstream_counts_rdd = spark.sparkContext.parallelize(sample_clickstream_counts)

### Task 3

Using the RDD from the previous step, create a DataFrame named `clickstream_sample_df`

In [None]:
# Create a DataFrame from the RDD of sample clickstream counts
clickstream_sample_df = spark.toDF(['source_page', 'target_page', 'link_category', 'link_count'])

# Display the DataFrame to the notebook


## Task Group 2 - Inspecting Clickstream Data

### Task 4

Read the files in `./cleaned/clickstream/` into a new Spark DataFrame named `clickstream` and display the first few rows of the DataFrame in the notebook

In [None]:
# Read the target directory (`./cleaned/clickstream/`) into a DataFrame (`clickstream`)
clickstream = 

# Display the DataFrame to the notebook


### Task 5

Print the schema of the DataFrame in the notebook.

In [None]:
# Display the schema of the `clickstream` DataFrame to the notebook


### Task 6

Drop the `language_code` column from the DataFrame and display the new schema in the notebook.

In [None]:
# Drop target columns
clickstream = 

# Display the first few rows of the DataFrame

# Display the new schema in the notebook


### Task 7

Rename `referrer` and `resource` to `source_page` and `target_page`, respectively,

In [None]:
# Rename `referrer` and `resource` to `source_page` and `target_page`
clickstream = 
  
# Display the first few rows of the DataFrame

# Display the new schema in the notebook


## Task Group 3 - Querying Clickstream Data

### Task 8

Add the `clickstream` DataFrame as a temporary view named `clickstream` to make the data queryable with `sparkSession.sql()`

In [None]:
# Create a temporary view in the metadata for this `SparkSession` 


### Task 9

Filter the dataset to entries with `Hanging_Gardens_of_Babylon` as the `target_page` and order the result by `click_count` using PySpark DataFrame methods.

In [None]:
# Filter and sort the DataFrame using PySpark DataFrame methods


### Task 10

Perform the same analysis as the previous exercise using a SQL query. 

In [None]:
# Filter and sort the DataFrame using SQL


### Task 11

Calculate the sum of `click_count` grouped by `link_category` using PySpark DataFrame methods.

In [None]:
# Aggregate the DataFrame using PySpark DataFrame Methods 


### Task 12

Perform the same analysis as the previous exercise using a SQL query.

In [None]:
# Aggregate the DataFrame using SQL


## Task Group 4 - Saving Results to Disk

### Task 13

Let's create a new DataFrame named `internal_clickstream` that only contains article pairs where `link_category` is `link`. Use `filter()` to select rows to a specific condition and `select()` to choose which columns to return from the query.

In [None]:
# Create a new DataFrame named `internal_clickstream`
internal_clickstream = 

# Display the first few rows of the DataFrame in the notebook


### Task 14

Using `DataFrame.write.csv()`, save the `internal_clickstream` DataFrame as CSV files in a directory called `./results/article_to_article_csv/`.

In [None]:
# Save the `internal_clickstream` DataFrame to a series of CSV files


### Task 15

Using `DataFrame.write.parquet()`, save the `internal_clickstream` DataFrame as parquet files in a directory called `./results/article_to_article_pq/`.

In [None]:
# Save the `internal_clickstream` DataFrame to a series of parquet files


### Task 16

Close the `SparkSession` and underlying `SparkContext`. What happens if you we call `clickstream.show()` after closing the `SparkSession`?

In [None]:
# Stop the notebook's `SparkSession` and `SparkContext`


In [None]:
# The SparkSession and sparkContext are stopped; the following line will throw an error:
clickstream.show()