### EMR Notebooks Demo

* Installing notebook-scoped Python libraries on a running cluster directly via an EMR Notebook.
* Visualizing Spark dataframes by plotting variety of charts using `%matplot`, `%%display` magics.

Reference: https://aws.amazon.com/blogs/big-data/install-python-libraries-on-a-running-cluster-with-emr-notebooks/

#### Let us first start the Spark session on the notebook,

In [None]:
print("Welcome to my EMR Notebook!")

#### Benefits of using notebook-scoped libraries:

 * Runtime Installation
 * Handles Transitive Dependencies
 * Dependency Isolation
 * Portable library environment

#### Before we import and install libraries on the cluster, let us see the library packages already pre-installed and available to us on the cluster. 

In [None]:
sc.list_packages()

#### Now let us load the Amazon customer reviews data for books into Spark data frame,

In [None]:
df = spark.read.parquet('s3://amazon-reviews-pds/parquet/product_category=Books/*.parquet')

#### Let's determine the schema and number of available columns in the dataset

In [None]:
print(f'Total Columns: {len(df.dtypes)}')
df.printSchema()

#### Let's check total rows and number of books available in the given dataset

In [None]:
print(f'Total Rows: {df.count():,}')
num_of_books = df.select('product_id').distinct().count()
print(f'Number of Books: {num_of_books:,}')

#### Let's install Python libraries from PyPI repository
Let's analyze the number of book reviews by year and find the distribution of customer ratings. To do this, import the `pandas` library version 0.25.1 and the latest `matplotlib` library from the public PyPI repository. Install them on the cluster attached to your notebook using the install_pypi_package API.

In [None]:
sc.install_pypi_package("pandas==0.25.1") #Install pandas version 0.25.1 
sc.install_pypi_package("matplotlib", "https://pypi.org/simple") #Install matplotlib from given PyPI repository

#### Let’s verify whether our imported packages have been successfully installed

In [None]:
sc.list_packages()

#### Let’s find out the trend for number of reviews across years,

In [None]:
num_of_reviews_by_year = df.groupBy('year').count().orderBy('year').toPandas()

#### Let’s visualize the trend using `%matplot` magic

In [None]:
import matplotlib.pyplot as plt
plt.clf()
num_of_reviews_by_year.plot(kind='area', x='year',y='count', rot=70, color='#bc5090', legend=None, figsize=(8,6))
plt.xticks(num_of_reviews_by_year.year)
plt.xlim(1995, 2015)
plt.title('Number of reviews across years')
plt.xlabel('Year')
plt.ylabel('Number of Reviews')

In [None]:
%matplot plt

#### Finally, let's uninstall the package using ‘*uninstall_package*’ Pyspark API

In [None]:
sc.uninstall_package('pandas')

In [None]:
sc.list_packages()

#### Exploring dataframes using `%%display` magic 
Let's analyze the distribution of star ratings and visualize it using a pie chart.

In [None]:
%%display
df.groupBy('star_rating').count().orderBy('count')