## Creating Spark DataFrames

PySpark SQL DataFrame is a distributed collection of data organized into named columns. Under the hood, DataFrames are built on top of RDDs

### `rdd.toDF()`
The `toDF()` method is used to convert an RDD to DataFrame. The method is available on RDD of Row objects.

In [None]:
# Create an RDD from a list
hrly_views_rdd  = spark.sparkContext.parallelize([
    ["Betty_White" , 288886],
    ["Main_Page", 139564],
    ["New_Year's_Day", 7892],
    ["ABBA", 8154]
])

# Convert RDD to DataFrame
hrly_views_df = hrly_views_rdd\
    .toDF(["article_title", "view_count"])

### `DataFrame.show()`

The `show()` method is used to display the content of the DataFrame. By default, it shows the first 20 rows.

In [None]:
hrly_views_df.show(4, truncate=False)

```text
+--------------+-----------+
| article_title| view_count|
+--------------+-----------+
|   Betty_White|     288886|
|     Main_Page|     139564|
|New_Year's_Day|       7892|
|          ABBA|       8154|
+--------------+-----------+
```

### `DataFrame.rdd`

The `rdd` attribute is used to convert a DataFrame to RDD.

In [None]:
# Access DataFrame's underlying RDD
hrly_views_df_rdd = hrly_views_df.rdd

# Check object type
print(type(hrly_views_df_rdd)) 
# <class 'pyspark.rdd.RDD'>

## Spark DataFrames from Exernal Data Sources

In [None]:
print(type(spark.read)) 
# <class 'pyspark.sql.readwriter.DataFrameReader'>

# Read CSV to DataFrame
hrly_views_df = spark.read\
.option('header', True) \
.option('delimiter', ' ') \
.option('inferSchema', True)\ 
.csv('views_2022_01_01_000000.csv')

## Inspecting and Cleaning Data with PySpark

### `DataFrame.printSchema()`

The `printSchema()` method is used to print the schema of the DataFrame.

In [None]:
# Display DataFrame schema
hrly_views_df.printSchema()

```text
root
|-- language_code: string (nullable = true)
|-- article_title: string (nullable = true)
|-- hourly_count: integer (nullable = true)
|-- monthly_count: integer (nullable = true)
```

### `DataFrame.describe()`

We can use the `describe()` method to get the summary statistics of the DataFrame.

In [None]:
hrly_views_df_desc = hrly_views_df.describe()
hrly_views_df_desc.show(truncate=False)

```text
+-------+-------------+-------------+------------+-------------+
|summary|language_code|article_title|hourly_count|monthly_count|
+-------+-------------+-------------+------------+-------------+
|  count|      4654091|      4654091|     4654091|      4654091|
|   mean|         null|         null|     4.52417|          0.0|
| stddev|         null|         null|   182.92502|          0.0|
|    min|           aa|            -|           1|            0|
|    max|       zu.m.d|            -|      288886|            0|
+-------+-------------+-------------+------------+-------------+
```

### `Dataframe.drop()`

In [None]:
# Drop `monthly_count` and display new DataFrame
hrly_views_df = hrly_views_df.drop('monthly_count')
hrly_views_df.show(5) 

```text
+-------------+---------------------------+------------+
|language_code|article_title              |hourly_count|
+-------------+---------------------------+------------+
|en           |Cividade_de_Terroso        |           2|
|en           |Peel_Session_(Autechre_EP) |           2|
|en           |Young_Street_Bridge        |           1|
|en           |Troy,_Alabama              |           1|
|en           |Charlotte_Johnson_Wahl     |          10|
+-------------+---------------------------+------------+
```

### `DataFrame.withColumnRenamed()`

In [None]:
hrly_views_df = hrly_views_df\
.withColumnRenamed('article_title', 'page_title')
# Display DataFrame schema
hrly_views_df.printSchema()

```text
root
|-- language_code: string (nullable = true)
|-- page_title: string (nullable = true)
|-- hourly_count: integer (nullable = true)
```