# Spark DataFrames

We'll be covering some basic methods used when

First, use Seaborn to load the Titanic Dataset

```python
titanic_df = sns.load_dataset('titanic')
df_t = spark.createDataFrame(titanic_df)
```

Alternatively, we'll need to load in a DataFrame from a file. If you've uploaded the Titanic dataset (`train.csv`) then use the path to that file, for example `/FileStore/tables/train.csv`.


Here's the code for loading `train.csv` into a DataFrame:

```python
df = spark.read.csv("/FileStore/tables/train.csv", header=True, inferSchema=True)
```
  
This is analogous to Pandas' `pd.read_csv()` method, albeit more complicated.
- `spark`: Spark's object for working with structured data (like a csv)
- `read.csv("/FileStore/tables/train.csv")`: We're loading a csv and specifying its location
- `header=True`: Specifies that we have a header row in our csv
- `inferSchema=True`: Unlike pandas, Spark does not automatically infer the datatypes for each column. We either have to manually set the datatypes or run this option

Run the sample code above in the following cell to load in our DataFrame.

In [None]:
import seaborn as sns
import pyspark
from pyspark.sql import SparkSession

In [None]:
spark = SparkSession.builder.getOrCreate()

In [None]:
# df_t for Seaborn Titanic Data


In [None]:
# df for train.csv Titanic Data



# `df.show()`

`df.show()` is similar to pandas' `df.head()` method with a couple of exceptions:
1. `.show()` will print the first 20 rows
2. **It is only a print statement**. If you try to set `.show()` to a variable, you will not get a DataFrame.

`.show()` has a few parameters:
1. Number of rows to show (int)
2. Truncate (boolean)

Alternatively, you can also use the `df_t.show()` function to view a `df_t` dataframe

What do you notice about `df` and `df_t`?

# Column Data

`df_t.columns` & `df.columns` returns a list of strings, one for each column's name.

`df_t.printSchema` & `df.printSchema` allows us to see each column and its associated data type.

`df_t.dtypes` & `df.dtypes` returns a list of tuples, each tuple containing the column's name and data type.

# EXTRA CHALLENGE:

How would you compare, contrast, or combine the data from these two dataframes?

# Challenge
Use `df.dtypes` & `df_t.dtypes` to create a list of the numerical columns (names only).

# Summary statistics
`df.describe()` provides summary statistics, similar to its Pandas counterpart. `.describe()` returns a DataFrame, so you'll need to use `.show()` if you want to  see the results.

# Select

Rather than looking at the entire dataset, `df.select()` allows us to view a subset of the columns. Simply pass in comma separated strings as arguments.

# Filtering

`df.filter()` accepts a SQL string as a parameter. For example, to get all 3rd class passengers who emkared from Queenstown, we'd write the following:
```python
df.filter("Pclass = 3 AND Embarked = 'Q'")
```

Create a filter that returns all **survivors** from 3rd class.

# Filtering: Null values

Recall in SQL we use `WHERE col IS NULL` when looking for null values. Use that same logic to find everyone whose port of embarkation is null.

# Imputing null values

`df.na.fill()` is similar to `df.fillna()` in pandas. `.fill()` accepts either a string or a number, and will impute all string or numerical columns respectively.

In [None]:
# Impute both string and numerical null values

# Renaming columns

`df.withColumnRenamed()` is used for renaming columns:
```python
df.withColumnRenamed("OldCol", "NewCol")
```

We'll use this method quite a bit. Spark's machine learning models require two columns in order to work: `"features"` and `"label"`. In the cell below, rename the "Survived" column to "label".


# Creating new columns

`df.withColumn()` is used for creating new columns. The first parameter is the name of the new column, and the second is the new column.

```python
df.withColumn("NewCol", df.columnAsProperty)
```

In the cell below, create a column called "FamilyCount" which is equal to `df.SibSp + df.Parch`


# SQL functions

SQL functions are for performing some sort of calculation or transformation on our DataFrames. For example, finding the survival rate in our daset requires that we first import the `avg` function:

```python
from pyspark.sql.functions import avg, count, corr
```

And then apply it to the label column:
```python
df.select(avg("label"))```

# Challenge

Calculate the [correlation](https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame.corr) between the label column and Pclass.

# Group BY

Spark DataFrames have a `df.groupBy()` method similar to pandas. They are typically chained with an aggregate function (see previous section) like so:

```python
df.groupBy("Pclass").agg(avg("label")).show()
```

In the cell below, use the `.groupBy()` method to calculate two things:
1. The survival rate for each port of embarkation
2. How many passengers (regardless of survived) where from each port

# Challenges

Using what we've learned, see how many challenges you can tackle:

1. How many people were under the age of 30?
2. What is the survival rate from 2nd class?
3. How many null cabins are there?
4. What is the correlation between FamilyCount and label?
5. Create a list of all string columns

In [None]:
# How many people were under the age of 30?

In [None]:
# What is the survival rate from 2nd class?

In [None]:
# How many null cabins are there?

In [None]:
# What is the correlation between FamilyCount and label?

In [None]:
# Create a list of all string columns