# PySpark Dataframes
For each instruction, type the appropriate code into the cell below the instruction. Then, run the code by pressing the `Run` button above.

1. Import the SparkSession class:

```
from pyspark.sql import SparkSession
```

In [2]:
pip install pyspark

Collecting pysparkNote: you may need to restart the kernel to use updated packages.

  Downloading pyspark-3.5.0.tar.gz (316.9 MB)
     ---------------------------------------- 0.0/316.9 MB ? eta -:--:--
     ---------------------------------------- 0.0/316.9 MB ? eta -:--:--
     -------------------------------------- 0.1/316.9 MB 550.5 kB/s eta 0:09:36
     ---------------------------------------- 0.2/316.9 MB 1.3 MB/s eta 0:04:01
     ---------------------------------------- 0.5/316.9 MB 2.4 MB/s eta 0:02:12
     ---------------------------------------- 1.1/316.9 MB 4.5 MB/s eta 0:01:11
     ---------------------------------------- 2.1/316.9 MB 7.2 MB/s eta 0:00:44
      -------------------------------------- 4.4/316.9 MB 12.9 MB/s eta 0:00:25
      -------------------------------------- 5.9/316.9 MB 15.0 MB/s eta 0:00:21
     - ------------------------------------- 9.6/316.9 MB 22.8 MB/s eta 0:00:14
     - ------------------------------------ 13.1/316.9 MB 65.6 MB/s eta 0:00:05
   

In [3]:
from pyspark.sql import SparkSession

2. Use this class to instiate a Spark session:

```
spark = SparkSession \
    .builder \
    .appName("My First PySpark App") \
    .getOrCreate()
```

In [4]:
spark = SparkSession \
    .builder \
    .appName("My First PySpark App") \
    .getOrCreate()

3. Take a look at the session object:

```
spark
```

In [5]:
spark

In [10]:
from pyspark import SparkContext
sc =spark.sparkContext

4. Read the contents of a csv file into a Dataframe named 'accounts':

```
accounts = spark.read.option('header', 'true').csv('./data/accounts.csv')
```

In [6]:
accounts = spark.read.option('header', 'true').csv('./accounts.csv')

5. Take a look at the Dataframe's schema:

```
accounts.printSchema()
```

In [13]:
accounts.printSchema()

root
 |-- account_number: string (nullable = true)
 |-- aba: string (nullable = true)
 |-- bic: string (nullable = true)
 |-- opened: string (nullable = true)
 |-- balance: string (nullable = true)



In [12]:
numbers = sc.parallelize(list(range(15)))

6. Read the contents of a parquet file into a variable:
```
transactions = spark.read.option('header', True).parquet('./data/transactions.parquet')
```

In [14]:
transactions = spark.read.option('header',True).parquet('./transactions.parquet')

7. See how many rows are in the new Dataframe:
```
transactions.count()
```

8. Make a new Dataframe by grouping the transactions by account number and summing the groups. This will combine the transactions per account:

```
account_transactions = transactions.groupby('account_number').sum()
```

In [15]:
transactions.count()

1000000

9. Combine the accounts with the summed transaction values:

```
with_sum = accounts.join(account_transactions, 'account_number', 'inner')
```

In [17]:
with_sum = accounts.join(transactions, 'account_number', 'inner')

In [21]:
with_sum.columns

['account_number', 'aba', 'bic', 'opened', 'balance', 'amount', 'datetime']

10. Get the current balance per account by summing the transaction sums with the initial account balance:

```
accounts = with_sum.withColumn('new_balance', sum([with_sum.balance, with_sum['sum(amount)']]))
```

In [23]:
accounts = with_sum.withColumn('new_balance', sum([with_sum.balance, with_sum.amount]))

11. Get accounts with negative current balances:

```
neg_balance = accounts.filter(accounts.new_balance < 0)
```

In [24]:
neg_balance = accounts.filter(accounts.new_balance < 0)

12. Read client data from a json file:

```
clients = spark.read.json('./data/clients.json')
```

In [25]:
clients = spark.read.json('./clients.json')

13. Get the clients with a negative balance:

```
clients = clients.join(neg_balance, 'account_number', 'inner')
```

In [27]:
clients = clients.join(neg_balance,'account_number','inner')

14. Look at the top five clients with negative balances:

```
clients.select(['first_name', 'last_name', 'account_number', 'new_balance']).show(5)
```

In [28]:
clients.select(['first_name', 'last_name', 'account_number', 'new_balance']).show(5)

+----------+---------+------------------+-----------+
|first_name|last_name|    account_number|new_balance|
+----------+---------+------------------+-----------+
|    Jeremy|     Lane|KIJB69632401909582|    -1945.0|
|     Donna|    Johns|SSPR63654758238509|     -493.0|
|     Riley|    Green|VWRD72378313231126|    -3097.0|
|    Andres|     Leon|BNBJ15185037618068|     -403.0|
|   Jeffery|   Weaver|OIFO27569232151018|    -4699.0|
+----------+---------+------------------+-----------+
only showing top 5 rows

