# Fictitious Names

### Introduction:

This time you will create a data again 

Special thanks to [Chris Albon](http://chrisalbon.com/) for sharing the dataset and materials.
All the credits to this exercise belongs to him.  

In order to understand about it go [here](https://blog.codinghorror.com/a-visual-explanation-of-sql-joins/).

### Step 1. Import the necessary libraries

In [1]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import *

In [2]:
spark = SparkSession.builder.appName("PySparkPrac").getOrCreate()

25/08/21 18:50:03 WARN Utils: Your hostname, neosoft-Latitude-5420 resolves to a loopback address: 127.0.1.1; using 10.0.61.246 instead (on interface wlp0s20f3)
25/08/21 18:50:03 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
25/08/21 18:50:04 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
25/08/21 18:50:04 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
25/08/21 18:50:04 WARN Utils: Service 'SparkUI' could not bind on port 4041. Attempting port 4042.


### Step 2. Create the 3 DataFrames based on the following raw data

In [4]:
raw_data_1 = {
        'subject_id': ['1', '2', '3', '4', '5'],
        'first_name': ['Alex', 'Amy', 'Allen', 'Alice', 'Ayoung'], 
        'last_name': ['Anderson', 'Ackerman', 'Ali', 'Aoni', 'Atiches']}

raw_data_2 = {
        'subject_id': ['4', '5', '6', '7', '8'],
        'first_name': ['Billy', 'Brian', 'Bran', 'Bryce', 'Betty'], 
        'last_name': ['Bonder', 'Black', 'Balwner', 'Brice', 'Btisan']}

raw_data_3 = {
        'subject_id': ['1', '2', '3', '4', '5', '7', '8', '9', '10', '11'],
        'test_id': [51, 15, 15, 61, 16, 14, 15, 1, 61, 16]}

### Step 3. Assign each to a variable called data1, data2, data3

In [6]:
rows = list(zip(raw_data_1['subject_id'], raw_data_1['first_name'], raw_data_1['last_name']))
data1 = spark.createDataFrame(rows, ["subject_id","first_name", "last_name"])

In [7]:
print(rows)

[('1', 'Alex', 'Anderson'), ('2', 'Amy', 'Ackerman'), ('3', 'Allen', 'Ali'), ('4', 'Alice', 'Aoni'), ('5', 'Ayoung', 'Atiches')]


In [8]:
rows = list(zip(raw_data_2['subject_id'], raw_data_2['first_name'], raw_data_2['last_name']))
data2 = spark.createDataFrame(rows, ["subject_id","first_name", "last_name"])

In [9]:
rows = list(zip(raw_data_3['subject_id'], raw_data_3['test_id']))
data3 = spark.createDataFrame(rows, ["subject_id","test_id"])

In [10]:
data1.show()
data2.show()
data3.show()

                                                                                

+----------+----------+---------+
|subject_id|first_name|last_name|
+----------+----------+---------+
|         1|      Alex| Anderson|
|         2|       Amy| Ackerman|
|         3|     Allen|      Ali|
|         4|     Alice|     Aoni|
|         5|    Ayoung|  Atiches|
+----------+----------+---------+

+----------+----------+---------+
|subject_id|first_name|last_name|
+----------+----------+---------+
|         4|     Billy|   Bonder|
|         5|     Brian|    Black|
|         6|      Bran|  Balwner|
|         7|     Bryce|    Brice|
|         8|     Betty|   Btisan|
+----------+----------+---------+

+----------+-------+
|subject_id|test_id|
+----------+-------+
|         1|     51|
|         2|     15|
|         3|     15|
|         4|     61|
|         5|     16|
|         7|     14|
|         8|     15|
|         9|      1|
|        10|     61|
|        11|     16|
+----------+-------+



### Step 4. Join the two dataframes along rows and assign all_data

In [11]:
all_data = data1.unionByName(data2)
all_data.show()

+----------+----------+---------+
|subject_id|first_name|last_name|
+----------+----------+---------+
|         1|      Alex| Anderson|
|         2|       Amy| Ackerman|
|         3|     Allen|      Ali|
|         4|     Alice|     Aoni|
|         5|    Ayoung|  Atiches|
|         4|     Billy|   Bonder|
|         5|     Brian|    Black|
|         6|      Bran|  Balwner|
|         7|     Bryce|    Brice|
|         8|     Betty|   Btisan|
+----------+----------+---------+



### Step 5. Join the two dataframes along columns and assing to all_data_col

In [13]:
all_data_col = data1.join(data2, "subject_id", "inner")
all_data_col.show()

+----------+----------+---------+----------+---------+
|subject_id|first_name|last_name|first_name|last_name|
+----------+----------+---------+----------+---------+
|         4|     Alice|     Aoni|     Billy|   Bonder|
|         5|    Ayoung|  Atiches|     Brian|    Black|
+----------+----------+---------+----------+---------+



### Step 6. Print data3

In [14]:
data3.show()

+----------+-------+
|subject_id|test_id|
+----------+-------+
|         1|     51|
|         2|     15|
|         3|     15|
|         4|     61|
|         5|     16|
|         7|     14|
|         8|     15|
|         9|      1|
|        10|     61|
|        11|     16|
+----------+-------+



### Step 7. Merge all_data and data3 along the subject_id value

In [15]:
merge3 = all_data.join(data3, "subject_id", "inner")
merge3.show()

+----------+----------+---------+-------+
|subject_id|first_name|last_name|test_id|
+----------+----------+---------+-------+
|         1|      Alex| Anderson|     51|
|         2|       Amy| Ackerman|     15|
|         3|     Allen|      Ali|     15|
|         4|     Alice|     Aoni|     61|
|         4|     Billy|   Bonder|     61|
|         5|    Ayoung|  Atiches|     16|
|         5|     Brian|    Black|     16|
|         7|     Bryce|    Brice|     14|
|         8|     Betty|   Btisan|     15|
+----------+----------+---------+-------+



### Step 8. Merge only the data that has the same 'subject_id' on both data1 and data2

In [16]:
all_data_col = data1.join(data2, "subject_id", "inner")
all_data_col.show()

+----------+----------+---------+----------+---------+
|subject_id|first_name|last_name|first_name|last_name|
+----------+----------+---------+----------+---------+
|         4|     Alice|     Aoni|     Billy|   Bonder|
|         5|    Ayoung|  Atiches|     Brian|    Black|
+----------+----------+---------+----------+---------+



### Step 9. Merge all values in data1 and data2, with matching records from both sides where available.

In [17]:
all_data_mer = data1.join(data2, "subject_id", "outer")
all_data_mer.show()


+----------+----------+---------+----------+---------+
|subject_id|first_name|last_name|first_name|last_name|
+----------+----------+---------+----------+---------+
|         1|      Alex| Anderson|      NULL|     NULL|
|         2|       Amy| Ackerman|      NULL|     NULL|
|         3|     Allen|      Ali|      NULL|     NULL|
|         4|     Alice|     Aoni|     Billy|   Bonder|
|         5|    Ayoung|  Atiches|     Brian|    Black|
|         6|      NULL|     NULL|      Bran|  Balwner|
|         7|      NULL|     NULL|     Bryce|    Brice|
|         8|      NULL|     NULL|     Betty|   Btisan|
+----------+----------+---------+----------+---------+

