---

<center><h1>  Creating Dataframes  </h1></center>

---

In this notebook, we will learn how, to create dataframes, we can create a dataframes using the following methods.

 - Using an RDD
 - Using Collections
 - Using a CSV File


---


#### `Importing the Required Libraies`




---

In [1]:
from pyspark.sql import SparkSession

In [2]:
spark = SparkSession.builder.getOrCreate()
spark

----
---

#### `Spark DataFrame from an RDD`


----

In [3]:
sc = spark.sparkContext

In [4]:
# Load the data
students_data_rdd = sc.textFile('data/module_8_students_data.txt')

In [5]:
# Check the data type
type(students_data_rdd)

pyspark.rdd.RDD

In [6]:
students_data_rdd.collect()

['101 A Rohit Gurgaon 65 77 43 66 87',
 '102 B Akansha Delhi 55 46 24 66 77',
 '103 A Himanshu Faridabad 75 38 84 38 58',
 '104 A Ekta Delhi 85 84 39 58 85',
 '105 B Deepanshu Gurgaon 34 55 56 23 66',
 '106 B Ayush Delhi 66 62 98 74 87',
 '107 B Aditi Delhi 76 83 75 38 58',
 '108 A Sahil Faridabad 55 32 43 56 66',
 '109 A Krati Delhi 34 53 25 67 75']

In [7]:
# Tokenize records
students_data_rdd = students_data_rdd.map(lambda x: x.split(' '))

In [8]:
students_data_rdd.collect()

[['101', 'A', 'Rohit', 'Gurgaon', '65', '77', '43', '66', '87'],
 ['102', 'B', 'Akansha', 'Delhi', '55', '46', '24', '66', '77'],
 ['103', 'A', 'Himanshu', 'Faridabad', '75', '38', '84', '38', '58'],
 ['104', 'A', 'Ekta', 'Delhi', '85', '84', '39', '58', '85'],
 ['105', 'B', 'Deepanshu', 'Gurgaon', '34', '55', '56', '23', '66'],
 ['106', 'B', 'Ayush', 'Delhi', '66', '62', '98', '74', '87'],
 ['107', 'B', 'Aditi', 'Delhi', '76', '83', '75', '38', '58'],
 ['108', 'A', 'Sahil', 'Faridabad', '55', '32', '43', '56', '66'],
 ['109', 'A', 'Krati', 'Delhi', '34', '53', '25', '67', '75']]

In [9]:
# Create dataframe from rdd
students_data_dataframe = students_data_rdd.toDF(["roll_no",
                                                  "section",
                                                  "name",
                                                  "city",
                                                  "subject1",
                                                  "subject2",
                                                  "subject3",
                                                  "subject4",
                                                  "subject5"])

In [10]:
# Display dataframe
students_data_dataframe.show()

+-------+-------+---------+---------+--------+--------+--------+--------+--------+
|roll_no|section|     name|     city|subject1|subject2|subject3|subject4|subject5|
+-------+-------+---------+---------+--------+--------+--------+--------+--------+
|    101|      A|    Rohit|  Gurgaon|      65|      77|      43|      66|      87|
|    102|      B|  Akansha|    Delhi|      55|      46|      24|      66|      77|
|    103|      A| Himanshu|Faridabad|      75|      38|      84|      38|      58|
|    104|      A|     Ekta|    Delhi|      85|      84|      39|      58|      85|
|    105|      B|Deepanshu|  Gurgaon|      34|      55|      56|      23|      66|
|    106|      B|    Ayush|    Delhi|      66|      62|      98|      74|      87|
|    107|      B|    Aditi|    Delhi|      76|      83|      75|      38|      58|
|    108|      A|    Sahil|Faridabad|      55|      32|      43|      56|      66|
|    109|      A|    Krati|    Delhi|      34|      53|      25|      67|      75|
+---

---
---

#### `Spark DataFrame from Collections`

---

In [11]:
# Collection
sample_data = [
    (101, "A", "Rohit",    "Gurugram"),
    (102, "B", "Akansha",  "Delhi"),
    (103, "A", "Himanshu", "Faridabad"),
    (104, "A", "Ekta",     "Delhi"),
    (105, "B", "Ayush",    "Delhi")
]

In [12]:
# Create dataframe from collection
dataframe_from_collections = spark.createDataFrame(data=sample_data,schema=["roll_no",
                                                                            "section",
                                                                            "name",
                                                                            "city"])

In [13]:
# Display dataframe
dataframe_from_collections.show()

+-------+-------+--------+---------+
|roll_no|section|    name|     city|
+-------+-------+--------+---------+
|    101|      A|   Rohit| Gurugram|
|    102|      B| Akansha|    Delhi|
|    103|      A|Himanshu|Faridabad|
|    104|      A|    Ekta|    Delhi|
|    105|      B|   Ayush|    Delhi|
+-------+-------+--------+---------+



---
---

#### `Spark DataFrame from CSV file`


---

In [14]:
# Import required libraries
import pyspark.sql.types as tp

In [15]:
# Define schema or dataframe
my_schema = tp.StructType([
    tp.StructField(name= "roll_no", dataType= tp.IntegerType()),
    tp.StructField(name= "section", dataType= tp.StringType()),
    tp.StructField(name= "name",    dataType= tp.StringType()),
    tp.StructField(name= "city",    dataType= tp.StringType()),
    tp.StructField(name= "subject1",dataType= tp.IntegerType()),
    tp.StructField(name= "subject2",dataType= tp.IntegerType()),
    tp.StructField(name= "subject3",dataType= tp.IntegerType()),
    tp.StructField(name= "subject4",dataType= tp.IntegerType()),
    tp.StructField(name= "subject5",dataType= tp.IntegerType()),
])

In [16]:
# Create dataframe
df_csv_schema = spark.read.csv("data/module_8_students_data.csv",
                                             header=False, 
                                             schema=my_schema)

In [17]:
# Check type
type(df_csv_schema)

pyspark.sql.dataframe.DataFrame

In [18]:
# Display dataframe
df_csv_schema.show()

+-------+-------+---------+---------+--------+--------+--------+--------+--------+
|roll_no|section|     name|     city|subject1|subject2|subject3|subject4|subject5|
+-------+-------+---------+---------+--------+--------+--------+--------+--------+
|    101|      A|    Rohit|  Gurgaon|      65|      77|      43|      66|      87|
|    102|      B|  Akansha|    Delhi|      55|      46|      24|      66|      77|
|    103|      A| Himanshu|Faridabad|      75|      38|      84|      38|      58|
|    104|      A|     Ekta|    Delhi|      85|      84|      39|      58|      85|
|    105|      B|Deepanshu|  Gurgaon|      34|      55|      56|      23|      66|
|    106|      B|    Ayush|    Delhi|      66|      62|      98|      74|      87|
|    107|      B|    Aditi|    Delhi|      76|      83|      75|      38|      58|
|    108|      A|    Sahil|Faridabad|      55|      32|      43|      56|      66|
|    109|      A|    Krati|    Delhi|      34|      53|      25|      67|      75|
+---

---
---

#### `Spark DataFrame from CSV file - with inferSchema`


---

In [19]:
# Create dataframe with inferSchema
df_csv_infer = spark.read.csv("data/module_8_students_data.csv",
                                             header=False, 
                                             inferSchema=True)

In [20]:
# Check type
type(df_csv_infer)

pyspark.sql.dataframe.DataFrame

In [21]:
# Display dataframe
df_csv_infer.show()

+---+---+---------+---------+---+---+---+---+---+
|_c0|_c1|      _c2|      _c3|_c4|_c5|_c6|_c7|_c8|
+---+---+---------+---------+---+---+---+---+---+
|101|  A|    Rohit|  Gurgaon| 65| 77| 43| 66| 87|
|102|  B|  Akansha|    Delhi| 55| 46| 24| 66| 77|
|103|  A| Himanshu|Faridabad| 75| 38| 84| 38| 58|
|104|  A|     Ekta|    Delhi| 85| 84| 39| 58| 85|
|105|  B|Deepanshu|  Gurgaon| 34| 55| 56| 23| 66|
|106|  B|    Ayush|    Delhi| 66| 62| 98| 74| 87|
|107|  B|    Aditi|    Delhi| 76| 83| 75| 38| 58|
|108|  A|    Sahil|Faridabad| 55| 32| 43| 56| 66|
|109|  A|    Krati|    Delhi| 34| 53| 25| 67| 75|
+---+---+---------+---------+---+---+---+---+---+



---
---

#### `Rename DataFrame columns`
---

In [22]:
# Rename columns method 1
df_csv_infer2 = df_csv_infer.withColumnRenamed("_c0","roll_no")\
                            .withColumnRenamed("_c1","section")\
                            .withColumnRenamed("_c2","name")\
                            .withColumnRenamed("_c3","city")\
                            .withColumnRenamed("_c4","section_1")\
                            .withColumnRenamed("_c5","section_2")\
                            .withColumnRenamed("_c6","section_3")\
                            .withColumnRenamed("_c7","section_4")\
                            .withColumnRenamed("_c8","section_5")

In [23]:
# Display datframe
df_csv_infer2.show()

+-------+-------+---------+---------+---------+---------+---------+---------+---------+
|roll_no|section|     name|     city|section_1|section_2|section_3|section_4|section_5|
+-------+-------+---------+---------+---------+---------+---------+---------+---------+
|    101|      A|    Rohit|  Gurgaon|       65|       77|       43|       66|       87|
|    102|      B|  Akansha|    Delhi|       55|       46|       24|       66|       77|
|    103|      A| Himanshu|Faridabad|       75|       38|       84|       38|       58|
|    104|      A|     Ekta|    Delhi|       85|       84|       39|       58|       85|
|    105|      B|Deepanshu|  Gurgaon|       34|       55|       56|       23|       66|
|    106|      B|    Ayush|    Delhi|       66|       62|       98|       74|       87|
|    107|      B|    Aditi|    Delhi|       76|       83|       75|       38|       58|
|    108|      A|    Sahil|Faridabad|       55|       32|       43|       56|       66|
|    109|      A|    Krati|    D

In [24]:
# Rename columns method 2
column_names = ["roll_no","section","name","city","section_1","section_2","section_3","section_4","section_5"]

In [25]:
# Rename columns
df_csv_infer3 = df_csv_infer.toDF(*column_names)

In [26]:
# Display dataframe
df_csv_infer3.show()

+-------+-------+---------+---------+---------+---------+---------+---------+---------+
|roll_no|section|     name|     city|section_1|section_2|section_3|section_4|section_5|
+-------+-------+---------+---------+---------+---------+---------+---------+---------+
|    101|      A|    Rohit|  Gurgaon|       65|       77|       43|       66|       87|
|    102|      B|  Akansha|    Delhi|       55|       46|       24|       66|       77|
|    103|      A| Himanshu|Faridabad|       75|       38|       84|       38|       58|
|    104|      A|     Ekta|    Delhi|       85|       84|       39|       58|       85|
|    105|      B|Deepanshu|  Gurgaon|       34|       55|       56|       23|       66|
|    106|      B|    Ayush|    Delhi|       66|       62|       98|       74|       87|
|    107|      B|    Aditi|    Delhi|       76|       83|       75|       38|       58|
|    108|      A|    Sahil|Faridabad|       55|       32|       43|       56|       66|
|    109|      A|    Krati|    D