## Video covering 
* PySpark Dataframe
* Reading the dataset
* Checking the datatypes of the column(Schema)
* Selecting Columns and indexing
* Check Describe option similar to Pandas
* Adding Columns
* Dropping Columns

(Starts at 15:21 in the video)

In [2]:
from pyspark.sql import SparkSession

In [3]:
spark=SparkSession.builder.appName('Dataframe').getOrCreate()

In [4]:
spark

In [5]:
## the dataset
### inferSchema will make it so that the datatypes will be inferred.
### without inferSchema (which is default=False) then it will make all values strings.
df_pyspark = spark.read.option('header','true').csv('test1.csv',inferSchema=True)

In [6]:
### Check the schema
### Stopped at 20:14
df_pyspark.printSchema()

root
 |-- Name: string (nullable = true)
 |-- Age: integer (nullable = true)
 |-- Experience: integer (nullable = true)



In [7]:
## how to see the field titles
df_pyspark.columns

['Name', 'Age', 'Experience']

In [15]:
# picking out one or multiple columns
df_pyspark.select(['Name','Age']).show()

+------+---+
|  Name|Age|
+------+---+
|Graeme| 26|
|  Rach| 25|
|  Noah| 28|
|Taylor| 32|
+------+---+



In [17]:
# escape + b lets you put a new code block below whatever you have selected
df_pyspark['Name']

Column<'Name'>

In [18]:
# look at all data types of each column
df_pyspark.dtypes

[('Name', 'string'), ('Age', 'int'), ('Experience', 'int')]

In [19]:
# seem normal describe summary statistics 
df_pyspark.describe().show()

+-------+------+-----------------+-----------------+
|summary|  Name|              Age|       Experience|
+-------+------+-----------------+-----------------+
|  count|     4|                4|                4|
|   mean|  NULL|            27.75|              6.0|
| stddev|  NULL|3.095695936834452|3.366501646120693|
|    min|Graeme|               25|                2|
|    max|Taylor|               32|               10|
+-------+------+-----------------+-----------------+



### Adding columns in data frame

In [22]:
# Not an inplace function so it needs to be saved into a new variable to be saved
df_pyspark = df_pyspark.withColumn('Experience After 2 year',df_pyspark['Experience'] + 2)

In [23]:
df_pyspark.show()

+------+---+----------+-----------------------+
|  Name|Age|Experience|Experience After 2 year|
+------+---+----------+-----------------------+
|Graeme| 26|        10|                     12|
|  Rach| 25|         2|                      4|
|  Noah| 28|         5|                      7|
|Taylor| 32|         7|                      9|
+------+---+----------+-----------------------+



### Dropping columns from a data frame

In [25]:
dropped_df = df_pyspark.drop('Experience After 2 year')

In [26]:
dropped_df.show()

+------+---+----------+
|  Name|Age|Experience|
+------+---+----------+
|Graeme| 26|        10|
|  Rach| 25|         2|
|  Noah| 28|         5|
|Taylor| 32|         7|
+------+---+----------+



### Renaming Columns

In [29]:
dropped_df.withColumnRenamed('Name', 'New Name')

DataFrame[New Name: string, Age: int, Experience: int]

In [30]:
dropped_df.show()

+------+---+----------+
|  Name|Age|Experience|
+------+---+----------+
|Graeme| 26|        10|
|  Rach| 25|         2|
|  Noah| 28|         5|
|Taylor| 32|         7|
+------+---+----------+

