<a href="https://colab.research.google.com/github/naman-DA/PySpark_Project/blob/main/PySpark.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
!pip install pyspark



In [2]:
from pyspark.sql import SparkSession

In [3]:
spark = SparkSession.builder.appName('Pyspark').getOrCreate()

In [4]:
spark

In [10]:
df = spark.read.csv('/content/sample_data/marks.csv', header = True, inferSchema = True)

1. Display Top 3 Rows of the Dataset

In [11]:
df.show(3)

+-------+-----+------+
|   Name|Marks|Gender|
+-------+-----+------+
|Priyang|   98|  Male|
| Aadhya|   89|Female|
| Krisha|   99|Female|
+-------+-----+------+
only showing top 3 rows



2. Display Datatype of Each Column

In [12]:
df.printSchema()

root
 |-- Name: string (nullable = true)
 |-- Marks: integer (nullable = true)
 |-- Gender: string (nullable = true)



3. Display Column Names

In [13]:
df.columns

['Name', 'Marks', 'Gender']

4. Count number of Rows and Columns of the dataset

In [14]:
df.count()

7

In [15]:
len(df.columns)

3

5. Get Overall Statistics about the dataset

In [16]:
df.describe().show()

+-------+------+------------------+------+
|summary|  Name|             Marks|Gender|
+-------+------+------------------+------+
|  count|     7|                 7|     7|
|   mean|  NULL| 89.71428571428571|  NULL|
| stddev|  NULL|6.6761836831702395|  NULL|
|    min|Aadhya|                82|Female|
|    max|Vedant|                99|  Male|
+-------+------+------------------+------+



6. Find unique values available in the Gender column

In [17]:
df.toPandas()['Gender'].unique()

array(['Male', 'Female'], dtype=object)

7. Find total number unique values available in the Gender column

In [18]:
len(df.toPandas()['Gender'].unique())

2

In [19]:
df.show()

+-------+-----+------+
|   Name|Marks|Gender|
+-------+-----+------+
|Priyang|   98|  Male|
| Aadhya|   89|Female|
| Krisha|   99|Female|
| Vedant|   87|  Male|
| Parshv|   90|  Male|
| Mittal|   83|  Male|
|Archana|   82|Female|
+-------+-----+------+



8. How select Single column?

In [21]:
df.select('Name').show()

+-------+
|   Name|
+-------+
|Priyang|
| Aadhya|
| Krisha|
| Vedant|
| Parshv|
| Mittal|
|Archana|
+-------+



9. How select Multiple columns?

In [22]:
df.select(['Name', 'Gender']).show()

+-------+------+
|   Name|Gender|
+-------+------+
|Priyang|  Male|
| Aadhya|Female|
| Krisha|Female|
| Vedant|  Male|
| Parshv|  Male|
| Mittal|  Male|
|Archana|Female|
+-------+------+



10. Create new column with Marks+1 and also update existing dataframe


In [25]:
df = df.withColumn('New_Marks', df.Marks+1)

11. Rename Name columns and Give new name "Student_Name"

In [26]:
df.show()

+-------+-----+------+---------+
|   Name|Marks|Gender|New_Marks|
+-------+-----+------+---------+
|Priyang|   98|  Male|       99|
| Aadhya|   89|Female|       90|
| Krisha|   99|Female|      100|
| Vedant|   87|  Male|       88|
| Parshv|   90|  Male|       91|
| Mittal|   83|  Male|       84|
|Archana|   82|Female|       83|
+-------+-----+------+---------+



In [29]:
df = df.withColumnRenamed('Name', 'Student_Name')

In [30]:
df.show()

+------------+-----+------+---------+
|Student_Name|Marks|Gender|New_Marks|
+------------+-----+------+---------+
|     Priyang|   98|  Male|       99|
|      Aadhya|   89|Female|       90|
|      Krisha|   99|Female|      100|
|      Vedant|   87|  Male|       88|
|      Parshv|   90|  Male|       91|
|      Mittal|   83|  Male|       84|
|     Archana|   82|Female|       83|
+------------+-----+------+---------+



In [31]:
df.filter(df['Marks']>90).select('Student_Name')

DataFrame[Student_Name: string]

In [34]:
df.filter(df['Marks']>90).select(['Student_Name', 'Gender']).show()

+------------+------+
|Student_Name|Gender|
+------------+------+
|     Priyang|  Male|
|      Krisha|Female|
+------------+------+



In [36]:
df.filter((df['Marks']>90) & (df['Gender'] == 'Female')).select('Student_Name').show()

+------------+
|Student_Name|
+------------+
|      Krisha|
+------------+



In [37]:
df.filter((df['Marks']>90) & (df['Gender'] == 'Male')).select('Student_Name').show()

+------------+
|Student_Name|
+------------+
|     Priyang|
+------------+



In [39]:
df.groupby('Gender').mean().select(['Gender', 'avg(Marks)']).show()

+------+----------+
|Gender|avg(Marks)|
+------+----------+
|Female|      90.0|
|  Male|      89.5|
+------+----------+



In [40]:
df.columns

['Student_Name', 'Marks', 'Gender', 'New_Marks']

In [43]:
df.orderBy(df['Marks'].desc()).show()

+------------+-----+------+---------+
|Student_Name|Marks|Gender|New_Marks|
+------------+-----+------+---------+
|      Krisha|   99|Female|      100|
|     Priyang|   98|  Male|       99|
|      Parshv|   90|  Male|       91|
|      Aadhya|   89|Female|       90|
|      Vedant|   87|  Male|       88|
|      Mittal|   83|  Male|       84|
|     Archana|   82|Female|       83|
+------------+-----+------+---------+



In [44]:
df.show()

+------------+-----+------+---------+
|Student_Name|Marks|Gender|New_Marks|
+------------+-----+------+---------+
|     Priyang|   98|  Male|       99|
|      Aadhya|   89|Female|       90|
|      Krisha|   99|Female|      100|
|      Vedant|   87|  Male|       88|
|      Parshv|   90|  Male|       91|
|      Mittal|   83|  Male|       84|
|     Archana|   82|Female|       83|
+------------+-----+------+---------+



In [45]:
from pyspark.sql.functions import mean

In [46]:
mean1 = df.select(mean(df['Marks'])).collect()

In [48]:
mean1[0][0]

89.71428571428571

In [49]:
df.fillna(mean1[0][0]).show()

+------------+-----+------+---------+
|Student_Name|Marks|Gender|New_Marks|
+------------+-----+------+---------+
|     Priyang|   98|  Male|       99|
|      Aadhya|   89|Female|       90|
|      Krisha|   99|Female|      100|
|      Vedant|   87|  Male|       88|
|      Parshv|   90|  Male|       91|
|      Mittal|   83|  Male|       84|
|     Archana|   82|Female|       83|
+------------+-----+------+---------+

