<a href="https://colab.research.google.com/github/nitiksha/PySpark_code_practice/blob/main/read_words.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Find the top N most frequent words in a large text file
Problem Explanation:
You are given a large text file. Your task is to count how frequently each word appears and
return the top N most common words.

In [48]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
# Create Spark session
spark = SparkSession.builder.appName("TopNWords").getOrCreate()
# Load text file

In [49]:
df = spark.read.text('/Mar 03 4_42 PM.txt')
df.show()

+--------------------+
|               value|
+--------------------+
|   ðŸ”´Start zookeeper|
|                    |
|                    |
|Go inside extract...|
|Go inside bin folder|
|Open cmd in that ...|
|execute below com...|
|                    |
|          .\zkserver|
|                    |
|                    |
|                    |
|     ðŸ”´Start   Kafka|
|                    |
|Extract and Go In...|
|            Open cmd|
+--------------------+
only showing top 20 rows



In [50]:
df = df.withColumn("Words",explode(split('value','\\s+'))).drop('value')


In [51]:
df.show()

+--------------------+
|               Words|
+--------------------+
|             ðŸ”´Start|
|           zookeeper|
|                    |
|                    |
|                  Go|
|              inside|
|           extracted|
|           zookeeper|
|                  Go|
|              inside|
|                 bin|
|              folder|
|                Open|
|                 cmd|
|                  in|
|                that|
|              Folder|
|             execute|
+--------------------+
only showing top 20 rows



In [55]:
 df = df.filter(col('Words') != "")\
        .groupBy(col("Words"))\
        .agg(count("*").alias("count"))\
        .orderBy(col("count").desc())

In [56]:
df.show(10)

+--------------------+-----+
|               Words|count|
+--------------------+-----+
|              folder|    1|
|             --topic|    1|
|               using|    1|
|             Extract|    1|
|         --zookeeper|    1|
|                 for|    1|
|                 bin|    1|
|               below|    1|
|      localhost:9092|    1|
|kafka-console-con...|    1|
+--------------------+-----+
only showing top 10 rows



Explanation:
â€¢  split splits each line into words.
â€¢  explode flattens nested word lists.
â€¢  groupBy().count() gets word frequency.
â€¢  orderBy(desc("count")) sorts by frequency.
â€¢  limit(n) gets top N.