## Count Distinct from DataFrame
We can use `distinct().count()` of DataFrame or `countDistinct()` SQL function to get the count distinct.

`distinct()` eliminates duplicate records(matching all columns of a Row) from DataFrame, `count()` returns the count of records on DataFrame. By chaining these you can get the count distinct of PySpark DataFrame.

`countDistinct()` is a SQL function that could be used to get the count distinct of the selected columns.

In [1]:
from pyspark.sql import SparkSession
spark = SparkSession.builder \
         .appName('SparkByExamples.com') \
         .getOrCreate()

data = [("James", "Sales", 3000),
        ("Michael", "Sales", 4600),
        ("Robert", "Sales", 4100),
        ("Maria", "Finance", 3000),
        ("James", "Sales", 3000),
        ("Scott", "Finance", 3300),
        ("Jen", "Finance", 3900),
        ("Jeff", "Marketing", 3000),
        ("Kumar", "Marketing", 2000),
        ("Saif", "Sales", 4100)
    ]
    
columns = ["Name","Dept","Salary"]
df = spark.createDataFrame(data=data,schema=columns)
df.show()

Using Spark's default log4j profile: org/apache/spark/log4j2-defaults.properties
25/08/08 15:27:08 WARN Utils: Your hostname, javier-ubuntu, resolves to a loopback address: 127.0.1.1; using 10.0.0.205 instead (on interface wlx0013eff3e14d)
25/08/08 15:27:08 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Using Spark's default log4j profile: org/apache/spark/log4j2-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
25/08/08 15:27:09 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
                                                                                

+-------+---------+------+
|   Name|     Dept|Salary|
+-------+---------+------+
|  James|    Sales|  3000|
|Michael|    Sales|  4600|
| Robert|    Sales|  4100|
|  Maria|  Finance|  3000|
|  James|    Sales|  3000|
|  Scott|  Finance|  3300|
|    Jen|  Finance|  3900|
|   Jeff|Marketing|  3000|
|  Kumar|Marketing|  2000|
|   Saif|    Sales|  4100|
+-------+---------+------+



### Using DataFrame distinct() and count()
On the above DataFrame, we have a total of 10 rows and one row with all values duplicated, performing distinct count ( `distinct().count()` ) on this DataFrame should get us 9.

In [2]:
print(f"Distinct Count: {str(df.distinct().count())}")

Distinct Count: 9


### Using countDistinct() SQL Function
DataFrame `distinct()` returns a new DataFrame after eliminating duplicate rows (distinct on all columns). if you want to get count distinct on selected columns, use the PySpark SQL function `countDistinct()`. This function returns the number of distinct elements in a group.

In order to use this function, you need to import it first.

In [3]:
from pyspark.sql.functions import countDistinct

df2=df.select(countDistinct("Dept", "salary"))
df2.show()

+----------------------------+
|count(DISTINCT Dept, salary)|
+----------------------------+
|                           8|
+----------------------------+



Note that `countDistinct()` function returns a value in a Column type hence, you need to collect it to get the value from the DataFrame. And this function can be used to get the distinct count of any number of selected or all columns.

In [4]:
print(f"Distinct Count of Dept & Salary: {str(df2.collect()[0][0])}")

Distinct Count of Dept & Salary: 8


### Using SQL to get Count Distinct

In [5]:
df.createOrReplaceTempView("EMP")
df.show()
spark.sql("select count(distinct(*)) from EMP").show()

+-------+---------+------+
|   Name|     Dept|Salary|
+-------+---------+------+
|  James|    Sales|  3000|
|Michael|    Sales|  4600|
| Robert|    Sales|  4100|
|  Maria|  Finance|  3000|
|  James|    Sales|  3000|
|  Scott|  Finance|  3300|
|    Jen|  Finance|  3900|
|   Jeff|Marketing|  3000|
|  Kumar|Marketing|  2000|
|   Saif|    Sales|  4100|
+-------+---------+------+

+----------------------------------+
|count(DISTINCT Name, Dept, Salary)|
+----------------------------------+
|                                 9|
+----------------------------------+

