# Advanced type-checking using linters
## Functions that do not affect the schema

There are a number of functions in `DataSet` which do not affect the schema. For example:

In [2]:
from pyspark.sql import SparkSession
spark = SparkSession.Builder().getOrCreate()

In [3]:
from typedspark import Column, Schema, DataSet, create_partially_filled_dataset
from pyspark.sql.types import StringType

class A(Schema):
    a: Column[StringType]

df = create_partially_filled_dataset(spark, A, {A.a: ["a", "b", "c"]})
res = df.filter(A.a == "a")

In the above example, `filter()` will not actually make any changes to the schema, hence we have implemented the return type of `DataSet.filter()` to be a `DataSet` of the same `Schema` that you started with. In other words, a linter will see that `res` is of the type `DataSet[A]`.

This allows you to skip casting steps in many cases and instead define functions as:

In [4]:
def foo(df: DataSet[A]) -> DataSet[A]:
    return df.filter(A.a == "a")

The functions for which this is currently implemented include:

* `filter()`
* `distinct()`
* `orderBy()`

## Functions applied to two DataSets of the same schema

Similarly, some functions return a `DataSet[A]` when they take two `DataSet[A]` as an input. For example, here a linter will see that `res` is of the type `DataSet[A]`.

In [5]:
df_a = create_partially_filled_dataset(spark, A, {A.a: ["a", "b", "c"]})
df_b = create_partially_filled_dataset(spark, A, {A.a: ["d", "e", "f"]})

res = df_a.unionByName(df_b)

The functions in this category include:

* `unionByName()`
* `join(..., how="semi")`

## Transformations

Finally, the `transform()` function can also be typed. In the following example, a linter will see that `res` is of the type `DataSet[B]`.

In [6]:
from typedspark import transform_to_schema
from pyspark.sql.functions import lit

class B(A):
    b: Column[StringType]

def foo(df: DataSet[A]) -> DataSet[A]:
    return transform_to_schema(
        df,
        B,
        {
            B.b: lit("hi")
        }
    )

res = (
    create_partially_filled_dataset(spark, A, {A.a: ["a", "b", "c"]})
    .transform(foo)
)

## Did we miss anything?

There are likely more functions that we did not yet cover. Feel free to make an issue and reach out when you find one!