# Schema

```{note}
对于大型数据集，我们一般不自动推测字段类型，而是使用schema定义数据的结构。
```

## Spark数据类型

要定义schema，首先需要了解Spark支持的数据类型。

基础数据类型：

![jupyter](../images/type1.jpg)

复杂数据类型：

![jupyter](../images/type2.jpg)

## 定义schema

In [1]:
# Create our static data
data = [["Xia", "Deep Play", 99],
        ["Ronaldo", "Basic Football", 9999],
        ["Wang", "How to Earn 100M", 73281]]

In [2]:
from pyspark.sql.types import *

# 第一种定义方式
# author、title、pages三个字段，类型分别为str、str、int
# `False` Indicates this field can't be null values.
schema = StructType([StructField("author", StringType(), False),
                     StructField("title", StringType(), False),
                     StructField("pages", IntegerType(), False)])

In [3]:
# 第二种定义方式，就像数据库
schema = "author STRING, title STRING, pages INT"

## 使用schema指定DataFrame的数据结构

In [4]:
from pyspark.sql import SparkSession

# Create a SparkSession
spark = (SparkSession
         .builder
         .appName("schema_example")
         .getOrCreate())
# Create a DataFrame using the schema defined above
df = spark.createDataFrame(data, schema)
# Show the DataFrame; it should reflect our table above
df.show()

+-------+----------------+-----+
| author|           title|pages|
+-------+----------------+-----+
|    Xia|       Deep Play|   99|
|Ronaldo|  Basic Football| 9999|
|   Wang|How to Earn 100M|73281|
+-------+----------------+-----+



In [5]:
# 打印schema
df.printSchema()

root
 |-- author: string (nullable = true)
 |-- title: string (nullable = true)
 |-- pages: integer (nullable = true)

