# DataFrame的列与行

```{note}
本节介绍Spark DataFrame中列与行的基本概念和操作。
```

## 列

就像pandas的DataFrame列一样

In [1]:
import sys

from pyspark.sql import SparkSession

spark = (SparkSession
         .builder
         .appName("ColumnAndRow")
         .getOrCreate())
# 首先读入数据
df = (spark.read.format("csv")
      .option("header", "true")
      .option("inferSchema", "true")
      .load("../data/mnm_dataset.csv"))

In [2]:
# 各列名
df.columns

['State', 'Color', 'Count']

In [3]:
# Access a particular column
df['State']

Column<b'State'>

In [4]:
from pyspark.sql.functions import expr, col

# 使用expr进行列操作
df.select(expr("Count * 2")).show(2)

+-----------+
|(Count * 2)|
+-----------+
|         40|
|        132|
+-----------+
only showing top 2 rows



In [5]:
# 使用col进行列操作
df.select(col("Count") * 2).show(2)

+-----------+
|(Count * 2)|
+-----------+
|         40|
|        132|
+-----------+
only showing top 2 rows



## 行

A row in Spark is a generic Row object, containing one or more columns.

In [6]:
from pyspark.sql import Row
from pyspark.sql.functions import col

blog_row = Row(6, "Reynold", "Xin", ["twitter", "LinkedIn"])
# access using index for individual items
blog_row[1]

'Reynold'

In [7]:
rows = [Row("Matei Zaharia", "CA"), Row("Reynold Xin", "CA")]
# 使用Row列表创建DataFrame
authors_df = spark.createDataFrame(rows, ["Authors", "State"])
authors_df.show()

+-------------+-----+
|      Authors|State|
+-------------+-----+
|Matei Zaharia|   CA|
|  Reynold Xin|   CA|
+-------------+-----+

