# **Manipulating Data**

When manipulating data in Spark one could transform a dataframe into a Temporary View, and then use `spark.sql()` to perform various SQL operations. However the spark DataFrame object has a lot of the same functionality.

In [1]:
from pyspark import SparkContext
from pyspark.sql import SparkSession

In [2]:
sc = SparkContext()
spark = spark = SparkSession.builder.appName('example app').getOrCreate()

In [3]:
df = spark.read.csv("data/countries.csv", header=True, inferSchema=True)
df.show(5)

+--------------+---------+----+------------+-------------+
|          name|continent|code|surface_area|geosize_group|
+--------------+---------+----+------------+-------------+
|   Afghanistan|     Asia| AFG|      652090|       medium|
|   Netherlands|   Europe| NLD|       41526|        small|
|       Albania|   Europe| ALB|       28748|        small|
|       Algeria|   Africa| DZA|     2381740|        large|
|American Samoa|  Oceania| ASM|         199|        small|
+--------------+---------+----+------------+-------------+
only showing top 5 rows



## Select Data
In order to select data, the `.select()` method may be used. This method takes either
* strings containing the column names to be selected
* spark DataFrame column objects representing the columns to be selected

In [4]:
# these two operations are equivalent
data = df.select(df.name, df.continent)
data = df.select('name', 'continent')

data.show(5)

+--------------+---------+
|          name|continent|
+--------------+---------+
|   Afghanistan|     Asia|
|   Netherlands|   Europe|
|       Albania|   Europe|
|       Algeria|   Africa|
|American Samoa|  Oceania|
+--------------+---------+
only showing top 5 rows



## Filter Data
In order to filter data, the `.filter()` method may be used. This method takes either 
* a string containing an expression that normally follows the `WHERE` clause in SQL
* a spark column of boolean values

In [5]:
# these two operations are equivalent
filtered_data = df.filter('continent = "Asia"')
filtered_data = df.filter(df.continent == 'Asia')

filtered_data.show(5)

+--------------------+---------+----+------------+-------------+
|                name|continent|code|surface_area|geosize_group|
+--------------------+---------+----+------------+-------------+
|         Afghanistan|     Asia| AFG|      652090|       medium|
|United Arab Emirates|     Asia| ARE|       83600|        small|
|             Armenia|     Asia| ARM|       29800|        small|
|          Azerbaijan|     Asia| AZE|       86600|        small|
|             Bahrain|     Asia| BHR|         694|        small|
+--------------------+---------+----+------------+-------------+
only showing top 5 rows



## Adding or Updating Columns

### Using `.withColumn()`
In order to add columns, the `.withColumn()` method may be used. The first argument is the name of the new column, the second argument is a Spark DataFrame column object. 

If the name of the new column is already present in the DataFrame, then that column will be updated with the new values. 

It should be noted that the returning dataframe will always be the entire original dataframe with only change being the new or updated column. 

In [6]:
# update existing column
df.withColumn('surface_area', df.surface_area+1).show(5)

# add new column
df.withColumn('adjusted_surface_area', df.surface_area+1).show(5)

+--------------+---------+----+------------+-------------+
|          name|continent|code|surface_area|geosize_group|
+--------------+---------+----+------------+-------------+
|   Afghanistan|     Asia| AFG|    652091.0|       medium|
|   Netherlands|   Europe| NLD|     41527.0|        small|
|       Albania|   Europe| ALB|     28749.0|        small|
|       Algeria|   Africa| DZA|   2381741.0|        large|
|American Samoa|  Oceania| ASM|       200.0|        small|
+--------------+---------+----+------------+-------------+
only showing top 5 rows

+--------------+---------+----+------------+-------------+---------------------+
|          name|continent|code|surface_area|geosize_group|adjusted_surface_area|
+--------------+---------+----+------------+-------------+---------------------+
|   Afghanistan|     Asia| AFG|      652090|       medium|             652091.0|
|   Netherlands|   Europe| NLD|       41526|        small|              41527.0|
|       Albania|   Europe| ALB|       2

### Using `.select()`
Alternatively the `.select()` method may be used in combination with `.alias()`. In this case the `select` method will take a new spark DataFrame column as argument, and the `alias` method will take the new column name as argument. 

It should be noted that with this method only the column names passed to the `select` method will be returned. This can be advantageous as this will ensure no unnecessary data is kept in memory. 

In [7]:
df.select(
    'name',
    'continent',
    (df.surface_area+1).alias('adjusted_surface_area')
).show(5)

+--------------+---------+---------------------+
|          name|continent|adjusted_surface_area|
+--------------+---------+---------------------+
|   Afghanistan|     Asia|             652091.0|
|   Netherlands|   Europe|              41527.0|
|       Albania|   Europe|              28749.0|
|       Algeria|   Africa|            2381741.0|
|American Samoa|  Oceania|                200.0|
+--------------+---------+---------------------+
only showing top 5 rows



### Using `.selectExpr()`
Finally the same result can be achieved with the `.selectExpr()` method, which simply takes a SQL `SELECT` statement as argument. 

In [8]:
df.selectExpr(
        'name',
        'continent',
        "surface_area + 1 AS adjusted_surface_area"
).show(5)

+--------------+---------+---------------------+
|          name|continent|adjusted_surface_area|
+--------------+---------+---------------------+
|   Afghanistan|     Asia|             652091.0|
|   Netherlands|   Europe|              41527.0|
|       Albania|   Europe|              28749.0|
|       Algeria|   Africa|            2381741.0|
|American Samoa|  Oceania|                200.0|
+--------------+---------+---------------------+
only showing top 5 rows



## Aggregating Data

Similar to SQL and Pandas DataFrames, aggregate values can be determined by first grouping the data. The `.groupBy()` method creates a `GroupedData` object, which has common aggregation methods like (but not limited to)
* `sum()`
* `min()`
* `max()`
* `count()`

In [9]:
from pyspark.sql.functions import avg

# prepare data
continents = [
    'Europe', 
    'Asia', 
    'Africa', 
    'Oceania', 
    'North America', 
    'South America'
]
data = (
    df
    .withColumn('surface_area', df.surface_area.cast('double'))
    .filter(df.continent.isin(continents))
)

# these two expressions are equivalent
result = (
    data
    .groupBy('continent')
    .avg('surface_area')
    .withColumnRenamed('avg(surface_area)', 'average_surface_area')
)
result = (
    data
    .groupBy('continent')
    .agg(
        avg('surface_area').alias('average_surface_area')  # use imported avg() function
    )
)

result.show()


+-------------+--------------------+
|    continent|average_surface_area|
+-------------+--------------------+
|       Europe|   539193.9880952381|
|       Africa|   531466.1923076923|
|North America|   834824.9655172414|
|South America|  1480229.0833333333|
|      Oceania|   475701.8888888889|
|         Asia|   649590.7346938775|
+-------------+--------------------+



## Joining Data

Two tables can be joined together using the `.join` method. This takes three arguments:
* `other`: the other DataFrame
* `on`: a string indicating which column to join on
* `how`: the type of join (inner, outer, full, cross, etc.)

In [10]:
countries = spark.read.csv("data/countries.csv", header=True, inferSchema=True)
cities = spark.read.csv("data/cities.csv", header=True, inferSchema=True)

countries.show(3)
cities.show(3)

+-----------+---------+----+------------+-------------+
|       name|continent|code|surface_area|geosize_group|
+-----------+---------+----+------------+-------------+
|Afghanistan|     Asia| AFG|      652090|       medium|
|Netherlands|   Europe| NLD|       41526|        small|
|    Albania|   Europe| ALB|       28748|        small|
+-----------+---------+----+------------+-------------+
only showing top 3 rows

+---------+------------+---------------+-------------+-------------+
|     name|country_code|city_proper_pop|metroarea_pop|urbanarea_pop|
+---------+------------+---------------+-------------+-------------+
|  Abidjan|         CIV|        4765000|         null|      4765000|
|Abu Dhabi|         ARE|        1145000|         null|      1145000|
|    Abuja|         NGA|        1235880|      6000000|      1235880|
+---------+------------+---------------+-------------+-------------+
only showing top 3 rows



In [11]:
countries = countries.withColumnRenamed('name', 'country_name')
cities = cities.withColumnRenamed('name', 'city_name')

result = cities.join(
    countries, 
    on=(countries.code == cities.country_code),
    how='inner',
).select('city_name', 'country_name', 'continent')

result.show(5)

+-----------+--------------------+---------+
|  city_name|        country_name|continent|
+-----------+--------------------+---------+
|    Abidjan|       Cote d'Ivoire|   Africa|
|  Abu Dhabi|United Arab Emirates|     Asia|
|      Abuja|             Nigeria|   Africa|
|      Accra|               Ghana|   Africa|
|Addis Ababa|            Ethiopia|   Africa|
+-----------+--------------------+---------+
only showing top 5 rows

