# Dataset

In this notebook we will explore the dataset [Climate Change: Earth Surface Temperature Data](https://www.kaggle.com/berkeleyearth/climate-change-earth-surface-temperature-data/) from [Kaggle](https://www.kaggle.com). It records the average temperature per Country and City or State since the 1750.

There are multiple CSV files that provide different views. We will be using the file "GlobalLandTemperaturesByState.csv". The original file contains **645675** records, but after selecting only the US states, we reduce the dataset to **149745**.


| Field                         | Type    | Description                                    |
|-------------------------------|---------|------------------------------------------------|
| dt                            | Date    | Date of measurement, in YYYY-MM-DD format      |
| AverageTemperature            | Float   | The measured avarage temperature               |
| AverageTemperatureUncertainty | Float   | The statistical uncertainty of the measurement |
| State                         | String  | US State                                       |
| part_year                     | Integer | Parquet partition key                          |
| part_month                    | Integer | Parquet partition key                          |

Please note that, as we highlighted in `README.md` we have already performed an initial preprocessing, which means we have introduced new helper/utility fields that are not available in the original dataset; and we have dropped at least one field (i.e. `Country`).


# Imports 

In [1]:
import sys
sys.path.append("../config")

import config

In [2]:
import pyspark.sql.functions as fn

# Load data 

In [3]:
df_globaltemp = spark.read.parquet(f"{config.ARTIFACTS}/sample_global_temperatures/")

In [4]:
df = df_globaltemp

# Summary

In [5]:
df.describe().toPandas().transpose()
# df.summary().toPandas().transpose()

Unnamed: 0,0,1,2,3,4
summary,count,mean,stddev,min,max
AverageTemperature,141930,10.701555371823831,10.225132186418435,-28.788,32.905
AverageTemperatureUncertainty,141930,1.269450264340304,1.406615631149933,0.036,10.354
State,149745,,,Alabama,Wyoming
part_year,149745,1888.1068015626565,74.68817799577329,1743,2013
part_month,149745,6.49834719022338,3.451493899095775,1,12


# SQL 

In [6]:
df.createOrReplaceTempView("globaltemp")

In [13]:
q = spark.sql("""
    SELECT
        State,
        COUNT(*) AS count
    FROM
        globaltemp
    GROUP BY
        State
    ORDER BY
        State ASC
""")

In [14]:
q.show(n=51)

+--------------------+-----+
|               State|count|
+--------------------+-----+
|             Alabama| 3239|
|              Alaska| 2229|
|             Arizona| 2145|
|            Arkansas| 3067|
|          California| 1977|
|            Colorado| 2325|
|         Connecticut| 3239|
|            Delaware| 3239|
|District Of Columbia| 3239|
|             Florida| 3239|
|     Georgia (State)| 3239|
|              Hawaii| 1569|
|               Idaho| 2303|
|            Illinois| 3239|
|             Indiana| 3239|
|                Iowa| 3239|
|              Kansas| 2941|
|            Kentucky| 3239|
|           Louisiana| 3067|
|               Maine| 3239|
|            Maryland| 3239|
|       Massachusetts| 3239|
|            Michigan| 3239|
|           Minnesota| 3239|
|         Mississippi| 3067|
|            Missouri| 3239|
|             Montana| 2941|
|            Nebraska| 2941|
|              Nevada| 2224|
|       New Hampshire| 3239|
|          New Jersey| 3239|
|          New

In [17]:
q = spark.sql("""
    SELECT
        State,
        MIN(AverageTemperature) AS MinAvgTemperature,
        MAX(AverageTemperature) AS MaxAvgTemperature,
        AVG(AverageTemperature) AS AvgAvgTemperature
    FROM
        globaltemp
    GROUP BY
        State
    ORDER BY
        State
""")

In [18]:
q.show()

+--------------------+-----------------+-----------------+------------------+
|               State|MinAvgTemperature|MaxAvgTemperature| AvgAvgTemperature|
+--------------------+-----------------+-----------------+------------------+
|             Alabama|            0.137|           32.289| 17.06613786076651|
|              Alaska|          -28.788|           14.112| -4.89073757418071|
|             Arizona|           -0.829|           29.006|15.381526118352898|
|            Arkansas|           -3.176|           29.833| 15.57396252189778|
|          California|             0.53|           26.279|14.327677270349493|
|            Colorado|          -10.962|           22.484| 6.931333906096791|
|         Connecticut|           -8.917|            28.21| 9.020079509066683|
|            Delaware|           -5.586|           30.454|11.895237571683838|
|District Of Columbia|           -6.386|           30.833| 11.91847451144824|
|             Florida|           10.077|           32.905|21.501