NB: pandas doesn’t work in local IDE to load data through databricks-connect (pandas works just in notebook in cloud, within Databricks workspace). P.S. the purpose we want to load data into pandas, not into spark dataframe is the difference in data distribution. Pandas will load to the default number of partitions(spark will automatically load up to 128 Mb in a single partition). In our case, due to the cluster type selected, the default number of partitions = 4 (number of cores)

In [0]:
# load to pandas to destribute by default to 4 partitions(cluster has 4 cores). Otherwise spark will load up to 128 mb to single partition
import pandas as pd
insurance_pandas_df = pd.read_csv("/dbfs/FileStore/datasets/insurance.csv")
insurance_df = spark.createDataFrame(insurance_pandas_df)

In [0]:
insurance_df.groupBy('sex').count().display()

sex,count
female,662
male,676


In [0]:
from pyspark.sql.functions import round

gender_data_counts = insurance_df.groupBy('sex')\
                                   .count()\
                                   .withColumnRenamed('count', 'total')

gender_data_properties = gender_data_counts\
                            .withColumn('proportions', round(gender_data_counts.total/insurance_df.count()*100, 2))\
                            .drop('total')\
                            .display()

sex,proportions
female,49.48
male,50.52


In [0]:
# .agg({column : type of aggregation})
charges_by_smokinghabit = insurance_df.groupBy('smoker')\
                                      .agg({'charges': 'avg'})\
                                      .withColumnRenamed('avg(charges)', 'average_charges')
charges_by_smokinghabit.display()

smoker,average_charges
no,8434.2682978562
yes,32050.23183153284


In [0]:
# multiple .agg({}) 
# note that both avg and average can be used

charges_by_smokinghabit = insurance_df.groupBy('smoker')\
                                      .agg({'charges': 'avg', 'bmi': 'average', 'sex': 'count'})\
                                      .withColumnRenamed('avg(charges)', 'average_charges')
charges_by_smokinghabit.display()

smoker,average_charges,count(sex),avg(bmi)
no,8434.2682978562,1064,30.651795112781954
yes,32050.23183153284,274,30.70844890510949


Output can only be rendered in Databricks

In [0]:
insurance_df.agg({'charges':'sum'}).display()

sum(charges)
17755824.990759


In [0]:
insurance_df.groupBy('region')\
            .agg({'charges':'sum'})\
            .withColumnRenamed('sum(charges)', 'region_revenue')\
            .display()

region,region_revenue
northwest,4035711.9965399993
southeast,5363689.76329
northeast,4343668.583308999
southwest,4012754.64762


In [0]:
insurance_df.groupBy('region')\
            .agg({'charges':'sum'})\
            .withColumnRenamed('sum(charges)', 'region_revenue')\
            .orderBy('region_revenue')\
            .display()

region,region_revenue
southwest,4012754.64762
northwest,4035711.9965399993
northeast,4343668.583308999
southeast,5363689.76329


In [0]:
insurance_df.groupBy('region')\
            .agg({'charges':'avg'})\
            .withColumnRenamed('avg(charges)', 'average_charges')\
            .orderBy('average_charges', ascending = False)\
            .display()

region,average_charges
southeast,14735.41143760989
northeast,13406.3845163858
northwest,12417.575373969228
southwest,12346.937377292308
