# PRONOVO DATA ANAYLSIS 

This notebook is dedicated to the investigation, cleaning and aggregation of the pronovo data to get data for each swiss municipality

In [3]:
# import libraries
import pandas as pd 
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

In [4]:
# load data 
datapath = "../GWR_PV_Daten.csv"
pronovo_df = pd.read_csv(datapath, low_memory=False)
pronovo_df.head()

Unnamed: 0,EGID,GDEKT,GGDENR,GGDENAME,EGRID,LGBKR,LPARZ,LPARZSX,LTYP,GEBNR,...,PV_Pot_reco,Roof_area,Facade_area,FPV_Pot,FPV_Pot_reco,BeginningOfOperation,InitialPower,TotalPower,PlantCategory,TotalEnergy
0,11513432,ZH,1,Aeugst am Albis,CH540120777857,0.0,1573,,,1199,...,11040.259333,230.578943,317.346632,8237.377869,0.0,,,,,
1,11513433,ZH,1,Aeugst am Albis,CH587820017717,0.0,1543,,,1198,...,9525.977213,151.208898,228.503458,8979.194682,5247.683171,,,,,
2,11517090,ZH,1,Aeugst am Albis,CH707701872012,0.0,1361,,,1017,...,,,,,,,,,,
3,1600000,ZH,1,Aeugst am Albis,CH487820017751,0.0,1631,,,164,...,19417.187289,266.831234,531.288763,22112.221833,15430.840027,,,,,
4,1600001,ZH,1,Aeugst am Albis,CH567720017811,0.0,1579,,,154,...,22857.860524,459.94039,458.462737,15636.491774,4575.180087,,,,,


### Interpretation of Column Names from the Swiss Communes Solar Energy Potential Dataset

1. **EGID**: Likely stands for "Eidgenössische Gebäudeidentifikationsnummer" (Federal Building Identification Number), a unique identifier for buildings in Switzerland.
2. **GDEKT**: Could represent the "Gemeinde Kanton" (Municipality Canton), meaning the canton to which a commune belongs.
3. **GGDENR**: Probably "Gemeindenummer" (Municipality Number), a code representing each municipality.
4. **GGDENAME**: Likely "Gemeinde Name" (Municipality Name), the name of the commune.
5. **EGRID**: Possibly another identifier similar to EGID, related to the "Energie" (Energy) grid.
6. **LGBKR**: Could be related to land registry or "Liegenschaftsbesitz" (property ownership) registry.
7. **LPARZ**: Short for "Liegenschaft Parzelle" (Property Parcel), the plot of land.
8. **LPARZSX**: Maybe a specific sub-plot or section of the parcel.
9. **LTYP**: Likely "Liegenschaft Typ" (Property Type), referring to the type of building or land use.
10. **GEBNR**: Could be "Gebäudenummer" (Building Number), an identifier for the building.
11. **GBEZ**: Could refer to "Gebäude Bezeichnung" (Building Description).
12. **GKODE**: Potentially the "Geografische Koordinate Ost" (Geographic Coordinate East), referring to a geographic coordinate.
13. **GKODN**: Likely "Geografische Koordinate Nord" (Geographic Coordinate North), for latitude.
14. **GKSCE**: Likely refers to the score or some calculated measure based on coordinates.
15. **GSTAT**: Possibly "Gebäudestatus" (Building Status), which might indicate if the building is in use or abandoned.
16. **GKAT**: Could be "Gebäudekategorie" (Building Category), referring to the classification of the building.
17. **GKLAS**: Likely "Gebäude Klasse" (Building Class), another classification system.
18. **GBAUJ**: Short for "Gebäude Baujahr" (Year of Construction), referring to when the building was constructed.
19. **GBAUM**: Possibly "Gebäude Bau Monat" (Month of Construction).
20. **GBAUP**: Likely "Gebäude Bau Periode" (Construction Period), referring to a time range.
21. **GABBJ**: Could be "Gebäud Abbruchjahr" (Year of Demolition), indicating the year the building was demolished.
22. **GAREA**: Likely "Gebäudefläche" (Building Area), the area covered by the building.
23. **GVOL**: Probably "Gebäudevolumen" (Building Volume), the total volume of the building.
24. **GVOLNORM**: Could be "Normiertes Gebäudevolumen" (Normalized Building Volume), standardized for comparison.
25. **GVOLSCE**: Likely a score or measure related to building volume.
26. **GASTW**: Could refer to "Anzahl Stockwerke" (Number of Floors).
27. **GANZWHG**: Likely "Anzahl Wohnungen" (Number of Apartments), indicating the number of housing units in the building.
28. **GAZZI**: Possibly an energy-efficiency measure or some related index.
29. **GSCHUTZR**: Could refer to "Gebäudeschutzrecht" (Building Protection Status), indicating protected or historical buildings.
30. **GEBF**: Might refer to "Gebäudeform" (Building Form/Shape).
31. **GWAERZH1**: Likely refers to "Wärmebedarf Zone 1" (Heating Demand Zone 1), related to energy requirements.
32. **GENH1**: Probably "Energetischer Heizbedarf Zone 1" (Energy Heating Requirement Zone 1).
33. **GWAERSCEH1**: Some kind of heating-related score or calculation for Zone 1.
34. **GWAERDATH1**: Likely "Wärmedaten für Heizbedarf Zone 1" (Heating Data for Zone 1).
35. **GWAERZH2**: Same as above but for Zone 2.
36. **GENH2**: Same as above but for Zone 2.
37. **GWAERSCEH2**: Heating score or calculation for Zone 2.
38. **GWAERDATH2**: Heating data for Zone 2.
39. **GWAERZW1**: Likely related to "Warmwasserbedarf Zone 1" (Hot Water Demand Zone 1).
40. **GENW1**: "Energetischer Warmwasserbedarf Zone 1" (Energy Hot Water Requirement Zone 1).
41. **GWAERSCEW1**: Water heating-related score for Zone 1.
42. **GWAERDATW1**: Hot water demand data for Zone 1.
43. **GWAERZW2**: Same as above but for Zone 2.
44. **GENW2**: Same as above but for Zone 2.
45. **GWAERSCEW2**: Water heating-related score for Zone 2.
46. **GWAERDATW2**: Hot water demand data for Zone 2.
47. **GEXPDAT**: Could be "Gültigkeitsdatum" (Expiration Date), referring to the date when something is valid until.
48. **Coord_lat**: Latitude coordinate.
49. **Coord_long**: Longitude coordinate.
50. **WSTAT**: Likely "Wetterstatus" (Weather Status) or "Wirtschaftsstatus" (Economic Status), though this is less clear without more context.
51. **BEDARF_HEIZUNG**: "Bedarf für Heizung" (Heating Demand), indicating the required heating energy.
52. **PV_Pot**: "Photovoltaik Potenzial" (Photovoltaic Potential), solar energy potential for PV panels.
53. **PV_Pot_reco**: "Empfohlenes Photovoltaik Potenzial" (Recommended PV Potential), optimized or recommended potential for solar energy.
54. **Roof_area**: Area of the roof, potentially for calculating solar panel coverage.
55. **Facade_area**: Area of the building’s facade, possibly for solar panels or insulation.
56. **FPV_Pot**: "Fassaden-Photovoltaik Potenzial" (Facade Photovoltaic Potential), referring to solar potential on the building facade.
57. **FPV_Pot_reco**: "Empfohlenes Fassaden-Photovoltaik Potenzial" (Recommended Facade PV Potential).
58. **BeginningOfOperation**: The date the building or solar installation began operation.
59. **InitialPower**: The initial power output, possibly of the solar installation.
60. **TotalPower**: The total power output, again likely related to solar energy.
61. **PlantCategory**: Likely refers to the category of the solar plant or energy generation facility.
62. **TotalEnergy**: The total energy generated, likely from solar power.


In [5]:
# Get a sense of the dataframe 
pronovo_df_test = pronovo_df.copy()

pronovo_df_test.replace(["NaN", "nan", " N/A", "Na"], np.nan, inplace=True)
#pronovo_df_test = pronovo_df_test.dropna(subset=["InitialPower"])
#empty_cols_df = pronovo_df_test.isnull().sum()
#empty_cols_df[empty_cols_df==0]
print(pronovo_df_test.shape, pronovo_df.shape)

# selects rows where InitialPower is nan 
df_test = pronovo_df_test[~pronovo_df_test["InitialPower"].isin([np.nan])].copy()

# counts NaN values
df_test["InitialPower"].isna().sum()

(3082589, 62) (3082589, 62)


0

In [8]:
# take a look at all the columns of the dataframe 
pronovo_df.columns

Index(['EGID', 'GDEKT', 'GGDENR', 'GGDENAME', 'EGRID', 'LGBKR', 'LPARZ',
       'LPARZSX', 'LTYP', 'GEBNR', 'GBEZ', 'GKODE', 'GKODN', 'GKSCE', 'GSTAT',
       'GKAT', 'GKLAS', 'GBAUJ', 'GBAUM', 'GBAUP', 'GABBJ', 'GAREA', 'GVOL',
       'GVOLNORM', 'GVOLSCE', 'GASTW', 'GANZWHG', 'GAZZI', 'GSCHUTZR', 'GEBF',
       'GWAERZH1', 'GENH1', 'GWAERSCEH1', 'GWAERDATH1', 'GWAERZH2', 'GENH2',
       'GWAERSCEH2', 'GWAERDATH2', 'GWAERZW1', 'GENW1', 'GWAERSCEW1',
       'GWAERDATW1', 'GWAERZW2', 'GENW2', 'GWAERSCEW2', 'GWAERDATW2',
       'GEXPDAT', 'Coord_lat', 'Coord_long', 'WSTAT', 'BEDARF_HEIZUNG',
       'PV_Pot', 'PV_Pot_reco', 'Roof_area', 'Facade_area', 'FPV_Pot',
       'FPV_Pot_reco', 'BeginningOfOperation', 'InitialPower', 'TotalPower',
       'PlantCategory', 'TotalEnergy'],
      dtype='object')

In [6]:
# Analyses which column has most NaN values
pronovo_df.replace(["NaN", "nan", " N/A", "Na"], np.nan, inplace=True)
for i, column in enumerate(pronovo_df.columns): 
    print(f"{column}: {pronovo_df[column].isna().sum()}")

EGID: 0
GDEKT: 0
GGDENR: 0
GGDENAME: 0
EGRID: 61127
LGBKR: 84
LPARZ: 84
LPARZSX: 3082589
LTYP: 3082589
GEBNR: 1220721
GBEZ: 2392920
GKODE: 2
GKODN: 2
GKSCE: 2
GSTAT: 0
GKAT: 4
GKLAS: 163261
GBAUJ: 1256467
GBAUM: 2730280
GBAUP: 251549
GABBJ: 3082582
GAREA: 14924
GVOL: 2941401
GVOLNORM: 2941401
GVOLSCE: 2941573
GASTW: 1069464
GANZWHG: 1297732
GAZZI: 3020189
GSCHUTZR: 1997400
GEBF: 2860849
GWAERZH1: 1083583
GENH1: 1071473
GWAERSCEH1: 1113062
GWAERDATH1: 1070377
GWAERZH2: 2625957
GENH2: 2625241
GWAERSCEH2: 2727665
GWAERDATH2: 2625266
GWAERZW1: 998116
GENW1: 997837
GWAERSCEW1: 1035774
GWAERDATW1: 992466
GWAERZW2: 1939396
GENW2: 1938167
GWAERSCEW2: 2040676
GWAERDATW2: 1938023
GEXPDAT: 0
Coord_lat: 2
Coord_long: 2
WSTAT: 1289086
BEDARF_HEIZUNG: 1414992
PV_Pot: 1414992
PV_Pot_reco: 1414992
Roof_area: 1414992
Facade_area: 1429077
FPV_Pot: 1429077
FPV_Pot_reco: 1429077
BeginningOfOperation: 2872012
InitialPower: 2872012
TotalPower: 2872012
PlantCategory: 2876009
TotalEnergy: 2872012


In [7]:
# Keeps only the columns of potential interest  
columns_to_keep = [
    "EGID", 
    "GDEKT", 
    "GGDENR", 
    "GGDENAME", 
    "GSTAT", 
    "GKAT",
    "GKLAS",
    "Coord_lat", 
    "Coord_long",
    "PV_Pot", 
    "PV_Pot_reco", 
    "Roof_area", 
    "Facade_area",
    "FPV_Pot", 
    "FPV_Pot_reco",
    "BeginningOfOperation",
    "InitialPower",
    "TotalPower",
    "PlantCategory",
    "TotalEnergy",
]
df = pronovo_df[columns_to_keep].copy()
df.shape

(3082589, 20)

In [201]:
# Agreggates 

Municipal_df = df.groupby(["GGDENR","GGDENAME","GDEKT"]).agg(
    pv_pot=('PV_Pot', 'sum'),  # Maximal potential PV production in each municipality
    pv_pot_reco=('PV_Pot_reco', 'sum'),  # Recommended potential PV production in each municipality
    fpv_Pot=('FPV_Pot', 'sum'),  # Maximal potential Facade PV production in each municipality
    fpv_pot_reco=('FPV_Pot_reco', 'sum'),  # Recommended potential Facade PV production in each municipality
    roof_area=('Roof_area', 'sum'),
    facade_area=('Facade_area', 'sum'),
    total_power=('TotalPower', 'sum'),
    total_energy=('TotalEnergy', 'sum'),
    initial_power=('InitialPower', 'sum')
).reset_index(["GGDENR","GGDENAME","GDEKT"])


# Rename columns to snake_case
Municipal_df.columns = Municipal_df.columns.str.lower().str.replace(' ', '_')

# Rename specific columns using a dictionary
Municipal_df = Municipal_df.rename(columns={'ggdenr': 'mun_id', 'ggdename': 'mun_name', 'gdekt': 'canton_abr'}).set_index(['mun_id','mun_name'])


In [202]:
Municipal_df.head(10)

Unnamed: 0_level_0,Unnamed: 1_level_0,canton_abr,pv_pot,pv_pot_reco,fpv_pot,fpv_pot_reco,roof_area,facade_area,total_power,total_energy,initial_power
mun_id,mun_name,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1,Aeugst am Albis,ZH,19107140.0,13485430.0,8666142.0,5140205.0,177869.927003,184175.446768,1514.66,1514660.0,1482.66
2,Affoltern am Albis,ZH,67274110.0,52868010.0,37962680.0,22113730.0,653710.145217,800320.664453,3988.31,3988310.0,3828.16
3,Bonstetten,ZH,28078090.0,20862920.0,14428540.0,7962732.0,271607.933081,316012.443845,1620.22,1620220.0,1574.84
4,Hausen am Albis,ZH,36201220.0,25837820.0,17470770.0,10675740.0,342510.943327,371456.902241,3073.58,3073580.0,2684.31
5,Hedingen,ZH,26598540.0,19539730.0,12479460.0,7409044.0,250280.021025,267799.223428,2708.86,2708860.0,2439.94
6,Kappel am Albis,ZH,15916880.0,11201920.0,7174595.0,4457193.0,143761.555547,147121.491785,1002.71,1002710.0,979.46
7,Knonau,ZH,20407820.0,15146200.0,9577837.0,5719468.0,188035.848579,197854.115252,3770.61,3770610.0,2936.59
8,Maschwanden,ZH,9644733.0,6872408.0,4132000.0,2519524.0,88930.93269,85657.247218,706.71,706710.0,705.51
9,Mettmenstetten,ZH,48457130.0,34510260.0,22222700.0,13497120.0,445394.213132,458726.069092,3562.86,3562860.0,3458.42
10,Obfelden,ZH,36089470.0,26439660.0,18730640.0,10956960.0,347901.858653,400476.579147,2494.17,2494170.0,2358.42


In [205]:
#check for uniqueness of ID
#plt.plot(Municipal_df["GGDENR"].value_counts())
#plt.show()

#otherway to verify if the indices are unique
Municipal_df.index.is_unique

True

In [207]:
Municipal_df['total_potential'] = Municipal_df.pv_pot + Municipal_df.fpv_pot
Municipal_df['recommended_potential'] = Municipal_df.pv_pot_reco + Municipal_df.fpv_pot_reco
Municipal_df.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,canton_abr,pv_pot,pv_pot_reco,fpv_pot,fpv_pot_reco,roof_area,facade_area,total_power,total_energy,initial_power,total_potential,recommended_potential
mun_id,mun_name,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
1,Aeugst am Albis,ZH,19107140.0,13485430.0,8666142.0,5140205.0,177869.927003,184175.446768,1514.66,1514660.0,1482.66,27773280.0,18625640.0
2,Affoltern am Albis,ZH,67274110.0,52868010.0,37962680.0,22113730.0,653710.145217,800320.664453,3988.31,3988310.0,3828.16,105236800.0,74981740.0
3,Bonstetten,ZH,28078090.0,20862920.0,14428540.0,7962732.0,271607.933081,316012.443845,1620.22,1620220.0,1574.84,42506630.0,28825660.0
4,Hausen am Albis,ZH,36201220.0,25837820.0,17470770.0,10675740.0,342510.943327,371456.902241,3073.58,3073580.0,2684.31,53672000.0,36513560.0
5,Hedingen,ZH,26598540.0,19539730.0,12479460.0,7409044.0,250280.021025,267799.223428,2708.86,2708860.0,2439.94,39078000.0,26948770.0


In [209]:
# tests

# Assert that the new column is calculated correctly for specific rows (e.g., first row)
assert Municipal_df['total_potential'].iloc[0] == Municipal_df['pv_pot'].iloc[0] + Municipal_df['fpv_pot'].iloc[0], "Incorrect calculation for total_pot in first row"
# Assert that the new column is calculated correctly for specific rows (e.g., first row)
assert Municipal_df['recommended_potential'].iloc[0] == Municipal_df['pv_pot_reco'].iloc[0] + Municipal_df['fpv_pot_reco'].iloc[0], "Incorrect calculation for total_pot_reco in first row"

In [211]:
Municipal_df["achieved_rp"]= Municipal_df.total_energy / Municipal_df.recommended_potential #achieved recommended potential
Municipal_df["achieved_tp"]= Municipal_df.total_energy / Municipal_df.total_potential #achieved total potential

In [213]:
# tests

# Assert that the new column is calculated correctly for specific rows (e.g., first row)
assert Municipal_df['achieved_rp'].iloc[0] == Municipal_df['total_energy'].iloc[0] / Municipal_df['recommended_potential'].iloc[0], "Incorrect calculation for total_pot in first row"
# Assert that the new column is calculated correctly for specific rows (e.g., first row)
assert Municipal_df['achieved_tp'].iloc[0] == Municipal_df['total_energy'].iloc[0] / Municipal_df['total_potential'].iloc[0], "Incorrect calculation for total_pot_reco in first row"

In [215]:
Municipal_df['achieved_tp']/Municipal_df['achieved_rp']

mun_id  mun_name          
1       Aeugst am Albis       0.670631
2       Affoltern am Albis    0.712505
3       Bonstetten            0.678145
4       Hausen am Albis       0.680309
5       Hedingen              0.689615
                                ...   
6808    Clos du Doubs         0.673181
6809    Haute-Ajoie           0.729627
6810    La Baroche            0.698079
6811    Damphreux-Lugnez      0.700716
6812    Basse-Vendline        0.724654
Length: 2133, dtype: float64

### Comment
The recommended potential seems to be roughly 2/3 of the maximum potential

In [221]:
# Sorts df to get the highest potential reached on top 
# Sort the DataFrame in-place (no need to assign to a new variable)
Municipal_df.sort_values(by='achieved_rp', ascending=False, inplace=True)
Municipal_df

Unnamed: 0_level_0,Unnamed: 1_level_0,canton_abr,pv_pot,pv_pot_reco,fpv_pot,fpv_pot_reco,roof_area,facade_area,total_power,total_energy,initial_power,total_potential,recommended_potential,achieved_rp,achieved_tp
mun_id,mun_name,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
2043,Sévaz,FR,4.670673e+06,4.180890e+06,1.929158e+06,1.425327e+06,39036.054397,33832.332298,3693.98,3693980.0,3693.98,6.599830e+06,5.606217e+06,0.658908,0.559708
5565,Onnens (VD),VD,1.305776e+07,1.177943e+07,3.235058e+06,2.103798e+06,112876.590943,63996.552432,8615.12,8615120.0,8607.17,1.629282e+07,1.388323e+07,0.620542,0.528768
6452,Cressier (NE),NE,1.679586e+07,1.422929e+07,7.305761e+06,4.541674e+06,146427.208998,153540.786324,9229.40,9229400.0,8890.75,2.410162e+07,1.877096e+07,0.491685,0.382937
5629,Clarmont,VD,1.777467e+06,1.489369e+06,9.465443e+05,6.455402e+05,13871.650234,16173.060807,986.84,986840.0,859.72,2.724012e+06,2.134909e+06,0.462240,0.362275
5073,Giornico,TI,1.495146e+07,5.614214e+06,5.010771e+06,2.016368e+06,150042.723782,137640.972544,3393.20,3393200.0,3376.56,1.996223e+07,7.630582e+06,0.444684,0.169981
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3713,Ferrera,GR,1.881119e+06,1.338627e+06,1.155161e+06,6.540965e+05,18165.699377,24885.535927,0.00,0.0,0.00,3.036280e+06,1.992723e+06,0.000000,0.000000
3711,Rongellen,GR,6.321874e+05,2.848915e+05,2.847475e+05,1.276220e+05,6435.525421,7305.033094,0.00,0.0,0.00,9.169349e+05,4.125135e+05,0.000000,0.000000
5304,Bosco/Gurin,TI,1.573302e+06,1.088195e+06,1.040593e+06,6.231462e+05,15738.040974,22894.051664,0.00,0.0,0.00,2.613895e+06,1.711341e+06,0.000000,0.000000
2391,Staatswald Galm,FR,0.000000e+00,0.000000e+00,0.000000e+00,0.000000e+00,0.000000,0.000000,0.00,0.0,0.00,0.000000e+00,0.000000e+00,,
