In [25]:
import pandas as pd
from IPython.display import display
turbinedata = pd.read_excel(r"C:\Users\prabr\Desktop\GroupProjectWork\WindTurbines\Global-Wind-Power-Tracker-June-2024.xlsx",sheet_name="Data")
turbinedata = turbinedata[["Date Last Researched","Country/Area","Capacity (MW)","Installation Type",
"Status","Start year","Retired year","Latitude","Longitude","Location accuracy"]]

display(turbinedata.head())
display(turbinedata.columns)

Unnamed: 0,Date Last Researched,Country/Area,Capacity (MW),Installation Type,Status,Start year,Retired year,Latitude,Longitude,Location accuracy
0,2023/07/03,Algeria,10.0,Onshore,operating,2014.0,,28.4624,-0.0576,exact
1,2023/07/03,Algeria,20.0,Onshore,cancelled - inferred 4 y,,,35.1689,7.1055,approximate
2,2023/07/03,Algeria,50.0,Unknown,cancelled - inferred 4 y,,,29.2356,0.4569,approximate
3,2023/07/06,Egypt,1100.0,Onshore,construction,2026.0,,26.254,29.2675,approximate
4,2023/07/06,Egypt,10000.0,Unknown,announced,,,26.5583,31.6773,approximate


Index(['Date Last Researched', 'Country/Area', 'Capacity (MW)',
       'Installation Type', 'Status', 'Start year', 'Retired year', 'Latitude',
       'Longitude', 'Location accuracy'],
      dtype='object')

We can remove Installation Type as that will be clear from Latitude and Longitutde


Date Last Researched can be omitted

Country/Area wont be needed due to Latitude and Longitude coordinates

Status is required to differentiate standing structures and non standing structures

Capacity is needed to determine size of the turbine/classification

Retired year not important as we are not interested in turbines that are not standing

In [26]:
turbinedata["Location accuracy"].unique()

array(['exact', 'approximate'], dtype=object)

Location accuracy is either exact or approximate, both are good enough for the general whereabouts for global wind turbine positions and so "Location accuracy" can be omitted

In [27]:
turbinedata["Status"].unique()

array(['operating', 'cancelled - inferred 4 y', 'construction',
       'announced', 'pre-construction', 'shelved',
       'shelved - inferred 2 y', 'cancelled', 'retired', 'mothballed'],
      dtype=object)

Not interested in cancelled, retired or shelved as theyre not operating nor will ever exist and so have little to no ability of killing birds

Mothballed will stay in as despite them not being used, have not been dismantled and so have potential to kill birds

Operating, construction and announced have potential to kill birds and can provide future forecasting of wind turbine placements
 

In [28]:
turbinedata = turbinedata[turbinedata["Status"].isin(["operating","construction","announced",])]
turbinedata = turbinedata[["Capacity (MW)","Status","Start year","Latitude","Longitude"]]
turbinedata.columns = pd.Index(["Capacity (MW)","Status","Start Year","Latitude","Longitude"])


In [29]:
display(turbinedata.isna().sum(axis=0))

print("Percentage of unknown build dates = ",100*round(turbinedata["Start Year"].isna().sum()/turbinedata["Start Year"].size,2),"%")

Capacity (MW)       0
Status              0
Start Year       3407
Latitude            0
Longitude           0
dtype: int64

Percentage of unknown build dates =  17.0 %


There are gaps in the dataset where the year to build is unknown. These are real turbines however the date is unknown.

The start year will be important for later analysis, however for now, position is more important. Important to make it known about these start year gaps, however for now, they can substituted for -1 for now

In [30]:
turbinedata = turbinedata.fillna(-1)

Now need to check for any outliers. The wind turbine layout wont follow any specific trend and can reasonably be placed  very far apart along with large discrepeancy in time period between building them (due to only recent developements in green activism). 

We can do a simple check for outliers but will consider case by case basis as they may fall outside of these ranges for very reasonable reasons

In [31]:
upper_o = turbinedata[["Start Year","Latitude","Longitude"]].mean()+3*turbinedata[["Start Year","Latitude","Longitude"]].std()
lower_o = turbinedata[["Start Year","Latitude","Longitude"]].mean()-3*turbinedata[["Start Year","Latitude","Longitude"]].std()


display("Upper threshold for outliers",(turbinedata[["Start Year","Latitude","Longitude"]]>=upper_o).sum())
display("Lower threshold for outliers",(turbinedata[["Start Year","Latitude","Longitude"]]<=lower_o).sum())

'Upper threshold for outliers'

Start Year    0
Latitude      0
Longitude     0
dtype: int64

'Lower threshold for outliers'

Start Year      0
Latitude      536
Longitude       0
dtype: int64

In [32]:
turbinedata[turbinedata["Start Year"]>=upper_o["Start Year"]]

Unnamed: 0,Capacity (MW),Status,Start Year,Latitude,Longitude


No visible issues with values that exceed the upper threshold. It makes sense to have outliers in years due to the fact that we have only recentley started implementing wind turbines, plus financing and planning permmission is a lengthy process.
The positions can also span quite alot with a lot of land not having wind turbines i.e. europe to america to china is large land that wont necessarily have a lot of wind turbines

Can change the names of the "Status" to make it better looking

In [33]:
turbinedata=turbinedata.replace("operating","Operating").replace("announced","Accounced").replace("construction","Construction")

In [34]:
turbinedata.head()

Unnamed: 0,Capacity (MW),Status,Start Year,Latitude,Longitude
0,10.0,Operating,2014.0,28.4624,-0.0576
3,1100.0,Construction,2026.0,26.254,29.2675
4,10000.0,Accounced,-1.0,26.5583,31.6773
5,160.0,Construction,-1.0,29.6607,32.3314
6,502.0,Construction,2025.0,28.1338,33.2602


Export data

In [35]:
turbinedata.to_csv(r"C:\Users\prabr\Desktop\GroupProjectWork\CleanedData\WindTurbineData.csv",index=False)