
# Data Cleaning Process

This notebook outlines the data cleaning process for the solar panel degradation analysis project. 
The steps include handling missing values, correcting data types, and dealing with outliers.



## Step 1: Handling Missing Values

Missing values can significantly impact the quality of the analysis. 
The strategy for handling missing data depends on the nature of the data and its relevance to the study.
For columns with a high percentage of missing values that are crucial for analysis, 
it might be more appropriate to remove these rows. 
For others, imputation or retaining the missing values could be considered.


In [None]:

# Identifying columns with missing values and the count of missing values in each column
missing_values = df.isnull().sum()

# Decision: Drop rows with missing 'technology1' and 'technology2' if they are crucial for analysis
df_cleaned = df.dropna(subset=['technology1', 'technology2'])

# Impute or further handle other missing values as deemed necessary



## Step 2: Correcting Data Types

Correct data types are essential for effective analysis. 
This step ensures that numerical data is treated as such and categorical data is recognized by analysis tools.


In [None]:

# Correcting data types
# Converting 'length_years_rounded' to numerical type (float)
df_cleaned['length_years_rounded'] = pd.to_numeric(df_cleaned['length_years_rounded'], errors='coerce')

# Converting 'tracking' to boolean
df_cleaned['tracking'] = df_cleaned['tracking'].astype('bool')



## Step 3: Handling Outliers

Outliers can skew the analysis and may need to be treated separately. 
This step involves identifying outliers and deciding on a strategy to handle them, such as removal or further investigation.


In [None]:

# Identifying outliers in 'plr_median' using IQR
Q1 = df_cleaned['plr_median'].quantile(0.25)
Q3 = df_cleaned['plr_median'].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

# Filtering out the outliers
df_cleaned = df_cleaned[(df_cleaned['plr_median'] >= lower_bound) & (df_cleaned['plr_median'] <= upper_bound)]
