# Data Cleaning - Weather Dataset
## Weather in France

This dataset contains daily historical data of the maximum power (in GW) required to cover the peaks of French gross consumption, along with the values of reference and average temperatures. The dataset has been produced by RTE and METEO-FRANCE (WEATHER-FRANCE institution) and is updated on a monthly basis. It covers a temporal range from 2012 to present, with a temporal resolution of daily. The dataset is related to the topic of electricity consumption and meteorology and is at a national level, covering the territory of France. The dataset is relevant for analyzing the trends and patterns of peak power consumption and identifying the factors that drive the demand for electricity in France. 

In our project, we will try to see if there is a correlation between the energy consumption and the weather.

## Libraries

In [52]:
# importing libraries for data cleaning and analysis :
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

## Exploring the data

In [95]:
# importing the dataset :

df = pd.read_csv('pic-journalier-consommation-brute.csv', sep=';')
df.head()

Unnamed: 0,Date,Pic journalier consommation (MW),Température moyenne (°C),Température référence (°C)
0,2012-01-03,76698.0,8.4,4.6
1,2012-01-08,66520.0,7.9,4.7
2,2012-01-12,80367.0,5.1,4.7
3,2012-01-15,75776.0,1.1,4.8
4,2012-01-17,86516.0,1.0,4.9


In [96]:
# checking the shape of the dataset :
df.shape

(3987, 4)

In [97]:
# checking the columns of the dataset :
df.columns

Index(['Date', 'Pic journalier consommation (MW)', 'Température moyenne (°C)',
       'Température référence (°C)'],
      dtype='object')

In [98]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3987 entries, 0 to 3986
Data columns (total 4 columns):
 #   Column                            Non-Null Count  Dtype  
---  ------                            --------------  -----  
 0   Date                              3987 non-null   object 
 1   Pic journalier consommation (MW)  3987 non-null   float64
 2   Température moyenne (°C)          3987 non-null   float64
 3   Température référence (°C)        3987 non-null   float64
dtypes: float64(3), object(1)
memory usage: 124.7+ KB


## Let's begin by Fixing the datetime because it's the most important feature

'date' column's type needs to be converted to datetime format. Also we need to fix it as index.

In [99]:
# Convert the "date" column to a datetime format
df['Date'] = pd.to_datetime(df['Date'])

# Set the "date" column as the index of the DataFrame
df.set_index('Date', inplace=True)

In [100]:
# Check the first few rows of the DataFrame
print(df.head())

            Pic journalier consommation (MW)  Température moyenne (°C)  \
Date                                                                     
2012-01-03                           76698.0                       8.4   
2012-01-08                           66520.0                       7.9   
2012-01-12                           80367.0                       5.1   
2012-01-15                           75776.0                       1.1   
2012-01-17                           86516.0                       1.0   

            Température référence (°C)  
Date                                    
2012-01-03                         4.6  
2012-01-08                         4.7  
2012-01-12                         4.7  
2012-01-15                         4.8  
2012-01-17                         4.9  


In [101]:
df.index

DatetimeIndex(['2012-01-03', '2012-01-08', '2012-01-12', '2012-01-15',
               '2012-01-17', '2012-01-20', '2012-01-21', '2012-01-23',
               '2012-01-28', '2012-01-29',
               ...
               '2022-10-16', '2022-10-17', '2022-10-20', '2022-10-21',
               '2022-11-04', '2022-11-12', '2022-11-15', '2022-11-16',
               '2022-11-19', '2022-11-26'],
              dtype='datetime64[ns]', name='Date', length=3987, freq=None)

In [102]:
# Creating "Year", "Month", "Day", "Weekday" columns : 
df['Year'] = df.index.year
df['Month'] = df.index.month
df['Day'] = df.index.day
df['Weekday'] = df.index.weekday

In [103]:
# Putting the "Year", "Month", "Day" columns at the beginning of the DataFrame :
df = df.reindex(columns=['Year', 'Month', 'Day','Weekday'] + list(df.columns[:-4]))

In [104]:
df.shape

(3987, 7)

In [105]:
df.head(5)

Unnamed: 0_level_0,Year,Month,Day,Weekday,Pic journalier consommation (MW),Température moyenne (°C),Température référence (°C)
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
2012-01-03,2012,1,3,1,76698.0,8.4,4.6
2012-01-08,2012,1,8,6,66520.0,7.9,4.7
2012-01-12,2012,1,12,3,80367.0,5.1,4.7
2012-01-15,2012,1,15,6,75776.0,1.1,4.8
2012-01-17,2012,1,17,1,86516.0,1.0,4.9


1) We need explore the data and drop some useless colunms to make the analysis easier.
2) Translating the columns name to english
3) We need to create a calculated field "Temperature Deviation (°C)" feature to help us with the analysis. 


## Dropping Columns

In [106]:
# Dropping "Pic journalier consommation (MW)" column beacause we don't need it :

df.drop('Pic journalier consommation (MW)', axis=1, inplace=True)

In [107]:
df.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 3987 entries, 2012-01-03 to 2022-11-26
Data columns (total 6 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   Year                        3987 non-null   int64  
 1   Month                       3987 non-null   int64  
 2   Day                         3987 non-null   int64  
 3   Weekday                     3987 non-null   int64  
 4   Température moyenne (°C)    3987 non-null   float64
 5   Température référence (°C)  3987 non-null   float64
dtypes: float64(2), int64(4)
memory usage: 218.0 KB


In [108]:
df.head(5)

Unnamed: 0_level_0,Year,Month,Day,Weekday,Température moyenne (°C),Température référence (°C)
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2012-01-03,2012,1,3,1,8.4,4.6
2012-01-08,2012,1,8,6,7.9,4.7
2012-01-12,2012,1,12,3,5.1,4.7
2012-01-15,2012,1,15,6,1.1,4.8
2012-01-17,2012,1,17,1,1.0,4.9


## Translation of columns names

In [109]:
# Translating the french columns names to english :

# Define a dictionary to map French column names to English column names
translation_dict = {
    'Température moyenne (°C)': 'Average temperature (°C)',
    'Température référence (°C)': 'Reference temperature (°C)'
}

# Rename columns using the dictionary
df.rename(columns=translation_dict, inplace=True)

# Print the updated column names
print(df.columns)

Index(['Year', 'Month', 'Day', 'Weekday', 'Average temperature (°C)',
       'Reference temperature (°C)'],
      dtype='object')


In [110]:
df.head(5)

Unnamed: 0_level_0,Year,Month,Day,Weekday,Average temperature (°C),Reference temperature (°C)
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2012-01-03,2012,1,3,1,8.4,4.6
2012-01-08,2012,1,8,6,7.9,4.7
2012-01-12,2012,1,12,3,5.1,4.7
2012-01-15,2012,1,15,6,1.1,4.8
2012-01-17,2012,1,17,1,1.0,4.9


## Missing Values

In [111]:
# Checking the number of null values in each column :
df.isnull().sum()

Year                          0
Month                         0
Day                           0
Weekday                       0
Average temperature (°C)      0
Reference temperature (°C)    0
dtype: int64

## Checking Duplicates

In [112]:
# Checking duplicates in the dataset :

df.duplicated().sum()

0

## Let's create a new calculated field called "Temperature Deviation (°C)" 

In [113]:
# Creating a new column 'Temperature Deviation (°C)' which is the difference between the 'Average temperature (°C)' and the 'Reference temperature (°C)' :

df['Temperature Deviation (°C)'] = df['Average temperature (°C)'] - df['Reference temperature (°C)']

In [114]:
df.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 3987 entries, 2012-01-03 to 2022-11-26
Data columns (total 7 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   Year                        3987 non-null   int64  
 1   Month                       3987 non-null   int64  
 2   Day                         3987 non-null   int64  
 3   Weekday                     3987 non-null   int64  
 4   Average temperature (°C)    3987 non-null   float64
 5   Reference temperature (°C)  3987 non-null   float64
 6   Temperature Deviation (°C)  3987 non-null   float64
dtypes: float64(3), int64(4)
memory usage: 249.2 KB


## Keeping only the time range that we want for our project

In [115]:
# 2013-05-21 to 2020-08-01 :
start_date = '2013-05-21'
end_date = '2020-08-01'

date_range = pd.date_range(start=start_date, end=end_date, freq='D')
df = df.reindex(date_range)

In [116]:
df.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 2630 entries, 2013-05-21 to 2020-08-01
Freq: D
Data columns (total 7 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   Year                        2630 non-null   int64  
 1   Month                       2630 non-null   int64  
 2   Day                         2630 non-null   int64  
 3   Weekday                     2630 non-null   int64  
 4   Average temperature (°C)    2630 non-null   float64
 5   Reference temperature (°C)  2630 non-null   float64
 6   Temperature Deviation (°C)  2630 non-null   float64
dtypes: float64(3), int64(4)
memory usage: 164.4 KB


## Exporting the Dataset to CSV

In [117]:
# Exporting the weather dataset to a csv file :
df.to_csv('weather.csv', index=False)