# Data Cleaning - Weather Dataset
## Weather in France

This dataset contains daily historical data of the maximum power (in GW) required to cover the peaks of French gross consumption, along with the values of reference and average temperatures. The dataset has been produced by RTE and METEO-FRANCE (WEATHER-FRANCE institution) and is updated on a monthly basis. It covers a temporal range from 2012 to present, with a temporal resolution of daily. The dataset is related to the topic of electricity consumption and meteorology and is at a national level, covering the territory of France. The dataset is relevant for analyzing the trends and patterns of peak power consumption and identifying the factors that drive the demand for electricity in France. 

In our project, we will try to see if there is a correlation between the energy consumption and the weather.

## Libraries

In [176]:
# importing libraries for data cleaning and analysis :
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

## Exploring the data

In [194]:
# importing the dataset :

df = pd.read_csv('pic-journalier-consommation-brute.csv', sep=';')
df.head()

Unnamed: 0,Date,Pic journalier consommation (MW),Température moyenne (°C),Température référence (°C)
0,2012-01-03,76698.0,8.4,4.6
1,2012-01-08,66520.0,7.9,4.7
2,2012-01-12,80367.0,5.1,4.7
3,2012-01-15,75776.0,1.1,4.8
4,2012-01-17,86516.0,1.0,4.9


In [195]:
# checking the shape of the dataset :
df.shape

(3987, 4)

In [196]:
# checking the columns of the dataset :
df.columns

Index(['Date', 'Pic journalier consommation (MW)', 'Température moyenne (°C)',
       'Température référence (°C)'],
      dtype='object')

In [197]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3987 entries, 0 to 3986
Data columns (total 4 columns):
 #   Column                            Non-Null Count  Dtype  
---  ------                            --------------  -----  
 0   Date                              3987 non-null   object 
 1   Pic journalier consommation (MW)  3987 non-null   float64
 2   Température moyenne (°C)          3987 non-null   float64
 3   Température référence (°C)        3987 non-null   float64
dtypes: float64(3), object(1)
memory usage: 124.7+ KB


## Let's begin by Fixing the datetime because it's the most important feature

'date' column's type needs to be converted to datetime format. Also we need to fix it as index.

In [198]:
# Convert the "date" column to a datetime format
df['Date'] = pd.to_datetime(df['Date'])

# Set the "date" column as the index of the DataFrame
#df.set_index('Date', inplace=True) not needed for now, let's keep it as a column

In [199]:
# Check the first few rows of the DataFrame
print(df.head())

        Date  Pic journalier consommation (MW)  Température moyenne (°C)  \
0 2012-01-03                           76698.0                       8.4   
1 2012-01-08                           66520.0                       7.9   
2 2012-01-12                           80367.0                       5.1   
3 2012-01-15                           75776.0                       1.1   
4 2012-01-17                           86516.0                       1.0   

   Température référence (°C)  
0                         4.6  
1                         4.7  
2                         4.7  
3                         4.8  
4                         4.9  


In [200]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3987 entries, 0 to 3986
Data columns (total 4 columns):
 #   Column                            Non-Null Count  Dtype         
---  ------                            --------------  -----         
 0   Date                              3987 non-null   datetime64[ns]
 1   Pic journalier consommation (MW)  3987 non-null   float64       
 2   Température moyenne (°C)          3987 non-null   float64       
 3   Température référence (°C)        3987 non-null   float64       
dtypes: datetime64[ns](1), float64(3)
memory usage: 124.7 KB


In [201]:
# Creating "Year", "Month", "Day", "Weekday" columns : 
df['Year'] = df['Date'].dt.year
df['Month'] = df['Date'].dt.month
df['Day'] = df['Date'].dt.day
df['Weekday'] = df['Date'].dt.weekday

In [202]:
# Check the first few rows of the DataFrame
df.head()

Unnamed: 0,Date,Pic journalier consommation (MW),Température moyenne (°C),Température référence (°C),Year,Month,Day,Weekday
0,2012-01-03,76698.0,8.4,4.6,2012,1,3,1
1,2012-01-08,66520.0,7.9,4.7,2012,1,8,6
2,2012-01-12,80367.0,5.1,4.7,2012,1,12,3
3,2012-01-15,75776.0,1.1,4.8,2012,1,15,6
4,2012-01-17,86516.0,1.0,4.9,2012,1,17,1


In [203]:
# Putting the "Year", "Month", "Day" columns at the beginning of the DataFrame just next to the "Date" column :
df = df[['Date', 'Year', 'Month', 'Day', 'Weekday', 'Pic journalier consommation (MW)', 'Température moyenne (°C)', 'Température référence (°C)']]

In [204]:
df.head(5)

Unnamed: 0,Date,Year,Month,Day,Weekday,Pic journalier consommation (MW),Température moyenne (°C),Température référence (°C)
0,2012-01-03,2012,1,3,1,76698.0,8.4,4.6
1,2012-01-08,2012,1,8,6,66520.0,7.9,4.7
2,2012-01-12,2012,1,12,3,80367.0,5.1,4.7
3,2012-01-15,2012,1,15,6,75776.0,1.1,4.8
4,2012-01-17,2012,1,17,1,86516.0,1.0,4.9


In [205]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3987 entries, 0 to 3986
Data columns (total 8 columns):
 #   Column                            Non-Null Count  Dtype         
---  ------                            --------------  -----         
 0   Date                              3987 non-null   datetime64[ns]
 1   Year                              3987 non-null   int64         
 2   Month                             3987 non-null   int64         
 3   Day                               3987 non-null   int64         
 4   Weekday                           3987 non-null   int64         
 5   Pic journalier consommation (MW)  3987 non-null   float64       
 6   Température moyenne (°C)          3987 non-null   float64       
 7   Température référence (°C)        3987 non-null   float64       
dtypes: datetime64[ns](1), float64(3), int64(4)
memory usage: 249.3 KB


1) We need explore the data and drop some useless colunms to make the analysis easier.
2) Translating the columns name to english
3) We need to create a calculated field "Temperature Deviation (°C)" feature to help us with the analysis. 


## Dropping Columns

In [209]:
# Dropping "Pic journalier consommation (MW)" column beacause we don't need it :

df.drop('Pic journalier consommation (MW)', axis=1, inplace=True)

In [208]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3987 entries, 0 to 3986
Data columns (total 7 columns):
 #   Column                      Non-Null Count  Dtype         
---  ------                      --------------  -----         
 0   Date                        3987 non-null   datetime64[ns]
 1   Year                        3987 non-null   int64         
 2   Month                       3987 non-null   int64         
 3   Day                         3987 non-null   int64         
 4   Weekday                     3987 non-null   int64         
 5   Température moyenne (°C)    3987 non-null   float64       
 6   Température référence (°C)  3987 non-null   float64       
dtypes: datetime64[ns](1), float64(2), int64(4)
memory usage: 218.2 KB


In [210]:
df.head(5)

Unnamed: 0,Date,Year,Month,Day,Weekday,Température moyenne (°C),Température référence (°C)
0,2012-01-03,2012,1,3,1,8.4,4.6
1,2012-01-08,2012,1,8,6,7.9,4.7
2,2012-01-12,2012,1,12,3,5.1,4.7
3,2012-01-15,2012,1,15,6,1.1,4.8
4,2012-01-17,2012,1,17,1,1.0,4.9


## Translation of columns names

In [211]:
# Translating the french columns names to english :

# Define a dictionary to map French column names to English column names
translation_dict = {
    'Température moyenne (°C)': 'Average temperature (°C)',
    'Température référence (°C)': 'Reference temperature (°C)'
}

# Rename columns using the dictionary
df.rename(columns=translation_dict, inplace=True)

# Print the updated column names
print(df.columns)

Index(['Date', 'Year', 'Month', 'Day', 'Weekday', 'Average temperature (°C)',
       'Reference temperature (°C)'],
      dtype='object')


In [212]:
df.head(5)

Unnamed: 0,Date,Year,Month,Day,Weekday,Average temperature (°C),Reference temperature (°C)
0,2012-01-03,2012,1,3,1,8.4,4.6
1,2012-01-08,2012,1,8,6,7.9,4.7
2,2012-01-12,2012,1,12,3,5.1,4.7
3,2012-01-15,2012,1,15,6,1.1,4.8
4,2012-01-17,2012,1,17,1,1.0,4.9


## Missing Values

In [213]:
# Checking the number of null values in each column :
df.isnull().sum()

Date                          0
Year                          0
Month                         0
Day                           0
Weekday                       0
Average temperature (°C)      0
Reference temperature (°C)    0
dtype: int64

## Checking Duplicates

In [214]:
# Checking duplicates in the dataset :

df.duplicated().sum()

0

## Let's create a new calculated field called "Temperature Deviation (°C)" 

In [215]:
# Creating a new column 'Temperature Deviation (°C)' which is the difference between the 'Average temperature (°C)' and the 'Reference temperature (°C)' :

df['Temperature Deviation (°C)'] = df['Average temperature (°C)'] - df['Reference temperature (°C)']

In [216]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3987 entries, 0 to 3986
Data columns (total 8 columns):
 #   Column                      Non-Null Count  Dtype         
---  ------                      --------------  -----         
 0   Date                        3987 non-null   datetime64[ns]
 1   Year                        3987 non-null   int64         
 2   Month                       3987 non-null   int64         
 3   Day                         3987 non-null   int64         
 4   Weekday                     3987 non-null   int64         
 5   Average temperature (°C)    3987 non-null   float64       
 6   Reference temperature (°C)  3987 non-null   float64       
 7   Temperature Deviation (°C)  3987 non-null   float64       
dtypes: datetime64[ns](1), float64(3), int64(4)
memory usage: 249.3 KB


## Keeping only the time range that we want for our project

In [217]:
# Checking the time period of the dataset :
df['Date'].min(), df['Date'].max()

(Timestamp('2012-01-01 00:00:00'), Timestamp('2022-11-30 00:00:00'))

In [218]:
df.head(5)

Unnamed: 0,Date,Year,Month,Day,Weekday,Average temperature (°C),Reference temperature (°C),Temperature Deviation (°C)
0,2012-01-03,2012,1,3,1,8.4,4.6,3.8
1,2012-01-08,2012,1,8,6,7.9,4.7,3.2
2,2012-01-12,2012,1,12,3,5.1,4.7,0.4
3,2012-01-15,2012,1,15,6,1.1,4.8,-3.7
4,2012-01-17,2012,1,17,1,1.0,4.9,-3.9


In [219]:
# '2012-01-01' to '2022-05-31':
# Select the rows that are within the time range
df = df.loc[(df['Date'] >= '2012-01-01') & (df['Date'] <= '2022-05-31')]

# Reset the index to the default integer index
df.reset_index(drop=True, inplace=True)

# Print the updated dataframe
print(df.head())

        Date  Year  Month  Day  Weekday  Average temperature (°C)  \
0 2012-01-03  2012      1    3        1                       8.4   
1 2012-01-08  2012      1    8        6                       7.9   
2 2012-01-12  2012      1   12        3                       5.1   
3 2012-01-15  2012      1   15        6                       1.1   
4 2012-01-17  2012      1   17        1                       1.0   

   Reference temperature (°C)  Temperature Deviation (°C)  
0                         4.6                         3.8  
1                         4.7                         3.2  
2                         4.7                         0.4  
3                         4.8                        -3.7  
4                         4.9                        -3.9  


In [220]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3804 entries, 0 to 3803
Data columns (total 8 columns):
 #   Column                      Non-Null Count  Dtype         
---  ------                      --------------  -----         
 0   Date                        3804 non-null   datetime64[ns]
 1   Year                        3804 non-null   int64         
 2   Month                       3804 non-null   int64         
 3   Day                         3804 non-null   int64         
 4   Weekday                     3804 non-null   int64         
 5   Average temperature (°C)    3804 non-null   float64       
 6   Reference temperature (°C)  3804 non-null   float64       
 7   Temperature Deviation (°C)  3804 non-null   float64       
dtypes: datetime64[ns](1), float64(3), int64(4)
memory usage: 237.9 KB


In [221]:
df.head(5)

Unnamed: 0,Date,Year,Month,Day,Weekday,Average temperature (°C),Reference temperature (°C),Temperature Deviation (°C)
0,2012-01-03,2012,1,3,1,8.4,4.6,3.8
1,2012-01-08,2012,1,8,6,7.9,4.7,3.2
2,2012-01-12,2012,1,12,3,5.1,4.7,0.4
3,2012-01-15,2012,1,15,6,1.1,4.8,-3.7
4,2012-01-17,2012,1,17,1,1.0,4.9,-3.9


In [223]:
# Checking the time period of the dataset :
df['Date'].min(), df['Date'].max()

(Timestamp('2012-01-01 00:00:00'), Timestamp('2022-05-31 00:00:00'))

## Exporting the Dataset to CSV

In [225]:
# Exporting the weather dataset to a csv file :
df.to_csv('weather.csv', index=False)