# Title <a id='back'></a>

## Table of Contents
- [Project Introduction](#project-introduction)
    - [Analysis Outline](#analysis-outline)
    - [Results](#results)
- [Importing Libraries and Opening Data Files](#importing-libraries-and-opening-data-files)
- [Pre-Processing Data](#pre-processing-data)
    - [Header Style](#header-style)
    - [Duplicates](#duplicates)
    - [Missing Values](#missing-values)
    - [Data Usage and Formatting](#data-usage-and-formatting)
    - [Data Wrangling](#data-wrangling)
- [Exploratory Data Analysis](#exploratory-data-analysis)
- [Conclusions and Reccomendations](#conclusions-and-reccomendations)
- [Dataset Citation](#dataset-citation)

<a name='headers'>

## Project Introduction

[project intro]

### Analysis Outline

[Analysis Outline]

### Results

[Results]


[Back to Table of Contents](#back)

## Importing Libraries and Opening Data Files

In [1]:
# Importing the needed libraries for this assignment
import pandas as pd
import numpy as np
from datetime import datetime as dt
from matplotlib import pyplot as plt
import seaborn as sns

In [2]:
# Importing file for assignment
try:
    pg_1 = pd.read_csv('Plant_1_Generation_Data.csv', sep=',')
except:
    pg_1 = pd.read_csv('/datasets/Plant_1_Generation_Data.csv', sep=',')

try:
    ws_1 = pd.read_csv('Plant_1_Weather_Sensor_Data.csv', sep=',')
except:
    ws_1 = pd.read_csv('/datasets/Plant_1_Weather_Sensor_Data.csv', sep=',')

try:
    pg_2 = pd.read_csv('Plant_2_Generation_Data.csv', sep=',')
except:
    pg_2 = pd.read_csv('/datasets/Plant_2_Generation_Data.csv', sep=',')

try:
    ws_2 = pd.read_csv('Plant_2_Weather_Sensor_Data.csv', sep=',')
except:
    ws_2 = pd.read_csv('/datasets/Plant_2_Weather_Sensor_Data.csv', sep=',')

[Back to Table of Contents](#back)

## Pre-Processing Data

### Header Style

In [3]:
# Getting general information about the dataset
pg_1.info()
pg_1.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 68778 entries, 0 to 68777
Data columns (total 7 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   DATE_TIME    68778 non-null  object 
 1   PLANT_ID     68778 non-null  int64  
 2   SOURCE_KEY   68778 non-null  object 
 3   DC_POWER     68778 non-null  float64
 4   AC_POWER     68778 non-null  float64
 5   DAILY_YIELD  68778 non-null  float64
 6   TOTAL_YIELD  68778 non-null  float64
dtypes: float64(4), int64(1), object(2)
memory usage: 3.7+ MB


Unnamed: 0,DATE_TIME,PLANT_ID,SOURCE_KEY,DC_POWER,AC_POWER,DAILY_YIELD,TOTAL_YIELD
0,15-05-2020 00:00,4135001,1BY6WEcLGh8j5v7,0.0,0.0,0.0,6259559.0
1,15-05-2020 00:00,4135001,1IF53ai7Xc0U56Y,0.0,0.0,0.0,6183645.0
2,15-05-2020 00:00,4135001,3PZuoBAID5Wc2HD,0.0,0.0,0.0,6987759.0
3,15-05-2020 00:00,4135001,7JYdWkrLSPkdwr4,0.0,0.0,0.0,7602960.0
4,15-05-2020 00:00,4135001,McdE0feGgRqW7Ca,0.0,0.0,0.0,7158964.0


In [4]:
#checking for snakecase format
pg_1.columns

Index(['DATE_TIME', 'PLANT_ID', 'SOURCE_KEY', 'DC_POWER', 'AC_POWER',
       'DAILY_YIELD', 'TOTAL_YIELD'],
      dtype='object')

In [5]:
# Renaming column names to snake_case format
pg_1 = pg_1.rename(columns={'DATE_TIME': 'date_time',
                            'PLANT_ID': 'plant_id',
                            'SOURCE_KEY': 'source_key',
                            'DC_POWER': 'dc_power',
                            'AC_POWER': 'ac_power',
                            'DAILY_YIELD': 'daily_yield',
                            'TOTAL_YIELD': 'total_yield'})
pg_1.columns

Index(['date_time', 'plant_id', 'source_key', 'dc_power', 'ac_power',
       'daily_yield', 'total_yield'],
      dtype='object')

In [6]:
# Getting general information about the dataset
ws_1.info()
ws_1.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3182 entries, 0 to 3181
Data columns (total 6 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   DATE_TIME            3182 non-null   object 
 1   PLANT_ID             3182 non-null   int64  
 2   SOURCE_KEY           3182 non-null   object 
 3   AMBIENT_TEMPERATURE  3182 non-null   float64
 4   MODULE_TEMPERATURE   3182 non-null   float64
 5   IRRADIATION          3182 non-null   float64
dtypes: float64(3), int64(1), object(2)
memory usage: 149.3+ KB


Unnamed: 0,DATE_TIME,PLANT_ID,SOURCE_KEY,AMBIENT_TEMPERATURE,MODULE_TEMPERATURE,IRRADIATION
0,2020-05-15 00:00:00,4135001,HmiyD2TTLFNqkNe,25.184316,22.857507,0.0
1,2020-05-15 00:15:00,4135001,HmiyD2TTLFNqkNe,25.084589,22.761668,0.0
2,2020-05-15 00:30:00,4135001,HmiyD2TTLFNqkNe,24.935753,22.592306,0.0
3,2020-05-15 00:45:00,4135001,HmiyD2TTLFNqkNe,24.84613,22.360852,0.0
4,2020-05-15 01:00:00,4135001,HmiyD2TTLFNqkNe,24.621525,22.165423,0.0


In [7]:
#checking for snakecase format
ws_1.columns

Index(['DATE_TIME', 'PLANT_ID', 'SOURCE_KEY', 'AMBIENT_TEMPERATURE',
       'MODULE_TEMPERATURE', 'IRRADIATION'],
      dtype='object')

In [8]:
# Renaming column names to snake_case format
ws_1 = ws_1.rename(columns={'DATE_TIME': 'date_time',
                            'PLANT_ID': 'plant_id',
                            'SOURCE_KEY': 'source_key',
                            'AMBIENT_TEMPERATURE': 'ambient_temp',
                            'MODULE_TEMPERATURE': 'module_temp',
                            'IRRADIATION': 'irradiation'})
ws_1.columns

Index(['date_time', 'plant_id', 'source_key', 'ambient_temp', 'module_temp',
       'irradiation'],
      dtype='object')

In [9]:
# Getting general information about the dataset
pg_2.info()
pg_2.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 67698 entries, 0 to 67697
Data columns (total 7 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   DATE_TIME    67698 non-null  object 
 1   PLANT_ID     67698 non-null  int64  
 2   SOURCE_KEY   67698 non-null  object 
 3   DC_POWER     67698 non-null  float64
 4   AC_POWER     67698 non-null  float64
 5   DAILY_YIELD  67698 non-null  float64
 6   TOTAL_YIELD  67698 non-null  float64
dtypes: float64(4), int64(1), object(2)
memory usage: 3.6+ MB


Unnamed: 0,DATE_TIME,PLANT_ID,SOURCE_KEY,DC_POWER,AC_POWER,DAILY_YIELD,TOTAL_YIELD
0,2020-05-15 00:00:00,4136001,4UPUqMRk7TRMgml,0.0,0.0,9425.0,2429011.0
1,2020-05-15 00:00:00,4136001,81aHJ1q11NBPMrL,0.0,0.0,0.0,1215279000.0
2,2020-05-15 00:00:00,4136001,9kRcWv60rDACzjR,0.0,0.0,3075.333333,2247720000.0
3,2020-05-15 00:00:00,4136001,Et9kgGMDl729KT4,0.0,0.0,269.933333,1704250.0
4,2020-05-15 00:00:00,4136001,IQ2d7wF4YD8zU1Q,0.0,0.0,3177.0,19941530.0


In [10]:
#checking for snakecase format
pg_2.columns

Index(['DATE_TIME', 'PLANT_ID', 'SOURCE_KEY', 'DC_POWER', 'AC_POWER',
       'DAILY_YIELD', 'TOTAL_YIELD'],
      dtype='object')

In [11]:
# Renaming column names to snake_case format
pg_2 = pg_2.rename(columns={'DATE_TIME': 'date_time',
                            'PLANT_ID': 'plant_id',
                            'SOURCE_KEY': 'source_key',
                            'DC_POWER': 'dc_power',
                            'AC_POWER': 'ac_power',
                            'DAILY_YIELD': 'daily_yield',
                            'TOTAL_YIELD': 'total_yield'})
pg_2.columns

Index(['date_time', 'plant_id', 'source_key', 'dc_power', 'ac_power',
       'daily_yield', 'total_yield'],
      dtype='object')

In [12]:
# Getting general information about the dataset
ws_2.info()
ws_2.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3259 entries, 0 to 3258
Data columns (total 6 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   DATE_TIME            3259 non-null   object 
 1   PLANT_ID             3259 non-null   int64  
 2   SOURCE_KEY           3259 non-null   object 
 3   AMBIENT_TEMPERATURE  3259 non-null   float64
 4   MODULE_TEMPERATURE   3259 non-null   float64
 5   IRRADIATION          3259 non-null   float64
dtypes: float64(3), int64(1), object(2)
memory usage: 152.9+ KB


Unnamed: 0,DATE_TIME,PLANT_ID,SOURCE_KEY,AMBIENT_TEMPERATURE,MODULE_TEMPERATURE,IRRADIATION
0,2020-05-15 00:00:00,4136001,iq8k7ZNt4Mwm3w0,27.004764,25.060789,0.0
1,2020-05-15 00:15:00,4136001,iq8k7ZNt4Mwm3w0,26.880811,24.421869,0.0
2,2020-05-15 00:30:00,4136001,iq8k7ZNt4Mwm3w0,26.682055,24.42729,0.0
3,2020-05-15 00:45:00,4136001,iq8k7ZNt4Mwm3w0,26.500589,24.420678,0.0
4,2020-05-15 01:00:00,4136001,iq8k7ZNt4Mwm3w0,26.596148,25.08821,0.0


In [13]:
#checking for snakecase format
ws_2.columns

Index(['DATE_TIME', 'PLANT_ID', 'SOURCE_KEY', 'AMBIENT_TEMPERATURE',
       'MODULE_TEMPERATURE', 'IRRADIATION'],
      dtype='object')

In [14]:
# Renaming column names to snake_case format
pg_2 = pg_2.rename(columns={'DATE_TIME': 'date_time',
                            'PLANT_ID': 'plant_id',
                            'SOURCE_KEY': 'source_key',
                            'AMBIENT_TEMPERATURE': 'ambient_temp',
                            'MODULE_TEMPERATURE': 'module_temp',
                            'IRRADIATION': 'irradiation'})
pg_2.columns

Index(['date_time', 'plant_id', 'source_key', 'dc_power', 'ac_power',
       'daily_yield', 'total_yield'],
      dtype='object')

[Back to Table of Contents](#back)

### Duplicates

In [15]:
# Checking for duplicates
pg_1.duplicated().sum()

0

In [16]:
# Checking for duplicates
ws_1.duplicated().sum()

0

In [17]:
# Checking for duplicates
pg_2.duplicated().sum()

0

In [18]:
# Checking for duplicates
ws_2.duplicated().sum()

0

[Back to Table of Contents](#back)

### Missing Values

In [19]:
# Checking for null values
pg_1.isna().sum()

date_time      0
plant_id       0
source_key     0
dc_power       0
ac_power       0
daily_yield    0
total_yield    0
dtype: int64

In [20]:
# Checking for null values
ws_1.isna().sum()

date_time       0
plant_id        0
source_key      0
ambient_temp    0
module_temp     0
irradiation     0
dtype: int64

In [21]:
# Checking for null values
pg_2.isna().sum()

date_time      0
plant_id       0
source_key     0
dc_power       0
ac_power       0
daily_yield    0
total_yield    0
dtype: int64

In [22]:
# Checking for null values
ws_2.isna().sum()

DATE_TIME              0
PLANT_ID               0
SOURCE_KEY             0
AMBIENT_TEMPERATURE    0
MODULE_TEMPERATURE     0
IRRADIATION            0
dtype: int64

[Back to Table of Contents](#back)

### Data Usage and Formatting

In [23]:
pg_1.info()
pg_1.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 68778 entries, 0 to 68777
Data columns (total 7 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   date_time    68778 non-null  object 
 1   plant_id     68778 non-null  int64  
 2   source_key   68778 non-null  object 
 3   dc_power     68778 non-null  float64
 4   ac_power     68778 non-null  float64
 5   daily_yield  68778 non-null  float64
 6   total_yield  68778 non-null  float64
dtypes: float64(4), int64(1), object(2)
memory usage: 3.7+ MB


Unnamed: 0,date_time,plant_id,source_key,dc_power,ac_power,daily_yield,total_yield
0,15-05-2020 00:00,4135001,1BY6WEcLGh8j5v7,0.0,0.0,0.0,6259559.0
1,15-05-2020 00:00,4135001,1IF53ai7Xc0U56Y,0.0,0.0,0.0,6183645.0
2,15-05-2020 00:00,4135001,3PZuoBAID5Wc2HD,0.0,0.0,0.0,6987759.0
3,15-05-2020 00:00,4135001,7JYdWkrLSPkdwr4,0.0,0.0,0.0,7602960.0
4,15-05-2020 00:00,4135001,McdE0feGgRqW7Ca,0.0,0.0,0.0,7158964.0


In [24]:
pg_1['date_time'] = pd.to_datetime(pg_1['date_time'], format='%d-%m-%Y %H:%M')
pg_1['source_key'] = pg_1['source_key'].str.lower()
pg_1['source_key'] = pg_1['source_key'].astype('category')
pg_1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 68778 entries, 0 to 68777
Data columns (total 7 columns):
 #   Column       Non-Null Count  Dtype         
---  ------       --------------  -----         
 0   date_time    68778 non-null  datetime64[ns]
 1   plant_id     68778 non-null  int64         
 2   source_key   68778 non-null  category      
 3   dc_power     68778 non-null  float64       
 4   ac_power     68778 non-null  float64       
 5   daily_yield  68778 non-null  float64       
 6   total_yield  68778 non-null  float64       
dtypes: category(1), datetime64[ns](1), float64(4), int64(1)
memory usage: 3.2 MB


In [25]:
ws_1.info()
ws_1.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3182 entries, 0 to 3181
Data columns (total 6 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   date_time     3182 non-null   object 
 1   plant_id      3182 non-null   int64  
 2   source_key    3182 non-null   object 
 3   ambient_temp  3182 non-null   float64
 4   module_temp   3182 non-null   float64
 5   irradiation   3182 non-null   float64
dtypes: float64(3), int64(1), object(2)
memory usage: 149.3+ KB


Unnamed: 0,date_time,plant_id,source_key,ambient_temp,module_temp,irradiation
0,2020-05-15 00:00:00,4135001,HmiyD2TTLFNqkNe,25.184316,22.857507,0.0
1,2020-05-15 00:15:00,4135001,HmiyD2TTLFNqkNe,25.084589,22.761668,0.0
2,2020-05-15 00:30:00,4135001,HmiyD2TTLFNqkNe,24.935753,22.592306,0.0
3,2020-05-15 00:45:00,4135001,HmiyD2TTLFNqkNe,24.84613,22.360852,0.0
4,2020-05-15 01:00:00,4135001,HmiyD2TTLFNqkNe,24.621525,22.165423,0.0


In [27]:
ws_1['date_time'] = pd.to_datetime(ws_1['date_time'], format='%Y-%m-%d %H:%M:%S')
ws_1['date_time'] = ws_1.date_time.dt.strftime('%d-%m-%Y %H:%M')
ws_1['date_time'] = pd.to_datetime(ws_1['date_time'], format='%d-%m-%Y %H:%M')
ws_1['source_key'] = ws_1['source_key'].str.lower()
ws_1['source_key'] = ws_1['source_key'].astype('category')
ws_1.info()
ws_1.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3182 entries, 0 to 3181
Data columns (total 6 columns):
 #   Column        Non-Null Count  Dtype         
---  ------        --------------  -----         
 0   date_time     3182 non-null   datetime64[ns]
 1   plant_id      3182 non-null   int64         
 2   source_key    3182 non-null   category      
 3   ambient_temp  3182 non-null   float64       
 4   module_temp   3182 non-null   float64       
 5   irradiation   3182 non-null   float64       
dtypes: category(1), datetime64[ns](1), float64(3), int64(1)
memory usage: 127.6 KB


Unnamed: 0,date_time,plant_id,source_key,ambient_temp,module_temp,irradiation
0,2020-05-15 00:00:00,4135001,hmiyd2ttlfnqkne,25.184316,22.857507,0.0
1,2020-05-15 00:15:00,4135001,hmiyd2ttlfnqkne,25.084589,22.761668,0.0
2,2020-05-15 00:30:00,4135001,hmiyd2ttlfnqkne,24.935753,22.592306,0.0
3,2020-05-15 00:45:00,4135001,hmiyd2ttlfnqkne,24.84613,22.360852,0.0
4,2020-05-15 01:00:00,4135001,hmiyd2ttlfnqkne,24.621525,22.165423,0.0


[Back to Table of Contents](#back)

### Data Wrangling

[Back to Table of Contents](#back)

## Exploratory Data Analysis

[Back to Table of Contents](#back)

## Conclusions and Reccomendations

[Back to Table of Contents](#back)

## Dataset Citation

syntax:
[Dataset creator's name]. ([Year &amp; Month of dataset creation]). [Name of the dataset], [Version of the dataset]. Retrieved [Date Retrieved] from [URL of the dataset].

example:
Tatman, R. (2017, November). R vs. Python: The Kitchen Gadget Test, Version 1. Retrieved December 20, 2017 from https://www.kaggle.com/rtatman/r-vs-python-the-kitchen-gadget-test.

[Back to Table of Contents](#back)