## Data Quality Check

In [1]:
# import library and data 
import pandas as pd 
 
benin = pd.read_csv('../../data/benin-malanville.csv')
sierraleone = pd.read_csv('../../data/sierraleone-bumbuna.csv')
togo = pd.read_csv('../../data/togo-dapaong_qc.csv')

### Check for missing values 

In [4]:
print("Missing values in Benin dataset:\n", benin.isnull().sum())


Missing values in Benin dataset:
 Timestamp             0
GHI                   0
DNI                   0
DHI                   0
ModA                  0
ModB                  0
Tamb                  0
RH                    0
WS                    0
WSgust                0
WSstdev               0
WD                    0
WDstdev               0
BP                    0
Cleaning              0
Precipitation         0
TModA                 0
TModB                 0
Comments         525600
dtype: int64


In [5]:
print("Missing values in Sierra Leone dataset:\n", sierraleone.isnull().sum())

Missing values in Sierra Leone dataset:
 Timestamp             0
GHI                   0
DNI                   0
DHI                   0
ModA                  0
ModB                  0
Tamb                  0
RH                    0
WS                    0
WSgust                0
WSstdev               0
WD                    0
WDstdev               0
BP                    0
Cleaning              0
Precipitation         0
TModA                 0
TModB                 0
Comments         525600
dtype: int64


In [6]:
print("Missing values in Togo dataset:\n", togo.isnull().sum())

Missing values in Togo dataset:
 Timestamp             0
GHI                   0
DNI                   0
DHI                   0
ModA                  0
ModB                  0
Tamb                  0
RH                    0
WS                    0
WSgust                0
WSstdev               0
WD                    0
WDstdev               0
BP                    0
Cleaning              0
Precipitation         0
TModA                 0
TModB                 0
Comments         525600
dtype: int64


### Check for negative values in GHI, DNI, and DHI columns

In [7]:
print("Negative values in Benin dataset:\n", benin[(benin['GHI'] < 0) | (benin['DNI'] < 0) | (benin['DHI'] < 0)])
print("Negative values in Sierra Leone dataset:\n", sierraleone[(sierraleone['GHI'] < 0) | (sierraleone['DNI'] < 0) | (sierraleone['DHI'] < 0)])
print("Negative values in Togo dataset:\n", togo[(togo['GHI'] < 0) | (togo['DNI'] < 0) | (togo['DHI'] < 0)])

Negative values in Benin dataset:
                Timestamp  GHI  DNI  DHI  ModA  ModB  Tamb    RH   WS  WSgust  \
0       2021-08-09 00:01 -1.2 -0.2 -1.1   0.0   0.0  26.2  93.4  0.0     0.4   
1       2021-08-09 00:02 -1.1 -0.2 -1.1   0.0   0.0  26.2  93.6  0.0     0.0   
2       2021-08-09 00:03 -1.1 -0.2 -1.1   0.0   0.0  26.2  93.7  0.3     1.1   
3       2021-08-09 00:04 -1.1 -0.1 -1.0   0.0   0.0  26.2  93.3  0.2     0.7   
4       2021-08-09 00:05 -1.0 -0.1 -1.0   0.0   0.0  26.2  93.3  0.1     0.7   
...                  ...  ...  ...  ...   ...   ...   ...   ...  ...     ...   
525595  2022-08-08 23:56 -5.5 -0.1 -5.9   0.0   0.0  23.1  98.3  0.3     1.1   
525596  2022-08-08 23:57 -5.5 -0.1 -5.8   0.0   0.0  23.1  98.3  0.2     0.7   
525597  2022-08-08 23:58 -5.5 -0.1 -5.8   0.0   0.0  23.1  98.4  0.6     1.1   
525598  2022-08-08 23:59 -5.5 -0.1 -5.8   0.0   0.0  23.1  98.3  0.9     1.3   
525599  2022-08-09 00:00 -5.5 -0.1 -5.7   0.0   0.0  23.1  98.3  1.2     1.6   

    

### Check for outliers in sensor readings (ModA, ModB) and wind speed data (WS, WSgust)
> Approach to check how to determine the outliers 
- The function `detect_outliers` is designed to identify and print outliers in specified columns of a given DataFrame.
- It uses the Interquartile Range (IQR) method to determine outliers. For each column specified in the `columns` list,
- the function calculates the first quartile (Q1) and the third quartile (Q3). The IQR is then computed as the difference
- between Q3 and Q1. Outliers are defined as data points that fall below the lower bound (Q1 - 1.5 * IQR) or above the
- upper bound (Q3 + 1.5 * IQR). The function prints the outliers for each column.

- The code then applies this function to three datasets: Benin, Sierra Leone, and Togo, checking for outliers in the
- columns 'ModA', 'ModB', 'WS', and 'WSgust'. It prints the outliers found in each dataset.

In [9]:
def detect_outliers(df, columns):
    for column in columns:
        Q1 = df[column].quantile(0.25)
        Q3 = df[column].quantile(0.75)
        IQR = Q3 - Q1
        lower_bound = Q1 - 1.5 * IQR
        upper_bound = Q3 + 1.5 * IQR
        outliers = df[(df[column] < lower_bound) | (df[column] > upper_bound)]
        print(f"Outliers in {column}:\n", outliers)

print("Outliers in Benin dataset:")
detect_outliers(benin, ['ModA', 'ModB', 'WS', 'WSgust'])

Outliers in Benin dataset:
Outliers in ModA:
                Timestamp     GHI    DNI    DHI    ModA    ModB  Tamb    RH  \
671     2021-08-09 11:12  1274.0  698.8  615.2  1210.3  1210.3  30.1  69.6   
674     2021-08-09 11:15  1349.0  771.8  618.0  1281.5  1281.5  30.9  67.1   
676     2021-08-09 11:17  1334.0  751.9  620.0  1267.3  1267.3  31.1  67.0   
850     2021-08-09 14:11  1324.0  813.0  532.3  1218.0  1217.0  31.0  62.9   
5019    2021-08-12 11:40  1324.0  675.6  659.6  1271.0  1271.0  29.1  75.4   
...                  ...     ...    ...    ...     ...     ...   ...   ...   
514922  2022-08-01 14:03  1311.0  698.0  628.4  1182.0  1171.0  29.7  70.2   
517777  2022-08-03 13:38  1268.0  652.5  612.1  1163.0  1150.0  30.1  70.7   
520659  2022-08-05 13:40  1280.0  778.5  497.2  1168.0  1161.0  29.0  70.3   
522074  2022-08-06 13:15  1262.0  772.4  475.8  1166.0  1153.0  31.1  66.0   
522075  2022-08-06 13:16  1289.0  758.1  519.2  1191.0  1179.0  31.4  65.9   

         WS  WSgu

In [10]:
print("Outliers in Sierra Leone dataset:")
detect_outliers(sierraleone, ['ModA', 'ModB', 'WS', 'WSgust'])

Outliers in Sierra Leone dataset:
Outliers in ModA:
                Timestamp    GHI    DNI    DHI    ModA    ModB  Tamb    RH  \
655     2021-10-30 10:56  851.0  285.0  605.9   912.0   890.0  26.0  84.2   
658     2021-10-30 10:59  932.0  397.7  589.4  1002.0   978.0  26.2  85.6   
659     2021-10-30 11:00  995.0  492.6  571.1  1065.0  1039.0  26.4  83.4   
660     2021-10-30 11:01  988.0  504.2  552.7  1050.0  1025.0  26.5  83.6   
661     2021-10-30 11:02  978.0  506.0  541.7  1049.0  1025.0  26.4  82.7   
...                  ...    ...    ...    ...     ...     ...   ...   ...   
524844  2022-10-29 11:25  888.0  362.7  559.4   932.0   910.0  30.6  77.0   
524845  2022-10-29 11:26  986.0  474.4  556.7  1035.0  1009.0  30.7  76.0   
524862  2022-10-29 11:43  928.0  400.5  556.8   974.0   949.0  30.0  76.4   
524863  2022-10-29 11:44  925.0  396.3  556.9   972.0   946.0  30.2  76.2   
524864  2022-10-29 11:45  905.0  385.8  547.0   952.0   927.0  30.3  77.0   

         WS  WSgust  W

In [11]:
print("Outliers in Togo dataset:")
detect_outliers(togo, ['ModA', 'ModB', 'WS', 'WSgust'])

Outliers in Togo dataset:
Outliers in ModA:
                Timestamp     GHI    DNI    DHI    ModA    ModB  Tamb    RH  \
674     2021-10-25 11:15  1094.0  706.7  490.8  1140.0  1126.0  31.2  67.9   
675     2021-10-25 11:16  1085.0  696.1  491.6  1133.0  1118.0  31.2  66.5   
681     2021-10-25 11:22  1028.0  578.4  528.4  1063.0  1049.0  31.0  67.3   
707     2021-10-25 11:48  1036.0  690.2  434.8  1068.0  1056.0  31.6  66.3   
708     2021-10-25 11:49  1040.0  715.7  411.5  1068.0  1056.0  31.4  66.3   
...                  ...     ...    ...    ...     ...     ...   ...   ...   
522079  2022-10-22 13:20  1022.0  662.3  425.8  1070.1  1026.0  34.0  36.0   
522080  2022-10-22 13:21  1026.0  695.2  400.0  1070.1  1027.0  34.0  36.7   
522081  2022-10-22 13:22  1060.0  733.3  405.0  1107.0  1062.0  34.2  35.2   
522087  2022-10-22 13:28  1040.0  725.8  399.7  1086.5  1043.0  34.4  34.6   
522088  2022-10-22 13:29  1024.0  727.0  381.9  1069.1  1025.0  34.5  34.8   

         WS  WSgus

# Z-Score Analysis 

- The Z-score is a measure of how many standard deviations a data point is from the mean. It is used to identify data points that are significantly different from the mean.
# 
- The function `calculate_z_scores` is designed to calculate the Z-scores for specified columns of a given DataFrame.
- It calculates the mean and standard deviation for each column specified in the `columns` list.
- The Z-score for each data point is then computed as (data point - mean) / standard deviation.
- The function prints the data points with Z-scores greater than a specified threshold (e.g., 3 or -3).
 
- The code then applies this function to three datasets: Benin, Sierra Leone, and Togo, checking for Z-scores in the columns 'ModA', 'ModB', 'WS', and 'WSgust'.
- It prints the data points with Z-scores greater than the threshold in each dataset.


In [3]:
from scipy.stats import zscore

def calculate_z_scores(df, columns, threshold=3):
    for column in columns:
        df[f'{column}_zscore'] = zscore(df[column])
        outliers = df[(df[f'{column}_zscore'] > threshold) | (df[f'{column}_zscore'] < -threshold)]
        print(f"Data points with Z-scores greater than {threshold} or less than {-threshold} in {column}:\n", outliers)

In [4]:
print("Z-Score Analysis for Benin dataset:")
calculate_z_scores(benin, ['ModA', 'ModB', 'WS', 'WSgust'])

Z-Score Analysis for Benin dataset:
Data points with Z-scores greater than 3 or less than -3 in ModA:
                Timestamp     GHI    DNI    DHI    ModA    ModB  Tamb    RH  \
674     2021-08-09 11:15  1349.0  771.8  618.0  1281.5  1281.5  30.9  67.1   
676     2021-08-09 11:17  1334.0  751.9  620.0  1267.3  1267.3  31.1  67.0   
850     2021-08-09 14:11  1324.0  813.0  532.3  1218.0  1217.0  31.0  62.9   
5019    2021-08-12 11:40  1324.0  675.6  659.6  1271.0  1271.0  29.1  75.4   
5024    2021-08-12 11:45  1360.0  827.0  543.5  1305.6  1305.6  29.6  71.3   
8021    2021-08-14 13:42  1309.0  880.0  427.1  1220.0  1222.0  30.6  68.4   
12245   2021-08-17 12:06  1390.0  860.0  518.3  1320.5  1320.5  28.8  75.9   
12246   2021-08-17 12:07  1413.0  880.0  523.9  1342.3  1342.3  28.8  75.0   
12270   2021-08-17 12:31  1381.0  878.0  481.2  1312.0  1312.0  29.3  72.7   
12271   2021-08-17 12:32  1391.0  869.0  500.4  1321.5  1321.5  29.4  74.8   
12272   2021-08-17 12:33  1375.0  818.0

In [5]:
print("Z-Score Analysis for Sierra Leone dataset:")
calculate_z_scores(sierraleone, ['ModA', 'ModB', 'WS', 'WSgust'])

Z-Score Analysis for Sierra Leone dataset:
Data points with Z-scores greater than 3 or less than -3 in ModA:
                Timestamp     GHI    DNI    DHI    ModA    ModB  Tamb    RH  \
662     2021-10-30 11:03  1071.0  616.0  539.1  1142.0  1116.0  26.4  82.5   
663     2021-10-30 11:04  1119.0  673.5  538.4  1193.0  1167.0  26.3  83.1   
670     2021-10-30 11:11  1092.0  667.1  507.6  1163.0  1137.0  27.0  83.4   
671     2021-10-30 11:12  1154.0  755.7  491.3  1237.0  1208.0  27.0  82.4   
672     2021-10-30 11:13  1063.0  665.2  478.0  1132.0  1105.0  27.1  80.1   
...                  ...     ...    ...    ...     ...     ...   ...   ...   
517751  2022-10-24 13:12  1091.0  657.7  471.9  1123.0  1098.0  30.2  75.8   
517808  2022-10-24 14:09  1085.0  737.7  447.3  1114.0  1092.0  30.4  74.1   
517811  2022-10-24 14:12  1087.0  745.7  445.6  1112.0  1089.0  30.7  73.8   
517823  2022-10-24 14:24  1109.0  762.0  470.3  1130.0  1109.0  30.6  73.1   
520614  2022-10-26 12:55  1068.0

In [6]:
print("Z-Score Analysis for Togo dataset:")
calculate_z_scores(togo, ['ModA', 'ModB', 'WS', 'WSgust'])

Z-Score Analysis for Togo dataset:
Data points with Z-scores greater than 3 or less than -3 in ModA:
                Timestamp     GHI    DNI    DHI    ModA    ModB  Tamb    RH  \
248418  2022-04-15 12:19  1267.0  666.5  593.3  1194.0  1164.0  35.3  45.3   
248425  2022-04-15 12:26  1263.0  653.3  602.0  1185.0  1154.0  35.3  45.8   
255471  2022-04-20 09:52  1262.0  733.6  631.7  1184.0  1145.0  34.2  48.6   
255593  2022-04-20 11:54  1263.0  756.6  492.9  1185.0  1152.0  35.5  42.4   
269959  2022-04-30 11:20  1301.0  733.9  562.1  1195.0  1178.0  31.0  54.5   
...                  ...     ...    ...    ...     ...     ...   ...   ...   
510411  2022-10-14 10:52  1162.0  742.9  453.3  1196.2  1179.0  30.1  69.3   
510412  2022-10-14 10:53  1285.0  862.8  460.8  1309.4  1290.0  30.1  67.8   
510413  2022-10-14 10:54  1202.0  748.9  494.9  1242.3  1225.0  30.1  68.0   
510414  2022-10-14 10:55  1267.0  790.9  514.5  1299.4  1280.0  30.1  68.6   
510416  2022-10-14 10:57  1214.0  699.2 