# 07a - Prepare data for `Deepchecks`, and run some `Deepcheks` tests

__Goal__:

1. Read a raw CSV file into a dataframe, e.g. `weather_dataset_raw_development.csv`
2. Preprocess it and prepare it for Deepchecks. In particular:
 - Rename the columns
 - Remove useless columns: `S_No`, `Location`, `Apparent_temperature`
 - Preprocess `Timestamp`:
   - Convert the `Timestamp` type to `datetime` without `UTC offsets`
   - Check if `Timestamp` is in ascending order, if not sort `df` by `Timestamp`
   - Remove `Timestamp` duplicates
   - Set `Timestamp` as the index of the dataframe
   - Add rows of `NaN`s to `df` when a timestamp is missing
 - Add columns `Year` and `month`;
 - Replace categories `“snow”` and `“clear”` of the target variable `Weather` by `“no_rain”`;
3. Give the preprocessed dataset a name, e.g. `dev`, and save it, e.g. in `deepcheks\dev\dev.pkl`.

### Import

In [1]:
import pandas as pd
from pathlib import Path

In [2]:
data_dir = Path('../data')
deepchecks_dir = Path('../deepchecks')
deepchecks_dir.mkdir(exist_ok=True)

# 1. Read a raw `CSV` file into a dataframe

In [3]:
csv_file_name = 'weather_dataset_raw_development.csv'

df = pd.read_csv(data_dir / csv_file_name)
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 43848 entries, 0 to 43847
Data columns (total 11 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   S_No                    43848 non-null  int64  
 1   Timestamp               43848 non-null  object 
 2   Location                43848 non-null  object 
 3   Temperature_C           43848 non-null  float64
 4   Apparent_Temperature_C  43848 non-null  float64
 5   Humidity                43848 non-null  float64
 6   Wind_speed_kmph         43848 non-null  float64
 7   Wind_bearing_degrees    43848 non-null  int64  
 8   Visibility_km           43848 non-null  float64
 9   Pressure_millibars      43848 non-null  float64
 10  Weather_conditions      43843 non-null  object 
dtypes: float64(6), int64(2), object(3)
memory usage: 3.7+ MB


# 2. Preprocess the dataframe 

## A. Rename the columns

In [4]:
df.rename(columns={"Temperature_C": "Temperature", 
                   "Apparent_Temperature_C": "Apparent_temperature",
                   "Wind_speed_kmph": "Wind_speed",
                   "Wind_bearing_degrees": "Wind_bearing",
                   "Visibility_km": "Visibility",
                   "Pressure_millibars": "Pressure",
                   "Weather_conditions": "Weather"}, inplace=True)
df.head(1)

Unnamed: 0,S_No,Timestamp,Location,Temperature,Apparent_temperature,Humidity,Wind_speed,Wind_bearing,Visibility,Pressure,Weather
0,2881,2006-01-01 00:00:00+00:00,"Port of Turku, Finland",1.161111,-3.238889,0.85,16.6152,139,9.9015,1016.15,rain


## B. Remove useless columns

In [5]:
df.drop(['S_No'], axis=1, inplace=True)
df.drop(['Location'], axis=1, inplace=True)
df.drop(['Apparent_temperature'], axis=1, inplace=True)
df.head(1)

Unnamed: 0,Timestamp,Temperature,Humidity,Wind_speed,Wind_bearing,Visibility,Pressure,Weather
0,2006-01-01 00:00:00+00:00,1.161111,0.85,16.6152,139,9.9015,1016.15,rain


## C. Preprocess `Timestamp`

#### C.1. Convert the `Timestamp` type to `datetime` without `UTC offsets`

In [6]:
df['Timestamp'] = pd.to_datetime(df['Timestamp'], utc=True)

#### C.2. Check if `Timestamp` is in ascending order, if not sort `df` by `Timestamp`

In [7]:
if df["Timestamp"].is_monotonic_increasing:  # Answer is False, if there are NaNs
    print("The dataset `df` is already sorted by `Timestamp`.")
else:
    df.sort_values(by='Timestamp', inplace=True) # Rows with NaN at "Timestamp" column are put at the end
df.dropna(subset=["Timestamp"], inplace=True)    # Remove rows with NaN at "Timestamp" column 
print(f"Length of `df`: {len(df)}")

The dataset `df` is already sorted by `Timestamp`.
Length of `df`: 43848


#### C.3. Remove `Timestamp` duplicates

In [8]:
df = df.drop_duplicates(subset=["Timestamp"], keep="last") # We keep the last row of the subset of duplicated timestamps
print(f"Length of `df`: {len(df)}")

Length of `df`: 43824


#### C.4 Set `Timestamp` as the index of `df`

In [9]:
#df.set_index('Timestamp', inplace=True)

#### C5. Add rows of `NaN`s to `df` when a timestamp is missing

In [10]:
#df_min_timestamp = df.index.min()
#df_max_timestamp = df.index.max()
#print(f'Minimum index of "df": {df_min_timestamp} \nMaximum index of "df": {df_max_timestamp}')

In [11]:
# regular_timestamp_range = pd.date_range(start=df_min_timestamp, end=df_max_timestamp,freq='H')
# print(f"Length of `df`: {len(df)}\nLength of `regular_timestamp_range`: {len(regular_timestamp_range)}")

# diff = len(regular_timestamp_range)-len(df)
# if diff == 0:
#     print("\nThere were no missing timestamp in `df`!")
# else:
#     print(f"\nThere were {diff} missing timestamps in `df`!")

In [12]:
# df = df.reindex(regular_timestamp_range, copy=True) # The  resulting df as rows of "NaN"s when a timestamp is missing
# print(f"Length of `df`: {len(df)}")

## D. Add columns `Year` and `Month`

In [13]:
# df["Year"] = pd.Series(df.index).dt.year.to_numpy()
# df["Month"] = pd.Series(df.index).dt.month.to_numpy()
df["Year"] = df["Timestamp"].dt.year
df["Month"] = df["Timestamp"].dt.month
df.head(1)

Unnamed: 0,Timestamp,Temperature,Humidity,Wind_speed,Wind_bearing,Visibility,Pressure,Weather,Year,Month
0,2006-01-01 00:00:00+00:00,1.161111,0.85,16.6152,139,9.9015,1016.15,rain,2006,1


### E. Replace categories `“snow”` and `“clear”` of the target variable `Weather` by `“no_rain”`

In [14]:
df["Weather"] = df["Weather"].replace({"snow": "no_rain", "clear": "no_rain"})

# 3. Give the preprocessed dataset a name and save it

In [15]:
preprocessed_dataset_name = "dev"
deepchecks_subdir = deepchecks_dir / preprocessed_dataset_name
deepchecks_subdir.mkdir(exist_ok=True)

df.to_pickle(deepchecks_subdir / (preprocessed_dataset_name+".pkl"))