<a href="https://colab.research.google.com/github/jathurT/Data-Crunch-Competition/blob/main/%20Notebooks_and_Scripts/Sample%20V1.0.0.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Exploratory Data Analysis

In [25]:
# Importing Libraries
import pandas as pd
import numpy as np
from datetime import datetime
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler
import xgboost as xgb
import lightgbm as lgb
from prophet import Prophet
from statsmodels.tsa.statespace.sarimax import SARIMAX
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM, Dense, Dropout

In [26]:
import warnings
warnings.filterwarnings('ignore')

In [27]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [28]:
# Load data
train_df = pd.read_csv('/content/drive/MyDrive/DataCrunchCompetitionDatasets/train.csv')
test_df = pd.read_csv('/content/drive/MyDrive/DataCrunchCompetitionDatasets/test.csv')
submission_df = pd.read_csv('/content/drive/MyDrive/DataCrunchCompetitionDatasets/sample_submission.csv')

In [29]:
# preview the dataset
train_df.head()

Unnamed: 0,ID,Year,Month,Day,kingdom,latitude,longitude,Avg_Temperature,Avg_Feels_Like_Temperature,Temperature_Range,Feels_Like_Temperature_Range,Radiation,Rain_Amount,Rain_Duration,Wind_Speed,Wind_Direction,Evapotranspiration
0,1,1,4,1,Arcadia,24.280002,-37.22998,25.5,30.5,8.5,10.3,22.52,58.89,16,8.6,283,1.648659
1,2,1,4,1,Atlantis,22.979999,-37.32999,299.65,305.15,5.9,8.2,22.73,11.83,12,15.8,161,1.583094
2,3,1,4,1,Avalon,22.88,-37.130006,26.3,31.5,5.2,6.4,22.73,11.83,12,15.8,161,1.593309
3,4,1,4,1,Camelot,24.180003,-36.929994,24.0,28.4,8.2,10.7,22.67,75.27,16,6.4,346,1.638997
4,5,1,4,1,Dorne,25.780002,-37.53,28.0,32.8,5.7,10.2,22.35,4.81,8,16.7,185,1.719189


*   Some temperature measurements may be in **Celsius or Kelvin** depending on the kingdom




In [30]:
# preview the dataset
test_df.head()

Unnamed: 0,ID,Year,Month,Day,kingdom
0,84961,9,1,1,Arcadia
1,84962,9,1,1,Atlantis
2,84963,9,1,1,Avalon
3,84964,9,1,1,Camelot
4,84965,9,1,1,Dorne


In [31]:
# preview the dataset
submission_df.head()

Unnamed: 0,ID,Avg_Temperature,Radiation,Rain_Amount,Wind_Speed,Wind_Direction
0,84961,0,0,0,0,0
1,84962,0,0,0,0,0
2,84963,0,0,0,0,0
3,84964,0,0,0,0,0
4,84965,0,0,0,0,0


In [32]:
train_df.shape

(84960, 17)

We can see that there are **84960 instances** and **17 variables** in the train data set.

In [33]:
test_df.shape

(4530, 5)

We can see that there are **4530 instances** and **5 variables** in the test data set.

In [34]:
submission_df.shape

(4530, 6)

We can see that there are **4530 instances** and **6 variables** in the data set.

In [35]:
train_col_names = train_df.columns
train_col_names

Index(['ID', 'Year', 'Month', 'Day', 'kingdom', 'latitude', 'longitude',
       'Avg_Temperature', 'Avg_Feels_Like_Temperature', 'Temperature_Range',
       'Feels_Like_Temperature_Range', 'Radiation', 'Rain_Amount',
       'Rain_Duration', 'Wind_Speed', 'Wind_Direction', 'Evapotranspiration'],
      dtype='object')

Here We can celarly see that some column names staring with uppercase letters. So, to standardize and maintain consistency in column naming conventions, we can change these uppercase columns to lowercase.

In [36]:
train_df.columns = train_df.columns.str.lower()
train_df.columns

Index(['id', 'year', 'month', 'day', 'kingdom', 'latitude', 'longitude',
       'avg_temperature', 'avg_feels_like_temperature', 'temperature_range',
       'feels_like_temperature_range', 'radiation', 'rain_amount',
       'rain_duration', 'wind_speed', 'wind_direction', 'evapotranspiration'],
      dtype='object')

In [37]:
# view summary of dataset
train_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 84960 entries, 0 to 84959
Data columns (total 17 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   id                            84960 non-null  int64  
 1   year                          84960 non-null  int64  
 2   month                         84960 non-null  int64  
 3   day                           84960 non-null  int64  
 4   kingdom                       84960 non-null  object 
 5   latitude                      84960 non-null  float64
 6   longitude                     84960 non-null  float64
 7   avg_temperature               84960 non-null  float64
 8   avg_feels_like_temperature    84960 non-null  float64
 9   temperature_range             84960 non-null  float64
 10  feels_like_temperature_range  84960 non-null  float64
 11  radiation                     84960 non-null  float64
 12  rain_amount                   84960 non-null  float64
 13  r

**Types of variables**

---


In this section, We segregate the dataset into categorical and numerical variables. There are a mixture of categorical and numerical variables in the dataset. Categorical variables have data type object. Numerical variables have data type float64 and int64.

---



In [38]:
# find categorical variables
categorical = [var for var in train_df.columns if train_df[var].dtype=='O']
print('There are {} categorical variables\n'.format(len(categorical)))
print('The categorical variables are :', categorical)

There are 1 categorical variables

The categorical variables are : ['kingdom']


## Explore problems within categorical variables

In [39]:
# check missing values in categorical variables
train_df[categorical].isnull().sum()

Unnamed: 0,0
kingdom,0


In [40]:
# check for cardinality in categorical variables
for var in categorical:
    print(var, ' contains ', len(train_df[var].unique()), ' labels')

kingdom  contains  30  labels


High cardinality may pose some serious problems in the machine learning model. So, we will check for high cardinality.

In [42]:
train_df.kingdom.unique()

array(['Arcadia', 'Atlantis', 'Avalon', 'Camelot', 'Dorne', 'Eden',
       'El Dorado', 'Elysium', 'Emerald City', 'Helios', 'Krypton',
       'Metropolis', 'Midgar', 'Midgard', 'Mordor', 'Neo-City',
       'Neo-Tokyo', 'Nirvana', 'Olympus', 'Pandora', 'Rapture',
       'Rivendell', 'Serenity', 'Shangri-La', 'Solara', 'Solstice',
       'Sunspear', 'Utopia', 'Valyria', 'Winterfell'], dtype=object)

In [45]:
train_df.kingdom.value_counts()

Unnamed: 0_level_0,count
kingdom,Unnamed: 1_level_1
Arcadia,2832
Atlantis,2832
Avalon,2832
Camelot,2832
Dorne,2832
Eden,2832
El Dorado,2832
Elysium,2832
Emerald City,2832
Helios,2832


We can clearly see that every label has same amount of tuples.
We want to do One Hot Encoding for kingdom variable in the data preprocessing step because of the 30 labels

## Explore Problem With Numerical Variables

In [46]:
# find numerical variables
numerical = [var for var in train_df.columns if train_df[var].dtype!='O']
print('There are {} numerical variables\n'.format(len(numerical)))
print('The numerical variables are :', numerical)

There are 16 numerical variables

The numerical variables are : ['id', 'year', 'month', 'day', 'latitude', 'longitude', 'avg_temperature', 'avg_feels_like_temperature', 'temperature_range', 'feels_like_temperature_range', 'radiation', 'rain_amount', 'rain_duration', 'wind_speed', 'wind_direction', 'evapotranspiration']


In [48]:
# view the numerical variables
train_df[numerical].head()

Unnamed: 0,id,year,month,day,latitude,longitude,avg_temperature,avg_feels_like_temperature,temperature_range,feels_like_temperature_range,radiation,rain_amount,rain_duration,wind_speed,wind_direction,evapotranspiration
0,1,1,4,1,24.280002,-37.22998,25.5,30.5,8.5,10.3,22.52,58.89,16,8.6,283,1.648659
1,2,1,4,1,22.979999,-37.32999,299.65,305.15,5.9,8.2,22.73,11.83,12,15.8,161,1.583094
2,3,1,4,1,22.88,-37.130006,26.3,31.5,5.2,6.4,22.73,11.83,12,15.8,161,1.593309
3,4,1,4,1,24.180003,-36.929994,24.0,28.4,8.2,10.7,22.67,75.27,16,6.4,346,1.638997
4,5,1,4,1,25.780002,-37.53,28.0,32.8,5.7,10.2,22.35,4.81,8,16.7,185,1.719189


In [49]:
# check missing values in numerical variables
train_df[numerical].isnull().sum()

Unnamed: 0,0
id,0
year,0
month,0
day,0
latitude,0
longitude,0
avg_temperature,0
avg_feels_like_temperature,0
temperature_range,0
feels_like_temperature_range,0
