#ML Data Cleaning and feature selection

# About Dataset

This dataset contains about 10 years of daily weather observations from numerous Australian weather stations.

**Taget Variable:**<br>
RainTomorrow - The amount of next day rain in mm. Used to

*   RainTomorrow - The amount of next day rain in mm. Used to create response variable RainTomorrow. A kind of measure of the "risk"

**Independent Variables:**<br>
* Date - Date of observation<br>
* Location - The common name of the location of the weather station<br>
* MinTemp - The minimum temperature in degrees celsius<br>
* MaxTemp - The maximum temperature in degrees celsius<br>
* Rainfall - The amount of rainfall recorded for the day in mm<br>
* Evaporation - The so-called Class A pan evaporation (mm) in the 24 hours to 9am<br>
* Sunshine - The number of hours of bright sunshine in the day.<br>
* WindGustDir - The direction of the strongest wind gust in the 24 hours to midnight<br>
* WindGustSpeed - The speed (km/h) of the strongest wind gust in the 24 hours to midnight<br>
* WindDir9am - Direction of the wind at 9am<br>
* WindDir3pm - Direction of the wind at 3pm<br>
* WindSpeed9am - Wind speed (km/hr) averaged over 10 minutes prior to 9am<br>
* WindSpeed3pm - Wind speed (km/hr) averaged over 10 minutes prior to 3pm<br>
* Humidity9am - Humidity (percent) at 9am<br>
* Humidity3pm - Humidity (percent) at 3pm<br>
* Pressure9am - Atmospheric pressure (hpa) reduced to mean sea level at 9am<br>
* Pressure3pm - Atmospheric pressure (hpa) reduced to mean sea level at 3pm<br>
* Cloud9am - Fraction of sky obscured by cloud at 9am. This is measured in "oktas", which are a unit of eigths. It records how many <br>
* Cloud3pm - Fraction of sky obscured by cloud (in "oktas": eighths) at 3pm. See Cload9am for a description of the values<br>
* Temp9am - Temperature (degrees C) at 9am<br>
* Temp3pm - Temperature (degrees C) at 3pm<br>
* RainToday - Boolean: 1 if precipitation (mm) in the 24 hours to 9am exceeds 1mm, otherwise 0<br>

[Link to Kaggle Dataset](https://www.kaggle.com/datasets/jsphyg/weather-dataset-rattle-package/discussion)

# Aim of project

* What are the data types? (Only numeric and categorical)

* Are there missing values?

* What are the likely distributions of the numeric variables?

* Which independent variables are useful to predict a target (dependent variable)? (Use at least three methods)

* Which independent variables have missing data? How much?

* Do the training and test sets have the same data?

* In the predictor variables independent of all the other predictor variables?

* Which predictor variables are the most important?

* Do the ranges of the predictor variables make sense?

* What are the distributions of the predictor variables?   

* Remove outliers and keep outliers (does if have an effect of the final predictive model)?

* Remove 1%, 5%, and 10% of your data randomly and impute the values back using at least 3 imputation methods. How well did the methods recover the missing values?  That is remove some data, check the % error on residuals for numeric data and check for bias and variance of the error.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.metrics import accuracy_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import RFE # Recursive Feature Selection
from sklearn.metrics import confusion_matrix
from sklearn.impute import KNNImputer,SimpleImputer
from sklearn.preprocessing import LabelEncoder
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer

import random, math
from sklearn.metrics import r2_score, mean_squared_error

from scipy import stats
import statsmodels.api as sm

In [3]:
!pip install kaggle
from google.colab import files

# Upload your Kaggle API credentials JSON file that you downloaded earlier
files.upload()




ModuleNotFoundError: No module named 'google'

In [3]:
!mkdir -p ~/.kaggle
!mv kaggle.json ~/.kaggle/
!chmod 600 ~/.kaggle/kaggle.json


The syntax of the command is incorrect.
'mv' is not recognized as an internal or external command,
operable program or batch file.
'chmod' is not recognized as an internal or external command,
operable program or batch file.


In [4]:
!kaggle datasets download -d jsphyg/weather-dataset-rattle-package


Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "C:\Users\saksh\AppData\Local\Programs\Python\Python312\Scripts\kaggle.exe\__main__.py", line 4, in <module>
  File "C:\Users\saksh\AppData\Local\Programs\Python\Python312\Lib\site-packages\kaggle\__init__.py", line 23, in <module>
    api.authenticate()
  File "C:\Users\saksh\AppData\Local\Programs\Python\Python312\Lib\site-packages\kaggle\api\kaggle_api_extended.py", line 403, in authenticate
    raise IOError('Could not find {}. Make sure it\'s located in'
OSError: Could not find kaggle.json. Make sure it's located in C:\Users\saksh\.kaggle. Or use the environment method.


In [None]:
!unzip -o weather-dataset-rattle-package.zip -d /content

Load the dataset

In [None]:
df = pd.read_csv('/content/weatherAUS.csv')
df.head()

In [None]:
#print total number of columns and rows present in the dataset
print('The Dataset has', df.shape[0], 'rows and', df.shape[1],'columns')

#What are the data types?

In [None]:
#print datatype of each column to find categorical and numerical variable
print('Column Name      Datatype')
print("")
print(df.dtypes)

This weather dataset comprises 7 categorical columns with data stored as 'Object' datatype and 16 numerical columns represented as 'float64' datatype.

In [None]:
#Look for concise summary of dataset using info()
df.info()

## EDA on categorical variables

In [None]:
# we will look at categorical columns which has datatype as 'Object'
cat_cols = df.select_dtypes(include=['object']).columns
cat_cols

In [None]:
# This will give statistical summary statistics of the categorical columns
df[cat_cols].describe()

In [None]:
#Look for concise summary of dataset using info()
df[cat_cols].info()

Now, we will impute categorical variable with mode

In [None]:
# Impute categorical var with Mode
df['WindGustDir'] = df['WindGustDir'].fillna(df['WindGustDir'].mode()[0])
df['WindDir9am'] = df['WindDir9am'].fillna(df['WindDir9am'].mode()[0])
df['WindDir3pm'] = df['WindDir3pm'].fillna(df['WindDir3pm'].mode()[0])
df['RainToday'] = df['RainToday'].fillna(df['RainToday'].mode()[0])

In [None]:
# Check Categorical columns again for the null values
df[cat_cols].info()

In [None]:
# plot distribution of 'RainToday' variable
d = df['RainToday'].value_counts()
labels = list(d.index)
d
plt.pie(d, labels=labels, autopct='%1.3f%%')
plt.show()

In [None]:
# plot distribution of 'RainTomorrow' target variable
df.dropna(subset=['RainTomorrow'], inplace=True)
d = df['RainTomorrow'].value_counts()
labels = list(d.index)
d
plt.pie(d, labels=labels, autopct='%1.3f%%')
plt.show()

We will convert Date object to year, month and date which then can be converted to categorical columns

In [None]:
# Convert Date object to datetime
df['Date'] = pd.to_datetime(df['Date'])

In [None]:
# Split Date to 'Year', 'Month' & 'Day'
df['Year'] = df['Date'].dt.year
df['Month'] = df['Date'].dt.month
df['Day'] = df['Date'].dt.day


In [None]:
# Drop 'Date' from df
data = df.drop('Date', axis=1, inplace=True)

data = df.dropna(axis=0, how='any', subset=["RainTomorrow"])

df.head()

In [None]:
# Check unique location
df['Location'].unique()

In [None]:
# cat_features is a list of column names representing categorical features in a dataset
cat_features = ['Year', 'Month', 'Day', 'Location', 'WindGustDir', 'WindDir9am', 'WindDir3pm',
       'RainToday']

print(data.shape)

In [None]:
# check top 5
df[cat_features].head()

In [None]:
lencoders = {}
features = ['Year', 'Month', 'Day', 'Location', 'WindGustDir', 'WindDir9am', 'WindDir3pm',
       'RainToday', 'RainTomorrow']
for col in data[features].columns:
    lencoders[col] = LabelEncoder()
    data[col] = lencoders[col].fit_transform(data[col])

In [None]:
encoded_data = data.copy()
encoded_data.head()

## EDA on continuos variables

In [None]:
# we will look at categorical columns which does not have datatype as 'Object'
num_cols = df.select_dtypes(exclude=['object']).columns
df[num_cols].head()

In [None]:
# This will give statistical summary statistics of the numerical columns
df[num_cols].describe()

In [None]:
#Look for concise summary of dataset using info()
df[num_cols].info()

In [None]:
# Pandas profiling before data preprocessing
!pip install pandas-profiling
from pandas_profiling import ProfileReport

profile = ProfileReport(df[num_cols], title='Pandas profiling before data preprocessing', minimal=True)
profile.to_notebook_iframe()

In [None]:
# num_features is a list of column names representing numerical features in a dataset
num_features = ['MinTemp', 'MaxTemp', 'Rainfall', 'Evaporation', 'Sunshine',
       'WindGustSpeed', 'WindSpeed9am', 'WindSpeed3pm', 'Humidity9am',
       'Humidity3pm', 'Pressure9am', 'Pressure3pm', 'Cloud9am', 'Cloud3pm',
       'Temp9am', 'Temp3pm']

df[num_features]

# Are there missing values?

In [None]:
#Lets find missing value columns
missing_values = encoded_data[num_features].isnull().sum()
missing_values

Lets find out missing values in percentage

In [None]:
# Missing values in percent
missing_values_pct = encoded_data[num_features].isnull().sum()/encoded_data.shape[0]*100
missing_values_pct.sort_values(ascending=False)

Now we will impute values to missing data

In [None]:
# Impute data with MICE imputer
imputed_data = encoded_data.copy(deep=True)
mice_imputer = IterativeImputer()
imputed_data.iloc[:, :] = mice_imputer.fit_transform(encoded_data)

In [None]:
# Check for missing values
imputed_data.isnull().sum()

# Which independent variables have missing data? How much?

In [None]:
#!pip install pandas-profiling
from pandas_profiling import ProfileReport

profile = ProfileReport(df, title='Pandas profiling before data preprocessing', minimal=True)
profile.to_notebook_iframe()

#What are the likely distributions of the numeric variables?

In [None]:
# Plot Histogram
import seaborn as sns
import matplotlib.pyplot as plt
fig = plt.figure(figsize = (20,15))
ax = fig.gca()
imputed_data.hist(ax=ax)
plt.show()

## Likely distribution of numerical variables

In [None]:
import matplotlib.pyplot as plt
from scipy import stats
import statsmodels.api as sm

# Loop through each numeric column in the DataFrame
for column in df.select_dtypes(include=['float64']):
#    data = df[column].dropna()  # Remove missing values if any
    sm.qqplot(imputed_data[column], line='q')
    plt.title(f'QQ Plot for {column}')
    plt.show()



*   **Bimodal Distributions:** Sunshine, Cloud9am, Cloud3pm
*   **Skewed Normal Distributions:** MinTemp, MaxTemp, WindGustSpeed, Humidity9am, Humidity3pm, Pressure9am, Pressure3pm, Temp9am, Temp3pm
*   **Exponential Distributions:** Rainfall, Evaporation, WindSpeed9am, WindSpeed3pm


# What are the distributions of the predictor variables?   

## Likely frequency distribution of categorical variables

In [None]:
for column in imputed_data[cat_features]:
  # Calculate the frequency of each category
  freq = imputed_data[column].value_counts()
  print(column)
  # Print the frequency of each category
  print(freq)
  print("")

In [None]:
normalized_data = imputed_data.copy()
for column in normalized_data[cat_features]:
    freq = normalized_data[column].value_counts(normalize=True)
    print(f"{column}\n{freq}\n")

In [None]:
normalized_data.info()

In [None]:
for i, column in enumerate(normalized_data[cat_features]):
    print(column)
    plt.figure(i)
    normalized_data[column].hist()
    plt.show()

In [None]:
# check Countplot for "RainTomorrow" target variable
sns.countplot(x='RainTomorrow', data = normalized_data, palette = "Set1")

In [None]:
from sklearn.utils import resample

# Separate the majority and minority classes
majority_class = normalized_data[normalized_data.RainTomorrow == 0]
minority_class = normalized_data[normalized_data.RainTomorrow == 1]

# Upsample the minority class to match the majority class
minority_upsampled = resample(minority_class, replace=True, n_samples=len(majority_class), random_state=42)

# Combine the majority class and upsampled minority class
balanced_df = pd.concat([majority_class, minority_upsampled])

# Now, balanced_df contains a balanced dataset
sns.countplot(x='RainTomorrow', data = balanced_df, palette = "Set1")

# Which independent variables are useful to predict a target (dependent variable)? (Use at least three methods)

In [None]:
# Standardize our Data - Feature Scaling 0-1 scale

from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler(feature_range=(0,1))

#assign scaler to column:
df_scaled = pd.DataFrame(scaler.fit_transform(normalized_data), columns=normalized_data.columns)

df_scaled.head()

##1) Using SelectKBest feature selection technique

In [None]:
# Selection of the most important features using SelectKBest
from sklearn.feature_selection import SelectKBest, chi2

X = df_scaled.loc[:,df_scaled.columns!='RainTomorrow']
y = df_scaled[['RainTomorrow']]

selector = SelectKBest(chi2, k=5)
selector.fit(X, y)

X_new = selector.transform(X)
print("The 5 most important predictor variables are:\n", X.columns[selector.get_support(indices=True)])

## 2) Using heatmap

In [None]:
X = df_scaled.drop('RainTomorrow', axis=1)  # Replace 'TargetVariable' with the actual name of your target variable
y = df_scaled['RainTomorrow']

# Calculate the correlation matrix
correlation_matrix = X.corr()

# Create a heatmap
plt.figure(figsize=(20, 10))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm')
plt.title("Correlation Heatmap with Target Variable")
plt.show()


1. **MinTemp and MaxTemp**: These variables have a strong positive correlation with a coefficient of 0.74.

2. **MinTemp and Temp3pm**: There is a notable positive correlation of 0.71 between MinTemp and Temp3pm.

3. **MinTemp and Temp9am**: The correlation between MinTemp and Temp9am is exceptionally strong, with a coefficient of 0.90.

4. **MaxTemp and Temp9am**: MaxTemp and Temp9am also display a robust positive correlation, having a coefficient of 0.89.

5. **MaxTemp and Temp3pm**: MaxTemp and Temp3pm exhibit a remarkably strong positive correlation, with a coefficient of 0.98.

6. **WindGustSpeed and WindSpeed3pm**: These variables are highly positively correlated, with a coefficient of 0.69.

7. **Pressure9am and Pressure3pm**: Pressure9am and Pressure3pm show a very strong positive correlation, with a coefficient of 0.96.

8. **Temp9am and Temp3pm**: The correlation between Temp9am and Temp3pm is quite strong, with a coefficient of 0.86.

In summary, these pairs of variables demonstrate substantial positive relationships, as indicated by their correlation coefficients.

In [None]:
for column in cat_features:
    plt.figure(figsize=(8, 6))
    sns.countplot(x=column, hue='RainTomorrow', data=normalized_data)
    plt.title(f'{column} vs RainTomorrow')
    plt.xlabel(column)
    plt.ylabel('Count')
    plt.show()


## 3) Using FeatureCorrelation visualizer from Yellowbrick

In [None]:
X = normalized_data.drop(['RainTomorrow'],axis=1)
y = normalized_data['RainTomorrow']

X.head()



In [None]:
from yellowbrick.target import FeatureCorrelation
feature_names = list(X.columns)

visualizer = FeatureCorrelation(labels = feature_names)
visualizer.fit(X, y)
visualizer.poof()

Observation:


*   **RainToday**, **Cloud3pm**, **Cloud9am**, **Humidity3pm**, **Humidity9am**, **WindGustSpeed** and **Rainfall** are positively associated with the target.
*   Whereas, Sunshine, Pressure3pm and Pressure9am are negatively associated with the target.Hence, those are not useful feature for predicting target.







## Which predictor variables are the most important?

**RainToday**, **Cloud3pm**, **Cloud9am**, **Humidity3pm**, **Humidity9am**, **WindGustSpeed** and **Rainfall** are most important variable for predicting taget.

## Do the ranges of the predictor variables make sense?

In [None]:
plt.figure(figsize=(50,25))
sns.boxplot(data=df_scaled[num_features])

Rainfall, Evaporation, WindGustSpeed, WindSpeed9am, Windspeed3pm, Humidity9am, Humidity3pm, Pressure9am, Pressure3pm have substantial outliers

# Do the training and test sets have the same data?

In [None]:
X = df_scaled.drop(['RainTomorrow'], axis=1)

y = df_scaled['RainTomorrow']

In [None]:
# split X and y into training and testing sets

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

In [None]:
print(X_train.shape)

In [None]:
print(X_test.shape)

No, by looking at the shape of data training and test sets have different data. They are divided with 80:20 ratio

In [None]:
# Initializing Random Forest Classifier
RandForest_RFE = RandomForestClassifier()
# Initializing the RFE object, one of the most important arguments is the estimator, in this case is RandomForest
rfe = RFE(estimator=RandForest_RFE, n_features_to_select=10, step=1)
# Fit the origial dataset
rfe = rfe.fit(X_train, y_train)

In [None]:
print("Best features chosen by RFE: \n")

for i in X_train.columns[rfe.support_]:
    print(i)

In [None]:
y_pred = rfe.predict(X_test)

In [None]:
# Create the confusion matrix
cm = confusion_matrix(y_test, y_pred)

# Print the confusion matrix
print("Confusion Matrix:" ,cm)
print('True Positives(TP) = ', cm[0,0])
print('True Negatives(TN) = ', cm[1,1])
print('False Positives(FP) = ', cm[0,1])
print('False Negatives(FN) = ', cm[1,0])

# visualize confusion matrix with seaborn heatmap
cm_matrix = pd.DataFrame(data=cm, columns=['Actual Positive:1', 'Actual Negative:0'],
                                  index=['Predict Positive:1', 'Predict Negative:0'])
sns.heatmap(cm_matrix, annot=True, fmt='d', cmap='YlGnBu')

# print classification accuracy
classification_accuracy = (cm[0,0] + cm[1,1]) / float(cm[0,0] + cm[1,1] + cm[0,1] + cm[1,0])
print('Classification accuracy : {0:0.4f}'.format(classification_accuracy))

# print classification error
classification_error = (cm[0,1] + cm[1,0]) / float(cm[0,0] + cm[1,1] + cm[0,1] + cm[1,0])
print('Classification error : {0:0.4f}'.format(classification_error))

logistic regression on most important features

In [None]:
from sklearn.linear_model import LogisticRegression
# Initialize the logistic regression model
logistic_reg_model = LogisticRegression()

# Train the model on the training data
logistic_reg_model.fit(X_train, y_train)


In [None]:
y_pred = logistic_reg_model.predict(X_test)

In [None]:
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy:.2f}')

# Generate a classification report
report = classification_report(y_test, y_pred)
print('Classification Report:\n', report)

# Create a confusion matrix
cm = confusion_matrix(y_test, y_pred)
print('Confusion Matrix:\n', cm)


We can anticipate that the RandomForestClassification model is likely to achieve a higher accuracy score, approximately 88%, compared to the Logistic Regression model, which is expected to yield an accuracy of around 86%.

# Remove outliers and keep outliers (does if have an effect of the final predictive model)?

In [None]:
df_Oclean = df_scaled.copy()

In [None]:
def remove_out(df_Oclean, num_cols, lbv=0.25, hbv=0.75):
    Q1 = df_Oclean[num_cols].quantile(lbv)
    Q3 = df_Oclean[num_cols].quantile(hbv)
    IQR = Q3-Q1
    lb = Q1-1.5*IQR
    hb = Q3+1.5*IQR
    for i in num_cols:
        df_Oclean = df_Oclean[(df_Oclean[i]>=lb[i]) & (df_Oclean[i]<=hb[i])]
    return df_Oclean

In [None]:
cols_outliers = ['MinTemp', 'MaxTemp', 'Rainfall',
       'WindGustSpeed', 'WindSpeed9am', 'WindSpeed3pm', 'Humidity9am',
       'Humidity3pm', 'Pressure9am', 'Pressure3pm', 'Cloud9am', 'Cloud3pm',
       'Temp9am', 'Temp3pm', 'Year', 'Month', 'Day', 'Evaporation', 'Sunshine','RainToday', 'RainTomorrow']

In [None]:
df_clean = remove_out(df_Oclean, cols_outliers, lbv=0.10, hbv=0.90)
df_clean.shape

In [None]:
X_new = df_clean.drop(['RainTomorrow'], axis=1)

y_new = df_clean['RainTomorrow']

In [None]:
X_new.shape

In [None]:
# split X_new and y_new into training and testing sets

from sklearn.model_selection import train_test_split

X_train_new, X_test_new, y_train_new, y_test_new = train_test_split(X_new, y_new, test_size = 0.2, random_state = 0)

In [None]:
print(X_train_new.shape)

In [None]:
print(X_test_new.shape)

In [None]:
# Initializing Random Forest Classifier
RandForest_RFE = RandomForestClassifier()
# Initializing the RFE object, one of the most important arguments is the estimator, in this case is RandomForest
rfe = RFE(estimator=RandForest_RFE, n_features_to_select=10, step=1)
# Fit the origial dataset
rfe = rfe.fit(X_train_new, y_train_new)

In [None]:
y_pred_new = rfe.predict(X_test_new)

In [None]:
# Create the confusion matrix
cm1 = confusion_matrix(y_test_new, y_pred_new)

# Print the confusion matrix
print("Confusion Matrix:" ,cm1)
print('True Positives(TP) = ', cm1[0,0])
print('True Negatives(TN) = ', cm1[1,1])
print('False Positives(FP) = ', cm1[0,1])
print('False Negatives(FN) = ', cm1[1,0])

# visualize confusion matrix with seaborn heatmap
cm_matrix1 = pd.DataFrame(data=cm1, columns=['Actual Positive:1', 'Actual Negative:0'],
                                  index=['Predict Positive:1', 'Predict Negative:0'])
sns.heatmap(cm_matrix1, annot=True, fmt='d', cmap='YlGnBu')

# print classification accuracy
classification_accuracy1 = (cm1[0,0] + cm1[1,1]) / float(cm1[0,0] + cm1[1,1] + cm1[0,1] + cm1[1,0])
print('Classification accuracy : {0:0.4f}'.format(classification_accuracy1))

# print classification error
classification_error1 = (cm1[0,1] + cm1[1,0]) / float(cm1[0,0] + cm1[1,1] + cm1[0,1] + cm1[1,0])
print('Classification error : {0:0.4f}'.format(classification_error1))

In [None]:
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# Calculate accuracy
accuracy2 = accuracy_score(y_test_new, y_pred_new)
print(f'Accuracy: {accuracy2:.2f}')

# Generate a classification report
report2 = classification_report(y_test_new, y_pred_new)
print('Classification Report:\n', report2)

# Create a confusion matrix
cm2 = confusion_matrix(y_test_new, y_pred_new)
print('Confusion Matrix:\n', cm2)

Upon removing outliers, we observed a notable improvement in the accuracy score, reaching 89%. This enhancement underscores the significant impact of outliers on the final predictive model's performance.

# Remove 1%, 5%, and 10% of your data randomly and impute the values back using at least 3 imputation methods. How well did the methods recover the missing values?  That is remove some data, check the % error on residuals for numeric data and check for bias and variance of the error.

In [None]:
# remove 1%
n=round(0.99*len(df_scaled))
red_data1 = df_scaled.sample(n=n)

# remove 5%
n=round(0.95*len(df_scaled))
red_data5 = df_scaled.sample(n=n)

# remove 10%
n=round(0.90*len(df_scaled))
red_data10 = df_scaled.sample(n=n)

##Data imputation using mode

Data imputation using the mode involves replacing missing values in a dataset with the most frequently occurring value (mode) in the respective column. This imputation method is commonly used for categorical variables and discrete data. It helps maintain the distribution of the variable while filling in missing entries with the most common category.

In [None]:
imputed_data1 = red_data1.copy(deep=True)
for col in imputed_data1.columns:
  imputed_data1[col] = imputimputed_data1ed_data_1[col].fillna(imputed_data1[col].mode()[0])

imputed_data5 = red_data5.copy(deep=True)
for col in imputed_data5.columns:
  imputed_data5[col] = imputed_data5[col].fillna(imputed_data5[col].mode()[0])


imputed_data10 = red_data10.copy(deep=True)
for col in imputed_data10.columns:
  imputed_data10[col] = imputed_data10[col].fillna(imputed_data10[col].mode()[0])

In [None]:
X = imputed_data1.drop(['RainTomorrow'], axis=1)

y = imputed_data1['RainTomorrow']

X_train_red, X_test_red, y_train_red, y_test_red = train_test_split(X, y, test_size = 0.2, random_state = 0)

# Initializing Random Forest Classifier
RandForest_RFE = RandomForestClassifier()
# Initializing the RFE object, one of the most important arguments is the estimator, in this case is RandomForest
rfe = RFE(estimator=RandForest_RFE, n_features_to_select=10, step=1)
# Fit the origial dataset
rfe = rfe.fit(X_train_red, y_train_red)

y_pred_red = rfe.predict(X_test_red)

# Calculate accuracy
accuracy = accuracy_score(y_test_red, y_pred_red)
print(f'Accuracy: {accuracy:}')

# Generate a classification report
report = classification_report(y_test_red, y_pred_red)
print('Classification Report:\n', report)

# Create a confusion matrix
cm = confusion_matrix(y_test_red, y_pred_red)
print('Confusion Matrix:\n', cm)

In [None]:
X = imputed_data5.drop(['RainTomorrow'], axis=1)

y = imputed_data5['RainTomorrow']

X_train_red, X_test_red, y_train_red, y_test_red = train_test_split(X, y, test_size = 0.2, random_state = 0)

# Initializing Random Forest Classifier
RandForest_RFE = RandomForestClassifier()
# Initializing the RFE object, one of the most important arguments is the estimator, in this case is RandomForest
rfe = RFE(estimator=RandForest_RFE, n_features_to_select=10, step=1)
# Fit the origial dataset
rfe = rfe.fit(X_train_red, y_train_red)

y_pred_red = rfe.predict(X_test_red)

# Calculate accuracy
accuracy = accuracy_score(y_test_red, y_pred_red)
print(f'Accuracy: {accuracy:}')

# Generate a classification report
report = classification_report(y_test_red, y_pred_red)
print('Classification Report:\n', report)

# Create a confusion matrix
cm = confusion_matrix(y_test_red, y_pred_red)
print('Confusion Matrix:\n', cm)

In [None]:
X = imputed_data10.drop(['RainTomorrow'], axis=1)

y = imputed_data10['RainTomorrow']

X_train_red, X_test_red, y_train_red, y_test_red = train_test_split(X, y, test_size = 0.2, random_state = 0)

# Initializing Random Forest Classifier
RandForest_RFE = RandomForestClassifier()
# Initializing the RFE object, one of the most important arguments is the estimator, in this case is RandomForest
rfe = RFE(estimator=RandForest_RFE, n_features_to_select=10, step=1)
# Fit the origial dataset
rfe = rfe.fit(X_train_red, y_train_red)

y_pred_red = rfe.predict(X_test_red)

# Calculate accuracy
accuracy = accuracy_score(y_test_red, y_pred_red)
print(f'Accuracy: {accuracy:}')

# Generate a classification report
report = classification_report(y_test_red, y_pred_red)
print('Classification Report:\n', report)

# Create a confusion matrix
cm = confusion_matrix(y_test_red, y_pred_red)
print('Confusion Matrix:\n', cm)

## Data imputation using median

Data imputation using the median involves replacing missing values in a dataset with the median value of the respective column. This imputation method is commonly used for continuous numerical variables and is robust to outliers. It helps maintain the central tendency of the data while filling in missing entries with a representative value.

In [None]:
imputed_data1 = red_data1.copy(deep=True)
for col in imputed_data1.columns:
  imputed_data1[col] = imputed_data1[col].fillna(imputed_data1[col].median())

imputed_data5 = red_data5.copy(deep=True)
for col in imputed_data5.columns:
  imputed_data5[col] = imputed_data_5[col].fillna(imputed_data5[col].median())


imputed_data10 = red_data10.copy(deep=True)
for col in imputed_data10.columns:
  imputed_data10[col] = imputed_data10[col].fillna(imputed_data10[col].median())

In [None]:
X = imputed_data1.drop(['RainTomorrow'], axis=1)

y = imputed_data1['RainTomorrow']

X_train_red, X_test_red, y_train_red, y_test_red = train_test_split(X, y, test_size = 0.2, random_state = 0)

# Initializing Random Forest Classifier
RandForest_RFE = RandomForestClassifier()
# Initializing the RFE object, one of the most important arguments is the estimator, in this case is RandomForest
rfe = RFE(estimator=RandForest_RFE, n_features_to_select=10, step=1)
# Fit the origial dataset
rfe = rfe.fit(X_train_red, y_train_red)

y_pred_red = rfe.predict(X_test_red)

# Calculate accuracy
accuracy = accuracy_score(y_test_red, y_pred_red)
print(f'Accuracy: {accuracy:}')

# Generate a classification report
report = classification_report(y_test_red, y_pred_red)
print('Classification Report:\n', report)

# Create a confusion matrix
cm = confusion_matrix(y_test_red, y_pred_red)
print('Confusion Matrix:\n', cm)

In [None]:
X = imputed_data5.drop(['RainTomorrow'], axis=1)

y = imputed_data5['RainTomorrow']

X_train_red, X_test_red, y_train_red, y_test_red = train_test_split(X, y, test_size = 0.2, random_state = 0)

# Initializing Random Forest Classifier
RandForest_RFE = RandomForestClassifier()
# Initializing the RFE object, one of the most important arguments is the estimator, in this case is RandomForest
rfe = RFE(estimator=RandForest_RFE, n_features_to_select=10, step=1)
# Fit the origial dataset
rfe = rfe.fit(X_train_red, y_train_red)

y_pred_red = rfe.predict(X_test_red)

# Calculate accuracy
accuracy = accuracy_score(y_test_red, y_pred_red)
print(f'Accuracy: {accuracy:}')

# Generate a classification report
report = classification_report(y_test_red, y_pred_red)
print('Classification Report:\n', report)

# Create a confusion matrix
cm = confusion_matrix(y_test_red, y_pred_red)
print('Confusion Matrix:\n', cm)

In [None]:
X = imputed_data10.drop(['RainTomorrow'], axis=1)

y = imputed_data10['RainTomorrow']

X_train_red, X_test_red, y_train_red, y_test_red = train_test_split(X, y, test_size = 0.2, random_state = 0)

# Initializing Random Forest Classifier
RandForest_RFE = RandomForestClassifier()
# Initializing the RFE object, one of the most important arguments is the estimator, in this case is RandomForest
rfe = RFE(estimator=RandForest_RFE, n_features_to_select=10, step=1)
# Fit the origial dataset
rfe = rfe.fit(X_train_red, y_train_red)

y_pred_red = rfe.predict(X_test_red)

# Calculate accuracy
accuracy = accuracy_score(y_test_red, y_pred_red)
print(f'Accuracy: {accuracy:}')

# Generate a classification report
report = classification_report(y_test_red, y_pred_red)
print('Classification Report:\n', report)

# Create a confusion matrix
cm = confusion_matrix(y_test_red, y_pred_red)
print('Confusion Matrix:\n', cm)

## Data imputation using MICE

Data imputation using the Multiple Imputation by Chained Equations (MICE) iterative method is a technique for handling missing data by imputing values through a series of predictive models. It's particularly useful when dealing with datasets where missing values are not completely at random and when you want to preserve the relationships between variables.

In [None]:
imputed_data1 = red_data1.copy(deep=True)
mice_imputer = IterativeImputer()
imputed_data1.iloc[:, :] = mice_imputer.fit_transform(reduced_data1)

imputed_data5 = red_data5.copy(deep=True)
imputed_data5.iloc[:, :] = mice_imputer.fit_transform(reduced_data5)

imputed_data10 = red_data10.copy(deep=True)
imputed_data10.iloc[:, :] = mice_imputer.fit_transform(reduced_data10)

In [None]:
X = imputed_data1.drop(['RainTomorrow'], axis=1)

y = imputed_data1['RainTomorrow']

X_train_red, X_test_red, y_train_red, y_test_red = train_test_split(X, y, test_size = 0.2, random_state = 0)

# Initializing Random Forest Classifier
RandForest_RFE = RandomForestClassifier()
# Initializing the RFE object, one of the most important arguments is the estimator, in this case is RandomForest
rfe = RFE(estimator=RandForest_RFE, n_features_to_select=10, step=1)
# Fit the origial dataset
rfe = rfe.fit(X_train_red, y_train_red)

y_pred_red = rfe.predict(X_test_red)

# Calculate accuracy
accuracy = accuracy_score(y_test_red, y_pred_red)
print(f'Accuracy: {accuracy:}')

# Generate a classification report
report = classification_report(y_test_red, y_pred_red)
print('Classification Report:\n', report)

# Create a confusion matrix
cm = confusion_matrix(y_test_red, y_pred_red)
print('Confusion Matrix:\n', cm)

In [None]:
X = imputed_data5.drop(['RainTomorrow'], axis=1)

y = imputed_data5['RainTomorrow']

X_train_red, X_test_red, y_train_red, y_test_red = train_test_split(X, y, test_size = 0.2, random_state = 0)

# Initializing Random Forest Classifier
RandForest_RFE = RandomForestClassifier()
# Initializing the RFE object, one of the most important arguments is the estimator, in this case is RandomForest
rfe = RFE(estimator=RandForest_RFE, n_features_to_select=10, step=1)
# Fit the origial dataset
rfe = rfe.fit(X_train_red, y_train_red)

y_pred_red = rfe.predict(X_test_red)

# Calculate accuracy
accuracy = accuracy_score(y_test_red, y_pred_red)
print(f'Accuracy: {accuracy:}')

# Generate a classification report
report = classification_report(y_test_red, y_pred_red)
print('Classification Report:\n', report)

# Create a confusion matrix
cm = confusion_matrix(y_test_red, y_pred_red)
print('Confusion Matrix:\n', cm)

In [None]:
X = imputed_data10.drop(['RainTomorrow'], axis=1)

y = imputed_data10['RainTomorrow']

X_train_red, X_test_red, y_train_red, y_test_red = train_test_split(X, y, test_size = 0.2, random_state = 0)

# Initializing Random Forest Classifier
RandForest_RFE = RandomForestClassifier()
# Initializing the RFE object, one of the most important arguments is the estimator, in this case is RandomForest
rfe = RFE(estimator=RandForest_RFE, n_features_to_select=10, step=1)
# Fit the origial dataset
rfe = rfe.fit(X_train_red, y_train_red)

y_pred_red = rfe.predict(X_test_red)

# Calculate accuracy
accuracy = accuracy_score(y_test_red, y_pred_red)
print(f'Accuracy: {accuracy:}')

# Generate a classification report
report = classification_report(y_test_red, y_pred_red)
print('Classification Report:\n', report)

# Create a confusion matrix
cm = confusion_matrix(y_test_red, y_pred_red)
print('Confusion Matrix:\n', cm)

# Questions

* Which independent variables are useful to predict a target (dependent variable)?
As observed earlier, all variables demonstrate a significant influence on the target variable. However, for predictive modeling, the following variables have been identified as particularly valuable:
1) MinTemp
2) MaxTemp
3) Sunshine
4) WindGustSpeed
5) Humidity3pm
6) Pressure9am
7) Pressure3pm
8) Cloud9am
9) Cloud3pm
10) Temp3pm

* Which independent variable have missing data? How much?
>Among the variables, Pressure9am, Pressure3pm, and sunshine exhibit a relatively higher rate of missing data, ranging from 30% to 50%. In contrast, the remaining variables demonstrate a lower prevalence of missing values.

* Do the training and test sets have the same data?
>The training and test sets are characterized by distinct data ranges, and they have been partitioned in an 80:20 ratio.

* In the predictor variables independent of all the other predictor variables?
>No, instead we have observed robust correlations among predictor variables themselves.

* Which predictor variables are the most important?
>RainToday, Cloud3pm, Cloud9am, Humidity3pm, Humidity9am, WindGustSpeed and Rainfall are most important variable for predicting taget.

* Do the ranges of the predictor variables make sense?
> yes, looking at boxplot we can infer that ranges of predictor variables makes sense

* What are the distributions of the predictor variables?
>*   Bimodal Distributions: Sunshine, Cloud9am, Cloud3pm
 *   Exponential Distributions: Rainfall, Evaporation, WindSpeed9am, WindSpeed3pm
 *   Skewed Normal Distributions: MinTemp, MaxTemp, WindGustSpeed, Humidity9am, Humidity3pm, Pressure9am, Pressure3pm, Temp9am, Temp3pm


# References

[Machine Learning A-Z™: AI, Python & R + ChatGPT Bonus [2023]- Udemy](https://www.udemy.com/course/machinelearning/)

[Data Cleaning and EDA- Youtube](https://www.youtube.com/watch?v=VCt7UaIr64I)

https://github.com/aiskunks/YouTube/blob/main/A_Crash_Course_in_Statistical_Learning/ML_Data_Cleaning_and_Feature_Selection/ML_Data_Cleaning_and_Feature_Selection_Abalone.ipynb

https://machinelearningmastery.com/calculate-feature-importance-with-python/



#Copyright

Copyright (c) 2023 Asawari Anant Kadam

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.