# Project : Wild Blueberry Yield Prediction

## Index
1. [Description](#description)
2. [Problem Statement](#problem-statement)
3. [Loading necessary libraries](#loading-necessary-libraries)
4. [Function Definations](#function-definations)
5. [Reading in the Dataset](#reading-in-the-dataset)
6. [Exploratory Descriptive Analysis(EDA)](#exploratory-descriptive-analysiseda)
    1. [Basic Data Inspection](#basic-data-inspection)
    2. [Descriptive Statistics](#descriptive-statistics)
    3. [Duplicate Values](#duplicate-values)
    4. [Missing Values](#missing-values)
    5. [Unique Values](#unique-values)
    6. [Exploring Target Variable](#exploring-target-variable)
    7. [Data Visualization](#data-visualization)
7. [Insights](#insights)
8. [Preprocessing](#preprocessing)
9. [Feature Engineering]()
10. [Modeling]()
11. [Hyperparameter Tuning]()
12. [Explainable AI(XAI)]()
13. [Dataframe Pipeline]()

## Description

The dataset used for predictive modelling was generated by the Wild Blueberry Pollination Simulation Model, which is an open-source, spatially-explicit computer simulation program, that enables exploration of how various factors, including plant spatial arrangement, outcrossing and self-pollination, bee species compositions and weather conditions, in isolation and combination, affect pollination efficiency and yield of the wild blueberry agro-ecosystem. The simulation model has been validated by the field observation and experimental data collected in Maine USA and Canadian Maritimes during the last 30 years and now is a useful tool for hypothesis testing and theory development for wild blueberry pollination researches. This simulated data provides researchers who have actual data collected from field observation and those who wants to experiment the potential of machine learning algorithms response to real data and computer simulation modelling generated data as input for crop yield prediction models.

## Problem Statement

The target feature is ```yield``` which is a continuous variable. The task is to classify this variable based on the other 17 features.The evaluation metrics will be **RMSE** score.

## Loading necessary libraries
- NumPy: A library for numerical operations in Python.
- Pandas: A powerful library for data manipulation and analysis.
- Matplotlib: A library for creating static, interactive, and animated visualizations in Python.
- Seaborn: A data visualization library based on Matplotlib for making attractive and informative statistical graphics.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Checking the version of the library installed in the Environment. 
print(f"Numpy version: {np.__version__}")
print(f"Pandas version: {pd.__version__}")
print(f"Matplotlib version: {plt.matplotlib.__version__}")
print(f"Seaborn version: {sns.__version__}")

# Matplotlib plots appear directly within the notebook, enhancing the interactivity
%matplotlib inline

# Dark visual theme and 'pastel' color palette for Seaborn plots.
sns.set_theme(style="dark")
sns.set_palette("pastel")


# Suppresssing non-critical warnings to maintain clean and uncluttered output for better readability.
import warnings
warnings.filterwarnings('ignore')

# Display all columns without truncation
pd.set_option('display.max_columns', None)

## Function Definations
- Function prints the number of columns in a DataFrame and lists their names in square brackets, separated by commas.<br>
Function calling: ```columns_in_a_dataframe(dataframe)```

In [None]:
# 1. Function prints the number of columns in a DataFrame and lists their names in square brackets, separated by commas. 
def columns_in_a_dataframe(dataframe):

    dataframe_columns = dataframe.columns.tolist()
    print(f"Number of columns: {len(dataframe_columns)}")
    print(f"{dataframe_columns}\n")


# 2. Function distinguishes columns in a DataFrame based on their data types
def distinguish_column_ac_to_datatype(dataframe):
    data_types = {
        'Numerical': ['number'],
        'Categorical': ['category', 'object'],
        'Datetime': ['datetime', 'datetime64[ns]']
    }
    
    for dtype, dtype_list in data_types.items():
        columns = dataframe.select_dtypes(include=dtype_list).columns.tolist()
        if columns:
            print(f"Number of {dtype} columns: {len(columns)}")
            print(f"{columns}\n")
        else:
            print(f"No {dtype} column in your dataframe")


# 3. Function prints the number of missing value and percentage of missing values for each column having atleast one missing value
def missing_values_table(dataframe):
        
    mis_val = dataframe.isnull().sum()
    mis_val_percent = 100 * mis_val / len(dataframe)
        
    # Make a table with the results
    mis_val_table = pd.concat([mis_val, mis_val_percent], axis=1)
        
    # Rename the columns
    mis_val_table_ren_columns = mis_val_table.rename(
    columns = {0 : 'Missing Values', 1 : '% of Total Values'})
        
    # Sort the table by percentage of missing descending
    mis_val_table_ren_columns = mis_val_table_ren_columns[
        mis_val_table_ren_columns.iloc[:,1] != 0].sort_values(
        '% of Total Values', ascending=False).round(1)
        
    # Print some summary information
    print ("Your selected dataframe has " + str(dataframe.shape[1]) + " columns.\n"      
        "There are " + str(mis_val_table_ren_columns.shape[0]) +
          " columns that have missing values.")
        
    # Return the dataframe with missing information
    return mis_val_table_ren_columns


# 4. Function prints unique values for each column when unique values in that column is less than or equal to max_unique_values.
def unique_values_per_column(dataframe, max_unique_values):
    for column in dataframe.columns:
        unique_values = dataframe[column].unique()
        
        if len(unique_values) <= max_unique_values:
            print(f"Column: {column}")
            print(f"Number of Unique values: {len(unique_values)}")
            print(f"Unique values: {', '.join(map(str, unique_values))}\n")


# 5. Function giving value count for each unique value sorted in descending order of occurance of each unique value
def unique_value_with_count(dataframe):
    for column in dataframe.columns:
        print(f"Column: {dataframe[column].value_counts().sort_values(ascending=False)}")
        print("\n")


# 6. Function plots histograms and boxplots for numerical columns in a DataFrame
def plot_numerical_histogram_boxplot(dataframe):
    numerical_columns = dataframe.select_dtypes(include=['number'])
    
    for column in numerical_columns.columns:
        plt.figure(figsize=(15,6))

        axes = plt.subplot(1,2,1)
        sns.histplot(dataframe, x = column, bins = 20, color = 'steelblue', edgecolor = 'black')
        axes.set_xlabel(column)
        axes.set_ylabel('frequency')
        axes.set_title(f'Histogram of {column}')

        axes = plt.subplot(1,2,2)
        sns.boxplot(dataframe, y= column, color= 'steelblue')
        axes.set_ylabel(column)
        axes.set_title(f'Boxplot of {column}')

        # Adjust layout
        plt.tight_layout()
        plt.show()


# 7. Function to plot countplots of all categorical variables in an ascending order of count of each unique variable with unique values less than or equal to the argument unique_value_limit 
def plot_categorical_countplots(dataframe, unique_value_limit=10):
    categorical_columns = dataframe.select_dtypes(include=['object'])

    for column in categorical_columns.columns:
        unique_count = len(dataframe[column].unique())
        if unique_count <= unique_value_limit:
            plt.figure(figsize=(15, 6))
            ax = sns.countplot(data=dataframe, x=column,order= dataframe[column].value_counts(ascending=True).index)
            plt.xticks(rotation=90)
            plt.title(f'Countplot of {column}')
            plt.xlabel('')

            total_count = len(dataframe[column])
            
            # Add text annotations for the count at the top and percentage in the middle of each bar
            for p in ax.patches:
                height = p.get_height()
                height = int(height)
                percentage = (height / total_count) * 100
                x = p.get_x() + p.get_width() / 2.
                y_top = height + 5  # Adjust the vertical position for the count
                y_middle = height / 2
                ax.annotate(f'{height}', 
                            (x, y_top),
                            ha='center', va='bottom', fontsize=12, color='black')
                ax.annotate(f'{percentage:.2f}%', 
                            (x, y_middle),
                            ha='center', va='center', fontsize=12, color='black')

            plt.show()


# 8. Function to plot countplot of categorical variable in an ascending order of count of each unique variable
def plot_single_categorical_countplot(dataframe, column_name):
    if column_name not in dataframe.columns:
        print(f"Column '{column_name}' not found in the DataFrame.")
        return

    unique_count = len(dataframe[column_name].unique())

    if unique_count <= 20:
        plt.figure(figsize=(15, 6))
        ax = sns.countplot(data=dataframe, x=column_name, edgecolor='black', order=dataframe[column_name].value_counts(ascending=False).index)

        ax.patch.set_edgecolor('black')
        ax.patch.set_linewidth(2)

        plt.xticks(rotation=90)
        plt.title(f'Countplot of {column_name}')
        plt.xlabel('')

        total_count = len(dataframe[column_name])

        # Add text annotations for the count and percentage
        for p in ax.patches:
            height = p.get_height()
            height = int(height)
            percentage = (height / total_count) * 100
            x = p.get_x() + p.get_width() / 2.

            if percentage < 5:
                y_top_percentage = height + 10
                ax.annotate(f'{percentage:.2f}%',
                            (x, y_top_percentage),
                            ha='center', va='bottom', fontsize=12, color='black')
            else:
                y_top = height + 10
                ax.annotate(f'{height}',
                            (x, y_top),
                            ha='center', va='bottom', fontsize=12, color='black')
                y_middle = height / 2
                ax.annotate(f'{percentage:.2f}%',
                            (x, y_middle),
                            ha='center', va='center', fontsize=12, color='black')
        plt.show()

## Reading in the Dataset

In [None]:
df = pd.read_csv('')

## Exploratory Descriptive Analysis(EDA)

### Basic Data Inspection

In [None]:
df_eda = df.copy(deep=True)

In [None]:
df_eda.head()

In [None]:
df_eda.tail()

In [None]:
df_eda.sample(5)

In [None]:
print(f"Shape of dataframe: {df_eda.shape}")

In [None]:
print(f"Columns of dataframe:\n{df_eda.columns}")

In [None]:
distinguish_column_ac_to_datatype(df_eda)

In [None]:
df_eda.info()

### Descriptive Statistics

In [None]:
# Numerical Columns Statistics
df_eda.describe().T

In [None]:
# Categorical Columns Statistics
df_eda.describe(include='O').T

### Duplicate Values

In [None]:
print(f"Number of duplicate instances in the dataset: {df_eda.duplicated().sum()}")

### Missing Values

In [None]:
df_eda.isnull().sum().sort_values(ascending=False)

In [None]:
missing_values_table(df_eda)

### Unique Values

In [None]:
df_eda.nunique().sort_values(ascending=False)

In [None]:
unique_values_per_column(df_eda,20) #Function defined above

### Exploring Target Variable

In [None]:
# Enter the Target variable
print(df_eda[''].value_counts())

# If the target variable is categorical
plot_single_categorical_countplot(df_eda,'') 

### Data Visualization 

- [X]Histogram
- [X]Boxplot
- [X]Heatmap
- [X]Countsplot

In [None]:
# Heatmap
plt.figure(figsize =(15,5))
sns.heatmap(df_eda.corr(numeric_only = True),annot = True,cmap = "YlGnBu")
plt.show()

In [None]:
plot_numerical_histogram_boxplot(df_eda)

In [None]:
plot_categorical_countplots(df_eda)

## Insights

## Preprocessing

## Dataframe Pipeline

This section provides a clear view of the data processing pipeline.

1. **Initial Dataframe**

- **Name:** `df`
- **Description:** This is the initial dataframe containing the raw data.

2. **Exploratory Data Analysis (EDA)**

- **Name:** `df_eda`
- **Description:** A deep copy of the initial dataframe `df`, used for exploratory data analysis.

3. **Heading**

- **Name:** `df_`
- **Description:** Description of the dataset


. **Final Dataset for Modeling**

- **Name:** `X_train_final`, `X_test_final`, `X_val_final`
- **Description:** The final datasets used for machine learning modeling, including preprocessing steps like feature scaling, encoding, and feature selection.