# 2403 PT_DS Regression Project

### Project Title: Regression Project
Analyse and predict average temperature from agri-food sector.
#### Done By: Regression Project Team (K Ebrahim, J Sithole, J Maleka, S Tlhale, N Mhlophe & M Majola)

© ExploreAI 2024

---

## Table of Contents

<a href=#BC> Background Context</a>

<a href=#one>1. Importing Packages</a>

<a href=#two>2. Data Collection and Description</a>

<a href=#three>3. Loading Data </a>

<a href=#four>4. Data Cleaning and Filtering</a>

<a href=#five>5. Exploratory Data Analysis (EDA)</a>

<a href=#nine>6. Conclusion and Future Work</a>

<a href=#ten>7. References</a>

---
 <a id="BC"></a>
## **Background Context**
<a href=#cont>Back to Table of Contents</a>
### Objective of the Project:
The aim is to analyze and predict average temperature from the agri-food sector, using data from the FAO and IPCC, to understand climate impacts and develop sustainable strategies for stakeholders including policymakers and agricultural businesses.

### Project Purpose:
The purpose of this project is to analyze and predict average temperatures specifically in the agri-food sector. By leveraging data from the FAO (Food and Agriculture Organization) and the IPCC (Intergovernmental Panel on Climate Change), the project aims to gain deeper insights into how climate impacts influence the agricultural industry. The objective is to help develop sustainable strategies that can be used by policymakers and agricultural businesses to mitigate risks associated with climate change and enhance agricultural productivity and sustainability.

### Project Details:
**Data Source**: The project utilizes data from internationally recognized organizations like FAO and IPCC, ensuring reliable and comprehensive climate and agricultural data.

**Focus**: The project centers on understanding the relationship between average temperatures and various factors in the agri-food sector, potentially including crop yields, soil conditions, and climate variability.

**Goal**: Develop predictive models that can forecast future average temperatures in agricultural regions, which will help stakeholders plan for climate resilience.

**Impact**: This information will support sustainable agricultural practices, improve decision-making for businesses, and help policymakers in creating adaptation strategies for climate-related challenges.

### Additional Details:
**Methodology**:
1. **Data Collection**: Gather data from FAO and IPCC databases.
2. **Data Cleaning**: Preprocess the data to handle missing values, outliers, and inconsistencies.
3. **Feature Engineering**: Create relevant features that can impact average temperature predictions.
4. **Model Development**: Train multiple predictive models (e.g., Linear Regression, Random Forest, XGBoost) and select the best-performing model.
5. **Validation**: Validate the model using a separate test dataset to ensure accuracy.
6. **Deployment**: Deploy the model for continuous temperature prediction and monitoring.

**Tools and Technologies**:
- **Python**: Primary programming language for data analysis and model development.
- **Pandas**: For data manipulation and analysis.
- **NumPy**: For numerical operations.
- **Scikit-learn**: For machine learning model development.
- **XGBoost**: For advanced gradient boosting models.
- **Matplotlib and Seaborn**: For data visualization.
- **Jupyter Notebook**: For interactive data analysis and model building.


---

### ***Dataset Features***:
**Area**: The data collection covers a vast portion of the globe (236 countries). The dataset includes data from a diverse range of countries across all continents, capturing different regions, climates, and agricultural practices.

**Year**: 1990 to 2020.

### Emission Sources:
- **Savanna fires**: CO2 emissions from fires in savanna regions.
- **Forest fires**: CO2 emissions from forest fires.
- **Crop Residues**: Emissions from the burning or decomposition of crop residues.
- **Rice Cultivation**: CO2 emissions related to rice farming, often associated with methane emissions as well.
- **Drained organic soils (CO2)**: Emissions from drained peatlands or other organic soils.
- **Pesticides Manufacturing**: Emissions from the production of pesticides.
- **Food Transport**: Emissions from the transportation of food products.
- **Forestland**: CO2 sequestration or emissions related to forested areas.
- **Net Forest conversion**: Net CO2 emissions from deforestation or reforestation.

### Food System Activities:
- **Food Household Consumption**: Emissions related to household food consumption.
- **Food Retail**: Emissions from retail activities related to food.
- **On-farm Electricity Use**: Emissions from electricity used on farms.
- **Food Packaging**: Emissions from the packaging of food products.
- **Agrifood Systems Waste Disposal**: Emissions from waste disposal in food systems.
- **Food Processing**: Emissions from the processing of food.
- **Fertilizers Manufacturing**: Emissions from the production of fertilizers.

### Other Activities and Emission Sources:
- **IPPU**: Emissions from industrial processes and product use.
- **Manure applied to Soils**: Emissions from the application of manure to soils.
- **Manure left on Pasture**: Emissions from manure left on pastures.
- **Manure Management**: Emissions from the management of manure.
- **Fires in organic soils**: CO2 emissions from fires in organic-rich soils.
- **Fires in humid tropical forests**: Emissions from fires in tropical forests.

### Energy Use and Demographics:
- **On-farm energy use**: Emissions from energy consumption on farms.
- **Rural population**: The rural population count.
- **Urban population**: The urban population count.
- **Total Population - Male**: Total male population.
- **Total Population - Female**: Total female population.

### Total Emission and Climate Data:
- **total_emission**: The total CO2 emissions from all sources combined.
- **Average Temperature °C**: Average temperature in degrees Celsius.


---
<a href=#one></a>
## **Importing Packages**
<a href=#cont>Back to Table of Contents</a>

* **Purpose:** Set up the Python environment with necessary libraries and tools.
* **Details:** List and import all the Python packages that will be used throughout the project such as Pandas for data manipulation, Matplotlib/Seaborn for visualization, scikit-learn for modeling, etc.
---

In [1]:
# Importing all Important Packages
import pickle                                       #For saving and loading Python objects.
#import joblib                                       #For saving and loading large NumPy arrays and Python objects efficiently.
import seaborn                                      #For saving and loading large NumPy arrays and Python objects efficiently.
from sklearn import metrics                         #For calculating evaluation metrics like mean squared error, R-squared, etc.
import statsmodels.api as sm                        #For statistical modeling, including linear regression
from sklearn.pipeline import make_pipeline          #For creating pipelines that chain multiple data preprocessing and modeling steps.
from sklearn.tree import DecisionTreeRegressor      #For decision tree models.
from sklearn.preprocessing import StandardScaler    #For data preprocessing tasks like scaling and normalization.
from sklearn.model_selection import train_test_split #For tasks like splitting data into training and testing sets.
from sklearn.linear_model  import LinearRegression, Ridge, Lasso  #For linear regression models, including Ridge and Lasso regression.
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error
import pandas as pd                                 # for data manipulation and analysis
import numpy as np                                  # for numerical operations
import matplotlib.pyplot as plt                     # for data visualization
import seaborn as sns                               # for enhanced data visualization 
import warnings                                     #For controlling warnings that might be generated during code execution.
warnings.filterwarnings('ignore')                   # for excluding warnings
from scipy.stats import ttest_ind

---
<a href=#two></a>
## **Data Collection and Description**
<a href=#cont>Back to Table of Contents</a>

* **Purpose:** Describe how the data was collected and provide an overview of its characteristics.
* **Details:** Mention sources of the data, the methods used for collection (e.g., APIs, web scraping, datasets from repositories), and a general description of the dataset including size, scope, and types of data available (e.g., numerical, categorical).
---

In [2]:
#Please use code cells to code in and do not forget to comment your code.

---
<a href=#three></a>
## **Loading Data**
<a href=#cont>Back to Table of Contents</a>

* **Purpose:** Load the data into the notebook for manipulation and analysis.
* **Details:** Show the code used to load the data and display the first few rows to give a sense of what the raw data looks like.
---

In [103]:
# Loading co2 emissions dataset using pandas

df = pd.read_csv('co2_emissions_from_agri.csv')
df.head()

Unnamed: 0,Area,Year,Savanna fires,Forest fires,Crop Residues,Rice Cultivation,Drained organic soils (CO2),Pesticides Manufacturing,Food Transport,Forestland,...,Manure Management,Fires in organic soils,Fires in humid tropical forests,On-farm energy use,Rural population,Urban population,Total Population - Male,Total Population - Female,total_emission,Average Temperature °C
0,Afghanistan,1990,14.7237,0.0557,205.6077,686.0,0.0,11.807483,63.1152,-2388.803,...,319.1763,0.0,0.0,,9655167.0,2593947.0,5348387.0,5346409.0,2198.963539,0.536167
1,Afghanistan,1991,14.7237,0.0557,209.4971,678.16,0.0,11.712073,61.2125,-2388.803,...,342.3079,0.0,0.0,,10230490.0,2763167.0,5372959.0,5372208.0,2323.876629,0.020667
2,Afghanistan,1992,14.7237,0.0557,196.5341,686.0,0.0,11.712073,53.317,-2388.803,...,349.1224,0.0,0.0,,10995568.0,2985663.0,6028494.0,6028939.0,2356.304229,-0.259583
3,Afghanistan,1993,14.7237,0.0557,230.8175,686.0,0.0,11.712073,54.3617,-2388.803,...,352.2947,0.0,0.0,,11858090.0,3237009.0,7003641.0,7000119.0,2368.470529,0.101917
4,Afghanistan,1994,14.7237,0.0557,242.0494,705.6,0.0,11.712073,53.9874,-2388.803,...,367.6784,0.0,0.0,,12690115.0,3482604.0,7733458.0,7722096.0,2500.768729,0.37225


---
<a href=#four></a>
## **Data Cleaning and Filtering**
<a href=#cont>Back to Table of Contents</a>

* **Purpose:** Prepare the data for analysis by cleaning and filtering.
* **Details:** Include steps for handling missing values, removing outliers, correcting errors, and possibly reducing the data (filtering based on certain criteria or features).
---

In [72]:
# Find count of nulls
null_counts = df.isnull().sum()

# Determine % of nulls in columns
null_percentage = (df.isnull().sum() / len(df)) * 100
null_percentage = null_percentage[null_percentage > 0]

# Add to dataframe
null_df = pd.DataFrame({
    'null_count': null_counts,
    'null_column_percentage': null_percentage
})

# Filter df for nulls only
null_df = null_df[null_df["null_count"]> 0]
null_df

Unnamed: 0,null_count,null_column_percentage
Crop Residues,1389,19.94257
Fires in humid tropical forests,155,2.225413
Food Household Consumption,473,6.791098
Forest fires,93,1.335248
Forestland,493,7.078248
IPPU,743,10.667624
Manure Management,928,13.323762
Manure applied to Soils,928,13.323762
Net Forest conversion,493,7.078248
On-farm energy use,956,13.725772


- Decide on which nulls to fill or remove.
---

In [50]:
# Find duplicates
print("Number of duplicate rows:", df.duplicated().sum())

Number of duplicate rows: 0


In [91]:
# Find outliers using scipy.stats
from scipy import stats

# Select numeric datatypes
numeric_cols = df.select_dtypes(include="number")

# calculate zscores
zscores = np.abs(stats.zscore(numeric_cols))

# Filter zscores > 3
outliers = numeric_cols[(zscores > 3).any(axis=1)]

# Filter df with indices of outliers
df.loc[outliers.index]

Unnamed: 0,Area,Year,Savanna fires,Forest fires,Crop Residues,Rice Cultivation,Drained organic soils (CO2),Pesticides Manufacturing,Food Transport,Forestland,...,Manure Management,Fires in organic soils,Fires in humid tropical forests,On-farm energy use,Rural population,Urban population,Total Population - Male,Total Population - Female,total_emission,Average Temperature °C
262,Argentina,2004,2850.2854,690.3494,4144.1416,1326.3555,5210.0713,4707.0,3709.2319,-35904.1100,...,2177.2094,0.0,297.7794,11889.3405,3927028.0,34801668.0,19077001.0,19591796.0,1.839589e+05,0.583583
263,Argentina,2005,2117.7708,583.9524,4771.5106,1246.0739,5197.0239,4951.0,3989.6423,-35904.1100,...,2219.8865,0.0,251.7815,15883.5885,3902354.0,35243134.0,19279145.0,19791355.0,1.896272e+05,0.228417
264,Argentina,2006,2191.8526,515.5174,4475.2982,1325.1560,5198.8975,5356.0,4421.3458,-35904.1100,...,2305.7266,0.0,160.4116,11520.1690,3876846.0,35682044.0,19484104.0,19992747.0,1.895432e+05,0.669083
265,Argentina,2007,2170.6834,505.2612,5277.9343,1290.7384,5198.0361,6090.0,4464.8701,-35904.1100,...,2320.7661,0.0,247.2355,10121.1439,3850764.0,36119460.0,19685381.0,20190731.0,1.909674e+05,-0.195750
266,Argentina,2008,2767.1404,574.2262,5430.2776,1430.4864,5212.2518,6737.0,4743.0185,-35904.1100,...,2286.3380,0.0,156.1026,14333.5399,3824321.0,36558068.0,19886051.0,20387718.0,1.977888e+05,0.660083
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6619,United States of America,2018,1540.0409,2457.9409,26846.6988,11540.9700,50091.9803,11207.0,67945.7650,-331408.0367,...,54262.4288,0.0,3.4785,51007.8455,57980034.0,268786714.0,164538320.0,167601717.0,1.034913e+06,1.243167
6620,United States of America,2019,988.5763,1190.9796,24747.3192,9823.7160,50226.9948,10884.0,67843.5764,-331408.0367,...,55231.7961,0.0,0.4175,43785.9965,57727196.0,271365914.0,165698830.0,168620840.0,1.022096e+06,1.053917
6621,United States of America,2020,2031.3179,5405.3003,26151.6287,11846.3380,50220.7454,10884.0,60066.4841,-331408.0367,...,54764.8665,0.0,0.2783,40802.7314,57456395.0,273975139.0,166504407.0,169437597.0,1.023694e+06,1.322083
6684,USSR,1990,8405.2264,7262.4148,14854.7660,4813.7600,131838.2352,1169.0,32821.6716,-605722.9991,...,60407.5004,0.0,0.0000,248879.1769,99158922.0,188910774.0,136777703.0,153126995.0,5.244739e+05,1.158250


- Decide on action to deal with outliers, to be removed or make use of Robust Regression that is less sensitive to outliers.
- The dataset can be split to handle both scenarios, without outliers to proceed with LinearRegression and with outliers using HuberRegressor from sklearn. 
---

In [75]:
# Check data types
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6965 entries, 0 to 6964
Data columns (total 31 columns):
 #   Column                           Non-Null Count  Dtype  
---  ------                           --------------  -----  
 0   Area                             6965 non-null   object 
 1   Year                             6965 non-null   int64  
 2   Savanna fires                    6934 non-null   float64
 3   Forest fires                     6872 non-null   float64
 4   Crop Residues                    5576 non-null   float64
 5   Rice Cultivation                 6965 non-null   float64
 6   Drained organic soils (CO2)      6965 non-null   float64
 7   Pesticides Manufacturing         6965 non-null   float64
 8   Food Transport                   6965 non-null   float64
 9   Forestland                       6472 non-null   float64
 10  Net Forest conversion            6472 non-null   float64
 11  Food Household Consumption       6492 non-null   float64
 12  Food Retail         

- The only object type is Area which are country names, this feature can be one-hot encoded before fitting models.

In [99]:
# Clean column names
def clean_column_names(columns):
    # Initialize empty list
    cleaned_cols = []
    for col in columns:
        # Apply lower function and replace spaces and characters
        col = col.lower().replace(" ", "_").replace("(", "").replace(")", "").replace("-", "").replace("°", "").replace("__", "_")
        # Append each cleaned column to list
        cleaned_cols.append(col)
    return cleaned_cols

# Apply function to df
df.columns = clean_column_names(df.columns)
df.columns

Index(['area', 'year', 'savanna_fires', 'forest_fires', 'crop_residues',
       'rice_cultivation', 'drained_organic_soils_co2',
       'pesticides_manufacturing', 'food_transport', 'forestland',
       'net_forest_conversion', 'food_household_consumption', 'food_retail',
       'onfarm_electricity_use', 'food_packaging',
       'agrifood_systems_waste_disposal', 'food_processing',
       'fertilizers_manufacturing', 'ippu', 'manure_applied_to_soils',
       'manure_left_on_pasture', 'manure_management', 'fires_in_organic_soils',
       'fires_in_humid_tropical_forests', 'onfarm_energy_use',
       'rural_population', 'urban_population', 'total_population_male',
       'total_population_female', 'total_emission', 'average_temperature_c'],
      dtype='object')

---
<a href=#five></a>
## **Exploratory Data Analysis (EDA)**
<a href=#cont>Back to Table of Contents</a>

* **Purpose:** Explore and visualize the data to uncover patterns, trends, and relationships.
* **Details:** Use statistics and visualizations to explore the data. This may include histograms, box plots, scatter plots, and correlation matrices. Discuss any significant findings.
---


In [5]:
#Please use code cells to code in and do not forget to comment your code.

---
<a href=#nine></a>
## **Conclusion and Future Work**
<a href=#cont>Back to Table of Contents</a>

* **Purpose:** Summarize the findings and discuss future directions.
* **Details:** Conclude with a summary of the results, insights gained, limitations of the study, and suggestions for future projects or improvements in methodology or data collection.
---


In [6]:
#Please use code cells to code in and do not forget to comment your code.

---
<a href=#ten></a>
## **References**
<a href=#cont>Back to Table of Contents</a>

* **Purpose:** Provide citations and sources of external content.
* **Details:** List all the references and sources consulted during the project, including data sources, research papers, and documentation for tools and libraries used.
---

In [7]:
#Please use code cells to code in and do not forget to comment your code.

## Additional Sections to Consider

* ### Appendix: 
For any additional code, detailed tables, or extended data visualizations that are supplementary to the main content.

* ### Collaborators: 
  - Kyle Ebrahim
  - Selogilwe Tlhale
  - Josia Sithole
  - Mgcini Emmanuel Majola
  - Jerry Maleka
  - Nhlokomo Mhlophe
