# 2403 PT_DS Regression Project

### Project Title: Regression Project
Analyse and predict average temperature from agri-food sector.
#### Done By: Regression Project Team (K Ebrahim, J Sithole, J Maleka, S Tlhale, N Mhlophe & M Majola)

© ExploreAI 2024

---

## Table of Contents

<a href=#BC> Background Context</a>

<a href=#one>1. Importing Packages</a>

<a href=#two>2. Data Collection and Description</a>

<a href=#three>3. Loading Data </a>

<a href=#four>4. Data Cleaning and Filtering</a>

<a href=#five>5. Exploratory Data Analysis (EDA)</a>

<a href=#nine>6. Conclusion and Future Work</a>

<a href=#ten>7. References</a>

---
 <a id="BC"></a>
## **Background Context**
<a href=#cont>Back to Table of Contents</a>
* **Objective of the Project:** 

    The aim is to analyse and predict average temperature from the agri-food sector, using data from the FAO and IPCC, to understand climate impacts and develop sustainable strategies for stakeholders including policymakers and agricultural businesses.


* **Project Purpose:** 

     The purpose of this project is to analyze and predict average temperatures specifically in the agri-food sector. By leveraging data from the FAO (Food and Agriculture Organization) and the IPCC (Intergovernmental Panel on Climate Change), the project aims to gain deeper insights into how climate impacts influence the agricultural industry. The objective is to help develop sustainable strategies that can be used by policymakers and agricultural businesses to mitigate risks associated with climate change and enhance agricultural productivity and sustainability

* **Project Details:** 

     **Data Source**: The project utilizes data from internationally recognized organizations like FAO and IPCC, ensuring reliable and comprehensive climate and agricultural data.

     **Focus**: The project centers on understanding the relationship between average temperatures and various factors in the agri-food sector, potentially including crop yields, soil conditions, and climate variability.

     **Goal**: Develop predictive models that can forecast future average temperatures in agricultural regions, which will help stakeholders plan for climate resilience.
    
     **Impact**: This information will support sustainable agricultural practices, improve decision-making for businesses, and help policymakers in creating adaptation strategies for climate-related challenges.
---

---
<a href=#one></a>
## **Importing Packages**
<a href=#cont>Back to Table of Contents</a>

* **Purpose:** Set up the Python environment with necessary libraries and tools.
* **Details:** List and import all the Python packages that will be used throughout the project such as Pandas for data manipulation, Matplotlib/Seaborn for visualization, scikit-learn for modeling, etc.
---

In [2]:
# Importing all Important Packages
import pickle                                       #For saving and loading Python objects.
#import joblib                                       #For saving and loading large NumPy arrays and Python objects efficiently.
import seaborn                                      #For saving and loading large NumPy arrays and Python objects efficiently.
from sklearn import metrics                         #For calculating evaluation metrics like mean squared error, R-squared, etc.
import statsmodels.api as sm                        #For statistical modeling, including linear regression
from sklearn.datasets import load_diabetes          #For loading built-in datasets.
from sklearn.pipeline import make_pipeline          #For creating pipelines that chain multiple data preprocessing and modeling steps.
from sklearn.tree import DecisionTreeRegressor      #For decision tree models.
from sklearn.preprocessing import StandardScaler    #For data preprocessing tasks like scaling and normalization.
from sklearn.model_selection import train_test_split #For tasks like splitting data into training and testing sets.
from sklearn.linear_model  import LinearRegression, Ridge, Lasso  #For linear regression models, including Ridge and Lasso regression.
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error
import pandas as pd                                 # for data manipulation and analysis
import numpy as np                                  # for numerical operations
import matplotlib.pyplot as plt                     # for data visualization
import seaborn as sns                               # for enhanced data visualization 
import warnings                                     #For controlling warnings that might be generated during code execution.
warnings.filterwarnings('ignore')                   # for excluding warnings
from scipy.stats import ttest_ind

---
<a href=#two></a>
## **Data Collection and Description**
<a href=#cont>Back to Table of Contents</a>

* **Purpose:** Describe how the data was collected and provide an overview of its characteristics.
* **Details:** Mention sources of the data, the methods used for collection (e.g., APIs, web scraping, datasets from repositories), and a general description of the dataset including size, scope, and types of data available (e.g., numerical, categorical).
---

In [2]:
#Please use code cells to code in and do not forget to comment your code.

---
<a href=#three></a>
## **Loading Data**
<a href=#cont>Back to Table of Contents</a>

* **Purpose:** Load the data into the notebook for manipulation and analysis.
* **Details:** Show the code used to load the data and display the first few rows to give a sense of what the raw data looks like.
---

*  pandas is a Python library used for data manipulation and analysis.
*  pd.read_csv loads the CSV (Comma-Separated Values) file into a DataFrame, which is a tabular data structure.
*  index_col=False specifies that the CSV file doesn't have a specific index column.
*  df.head(4) displays the first 4 rows of the DataFrame, useful for checking if the data was loaded correctly.

In [3]:
# Load the CSV data into a pandas DataFrame; 'index_col=False' means no column is used as the row index
df = pd.read_csv('co2_emissions_from_agri.csv', index_col=False)

# Display the first 4 rows of the DataFrame to get a quick preview of the data
df.head(4)

Unnamed: 0,Area,Year,Savanna fires,Forest fires,Crop Residues,Rice Cultivation,Drained organic soils (CO2),Pesticides Manufacturing,Food Transport,Forestland,...,Manure Management,Fires in organic soils,Fires in humid tropical forests,On-farm energy use,Rural population,Urban population,Total Population - Male,Total Population - Female,total_emission,Average Temperature °C
0,Afghanistan,1990,14.7237,0.0557,205.6077,686.0,0.0,11.807483,63.1152,-2388.803,...,319.1763,0.0,0.0,,9655167.0,2593947.0,5348387.0,5346409.0,2198.963539,0.536167
1,Afghanistan,1991,14.7237,0.0557,209.4971,678.16,0.0,11.712073,61.2125,-2388.803,...,342.3079,0.0,0.0,,10230490.0,2763167.0,5372959.0,5372208.0,2323.876629,0.020667
2,Afghanistan,1992,14.7237,0.0557,196.5341,686.0,0.0,11.712073,53.317,-2388.803,...,349.1224,0.0,0.0,,10995568.0,2985663.0,6028494.0,6028939.0,2356.304229,-0.259583
3,Afghanistan,1993,14.7237,0.0557,230.8175,686.0,0.0,11.712073,54.3617,-2388.803,...,352.2947,0.0,0.0,,11858090.0,3237009.0,7003641.0,7000119.0,2368.470529,0.101917


*  pd.set_option: This is a method to change pandas settings.
*  'display.max_columns': This setting controls how many columns are shown when printing a DataFrame.
*  None: This means "no limit" — pandas will display all columns regardless of how many there are.
*  Purpose: The code ensures that when you view the DataFrame, all columns are displayed, even if there are many. Normally, pandas might hide some columns to keep the output compact.

In [4]:
# Set pandas option to display all columns in the output (without hiding any)
pd.set_option('display.max_columns', None)

In [5]:
# The following line creates a separate copy of the original dataframe 'df' 
df_copy = df.copy()

In [6]:
# Display the shape of the dataframe
df_copy.shape

(6965, 31)

The number of rows and columns in the dataframe was examined using the `shape` attribute

In [7]:
# Display a summary of the dataframe
df_copy.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6965 entries, 0 to 6964
Data columns (total 31 columns):
Area                               6965 non-null object
Year                               6965 non-null int64
Savanna fires                      6934 non-null float64
Forest fires                       6872 non-null float64
Crop Residues                      5576 non-null float64
Rice Cultivation                   6965 non-null float64
Drained organic soils (CO2)        6965 non-null float64
Pesticides Manufacturing           6965 non-null float64
Food Transport                     6965 non-null float64
Forestland                         6472 non-null float64
Net Forest conversion              6472 non-null float64
Food Household Consumption         6492 non-null float64
Food Retail                        6965 non-null float64
On-farm Electricity Use            6965 non-null float64
Food Packaging                     6965 non-null float64
Agrifood Systems Waste Disposal    6965 n

**Results** :
- **Data Types** : The dataframe contains 3 data types: Float64, Object and Int64.
- **Column Types** : There are 29 columns with the float64 data type, indicating they hold numerical values. The Year column is of type int64, meaning it contains integer values, while the Area column is of type object, likely representing categorical data (such as names of regions or countries).
- **Non-Null Count** : Most columns have 6,965 non-null values, indicating they are fully populated. However, some columns have missing values, as shown by non-null counts being less than 6,965. These columns include 'Savanna fires', 'Forest fires', 'Crop Residues', 'Forestland', 'Net Forest', 'IPPU', 'Manure applied to Soils', 'Manure Management', 'Fires in humid tropical forests', and 'On-farm energy use'.

In [8]:
# Generates descriptive statistics of the dataframe
df_copy.describe()

Unnamed: 0,Year,Savanna fires,Forest fires,Crop Residues,Rice Cultivation,Drained organic soils (CO2),Pesticides Manufacturing,Food Transport,Forestland,Net Forest conversion,Food Household Consumption,Food Retail,On-farm Electricity Use,Food Packaging,Agrifood Systems Waste Disposal,Food Processing,Fertilizers Manufacturing,IPPU,Manure applied to Soils,Manure left on Pasture,Manure Management,Fires in organic soils,Fires in humid tropical forests,On-farm energy use,Rural population,Urban population,Total Population - Male,Total Population - Female,total_emission,Average Temperature °C
count,6965.0,6934.0,6872.0,5576.0,6965.0,6965.0,6965.0,6965.0,6472.0,6472.0,6492.0,6965.0,6965.0,6965.0,6965.0,6965.0,6965.0,6222.0,6037.0,6965.0,6037.0,6965.0,6810.0,6009.0,6965.0,6965.0,6965.0,6965.0,6965.0,6965.0
mean,2005.12491,1188.390893,919.302167,998.706309,4259.666673,3503.228636,333.418393,1939.58176,-17828.285678,17605.64,4847.580384,2043.210539,1626.68146,1658.629808,6018.444633,3872.724461,3035.723356,19991.5,923.225603,3518.026573,2263.344946,1210.315532,668.452931,3008.982252,17857740.0,16932300.0,17619630.0,17324470.0,64091.24,0.872989
std,8.894665,5246.287783,3720.078752,3700.34533,17613.825187,15861.445678,1429.159367,5616.748808,81832.210543,101157.5,25789.143619,8494.24926,9343.182193,11481.343725,22156.742542,19838.216846,11693.029064,111420.9,3226.992039,9103.556202,7980.542461,22669.84776,3264.879486,12637.86443,89015210.0,65743620.0,76039930.0,72517110.0,228313.0,0.55593
min,1990.0,0.0,0.0,0.0002,0.0,0.0,0.0,0.0001,-797183.079,0.0,0.0,0.0,0.0,0.0,0.34,0.0001,0.0019,0.0,0.049,0.0007,0.4329,0.0,0.0,0.0319,0.0,0.0,250.0,270.0,-391884.1,-1.415833
25%,1997.0,0.0,0.0,11.006525,181.2608,0.0,6.0,27.9586,-2848.35,0.0,11.39995,26.8185,8.0376,67.631366,86.6805,209.587728,360.358799,39.03153,16.303,139.6699,37.6321,0.0,0.0,13.2919,97311.0,217386.0,201326.0,207890.0,5221.244,0.511333
50%,2005.0,1.65185,0.5179,103.6982,534.8174,0.0,13.0,204.9628,-62.92,44.44,155.4711,172.0426,29.1207,74.018133,901.2757,344.7602,1115.0524,803.7066,120.4439,972.5674,269.8563,0.0,0.0,141.0963,1595322.0,2357581.0,2469660.0,2444135.0,12147.65,0.8343
75%,2013.0,111.0814,64.950775,377.640975,1536.64,690.4088,116.325487,1207.0009,0.0,4701.746,1377.15195,1075.9991,499.9447,281.791,3006.4421,1236.9134,2024.8699,6155.175,460.1202,2430.7926,1126.8189,0.0,9.577875,1136.9254,8177340.0,8277123.0,9075924.0,9112588.0,35139.73,1.20675
max,2020.0,114616.4011,52227.6306,33490.0741,164915.2556,241025.0696,16459.0,67945.765,171121.076,1605106.0,466288.2007,133784.0653,165676.299,175741.3061,213289.7016,274253.5125,170826.4233,1861641.0,34677.3603,92630.7568,70592.6465,991717.5431,51771.2568,248879.1769,900099100.0,902077800.0,743586600.0,713341900.0,3115114.0,3.558083


**Results** :
- **High Variability** : Many columns, such as `Savanna fires`, `Forest fires`, and `Crop Residues`, show high variability, indicating significant differences in emissions across different regions and years.
- **Missing Values** : Some columns have missing values, which need to be addressed before further analysis.
- **Negative Values** : Columns like `Forestland` and `total_emission` have negative values, which might indicate net deforestation or net negative emissions in some regions.

In [10]:
# Display the total count of missing values for each column
df_copy.isna().sum()

Area                                  0
Year                                  0
Savanna fires                        31
Forest fires                         93
Crop Residues                      1389
Rice Cultivation                      0
Drained organic soils (CO2)           0
Pesticides Manufacturing              0
Food Transport                        0
Forestland                          493
Net Forest conversion               493
Food Household Consumption          473
Food Retail                           0
On-farm Electricity Use               0
Food Packaging                        0
Agrifood Systems Waste Disposal       0
Food Processing                       0
Fertilizers Manufacturing             0
IPPU                                743
Manure applied to Soils             928
Manure left on Pasture                0
Manure Management                   928
Fires in organic soils                0
Fires in humid tropical forests     155
On-farm energy use                  956


**Results** : In summary, most columns have no missing values, but some columns have a varying number of missing entries, ranging from 31 to 1,389.

---
<a href=#four></a>
## **Data Cleaning and Filtering**
<a href=#cont>Back to Table of Contents</a>

* **Purpose:** Prepare the data for analysis by cleaning and filtering.
* **Details:** Include steps for handling missing values, removing outliers, correcting errors, and possibly reducing the data (filtering based on certain criteria or features).
---

In [4]:
#Please use code cells to code in and do not forget to comment your code.

---
<a href=#five></a>
## **Exploratory Data Analysis (EDA)**
<a href=#cont>Back to Table of Contents</a>

* **Purpose:** Explore and visualize the data to uncover patterns, trends, and relationships.
* **Details:** Use statistics and visualizations to explore the data. This may include histograms, box plots, scatter plots, and correlation matrices. Discuss any significant findings.
---


In [5]:
#Please use code cells to code in and do not forget to comment your code.

---
<a href=#nine></a>
## **Conclusion and Future Work**
<a href=#cont>Back to Table of Contents</a>

* **Purpose:** Summarize the findings and discuss future directions.
* **Details:** Conclude with a summary of the results, insights gained, limitations of the study, and suggestions for future projects or improvements in methodology or data collection.
---


In [6]:
#Please use code cells to code in and do not forget to comment your code.

---
<a href=#ten></a>
## **References**
<a href=#cont>Back to Table of Contents</a>

* **Purpose:** Provide citations and sources of external content.
* **Details:** List all the references and sources consulted during the project, including data sources, research papers, and documentation for tools and libraries used.
---

In [7]:
#Please use code cells to code in and do not forget to comment your code.

## Additional Sections to Consider

* ### Appendix: 
For any additional code, detailed tables, or extended data visualizations that are supplementary to the main content.

* ### Collaborators: 
  - Kyle Ebrahim
  - Selogilwe Tlhale
  - Josia Sithole
  - Mgcini Emmanuel Majola
  - Jerry Maleka
  - Nhlokomo Mhlophe
