# Agricultural Emissions Regression Project


![agri_image](agri_image.png)

<a id="cont"></a>

## Table of Contents
* <b>[1. Project Overview](#chapter1)
* <b>[2. Importing Packages](#chapter2)
* <b>[3. Loading Data](#chapter3)
* <b>[4. Data Cleaning](#chapter4)
* <b>[5. Exploratory Data Analysis (EDA)](#chapter5)
* <b>[6. Regression Models](#chapter6)
* <b>[7. Conclusion](#chapter7)

## 1. Project Overview <a class="anchor" id="chapter1"></a>

## 2. Importing Packages <a class="anchor" id="chapter2"></a>

In [3]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression


import csv
import seaborn as sns


# Displays output inline
%matplotlib inline

# Libraries for Handing Errors
import warnings
warnings.filterwarnings('ignore')

**_Insights_**<br>
We have imported various libraries to assist us with data manipulation and analysis. These libraries include:<br>
* Numpy - to work with arrays,
* Pandas - to assist us to analyse our data,
* Matplotlib - this assists us with data visualization,
* Seaborn - this is a powerful library that we use for statistical graphics, and it works seamlessly with Pandas dataframes,
* SKlearn - this is a machine learning library and we use this for our regression tasks,
* csv -
* 


## 3. Loading Data <a class="anchor" id="chapter3"></a>

In [4]:
df = pd.read_csv("co2_emissions_from_agri.csv", index_col=False)

**_Insights_**<br>
The function `df = pd.read_csv` was used to create a dataframe of the csv file.<br>

In [6]:
df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6965 entries, 0 to 6964
Data columns (total 31 columns):
 #   Column                           Non-Null Count  Dtype  
---  ------                           --------------  -----  
 0   Area                             6965 non-null   object 
 1   Year                             6965 non-null   int64  
 2   Savanna fires                    6934 non-null   float64
 3   Forest fires                     6872 non-null   float64
 4   Crop Residues                    5576 non-null   float64
 5   Rice Cultivation                 6965 non-null   float64
 6   Drained organic soils (CO2)      6965 non-null   float64
 7   Pesticides Manufacturing         6965 non-null   float64
 8   Food Transport                   6965 non-null   float64
 9   Forestland                       6472 non-null   float64
 10  Net Forest conversion            6472 non-null   float64
 11  Food Household Consumption       6492 non-null   float64
 12  Food Retail         

**_Insights_**<br>
We used the `.info()` method to view the total number columns (31), as well as the total number of entries in our dataset, indicated as 6965 entries. The `.info()` method also indicates the presence of the following datatypes in our dataset:
+ Objects = 1 column
+ Int64 = 1 column
+ Float64 = 29 columns<br>
We can also detect a number of null values in various columns of the dataset.<br><br>

In [11]:
pd.set_option("display.max_columns", None)
df.head()

Unnamed: 0,Area,Year,Savanna fires,Forest fires,Crop Residues,Rice Cultivation,Drained organic soils (CO2),Pesticides Manufacturing,Food Transport,Forestland,Net Forest conversion,Food Household Consumption,Food Retail,On-farm Electricity Use,Food Packaging,Agrifood Systems Waste Disposal,Food Processing,Fertilizers Manufacturing,IPPU,Manure applied to Soils,Manure left on Pasture,Manure Management,Fires in organic soils,Fires in humid tropical forests,On-farm energy use,Rural population,Urban population,Total Population - Male,Total Population - Female,total_emission,Average Temperature °C
0,Afghanistan,1990,14.7237,0.0557,205.6077,686.0,0.0,11.807483,63.1152,-2388.803,0.0,79.0851,109.6446,14.2666,67.631366,691.7888,252.21419,11.997,209.9778,260.1431,1590.5319,319.1763,0.0,0.0,,9655167.0,2593947.0,5348387.0,5346409.0,2198.963539,0.536167
1,Afghanistan,1991,14.7237,0.0557,209.4971,678.16,0.0,11.712073,61.2125,-2388.803,0.0,80.4885,116.6789,11.4182,67.631366,710.8212,252.21419,12.8539,217.0388,268.6292,1657.2364,342.3079,0.0,0.0,,10230490.0,2763167.0,5372959.0,5372208.0,2323.876629,0.020667
2,Afghanistan,1992,14.7237,0.0557,196.5341,686.0,0.0,11.712073,53.317,-2388.803,0.0,80.7692,126.1721,9.2752,67.631366,743.6751,252.21419,13.4929,222.1156,264.7898,1653.5068,349.1224,0.0,0.0,,10995568.0,2985663.0,6028494.0,6028939.0,2356.304229,-0.259583
3,Afghanistan,1993,14.7237,0.0557,230.8175,686.0,0.0,11.712073,54.3617,-2388.803,0.0,85.0678,81.4607,9.0635,67.631366,791.9246,252.21419,14.0559,201.2057,261.7221,1642.9623,352.2947,0.0,0.0,,11858090.0,3237009.0,7003641.0,7000119.0,2368.470529,0.101917
4,Afghanistan,1994,14.7237,0.0557,242.0494,705.6,0.0,11.712073,53.9874,-2388.803,0.0,88.8058,90.4008,8.3962,67.631366,831.9181,252.21419,15.1269,182.2905,267.6219,1689.3593,367.6784,0.0,0.0,,12690115.0,3482604.0,7733458.0,7722096.0,2500.768729,0.37225


**_Insights_**<br>
We executed `pd.set_option("display.max_columns", None)` to view all the columns in our dataset, as this gives us an opportunity to gain a better understanding of our dataset.<br><br>

## 4. Data Cleaning <a class="anchor" id="chapter4"></a>

In [7]:
# missing values
# df.isnull().sum()

# duplicated rows - none
# df.duplicated().sum()

0

**_Insights_**<br>

In [12]:
def check_for_conditional_values(df, condition, value):
    '''
    Display the number of values in each column that matches the provided condition and value.
    Used to identify columns that contain unexpected values e.g. -1 values where the value should be nan 
    e.g check_for_conditional_values(df, "==", -1 ) will print number of records containing "-1"

    Parameters:
    df (pandas.DataFrame): The DataFrame to check for duplicate rows.
    condition:  The condition or operator to be used "<", "<=", "==", ">=", or ">" are valid
    value: The value to be used with the condition. Any integer can be used.

    Returns:
    No return value. The count of values matching the expression per column is printed to the screen.
    "e.g
    '''
    print(f"Checking columns with values {condition} {value}")
    for col in df.columns:
        if df[col].dtype in ["float64", "int64"]:
            if condition == "<":
                matching_values = df[col] < value
            elif condition == "<=":
                matching_values = df[col] <= value
            elif condition == "==":
                matching_values = df[col] == value
            elif condition == ">=":
                matching_values = df[col] >= value
            elif condition == ">":
                matching_values = df[col] > value
            else:
                print("Invalid conditional operator specified")
                return
            # print(matching_values)
            count_matches = matching_values.sum()
        
            if count_matches > 0:
                print(f"{col} has {count_matches} values matching condition {condition} {value}")
            else:
                pass

In [13]:
# looking for negative values, nan, -1, 0 values

# check_for_conditional_values(df, "<", 0)
# check_for_conditional_values(df, "==", -1)
# check_for_conditional_values(df, "==", 0)

Checking columns with values == -1


## 5. Exploratory Data Analysis <a class="anchor" id="chapter5"></a>

## 6. Regression Models <a class="anchor" id="chapter6"></a>

## 7. Conclusion <a class="anchor" id="chapter7"></a>