### Introduction 


Every year people demand more from nature than it can regenerate. Individuals, communities and government leaders use ecological footprint data to better manage limited resources, reduce economic risk, and improve well-being. The Dataset provides Ecological Footprint per capita data for years 1961-2016 in global hectares (gha). Ecological Footprint is a measure of how much area of biologically productive land and water an individual, population, or activity requires to produce all the resources it consumes and to absorb the waste it generates, using prevailing technology and resource management practices. The Ecological Footprint is measured in global hectares. Since trade is global, an individual or country's Footprint tracks area from all over the world. 

Apart from predicting numeric values, another important supervised machine learning method is classification and it involves predicting classes (either binary or multinomial classes). In this section, we will cover how to measure performances of class prediction, linear classification methods and non-linear/tree-based methods. We’ll also focus on strategies for applying a successful classification model like interpretability-accuracy trade-off, class and imbalance.

*The National Footprint and Biocapacity Accounts (NFAs) measure the ecological resource use and resource capacity of nations from 1961 to 2016. The calculations in the National Footprint and Biocapacity Accounts are primarily based on United Nations data sets, including those published by the Food and Agriculture Organization, United Nations Commodity Trade Statistics Database, and the UN Statistics Division, as well as the International Energy Agency. In this project, we will use this data to classify and predict the quality metrics (qascore) of the ecological footprint data for the different countries. This data includes total and per capita national biocapacity, the ecological footprint of consumption, the ecological footprint of production and total area in hectares.*

Data Source [here](https://data.world/footprint/nfa-2019-edition)

### Linear classification and Logistic Regression

__Linear classifiers__ 
For simplicity, we define a linear classifier as a binary classifier that separates two classes (positive and negative class) using a linear separator by computing a linear combination of the features and comparing against a set threshold.

__Logistic regression__ is a linear algorithm that can be used for binary or multiclass classification. It is a discriminative classifier that estimates the probability that an instance belongs to a class using an s-shape function curve called the sigmoid function. 


In [1]:
import pandas as pd
import numpy as np

>collecting the data

In [2]:
df = pd.read_csv('NFA 2019 Public_data.csv', low_memory =False)

In [3]:
df.shape

(72186, 12)

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 72186 entries, 0 to 72185
Data columns (total 12 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   country         72186 non-null  object 
 1   year            72186 non-null  int64  
 2   country_code    72186 non-null  int64  
 3   record          72186 non-null  object 
 4   crop_land       51714 non-null  float64
 5   grazing_land    51714 non-null  float64
 6   forest_land     51714 non-null  object 
 7   fishing_ground  51713 non-null  float64
 8   built_up_land   51713 non-null  float64
 9   carbon          51713 non-null  float64
 10  total           72177 non-null  float64
 11  QScore          72185 non-null  object 
dtypes: float64(6), int64(2), object(4)
memory usage: 6.6+ MB


In [5]:
df

Unnamed: 0,country,year,country_code,record,crop_land,grazing_land,forest_land,fishing_ground,built_up_land,carbon,total,QScore
0,Armenia,1992,1,AreaPerCap,1.402924e-01,1.995463e-01,0.097188051,3.688847e-02,2.931995e-02,0.000000e+00,5.032351e-01,3A
1,Armenia,1992,1,AreaTotHA,4.830000e+05,6.870000e+05,334600,1.270000e+05,1.009430e+05,0.000000e+00,1.732543e+06,3A
2,Armenia,1992,1,BiocapPerCap,1.598044e-01,1.352610e-01,0.084003213,1.374213e-02,3.339780e-02,0.000000e+00,4.262086e-01,3A
3,Armenia,1992,1,BiocapTotGHA,5.501762e+05,4.656780e+05,289207.1078,4.731155e+04,1.149823e+05,0.000000e+00,1.467355e+06,3A
4,Armenia,1992,1,EFConsPerCap,3.875102e-01,1.894622e-01,1.26E-06,4.164833e-03,3.339780e-02,1.114093e+00,1.728629e+00,3A
...,...,...,...,...,...,...,...,...,...,...,...,...
72181,World,2016,5001,BiocapTotGHA,3.984702e+09,1.504757e+09,5111762779,1.095445e+09,4.726163e+08,0.000000e+00,1.216928e+10,3A
72182,World,2016,5001,EFConsPerCap,5.336445e-01,1.402092e-01,0.273495416,8.974253e-02,6.329435e-02,1.646235e+00,2.746619e+00,3A
72183,World,2016,5001,EFConsTotGHA,3.984702e+09,1.046937e+09,2042179333,6.701039e+08,4.726163e+08,1.229237e+10,2.050891e+10,3A
72184,World,2016,5001,EFProdPerCap,5.336445e-01,1.402092e-01,0.273495416,8.974253e-02,6.329435e-02,1.646235e+00,2.746619e+00,3A


#### Target variable is QScore so lets check what it is made of

In [6]:
df.QScore.value_counts()

3A    51481
2A    10576
2B    10096
1B       16
1A       16
Name: QScore, dtype: int64

#### Lets also check the missing data

In [7]:
df.isnull().sum()

country               0
year                  0
country_code          0
record                0
crop_land         20472
grazing_land      20472
forest_land       20472
fishing_ground    20473
built_up_land     20473
carbon            20473
total                 9
QScore                1
dtype: int64

In [8]:
df.dropna(inplace=True)
df.isna().sum()

country           0
year              0
country_code      0
record            0
crop_land         0
grazing_land      0
forest_land       0
fishing_ground    0
built_up_land     0
carbon            0
total             0
QScore            0
dtype: int64

### Checking the target variable again

In [9]:
df.QScore.value_counts()

3A    51473
2A      224
1A       16
Name: QScore, dtype: int64

#### We can see the difference in the values now 

For this practice we will be doing a binary classification, i.e between 2 classes, for now.

so we will be choosing 2A and 3A for the test

In [10]:
# treating 1A attributes as 2A
df.QScore.replace('1A','2A', inplace=True)
df.QScore.value_counts()

3A    51473
2A      240
Name: QScore, dtype: int64

>Separating the data

In [11]:
df_2A = df[df.QScore =='2A']
df_3A = df[df['QScore']=='3A'].sample(350)

In [12]:
data = df_2A.append(df_3A)

In [13]:
data.reset_index(drop=True)

Unnamed: 0,country,year,country_code,record,crop_land,grazing_land,forest_land,fishing_ground,built_up_land,carbon,total,QScore
0,Algeria,2016,4,AreaPerCap,2.072989e-01,8.112722e-01,0.048357265,2.258528e-02,2.998367e-02,0.000000e+00,1.119497e+00,2A
1,Algeria,2016,4,AreaTotHA,8.417600e+06,3.294260e+07,1963600,9.171000e+05,1.217520e+06,0.000000e+00,4.545842e+07,2A
2,Algeria,2016,4,BiocapPerCap,2.021916e-01,2.636077e-01,0.027166736,7.947991e-03,2.924496e-02,0.000000e+00,5.301590e-01,2A
3,Algeria,2016,4,BiocapTotGHA,8.210214e+06,1.070408e+07,1103135.245,3.227369e+05,1.187524e+06,0.000000e+00,2.152769e+07,2A
4,Algeria,2016,4,EFConsPerCap,6.280528e-01,1.810332e-01,0.162800822,1.472910e-02,2.924496e-02,1.391455e+00,2.407316e+00,2A
...,...,...,...,...,...,...,...,...,...,...,...,...
585,Paraguay,2003,169,EFProdTotGHA,8.822898e+06,1.092258e+07,5202430.629,4.485957e+04,5.465968e+05,1.377708e+06,2.691707e+07,3A
586,Mexico,2007,138,EFProdTotGHA,3.977095e+07,2.907084e+07,19612353.41,8.042884e+06,4.728035e+06,1.628276e+08,2.640527e+08,3A
587,Cameroon,2005,32,EFConsPerCap,4.428059e-01,1.257607e-01,0.253316867,5.534297e-02,4.106505e-02,1.446133e-01,1.062905e+00,3A
588,Mozambique,1973,144,BiocapTotGHA,3.682331e+06,2.490601e+07,17405682.46,4.328439e+06,5.312982e+05,0.000000e+00,5.085376e+07,3A


In [14]:
data.reset_index(drop=True, inplace=True)
data.index=data.index +1

In [15]:
data

Unnamed: 0,country,year,country_code,record,crop_land,grazing_land,forest_land,fishing_ground,built_up_land,carbon,total,QScore
1,Algeria,2016,4,AreaPerCap,2.072989e-01,8.112722e-01,0.048357265,2.258528e-02,2.998367e-02,0.000000e+00,1.119497e+00,2A
2,Algeria,2016,4,AreaTotHA,8.417600e+06,3.294260e+07,1963600,9.171000e+05,1.217520e+06,0.000000e+00,4.545842e+07,2A
3,Algeria,2016,4,BiocapPerCap,2.021916e-01,2.636077e-01,0.027166736,7.947991e-03,2.924496e-02,0.000000e+00,5.301590e-01,2A
4,Algeria,2016,4,BiocapTotGHA,8.210214e+06,1.070408e+07,1103135.245,3.227369e+05,1.187524e+06,0.000000e+00,2.152769e+07,2A
5,Algeria,2016,4,EFConsPerCap,6.280528e-01,1.810332e-01,0.162800822,1.472910e-02,2.924496e-02,1.391455e+00,2.407316e+00,2A
...,...,...,...,...,...,...,...,...,...,...,...,...
586,Paraguay,2003,169,EFProdTotGHA,8.822898e+06,1.092258e+07,5202430.629,4.485957e+04,5.465968e+05,1.377708e+06,2.691707e+07,3A
587,Mexico,2007,138,EFProdTotGHA,3.977095e+07,2.907084e+07,19612353.41,8.042884e+06,4.728035e+06,1.628276e+08,2.640527e+08,3A
588,Cameroon,2005,32,EFConsPerCap,4.428059e-01,1.257607e-01,0.253316867,5.534297e-02,4.106505e-02,1.446133e-01,1.062905e+00,3A
589,Mozambique,1973,144,BiocapTotGHA,3.682331e+06,2.490601e+07,17405682.46,4.328439e+06,5.312982e+05,0.000000e+00,5.085376e+07,3A


In [17]:
from sklearn.utils import shuffle

ImportError: No module named 'sklearn.__check_build._check_build'
___________________________________________________________________________
Contents of C:\Users\ADELEKE OLADAPO\anaconda3\lib\site-packages\sklearn\__check_build:
setup.py                  _check_build.cp38-win_amd64.pyd__init__.py
__pycache__
___________________________________________________________________________
It seems that scikit-learn has not been built correctly.

If you have installed scikit-learn from source, please do not forget
to build the package before using it: run `python setup.py install` or
`make` in the source directory.

If you have used an installer, please check that it is suited for your
Python version, your operating system and your platform.

In [None]:
data = shuffle(data)
data.reset_index(inplace=True)
data.index = data.index+1

In [None]:
data

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
x = data.drop(['country_code','QScore', 'country', 'year'], 1)
y = data.QScore

In [None]:
x_train, x_test, y_train, y_test= train_test_split(x,y,
                                                    random_state=0,
                                                  test_size = 0.3)

In [None]:
y_train.value_counts()

In [None]:
from sklearn.preprocessing import LabelEncoder

In [None]:
encoder = LabelEncoder()
x_train.record= encoder.fit_transform(x_train.record)
x_test.record = encoder.fit_transform(x_test.record)

In [None]:
x_test.record.value_counts()

In [None]:
x_test

In [None]:
import sklearn
from imblearn.over_sampling import SMOTE
smote = SMOTE(random_state=1)

In [None]:
ytrain= pd.Series(y_train)
x_balanced, y_balanced = smote.fit_resample(x_train,ytrain)

In [None]:
x_train

In [None]:
import sys

for i in sys.path:
    print(i)