### Introduction 


Every year people demand more from nature than it can regenerate. Individuals, communities and government leaders use ecological footprint data to better manage limited resources, reduce economic risk, and improve well-being. The Dataset provides Ecological Footprint per capita data for years 1961-2016 in global hectares (gha). Ecological Footprint is a measure of how much area of biologically productive land and water an individual, population, or activity requires to produce all the resources it consumes and to absorb the waste it generates, using prevailing technology and resource management practices. The Ecological Footprint is measured in global hectares. Since trade is global, an individual or country's Footprint tracks area from all over the world. 

Apart from predicting numeric values, another important supervised machine learning method is classification and it involves predicting classes (either binary or multinomial classes). In this section, we will cover how to measure performances of class prediction, linear classification methods and non-linear/tree-based methods. We’ll also focus on strategies for applying a successful classification model like interpretability-accuracy trade-off, class and imbalance.

*The National Footprint and Biocapacity Accounts (NFAs) measure the ecological resource use and resource capacity of nations from 1961 to 2016. The calculations in the National Footprint and Biocapacity Accounts are primarily based on United Nations data sets, including those published by the Food and Agriculture Organization, United Nations Commodity Trade Statistics Database, and the UN Statistics Division, as well as the International Energy Agency. In this project, we will use this data to classify and predict the quality metrics (qascore) of the ecological footprint data for the different countries. This data includes total and per capita national biocapacity, the ecological footprint of consumption, the ecological footprint of production and total area in hectares.*

Data Source [here](https://data.world/footprint/nfa-2019-edition)

### Linear classification and Logistic Regression

__Linear classifiers__ 
For simplicity, we define a linear classifier as a binary classifier that separates two classes (positive and negative class) using a linear separator by computing a linear combination of the features and comparing against a set threshold.

__Logistic regression__ is a linear algorithm that can be used for binary or multiclass classification. It is a discriminative classifier that estimates the probability that an instance belongs to a class using an s-shape function curve called the sigmoid function. 


In [1]:
import pandas as pd
import numpy as np

>collecting the data

In [2]:
df = pd.read_csv('NFA 2019 Public_data.csv', low_memory =False)

In [3]:
df.shape

(72186, 12)

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 72186 entries, 0 to 72185
Data columns (total 12 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   country         72186 non-null  object 
 1   year            72186 non-null  int64  
 2   country_code    72186 non-null  int64  
 3   record          72186 non-null  object 
 4   crop_land       51714 non-null  float64
 5   grazing_land    51714 non-null  float64
 6   forest_land     51714 non-null  object 
 7   fishing_ground  51713 non-null  float64
 8   built_up_land   51713 non-null  float64
 9   carbon          51713 non-null  float64
 10  total           72177 non-null  float64
 11  QScore          72185 non-null  object 
dtypes: float64(6), int64(2), object(4)
memory usage: 6.6+ MB


In [5]:
df

Unnamed: 0,country,year,country_code,record,crop_land,grazing_land,forest_land,fishing_ground,built_up_land,carbon,total,QScore
0,Armenia,1992,1,AreaPerCap,1.402924e-01,1.995463e-01,0.097188051,3.688847e-02,2.931995e-02,0.000000e+00,5.032351e-01,3A
1,Armenia,1992,1,AreaTotHA,4.830000e+05,6.870000e+05,334600,1.270000e+05,1.009430e+05,0.000000e+00,1.732543e+06,3A
2,Armenia,1992,1,BiocapPerCap,1.598044e-01,1.352610e-01,0.084003213,1.374213e-02,3.339780e-02,0.000000e+00,4.262086e-01,3A
3,Armenia,1992,1,BiocapTotGHA,5.501762e+05,4.656780e+05,289207.1078,4.731155e+04,1.149823e+05,0.000000e+00,1.467355e+06,3A
4,Armenia,1992,1,EFConsPerCap,3.875102e-01,1.894622e-01,1.26E-06,4.164833e-03,3.339780e-02,1.114093e+00,1.728629e+00,3A
...,...,...,...,...,...,...,...,...,...,...,...,...
72181,World,2016,5001,BiocapTotGHA,3.984702e+09,1.504757e+09,5111762779,1.095445e+09,4.726163e+08,0.000000e+00,1.216928e+10,3A
72182,World,2016,5001,EFConsPerCap,5.336445e-01,1.402092e-01,0.273495416,8.974253e-02,6.329435e-02,1.646235e+00,2.746619e+00,3A
72183,World,2016,5001,EFConsTotGHA,3.984702e+09,1.046937e+09,2042179333,6.701039e+08,4.726163e+08,1.229237e+10,2.050891e+10,3A
72184,World,2016,5001,EFProdPerCap,5.336445e-01,1.402092e-01,0.273495416,8.974253e-02,6.329435e-02,1.646235e+00,2.746619e+00,3A


#### Target variable is QScore so lets check what it is made of

In [6]:
df.QScore.value_counts()

3A    51481
2A    10576
2B    10096
1A       16
1B       16
Name: QScore, dtype: int64

#### Lets also check the missing data

In [7]:
df.isnull().sum()

country               0
year                  0
country_code          0
record                0
crop_land         20472
grazing_land      20472
forest_land       20472
fishing_ground    20473
built_up_land     20473
carbon            20473
total                 9
QScore                1
dtype: int64

In [8]:
df.dropna(inplace=True)
df.isna().sum()

country           0
year              0
country_code      0
record            0
crop_land         0
grazing_land      0
forest_land       0
fishing_ground    0
built_up_land     0
carbon            0
total             0
QScore            0
dtype: int64

### Checking the target variable again

In [9]:
df.QScore.value_counts()

3A    51473
2A      224
1A       16
Name: QScore, dtype: int64

#### We can see the difference in the values now 

For this practice we will be doing a binary classification, i.e between 2 classes, for now.

so we will be choosing 2A and 3A for the test

In [10]:
# treating 1A attributes as 2A
df.QScore.replace('1A','2A', inplace=True)
df.QScore.value_counts()

3A    51473
2A      240
Name: QScore, dtype: int64

>Separating the data

In [11]:
df_2A = df[df.QScore =='2A']
df_3A = df[df['QScore']=='3A'].sample(350)

In [12]:
data = df_2A.append(df_3A)

In [13]:
data.reset_index(drop=True)

Unnamed: 0,country,year,country_code,record,crop_land,grazing_land,forest_land,fishing_ground,built_up_land,carbon,total,QScore
0,Algeria,2016,4,AreaPerCap,2.072989e-01,8.112722e-01,0.048357265,2.258528e-02,2.998367e-02,0.000000e+00,1.119497e+00,2A
1,Algeria,2016,4,AreaTotHA,8.417600e+06,3.294260e+07,1963600,9.171000e+05,1.217520e+06,0.000000e+00,4.545842e+07,2A
2,Algeria,2016,4,BiocapPerCap,2.021916e-01,2.636077e-01,0.027166736,7.947991e-03,2.924496e-02,0.000000e+00,5.301590e-01,2A
3,Algeria,2016,4,BiocapTotGHA,8.210214e+06,1.070408e+07,1103135.245,3.227369e+05,1.187524e+06,0.000000e+00,2.152769e+07,2A
4,Algeria,2016,4,EFConsPerCap,6.280528e-01,1.810332e-01,0.162800822,1.472910e-02,2.924496e-02,1.391455e+00,2.407316e+00,2A
...,...,...,...,...,...,...,...,...,...,...,...,...
585,Gambia,1965,75,EFConsTotGHA,1.900320e+05,1.460567e+05,99036.98759,1.759882e+04,3.237194e+04,2.095853e+04,5.060550e+05,3A
586,Japan,2015,110,AreaTotHA,3.484000e+06,1.012000e+06,24958000,4.403510e+07,2.474530e+06,0.000000e+00,7.596363e+07,3A
587,Romania,1996,183,BiocapPerCap,7.180163e-01,1.065143e-01,1.063636126,8.041483e-02,9.752131e-02,0.000000e+00,2.066103e+00,3A
588,Spain,2014,203,EFConsTotGHA,3.693141e+07,6.800395e+06,9881020.263,1.742582e+07,1.779262e+06,1.024287e+08,1.752466e+08,3A


In [14]:
data.reset_index(drop=True, inplace=True)
data.index=data.index +1

In [15]:
data

Unnamed: 0,country,year,country_code,record,crop_land,grazing_land,forest_land,fishing_ground,built_up_land,carbon,total,QScore
1,Algeria,2016,4,AreaPerCap,2.072989e-01,8.112722e-01,0.048357265,2.258528e-02,2.998367e-02,0.000000e+00,1.119497e+00,2A
2,Algeria,2016,4,AreaTotHA,8.417600e+06,3.294260e+07,1963600,9.171000e+05,1.217520e+06,0.000000e+00,4.545842e+07,2A
3,Algeria,2016,4,BiocapPerCap,2.021916e-01,2.636077e-01,0.027166736,7.947991e-03,2.924496e-02,0.000000e+00,5.301590e-01,2A
4,Algeria,2016,4,BiocapTotGHA,8.210214e+06,1.070408e+07,1103135.245,3.227369e+05,1.187524e+06,0.000000e+00,2.152769e+07,2A
5,Algeria,2016,4,EFConsPerCap,6.280528e-01,1.810332e-01,0.162800822,1.472910e-02,2.924496e-02,1.391455e+00,2.407316e+00,2A
...,...,...,...,...,...,...,...,...,...,...,...,...
586,Gambia,1965,75,EFConsTotGHA,1.900320e+05,1.460567e+05,99036.98759,1.759882e+04,3.237194e+04,2.095853e+04,5.060550e+05,3A
587,Japan,2015,110,AreaTotHA,3.484000e+06,1.012000e+06,24958000,4.403510e+07,2.474530e+06,0.000000e+00,7.596363e+07,3A
588,Romania,1996,183,BiocapPerCap,7.180163e-01,1.065143e-01,1.063636126,8.041483e-02,9.752131e-02,0.000000e+00,2.066103e+00,3A
589,Spain,2014,203,EFConsTotGHA,3.693141e+07,6.800395e+06,9881020.263,1.742582e+07,1.779262e+06,1.024287e+08,1.752466e+08,3A


In [16]:
from sklearn.utils import shuffle

In [17]:
data = shuffle(data)
data.reset_index(inplace=True)
data.index = data.index+1

In [18]:
data

Unnamed: 0,index,country,year,country_code,record,crop_land,grazing_land,forest_land,fishing_ground,built_up_land,carbon,total,QScore
1,258,Japan,1967,110,BiocapPerCap,2.416488e-01,1.824299e-02,0.442648678,1.306813e-01,9.355754e-02,0.000000e+00,9.267792e-01,3A
2,463,"Korea, Republic of",2011,117,EFProdTotGHA,6.602709e+06,4.480363e+04,2832996.489,1.778800e+07,2.853351e+06,2.197558e+08,2.498776e+08,3A
3,74,Gabon,2016,74,AreaTotHA,4.950000e+05,4.665000e+06,23200000,4.657200e+06,7.718950e+04,0.000000e+00,3.309439e+07,2A
4,308,Brazil,1995,21,AreaTotHA,6.206094e+07,1.964111e+08,533989500,8.664460e+07,5.352560e+06,0.000000e+00,8.844587e+08,3A
5,133,Kyrgyzstan,2016,113,EFConsPerCap,4.271284e-01,1.977978e-01,0.059306959,5.479879e-03,7.480447e-02,8.907022e-01,1.655220e+00,2A
...,...,...,...,...,...,...,...,...,...,...,...,...,...
586,390,Togo,1997,217,EFProdPerCap,3.498585e-01,7.455842e-02,0.511467665,9.801566e-03,1.836928e-02,6.709183e-02,1.031147e+00,3A
587,356,Myanmar,1980,28,EFProdTotGHA,1.393919e+07,3.126307e+05,8344046.733,9.158921e+05,1.398550e+06,1.745669e+06,2.665598e+07,3A
588,519,Oman,1985,221,AreaTotHA,4.488843e+04,1.016112e+06,2000,5.420600e+06,6.342530e+04,0.000000e+00,6.547025e+06,3A
589,297,"Korea, Republic of",1969,117,EFConsPerCap,4.155237e-01,1.514948e-02,0.15235596,1.373962e-01,6.199689e-02,6.079419e-01,1.390364e+00,3A


In [19]:
from sklearn.model_selection import train_test_split

In [20]:
x = data.drop(['country_code','QScore', 'country', 'year'], 1)
y = data.QScore

In [21]:
x_train, x_test, y_train, y_test= train_test_split(x,y,
                                                    random_state=0,
                                                  test_size = 0.3)

In [22]:
y_train.value_counts()

3A    244
2A    169
Name: QScore, dtype: int64

In [23]:
from sklearn.preprocessing import LabelEncoder

In [24]:
encoder = LabelEncoder()
x_train.record= encoder.fit_transform(x_train.record)
x_test.record = encoder. transform(x_test.record)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self[name] = value


In [25]:
x_test.record.value_counts()

2    27
6    26
4    24
1    24
7    23
3    20
5    18
0    15
Name: record, dtype: int64

In [26]:
x_test

Unnamed: 0,index,record,crop_land,grazing_land,forest_land,fishing_ground,built_up_land,carbon,total
226,122,1,2.150000e+05,229000.000000,334820,1.356100e+06,6.290070e+04,0.000000e+00,2.197821e+06
15,442,7,5.112596e+06,38138.322530,1871759.644,6.077964e+05,7.860495e+05,1.844748e+07,2.686382e+07
86,538,6,2.351383e-01,0.865862,0.36910777,2.072443e-02,2.336825e-02,4.814289e-03,1.519015e+00
419,544,2,2.797492e-01,0.149253,0.05591369,1.215640e-01,4.780631e-02,0.000000e+00,6.542865e-01
133,239,6,2.017831e+00,0.005768,0.216720935,3.223480e-03,8.893945e-02,1.665767e+00,3.998251e+00
...,...,...,...,...,...,...,...,...,...
347,339,6,1.512432e+00,0.328729,1.222317358,1.372638e-01,6.631180e-02,6.816763e+00,1.008382e+01
370,403,6,4.094529e-01,0.999447,0.203250145,3.659261e-02,4.544675e-02,7.917214e-02,1.773361e+00
141,310,7,1.237290e+04,922.973851,24587.32574,3.114107e+04,9.156912e+03,5.351648e+05,6.133460e+05
534,340,3,1.927591e+07,470756.774400,10234626.14,1.424796e+07,4.227508e+06,0.000000e+00,4.845676e+07


In [28]:
import sklearn
from imblearn.over_sampling import SMOTE
smote = SMOTE(random_state=1)

In [29]:
ytrain= pd.Series(y_train)
x_balanced, y_balanced = smote.fit_resample(x_train,ytrain)

AttributeError: 'SMOTE' object has no attribute '_validate_data'

In [None]:
x_train

In [None]:
import sys

for i in sys.path:
    print(i)

# Lesson 2

### CROSS VALIDATION
>KFold
- This technique is called K-Fold because the data is split into K equal groups.  If k = 5, a 5-fold cross validation can be performed such that the data is split into k1, k2, k3, k4 and k5. The model is trained on k2 - k5 and evaluated on k1 then repeated k times until every group is used to train and test the model.

In [33]:
from sklearn.model_selection import KFold , cross_val_score
from sklearn.preprocessing import MinMaxScaler, StandardScaler
from sklearn.linear_model import LogisticRegression

In [63]:
help(cross_val_score)

Help on function cross_val_score in module sklearn.model_selection._validation:

cross_val_score(estimator, X, y=None, groups=None, scoring=None, cv=None, n_jobs=None, verbose=0, fit_params=None, pre_dispatch='2*n_jobs', error_score=nan)
    Evaluate a score by cross-validation
    
    Read more in the :ref:`User Guide <cross_validation>`.
    
    Parameters
    ----------
    estimator : estimator object implementing 'fit'
        The object to use to fit the data.
    
    X : array-like
        The data to fit. Can be for example a list, or an array.
    
    y : array-like, optional, default: None
        The target variable to try to predict in the case of
        supervised learning.
    
    groups : array-like, with shape (n_samples,), optional
        Group labels for the samples used while splitting the dataset into
        train/test set. Only used in conjunction with a "Group" :term:`cv`
        instance (e.g., :class:`GroupKFold`).
    
    scoring : string, callable or 

In [64]:
help(KFold)

Help on class KFold in module sklearn.model_selection._split:

class KFold(_BaseKFold)
 |  KFold(n_splits=5, shuffle=False, random_state=None)
 |  
 |  K-Folds cross-validator
 |  
 |  Provides train/test indices to split data in train/test sets. Split
 |  dataset into k consecutive folds (without shuffling by default).
 |  
 |  Each fold is then used once as a validation while the k - 1 remaining
 |  folds form the training set.
 |  
 |  Read more in the :ref:`User Guide <cross_validation>`.
 |  
 |  Parameters
 |  ----------
 |  n_splits : int, default=5
 |      Number of folds. Must be at least 2.
 |  
 |      .. versionchanged:: 0.22
 |          ``n_splits`` default value changed from 3 to 5.
 |  
 |  shuffle : boolean, optional
 |      Whether to shuffle the data before splitting into batches.
 |  
 |  random_state : int, RandomState instance or None, optional, default=None
 |      If int, random_state is the seed used by the random number generator;
 |      If RandomState instanc

In [47]:
# the forest_land was an object datatype so i had to change to float
l = []
for i in x.forest_land:
    if type(i) is str:
        l.append(float(i))
    else:
        l.append(0.0)
    

2

In [50]:
x.forest_land= l

In [52]:
# encoding the record columns cos we need it

x.record = encoder.transform(x.record)

In [55]:
log_reg= LogisticRegression()

In [53]:
maxscaler = MinMaxScaler()


norm_df = pd.DataFrame(maxscaler.fit_transform(x), columns=x.columns)

In [56]:
scores = cross_val_score(log_reg, norm_df, y, cv=5, scoring='f1_macro')
scores

array([0.98244048, 0.99124824, 0.99119074, 0.99119074, 0.99119074])

In [57]:
from sklearn.metrics import f1_score

In [59]:
kf = KFold(n_splits=5)
f_scores = []
for train_index,test_index in kf.split(norm_df):
    xtrain,xtest = norm_df.iloc[train_index],norm_df.iloc[test_index]
    ytrain,ytest = y.iloc[train_index],y.iloc[test_index]
    model = LogisticRegression().fit(xtrain,ytrain)
    f_scores.append(f1_score(ytest, model.predict(xtest), pos_label='2A')*100)

In [60]:
f_scores

[97.72727272727273,
 96.84210526315789,
 99.009900990099,
 100.0,
 99.08256880733944]

In [61]:
help(f1_score)

Help on function f1_score in module sklearn.metrics._classification:

f1_score(y_true, y_pred, labels=None, pos_label=1, average='binary', sample_weight=None, zero_division='warn')
    Compute the F1 score, also known as balanced F-score or F-measure
    
    The F1 score can be interpreted as a weighted average of the precision and
    recall, where an F1 score reaches its best value at 1 and worst score at 0.
    The relative contribution of precision and recall to the F1 score are
    equal. The formula for the F1 score is::
    
        F1 = 2 * (precision * recall) / (precision + recall)
    
    In the multi-class and multi-label case, this is the average of
    the F1 score of each class with weighting depending on the ``average``
    parameter.
    
    Read more in the :ref:`User Guide <precision_recall_f_measure_metrics>`.
    
    Parameters
    ----------
    y_true : 1d array-like, or label indicator array / sparse matrix
        Ground truth (correct) target values.
    


In [65]:
model.predict(xtest).shape

(118,)

>StratifiedKFold CV
- Although similar to the technique described above, Stratified K-Fold cross validation ensures that in every fold, there is an equal proportion of each target class to obtain a good representation of the data and avoid imbalance and biased results. For example, if there are two target classes t1 and t2 with equal distribution in the data, it is best to ensure that the folds also have the same distribution.



In [None]:
from sklearn.model_selection import StratifiedKFold
skf = StratifiedKFold(n_splits=5, random_state=1, shuffle=True)

In [1]:
24*7

168

In [6]:
import numpy as  np

a = np.array([1 for i in range(6)])
b = a

In [7]:
a+= np.array([2 for i in range(6)])

In [8]:
b

array([3, 3, 3, 3, 3, 3])

In [10]:
c = np.array([3 for i in range(7)])
d = c

In [11]:
c = c + np.array([4 for i in range(7)])

In [12]:
d

array([3, 3, 3, 3, 3, 3, 3])

In [13]:
e = d[:4]
e[0]= 22

In [14]:
d

array([22,  3,  3,  3,  3,  3,  3])

In [None]:
df = pd.read_csv('NFA 2019 Public_data.csv', low_memory =False)