# 3. Spot Check Algorithms

From https://machinelearningmastery.com/process-for-working-through-machine-learning-problems/

I use 10 fold cross validation in my test harnesses by default. All experiments (algorithm and dataset combinations) are repeated 10 times and the mean and standard deviation of the accuracy is collected and reported. I also use statistical significance tests to flush out meaningful results from noise. Box-plots are very useful for summarizing the distribution of accuracy results for each algorithm and dataset pair.

I spot check algorithms, which means loading up a bunch of standard machine learning algorithms into my test harness and performing a formal experiment. I typically run 10-20 standard algorithms from all the major algorithm families across all the transformed and scaled versions of the dataset I have prepared.

The goal of spot checking is to flush out the types of algorithms and dataset combinations that are good at picking out the structure of the problem so that they can be studied in more detail with focused experiments.

More focused experiments with well-performing families of algorithms may be performed in this step, but algorithm tuning is left for the next step.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib.cm import get_cmap
%matplotlib inline

## Training data

In [None]:
df = pd.read_csv('data/TrainingSet.csv', index_col=0)
df.columns = [year[:4] for year in df.columns][:-3] + [col.replace(' ', '_') for col in df.columns.values[-3:]]

## Submission data

In [None]:
# read the data containing the rows we need to predict
df_submission = pd.read_csv('data/SubmissionRows.csv', index_col=0)

In [None]:
df_submission_in_data = df.loc[df_submission.index]

# What are we trying to achieve?

 * We have 737 indicators from 206 countries with data from 1972 to 2007.
 * We would like to predict what these indicators will be in 2008 and 2012.

A very simplistic way of predicting the future values of these indicators would be to do a simple linear regression for indicators with more than 1 data point in the last 35 years or use the only data point we have for indicators with a single value.

**Let's try to code this simplistic version**

In [None]:
from sklearn.linear_model import LinearRegression

In [None]:
def make_prediction(row):
    data = row.loc['1972':'2007']
    nbr_data_points = data.count()
    if nbr_data_points < 2:
        pred_2008 = data.dropna().values
        pred_2012 = pred_2008
    
    else:
        years = data.dropna().index.values.astype(np.int).reshape(-1, 1)
        values = data.dropna().values
        
        # linear regression
        regr = LinearRegression()
        regr.fit(years, values)
        
        # predictions
        pred_2008 = regr.predict(np.array([2008]).reshape(-1, 1))
        pred_2012 = regr.predict(np.array([2012]).reshape(-1, 1))
        
    return pred_2008[0], pred_2012[0]

In [None]:
df_simple_preds = pd.DataFrame(df_submission_in_data.apply(make_prediction, axis=1).tolist(), \
                               index=df_submission_in_data.index, columns=['2008','2012'])

In [None]:
df_simple_preds.head()

In [None]:
def plot_predictions(df_train, df_pred, nbr_rows):
    np.random.seed(3)
    rows_to_plot = np.random.choice(df_train.index.values, nbr_rows, replace=False)
    
    cmap = get_cmap('Set1')
    colors = cmap.colors
        
    fig, ax = plt.subplots(figsize=(16,10))
    for i,j in zip(rows_to_plot, range(nbr_rows)):
        if j >= len(colors): j -= len(colors)
        ax.plot(df_train.loc[i, '1972':'2007'].dropna().index.astype(int), 
                df_train.loc[i, '1972':'2007'].dropna().values, 
                label=df_train.loc[i, 'Country_Name']+ '/' + df_train.loc[i, 'Series_Name'],
                marker='o',
                linewidth=4,
                alpha=0.5,
                color=colors[j])
                
        ax.plot(df_pred.loc[i].index.astype(int), 
                df_pred.loc[i].values,
                marker='s',
                linewidth=4,
                markersize=10,
                color=colors[j])

    plt.legend(loc=2)

In [None]:
plot_predictions(df_submission_in_data, df_simple_preds, 16)

### Plotting one target indicator [Environmental Sustainability (7.8)] for Afghanistan

In [None]:
df_afghanistan_7_8 = df[ (df["Country_Name"] == "Afghanistan") & (df["Series_Code"] == "7.8")]
df_afghanistan_7_8_1972_to_2007 = df_afghanistan_7_8.loc[:, "1972":"2007"]
df_afghanistan_7_8_1972_to_2007.T.plot(marker="o", title="Indicator Environmental Sustainability (7.8) for Afghanistan");

### Plotting all indicators except Environmental Sustainability (7.8) for Afghanistan

In [None]:
df_afghanistan_not_7_8 = df[ (df["Country_Name"] == "Afghanistan") & (df["Series_Code"] != "7.8")]
df_afghanistan_not_7_8_1972_to_2007 = df_afghanistan_not_7_8.loc[:, "1972":"2007"]
df_afghanistan_not_7_8_1972_to_2007.T.plot(marker="o", title="Indicator (all other indicators) for Afghanistan", legend=False);

### Plotting all indicators except Environmental Sustainability (7.8) for Afghanistan for 2001 to 2007

In [None]:
df_afghanistan_not_7_8 = df[ (df["Country_Name"] == "Afghanistan") & (df["Series_Code"] != "7.8")]
df_afghanistan_not_7_8_1972_to_2001 = df_afghanistan_not_7_8.loc[:, "2001":"2007"]
df_afghanistan_not_7_8_1972_to_2001.T.plot(marker="o", title="Indicator (all other indicators) for Afghanistan", legend=False);

### Enlisting top correlated features against target feature [Environmental Sustainability (7.8)]

In [None]:
df_afghanistan = df[ df["Country_Name"] == "Afghanistan" ]

df_2000_2007 = df_afghanistan.loc[:, "2000":"2007"]
df_2000_2007_clean_index = df_2000_2007.count(axis=1) >= 4


data = df_afghanistan[df_2000_2007_clean_index].set_index('Series_Code').loc[:, "2000":"2007"].T

coeff = data.corr().loc["7.8"].abs()
coeff.sort_values(inplace=True, ascending=False)
coeff.iloc[0:20]

### Plotting the top correlated indicators for Afghanisthan between 2000 to 2007

In [None]:
series_code_correlated_to_7_8 = coeff.iloc[0:20].index
df_afghanistan_indicators_correlated_to_7_8 = df_afghanistan[df_afghanistan.Series_Code.isin(series_code_correlated_to_7_8)]
df_afghanistan_indicators_correlated_to_7_8.set_index('Series_Code', inplace=True)
plt.rcParams["figure.figsize"] = (14,7)
df_afghanistan_indicators_correlated_to_7_8.loc[:, "2000":"2007"].T.plot(marker="o", legend=True)
plt.legend(loc=5);

In [None]:
from sklearn.preprocessing import normalize, scale, MinMaxScaler

In [None]:
scaled_df = df_afghanistan_indicators_correlated_to_7_8.loc[:,'1972':'2007'].T
# print(scaled_df.shape)
# display(scaled_df.head(20))
scaled_df = scaled_df.dropna()
# scaled2_norm_df = pd.DataFrame(normalize(scaled_df, axis=0), columns=scaled_df.columns, index=scaled_df.index)
# scaled2_scale_df = pd.DataFrame(scale(scaled_df, axis=0), columns=scaled_df.columns, index=scaled_df.index)
scaled2_MinMax_df = pd.DataFrame(MinMaxScaler().fit_transform(scaled_df), columns=scaled_df.columns, index=scaled_df.index)
# display(scaled_df.head(20))
# scaled2_norm_df.plot(marker="o", legend=True)
# scaled2_scale_df.plot(marker="o", legend=True)
scaled2_MinMax_df.plot(marker="o", legend=True);

In [None]:
pred_columns = [str(item) + "_pred" for item in np.array(range(2002,2008))]
true_columns = [str(item) for item in range(2002,2008)]

In [None]:
def make_prediction(row):
    training_data = row.loc['1972':'2002']
    test_data = row.loc['2002':'2007']
    
    nbr_data_points = training_data.count()
    if test_data.count()<6 or training_data.count() < 6 :
        return  [None]*6
    else:
        years = training_data.dropna().index.values.astype(np.int).reshape(-1, 1)
        values = training_data.dropna().values
        
        #linear regression
        regr = LinearRegression()
        regr.fit(years, values)
        
        #predictions
        return regr.predict(np.array(range(2002,2008)).reshape(-1, 1))

In [None]:
%pdb 0
from pdb import set_trace
from sklearn.metrics import mean_squared_error

In [None]:
def count_nas(df):
    return df.isna().sum().sum()

In [None]:
df = df_submission_in_data.dropna(subset=true_columns)
df_simple_preds_true_columns = pd.DataFrame(df.apply(make_prediction, axis=1).tolist(), \
                               index=df.index, columns=true_columns)

In [None]:
df_simple_preds.shape

In [None]:
df_simple_preds_true_columns.shape

In [None]:
scaled_df.shape

In [None]:
scaler = MinMaxScaler()
scaled_df_submission = df_submission_in_data.loc[:,'1972':'2007']
scaled_df_submission = scaled_df_submission.dropna()
scaled2_MinMax_df_submission = pd.DataFrame(scaler.fit_transform(scaled_df_submission), \
                                            columns=scaled_df_submission.columns, index=scaled_df_submission.index)

In [None]:
scaled2_MinMax_df_submission.shape

In [None]:
scaled2_MinMax_df_dropna = scaled2_MinMax_df_submission.dropna(subset=true_columns)
scaled2_MinMax_df_preds = pd.DataFrame(scaled2_MinMax_df_dropna.apply(make_prediction, axis=1).tolist(), \
                               index=scaled2_MinMax_df_dropna.index, columns=true_columns)

In [None]:
scaled2_MinMax_df_true_columns = scaled2_MinMax_df_dropna[true_columns]
scaled2_MinMax_df_true_columns.shape

In [None]:
scaled2_MinMax_df_preds.shape

In [None]:
count_nas(scaled2_MinMax_df_true_columns)

In [None]:
scaled2_MinMax_df_true_columns.head(2)

In [None]:
count_nas(scaled2_MinMax_df_preds)

In [None]:
scaled2_MinMax_df_preds.head(2)

In [None]:
def assert_all_finite(X):
    X = np.asanyarray(X)
    return (X.dtype.char in np.typecodes['AllFloat'] and np.isfinite(X.sum())
            and np.isfinite(X).all())

In [None]:
def validate(y_true, y_pred):
    y_true_df = y_true.copy()
    y_pred_df = y_pred.dropna()
    validate = y_true_df.loc[y_pred_df.index][true_columns]
    assert(assert_all_finite(validate.dropna()))
    assert(assert_all_finite(y_pred_df))
    return mean_squared_error(validate.dropna(), y_pred_df)
# should return the dispersion of the errors as well

print(validate(scaled2_MinMax_df_true_columns, scaled2_MinMax_df_preds))

In [None]:
retained_columns = ['Country_Name', 'Series_Code', 'Series_Name']

In [None]:
df_submission_in_data_dropna = df_submission_in_data.dropna(subset=true_columns)

In [None]:
df_submission_in_data_no_years = df_submission_in_data_dropna[retained_columns]

In [None]:
df_merged = pd.merge(scaled2_MinMax_df_true_columns, df_submission_in_data_no_years, left_index=True, \
                     right_index=True, how='outer', suffixes=('',''))

In [None]:
df_merged.dropna(inplace=True)
df_merged.shape

In [None]:
assert(df_submission_in_data.shape[0] == df_simple_preds.shape[0])

In [None]:
# plot_predictions(df_submission_in_data, df_simple_preds, 16)

In [None]:
assert(df_merged.shape[0] == scaled2_MinMax_df_preds.shape[0])

In [None]:
plot_predictions(df_merged, scaled2_MinMax_df_preds, 8)

## 12/06/2019 Setup polynomial model

In [None]:
from sklearn.model_selection import train_test_split
X = scaled2_norm_df
y = submission_codes
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

In [None]:
pred_columns = [str(item) + "_pred" for item in np.array(range(2002,2008))]
true_columns = [str(item) for item in range(2002,2008)]

In [None]:
def make_prediction(row, model):
    training_data = row.loc['1972':'2002']
    test_data = row.loc['2002':'2007']
    
    nbr_data_points = training_data.count()
    if test_data.count()<6 or training_data.count() < 6 :
        return  [None]*6
    else:
        years = training_data.dropna().index.values.astype(np.int).reshape(-1, 1)
        values = training_data.dropna().values
        
        model.fit(years, values)
        
        #predictions
        return model.predict(np.array(range(2002,2008)).reshape(-1, 1))

### Linear Regression baseline

In [None]:
#linear regression
model = LinearRegression()

In [None]:
df = df_submission_in_data.dropna(subset=true_columns)
df_simple_preds_true_columns = pd.DataFrame(df.apply(make_prediction, args=(model,),axis=1).tolist(), \
                               index=df.index, columns=true_columns)

In [None]:
scaler = MinMaxScaler()
scaled_df_submission = df_submission_in_data.loc[:,'1972':'2007']
scaled_df_submission = scaled_df_submission.dropna()
scaled_df_submission = scaled_df_submission.dropna(subset=true_columns)
scaled_df_submission = pd.DataFrame(scaler.fit_transform(scaled_df_submission), \
                                    columns=scaled_df_submission.columns, index=scaled_df_submission.index)

In [None]:
scaled_df_submission_preds = pd.DataFrame(scaled_df_submission.apply(make_prediction, args=(model,),axis=1).tolist(),\
                               index=scaled_df_submission.index, columns=true_columns)

In [None]:
print(validate(scaled_df_submission, scaled_df_submission_preds))

In [None]:
df_submission_in_data_dropna = df_submission_in_data.dropna(subset=true_columns)
df_submission_in_data_no_years = df_submission_in_data_dropna[retained_columns]
df_merged = pd.merge(scaled_df_submission, df_submission_in_data_no_years, left_index=True, right_index=True, \
                     how='outer', suffixes=('',''))
df_merged.dropna(subset=true_columns,inplace=True)

In [None]:
# df_merged = pd.merge(scaled_df_submission, df_submission_in_data_no_years, left_index=True, right_index=True, how='outer', suffixes=('',''))
plot_predictions(df_merged, scaled_df_submission_preds, 8)

### Polynomial regression baseline

In [None]:
pred_columns = [str(item) + "_pred" for item in np.array(range(2002,2008))]
true_columns = [str(item) for item in range(2002,2008)]

In [None]:
def make_prediction(row, model):
    training_data = row.loc['1972':'2002']
    test_data = row.loc['2002':'2007']
    
    nbr_data_points = training_data.count()
    if test_data.count()<6 or training_data.count() < 6 :
        return  [None]*6
    else:
        years = training_data.dropna().index.values.astype(np.int).reshape(-1, 1)
        values = training_data.dropna().values
        
        model.fit(years, values)
        
        #predictions
        return model.predict(np.array(range(2002,2008)).reshape(-1, 1))

In [None]:
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import make_pipeline

In [None]:
model = LinearRegression()

In [None]:
df = df_submission_in_data.dropna(subset=true_columns)
# df.loc[:,'1972':'2007'].head()
df=df.loc[:,'2005':'2007']
df=df.iloc[0:2,:]
display(df.head())
transformer = PolynomialFeatures(degree=1)
df_poly=transformer.fit_transform(df.T)
print(df_poly)
transformer = PolynomialFeatures(degree=2,interaction_only=False)
df_poly=transformer.fit_transform(df.T)
print(df_poly)
transformer = PolynomialFeatures(degree=2,interaction_only=True)
df_poly=transformer.fit_transform(df.T)
print(df_poly)
# df_simple_preds_true_columns = pd.DataFrame(df.apply(make_prediction, args=(model,),axis=1).tolist(),\
#                                index=df.index, columns=true_columns)

In [None]:
model = make_pipeline(PolynomialFeatures(degree=2), LinearRegression())
# tried degree 3 but did not look better

In [None]:
scaler = MinMaxScaler()
scaled_df_submission = df_submission_in_data.loc[:,'1972':'2007']
scaled_df_submission = scaled_df_submission.dropna()
scaled_df_submission = scaled_df_submission.dropna(subset=true_columns)
scaled_df_submission = pd.DataFrame(scaler.fit_transform(scaled_df_submission), \
                                    columns=scaled_df_submission.columns, index=scaled_df_submission.index)

In [None]:
scaled_df_submission_preds = pd.DataFrame(scaled_df_submission.apply(make_prediction, args=(model,),axis=1).tolist(),\
                               index=scaled_df_submission.index, columns=true_columns)

In [None]:
print(validate(scaled_df_submission, scaled_df_submission_preds))

In [None]:
df_submission_in_data_dropna = df_submission_in_data.dropna(subset=true_columns)
df_submission_in_data_no_years = df_submission_in_data_dropna[retained_columns]
df_merged = pd.merge(scaled_df_submission, df_submission_in_data_no_years, left_index=True, \
                     right_index=True, how='outer', suffixes=('',''))
df_merged.dropna(subset=true_columns,inplace=True)

In [None]:
# df_merged = pd.merge(scaled_df_submission, df_submission_in_data_no_years, left_index=True, right_index=True, how='outer', suffixes=('',''))
plot_predictions(df_merged, scaled_df_submission_preds, 8)