## Description 

Aluminum (Al), gallium (Ga), indium (In) sesquioxides are some of the most promising transparent conductors because of a combination of both large bandgap energies, which leads to optical transparency over the visible range, and high conductivities.

These alloys are described by the formula (AlxGayInz)2N O 3N; where x, y, and z can vary but are limited by the constraint x+y+z = 1. The total number of atoms in the unit cell, Ntotal = 2N+3N (where N is an integer), is typically between 5 and 100.

However, the main limitation in the design of compounds is that identification and discovery of novel materials for targeted applications requires an examination of enormous compositional and configurational degrees of freedom (i.e., many combinations of x, y, and z).

The following information has been included:

 * Spacegroup (a label identifying the symmetry of the material)
 * Total number of Al, Ga, In and O atoms in the unit cell (Ntotal(Ntotal)
 * Relative compositions of Al, Ga, and In (x, y, z)
 * Lattice vectors and angles: lv1, lv2, lv3 (which are lengths given in units of angstroms (10^−10 meters) and 
 * α, β, γ (which are angles in degrees between 0° and 360°)
 
The task for this competition is to predict two target properties:

  1. Formation energy (an important indicator of the stability of a material)
  2. Bandgap energy (an important property for optoelectronic applications)
  
Since they are continuous variables to be predicted that makes it a regression supervised problem.

# import libraries and Load data  

We will train and tune various regressors models through GridSearchCV to predict the best regressor.

In [None]:
# import libraries and Load data  
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

from sklearn import preprocessing
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_predict, train_test_split
from sklearn.metrics import r2_score, mean_squared_error  #, mean_squared_log_error, mean_absolute_error

from sklearn.linear_model import LinearRegression, Ridge,  RANSACRegressor
from sklearn.svm import SVR
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import AdaBoostRegressor, ExtraTreesRegressor, BaggingRegressor
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from xgboost.sklearn import XGBRegressor
#from scipy.stats import randint
#import scipy.stats as st

# load data
train_data = pd.read_csv('../input/nomad2018-predict-transparent-conductors/train.csv')
test_data = pd.read_csv('../input/nomad2018-predict-transparent-conductors/test.csv')
train_data.head()


In [None]:
train_data.shape

In [None]:
test_data.head()

In [None]:
test_data.shape

In [None]:
train_data.columns[(train_data == 0).all()]

From the output of the head() and shape we can see that both the train_data and test_data have same features (columns) excepts the two target features  'formation_energy_ev_natom' and 'bandgap_energy_ev' missing in test_data. And thats what our goal is to predict these target columns for test_data.

Train_data and test_data has 2400 and 600 records respectively.

# Let's begin programing  

In brief : 

    As we observed there are not many columns or features to reduce or take out the irrelevant features. 
    As on initial glance we can say :
           
           1. All are numeric features.
           2. target features are also continous numeric features so it is basically a regression problem.
           3. the two features that can be removed before prediction are 'id' and  'number_of_total_atoms'.
    
    Some more observation exploratory data analysis can be done that I will do it in another post.
    As mentioned in "https://www.kaggle.com/c/nomad2018-predict-transparent-conductors" many features need domain knowledge and are provided in the data set for purpose of data mining. 
    
    With my understanding on the features, 'id' is simple series of number that identify the each rows. 'number_of_total_atoms' is combination of 'percent_atom_al', 'percent_atom_ga', and 'percent_atom_in' so that makes the feature 'number_of_total_atoms' less importand for prediction. I have less or negligible clarity on lattice vactors and lattice angle. I believe it depends on the contitutents the Al, Ga and In with oxygen and the process followed to mix these elements in required quantities that probably what different 'spacegroup' defines, such that the new transparent conductor acquired the specific lattice structure -vectors and angles.
    
    The code below gives some more information about each features. Like 'spacegroup' are basically of 6 types similarly 'number_of_total_atoms' can be grouped in to 6 categories. other features have quite high values which is not suggested to be categoried i think. 
     

In [None]:
unique_values_distribution = []
def unique_col_values(df):
    for column in df:
        unique_values_distribution.append ((df[column].name, len(df[column].unique()), df[column].dtype ))
        
unique_col_values(train_data)

columns_heading  = ['Header Name','Unique Count','Data Type']

data_distribution = pd.DataFrame.from_records(unique_values_distribution, columns=columns_heading)
data_distribution

In [None]:
train_data["spacegroup"].unique()

In [None]:
train_data["number_of_total_atoms"].unique()

In [None]:
#correlation matrix
corrmat = train_data.corr()
plt.figure(figsize=(10, 10))
sns.heatmap(corrmat, cmap='viridis');

The correlation heat map shows almost all features are negatively correlated. Let's filter it further. 

In [None]:
#correlation matrix
corrmat = train_data.corr()

plt.figure(figsize=(12, 12))

sns.heatmap(corrmat[(corrmat >= 0.5) | (corrmat <= -0.4)], 
            cmap='viridis', vmax=1.0, vmin=-1.0, linewidths=0.1,
            annot=True, annot_kws={"size": 11}, square=True);

From the correlation matrix we can observe that features are not strongly positively correlated except to some extent   'lattice_angle_beta_degree' with 'lattice_vector_1_ang' and 'percent_atom_al' with 'bandgap_energy_ev'

the description in "https://www.kaggle.com/c/nomad2018-predict-transparent-conductors" tells that the sum total of percentage of Al, Ga, In is 1.
Lets check the total sum of the these atoms. 

In [None]:
(train_data["percent_atom_al"] + train_data["percent_atom_ga"] +train_data["percent_atom_in"]).unique()

In [None]:
print ((train_data["percent_atom_al"] + train_data["percent_atom_ga"]).where (train_data["percent_atom_in"] ==0.0 ).count())
print ((train_data["percent_atom_al"] + train_data["percent_atom_in"]).where (train_data["percent_atom_ga"] ==0.0 ).count())
print ((train_data["percent_atom_ga"] + train_data["percent_atom_in"]).where (train_data["percent_atom_al"] ==0.0 ).count())


In [None]:
# Create a total combined element variables that actually constitutes the transparent conductor.
train_data["total_al_ga"] = (train_data["percent_atom_al"] + train_data["percent_atom_ga"]).where (train_data["percent_atom_in"] ==0.0 )
train_data["total_al_in"] = (train_data["percent_atom_al"] + train_data["percent_atom_in"]).where (train_data["percent_atom_ga"] ==0.0 )
train_data["total_ga_in"] = (train_data["percent_atom_ga"] + train_data["percent_atom_in"]).where (train_data["percent_atom_al"] ==0.0 )

test_data["total_al_ga"] = (test_data["percent_atom_al"] + test_data["percent_atom_ga"]).where (test_data["percent_atom_in"] ==0.0 )
test_data["total_al_in"] = (test_data["percent_atom_al"] + test_data["percent_atom_in"]).where (test_data["percent_atom_ga"] ==0.0 )
test_data["total_ga_in"] = (test_data["percent_atom_ga"] + test_data["percent_atom_in"]).where (test_data["percent_atom_al"] ==0.0 )


In [None]:
print (train_data["total_al_ga"].count())
print (train_data["total_al_in"].count())
print (train_data["total_ga_in"].count())

In [None]:
train_data[['percent_atom_al','percent_atom_ga', 'percent_atom_in', 'total_al_ga', 'total_al_in',
       'total_ga_in']].head(10)

In [None]:
train_data.fillna(0, inplace = True )
test_data.fillna(0, inplace = True )

In [None]:
train_data[['percent_atom_al','percent_atom_ga', 'percent_atom_in', 'total_al_ga', 'total_al_in',
       'total_ga_in']].head(10)

Now we can see there are some rows which tells the element is composed of all the three elements as we can see row 6 and 8.
Lets create a new variable or columna the had all the atoms in the element.

In [None]:
train_data["total_al_ga_in"] = ( train_data["percent_atom_al"] + train_data["percent_atom_ga"] +train_data["percent_atom_in"] ).where ((train_data["total_al_ga"] + train_data["total_al_in"] + train_data["total_ga_in"]) ==0.0 )

test_data["total_al_ga_in"] = ( test_data["percent_atom_al"] + test_data["percent_atom_ga"] +test_data["percent_atom_in"] ).where ((test_data["total_al_ga"] + test_data["total_al_in"] + test_data["total_ga_in"]) ==0.0 )


In [None]:
train_data[['percent_atom_al','percent_atom_ga', 'percent_atom_in', 'total_al_ga', 'total_al_in',
       'total_ga_in',"total_al_ga_in"]].head(10)

In [None]:
train_data.fillna(0, inplace = True )
test_data.fillna(0, inplace = True )

In [None]:
train_data[['percent_atom_al','percent_atom_ga', 'percent_atom_in', 'total_al_ga', 'total_al_in',
       'total_ga_in',"total_al_ga_in"]].head(10)

In [None]:
train_data.describe().T


### The basic overall logic :

###### A. Identify and Select important features for modeling
###### B. train and test various regressor on the selected features.

  1. Split selected features into train and test data.
  2. Apply GrideSearchCV and Pipeline as these fuctionality reduce coding and makes the code simple and more modular. Also it gives the freedom to execute various classifiers and regressors to execute at once along with cross-validation logic and many more functionality such as hyperparmeter, scoreing etc.... 
  3. Apply trained model on the above splitted test data.
  4. Calculate and collect the 'Mean Square Error' (MSE), 'R2 Square' and 'Root Mean Square Log Error' (RMSLE) for all the trained models.
  5. Compare the above error score results and the select the best model.
  6. Using this selected model get the predicted output for formation and bandgap energy for the test data set.
  7. Finally, create the submission.csv file with 'id', 'formation_energy_ev_natom'and 'bandgap_energy_ev'.


### Data Cleanup

As mentioned lets remove 'id' and 'number_of_total_atoms' from train_data.

In [None]:
# 1. define the columns for train_data

train_data = train_data[[ 'spacegroup', #'number_of_total_atoms', 
                         'percent_atom_al', 'percent_atom_ga',    'percent_atom_in', 
                         'lattice_vector_1_ang',     'lattice_vector_2_ang','lattice_vector_3_ang',
                         'lattice_angle_alpha_degree','lattice_angle_beta_degree','lattice_angle_gamma_degree',
                         'formation_energy_ev_natom','bandgap_energy_ev','total_al_ga', 'total_al_in',
       'total_ga_in',"total_al_ga_in"]]

train_data.columns = [ 'spacegroup', #'number_of_total_atoms',                        
                       'percent_atom_al', 'percent_atom_ga',     'percent_atom_in',  
                       'lattice_vector_1_ang',     'lattice_vector_2_ang', 'lattice_vector_3_ang', 
                       'lattice_angle_alpha_degree', 'lattice_angle_beta_degree', 'lattice_angle_gamma_degree', 
                       'formation_energy_ev_natom','bandgap_energy_ev','total_al_ga', 'total_al_in',
       'total_ga_in',"total_al_ga_in"]



### Data Slicing

Split the training data into train and data for formation energy and bandgap energy 

In [None]:
# 2. Separate the target from train_data and split the train_data into training and testing data.
X_train = train_data.drop([ "formation_energy_ev_natom", "bandgap_energy_ev"], axis = 1)

Y_formation_energy = train_data['formation_energy_ev_natom']
Y_bandgap_energy   = train_data['bandgap_energy_ev']

# 
fX_train_data, fX_test_data, fy_train_target, fy_test_target  = train_test_split(X_train, Y_formation_energy, 
                                                                                 test_size=0.25, random_state=101)
bX_train_data, bX_test_data, by_train_target, by_test_target  = train_test_split(X_train, Y_bandgap_energy, 
                                                                                 test_size=0.25, random_state=101)

### Data Training

In [None]:
lrg = LinearRegression()
svr = SVR()
#rrg = RANSACRegressor()
#rid = Ridge()
dtr = DecisionTreeRegressor()
rfr = RandomForestRegressor()
gbr = GradientBoostingRegressor()
abr = AdaBoostRegressor()        
bgr = BaggingRegressor()
etr = ExtraTreesRegressor()
xgr = XGBRegressor(nthread=-1)


regressors = {  'DTR' : dtr, 'SVR': svr,  'ABR' : abr, 'XGR' : xgr,
                'BGR' : bgr, 'RFR' : rfr, 'ETR': etr, 'GBR' : gbr, 
                'DTRD' : dtr, 'SVRD': svr,  'ABRD' : abr, 'XGRD' : xgr,
                'BGRD' : bgr, 'RFRD' : rfr, 'ETRD': etr, 'GBRD' : gbr } 

The function "predict_evaluate" below does the following :

It will evaluate various regression classifier using pipleline, hyperparameter and GridSearchCV
It will select the best performing classifier
It will train and predict on training data from "train_test_split" for both "formation energy" and "bandgap energy"
It will predict on training data from "train_test_split" for both "formation energy" and "bandgap energy"
It will get the error scores for all the classifers for MSE, R-Squared and Root Mean Square Log Error for comparision
It will return the prediction, error scores from training data and error scores from test data
 labels  = ['Clf','mean absolute error','mean square error','R2 squared', 'Mean Sq Log Error', 'Root Mean Sq Log Error']

In [None]:
param ={}
random_state = 101
def hyper_parameters(var):
    
    if var == 'SVR':
        param = { 'svr__gamma': ['auto'],                                  #[0.0001, 0.001, 0.005, 0.01, 0.1]    
                  'svr__epsilon': [0.1],      
                  'svr__tol': [0.001],       
                  'svr__cache_size': [200,250,300,500] 
                }
   
    
    elif var == 'DTR':
        param = { 'decisiontreeregressor__criterion': ['mse','mae'],
                  'decisiontreeregressor__max_depth': [7],
                  'decisiontreeregressor__max_features': ['auto', 'sqrt', 'log2'],   
                  'decisiontreeregressor__max_leaf_nodes': [200] ,
                  'decisiontreeregressor__min_samples_split':  [20],
                  'decisiontreeregressor__min_samples_leaf': [7, 10,50 ],
                  'decisiontreeregressor__random_state': [random_state]
                } 

    elif var == 'RFR':
        param = {'randomforestregressor__max_features' : ['auto'],
                 'randomforestregressor__max_depth': [7],
                 'randomforestregressor__n_estimators': [90,100],
                 'randomforestregressor__min_samples_split':  [6,7,8,9],
                 'randomforestregressor__random_state' :[random_state]
                }
        
    elif var == 'GBR':
        param = {'gradientboostingregressor__n_estimators': [90],
                 'gradientboostingregressor__learning_rate': [0.1],
                 'gradientboostingregressor__max_depth': [3,7],
                 #'gradientboostingregressor__loss': ['ls']
                 'gradientboostingregressor__random_state' :[random_state]
                } 
        
    elif var == 'ABR':
        param = {  'adaboostregressor__random_state': [random_state],  
                  #'adaboostregressor__base_estimator': [None],   
                   'adaboostregressor__n_estimators': [160,170,180,190,200] , 
                   'adaboostregressor__loss': ['exponential'],   #['linear', 'square', 'exponential']  
                   'adaboostregressor__learning_rate': [0.1] 
                    }
    elif var == 'BGR':
        param = { 'baggingregressor__n_estimators': [50,51,52], 
                  #'baggingregressor__max_features': [9], #[7,8,9],
                  'baggingregressor__random_state': [random_state],
                  'baggingregressor__max_samples': [300]
                 }
    elif var == 'ETR':
        param =  { 'extratreesregressor__random_state': [random_state],   
                   'extratreesregressor__criterion': ['mse'],
                   'extratreesregressor__max_features': ['auto', 'sqrt', 'log2'], 
                   'extratreesregressor__n_estimators':[70,80,90]
                     }    
    elif var == 'XGR':     
        param = { 'xgbregressor__max_depth': [3],
                  'xgbregressor__learning_rate': [0.1],
                  'xgbregressor__n_estimators': [80],
                  #'xgbregressor__n_jobs': [dep]
                  'xgbregressor__reg_lambda': [0.5],
                  'xgbregressor__max_delta_step': [0.3],
                  #'xgbregressor__min_child_weight': [1,2],
                 'xgbregressor__random_state':  [random_state]
                 }
    # regressor with default parameters   
    elif var == 'SVRD':
            param = { }
    #elif var == 'RIDD':
    #        param = { }
    elif var == 'DTRD':
            param = { }
    elif var == 'RFRD':
            param = { }
    elif var == 'GBRD':
            param = { }
    elif var == 'ABRD':
            param = { }
    elif var == 'BGRD':
            param = { }
    elif var == 'ETRD':
            param =  { }
    elif var == 'XGRD':     
            param = { }
       
    return param

In [None]:
import  time
def root_mean_squared_log_error(h, y): 
    """
    Compute the Root Mean Squared Log Error for hypthesis h and targets y
    Args:
        h - numpy array containing predictions with shape (n_samples, n_targets)
        y - numpy array containing targets with shape (n_samples, n_targets)
    """
    return np.sqrt(np.square(np.log(h + 1) - np.log(y + 1)).mean())


def collect_error_score(target, prediction):
    meansquare_error = mean_squared_error (target, prediction)                 # Mean Squared Error  
    r2square_error = r2_score(target, prediction)                              # R Squared  
    rmslog_error = root_mean_squared_log_error(prediction, target)             # Root Mean Square Log Error  
    #meanabsolute_error = mean_absolute_error (target, prediction)              # Absolute Mean Error 
    #msle = mean_squared_log_error(target, prediction)
    
    return ( meansquare_error, r2square_error, rmslog_error)
    
########    
def predict_evaluate(train_feature, train_target, test_feature, test_target):
    
    train_reg = []           # to collect the trained regressors
    test_error_scores = []   # to collect the error scores
    
    print ("==== Start training  Regressors ====")
    t = time.time()
    for i, model in regressors.items():
       
        pipe = make_pipeline(preprocessing.StandardScaler(), model)
        hyperparameters = hyper_parameters(i)
        trainedmodel = GridSearchCV(pipe, hyperparameters, cv=15)
    
        # Fit and predict train data
        #---------------------------
        trainedmodel.fit(train_feature, train_target)
        
        print (i,' trained best score :: ',trainedmodel.best_score_)
        print (":::::::::::::::::::::::::::")
        
        #print (i,' - ',trainedclfs.best_params_)
        #print (trainedmodel.best_estimator_)
        
        # predict test data
        pred_test = trainedmodel.predict(test_feature)
        
        # Get error scores on test data
        mse, r2, rmsle = collect_error_score(test_target, pred_test)

        test_error_scores.append ((i,  mse, r2, rmsle))
        train_reg.append ((i, trainedmodel))
        
    print ("==== Finished training  Regressors ====")    
    print (" Total training time :  ({0:.3f} s)\n".format(time.time() - t) )
    return ( test_error_scores, train_reg)
    
def error_table (score, labels, sort_col ):
    #labels  = ['Clf','mean absolute error','mean square error','R2 squared', 'Mean Sq Log Error', 'Root Mean Sq Log Error']
    scored_df = pd.DataFrame.from_records(score, columns=labels, index = None)
    sorted_scored = scored_df.sort_values(by = sort_col, ascending=True)
    return sorted_scored
    


## Prediction and Evaluation for Formation Energy

In [None]:
# Call "predict_evaluate" for Formation Energy
# pass training and test data for Formation energy to "predict_evaluate"
# "predict_evaluate" will return 
#      1. the classifier short initials ( like 'ETR' for ExtraTreesRegressor(), DTR for  DecisionTreeRegressor() etc...)
#      2. training data error scores ( like mean square error  R2 squared  Root Mean Sq Log Error etc.. ) and
#      3. test data error scores ( like mean square error  R2 squared  Root Mean Sq Log Error etc.. )    

form_error_scores, trained_pred_form = predict_evaluate(fX_train_data, fy_train_target, fX_test_data, fy_test_target)   


### Predicting error for Formation Energy

In [None]:
labels  = ['Regr','mean square error', 'R Squared', 'Root Mean Sq Log Error']
#############
print("Formation Energy scores : on test data - ordered by Mean Square Error : \n")
formation_energy_score = error_table (form_error_scores, labels,  'mean square error' )
formation_energy_score


It does not make any sense evaluate scores with poor values so lets extract or select top 10 best scores 

In [None]:
# Select top 10 scores
formation_energy_score_10 = formation_energy_score[0:10]
formation_energy_score_10

The table above gives the mean square error (MSE), R Squared and Root MEan Square Log Error (RMSLE)and ascending order of MSE, with model with lowest MSE value on the top. Which tells the model on top with lower MSE than the other models and also the adjusted R-squared is higher compared to others, it is probably a better model.  
 
Note : A mean squared error (MSE) of zero indicates perfect skill, or no error. In our case the MSE of XGR regressor is lowest and close to zero. Similarly, R2 Square with value 1 or closest to one is better model. 

Let's visualise the above scores using bar graph.

#### Error Score distribution for Formation Energy for top 10 regressor

In [None]:
formation_energy_score_10.plot(kind='bar', ylim=(-0.20,1.0), figsize=(12,4), align='center', colormap="tab20")
plt.xticks(np.arange(10), formation_energy_score_10.Regr)
plt.ylabel('Error Score')
plt.title('Error Score Distribution by Regressor')
plt.legend(bbox_to_anchor=(1.3, 0.9), loc=5, borderaxespad=0.5)

In [None]:
error_score_df = pd.DataFrame.from_records(formation_energy_score_10, columns=['Regr','mean square error','R Squared', 'Root Mean Sq Log Error' ], index = None)

error_score_df

#### Root Mean Square Log Error for Formation Energy for top 10 regressor

In [None]:
RMSLEscore = error_score_df[['Regr','Root Mean Sq Log Error']]

RMSLEscore.plot(kind='bar', ylim=None, figsize=(10,4), align='center', colormap="jet") 
plt.xticks(np.arange(10), RMSLEscore.Regr) 
plt.ylabel('Error Score') 
plt.title('Root Mean Square Logistic Error - Distribution by Regressor') 
plt.legend(bbox_to_anchor=(1, 1), loc=2, borderaxespad=0.)


#### Mean Square Error for Formation Energy for top 10 regressor

In [None]:
MSEscore = error_score_df[['Regr','mean square error']]

MSEscore.plot(kind='bar', ylim=None, figsize=(10,4), align='center', colormap="rainbow") 
plt.xticks(np.arange(10), MSEscore.Regr) 
plt.ylabel('Error Score') 
plt.title('Mean Square Error - Distribution by Regressor') 
plt.legend(bbox_to_anchor=(1, 1), loc=2, borderaxespad=0.)

#### R2 Square for Formation Energy for top 10 regressor

In [None]:
RSquarescore =error_score_df[['Regr','R Squared']]

RSquarescore.plot(kind='bar', ylim=None, figsize=(10,4), align='center', colormap="Spectral")
plt.xticks(np.arange(10), RSquarescore.Regr)
plt.ylabel('Error Score')
plt.title('R Squared - Distribution by Regressor')
plt.legend(bbox_to_anchor=(1, 1), loc=2, borderaxespad=0.)

In [None]:
MSE_RMSLEscore =error_score_df[['Regr','mean square error','Root Mean Sq Log Error']]

MSE_RMSLEscore.plot(kind='bar', ylim=None, figsize=(10,4), align='center', colormap="copper")
plt.xticks(np.arange(10), MSE_RMSLEscore.Regr)
plt.ylabel('Error Score')
plt.title('MSE and RMSLE - Distribution by Regressor')
plt.legend(bbox_to_anchor=(1, 1), loc=2, borderaxespad=0.)

## Select Best Predictor for formation Energy

Select the best trained regressor model which we will use to get the formation energy's predicted value for our original test data from test.csv.



In [None]:
# select the best regressor
def select_regressor(score_data, predictor) :
    print (" The Best prediction regressor with minimum MSE prediction \n ")
    print (" --------------------------------------------------------- \n ")
    """
    # find the regressor initial such as 'GBR','XGR','BGRD' etc.,from "formation_scored_tuned" for which MSE is lowest.
    # argmin() checks for the minimum value in the column 'mean square error' and returns the corresponing index value of the row
    """
    #val = score_data.loc[score_data['mean square error'].argmin(), 'Regr']
    val = score_data.loc[score_data['R Squared'].argmax(), 'Regr']
    # iterate through the items in train_reg collection and compare with the minimum MSE regressor initials extracted above 
    # to select the best trained regressor
    
    for i in range(len(predictor)):
        if predictor[i][0] == val:
            print (predictor[i])
            selected_reg = predictor[i][1]            
    return selected_reg


### Get Predicted Formation energy

In [None]:

X_test_data = test_data.drop(['id', 'number_of_total_atoms'], axis = 1)

In [None]:
# Call the select_regressor function to get the regressor and
# predict the formation energy for our original test_data (test.csv)

selected_regressor_form = select_regressor(formation_energy_score, trained_pred_form)

test_pred_form = selected_regressor_form.predict(X_test_data)

In [None]:
len (test_pred_form), test_pred_form.size 

# Evaluation for Bandgap Energy

In [None]:
##############
# Call "predict_evaluate" for Bandgap Energy
# pass training and test data for Formation energy to "predict_evaluate"
# "predict_evaluate" will return 
#      1. the classifier short initials
#      2. training data error scores and
#      3. test data error scores

test_error_scores, trained_pred_band   = predict_evaluate(bX_train_data, by_train_target, bX_test_data, by_test_target  )

In [None]:
labels  = ['Regr','mean square error', 'R Squared', 'Root Mean Sq Log Error']

print("bandgap energy error scores on test data - ordered by mean square error : \n")
bandgap_energy_score = error_table (test_error_scores, labels, 'mean square error')
bandgap_energy_score

In [None]:
# Select top 10 scores
bandgap_energy_score_10 = bandgap_energy_score[0:10]
bandgap_energy_score_10



#### Error Score distribution for Bandgap Energy for top 10 regressor

In [None]:
bandgap_energy_score_10.plot(kind='bar', ylim=None, figsize=(12,4), align='center', colormap="tab20")
plt.xticks(np.arange(10), bandgap_energy_score_10.Regr)
plt.ylabel('Error Score')
plt.title('Error Score Distribution by Regressor')
plt.legend(bbox_to_anchor=(1.3, 0.9), loc=5, borderaxespad=0.5)

In [None]:
error_score_df = pd.DataFrame.from_records(bandgap_energy_score_10, columns=['Regr','mean square error','R Squared', 'Root Mean Sq Log Error' ], index = None)

error_score_df

#### Root Mean Square Log Error for Bandgap Energy for top 10 regressor

In [None]:
RMSLEscore = error_score_df[['Regr','Root Mean Sq Log Error']]

RMSLEscore.plot(kind='bar', ylim=None, figsize=(10,3), align='center', colormap="jet") 
plt.xticks(np.arange(10), RMSLEscore.Regr) 
plt.ylabel('Error Score') 
plt.title('Root Mean Square Logistic Error - Distribution by Regressor') 
plt.legend(bbox_to_anchor=(1, 1), loc=2, borderaxespad=0.)


#### Mean Square Error for Bandgap Energy for top 10 regressor

In [None]:
print("bandgap energy error scores on test data - ordered by mean square error : \n")
MSEscore = error_score_df[['Regr','mean square error']]

MSEscore.plot(kind='bar', ylim=None, figsize=(10,4), align='center', colormap="rainbow") 
plt.xticks(np.arange(10), MSEscore.Regr) 
plt.ylabel('Error Score') 
plt.title('Mean Square Error - Distribution by Regressor') 
plt.legend(bbox_to_anchor=(1, 1), loc=2, borderaxespad=0.)

### Get Predicted Bandgap energy

In [None]:
# Call the select_regressor function to get the regressor and
# predict the bandgap energy for our original test_data (test.csv)
selected_regressor_band = select_regressor(bandgap_energy_score, trained_pred_band)
#
test_pred_band = selected_regressor_band.predict(X_test_data)


In [None]:
len(test_pred_band)

# The Submission file

In [None]:
## Save the the output to "submission.csv" file formation_energy_ev_natom	bandgap_energy_ev

id=(test_data['id'])           # 'id' from test_data of test.csv

submission_id = pd.DataFrame({ 'id' : id})
submission_form = pd.DataFrame({ 'formation_energy_ev_natom': test_pred_form})  # dataframe for predict formation energy
submission_band = pd.DataFrame({ 'bandgap_energy_ev': test_pred_band})          # dataframe for predict bandgap energy
submission_df =  pd.concat([submission_form,submission_band],axis=1)            
submission_df =  pd.concat([submission_id,submission_df],axis=1)                #dataframe with 'id', formation and bandgap energy 
# save into submission.csv
#submission_df.to_csv('D:\DataScienceCourse\TensorFlow Bootcamp\Kaggle_Predicting_Transparent_Conductors\submission.csv', index=False)