### Sales Prediction for Big Mart Outlets
The data scientists at BigMart have collected 2013 sales data for 1559 products across 10 stores in different cities. Also, certain attributes of each product and store have been defined. The aim is to build a predictive model and predict the sales of each product at a particular outlet.

Using this model, BigMart will try to understand the properties of products and outlets which play a key role in increasing sales.

##### First import basica packgages like Pnadas and Numpy

In [79]:
import pandas as pd
import numpy as np

Import the data which is already separated as training and test dataset

In [80]:
pd_train=pd.read_csv('E:\\Dataset\\Big Mar Sales Prediction\\train.csv')
pd_test=pd.read_csv('E:\\Dataset\\Big Mar Sales Prediction\\test.csv')

Now check column properties for diffirent columns

In [81]:
pd_train.info(),pd_test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8523 entries, 0 to 8522
Data columns (total 12 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   Item_Identifier            8523 non-null   object 
 1   Item_Weight                7060 non-null   float64
 2   Item_Fat_Content           8523 non-null   object 
 3   Item_Visibility            8523 non-null   float64
 4   Item_Type                  8523 non-null   object 
 5   Item_MRP                   8523 non-null   float64
 6   Outlet_Identifier          8523 non-null   object 
 7   Outlet_Establishment_Year  8523 non-null   int64  
 8   Outlet_Size                6113 non-null   object 
 9   Outlet_Location_Type       8523 non-null   object 
 10  Outlet_Type                8523 non-null   object 
 11  Item_Outlet_Sales          8523 non-null   float64
dtypes: float64(4), int64(1), object(7)
memory usage: 799.2+ KB
<class 'pandas.core.frame.DataFrame'>
RangeIn

(None, None)

Check count of NULL values by columns

In [82]:
pd_train.isnull().sum(),pd_test.isnull().sum()

(Item_Identifier                 0
 Item_Weight                  1463
 Item_Fat_Content                0
 Item_Visibility                 0
 Item_Type                       0
 Item_MRP                        0
 Outlet_Identifier               0
 Outlet_Establishment_Year       0
 Outlet_Size                  2410
 Outlet_Location_Type            0
 Outlet_Type                     0
 Item_Outlet_Sales               0
 dtype: int64,
 Item_Identifier                 0
 Item_Weight                   976
 Item_Fat_Content                0
 Item_Visibility                 0
 Item_Type                       0
 Item_MRP                        0
 Outlet_Identifier               0
 Outlet_Establishment_Year       0
 Outlet_Size                  1606
 Outlet_Location_Type            0
 Outlet_Type                     0
 dtype: int64)

Adding dummy columns before merging bot data and also adding additional column in test data to make equal columns in both tables.

In [83]:
pd_train['type']='train'
pd_test['type']='test'
pd_test['Item_Outlet_Sales']='na'

Merging both tables vertically so that we can treat null values or any impunity

In [84]:
pd_all=pd.concat([pd_train,pd_test],axis=0)

 Replacing NULL values by MEAN or MEDIAN basis the data category of column 

In [87]:
pd_all['Outlet_Size'].fillna(pd_all['Outlet_Size'].mode()[0],inplace=True)
pd_all['Item_Weight'].fillna(pd_all['Item_Weight'].mean(),inplace=True)

Replacing categorical values in each columns by numbers(we are taking 0 onwards number sequentially)

In [89]:
pd_all['Item_Fat_Content']=pd_all['Item_Fat_Content'].map({'Low Fat':0,'Regular':1,'LF':0,'reg':1,'low fat':0})
pd_all['Item_Type']=pd_all['Item_Type'].map({'Seafood':15,'Breakfast':14,'Starchy Foods':13,'Others':12,'Hard Drinks':11,'Breads':10,'Soft Drinks':9,'Meat':8,'Health and Hygiene':7,'Canned':6,'Baking Goods':5,'Dairy':4,'Frozen Foods':3,'Household':2,'Snack Foods':1,'Fruits and Vegetables':0})

pd_all['Outlet_Location_Type']=pd_all['Outlet_Location_Type'].map({'Tier 1':0,'Tier 2':1,'Tier 3':2})
pd_all['Outlet_Establishment_Year']=pd_all['Outlet_Establishment_Year'].map({'1998':8,'2007':7,'2009':6,'2002':5,'2004':4,'1997':3,'1999':2,'1987':1,'1985':0})


In [90]:
pd_all['Outlet_Size']=pd_all['Outlet_Size'].map({'Medium':0,'Small':1,'High':2})

In [91]:
pd_all['Outlet_Type']=pd_all['Outlet_Type'].map({'Supermarket Type1':0,'Grocery Store':1,'Supermarket Type3':2,'Supermarket Type2':3})

Changing columns values in the same scale using MAX MIN

In [92]:
for i in pd_all.iloc[:,[1,2,3,4,5,7,9,10]]:
    pd_all[i]=(pd_all[i]-pd_all[i].min())/(pd_all[i].max()-pd_all[i].min())

DELETE the irrelevent columns

In [93]:
del pd_all['Outlet_Establishment_Year']

In [None]:
Seperate both the dataset

In [94]:
sd_train=pd_all[pd_all['type']=='train']

In [95]:
sd_test=pd_all[pd_all['type']=='test']

Delete the irrilivent columns that would not be required in modeling

In [None]:
sd_train.drop(['Item_Identifier','Outlet_Identifier','type'],axis=1,inplace=True)

In [98]:
sd_test.drop(['Item_Outlet_Sales','type','Item_Identifier','Outlet_Identifier'],axis=1,inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  sd_test.drop(['Item_Outlet_Sales','type','Item_Identifier','Outlet_Identifier'],axis=1,inplace=True)


### Regression 
Importing different packages that will be used for different function

In [100]:
from sklearn.model_selection import train_test_split
sd_train1,sd_train2=train_test_split(sd_train,test_size=0.2,random_state=2)
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error                           
from sklearn.linear_model import LinearRegression

Now when we split the data in train & split, we will create two datasets(predictor and variables) from the same train data.

In [101]:
sd_train11=sd_train1.drop('Item_Outlet_Sales',axis=1)
sd_train12=sd_train1['Item_Outlet_Sales']

Now we will fit equation and will use the training dataset that we separated earlier.

In [102]:
lm=LinearRegression()
lm.fit(sd_train11,sd_train12)

LinearRegression()

Now predict the value for test dataset which we split earlier(20%)

In [106]:
predict=lm.predict(sd_train2.drop('Item_Outlet_Sales',axis=1))

Check the Error of predicted value against ther actual values.

In [107]:
mean_absolute_error(sd_train2['Item_Outlet_Sales'],predict)

1069.7329678766089

Now you predict the values for the actual data given to use.

In [83]:
pred=lm.predict(sd_test)

In [84]:
pd.DataFrame(pred).to_csv("sd_test.csv",index=False)

#### Ridge Regresson
Import Ridge,Lassor and GridSearch for further optimization of model

In [108]:
from sklearn.linear_model import Ridge,Lasso
from sklearn.model_selection import GridSearchCV

In [None]:
Defining the lambdas, starting from 0.001 to 100 and total 1000 lambdas

In [109]:
lambdas=np.linspace(.001,100,1000)

In [110]:
lambdas

array([1.00000000e-03, 1.01099099e-01, 2.01198198e-01, 3.01297297e-01,
       4.01396396e-01, 5.01495495e-01, 6.01594595e-01, 7.01693694e-01,
       8.01792793e-01, 9.01891892e-01, 1.00199099e+00, 1.10209009e+00,
       1.20218919e+00, 1.30228829e+00, 1.40238739e+00, 1.50248649e+00,
       1.60258559e+00, 1.70268468e+00, 1.80278378e+00, 1.90288288e+00,
       2.00298198e+00, 2.10308108e+00, 2.20318018e+00, 2.30327928e+00,
       2.40337838e+00, 2.50347748e+00, 2.60357658e+00, 2.70367568e+00,
       2.80377477e+00, 2.90387387e+00, 3.00397297e+00, 3.10407207e+00,
       3.20417117e+00, 3.30427027e+00, 3.40436937e+00, 3.50446847e+00,
       3.60456757e+00, 3.70466667e+00, 3.80476577e+00, 3.90486486e+00,
       4.00496396e+00, 4.10506306e+00, 4.20516216e+00, 4.30526126e+00,
       4.40536036e+00, 4.50545946e+00, 4.60555856e+00, 4.70565766e+00,
       4.80575676e+00, 4.90585586e+00, 5.00595495e+00, 5.10605405e+00,
       5.20615315e+00, 5.30625225e+00, 5.40635135e+00, 5.50645045e+00,
      

Fitting the model with RIdge

In [111]:
model=Ridge(fit_intercept=True)

In [113]:
params={'alpha':lambdas}

Defining the elem,ents of Grid Search including Number of Cross Validation(CV),Scoring type

In [114]:
grid_search=GridSearchCV(model,
                         param_grid=params,
                         cv=10,
                         scoring='neg_mean_absolute_error',
                        verbose=20,n_jobs=-1)

Defining a function to get top validation  score

In [112]:
def report(results, n_top=3):
    for i in range(1, n_top + 1):
        candidates = np.flatnonzero(results['rank_test_score'] == i)
        for candidate in candidates:
            print("Model with rank: {0}".format(i))
            print("Mean validation score: {0:.6f} (std: {1:.6f})".format(
                  results['mean_test_score'][candidate],
                  results['std_test_score'][candidate]))
            print("Parameters: {0}".format(results['params'][candidate]))
            print("")

Fitting the split train data

In [115]:
grid_search.fit(sd_train11,sd_train12)

Fitting 10 folds for each of 1000 candidates, totalling 10000 fits


GridSearchCV(cv=10, estimator=Ridge(), n_jobs=-1,
             param_grid={'alpha': array([1.00000000e-03, 1.01099099e-01, 2.01198198e-01, 3.01297297e-01,
       4.01396396e-01, 5.01495495e-01, 6.01594595e-01, 7.01693694e-01,
       8.01792793e-01, 9.01891892e-01, 1.00199099e+00, 1.10209009e+00,
       1.20218919e+00, 1.30228829e+00, 1.40238739e+00, 1.50248649e+00,
       1.60258559e+00, 1.70268468e+00, 1.8027...
       9.76977207e+01, 9.77978198e+01, 9.78979189e+01, 9.79980180e+01,
       9.80981171e+01, 9.81982162e+01, 9.82983153e+01, 9.83984144e+01,
       9.84985135e+01, 9.85986126e+01, 9.86987117e+01, 9.87988108e+01,
       9.88989099e+01, 9.89990090e+01, 9.90991081e+01, 9.91992072e+01,
       9.92993063e+01, 9.93994054e+01, 9.94995045e+01, 9.95996036e+01,
       9.96997027e+01, 9.97998018e+01, 9.98999009e+01, 1.00000000e+02])},
             scoring='neg_mean_absolute_error', verbose=20)

Getting top 5 performance

In [116]:
report(grid_search.cv_results_,5)

Model with rank: 1
Mean validation score: -1003.765548 (std: 32.846900)
Parameters: {'alpha': 10.211108108108109}

Model with rank: 2
Mean validation score: -1003.765553 (std: 32.845858)
Parameters: {'alpha': 10.111009009009008}

Model with rank: 3
Mean validation score: -1003.765689 (std: 32.844772)
Parameters: {'alpha': 10.010909909909909}

Model with rank: 4
Mean validation score: -1003.765774 (std: 32.847933)
Parameters: {'alpha': 10.311207207207207}

Model with rank: 5
Mean validation score: -1003.765904 (std: 32.843642)
Parameters: {'alpha': 9.91081081081081}



### Lasso
Defining the model with Lasso

In [117]:
model1=Lasso(fit_intercept=True)

Defining different elements in grid search

In [118]:
grid_search1=GridSearchCV(model1,
                         param_grid=params,
                         cv=30,
                         scoring='neg_mean_absolute_error',
                        verbose=20,n_jobs=-1)

Fitting the data

In [119]:
grid_search1.fit(sd_train11,sd_train12)

Fitting 30 folds for each of 1000 candidates, totalling 30000 fits


GridSearchCV(cv=30, estimator=Lasso(), n_jobs=-1,
             param_grid={'alpha': array([1.00000000e-03, 1.01099099e-01, 2.01198198e-01, 3.01297297e-01,
       4.01396396e-01, 5.01495495e-01, 6.01594595e-01, 7.01693694e-01,
       8.01792793e-01, 9.01891892e-01, 1.00199099e+00, 1.10209009e+00,
       1.20218919e+00, 1.30228829e+00, 1.40238739e+00, 1.50248649e+00,
       1.60258559e+00, 1.70268468e+00, 1.8027...
       9.76977207e+01, 9.77978198e+01, 9.78979189e+01, 9.79980180e+01,
       9.80981171e+01, 9.81982162e+01, 9.82983153e+01, 9.83984144e+01,
       9.84985135e+01, 9.85986126e+01, 9.86987117e+01, 9.87988108e+01,
       9.88989099e+01, 9.89990090e+01, 9.90991081e+01, 9.91992072e+01,
       9.92993063e+01, 9.93994054e+01, 9.94995045e+01, 9.95996036e+01,
       9.96997027e+01, 9.97998018e+01, 9.98999009e+01, 1.00000000e+02])},
             scoring='neg_mean_absolute_error', verbose=20)

In [120]:
report(grid_search1.cv_results_,5)

Model with rank: 1
Mean validation score: -1003.340453 (std: 69.510578)
Parameters: {'alpha': 2.8037747747747748}

Model with rank: 2
Mean validation score: -1003.341142 (std: 69.505908)
Parameters: {'alpha': 2.7036756756756755}

Model with rank: 3
Mean validation score: -1003.341262 (std: 69.514425)
Parameters: {'alpha': 2.9038738738738736}

Model with rank: 4
Mean validation score: -1003.342457 (std: 69.501659)
Parameters: {'alpha': 2.6035765765765766}

Model with rank: 5
Mean validation score: -1003.343292 (std: 69.517721)
Parameters: {'alpha': 3.003972972972973}



In [99]:
predict=grid_search1.predict(sd_train21)

In [100]:
pd.DataFrame(predict).to_csv("sd_test_lasso .csv",index=False)

### Decision Tree
Importing different packages like Decison Tree,random forest,extra tree and Randomized search CV

In [122]:
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import ExtraTreesRegressor
from sklearn.model_selection import RandomizedSearchCV

Defining different parameters of Decision Tree

In [123]:
paramss={ 
        'max_depth':[None,5,10,15,20,30,50,70],
            'min_samples_leaf':[1,2,5,10,15,20], 
            'min_samples_split':[2,5,10,15,20]
       }

In [124]:
reg=DecisionTreeRegressor()

Defining random search CV with parameter and fitting the model

In [125]:
random_search=RandomizedSearchCV(reg,
                                 cv=10,
                                 param_distributions=paramss,
                                 scoring='neg_mean_absolute_error',
                                 n_iter=240,n_jobs=-1,verbose=20
                                    )

In [126]:
random_search.fit(sd_train11,sd_train12)

Fitting 10 folds for each of 240 candidates, totalling 2400 fits


RandomizedSearchCV(cv=10, estimator=DecisionTreeRegressor(), n_iter=240,
                   n_jobs=-1,
                   param_distributions={'max_depth': [None, 5, 10, 15, 20, 30,
                                                      50, 70],
                                        'min_samples_leaf': [1, 2, 5, 10, 15,
                                                             20],
                                        'min_samples_split': [2, 5, 10, 15,
                                                              20]},
                   scoring='neg_mean_absolute_error', verbose=20)

In [127]:
report(random_search.cv_results_,5)

Model with rank: 1
Mean validation score: -755.039529 (std: 23.877146)
Parameters: {'min_samples_split': 2, 'min_samples_leaf': 20, 'max_depth': 5}

Model with rank: 1
Mean validation score: -755.039529 (std: 23.877146)
Parameters: {'min_samples_split': 5, 'min_samples_leaf': 20, 'max_depth': 5}

Model with rank: 1
Mean validation score: -755.039529 (std: 23.877146)
Parameters: {'min_samples_split': 10, 'min_samples_leaf': 20, 'max_depth': 5}

Model with rank: 1
Mean validation score: -755.039529 (std: 23.877146)
Parameters: {'min_samples_split': 15, 'min_samples_leaf': 20, 'max_depth': 5}

Model with rank: 1
Mean validation score: -755.039529 (std: 23.877146)
Parameters: {'min_samples_split': 20, 'min_samples_leaf': 20, 'max_depth': 5}



### RandomForest
definign diferent parameters for random forest

In [107]:
param_dist = {"n_estimators":[50,100,200],
              "max_features": [2,4,5,6,8],
              "bootstrap": [True, False],
                'max_depth':[None,5,10,15,20,30,50,70],
                'min_samples_leaf':[1,2,5,10,15,20], 
                'min_samples_split':[2,5,10,15,20]
                  }

In [108]:
regg=RandomForestRegressor()

In [109]:
n_iter_search = 15
random_search=RandomizedSearchCV(regg,
                                 cv=10,
                                 param_distributions=param_dist,
                                 scoring='neg_mean_absolute_error',
                                 n_iter=15,n_jobs=-1,verbose=20
                                    )

random_search.fit(sd_train11,sd_train12)

Fitting 10 folds for each of 15 candidates, totalling 150 fits


RandomizedSearchCV(cv=10, estimator=RandomForestRegressor(), n_iter=15,
                   n_jobs=-1,
                   param_distributions={'bootstrap': [True, False],
                                        'max_depth': [None, 5, 10, 15, 20, 30,
                                                      50, 70],
                                        'max_features': [2, 4, 5, 6, 8],
                                        'min_samples_leaf': [1, 2, 5, 10, 15,
                                                             20],
                                        'min_samples_split': [2, 5, 10, 15, 20],
                                        'n_estimators': [50, 100, 200]},
                   scoring='neg_mean_absolute_error', verbose=20)

In [110]:
report(random_search.cv_results_,5)

Model with rank: 1
Mean validation score: -751.994162 (std: 25.882417)
Parameters: {'n_estimators': 200, 'min_samples_split': 20, 'min_samples_leaf': 20, 'max_features': 5, 'max_depth': 20, 'bootstrap': True}

Model with rank: 2
Mean validation score: -753.586753 (std: 25.205017)
Parameters: {'n_estimators': 200, 'min_samples_split': 2, 'min_samples_leaf': 15, 'max_features': 4, 'max_depth': 30, 'bootstrap': True}

Model with rank: 3
Mean validation score: -754.928286 (std: 26.905674)
Parameters: {'n_estimators': 50, 'min_samples_split': 2, 'min_samples_leaf': 15, 'max_features': 4, 'max_depth': 10, 'bootstrap': False}

Model with rank: 4
Mean validation score: -755.825096 (std: 25.021806)
Parameters: {'n_estimators': 100, 'min_samples_split': 2, 'min_samples_leaf': 10, 'max_features': 5, 'max_depth': 30, 'bootstrap': True}

Model with rank: 5
Mean validation score: -756.548975 (std: 26.342801)
Parameters: {'n_estimators': 200, 'min_samples_split': 15, 'min_samples_leaf': 15, 'max_feat

In [111]:
predictrf=random_search.predict(sd_test)

In [112]:
pd.DataFrame(predictrf).to_csv("sd_test_rgggg.csv",index=False)

### ExraTree

In [216]:
param_dist = {"n_estimators":[50,100,200],
              "max_features": [2,4,5,6,8],
              "bootstrap": [True, False],
                'max_depth':[None,5,10,15,20,30,50,70],
                'min_samples_leaf':[1,2,5,10,15,20], 
                'min_samples_split':[2,5,10,15,20]
                  }

In [217]:
reget=ExtraTreesRegressor()

In [238]:
n_iter_search = 20
random_searchet=RandomizedSearchCV(reget,
                                 cv=15,
                                 param_distributions=param_dist,
                                 scoring='neg_mean_absolute_error',
                                 n_iter=30,n_jobs=-1,verbose=20
                                    )

random_searchet.fit(sd_train11,sd_train12)

Fitting 15 folds for each of 30 candidates, totalling 450 fits


RandomizedSearchCV(cv=15, estimator=ExtraTreesRegressor(), n_iter=30, n_jobs=-1,
                   param_distributions={'bootstrap': [True, False],
                                        'max_depth': [None, 5, 10, 15, 20, 30,
                                                      50, 70],
                                        'max_features': [2, 4, 5, 6, 8],
                                        'min_samples_leaf': [1, 2, 5, 10, 15,
                                                             20],
                                        'min_samples_split': [2, 5, 10, 15, 20],
                                        'n_estimators': [50, 100, 200]},
                   scoring='neg_mean_absolute_error', verbose=20)

In [239]:
report(random_searchet.cv_results_,5)

Model with rank: 1
Mean validation score: -749.818653 (std: 36.771817)
Parameters: {'n_estimators': 200, 'min_samples_split': 15, 'min_samples_leaf': 2, 'max_features': 6, 'max_depth': 10, 'bootstrap': True}

Model with rank: 2
Mean validation score: -750.835697 (std: 36.589455)
Parameters: {'n_estimators': 50, 'min_samples_split': 15, 'min_samples_leaf': 10, 'max_features': 6, 'max_depth': 10, 'bootstrap': False}

Model with rank: 3
Mean validation score: -750.875832 (std: 37.821535)
Parameters: {'n_estimators': 200, 'min_samples_split': 2, 'min_samples_leaf': 2, 'max_features': 8, 'max_depth': 10, 'bootstrap': True}

Model with rank: 4
Mean validation score: -751.077803 (std: 35.906078)
Parameters: {'n_estimators': 100, 'min_samples_split': 20, 'min_samples_leaf': 15, 'max_features': 6, 'max_depth': 10, 'bootstrap': False}

Model with rank: 5
Mean validation score: -751.140933 (std: 37.384566)
Parameters: {'n_estimators': 200, 'min_samples_split': 5, 'min_samples_leaf': 10, 'max_feat

### Gradient       Boost

In [None]:
import xgboost
from sklearn.model_selection import RandomizedSearchCV
from sklearn.ensemble import GradientBoostingRegressor

In [162]:
import xgboost
print(xgboost.__version__)

1.5.1


In [250]:
gbm_params={'n_estimators':[1,10,20,50,100,200,500,700,900,1100],
           'learning_rate': [0.0001,0.001,0.01,.05,0.1,0.4,0.8,1],
            'max_depth':[0,1,2,3,4,5,6,7,8,9],
#             'min_samples_split':[2,5,10,20],
#             'min_samples_leaf':[2,5,10,20],
            'subsample':[0.001,0.01,0.1,0.3,0.5,0.8,1],
            'max_features':[1,3,5,10,15,20,30,45,55,65,75,85,95]
           }

In [251]:
gbm=GradientBoostingRegressor()

In [264]:
random_searchgb=RandomizedSearchCV(gbm,
                                 scoring='neg_mean_absolute_error',
                                 param_distributions=gbm_params,
                                 cv=80,
                                 n_iter=100,
                                 n_jobs=-1,
                                verbose=20)

In [265]:
random_searchgb.fit(sd_train11,sd_train12)

Fitting 80 folds for each of 100 candidates, totalling 8000 fits


6480 fits failed out of a total of 8000.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
5840 fits failed with the following error:
Traceback (most recent call last):
  File "C:\Users\Cyntexia\anaconda3\lib\site-packages\sklearn\model_selection\_validation.py", line 680, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "C:\Users\Cyntexia\anaconda3\lib\site-packages\sklearn\ensemble\_gb.py", line 586, in fit
    n_stages = self._fit_stages(
  File "C:\Users\Cyntexia\anaconda3\lib\site-packages\sklearn\ensemble\_gb.py", line 663, in _fit_stages
    raw_predictions = self._fit_stage(
  File "C:\Users\Cyntexia\anaconda3\lib\site-packages\sklearn\ensemble\_gb.py", line 246, in _fit_stage
    tree.fit(X, residual, sam

RandomizedSearchCV(cv=80, estimator=GradientBoostingRegressor(), n_iter=100,
                   n_jobs=-1,
                   param_distributions={'learning_rate': [0.0001, 0.001, 0.01,
                                                          0.05, 0.1, 0.4, 0.8,
                                                          1],
                                        'max_depth': [0, 1, 2, 3, 4, 5, 6, 7, 8,
                                                      9],
                                        'max_features': [1, 3, 5, 10, 15, 20,
                                                         30, 45, 55, 65, 75, 85,
                                                         95],
                                        'n_estimators': [1, 10, 20, 50, 100,
                                                         200, 500, 700, 900,
                                                         1100],
                                        'subsample': [0.001, 0.01, 0.1, 0.3,
                  

In [266]:
report(random_searchgb.cv_results_,5)

Model with rank: 1
Mean validation score: -764.163091 (std: 84.429838)
Parameters: {'subsample': 1, 'n_estimators': 100, 'max_features': 5, 'max_depth': 3, 'learning_rate': 0.1}

Model with rank: 2
Mean validation score: -774.296668 (std: 85.972917)
Parameters: {'subsample': 0.3, 'n_estimators': 700, 'max_features': 1, 'max_depth': 7, 'learning_rate': 0.01}

Model with rank: 3
Mean validation score: -798.489683 (std: 86.889095)
Parameters: {'subsample': 0.1, 'n_estimators': 900, 'max_features': 3, 'max_depth': 3, 'learning_rate': 0.05}

Model with rank: 4
Mean validation score: -802.940034 (std: 80.157372)
Parameters: {'subsample': 0.1, 'n_estimators': 50, 'max_features': 3, 'max_depth': 7, 'learning_rate': 0.05}

Model with rank: 5
Mean validation score: -811.348527 (std: 91.807627)
Parameters: {'subsample': 0.8, 'n_estimators': 500, 'max_features': 3, 'max_depth': 5, 'learning_rate': 0.1}



In [267]:
predictgb=random_searchgb.predict(sd_test)

In [269]:
pd.DataFrame(predictgb).to_csv("sd_test_gb.csv",index=False)

### XGBoost  Regressor

In [210]:
xgb_params = {  
                "learning_rate":[0.0001,0.001,0.01,0.05,0.1,0.9],
                "gamma":[i/10.0 for i in range(0,5)],
                "max_depth": [2,3,4,5,6,7,8],
                "min_child_weight":[1,2,5,10],
                "max_delta_step":[0,1,2,5,10],
                "subsample":[i/10.0 for i in range(1,10)],
                "colsample_bytree":[i/10.0 for i in range(1,10)],
                "colsample_bylevel":[i/10.0 for i in range(1,10)],
                "reg_lambda":[1e-5, 1e-2, 0.1, 1, 100], 
                "reg_alpha":[1e-5, 1e-2, 0.1, 1, 100],
                "scale_pos_weight":[1,2,3,4,5,6,7,8,9],
                "n_estimators":[100,500,700,1000]
             }

In [211]:
xgb=xgboost.XGBRegressor(objective='reg:linear')

In [212]:
random_searchxgb=RandomizedSearchCV(xgb,n_jobs=-1,cv=15,n_iter=10,
                                    scoring='neg_mean_absolute_error',
                                     param_distributions=xgb_params)

In [None]:
random_searchxgb.fit(sd_train11,sd_train12)

In [182]:
report(random_searchxgb.cv_results_,5)

Model with rank: 1
Mean validation score: -783.847936 (std: 16.002134)
Parameters: {'subsample': 0.8, 'scale_pos_weight': 5, 'reg_lambda': 100, 'reg_alpha': 1, 'n_estimators': 700, 'min_child_weight': 1, 'max_depth': 8, 'max_delta_step': 0, 'learning_rate': 0.01, 'gamma': 0.1, 'colsample_bytree': 0.5, 'colsample_bylevel': 0.9}

Model with rank: 2
Mean validation score: -824.063335 (std: 20.681460)
Parameters: {'subsample': 0.5, 'scale_pos_weight': 9, 'reg_lambda': 0.1, 'reg_alpha': 100, 'n_estimators': 100, 'min_child_weight': 10, 'max_depth': 3, 'max_delta_step': 0, 'learning_rate': 0.5, 'gamma': 0.4, 'colsample_bytree': 0.5, 'colsample_bylevel': 0.6}

Model with rank: 3
Mean validation score: -1735.718952 (std: 28.387156)
Parameters: {'subsample': 0.7, 'scale_pos_weight': 3, 'reg_lambda': 100, 'reg_alpha': 100, 'n_estimators': 500, 'min_child_weight': 2, 'max_depth': 4, 'max_delta_step': 2, 'learning_rate': 0.5, 'gamma': 0.0, 'colsample_bytree': 0.8, 'colsample_bylevel': 0.9}

Model 

### Neural Network

In [128]:
from sklearn.neural_network import MLPRegressor

In [129]:
parameters={
'learning_rate': ["constant", "invscaling", "adaptive"],
'hidden_layer_sizes': [(5,10,5),(20,10),(10,20),(150,100,50), (120,80,40), (100,50,30)],
'alpha': [0.001,.01,0.1,0.2,0.4,0.8,0.9,0.3,0.1,0.01,0.001,0.0001,1],
'activation': ["relu", "logistic", "tanh"]
}

In [130]:
clf=MLPRegressor()

In [131]:
random_searchnn=RandomizedSearchCV(clf,n_iter=702,cv=50,
                                 param_distributions=parameters,
                                 scoring='neg_mean_absolute_error',random_state=2,
                                 n_jobs=-1,verbose=20)

In [132]:
random_searchnn.fit(sd_train11,sd_train12)

Fitting 50 folds for each of 702 candidates, totalling 35100 fits




RandomizedSearchCV(cv=50, estimator=MLPRegressor(), n_iter=702, n_jobs=-1,
                   param_distributions={'activation': ['relu', 'logistic',
                                                       'tanh'],
                                        'alpha': [0.001, 0.01, 0.1, 0.2, 0.4,
                                                  0.8, 0.9, 0.3, 0.1, 0.01,
                                                  0.001, 0.0001, 1],
                                        'hidden_layer_sizes': [(5, 10, 5),
                                                               (20, 10),
                                                               (10, 20),
                                                               (150, 100, 50),
                                                               (120, 80, 40),
                                                               (100, 50, 30)],
                                        'learning_rate': ['constant',
                                    

In [133]:
report(random_searchnn.cv_results_,5)





Model with rank: 1
Mean validation score: -746.792425 (std: 72.238538)
Parameters: {'learning_rate': 'adaptive', 'hidden_layer_sizes': (150, 100, 50), 'alpha': 0.4, 'activation': 'relu'}

Model with rank: 2
Mean validation score: -747.247263 (std: 72.837236)
Parameters: {'learning_rate': 'invscaling', 'hidden_layer_sizes': (150, 100, 50), 'alpha': 0.9, 'activation': 'relu'}

Model with rank: 3
Mean validation score: -747.569384 (std: 72.568108)
Parameters: {'learning_rate': 'invscaling', 'hidden_layer_sizes': (150, 100, 50), 'alpha': 0.01, 'activation': 'relu'}

Model with rank: 4
Mean validation score: -748.075442 (std: 70.101223)
Parameters: {'learning_rate': 'constant', 'hidden_layer_sizes': (150, 100, 50), 'alpha': 0.8, 'activation': 'relu'}

Model with rank: 5
Mean validation score: -748.158291 (std: 73.674458)
Parameters: {'learning_rate': 'invscaling', 'hidden_layer_sizes': (150, 100, 50), 'alpha': 0.3, 'activation': 'relu'}



In [134]:
predictrff=random_searchnn.predict(sd_test)

In [135]:
pd.DataFrame(predictrff).to_csv("sd_test_nn.csv",index=False)