# Machine Learning Pipeline - Scoring New Data

Let's imagine that a colleague from the business department comes and asks us to score the data from last months customers. They want to be sure that our model is working appropriately in the most recent data that the organization has.

**How would you go about to score the new data?** Try to give it a go. There is more than 1 way of doing it.

Below we present one potential solution.

What could we have done better?

In [1]:
# to handle datasets
import pandas as pd
import numpy as np

# for plotting
import matplotlib.pyplot as plt

# for the yeo-johnson transformation
import scipy.stats as stats

# to save the model
import joblib

# to display all the columns of the dataframe in the notebook
pd.pandas.set_option('display.max_columns', None)

In [2]:
target = 'charges'

In [3]:
# load the unseen / new dataset
#data = pd.read_csv('test.csv', index_col='id')
data = pd.read_csv('test.csv')

# rows and columns of the data
print(data.shape)

# visualise the dataset
data.head()

(800, 7)


Unnamed: 0,id,age,sex,bmi,children,smoker,region
0,0,28,female,32.694647,3,no,northeast
1,3,22,female,29.606817,0,no,northeast
2,6,38,female,33.567011,2,yes,northwest
3,7,22,female,29.216607,0,no,northwest
4,8,47,male,32.982643,3,yes,northwest


In [4]:
# drop the id variable

#data.drop('id', axis=1, inplace=True)
data.drop(['id'], axis=1, inplace=True)
#data.drop(['City'], axis=1, inplace=True)

print(data.shape)
data.head()

(800, 6)


Unnamed: 0,age,sex,bmi,children,smoker,region
0,28,female,32.694647,3,no,northeast
1,22,female,29.606817,0,no,northeast
2,38,female,33.567011,2,yes,northwest
3,22,female,29.216607,0,no,northwest
4,47,male,32.982643,3,yes,northwest


# Feature Engineering

First we need to transform the data. Below the list of transformations that we did during the Feature Engineering phase:

1. Missing values
2. Temporal variables
3. Non-Gaussian distributed variables
4. Categorical variables: remove rare labels
5. Categorical variables: convert strings to numbers
6. Put the variables in a similar scale

## Missing values

### Categorical variables

- Replace missing values with the string "missing" in those variables with a lot of missing data. 
- Replace missing data with the most frequent category in those variables that contain fewer observations without values. 

In [5]:
# first we needed to cast MSSubClass as object

# data['MSSubClass'] = data['MSSubClass'].astype('O')

In [6]:
# # list of different groups of categorical variables

# with_string_missing = ['Alley', 'FireplaceQu',
#                        'PoolQC', 'Fence', 'MiscFeature']

# # ==================
# # we copy this dictionary from the Feature-engineering notebook
# # note that we needed to hard-code this by hand

# # the key is the variable and the value is its most frequent category

# # what if we re-train the model and the below values change?
# # ==================

# with_frequent_category = {
#     'MasVnrType': 'None',
#     'BsmtQual': 'TA',
#     'BsmtCond': 'TA',
#     'BsmtExposure': 'No',
#     'BsmtFinType1': 'Unf',
#     'BsmtFinType2': 'Unf',
#     'Electrical': 'SBrkr',
#     'GarageType': 'Attchd',
#     'GarageFinish': 'Unf',
#     'GarageQual': 'TA',
#     'GarageCond': 'TA',
# }

In [7]:
# replace missing values with new label: "Missing"

# data[with_string_missing] = data[with_string_missing].fillna('Missing')

In [8]:
# # replace missing values with the most frequent category

# for var in with_frequent_category.keys():
#     data[var].fillna(with_frequent_category[var], inplace=True)

### Numerical variables

To engineer missing values in numerical variables, we will:

- add a binary missing value indicator variable
- and then replace the missing values in the original variable with the mean

In [9]:
# this is the dictionary of numerical variable with missing data
# and its mean, as determined from the training set in the
# Feature Engineering notebook

# note how we needed to hard code the values

vars_with_na = {
#     'LotFrontage': 69.87974098057354,
#     'MasVnrArea': 103.7974006116208,
#     'GarageYrBlt': 1978.2959677419356,
}

In [10]:
# replace missing values as we described above

for var in vars_with_na.keys():

    # add binary missing indicator (in train and test)
    data[var + '_na'] = np.where(data[var].isnull(), 1, 0)

    # replace missing values by the mean
    # (in train and test)
    data[var].fillna(vars_with_na[var], inplace=True)

data[vars_with_na].isnull().sum()

Series([], dtype: float64)

In [11]:
# # check the binary missing indicator variables

# data[['LotFrontage_na', 'MasVnrArea_na', 'GarageYrBlt_na']].head()

## Temporal variables

### Capture elapsed time

We need to capture the time elapsed between those variables and the year in which the house was sold:

In [12]:
# def elapsed_years(df, var):
#     # capture difference between the year variable
#     # and the year in which the house was sold
#     df[var] = df['YrSold'] - df[var]
#     return df

In [13]:
# for var in ['YearBuilt', 'YearRemodAdd', 'GarageYrBlt']:
#     data = elapsed_years(data, var)

In [14]:
# # now we drop YrSold
# data.drop(['YrSold'], axis=1, inplace=True)

## Numerical variable transformation

### Logarithmic transformation

We will transform with the logarithm the positive numerical variables in order to get a more Gaussian-like distribution.

In [15]:
# for var in ["LotFrontage", "1stFlrSF", "GrLivArea"]:
#     data[var] = np.log(data[var])

### Yeo-Johnson transformation

We will apply the Yeo-Johnson transformation to LotArea.

In [16]:
# note how we use the lambda that we learned from the train set
# in the notebook on Feature Engineering.

# Note that we need to hard code this value

yeo_johnson_vars = [
#     "co_cnt", "co_max",
#     "o3_cnt", "o3_min", "o3_max", "o3_var",
#     "so2_cnt", "so2_max",
#     "no2_cnt", "no2_max",
#     "temperature_cnt", "temperature_min", "temperature_var",
#     "humidity_cnt", "humidity_var",
#     "pressure_cnt", "pressure_var",
#     "ws_cnt", "ws_min", "ws_mid", "ws_max", "ws_var",
#     "dew_cnt", "dew_var"    
]

In [17]:
for var in yeo_johnson_vars:
    data[var], param = stats.yeojohnson(data[var])

    print(param)

### Binarize skewed variables

There were a few variables very skewed, we would transform those into binary variables.

In [18]:
# skewed = [
#     'BsmtFinSF2', 'LowQualFinSF', 'EnclosedPorch',
#     '3SsnPorch', 'ScreenPorch', 'MiscVal'
# ]

# for var in skewed:
    
#     # map the variable values into 0 and 1
#     data[var] = np.where(data[var]==0, 0, 1)

## Categorical variables

### Apply mappings

We remap variables with specific meanings into a numerical scale.

In [19]:
# # re-map strings to numbers, which determine quality

# qual_mappings = {'Po': 1, 'Fa': 2, 'TA': 3, 'Gd': 4, 'Ex': 5, 'Missing': 0, 'NA': 0}

# qual_vars = ['ExterQual', 'ExterCond', 'BsmtQual', 'BsmtCond',
#              'HeatingQC', 'KitchenQual', 'FireplaceQu',
#              'GarageQual', 'GarageCond',
#             ]

# for var in qual_vars:
#     data[var] = data[var].map(qual_mappings)

In [20]:
# exposure_mappings = {'No': 1, 'Mn': 2, 'Av': 3, 'Gd': 4}

# var = 'BsmtExposure'

# data[var] = data[var].map(exposure_mappings)

In [21]:
# finish_mappings = {'Missing': 0, 'NA': 0, 'Unf': 1, 'LwQ': 2, 'Rec': 3, 'BLQ': 4, 'ALQ': 5, 'GLQ': 6}

# finish_vars = ['BsmtFinType1', 'BsmtFinType2']

# for var in finish_vars:
#     data[var] = data[var].map(finish_mappings)

In [22]:
# garage_mappings = {'Missing': 0, 'NA': 0, 'Unf': 1, 'RFn': 2, 'Fin': 3}

# var = 'GarageFinish'

# data[var] = data[var].map(garage_mappings)

In [23]:
# fence_mappings = {'Missing': 0, 'NA': 0, 'MnWw': 1, 'GdWo': 2, 'MnPrv': 3, 'GdPrv': 4}

# var = 'Fence'

# data[var] = data[var].map(fence_mappings)

In [24]:
# check absence of na in the data set

with_null = [var for var in data.columns if data[var].isnull().sum() > 0]

with_null

[]

**Surprise**

There are quite a few variables with missing data!!

In [25]:
# did those have missing data in the train set?

[var for var in with_null if var in list(
    with_frequent_category.keys())+with_string_missing+list(vars_with_na.keys())]

[]

**IMPORTANT**

In the new data, we have a bunch of variables that contain missing information, that we did not anticipate.

### Removing Rare Labels

For the remaining categorical variables, we will group those categories that are present in less than 1% of the observations into a "Rare" string.

In [26]:
# # create a dictionary with the most frequent categories per variable

# # note the amount of hard coding that I need to do.

# # Can you think of an alternative? Perhaps we could have save this as a numpy pickle
# # and load it here, instead of hard-coding.

# # But that means that we need to go back to the Feature Engineering notebook, and change
# # the code so that we store the pickle. So there is still some code changes that we need

# frequent_ls = {
#     'MSZoning': ['FV', 'RH', 'RL', 'RM'],
#     'Street': ['Pave'],
#     'Alley': ['Grvl', 'Missing', 'Pave'],
#     'LotShape': ['IR1', 'IR2', 'Reg'],
#     'LandContour': ['Bnk', 'HLS', 'Low', 'Lvl'],
#     'Utilities': ['AllPub'],
#     'LotConfig': ['Corner', 'CulDSac', 'FR2', 'Inside'],
#     'LandSlope': ['Gtl', 'Mod'],
#     'Neighborhood': ['Blmngtn', 'BrDale', 'BrkSide', 'ClearCr', 'CollgCr', 'Crawfor',
#                      'Edwards', 'Gilbert', 'IDOTRR', 'MeadowV', 'Mitchel', 'NAmes', 'NWAmes',
#                      'NoRidge', 'NridgHt', 'OldTown', 'SWISU', 'Sawyer', 'SawyerW',
#                      'Somerst', 'StoneBr', 'Timber'],

#     'Condition1': ['Artery', 'Feedr', 'Norm', 'PosN', 'RRAn'],
#     'Condition2': ['Norm'],
#     'BldgType': ['1Fam', '2fmCon', 'Duplex', 'Twnhs', 'TwnhsE'],
#     'HouseStyle': ['1.5Fin', '1Story', '2Story', 'SFoyer', 'SLvl'],
#     'RoofStyle': ['Gable', 'Hip'],
#     'RoofMatl': ['CompShg'],
#     'Exterior1st': ['AsbShng', 'BrkFace', 'CemntBd', 'HdBoard', 'MetalSd', 'Plywood',
#                     'Stucco', 'VinylSd', 'Wd Sdng', 'WdShing'],

#     'Exterior2nd': ['AsbShng', 'BrkFace', 'CmentBd', 'HdBoard', 'MetalSd', 'Plywood',
#                     'Stucco', 'VinylSd', 'Wd Sdng', 'Wd Shng'],

#     'MasVnrType': ['BrkFace', 'None', 'Stone'],
#     'Foundation': ['BrkTil', 'CBlock', 'PConc', 'Slab'],
#     'Heating': ['GasA', 'GasW'],
#     'CentralAir': ['N', 'Y'],
#     'Electrical': ['FuseA', 'FuseF', 'SBrkr'],
#     'Functional': ['Min1', 'Min2', 'Mod', 'Typ'],
#     'GarageType': ['Attchd', 'Basment', 'BuiltIn', 'Detchd'],
#     'PavedDrive': ['N', 'P', 'Y'],
#     'PoolQC': ['Missing'],
#     'MiscFeature': ['Missing', 'Shed'],
#     'SaleType': ['COD', 'New', 'WD'],
#     'SaleCondition': ['Abnorml', 'Family', 'Normal', 'Partial'],
#     'MSSubClass': ['20', '30', '50', '60', '70', '75', '80', '85', '90', '120', '160', '190'],
# }

In [27]:
# for var in frequent_ls.keys():
    
#     # replace rare categories by the string "Rare"
#     data[var] = np.where(data[var].isin(
#         frequent_ls), data[var], 'Rare')

### Encoding of categorical variables

Next, we need to transform the strings of the categorical variables into numbers. 

In [28]:
# # we need the mappings learned from the train set. Otherwise, our model is going
# # to produce inaccurate results

# # note the amount of hard coding that we need to do.

# # Can you think of an alternative? 

# # Perhaps we could have save this as a numpy pickle
# # and load it here, instead of hard-coding.

# # But that means that we need to go back to the Feature Engineering notebook, and change
# # the code so that we store the pickle. So there is still some code changes that we need

ordinal_mappings = {
    'region': {
        'southeast': 0, 'southwest': 1, 'northeast': 2, 'northwest': 3},
    'sex': {
        'female': 0, 'male': 1},
    'smoker': {
        'no': 0, 'yes': 1}    
}
#     'City': {'Zanjān': 0, 'Denver': 1, 'Phoenix': 2, 'Vancouver': 3, 'Brisbane': 4, 'Melbourne': 5, 'Edinburgh': 6, 'Naha': 7, 'Taitung City': 8, 'Calama': 9, 'Darwin': 10, 'Vitória': 11, 'El Paso': 12, 'Sapporo': 13, 'Belfast': 14, 'Las Palmas de Gran Canaria': 15, 'Perth': 16, 'Las Vegas': 17, 'Liège': 18, 'Palma': 19, 'Milwaukee': 20, 'San Jose': 21, 'Wollongong': 22, 'Kanazawa': 23, 'Charleroi': 24, 'Indianapolis': 25, 'Tokyo': 26, 'Toyama': 27, 'Nantes': 28, 'Yokohama': 29, 'Bilbao': 30, 'Sanandaj': 31, 'Bandar Abbas': 32, 'Shizuoka': 33, 'Hamburg': 34, 'Saint Paul': 35, 'Albuquerque': 36, 'Sendai': 37, 'Marseille': 38, 'Utsunomiya': 39, 'Cardiff': 40, 'Tabriz': 41, 'Nara-shi': 42, 'Ōita': 43, 'Kagoshima': 44, 'Detroit': 45, 'Kobe': 46, 'Takamatsu': 47, 'Kyoto': 48, 'Granada': 49, 'Chiba': 50, 'Oklahoma City': 51, 'Keelung': 52, 'Gifu-shi': 53, 'Gasteiz / Vitoria': 54, 'Madrid': 55, 'Hiroshima': 56, 'Jacksonville': 57, 'Berlin': 58, 'Hạ Long': 59, 'Pachuca de Soto': 60, 'Akita': 61, 'Gent': 62, 'Lhasa': 63, 'Southend-on-Sea': 64, 'Valencia': 65, 'Nizhniy Novgorod': 66, 'Kochi': 67, 'Nancy': 68, 'Fukuoka': 69, 'Nagasaki': 70, 'Boise': 71, 'Miyazaki': 72, 'Wakayama': 73, 'Raleigh': 74, 'Newcastle': 75, 'Khorramabad': 76, 'Antwerpen': 77, 'Atlanta': 78, 'Gdańsk': 79, 'Okayama': 80, 'Huelva': 81, 'Huế': 82, 'Qom': 83, 'Rome': 84, 'Eskişehir': 85, 'Oviedo': 86, 'Hsinchu': 87, 'Oaxaca': 88, 'Warsaw': 89, 'Kielce': 90, 'San Luis Potosí': 91, 'East London': 92, 'Zagreb': 93, 'Worcester': 94, 'Pécs': 95, 'Nagano': 96, 'Leeds': 97, 'Taoyuan City': 98, 'Yazd': 99, 'Tehran': 100, 'Haarlem': 101, 'Budapest': 102, 'Szczecin': 103, 'Győr': 104, 'Saint Petersburg': 105, 'Taipei': 106, 'Szeged': 107, 'Jackson': 108, 'Chihuahua': 109, 'Bydgoszcz': 110, 'Puebla': 111, 'Haikou': 112, 'Novi Sad': 113, 'Salamanca': 114, 'Suncheon': 115, 'Tepic': 116, 'Nakhon Pathom': 117, 'Mainz': 118, 'Brescia': 119, 'Taichung': 120, 'Jeju City': 121, 'Haifa': 122, 'Los Angeles': 123, 'Livorno': 124, 'Barcelona': 125, 'Kütahya': 126, 'Monterrey': 127, 'Samsun': 128, 'Chuncheon': 129, 'Rayong': 130, 'Yeosu': 131, 'São Paulo': 132, 'Milan': 133, 'Balıkesir': 134, 'Changwon': 135, 'Los Ángeles': 136, 'Seongnam-si': 137, 'Morelia': 138, 'Qiqihar': 139, 'İzmit': 140, 'Yunfu': 141, 'Poznań': 142, 'Antakya': 143, 'Chon Buri': 144, 'Quilpué': 145, 'Wrocław': 146, 'Pohang': 147, 'Daejeon': 148, 'Ulsan': 149, 'Busan': 150, 'Middelburg': 151, 'Thrissur': 152, 'Netanya': 153, 'Strasbourg': 154, 'Andong': 155, 'Mokpo': 156, 'Tarnów': 157, 'Hegang': 158, 'Naples': 159, 'Seoul': 160, 'Guadalajara': 161, 'Novosibirsk': 162, 'Gwangju': 163, 'Talca': 164, 'Miskolc': 165, 'Jerusalem': 166, 'Hanoi': 167, 'Łódź': 168, 'Shenzhen': 169, 'Foshan': 170, 'Istanbul': 171, 'Kayseri': 172, 'Incheon': 173, 'Adapazarı': 174, 'Tel Aviv': 175, 'Fuzhou': 176, 'Zenica': 177, 'Cheongju-si': 178, 'Xiamen': 179, 'Changzhou': 180, 'Bangkok': 181, 'Thiruvananthapuram': 182, 'Suzhou': 183, 'Ningbo': 184, 'Niš': 185, 'Chennai': 186, 'Guiyang': 187, 'Kaohsiung': 188, 'Kraków': 189, 'Chiang Mai': 190, 'Shantou': 191, 'Hangzhou': 192, 'Tuzla': 193, 'Kunming': 194, 'Guangzhou': 195, 'Nanning': 196, 'Yinchuan': 197, 'Konya': 198, 'Toluca': 199, 'Shanghai': 200, 'Xining': 201, 'Harbin': 202, 'Qingdao': 203, 'Erzurum': 204, 'Qinhuangdao': 205, 'Jieyang': 206, 'Beijing': 207, 'Nanjing': 208, 'Fushun': 209, 'Nanchang': 210, 'Ürümqi': 211, 'Pretoria': 212, 'Shillong': 213, 'Visakhapatnam': 214, 'Bengaluru': 215, 'Vereeniging': 216, 'Hefei': 217, 'Bhopal': 218, 'Taiyuan': 219, 'Jinan': 220, 'Xi’an': 221, 'Zhuzhou': 222, 'Xinxiang': 223, 'Chengdu': 224, 'Changsha': 225, 'Zhengzhou': 226, 'Kolkata': 227, 'Mumbai': 228, 'Jaipur': 229, 'Gandhinagar': 230, 'Chandigarh': 231, 'Hyderabad': 232, 'Lampang': 233, 'Muzaffarnagar': 234, 'Hāpur': 235, 'Patna': 236, 'Lucknow': 237, 'Delhi': 238}
#}

In [29]:
for var in ordinal_mappings.keys():

    ordinal_label = ordinal_mappings[var]

    # use the dictionary to replace the categorical strings by integers
    data[var] = data[var].map(ordinal_label)

In [30]:
data.head()

Unnamed: 0,age,sex,bmi,children,smoker,region
0,28,0,32.694647,3,0,2
1,22,0,29.606817,0,0,2
2,38,0,33.567011,2,1,3
3,22,0,29.216607,0,0,3
4,47,1,32.982643,3,1,3


In [31]:
# # check absence of na in the data set

# with_null = [var for var in data.columns if data[var].isnull().sum() > 0]

# len(with_null)

In [32]:
# # there is missing data in a lot of the variables.

# # unfortunately, the scaler wil not work with missing data, so
# # we need to fill those values

# # in the real world, we would try to understand where they are coming from
# # and why they were not present in the training set

# # here I will just fill them in quickly to proceed with the demo

# data.fillna(0, inplace=True)

## Feature Scaling

We will scale features to the minimum and maximum values:

In [33]:
# load the scaler we saved in the notebook on Feature Engineering

# fortunataly, we were smart and we saved it, but this is an easy step
# to forget

scaler = joblib.load('minmax_scaler.joblib') 

data = pd.DataFrame(
    scaler.transform(data),
    columns=data.columns
)

In [34]:
data.head()

Unnamed: 0,age,sex,bmi,children,smoker,region
0,0.222222,0.0,0.408169,0.6,0.0,0.666667
1,0.088889,0.0,0.278196,0.0,0.0,0.666667
2,0.444444,0.0,0.444889,0.4,1.0,1.0
3,0.088889,0.0,0.261771,0.0,0.0,1.0
4,0.644444,1.0,0.420292,0.6,1.0,1.0


In [35]:
# load the pre-selected features
# ==============================

features = pd.read_csv('selected_features.csv')
features = features['0'].to_list() 

# reduce the train and test set to the selected features
data = data[features]

print(data.shape)
data.head()

(800, 6)


Unnamed: 0,age,sex,bmi,children,smoker,region
0,0.222222,0.0,0.408169,0.6,0.0,0.666667
1,0.088889,0.0,0.278196,0.0,0.0,0.666667
2,0.444444,0.0,0.444889,0.4,1.0,1.0
3,0.088889,0.0,0.261771,0.0,0.0,1.0
4,0.644444,1.0,0.420292,0.6,1.0,1.0


Note that we engineered so many variables, when we are actually going to feed only 31 to the model.

**What could we do differently?**

We could have, of course, engineered only the variables that we are going to use in the model. But that means:

- identifying which variables we need
- identifying which transformation we need per variable
- redefining our dictionaries accordingly
- retraining the MinMaxScaler only on the selected variables (at the moment, it is trained on the entire dataset)

That means, that we need to create extra code to train the scaler only on the selected variables. Probably removing the scaler from the Feature Engineering notebook and passing it onto the Feature Selection one.

We need to be really careful in re-writing the code here to make sure we do not forget or engineer wrongly any of the variables.

In [36]:
# now let's load the trained model

model = joblib.load('decisiontree_classifier.joblib') 

# let's obtain the predictions
pred = model.predict(data)

pred

array([0, 0, 1, 0, 1, 0, 0, 0, 2, 2, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0,
       0, 2, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 2, 1,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 1, 0, 0, 1,
       2, 0, 0, 0, 0, 2, 2, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 2, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 1, 0, 1, 0, 1, 0, 2, 0, 0, 0, 2, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0,
       1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0,
       0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 2, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 2,
       0, 0, 2, 0, 0, 0, 2, 0, 0, 2, 0, 1, 0, 1, 2, 0, 0, 0, 1, 0, 1, 0,
       0, 1, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,

In [37]:
submission = pd.read_csv('sample_submit.csv', header=None)
submission[1] = pred
submission.to_csv('submit_01_decisiontree.csv', header=False, index=False)

# pd.DataFrame(pred).to_csv('submit.csv', header=False, index=True)
# submit = pd.concat([data, pd.DataFrame(pred)], axis=0)
# submit


In [38]:
submission

Unnamed: 0,0,1
0,0,0
1,3,0
2,6,1
3,7,0
4,8,1
...,...,...
795,1979,0
796,1983,0
797,1985,0
798,1989,1


In [70]:
submission[1].value_counts()

0    669
1     74
2     57
Name: 1, dtype: int64

In [None]:
# let's plot the predicted sale prices
# pd.Series(pred).hist(bins=50)
# plt.show()

What shortcomings, inconvenience and problems did you find when scoring new data?

# List of problems

- re-wrote a lot of code ==> repetitive
- hard coded a lot of parameters ==> if these change we need to re-write them again
- engineered a lot of variables that we actually do not need for the model
- additional variables present missing data, we do not know what to do with them

We can minimize these hurdles by using Open-source. And we will see how in the next videos.