# Intermediate Machine Learning - Kaggle Course
This notebook is just for **reference sake.**

**The course and all content here is provided by Alexis Cook [here](https://www.kaggle.com/learn/intermediate-machine-learning)**

**Note that there will be error in each code block because I haven't plugged in any data, just the syntax is written down**

## 1. Intro
### What is covered in this course?
- handling missing values
- categorical variables
- ML pipelines
- cross validation
- XGBoost
- Data leakage



## 2. Missing values
There are 2 approaches in dealing with missing values -

**1. Drop columns with missing values**

This approach is risky as potentially useful column with just a few missing values could be dropped.

In [20]:
cols_with_misssing = [cols for col in X_train.columns  if X_train[col].isnull.any()]
reduced_X_train = X_train.drop(cols_with_missing,axis=1)


NameError: name 'X_train' is not defined

**2. A better option : imputation**

Imputation fills missing value with some number , for example the mean value.
Values won't be exact but yeilds better result than just dropping the column.

In [4]:
from sklearn.impute import SimpleImputer
myimputer = SimpleImputer()
pd.DataFrame(myimputer.fit_transform(X_train))

**3. An extension to imputation**

Imputation is standard approach, but the values could be higher or lower than actual value. So adding an additional columnn displaying rows which were originally missing(True/False) could be useful.

**Side note -**

How to use only numerical predictors in a data?

In [39]:
# like this
X = X_full.select_dtypes(exclude=['object'])


NameError: name 'X_full' is not defined

**Full code -**

In [9]:

# Make copy to avoid changing original data (when imputing)
X_train_plus = X_train.copy()
X_valid_plus = X_valid.copy()

# Make new columns indicating what will be imputed
for col in cols_with_missing:
    X_train_plus[col + '_was_missing'] = X_train_plus[col].isnull()
    X_valid_plus[col + '_was_missing'] = X_valid_plus[col].isnull()

# Imputation
my_imputer = SimpleImputer()
imputed_X_train_plus = pd.DataFrame(my_imputer.fit_transform(X_train_plus))
imputed_X_valid_plus = pd.DataFrame(my_imputer.transform(X_valid_plus))

# Imputation removed column names; put them back
imputed_X_train_plus.columns = X_train_plus.columns
imputed_X_valid_plus.columns = X_valid_plus.columns

print("MAE from Approach 3 (An Extension to Imputation):")
print(score_dataset(imputed_X_train_plus, imputed_X_valid_plus, y_train, y_valid))

** judging which method to apply**

If there are only few missing values in the data, then it is not advisable to drop complete column.
Instead impute missing values.

### Use score_dataset() to compare the effects of different missing values handling approaches

In [None]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error

# Function for comparing different approaches
def score_dataset(X_train, X_valid, y_train, y_valid):
    model = RandomForestRegressor(n_estimators=100, random_state=0)
    model.fit(X_train, y_train)
    preds = model.predict(X_valid)
    return mean_absolute_error(y_valid, preds)

**Side Note** : In one of the example, we found that imputation performed worse than droping despite low number of missing values

what could be the reason?

- We see that there are some fields lile GarageYrBlt , taking mean of this might not be the best idea.
- There are oher criteria such as median, min, however it is not clear what would be the best criteria to choose.
- After cross checking with mae score , median did produce better result.

In [13]:
myimputer = SimpleImputer(strategy='median')
myimputer

SimpleImputer(add_indicator=False, copy=True, fill_value=None,
              missing_values=nan, strategy='median', verbose=0)

## 3. Categorical variables
Categorical data needs to be preprocessed before plugging them in the dataset

There are 3 approaches

We will use score_dataset() to test quality of each approach

In [None]:
# Get list of categorical variables
s = (X_train.dtypes == 'object')
object_cols = list(s[s].index)


**1. Drop categorical variables**

This appoach only works if the variables do not contain any useful information.

In [None]:
drop_X_train = X_train.select_dtypes(exclude=['object'])
drop_X_valid = X_valid.select_dtypes(exclude=['object'])

print("MAE from Approach 1 (Drop categorical variables):")
print(score_dataset(drop_X_train, drop_X_valid, y_train, y_valid))

**2. Label encoding**

- Assigns each unique value to a different integer
- This works well with ordinal data (data which have ranking or order)
- eg: "Never" (0) < "Rarely" (1) < "Most days" (2) < "Every day" (3).
- works well with tree-based models (decision tree,random forest)



In [28]:
from sklearn.preprocessing import LabelEncoder

label_X_train = X_train.copy()
lebel_X_valied = X_valid.copy()

label_encoder = LabelEncoder()
for col in object_cols:
    label_X_train[col] = LabelEncoder.fit_transform(X_train[cols])
    label_X_valid[col] = LabelEncoder.transform(X_valid[cols])
    
rint("MAE from Approach 2 (Label Encoding):") 
print(score_dataset(label_X_train, label_X_valid, y_train, y_valid))

NameError: name 'X_train' is not defined

**3. One hot encoding**

- Creates new column for each type of value in the original data.
- For example a column containing "red","yellow","green" is split up into 3 columns .
- each column will have two values 1 or 0, for presence of the color .
- Good for vairiables without ranking (nominal variables)
- does not perform well if categorical variable takes on a large number of values

Some parameters - 
- We set handle_unknown='ignore' to avoid errors when the validation data contains classes that aren't represented in the training data, and
- setting sparse=False ensures that the encoded columns are returned as a numpy array (instead of a sparse matrix)

In [26]:
from sklearn.preprocessing import OneHotEncoder

OH_encoder = OneHotEncoder(handle_unknown='ignore',sparse = 'false')
OH_cols_train = pd.DataFrame(OH_encoder.fit_transform(X_train[object_cols]))
OH_cols_valid = pd.DataFrame(OH.transform(X_valid[Object_cols]))

OH_cols_train.index = X_train.index

# Remove categorical columns (will replace with one-hot encoding)
num_X_train = X_train.drop(object_cols, axis=1)
num_X_valid = X_valid.drop(object_cols, axis=1)

# Add one-hot encoded columns to numerical features
OH_X_train = pd.concat([num_X_train, OH_cols_train], axis=1)
OH_X_valid = pd.concat([num_X_valid, OH_cols_valid], axis=1)

print("MAE from Approach 3 (One-Hot Encoding):") 
print(score_dataset(OH_X_train, OH_X_valid, y_train, y_valid))

NameError: name 'X_train' is not defined

**Best approach?**

Dropping the column performs the worrst. 
Out of remaining two methods, one hot encoding ususally performs the best but depends case by case.


**Side note**

Sometimes columnns in training data is not present in validation data, in which case you cna didvide the columns into good_label_cols and bad label columns.
And drop the bad label columns

In [None]:
# All categorical columns
object_cols = [col for col in X_train.columns if X_train[col].dtype == "object"]

# Columns that can be safely label encoded
good_label_cols = [col for col in object_cols if 
                   set(X_train[col]) == set(X_valid[col])]
        
# Problematic columns that will be dropped from the dataset
bad_label_cols = list(set(object_cols)-set(good_label_cols))
        
print('Categorical columns that will be label encoded:', good_label_cols)
print('\nCategorical columns that will be dropped from the dataset:', bad_label_cols)

**cardinality of categorical variable**

cardinality is the number of unique labels for each column


In [38]:
# Get number of unique entries in each column with categorical data
object_nunique = list(map(lambda col: X_train[col].nunique(), object_cols))
d = dict(zip(object_cols, object_nunique))

# Print number of unique entries by column, in ascending order
sorted(d.items(), key=lambda x: x[1])

NameError: name 'object_cols' is not defined

We can make use of this information to figure our which colummns can be one-hot-encoded.

For high cardinality columns we do not use one-hot-encoding. We will keep this value as 10.

In [40]:
# Columns that will be one-hot encoded
low_cardinality_cols = [col for col in object_cols if X_train[col].nunique() < 10]

NameError: name 'object_cols' is not defined

## 4. Pipelines

"Pipelines are a simple way to keep your data preprocessing and modeling code organized. 

Specifically, a pipeline bundles preprocessing and modeling steps so you can use the whole bundle as if it were a single step."

Pros - 
- cleaner code
- fewwer bugs
- easy to productionise
- more options for model validation

**1.defining preprocessing steps**

Just like we have pipeline for bundling all the steps, we have ColumnTransformer to bundle together preprocessing steps


In [8]:
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder

In [None]:
categorical_cols = [cname for cname in X_train_full.columns if
                    X_train_full[cname].nunique() < 10 and 
                    X_train_full[cname].dtype == "object"]

# Select numerical columns
numerical_cols = [cname for cname in X_train_full.columns if 
                X_train_full[cname].dtype in ['int64', 'float64']]


In [16]:
#preprocessing for numerical data
numerical_transformer = SimpleImputer(strategy='median')


#preprocessing for categorical data
categorical_trainsformer = Pipeline(steps=[
    ('imputer',SimpleImputer(strategy='most_frequent')),
    ('Onehot',OneHotEncoder(handle_unknown='ignore'))
     ])

#Bundle preprocessing for numerical and categorical

preprocessor = ColumnTransformer(
    transformers=[
        ('num',numerical_transformer,numerical_cols),
        ('cat',categorical_trainsformer,categorical_cols)
    ])

NameError: name 'numerical_cols' is not defined

**2.Define the model**

In [20]:
from sklearn.ensemble import RandomForestRegressor

model = RandomForestRegressor(n_estimators=200, random_state=0)

**3.Create and evaluate the pipeline**

Use pipeline to bundle preprocessing and model steps

- With the pipeline, we preprocess the training data and fit the model in a single line of code
- With the pipeline, we supply the unprocessed features in X_valid to the predict() command, and the pipeline automatically preprocesses the features before generating predictions

In [22]:
from sklearn.metrics import mean_absolute_error

#bundle preprocessing and modelliing code in a pipeline

my_pipeline = Pipeline(steps=[
    ('preprocessor',preprocessor),
    ('model',model)
    ])

my_pipeline.fit(X_train,y_train)

preds = mypipeline.predict(X_valid)

score = mean_absolute_error(y_valid,preds) 

print('MAE:',score)

NameError: name 'preprocessor' is not defined

- Pipelines are valuable for cleaning up machine learning code and avoiding errors, and are especially useful for workflows with sophisticated data preprocessing.
- Also you can experiment with model parameters, numerical and categorical transformers to get the least MAE score

**3. Generate test predictions**

In [25]:
# Preprocessing of test data, fit model
preds_test = my_pipeline.predict(X_test) # Your code here

# Save test predictions to file
output = pd.DataFrame({'Id': X_test.index,
                       'SalePrice': preds_test})
output.to_csv('submission.csv', index=False)

NameError: name 'my_pipeline' is not defined

## 5. Cross validation

![cv](./cv.png)

What is cross-validation?
- In cross-validation, we run our modeling process on different subsets of the data to get multiple measures of model quality.
- For example, we could begin by dividing the data into 5 pieces, each 20% of the full dataset. In this case, we say that we have broken the data into 5 "folds"

- In Experiment 1, we use the first fold as a validation (or holdout) set and everything else as training data. This gives us a measure of model quality based on a 20% holdout set.
- In Experiment 2, we hold out data from the second fold (and use everything except the second fold for training the model). The holdout set is then used to get a second estimate of model quality.

and so on

Cross-validation gives a more accurate measure of model quality.

However it can take long time to run

tradeoff?
- For small datasets(less than 2 min to run), where extra computational burden isn't a big deal, you should run cross-validation.
- For larger datasets, a single validation set is sufficient. Your code will run faster, and you may have enough data that there's little need to re-use some of it for holdout.

In [29]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer

my_pipeline = Pipeline(steps=[('preprocessor',SimpleImputer()),
                              ('model',RandomForestRegressor(n_estimators=50,random_state=0))
])



In [30]:
from sklearn.metrics import cross_val_score

scores = -1*cross_val_score(my_pipeline, X,y,cv=5,scoring='neg_mean_absolute_error')
print('MAE scores:\n',scores)

ImportError: cannot import name 'cross_val_score' from 'sklearn.metrics' (C:\Users\nagar\Anaconda3\lib\site-packages\sklearn\metrics\__init__.py)

**We can create a function to test out different n_estimators value to find the best**

In [None]:
def get_score(n_estimators):
    my_pipeline = Pipeline(steps=[
        ('preprocessor', SimpleImputer()),
        ('model', RandomForestRegressor(n_estimators, random_state=0))
    ])
    scores = -1 * cross_val_score(my_pipeline, X, y,
                                  cv=3,
                                  scoring='neg_mean_absolute_error')
    return scores.mean()


In [None]:
# test for different n value
results = {}
n = [50,100,150,200,250,300,350,400]
for i in n:
    results[i] = get_score(i)

In [31]:
#plot n_estimators

results = {}
n = [50,100,150,200,250,300,350,400]
for i in n:
    results[i] = get_score(i)

NameError: name 'get_score' is not defined

using cv best estimator was found to be 200
although upon submission best still was 250(which was found during pipeline stage)

It has been suggested to make use of GridSearchCV() to find the best parameter

## 6. XGBoost

gradient boosting many of the kaggle competition and acheives state of the art results.

It is an ensemble method which goes through cycle and iteratively adds models into an ensemble.


![gradient boosting](./gradient_boosting.png)

Steps - 
- first we add a naive model, and make predictions
- then we calculate the loss using a loss function(eg - mean squared loss)
- then we use loss function to parameter tune another model and reduce the loss function
- then we add the new model to the ensemble,make predictions
- repeat the process 

XGboost - Extreme gradient boosting - implementation of gradient boosting wih several additional features focused on preformance and speed

In [1]:
from xgboost import XGBRegressor

my_model = XGBRegressor()
my_model.fit(X_train, y_train)

NameError: name 'X_train' is not defined

In [2]:
from sklearn.metrics import mean_absolute_error

predictions = my_model.predict(X_valid)
print("Mean Absolute Error: " + str(mean_absolute_error(predictions, y_valid)))

NameError: name 'X_valid' is not defined

**parameter tuning**

n_estimators - number of cycles/number of models
- too low causes underfitting
- too high causes overfitting
- usual value range - (100-1000)


early_stopping_rounds * **important** *

early_stopping_rounds offers a way to automatically find the ideal value for n_estimators

- When using early_stopping_rounds, you also need to set aside some data for calculating the validation scores - this is done by setting the eval_set parameter.
- Setting early_stopping_rounds=5 is a reasonable choice. In this case, we stop after 5 straight rounds of deteriorating validation scores.
- If you later want to fit a model with all of your data, set n_estimators to whatever value you found to be optimal when run with early stopping.

learning rate

- instead of getting predictions by simply adding up the predictions from each component model, we can multiply the predictions from each model by a small number (known as the learning rate) before adding them in.
- This means each tree we add to the ensemble helps us less. So, we can set a higher value for n_estimators without overfitting. If we use early stopping, the appropriate number of trees will be determined automatically.
- In general, a small learning rate and large number of estimators will yield more accurate XGBoost models, though it will also take the model longer to train since it does more iterations through the cycle. As default, XGBoost sets learning_rate=0.1.


n_jobs

- On larger datasets where runtime is a consideration, you can use parallelism to build your models faster. It's common to set the parameter n_jobs equal to the number of cores on your machine. On smaller datasets, this won't help.
- The resulting model won't be any better, so micro-optimizing for fitting time is typically nothing but a distraction. But, it's useful in large datasets where you would otherwise spend a long time waiting during the fit command.

**Code-**

In [7]:
my_model = XGBRegressor(n_estimators=1000, learning_rate=0.05, n_jobs=4)
my_model.fit(X_train, y_train, 
             early_stopping_rounds=5, 
             eval_set=[(X_valid, y_valid)], 
             verbose=False)

NameError: name 'X_train' is not defined

XGBoost is a the leading software library for working with standard tabular data (the type of data you store in Pandas DataFrames, as opposed to more exotic types of data like images and videos). With careful parameter tuning, you can train highly accurate models

## 7. Data leakage

- Data leakage (or leakage) happens when your training data contains information about the target, but similar data will not be available when the model is used for prediction. This leads to high performance on the training set (and possibly even the validation data), but the model will perform poorly in production.
- In other words, leakage causes a model to look accurate until you start making decisions with the model, and then the model becomes very inaccurate.

There are 2 types - target leakage and train-test contamination

**target leakage**

Target leakage occurs when your predictors include data that will not be available at the time you make predictions.
- It is important to think about target leakage in terms of the timing or chronological order that data becomes available, not merely whether a feature helps make good predictions.
- **Think of it like this - If you do not have access to that feature when making a new prediction, then that feature shouldn't be there in the first place.**

example- If the target variable is got_pnemonia(True/False) and there is column named took_antibiotic_medicine(True/False).
- Antibiotic is taken after the patient is diognosed with pnemonia.
- So if this model is deployed in real world, while doctors make predictions of whether patient got pnemonia or not, took_antibiotic_medicine field will still not be available, as 
    this comes after the diagnosis is made.
-  To prevent this type of data leakage, any variable updated (or created) after the target value is realized should be excluded.



**train_test contamination**

- This occurs if validation data is corrupted,even in subtle ways,  before splitting
- For example, imagine you run preprocessing (like fitting an imputer for missing values) before calling train_test_split().
- If your validation is based on a simple train-test split, exclude the validation data from any type of fitting, including the fitting of preprocessing steps.
- This is easier if you use scikit-learn pipelines. 
- When using cross-validation, it's even more critical that you do your preprocessing inside the pipeline!

**Example** - credit card acceptance


In [18]:
from sklearn.pipeline import make_pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score

# Since there is no preprocessing, we don't need a pipeline (used anyway as best practice!)
my_pipeline = make_pipeline(RandomForestClassifier(n_estimators=100))
cv_scores = cross_val_score(my_pipeline, X, y, 
                            cv=5,
                            scoring='accuracy')

print("Cross-validation accuracy: %f" % cv_scores.mean())

"""
output - 
Cross-validation accuracy: 0.979525
"""

NameError: name 'X' is not defined

In [16]:
#data details - 
"""
card: 1 if credit card application accepted, 0 if not
reports: Number of major derogatory reports
age: Age n years plus twelfths of a year
income: Yearly income (divided by 10,000)
share: Ratio of monthly credit card expenditure to yearly income
expenditure: Average monthly credit card expenditure
owner: 1 if owns home, 0 if rents
selfempl: 1 if self-employed, 0 if not
dependents: 1 + number of dependents
months: Months living at current address
majorcards: Number of major credit cards held
active: Number of active credit accounts
"""


'\ncard: 1 if credit card application accepted, 0 if not\nreports: Number of major derogatory reports\nage: Age n years plus twelfths of a year\nincome: Yearly income (divided by 10,000)\nshare: Ratio of monthly credit card expenditure to yearly income\nexpenditure: Average monthly credit card expenditure\nowner: 1 if owns home, 0 if rents\nselfempl: 1 if self-employed, 0 if not\ndependents: 1 + number of dependents\nmonths: Months living at current address\nmajorcards: Number of major credit cards held\nactive: Number of active credit accounts\n'

A few variables look suspicious. For example, does expenditure mean expenditure on this card or on cards used before appying?

At this point, basic data comparisons can be very helpful:



In [17]:
expenditures_cardholders = X.expenditure[y]
expenditures_noncardholders = X.expenditure[~y]

print('Fraction of those who did not receive a card and had no expenditures: %.2f' \
      %((expenditures_noncardholders == 0).mean()))
print('Fraction of those who received a card and had no expenditures: %.2f' \
      %(( expenditures_cardholders == 0).mean()))

"""
output -
Fraction of those who did not receive a card and had no expenditures: 1.00
Fraction of those who received a card and had no expenditures: 0.02
"""

NameError: name 'X' is not defined

- As shown above, everyone who did not receive a card had no expenditures, while only 2% of those who received a card had no expenditures. It's not surprising that our model appeared to have a high accuracy. But this also seems to be a case of target leakage, where expenditures probably means expenditures on the card they applied for.
- Since share is partially determined by expenditure, it should be excluded too. The variables active and majorcards are a little less clear, but from the description, they sound concerning. In most situations, it's better to be safe than sorry if you can't track down the people who created the data to find out more.
- We would run a model without target leakage as follows:

In [None]:
#Drop leaky predictors from dataset
potential_leaks = ['expenditure', 'share', 'active', 'majorcards']
X2 = X.drop(potential_leaks, axis=1)

# Evaluate the model with leaky predictors removed
cv_scores = cross_val_score(my_pipeline, X2, y, 
                            cv=5,
                            scoring='accuracy')

print("Cross-val accuracy: %f" % cv_scores.mean())

"""
output - 
Cross-val accuracy: 0.827139
"""

This accuracy is quite a bit lower, which might be disappointing. However, we can expect it to be right about 80% of the time when used on new applications, whereas the leaky model would likely do much worse than that (in spite of its higher apparent score in cross-validation)

Data leakage can be multi-million dollar mistake in many data science applications. Careful separation of training and validation data can prevent train-test contamination, and pipelines can help implement this separation. Likewise, a combination of caution, common sense, and data exploration can help identify target leakage.

## Data leakage example scenarios

### 1. The Data Science of Shoelaces

Nike has hired you as a data science consultant to help them save money on shoe materials. Your first assignment is to review a model one of their employees built to predict how many shoelaces they'll need each month. The features going into the machine learning model include:
- The current month (January, February, etc)
- Advertising expenditures in the previous month
- Various macroeconomic features (like the unemployment rate) as of the beginning of the current month
- The amount of leather they ended up using in the current month

The results show the model is almost perfectly accurate if you include the feature about how much leather they used. But it is only moderately accurate if you leave that feature out. You realize this is because the amount of leather they use is a perfect indicator of how many shoes they produce, which in turn tells you how many shoelaces they need.

Do you think the _leather used_ feature constitutes a source of data leakage? If your answer is "it depends," what does it depend on?

After you have thought about your answer, check it against the solution below.

**Solution:** This is tricky, and it depends on details of how data is collected (which is common when thinking about leakage). Would you at the beginning of the month decide how much leather will be used that month? If so, this is ok. But if that is determined during the month, you would not have access to it when you make the prediction. If you have a guess at the beginning of the month, and it is subsequently changed during the month, the actual amount used during the month cannot be used as a feature (because it causes leakage)

### 2. Return of the Shoelaces

You have a new idea. You could use the amount of leather Nike ordered (rather than the amount they actually used) leading up to a given month as a predictor in your shoelace model.

Does this change your answer about whether there is a leakage problem? If you answer "it depends," what does it depend on?

**Solution:** This could be fine, but it depends on whether they order shoelaces first or leather first. If they order shoelaces first, you won't know how much leather they've ordered when you predict their shoelace needs. If they order leather first, then you'll have that number available when you place your shoelace order, and you should be ok.

### 3. Getting Rich With Cryptocurrencies?

You saved Nike so much money that they gave you a bonus. Congratulations.

Your friend, who is also a data scientist, says he has built a model that will let you turn your bonus into millions of dollars. Specifically, his model predicts the price of a new cryptocurrency (like Bitcoin, but a newer one) one day ahead of the moment of prediction. His plan is to purchase the cryptocurrency whenever the model says the price of the currency (in dollars) is about to go up.

The most important features in his model are:
- Current price of the currency
- Amount of the currency sold in the last 24 hours
- Change in the currency price in the last 24 hours
- Change in the currency price in the last 1 hour
- Number of new tweets in the last 24 hours that mention the currency

The value of the cryptocurrency in dollars has fluctuated up and down by over \$100 in the last year, and yet his model's average error is less than \$1. He says this is proof his model is accurate, and you should invest with him, buying the currency whenever the model says it is about to go up.

Is he right? If there is a problem with his model, what is it?

**Solution:** There is no source of leakage here. These features should be available at the moment you want to make a predition, and they're unlikely to be changed in the training data after the prediction target is determined. But, the way he describes accuracy could be misleading if you aren't careful. If the price moves gradually, today's price will be an accurate predictor of tomorrow's price, but it may not tell you whether it's a good time to invest. For instance, if it is  100today,amodelpredictingapriceof100today,amodelpredictingapriceof 100 tomorrow may seem accurate, even if it can't tell you whether the price is going up or down from the current price. A better prediction target would be the change in price over the next day. If you can consistently predict whether the price is about to go up or down (and by how much), you may have a winning investment opportunity.

### 4. Preventing Infections

An agency that provides healthcare wants to predict which patients from a rare surgery are at risk of infection, so it can alert the nurses to be especially careful when following up with those patients.

You want to build a model. Each row in the modeling dataset will be a single patient who received the surgery, and the prediction target will be whether they got an infection.

Some surgeons may do the procedure in a manner that raises or lowers the risk of infection. But how can you best incorporate the surgeon information into the model?

You have a clever idea. 
1. Take all surgeries by each surgeon and calculate the infection rate among those surgeons.
2. For each patient in the data, find out who the surgeon was and plug in that surgeon's average infection rate as a feature.

Does this pose any target leakage issues?
Does it pose any train-test contamination issues?

**Solution:**  This poses a risk of both target leakage and train-test contamination (though you may be able to avoid both if you are careful).

You have target leakage if a given patient's outcome contributes to the infection rate for his surgeon, which is then plugged back into the prediction model for whether that patient becomes infected. You can avoid target leakage if you calculate the surgeon's infection rate by using only the surgeries before the patient we are predicting for. Calculating this for each surgery in your training data may be a little tricky.

You also have a train-test contamination problem if you calculate this using all surgeries a surgeon performed, including those from the test-set. The result would be that your model could look very accurate on the test set, even if it wouldn't generalize well to new patients after the model is deployed. This would happen because the surgeon-risk feature accounts for data in the test set. Test sets exist to estimate how the model will do when seeing new data. So this contamination defeats the purpose of the test set.

### 5. Housing Prices

You will build a model to predict housing prices.  The model will be deployed on an ongoing basis, to predict the price of a new house when a description is added to a website.  Here are four features that could be used as predictors.
1. Size of the house (in square meters)
2. Average sales price of homes in the same neighborhood
3. Latitude and longitude of the house
4. Whether the house has a basement

You have historic data to train and validate the model.

Which of the features is most likely to be a source of leakage?

**Solution:** potential_leakage_feature=2 (target leakage). Analysis for each feature - 
1. The size of a house is unlikely to be changed after it is sold (though technically it's possible). But typically this will be available when we need to make a prediction, and the data won't be modified after the home is sold. So it is pretty safe.
2. We don't know the rules for when this is updated. If the field is updated in the raw data after a home was sold, and the home's sale is used to calculate the average, this constitutes a case of target leakage. At an extreme, if only one home is sold in the neighborhood, and it is the home we are trying to predict, then the average will be exactly equal to the value we are trying to predict. In general, for neighborhoods with few sales, the model will perform very well on the training data. But when you apply the model, the home you are predicting won't have been sold yet, so this feature won't work the same as it did in the training data.
3. These don't change, and will be available at the time we want to make a prediction. So there's no risk of target leakage here.
4. This also doesn't change, and it is available at the time we want to make a prediction. So there's no risk of target leakage here.


Other resources :
- [Scoring parameters - docs](https://scikit-learn.org/stable/modules/model_evaluation.html)