<a href="https://colab.research.google.com/github/mostaf7583/IEEE-tasks/blob/main/IEEE_Session_2_Task.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<h1>Welcome to the second session's task!</h1>

This task's objectives are to:



*   Apply end-to-end machine learning steps on a house prices dataset
*   Use a hands-on approach to learning scikit-learn and its uses in machine learning



<h1><b>End-to-end Machine Learning</b></h1>

Steps are to 



1.   Download the data
2.   Understand and Clean the Data
3.   Preprocess and Feature Engineer the Data
4.   Train a model using the data
5.   Measure the model's performance using cross validation
6.   Do hyperparameter tuning on the model and/or try different models
7.   Use the final selected model to make predictions on the test set



## Download the Data

This part really depends on the problem and how it's presented to you. 

For example, I have uploaded the data to Kaggle so you can download it from there. 

Google Colab also provides the option to upload files but those will be deleted once the Colab session ends so it's better to download the data using code so you can re-run that cell every time you want to download the data.

For this task, however, I have uploaded the data to Google Drive and it can be downloaded as so:

<h2>Extra information in case you're wondering what the next cell does:
</h2>


The exclamation mark allows us to use bash commands, which is a way to use the operating system from the command line.

Most common and useful commands are:

1. pwd 

(short for print working directory to print where we are executing the command)

2. ls 

(short for list directory to list directory contents)

3. cd 

(short for change directory to change our current directory)


4. mkdir 

(short for make directory to create a new directory)

5. mv 

(short for move to move a file or folder from one location to another)

6. cp 

(short for copy where it copies a file from one location to another)


So gdown is actually a program that comes pre-installed with colab and it allows us to download a file from Google Drive using its id.

In [None]:
!gdown 1lXJTBP4mOVnS2nMTusEcUHjqWgZGNkLW
!gdown 1U2IWwN78Td1XQz38_u60pXsU_WgQzCtB
!gdown 1RwXmgcY6rh-JPgGHWLRn5lzom8YMIIFC

Downloading...
From: https://drive.google.com/uc?id=1lXJTBP4mOVnS2nMTusEcUHjqWgZGNkLW
To: /content/house_prices_train.csv
100% 1.06M/1.06M [00:00<00:00, 24.8MB/s]
Downloading...
From: https://drive.google.com/uc?id=1U2IWwN78Td1XQz38_u60pXsU_WgQzCtB
To: /content/house_prices_test.csv
100% 208k/208k [00:00<00:00, 19.2MB/s]
Downloading...
From: https://drive.google.com/uc?id=1RwXmgcY6rh-JPgGHWLRn5lzom8YMIIFC
To: /content/sample_submission.csv
100% 49.8k/49.8k [00:00<00:00, 35.4MB/s]


<h2>Our goal for this task is to use the house_prices_train.csv dataset to create a model and use that model to predict prices for house_prices_test.csv

We will then submit the given price predictions to Kaggle in the same format as the file sample_submission.csv</h2>

In [None]:
import pandas as pd
#Helps make printing numbers look better
pd.options.display.float_format = '{:.2f}'.format

train = pd.read_csv('house_prices_train.csv')

## Take a Quick Look at the Data Structure

<h1><b>Q1: Use some pandas built-in functions like head, describe, value_counts, etc. to take an initial look at the data</b></h1>

# Discover and Visualize the Data to Gain Insights

<h1><b>Q2: Use any visualization library you like to visualize some of the datasets attributes. Hint: You can use pandas dataframes built-in hist and box plot methods or pandas scatter_matrix. Other solutions are also welcome.</b></h1>

## Looking for Correlations

<h1><b>Q3: Look for correlations using pandas df.corr function, this will give you an output correlation matrix. Select the price column from that correlation matrix and sort its values descendingly  using a built-in pandas function.</b></h1>

# Prepare the Data for Machine Learning Algorithms

## Data Cleaning

<h1><b>Q4: Drop the ad_id column from the training set.</b></h1>

<h1><b>Q5: Check how many NaN values are there in the dataset. Demonstrate the three ways we've covered to deal with NaN values</b></h1>

<h1><b>Q5.1: Demonstrate how to drop the furnished column entirely (don't alter the train dataframe)</b></h1>

<h1><b>Q5.2: Demonstrate how to drop the rows which have a furnished value of NaN (don't alter the train dataframe)</b></h1>

<h1><b>Q5.3: Demonstrate how to count the most frequent value in the furnished column and how to fill na values with this value using pandas fillna function (don't alter the train dataframe)</b></h1>

The next cell is to illustrate how you could solve Q5.3 using sklearn's SimpleImputer. You can use this as a reference for Q7.

In [None]:
from sklearn.impute import SimpleImputer

columns_to_impute = ['furnished']

imputer = SimpleImputer(strategy="most_frequent")

imputer.fit_transform(train[columns_to_impute])


array([['no'],
       ['no'],
       ['no'],
       ...,
       ['no'],
       ['no'],
       ['no']], dtype=object)

# Feature Engineering

<h1><b>Q6: You're free to experiment here with adding any features that could help improve the model's performance. Make sure to add them within the feature_engineer function</b></h1>

In [None]:

def feature_engineer(df):
  new_df = df.copy()
  new_df['area_to_bathroom_ratio'] = new_df.area / new_df.bathrooms
  return new_df

train = feature_engineer(train)

# Preprocessing

<h1><b>Q6: Demonstrate how to use sklearn's OneHotEncoder on the categorical columns. Use the cat_columns and num_columns lists to quickly specify categorical or numerical attributes.</b></h1>

In [None]:
#Don't forget to add any column names you created during feature engineering here
cat_columns = ['type', 'bedrooms', 'bathrooms', 'level', 'furnished', 'rent', 'city', 'region']

num_columns = ['area', 'area_to_bathroom_ratio']

In [None]:
from sklearn.preprocessing import OneHotEncoder

cat_encoder = # Call the OneHotEncoder constructor here
train_cat_one_hot = #Call the cat_encoder's fit_transform function on the train set's cat columns here

If the previous cell was correct, you should see the one hot encoded outputs as a DataFrame by runnning the next cell

In [None]:
cat_one_hot_df = pd.DataFrame(train_cat_one_hot.toarray(), 
                              columns= cat_encoder.get_feature_names_out(), 
                              index = train.index)
cat_one_hot_df

<h1><b>Q7: Demonstrate how to use sklearn's MinMaxScaler on the numerical columns. Use the cat_columns and num_columns lists to quickly specify categorical or numerical attributes. Use the same process as OneHotEncoder but for num_columns</b></h1>

All previous preprocessing steps were for demonstration purposes. We'll actually be using sklearn's pipeline feature to construct our preprocessing pipeline like so:

## Transformation Pipelines

The next cell is an example for how to create a pipeline for categorical features only. It's a way to execute all the preprocessing steps using one call to cat_pipeline

In [None]:
from sklearn.pipeline import make_pipeline

cat_pipeline = make_pipeline(
    SimpleImputer(strategy="most_frequent"),
    OneHotEncoder(handle_unknown="ignore"))

cat_pipeline.fit_transform(train[cat_columns])

In [None]:
cat_pipeline.steps

[('impute', SimpleImputer(strategy='median')),
 ('min_max_scaling', MinMaxScaler())]

<h1><b>Q8: Define num_pipeline in the same way as cat_pipeline above but using SimpleImputer(strategy="median") and MinMaxScaler() instead</b></h1>

The next cell shows how to combine the two pipelines for numerical and categorical features to create our preprocessing pipeline for all features

In [None]:
from sklearn.compose import ColumnTransformer

preprocessing = ColumnTransformer([
    ("num", num_pipeline, num_columns),
    ("cat", cat_pipeline, cat_columns),
])

<h1><b>Q9: Define a variable X equal to all columns in the train dataset except price. Define another variable y that's just the price column</b></h1>

# Select and Train a Model

## Training and Evaluating on the Training Set

The following cell demonstrates how to import a simple LinearRegression model and that you can add models at the end of existing pipelines to make a pipeline for preprocessing and training

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

lin_reg = make_pipeline(preprocessing, LinearRegression())
#This fits the model with the X as input and y as the target
lin_reg.fit(X, y) 
#This is making predictions on the training set which doesn't make sense
#It's only used here to demonstrate how to use the model to make predictions
predictions = lin_reg.predict(X) 
#Let's see the RMSE for the model on the training set
#Look at the documentation for mean_squared_error for more info
lin_rmse = mean_squared_error(y, predictions,
                              squared=False)
lin_rmse

Pipeline(steps=[('columntransformer',
                 ColumnTransformer(transformers=[('num',
                                                  Pipeline(steps=[('impute',
                                                                   SimpleImputer(strategy='median')),
                                                                  ('min_max_scaling',
                                                                   MinMaxScaler())]),
                                                  ['area']),
                                                 ('cat',
                                                  Pipeline(steps=[('simpleimputer',
                                                                   SimpleImputer(strategy='most_frequent')),
                                                                  ('onehotencoder',
                                                                   OneHotEncoder(handle_unknown='ignore'))]),
                                                 

<h1><b>Q10: Import the RandomForestRegressor model and create a pipeline like the previous cell but with RandomForestRegressor instead of LinearRegressor. Don't forget to set the random_state parameter to any number to get reproducible results</b></h1>

## Better Evaluation Using Cross-Validation

<h1><b>Q11: Now import and use the cross_val_score function from sklearn. You need to pass in your model's pipeline, the X and y, and then specify the scoring metric to be "neg_root_mean_squared_error". </b></h1>

Note: This might take a few minutes to run!

Another Note: The cross_val_score function already splits the training data into training and validation data and reports the validation score.
There's no test data for you to split because I had already split it and won't make it available to you till later on :P

# Fine-Tune Your Model

## Grid Search

The following demonstrates how to use grid search to do exhaustive search using the full pipeline (preprocessing + model).

It tries all combinations of the provided values and selects the model with the best score.

In [None]:
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline

full_pipeline = Pipeline([
    ("preprocessing", preprocessing),
    ("random_forest", RandomForestRegressor(random_state=42)),
])
param_grid = [
    {'random_forest__n_estimators': [16, 32],
     'random_forest__max_features': [4, 6, 8]},
    {'random_forest__n_estimators': [64, 128],
     'random_forest__max_features': [6, 8, 10]},
]
grid_search = GridSearchCV(full_pipeline, param_grid, cv=3,
                           scoring='neg_root_mean_squared_error')
grid_search.fit(X, y)

The best hyperparameter combination found:

In [None]:
grid_search.best_params_

The model with the best score will be available as the best_estimator_ attribute of the grid_search object.

Since we are using pipelines, the final model is actually a pipeline which means it applies the preprocessing pipeline steps to its input and then the model prediction step

In [None]:
final_model = grid_search.best_estimator_

In [None]:
final_model

Let's look at the score of each hyperparameter combination tested during the grid search:

In [None]:
cv_res = pd.DataFrame(grid_search.cv_results_)
cv_res.sort_values(by="mean_test_score", ascending=False, inplace=True)

cv_res.head()

## Evaluate Your System on the Test Set

Now we'll load the test set, run our feature engineering function on it, and then pass it to our model's predict function (which does the preprocessing steps before predicting because it's a pipeline object)

In [None]:
test = pd.read_csv('house_prices_test.csv')
test = feature_engineer(test)
test_predictions = final_model.predict(test)

We have a sample submission file demonstrating the format which we should submit our predictions. We'll just replace the price column with our predictions here and save the file

In [None]:
submission = pd.read_csv('sample_submission.csv')
submission.price = test_predictions

submission

This is the file you should submit to kaggle. You can download it from the menu on the left of the colab website. 

In [None]:
submission.to_csv('submission.csv', index=False)

<h1><b>You made it till the end. Awesome. 🔥🔥🔥</b></h1>

You're of course free to continue tinkering with the task after finishing it to try and get a better score (which will be reflected on the Kaggle leaderboard!)

You can try playing with different features in the feature engineering step, or importing and trying different models and/or hyperparameters, or maybe different preprocessing steps.

The possibilities are endless!
