# Notebook for implementing Kaggle Machine Learning Education Course

## ** Level 1 **

## 1. The Data
### 1.1 Loading dataset

In [2]:
import pandas as pd

main_file_path = '../input/train.csv'
data = pd.read_csv(main_file_path)

### 1.2 Inspecting the dataset </br>
### 1.2.1 Using Describe
#### Describe gives you the the statistical information of the dataset's numerical data. It gives us the information about the count, mean, min, max, standard deviation and data under various quartiles .describe() method needs to be called on the loaded dataframe

In [4]:
data.describe()

### 1.2.2 Using columns
#### Columns helps us in identifying the columns and we can observe that the data shown by the describe method doesn't match the number of columns shown below. it's because, we can find the mean, min, max, etc., for Non-numeric data. So these columns are not shown by the describe method. Unlike the describe shown above, columns isn't a method. 

#### As we now know that there are other columns to be checked, let's check the datatypes of these columns using the .info() method

In [10]:
data.columns

### 1.2.3 Using Info
#### Info is another method which can be used for knowing about the dataset. It shows the number of entries in total and number of entries for each column. This helps us to think that there is a need to fill in these gaps

#### It also shows that there are different types of data to be handled. we can infer something based on the columns name and the datatypes.

In [6]:
data.info()

### 1.2.4 Using head

#### Like the Linux command for checking a file, head is present here to check the content of the loaded dataset. Number of lines to be shown can be passed as input parameter. Default value is 5.

In [11]:
data.head()

### 1.2.5 Using correlation matrix or corr in short
#### Correlation is one of the statistical method of finding the relation between two variables. The values range from -1 to +1
#### one being a the highest. positive and negative signs helps us in identifying how tightly or loosely coupled are the variables, Positivie correlation is the change in one variable in one direction will have a change in the other variable. If something increases, the other increases too under positive correlation. Negative correlation is the opposite of it.

#### This is better understood when we draw the graphs betweens these attributes or variables. As it's still level 1, let's skip it for now.

##### Note: Correlation matrix altogether might frighten, but if we take one column and inspect it, it will be very clear. Mostly the target value's column is chosen, here it's the saleprice. 

In [15]:
corr = data.corr()
corr

In [20]:
# Correlation in the descending order. Obviously, the SalePrice will have a positive correlation with it's own and will be 1.
corr.SalePrice.sort_values(ascending=False)

## 2. The Model

### 2.1 Selecting columns for training the model
#### Obviously not all the columns are useful, we can skip some of them or experiment with few and add more based on the results. we can also choose based on the correlation for testing the model.

#### There is a strong need for removing the target attribute from the data before training it. X is for predictor values and y is for target value.

In [22]:
y = data.SalePrice
prediction_columns = ['LotArea','YearBuilt','1stFlrSF','2ndFlrSF','FullBath','BedroomAbvGr','TotRmsAbvGrd']
X = data[prediction_columns]

### 2.2 Splitting the data to train and test set
#### Before implementing any model on our dataset, we need to be sure that the model isn't perfectly tuning to the dataset, when it's in perfect tune with the dataset, it will fail for new data. To prevent it, we need to have a test set for evaluating the model after training. This helps in checking the model's performance.****

In [23]:
from sklearn.model_selection import train_test_split
train_X, val_X, train_y, val_y = train_test_split(X,y,random_state=0)

### 2.3 DecisionTreeRegressor as the model
#### we have the train and test sets, we have the columns to be used, then next step is to choose a model. For now, we will use DecisionTreeRegressor as instructed by the Instructor DanB.

### 2.3.1 Defining and fitting the model
#### Cool thing about the SciKit learn is that all the model follow similar methods: 
* #### fit is for fitting the data, we pass the inputs as X and y. it's like training the model.
* #### predict is for predicting the expected values for a new X.
* #### tranform is another method, which is used in pipelines. When the data has to go through a procedures before feeding it to the model. transform method is used. Let's ignore this method for now. 

In [25]:
from sklearn.tree import DecisionTreeRegressor
dec_tree_reg = DecisionTreeRegressor()
dec_tree_reg.fit(train_X,train_y)

### 2.3.2 Predicting using the model trained
#### Training the model using fit method won't be of any use unless we put it to work or test. We test the model using the test data we created using train_test_split class of sklearn's model_selection module.

In [26]:
prediction_y = dec_tree_reg.predict(val_X)

### 2.3.3 Evaluating the model's predictions
#### As we cannot evaluated all the values predicted, we have use some method to go through all of the results and tell us how it performed. Imagine going through the predictions having 1000 results. Time  taking and prone to human errors.

In [27]:
from sklearn.metrics import mean_absolute_error
print(mean_absolute_error(val_y, prediction_y))

### 2.3.4 Exploring options available for model's performance improvement
#### DecisionTreeRegressor has input parameter of max_leaf_nodes, which will tell the model how deep the tree can be, we are exploring that option by passing different number of max_leaf_nodes

In [29]:
from sklearn.metrics import mean_absolute_error
from sklearn.tree import DecisionTreeRegressor

def get_mae(max_leaf_nodes, predictors_train, predictors_val, targ_train, targ_val):
    model = DecisionTreeRegressor(max_leaf_nodes=max_leaf_nodes, random_state=0)
    model.fit(predictors_train, targ_train)
    preds_val = model.predict(predictors_val)
    mae = mean_absolute_error(targ_val, preds_val)
    return(mae)

In [30]:
for max_leaf_nodes in [5,50,500,5000]:
    mae_res = get_mae(max_leaf_nodes, train_X, val_X, train_y, val_y)
    print("Max leaf nodes: %d  \t\t Mean Absolute Error:  %d" %(max_leaf_nodes, mae_res))

### 2.4 Trying out other models. 
#### On an online shopping site, It's always a best idea to check a deal's authenticity, compare price before purchasing it. Similarly experimenting helps us in finding the model's performance and scope of improvement. Here we are using RandomForestRegressor, I think it's kind of a group of Decision Trees atleast based on both the names, we can infer that. 

In [31]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error

rand_forest_reg = RandomForestRegressor()
rand_forest_reg.fit(train_X, train_y)
pred_y = rand_forest_reg.predict(val_X)
print("Mean Absolute Error using the RandomForestRegressor is: %d", mean_absolute_error(val_y, pred_y))

## 3. The Submission

### 3.1 Navigating the files - The Linux Way

In [None]:
ll ../input/

### 3.2 The whole thing again.
#### We are doing the same thing again, imports, reading the data, training, but this time we use the train and test data provided and doesn't have to split the data.

In [None]:
import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestRegressor

#reading the train.csv and test.csv file
train_data = pd.read_csv('../input/train.csv')
test_data = pd.read_csv('../input/test.csv')

#fetching the target value and predictors value based on columns for training data
predictor_cols = ['LotArea', 'OverallQual', 'YearBuilt', 'TotRmsAbvGrd']
train_X = train_data[predictor_cols]
train_y = train_data.SalePrice

#fetching the predictors value based on columns for test data. will be the same columns as training data
test_X = test_data[predictor_cols]

#defining the RandomForestRegressor
rand_forest_reg = RandomForestRegressor()

In [None]:
#fitting the training predictors and target to the model
rand_forest_reg.fit(train_X, train_y)

#predicting using the test data predcitors
predicted_prices = rand_forest_reg.predict(test_X)
print(predicted_prices)

### 3.3 The Final Call
#### We will be creating a dataframe using pandas, as the number of predicted values should match the test data provided. We are passing the id field of the test data and our prediction's results to the dataframe constructor

#### to_csv is the Pandas way of writing to csv. Ideally,  it should have been write_csv contradicting the read_csv method used during section 1.1 but it's to_csv and we should stick with it.

In [None]:
my_submission = pd.DataFrame({'Id': test_data.Id, 'SalePrice': predicted_prices})
# you could use any filename. We choose submission here
my_submission.to_csv('submission.csv', index=False)

## **Level 1 - The End**