<a href="https://colab.research.google.com/github/priyabants/Learning/blob/main/Kaggle_Intro_to_Machine_Learning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
"""
We use data to decide how to break the houses into two groups, and then again to determine the predicted price in each group. 
This step of capturing patterns from data is called fitting or training the model. 
The data used to fit the model is called the training data.
After the model has been fit, you can apply it to new data to predict prices of additional homes.
"""
#Basic Data Exploration - using pandas
"""Pandas is the primary tool data scientists use for exploring and manipulating data
important part of the Pandas library is the DataFrame. 
A DataFrame holds the type of data you might think of as a table. This is similar to a sheet in Excel, or a table in a SQL database
"""

In [2]:
import pandas as pd
# save filepath to variable for easier access
melbourne_file_path = './Kaggle/melb_data.csv'
# read the data and store data in DataFrame titled melbourne_data
melbourne_data = pd.read_csv(melbourne_file_path) 
# print a summary of the data in Melbourne data
melbourne_data.describe()

Unnamed: 0,Rooms,Price,Distance,Postcode,Bedroom2,Bathroom,Car,Landsize,BuildingArea,YearBuilt,Lattitude,Longtitude,Propertycount
count,13580.0,13580.0,13580.0,13580.0,13580.0,13580.0,13518.0,13580.0,7130.0,8205.0,13580.0,13580.0,13580.0
mean,2.937997,1075684.0,10.137776,3105.301915,2.914728,1.534242,1.610075,558.416127,151.96765,1964.684217,-37.809203,144.995216,7454.417378
std,0.955748,639310.7,5.868725,90.676964,0.965921,0.691712,0.962634,3990.669241,541.014538,37.273762,0.07926,0.103916,4378.581772
min,1.0,85000.0,0.0,3000.0,0.0,0.0,0.0,0.0,0.0,1196.0,-38.18255,144.43181,249.0
25%,2.0,650000.0,6.1,3044.0,2.0,1.0,1.0,177.0,93.0,1940.0,-37.856822,144.9296,4380.0
50%,3.0,903000.0,9.2,3084.0,3.0,1.0,2.0,440.0,126.0,1970.0,-37.802355,145.0001,6555.0
75%,3.0,1330000.0,13.0,3148.0,3.0,2.0,2.0,651.0,174.0,1999.0,-37.7564,145.058305,10331.0
max,10.0,9000000.0,48.1,3977.0,20.0,8.0,10.0,433014.0,44515.0,2018.0,-37.40853,145.52635,21650.0


In [None]:
#Interpreting Data Description
  """The results show 8 numbers for each column in your original dataset. 
  count, shows how many rows have non-missing values.
  mean, which is the average
  std is the standard deviation, which measures how numerically spread out the values are.
  To interpret the min, 25%, 50%, 75% and max values, imagine sorting each column from lowest to highest value
  """

###First Machine Learning Model

In [3]:
#We'll start by picking a few variables using our intuition
#To choose variables/columns, we'll need to see a list of all columns in the dataset. 
#That is done with the columns property of the DataFrame
import pandas as pd

melbourne_file_path = './Kaggle/melb_data.csv'
melbourne_data = pd.read_csv(melbourne_file_path) 
print(melbourne_data.columns)

# dropna drops missing values (think of na as "not available") 0, or 'index' : Drop rows 1, or 'columns' : Drop columns
melbourne_data = melbourne_data.dropna(axis=0)

#Selecting The Prediction Target
#We'll use the dot notation to select the column we want to predict, which is called the prediction target. By convention, the prediction target is called y.
y = melbourne_data.Price

#Choosing "Features"
#The columns that are inputted into our model (and later used to make predictions) are called "features."
#We select multiple features by providing a list of column names inside brackets. Each item in that list should be a string (with quotes)
melbourne_features = ['Rooms', 'Bathroom', 'Landsize', 'Lattitude', 'Longtitude']
X = melbourne_data[melbourne_features]  #By convention, this data is called X
#Let's quickly review the data we'll be using to predict house prices using the describe method and the head method, which shows the top few rows.
X.describe()
X.head()


Index(['Suburb', 'Address', 'Rooms', 'Type', 'Price', 'Method', 'SellerG',
       'Date', 'Distance', 'Postcode', 'Bedroom2', 'Bathroom', 'Car',
       'Landsize', 'BuildingArea', 'YearBuilt', 'CouncilArea', 'Lattitude',
       'Longtitude', 'Regionname', 'Propertycount'],
      dtype='object')


Unnamed: 0,Rooms,Bathroom,Landsize,Lattitude,Longtitude
1,2,1.0,156.0,-37.8079,144.9934
2,3,2.0,134.0,-37.8093,144.9944
4,4,1.0,120.0,-37.8072,144.9941
6,3,2.0,245.0,-37.8024,144.9993
7,2,1.0,256.0,-37.806,144.9954


In [None]:
#Building Your Model
#use the scikit-learn library to create your models.      write as sklearn during coding
#Scikit-learn is easily the most popular library for modeling the types of data typically stored in DataFrames.
"""The steps to building and using a model are:

  Define: What type of model will it be? A decision tree? Some other type of model? Some other parameters of the model type are specified too.
  Fit: Capture patterns from provided data. This is the heart of modeling.
  Predict: Just what it sounds like
  Evaluate: Determine how accurate the model's predictions are.


  Many machine learning models allow some randomness in model training. 
  Specifying a number for random_state ensures you get the same results in each run. 
  This is considered a good practice. 
  You use any number, and model quality won't depend meaningfully on exactly what value you choose
"""

#example of defining a decision tree model with scikit-learn and fitting it with the features and target variable.

In [4]:
from sklearn.tree import DecisionTreeRegressor

# Define model. Specify a number for random_state to ensure same results each run
melbourne_model = DecisionTreeRegressor(random_state=1)

# Fit model
melbourne_model.fit(X, y)

DecisionTreeRegressor(ccp_alpha=0.0, criterion='mse', max_depth=None,
                      max_features=None, max_leaf_nodes=None,
                      min_impurity_decrease=0.0, min_impurity_split=None,
                      min_samples_leaf=1, min_samples_split=2,
                      min_weight_fraction_leaf=0.0, presort='deprecated',
                      random_state=1, splitter='best')

In [9]:
#We now have a fitted model that we can use to make predictions.
#make predictions for new houses coming on the market rather than the houses we already have prices for
#will predict for first few rows of the training data to see how the predict function works.
print("Making predictions for the following 5 houses:")
print(X.head())
print("The predictions are")
print(melbourne_model.predict(X))

Making predictions for the following 5 houses:
   Rooms  Bathroom  Landsize  Lattitude  Longtitude
1      2       1.0     156.0   -37.8079    144.9934
2      3       2.0     134.0   -37.8093    144.9944
4      4       1.0     120.0   -37.8072    144.9941
6      3       2.0     245.0   -37.8024    144.9993
7      2       1.0     256.0   -37.8060    144.9954
The predictions are
[1035000. 1465000. 1600000. ...  385000.  560000. 2450000.]


In [None]:
#how accurate the model's predictions will be and how you can improve that. That will be you're next step.
#Model Validation
"""
You've built a model. But how good is it?
use model validation to measure the quality of your model.Measuring model quality is the key to iteratively improving your models.
make predictions with training data and compare those predictions to the target values in the training data
We need to summarize the model quality into a single metric.
1) Mean Absolute Error (also called MAE) - 
    last word error - The prediction error for each house is:   error=actual−predicted
    So, if a house cost $150,000 and you predicted it would cost $100,000 the error is $50,000.
    With the MAE metric, we take the absolute value of each error. This converts each error to a positive number. We then take the average of those absolute errors. This is our measure of model quality. In plain English, it can be said as

    On average, our predictions are off by about X.
    To calculate MAE, we first need a model. used decision tree model 
"""

In [10]:
from sklearn.metrics import mean_absolute_error

predicted_home_prices = melbourne_model.predict(X)
mean_absolute_error(y, predicted_home_prices)

1115.7467183128902

In [None]:
"""
sample of data you used to build the model (all homes with green doors were very expensive)
The model's job is to find patterns that predict home prices, so it will see this pattern, 
and it will always predict high prices for homes with green doors.
Since this pattern was derived from the training data, the model will appear accurate in the training data.
model would be very inaccurate when used for new dataset.
Since models' practical value come from making predictions on new data, we measure performance on data that wasn't used to build the model.

The most straightforward way to do this is to exclude some data from the model-building process, 
and then use those to test the model's accuracy on data it hasn't seen before. 
This data is called validation data.

The scikit-learn library has a function train_test_split to break up the data into two pieces. 
We'll use some of that data as training data to fit the model, 
and we'll use the other data as validation data to calculate mean_absolute_error.
"""

In [11]:
from sklearn.model_selection import train_test_split

# split data into training and validation data, for both features and target
# The split is based on a random number generator. Supplying a numeric value to
# the random_state argument guarantees we get the same split every time we
# run this script.
train_X, val_X, train_y, val_y = train_test_split(X, y, random_state = 0)
# Define model
melbourne_model = DecisionTreeRegressor()
# Fit model
melbourne_model.fit(train_X, train_y)

# get predicted prices on validation data
val_predictions = melbourne_model.predict(val_X)
print(mean_absolute_error(val_y, val_predictions))

276791.51775338926


In [None]:
#Your mean absolute error for the in-sample data was about 500 dollars. Out-of-sample it is more than 250,000 dollars.

In [None]:
#Code assignment
import pandas as pd
from sklearn.tree import DecisionTreeRegressor

# Path of the file to read
iowa_file_path = './input/home-data-for-ml-course/train.csv'

home_data = pd.read_csv(iowa_file_path)
y = home_data.SalePrice
feature_columns = ['LotArea', 'YearBuilt', '1stFlrSF', '2ndFlrSF', 'FullBath', 'BedroomAbvGr', 'TotRmsAbvGrd']
X = home_data[feature_columns]

# Specify Model
iowa_model = DecisionTreeRegressor()
# Fit Model
iowa_model.fit(X, y)

print("First in-sample predictions:", iowa_model.predict(X.head()))
print("Actual target values for those homes:", y.head().tolist())

#Step 1: Split Your Data
from sklearn.model_selection import train_test_split

train_X, val_X, train_y, val_y = train_test_split(X,y,random_state=1)

#Step 2: Specify and Fit the Model
# Specify the model
iowa_model = DecisionTreeRegressor(random_state=1)
# Fit iowa_model with the training data.
iowa_model.fit(train_X, train_y)

#Step 3: Make Predictions with Validation data
# Predict with all validation observations
val_predictions = iowa_model.predict(val_X)
# print the top few validation predictions
print(val_X.head())
# print the top few actual prices from validation data
print(val_y.head())

#Step 4: Calculate the Mean Absolute Error in Validation Data
from sklearn.metrics import mean_absolute_error
val_mae = mean_absolute_error(val_y, val_predictions)
print(val_mae)

###Underfitting and Overfitting