<a href="https://colab.research.google.com/github/nuageklow/sld_machinelearning/blob/main/MachineLearningBeginner_v1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


# Introduction

In this step-by-step tutorial you will using California Housing data to predict the median house values with linear regression. 

[Part I - Import Libraries and Understand the Data](#import-libraries-and-understand-the-data) 
1. Import python libraries 
2. Get your data
3. Data Exploration (Breakout Room)


[Part II - Build The Linear Regression Algorithm in Python](#develop-the-linear-regression-algorithm-in-python)
1. Create Basic Statistics Functions for Linear Regression (Breakout Room)  
2. Build the algorithm for Linear Regression  
3. Build Functions to handle dataset and results  
4. Exceute 

[Part III - Linear Regression using libraries](#linear-regression-using-libraries) 





# <font color="darkorchid">Import Libraries and Understand the Data 

## Required Python Libraries  
[__numpy__](https://numpy.org/) - provides functions on data structures. Also supports basic math functions  

[__pandas__]() - also provides data structures and a very convinient tool to manipulate and analyze data  

[__scikit-learn__]() - contains all the basic machine learning algorithms for regression, classification and clustering etc.

[__matplotlib__]() - a popular Python library that can be used for creating graphs such as charts.  


In [None]:
# check if required libraries are installed
!pip install numpy
!pip install pandas
!pip install sklearn
!pip install matplotlib

In [11]:
# import required python libraries
import numpy as np
import pandas as pd
import sklearn
import matplotlib
from random import seed, randrange
from math import sqrt
from sklearn import datasets

## Importing data  

Machine learning libraries contain a function where users can call datasets that are commonly used for training.  

We will be using a dataset called 'California Housing'.

In [None]:
from sklearn import datasets
# get data from datasets
data_df = datasets.fetch_california_housing(as_frame=True).frame




## Understanding the Data (Breakout Session)  


* find data size 
* column names  
* ensure no null values 

In [None]:
# find out the shape of data
print(data_df.shape)

# find out the 
print(data_df.size)

# take a peek at the data - head 5
print(data_df.head(5))

# find missing values
print(data_df.isnull().sum())

# <font color="hotpink">Build The Linear Regression Algorithm in Python

## Create Statistics Functions For Linear Regression (Breakout Session)

mean 
  * the average of list of data  

variance 
  * sum of sequared difference for each value from the mean value  

covariance  
  * variance between 2 groups of numbers



In [None]:
# create a mean function to find out the mean value from a list of items
def mean(values_list):
  # calculate the sum of the values_list
  sum_values_list = sum(values_list)

  # get the length of the values_list
  length_values_list = len(values_list)

  # find out the mean (convert length to float as datatype to keep the decimals)
  mean = sum_values_list / float(length_values_list)
  
  return mean

In [None]:
# create a variance function   
def variance(values_list, mean):
  # create an empty list to store the values
  variance_list = []

  # use for loop to calculate the squared difference of mean and add to the mean 
  for x in values_list:
    mean_sqdiff = (x - mean) ** 2
    variance_list.append(mean_sqdiff)
    
  # sum the values
  variance = sum(variance_list)
  return variance

In [None]:
# create a covariance function  
def covariance(x, x_mean, y, y_mean):
  # set covariance = 0
  covar = 0.0 

  # add covariance for each data point (x as list)
  for i in range(len(x)):
    covar += (x[i] - x_mean) * (y[i] - y_mean)

  return covar

## Develop the algorithm for Linear Regression 

Once we've created the basic statistical funtions above, we start implementing these funtion into an algorithm  

coefficient  
* linear regression is to estimate the values for 2 coefficients  

linear regression algorithm (simple)  
* implement the prediction equation to make predictions on test dataset

In [None]:
# create an coefficient function 
def coefficients(x, y):
  # calculate mean 
  x_mean, y_mean = mean(x), mean(y)

  # covariance divided by variance
  b1 = covariance(x, x_mean, y, y_mean) / variance(x, x_mean)

  # find the intercept to control the starting point of the regression line
  b0 = y_mean - b1 * x_mean
  
  return [b0, b1]

In [None]:
def linear_regression(train, test):
  # create an empty list called predictions to store all the scores
  predictions = list()

  # get the coefficients from training data
  b0, b1 = coefficients(train)

  # apply the b0 and b1 values on testing data
  for row in test:
      yhat = b0 + b1 * row[0]
      predictions.append(yhat)
      
  return predictions

## Build Functions to handle dataset and results

train_test_split  
* split up the dataset into a specific ratio  

rmse_metric  
* calculate root mean squared error

evaluate_algorthm
* evaluate an algorithm using a train/test split





In [None]:
# create a function to separate dataset list
def train_test_split(dataset, split):
  # create an empty called train
  train = list()

  # create split 
  train_size = split * len(dataset)
  dataset_copy = list(dataset) 

  # get train dataset
  while len(train) < train_size:  #
    index = randrange(len(dataset_copy))
    train.append(dataset_copy.pop(index))
  return train, dataset_copy


In [None]:
# set up rmse_metric to as a measure to evaluate result
def rmse_metric(actual, predicted):
  # set error as 0 as starting point
  sum_error = 0.0

  
  for i in range(len(actual)):
      prediction_error = predicted[i] - actual[i]
      sum_error += (prediction_error ** 2)
  mean_error = sum_error / float(len(actual))
  return sqrt(mean_error)

In [None]:
# evaluate an algorithm using a train/test split
def evaluate_algorithm(dataset, algorithm, split, *args):
  train, test = train_test_split(dataset, split)
  test_set = list()
  for row in test:
      row_copy = list(row)
      row_copy[-1] = None
      test_set.append(row_copy)
  predicted = algorithm(train, test_set, *args)
  actual = [row[-1] for row in test]
  rmse = rmse_metric(actual, predicted)
  return rmse

## Exceute 



In [None]:
seed(1) # initialize internal state from a seed
split = 0.7
df = data_df.frame
dataset = [[x,y] for x,y in zip(df['HouseAge'], df['MedHouseVal'])]
rmse = evaluate_algorithm(dataset, linear_regression, split)
print( ' RMSE: %.3f ' % (rmse))  # lower the better

# <font color="moccasin">Linear Regression using libraries </font>

In [None]:

# import libraries
from sklearn import datasets
from sklearn.model_selection import train_test_split, KFold, cross_val_score
from sklearn.linear_model import LinearRegression, LogisticRegression


# get data
data_array = datasets.fetch_california_housing().data

# separate the data
X = data_array[:,0:-2]
Y = data_array[:,-1]


test_size = 0.3
seed = 7

kfold = KFold(n_splits=10, random_state=7, shuffle=True)
model = LinearRegression()
scoring = 'r2'
results = cross_val_score(model, X, Y, cv=kfold, scoring=scoring)
print("R^2: %.3f (%.3f)" % (results.mean(), results.std()))