# Machine Learning
## Linear Regression using Scikit-Learn
<p>Using a set of data points, linear regression attempts to find a 'line of best fit' through the data. The 'line of best fit' is a corrrelation between the data points. It can then used to predict values of variables. Linear regression works best when there is a strong correlation of data to begin with.
Linear equation: y = mx + b, where m is the gradient of the line. For a given x and gradient, y can be predicted.</p>

#### Analysis of Results
<p>My example of linear regression in the code below aims to predict the area of forest burnt by a fire based of a number of input variables such as temperature, wind and humidity.
My code functions well, but I do not get a very strong accuracy (r squared score) when I print it. This indicates that the model explains only a small portion of the variance in the data. In other words, the  linear regression model has limited effectiveness in predicting the area burned in the forest fire dataset. However, this is because the input variables and the area burned variable have little correlation.</p>

In [None]:
#importing various libraries and related packages required for our prject
import pandas as pd 
import numpy as np 
import sklearn
import matplotlib.pyplot as pyplot
import pickle 
from sklearn import linear_model
from sklearn.utils import shuffle
from matplotlib import style


In [None]:
#import our data into the data variable
data = pd.read_csv("forestfires.csv", sep=",") #our data is seperated by commas

print(data.head()) #print the first five lines of code

In [None]:
#define data as an array, with the specific varaibles
data = data[["temp", "RH", "wind", "area", "FFMC"]] 
print(data.head()) #print the first 5 rows of this data set

In [None]:
predict = "area" #defining the variable which we will predict 

x = np.array(data.drop(predict, axis=1)) #set the x axis
y = np.array(data[predict]) #set the y axis

#splits the data to test and train it
x_train, x_test, y_train, y_test = sklearn.model_selection.train_test_split(x,y,test_size=0.1)

In [None]:
linear = linear_model.LinearRegression()
linear.fit(x_train, y_train)  #define the line of best fit for our data 
acc = linear.score(x_test, y_test)  #define accuracy

print(acc) #prints the accuracy of our model

In [None]:
with open("fireprediction.pickle", "wb") as f: # Opens a file in binary write mode
    pickle.dump(linear, f)  # Save the 'linear' model into the file


pickle_in = open("fireprediction.pickle", "rb") # Open the file in binary read mode
linear = pickle.load(pickle_in) # Load the model from the file

print('coefficient: \n', linear.coef_) # Print the coefficients of the model
print('Intercept: \n',  linear.intercept_) # Print the intercepts of the model


In [None]:
predictions = linear.predict(x_test) # Making predictions using the test data

for x in range(len(predictions)): # Loop through the predictions and compare with actual test data
    print(predictions[x], x_test[x], y_test[x])  # Print prediction, input data, and actual outcome

In [None]:
#plot the data
style.use("ggplot")

#set up the scatter plot
p="temp"
pyplot.scatter(data[p], data["area"]) 
pyplot.xlabel("Temperature") #set the label on the x-axis
pyplot.ylabel("Area burned(hectares)") #set the label and the y-axis
pyplot.title("Predicted Forest Burned Area") #set the title of the chart
pyplot.plot(x_test[:, 0], linear.predict(x_test), color='red')  #plot the line of best fit
pyplot.show() #display the graph