# For data manipulation parts and codes use link:

[ML BTK-1 Colab](https://colab.research.google.com/drive/1R8bJJypAHdpol59H9bgMTDRWCKzTgNoC#scrollTo=UIl8pb7dUvGS)

Next:
SVR Algorithm
[SVR Colab](https://colab.research.google.com/drive/1QWtItyz5mVSwWiGtQdMzI6OD3_JwuGzO#scrollTo=ZsEukbjNIUvm)

## **PREDICTION ALGORITHMS / importing libraries**

We will use classification for categorical data and prediction for numerical data.

Forecasting and prediction are different. We can use prediction to predict past values as well.

In [6]:
import io
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from google.colab import files

In [None]:
uploaded = files.upload()

# SIMPLE LINEAR REGRESSION

We will try to find line of best fit using *y = ax + b*.

example for simple linear regression:

*sales = a + b (months) + e*


In [None]:
data = pd.read_csv(io.BytesIO(uploaded['satislar.txt']))
print("Done")

In [None]:
months = data[['Aylar']]
sales = data[['Satislar']]
print(sales)
# or we can use instead of data[['Satislar']]
# sales2 = data.iloc[:,1:2].values
# here months is independent sales is dependent variable.
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(months, sales, 
                                                    test_size=0.33,
                                                    random_state=0)

# standardize data

# from sklearn.preprocessing import StandardScaler
# sc = StandardScaler()
# X_train = sc.fit_transform(x_train)
# X_test = sc.fit_transform(x_test)

# Y_train = sc.fit_transform(y_train)
# Y_test = sc.fit_transform(y_test)


In [None]:
from sklearn.linear_model import LinearRegression
lr = LinearRegression()
lr.fit(x_train, y_train)
predict = lr.predict(x_test)

Make graph of datas using matplotlib.pyplot

In [None]:
# first sort values by index
x_train = x_train.sort_index()
y_train = y_train.sort_index()

plt.plot(x_train, y_train)
plt.plot(x_test, lr.predict(x_test))

plt.title("Sale Prediction")
plt.xlabel("Months")
plt.ylabel("Sales")

# MULTIPLE LINEAR REGRESSION

*y = (beta0) + (beta1)(x1) + (beta2)(x2) + (beta3)(x3) + epsilon*

example for multiple linear regression:

*height = a + b(weight) + c(age) + d(foot size) + e*


---

> Be aware of dummy variable trap situation. Some algorithms effect much than others.

> It's happen when we change categorical data to numerical like one hat encoding.

---

p-value says that: 

How many example if I find that I can disprove null hypothesis. (p generally taken as 0.05 (%5))

H0 is null, H1 is alternative hypothesis. null hypothesis possibility of being wrong increases when p gets smaller. At the same time, possibility of being right of H1 increases. Vice versa, when p gets bigger, null hypothesis is more likely to be true.


In below code first we prepare data.

In [None]:
data = pd.read_csv(io.BytesIO(uploaded['veriler.txt']))
print("Done")

In [None]:
# encode sex column to 1s and 0s
c = data.iloc[:,-1:].values

from sklearn import preprocessing

le = preprocessing.LabelEncoder()
c[:,-1] = le.fit_transform(data.iloc[:,-1])

ohe = preprocessing.OneHotEncoder()
c = ohe.fit_transform(c).toarray()

# below code is from before little modified
country = data.iloc[:,0:1].values

le = preprocessing.LabelEncoder()
country[:,0] = le.fit_transform(data.iloc[:,0])

ohe = preprocessing.OneHotEncoder()
country = ohe.fit_transform(country).toarray()



age = data.iloc[:,1:4].values

dataLength = len(data)

result = pd.DataFrame(data=country, index = range(dataLength),
                      columns = ['fr', 'tr', 'us'])

result2 = pd.DataFrame(data = age, index = range(dataLength),
                       columns=['boy', 'kilo', 'yas'])
sex = data.iloc[:,-1].values
result3 = pd.DataFrame(data = c[:,:1], index = range(dataLength),
                       columns=['cinsiyet'])


s = pd.concat([result, result2], axis=1) 
s2 = pd.concat([s, result3], axis=1)
print(s2)
print(s)
print(result3)
# split sets
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(s, result3, 
                                                    test_size=0.33,
                                                    random_state=0)

Train ML model using data and predict sex.

In [None]:
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(x_train, y_train)

y_predict = regressor.predict(x_test)
print(y_predict)  # compare with y_test
print("-----Real Results-----")
print(y_test)

Predict height.

In [None]:
# first prepare data:
height = s2.iloc[:,3:4].values

left_side_of_height_column = s2.iloc[:,:3]
right_side_of_height_column = s2.iloc[:,4:]

without_height = pd.concat([left_side_of_height_column, right_side_of_height_column],
                           axis = 1)

x_train, x_test, y_train, y_test = train_test_split(without_height, height, 
                                                    test_size=0.33,
                                                    random_state=0)

r2 = LinearRegression()
r2.fit(x_train, y_train)

y_predict = r2.predict(x_test)
# compare with y_test
print(y_predict)
print("-----Real Results-----")
print(y_test)

# Backward Elimination

In before section we used all data to predict height for example but maybe we don't need all data. In this section we will use backward elimination to erase some data.

Calculating p-value to eveluate success of model prediction. (continue using veriler.txt data)

In [None]:
# we will test values one by one by taking out and eveluating scores
# will take out highest p-value then continue for next highest

# IN OUR MODEL, we don't have constant beta0 variable. We are trying to add
# that variable now. First we'll add a column has 22 ones in it.

import statsmodels.api as sm

beta0 = np.append(arr = np.ones((22,1)).astype(int), values=data, axis=1)
# print(beta0)

# X_l is array containing our independent variable, height is dependent var
X_l = without_height.iloc[:,[0,1,2,3,4,5]].values
X_l = np.array(X_l, dtype=float)

# statistical results:
model = sm.OLS(height, X_l).fit()
print(model.summary())

# we saw x5 has highest p value we delete 4th column (x5 => 4th column)it
X_l = without_height.iloc[:,[0,1,2,3,5]].values
X_l = np.array(X_l, dtype=float)
model = sm.OLS(height, X_l).fit()
print(model.summary())

# Polynomial Regression

*y = (beta0) + (beta1)(x) + (beta2)(x^2) + ... + (betah)(x^h) + epsilon*


or it can be **multi-varient**

*y = (beta0) + (beta1)(x1) + (beta2)(x2) + (beta11)(x1^2) + (beta22)(x2^2) + (beta12)(x1)(x2) + epsilon*

In [None]:
data = pd.read_csv(io.BytesIO(uploaded['maaslar.txt']))
print(data)

In [None]:
# import data and linear regression
x = data.iloc[:,1:2]
y = data.iloc[:,2:]
# convert to numpy array
X = x.values
Y = y.values

from sklearn.linear_model import LinearRegression
lin_reg = LinearRegression()
lin_reg.fit(X, Y)

plt.scatter(X, Y, color='red')
plt.plot(x, lin_reg.predict(X), color='blue')
plt.show()

In [None]:
# Polynomial regression
from sklearn.preprocessing import PolynomialFeatures
poly_reg = PolynomialFeatures(degree = 4)
x_poly = poly_reg.fit_transform(X)
print(x_poly)

lin_reg2 = LinearRegression()
lin_reg2.fit(x_poly, y)
plt.scatter(X,Y, color='green')
plt.plot(X, lin_reg2.predict(poly_reg.fit_transform(X)), color='blue')

In [None]:
# Predict Unknown Values

# for linear regression
print(lin_reg.predict([[11]]))
print(lin_reg.predict([[6.6]]))

# for poly reg
print(lin_reg2.predict(poly_reg.fit_transform([[11]])))
print(lin_reg2.predict(poly_reg.fit_transform([[6.6]])))
