# Linear-Regression

## Example: Predicting Bicycle Traffic 預測自行車流量

> As an example, let's take a look at whether we can predict the number of bicycle trips across Seattle's Fremont Bridge based on weather, season, and other factors.
We have seen this data already in [Working With Time Series](03.11-Working-with-Time-Series.ipynb).

我們來看一個例子，試圖從天氣、季節和其他因素中對西雅圖費利蒙大橋的自行車交通流量數據進行預測。我們已經在[在時間序列上操作](03.11-Working-with-Time-Series.ipynb)一節中使用過這個數據。

> In this section, we will join the bike data with another dataset, and try to determine the extent to which weather and seasonal factors—temperature, precipitation, and daylight hours—affect the volume of bicycle traffic through this corridor.
Fortunately, the NOAA makes available their daily [weather station data](http://www.ncdc.noaa.gov/cdo-web/search?datasetid=GHCND) (I used station ID USW00024233) and we can easily use Pandas to join the two data sources.
We will perform a simple linear regression to relate weather and other information to bicycle counts, in order to estimate how a change in any one of these parameters affects the number of riders on a given day.

本節中，我們會將自行車數據與另外一個數據集聯合起來，然後從中找到哪些天氣和季節因素，比方說溫度、降雨和日照時間，會影響到這條交通要道自行車流量數據。幸運的是美國國家海洋和大氣管理局NOAA公開了每天[氣象站數據](http://www.ncdc.noaa.gov/cdo-web/search?datasetid=GHCND)（ID USW00024233），我們可以使用Pandas很容易地聯合兩個數據集。然後我們使用簡單的線性回歸來擬合相關的天氣以及其他因素和自行車數量，以此來估計給定一天的任何其中一個參數改變對騎行者數量的影響。

> In particular, this is an example of how the tools of Scikit-Learn can be used in a statistical modeling framework, in which the parameters of the model are assumed to have interpretable meaning.
As discussed previously, this is not a standard approach within machine learning, but such interpretation is possible for some models.Let's start by loading the two datasets, indexing by date:

特別這是在統計模型框架中使用Scikit-Learn工具的例子，其中的模型參數被認為是有可解釋的含義的。正如之前討論的，這不是機器學期的標準方法，但是對於一些模型來說這樣的解釋是存在的。讓我們首先載入兩個數據集，使用日期進行索引：

In [None]:
# !curl -o FremontBridge.csv https://data.seattle.gov/api/views/65db-xm6k/rows.csv?accessType=DOWNLOAD

In [None]:
import pandas as pd

counts = pd.read_csv("input/FremontBridge.csv", index_col="Date", parse_dates=True)
weather = pd.read_csv("input/BicycleWeather.csv", index_col="DATE", parse_dates=True)

> we will compute the total daily bicycle traffic, and put this in its own dataframe:

我們計算每天自行車的總流量，把這個數據放進它自己的DataFrame中：

In [None]:
daily = counts.resample("d").sum()
daily["Total"] = daily.sum(axis=1)
daily = daily[["Total"]]  # 移除其他列

> We saw previously that the patterns of use generally vary from day to day; let's account for this in our data by adding binary columns that indicate the day of the week:

我們之前看到自行車流量隨著星期天數而發生不同變化；因此讓我們將這點也考慮進來，為這個數據集增加7個布爾值的列表示星期天數：

In [None]:
days = ["Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun"]
for i in range(7):
    daily[days[i]] = (daily.index.dayofweek == i).astype(float)

> Similarly, we might expect riders to behave differently on holidays; let's add an indicator of this as well:

類似的，我們也期望騎手們在節日會有不同習慣；讓我們將這點也考慮進來，加入一個標識列：

In [None]:
from pandas.tseries.holiday import USFederalHolidayCalendar

cal = USFederalHolidayCalendar()
holidays = cal.holidays("2012", "2016")
daily = daily.join(pd.Series(1, index=holidays, name="holiday"))
daily["holiday"].fillna(0, inplace=True)

> We also might suspect that the hours of daylight would affect how many people ride;

我們同樣猜測日照時間也會影響多少人騎自行車：

In [None]:
from datetime import datetime


def hours_of_daylight(date, axis=23.44, latitude=47.61):
    """
    計算給定日期的日照時間
    axis 23.44 黃赤夾角
    latitude 47.61 西雅圖緯度
    """
    # 2000年12月21日是冬至日，日照時間最短
    days = (date - datetime(2000, 12, 21)).days
    m = 1.0 - np.tan(np.radians(latitude)) * np.tan(
        np.radians(axis) * np.cos(days * 2 * np.pi / 365.25)
    )
    return 24.0 * np.degrees(np.arccos(1 - np.clip(m, 0, 2))) / 180.0


daily["daylight_hrs"] = list(map(hours_of_daylight, daily.index))
daily[["daylight_hrs"]].plot()
plt.ylim(8, 17)

> We can also add the average temperature and total precipitation to the data.
In addition to the inches of precipitation, let's add a flag that indicates whether a day is dry (has zero precipitation):

我們也可以增加平均氣溫和總降雨量數據。除了單位為英寸的降雨量列外，我們再增加一列標誌表示當天是否乾燥（降雨量為0）：

In [None]:
# 氣溫單位是0.1攝氏度，求平均值
weather["TMIN"] /= 10
weather["TMAX"] /= 10
weather["Temp (C)"] = 0.5 * (weather["TMIN"] + weather["TMAX"])

# 降雨量單位是0.1毫米，轉換為英寸
weather["PRCP"] /= 254
weather["dry day"] = (weather["PRCP"] == 0).astype(int)

daily = daily.join(weather[["PRCP", "Temp (C)", "dry day"]])

> Finally, let's add a counter that increases from day 1, and measures how many years have passed.
This will let us measure any observed annual increase or decrease in daily crossings:

最後，讓我們增加一列計數器從第一天開始計數，然後轉換成經過的年的小數數值。該列會在每年進行循環：

In [None]:
daily["annual"] = (daily.index - daily.index[0]).days / 365.0
daily.head()

> With this in place, we can choose the columns to use, and fit a linear regression model to our data.
We will set ``fit_intercept = False``, because the daily flags essentially operate as their own day-specific intercepts: Finally, we can compare the total and predicted bicycle traffic visually:

有了數據後，我們可以選擇使用哪些列來讓線性回歸模型進行擬合。我們設置`fit_intercept=False`，因為每天的數據都有著那一天自己的截距：最終我們將預測的自行車交通流量和實際總量進行比較繪製圖表：

In [None]:
# 移除所有有空值的行
daily.dropna(axis=0, how="any", inplace=True)

# 用來擬合模型的列包括星期幾、日照小時數、降水量、是否有雨、氣溫、該天的年計數
column_names = [
    "Mon",
    "Tue",
    "Wed",
    "Thu",
    "Fri",
    "Sat",
    "Sun",
    "holiday",
    "daylight_hrs",
    "PRCP",
    "dry day",
    "Temp (C)",
    "annual",
]
X = daily[column_names]
y = daily["Total"]

model = LinearRegression(fit_intercept=False)
model.fit(X, y)
daily["predicted"] = model.predict(X)
daily[["Total", "predicted"]].plot(alpha=0.5);

> It is evident that we have missed some key features, especially during the summer time.
Either our features are not complete (i.e., people decide whether to ride to work based on more than just these) or there are some nonlinear relationships that we have failed to take into account (e.g., perhaps people ride less at both high and low temperatures).
Nevertheless, our rough approximation is enough to give us some insights, and we can take a look at the coefficients of the linear model to estimate how much each feature contributes to the daily bicycle count:

很明顯我們遺失了一些關鍵的特徵，特別是在夏天的時候。或者我們的特徵不完整（如決定人們是否騎行的因素不止上述那些特徵）或者數據之間具有非線性的關係我們並未考慮進來（如人們在高溫和低溫的情況下都會減少騎行）。無論如何，我們這個粗糙的估計給了我們一些內在解釋，我們可以查看這個線性模型的係數，從中得到每個特徵是如何影響每天自行車總量的：

In [None]:
params = pd.Series(model.coef_, index=X.columns)
params

> These numbers are difficult to interpret without some measure of their uncertainty.
We can compute these uncertainties quickly using bootstrap resamplings of the data:

這些數字如果沒有一種對它們不確定性的度量方式的話很難解讀。我們可以使用對數據的重採樣來快速的計算這些不確定性：

In [None]:
from sklearn.utils import resample

np.random.seed(1)
err = np.std([model.fit(*resample(X, y)).coef_ for i in range(1000)], 0)

print(pd.DataFrame({"effect": params.round(0), "error": err.round(0)}))

> We first see that there is a relatively stable trend in the weekly baseline: there are many more riders on weekdays than on weekends and holidays.
We see that for each additional hour of daylight, 129 ± 9 more people choose to ride; a temperature increase of one degree Celsius encourages 65 ± 4 people to grab their bicycle; a dry day means an average of 548 ± 33 more riders, and each inch of precipitation means 665 ± 62 more people leave their bike at home.
Once all these effects are accounted for, we see a modest increase of 27 ± 18 new daily riders each year.

首先看到的是每週相對穩定的變化趨勢：顯然工作日比周末的騎行者要多得多。如果每天日照時間多一個小時，就會多出240.0 ± 31.0個騎行者；氣溫升高一攝氏度會多出135.0 ± 10.0個騎行者；晴天意味著會多出1032.0 ± 103.0個騎行者；而每多一英寸降雨意味著會有1389.0 ± 175.0個人決定將自行車留在家。一旦所有因素都計算在內，我們發現每年同一天會平均多出38.0 ± 109.0個騎行者。
> Our model is almost certainly missing some relevant information. For example, nonlinear effects (such as effects of precipitation *and* cold temperature) and nonlinear trends within each variable (such as disinclination to ride at very cold and very hot temperatures) cannot be accounted for in this model.
Additionally, we have thrown away some of the finer-grained information (such as the difference between a rainy morning and a rainy afternoon), and we have ignored correlations between days (such as the possible effect of a rainy Tuesday on Wednesday's numbers, or the effect of an unexpected sunny day after a streak of rainy days).
These are all potentially interesting effects, and you now have the tools to begin exploring them if you wish!

我們的模型基本可以肯定遺漏了一些相關的信息。例如，非線性效果（比方說降水量*和*低氣溫的共同作用）和每個變量的非線性趨勢（比方說在非常熱和非常冷的天氣下騎車的縮減量），這個模型都沒有計算在內。除此之外，我們還拋棄了一些細顆粒度的信息（例如下雨早晨和下雨下午的區別），而且我們還忽略了連續天數之間的關聯（比方說預報週三下雨結果週二就下雨了或者是連續雨天后的一個意料外的晴天）。這些都是潛在有趣的效應，並且你現在已經有了能夠進一步探索它們的工具了。

<!--NAVIGATION-->
< [深入：朴素贝叶斯分类](05.05-Naive-Bayes.ipynb) | [目录](Index.ipynb) | [深入：支持向量机](05.07-Support-Vector-Machines.ipynb) >

<a href="https://colab.research.google.com/github/wangyingsm/Python-Data-Science-Handbook/blob/master/notebooks/05.06-Linear-Regression.ipynb"><img align="left" src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open in Colab" title="Open and Execute in Google Colaboratory"></a>


## Example

In [None]:
import numpy as np
from numpy.testing import assert_almost_equal

np.set_printoptions(precision=2)  # .2
rng = np.random.RandomState(1)

# 初始化實驗數據並減去平均
W = rng.rand(2, 2)
X_normal = rng.normal(scale=5, size=(2, 20))
X_orig = W @ X_normal  # @ 就是你學過的矩陣相乘運算
X_mean = X_orig.mean(axis=1)[:, np.newaxis]
X = X_orig - X_mean
mean = X.mean(axis=1)

# 測試 numerical 相等，確保樣本的平均已經為 0 實作演算法時十分重要
assert_almost_equal(0, mean)
print("X.shape:", X.shape, "\n")
print(X)

In [None]:
"""
每個樣本為一個 column vector，索引從 0 開始
第一個 「 : 」 代表取得所有對應的 rows
"""
X[:, 0]

In [None]:
"""
除了 NumPy 比較特別以外，有實際用過 scikit-learn、PyTorch 或是 TensorFlow 做過矩陣運算的讀者們應該都清楚，
實作上這些函式庫常會將數據矩陣 X 做轉置（transpose），使其維度變成 (n_samples, n_features)。這樣的好處是
每一個列向量（row vector）都直接對應到一個樣本。這使得我們可以更輕鬆地存取特定樣本：
"""
# sanity check
assert_almost_equal(X[:, 0], X.T[0])
X.T[0]

In [None]:
"""
array([[ 2.89,  0.32,  5.8 , -6.52,  3.94, -4.21],    <- n_features *2
       [ 1.52,  0.91,  1.52, -0.88, -0.03, -1.26]])
                        ^
                    n_samples*6
"""
X[:, :6]

In [None]:
X[:, 0]

In [None]:
X[:, 1]

In [None]:
X[:, 2]

### 笛卡爾座標系統(Cartesian coordinate system)

現在想像你興沖沖地跑去見指導教授，迫不及待地獻上你剛搜集到的熱騰騰數據 X。教授僅看了一眼便道：

- 兩個特徵有點多，你能不能想辦法只用一個特徵來表示這些樣本的特性？

你連忙點頭稱是，接著便離開教授的辦公室。回到螢幕前，你盯著 X 裡頭的這些數字 #越想越不對勁。到底要怎樣才能把這些 2 維向量 x 各自用一個新的數值表示，同時又能保持這些樣本的特性不變呢？僅僅是將看似毫無章法的數據 X 描繪在這個座標系統上面，我們就能透過與生俱來的幾何直覺預測兩特徵 f1 與 f2 之間存在著某種程度的線性關係。這是幾何觀點上的一大勝利。這個發現讓我們離降維的目標近了許多。

In [None]:
import matplotlib.pyplot as plt

plt.scatter(X[0], X[1])

In [None]:
import matplotlib.pyplot as plt

plt.style.use("seaborn")

# 第一個參數為所有的 xs, 第二個參數為所有的 ys
plt.scatter(X[0, :], X[1, :])
plt.axis("equal");

向量投影到某低維子空間，事實上就是在線性地降低其維度。事實上這就是線性降維與 PCA 的核心精神：將原始數據拆解成更具代表性的主成分，並以其作為新的基準，重新描述數據。

In [None]:
# 該直線的單位向量（顯示到小數後第兩位）
v = np.array([0.9691344, 0.246533])
print("v       :", v)  # shape: (2,)
assert_almost_equal(1, np.linalg.norm(v))

# 使用 v 建立投影矩陣 Ｐ1
# 因為 P 是將 X 投影到 1 維，因此加個 1 在後面
P1 = v[np.newaxis, :]  # shape: (1, 2)
print("P1      :", P1)

# 利用 P1 將數據 X 投影到 v 所在子空間
L = P1 @ X

# 前 4 個樣本的新特徵 L 跟動畫內結果相同
print("L[:, :4]:", L[:, :4])

跳脫你的慣性思維，x 軸並不一定得水平展開。只要你想，這世上的任何直線都能是你的 x 軸。任何向量都可以是你描述手中數據的新基準。PCA 是一種拆解並重新表述數據的技巧，只要你想這世上的任何直線都能是你的 x 軸。任何向量都可以是你描述手中數據的新基準。PCA 就是一種拆解並重新表述數據的技巧，

In [None]:
from sklearn.decomposition import PCA

random_state = 9527  # 最大化 reproductivity

pca_1d = PCA(1, random_state=random_state)
L_sk = pca_1d.fit_transform(X.T).T

print("L_sk.shape:", L_sk.shape)
print("L_sk:", L_sk[:, :4])

# sklearn API 得到的結果跟我們手動計算結果相同
assert_almost_equal(L_sk, L)

In [None]:
"""
多數 Python 機器學習函式庫的預期輸入都是 n_samples 優先。
這是為何在呼叫 scikit-learn 時我先轉置 X 使其維度變為 (n_samples, n_features)，
接著再將其結果轉置回我想要的 (n_transformed_features, n_samples)：
"""
L_sk = pca_1d.fit_transform(X.T).T
data = X.T
L_transpose = pca_1d.transform(data)
assert_almost_equal(L.T, L_transpose)
L

## Project2

In [None]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns

data = pd.read_excel("input/insurance.xlsx")
data.head()  # data.shape #data.info()

In [None]:
# Label Encode Object Types
d_types = dict(data.dtypes)
for name, type_ in d_types.items():
    if str(type_) == "object":
        print(f"<======== {name} ===========>")
        print(data[name].value_counts())

In [None]:
from sklearn.preprocessing import LabelEncoder

for name, type_ in d_types.items():
    if str(type_) == "object":
        Le = LabelEncoder()
        data[name] = Le.fit_transform(data[name])

In [None]:
from sklearn.preprocessing import OneHotEncoder

onehotencoder = OneHotEncoder()
part = onehotencoder.fit_transform(data["region"].values.reshape(-1, 1)).toarray()
values = dict(data["region"].value_counts())

for e, (val, _) in enumerate(values.items()):
    data["region_" + str(val)] = part[:, e]

data = data.drop(["region"], axis=1)
data.head()

In [None]:
remaining_columns = list(data.columns)
remaining_columns.remove("expenses")

X = data[remaining_columns].values
Y = data["expenses"].values

from sklearn.model_selection import train_test_split

Xtrain, Xtest, Ytrain, Ytest = train_test_split(X, Y, test_size=0.2, random_state=4)

from sklearn.preprocessing import StandardScaler

Scaler = StandardScaler()
Xtrain = Scaler.fit_transform(Xtrain)
Xtest = Scaler.transform(Xtest)

### standardized (mean should be 1)

In [None]:
means = []

plt.ylim(-1, 1)
for i in range(X.shape[1]):
    means.append(np.mean(Xtest[:, i]))

plt.plot(means, scaley=False)

In [None]:
vars = []

plt.ylim(0, 2)
for i in range(X.shape[1]):
    vars.append(np.var(Xtest[:, i]))

plt.plot(vars)

###  LinearRegression
### 普通線性回歸

-  model.coef_ # 可查看係數
-  model.intercept_ # 可查看截距值

In [None]:
from sklearn.linear_model import LinearRegression

model = LinearRegression()
model.fit(Xtrain, Ytrain)

In [None]:
# Y = W.X + c
model.coef_.dot(Xtest[10, :]) + model.intercept_

In [None]:
model.predict(Xtest[10, :].reshape(1, -1))

In [None]:
# rfecv.support_  #保留排名
# rfecv.ranking_  #重要度排名
# model.coef_ # 可查看係數
# model.intercept_ # 可查看截距值

### Recursive Feature Elimination (RFECV) 
### 遞歸特徵消除：特徵擷取

通過交叉驗證來找到最優的特徵數量。如果減少特徵會造成性能損失，那麼將不會去除任何特徵。這個方法用以選取單模型特徵相當不錯，但是有兩個缺陷，一，計算量大。二，隨著學習器（評估器）的改變，最佳特徵組合也會改變，有些時候會造成不利影響。

- 對要訓練的機器學習算法進行建模
- 確定在一次迭代中要消除的特徵數量。
- 由於 RFECV 是遞歸迭代的，我們需要強行停止它。

In [None]:
from sklearn.feature_selection import RFECV
from sklearn.linear_model import LinearRegression

# 特徵選取
model = LinearRegression()
rfecv = RFECV(model, step=1, min_features_to_select=4, n_jobs=-1)  # 每次去除特徵數  #最小特徵數
rfecv.fit(Xtrain, Ytrain)
model.fit(Xtrain, Ytrain)

In [None]:
selected_features = np.where(rfecv.support_)[0]
selected_features

Xtrain = Xtrain[:, selected_features]
Xtrain
# Xtest = Xtest[:,selected_features]

In [None]:
# Y = W.X + c
model.coef_.dot(Xtest[10, :]) + model.intercept_

In [None]:
model.predict(Xtest[10, :].reshape(1, -1))

### Apparent Temperature Prediction

In [None]:
import datetime as dt

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
from sklearn import datasets, linear_model
from sklearn.metrics import mean_squared_error, r2_score

data = pd.read_csv("input/weather_data.csv")
data.head()

In [None]:
data.describe(include="all")

## DATA CLEANING

In [None]:
cols = [
    "Summary",
    "Precip Type",
    "Daily Summary",
    "Wind Bearing (degrees)",
    "Visibility (km)",
    "Loud Cover",
]
data = data.drop(cols, axis=1)

In [None]:
# Converting Formatted Date from Object to DateTimeObject.
data["Formatted Date"] = pd.to_datetime(data["Formatted Date"])
data.info()

In [None]:
# Indexing according to date and time and Setting Index.
idata = data.sort_values(by=["Formatted Date"])
idata = idata.set_index("Formatted Date")
idata.index

In [None]:
# removing duplicate values in the index
idata.index.drop_duplicates(keep="first")

## EXPLORATORY DATA ANALYSIS

In [None]:
# data after cleaning
idata.dropna()
idata.head()
idata.plot(y="Temperature (C)", figsize=(20, 10))

In [None]:
# resampling the data into day format
idata.index = pd.to_datetime(idata.index, utc=True)
idata2 = idata.resample(rule="D").mean()
idata2.head()
idata2.plot(y="Temperature (C)", figsize=(20, 10))

In [None]:
# extracting data from the month of January
jan2006 = idata["2006-01-01":"2006-01-31"]
jan2006.head()
jan2006.plot(
    y=["Apparent Temperature (C)", "Temperature (C)"], kind="line", figsize=(20, 10)
)

In [None]:
# resampling
JAN = jan2006.resample(rule="D").mean()
JAN.head()
JAN.plot(
    y=["Apparent Temperature (C)", "Temperature (C)"], kind="line", figsize=(20, 10)
)

In [None]:
winter = idata2["2006-01-01":"2006-02-28"]
winter2 = idata2["2006-12-01":"2006-12-31"]
spring = idata2["2006-03-01":"2006-05-31"]
summer = idata2["2006-06-01":"2006-08-31"]
winter.plot(y=["Temperature (C)", "Apparent Temperature (C)"], figsize=(20, 10))
winter2.plot(y=["Temperature (C)", "Apparent Temperature (C)"], figsize=(20, 10))

In [None]:
spring.plot(y=["Temperature (C)", "Apparent Temperature (C)"], figsize=(20, 10))

In [None]:
summer.plot(y=["Temperature (C)", "Apparent Temperature (C)"], figsize=(20, 10))

## Correlation in data

In [None]:
# checking correlation between the cols
data.corr()

In [None]:
# plotting the correlation
plt.figure(figsize=(10, 10))
sns.heatmap(data.corr(), annot=True)
plt.show()

## Data Visualization

In [None]:
sns.jointplot("Temperature (C)", "Apparent Temperature (C)", kind="reg", data=data)

In [None]:
sns.jointplot(kind="reg", y=data["Humidity"], x=data["Temperature (C)"])

In [None]:
sns.jointplot(kind="reg", y=data["Pressure (millibars)"], x=data["Temperature (C)"])

In [None]:
sns.jointplot("Apparent Temperature (C)", "Pressure (millibars)", kind="reg", data=data)

In [None]:
sns.jointplot(kind="reg", y=data["Wind Speed (km/h)"], x=data["Temperature (C)"])

In [None]:
sns.jointplot("Apparent Temperature (C)", "Wind Speed (km/h)", kind="reg", data=data)

In [None]:
sns.jointplot("Apparent Temperature (C)", "Humidity", kind="reg", data=data)

In [None]:
sns.jointplot(kind="hex", y=data["Humidity"], x=data["Pressure (millibars)"])

In [None]:
sns.jointplot(kind="hex", y=data["Humidity"], x=data["Temperature (C)"])

In [None]:
sns.jointplot(kind="hex", y=data["Humidity"], x=data["Apparent Temperature (C)"])

In [None]:
plt.figure(figsize=(20, 20))
sns.pairplot(data)
plt.show()

In [None]:
X = idata["2006-01-01":"2006-07-20"]  # taking 70% of the data for training
X1 = idata["2006-07-21":"2006-12-31"]  # 30% of data for testing

In [None]:
# taking cols for training the model
X_train = X[
    ["Temperature (C)", "Humidity", "Wind Speed (km/h)", "Pressure (millibars)"]
]
Y_train = X["Apparent Temperature (C)"]
Y_train

In [None]:
regr = linear_model.LinearRegression()

# Train the model using the training sets
regr.fit(X_train, Y_train)

In [None]:
# taking cols for testing
X_test = X1[
    ["Temperature (C)", "Humidity", "Wind Speed (km/h)", "Pressure (millibars)"]
]
X_test.head()

In [None]:
Y_test = X1["Apparent Temperature (C)"]
Y_test.head()

In [None]:
# predicting the Apparent Temperature
y_pred = regr.predict(X_test)
regr.score(X_test, Y_test)

### Showing the Actual Apparent Temperature and the predicted Apparent Temperature 

In [None]:
df = pd.DataFrame({"Actual": Y_test, "Predicted": y_pred})
df.head()

### Calculating the error in prediction 

In [None]:
from sklearn import metrics

print("Mean Absolute Error: ", metrics.mean_absolute_error(Y_test, y_pred))
print("Mean Squared Error: ", metrics.mean_squared_error(Y_test, y_pred))
print("Root Mean Squared Error: ", np.sqrt(metrics.mean_squared_error(Y_test, y_pred)))