Aplikacja do przewidywania ceny zamknięcia akcji Google na podstawie ceny otwarcia, wartości minimalnej i maksymalnej.

Pobranie zestawu danych z bazy kaggle.

In [2]:
from google.colab import files

In [3]:
uploaded = files.upload()

Saving kaggle.json to kaggle.json


In [4]:
!mkdir ~/.kaggle

mkdir: cannot create directory ‘/root/.kaggle’: File exists


In [5]:
!cp kaggle.json ~/.kaggle/

In [6]:
!chmod 600 /root/.kaggle/kaggle.json

In [7]:
import kaggle

In [8]:
from kaggle.api.kaggle_api_extended import KaggleApi

In [9]:
api = KaggleApi()

In [10]:
api.authenticate()

In [12]:
api.dataset_list_files('surajjoshi26/google-stock-price2004-2023').files

[google_stock_price.csv]

In [13]:
api.dataset_download_files('surajjoshi26/google-stock-price2004-2023', path='.')

In [20]:
!unzip -q google-stock-price2004-2023.zip

Instalacja środowiska pyspark

In [22]:
!apt-get install openjdk-8-jdk-headless -qq > /dev/null

In [23]:
!wget -q http://archive.apache.org/dist/spark/spark-3.1.1/spark-3.1.1-bin-hadoop3.2.tgz

In [24]:
!tar xf spark-3.1.1-bin-hadoop3.2.tgz

In [25]:
pip install -q findspark

In [26]:
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-3.1.1-bin-hadoop3.2"

Przygotowanie danych. Podzielenie zestawu danych na dane testowe i treningowe w proporcji 80/20.


In [27]:
import findspark
findspark.init()

In [28]:
from pyspark.sql import SparkSession

In [29]:
spark = SparkSession.builder.appName('Google Stock Close Prediction').getOrCreate()

In [36]:
dataFrame = spark.read.csv('google_stock_price.csv', header=True, inferSchema=True)

In [37]:
dataFrame.show()

+----------+------------------+------------------+------------------+------------------+------------------+---------+
|      Date|              Open|              High|               Low|             Close|         Adj Close|   Volume|
+----------+------------------+------------------+------------------+------------------+------------------+---------+
|2004-08-19| 2.490664005279541| 2.591784954071045|2.3900420665740967|2.4991331100463867|2.4991331100463867|897427216|
|2004-08-20| 2.515820026397705|2.7168169021606445|2.5031180381774902| 2.697638988494873| 2.697638988494873|458857488|
|2004-08-23| 2.758410930633545|2.8264060020446777|2.7160699367523193|2.7247869968414307|2.7247869968414307|366857939|
|2004-08-24|2.7706151008605957| 2.779581069946289|2.5795810222625732| 2.611959934234619| 2.611959934234619|306396159|
|2004-08-25| 2.614201068878174| 2.689918041229248|2.5873019695281982| 2.640104055404663| 2.640104055404663|184645512|
|2004-08-26|2.6139519214630127|2.6886720657348633| 2.606

In [38]:
from pyspark.ml.feature import VectorAssembler

In [40]:
assembler = VectorAssembler(inputCols=['Open','High','Low'], outputCol='features')

In [41]:
data = assembler.transform(dataFrame)

In [43]:
data.show(truncate=False)

+----------+------------------+------------------+------------------+------------------+------------------+---------+----------------------------------------------------------+
|Date      |Open              |High              |Low               |Close             |Adj Close         |Volume   |features                                                  |
+----------+------------------+------------------+------------------+------------------+------------------+---------+----------------------------------------------------------+
|2004-08-19|2.490664005279541 |2.591784954071045 |2.3900420665740967|2.4991331100463867|2.4991331100463867|897427216|[2.490664005279541,2.591784954071045,2.3900420665740967]  |
|2004-08-20|2.515820026397705 |2.7168169021606445|2.5031180381774902|2.697638988494873 |2.697638988494873 |458857488|[2.515820026397705,2.7168169021606445,2.5031180381774902] |
|2004-08-23|2.758410930633545 |2.8264060020446777|2.7160699367523193|2.7247869968414307|2.7247869968414307|36685793

In [44]:
prepared_data = data.select('features','Close')

In [45]:
prepared_data.show(truncate=False)

+----------------------------------------------------------+------------------+
|features                                                  |Close             |
+----------------------------------------------------------+------------------+
|[2.490664005279541,2.591784954071045,2.3900420665740967]  |2.4991331100463867|
|[2.515820026397705,2.7168169021606445,2.5031180381774902] |2.697638988494873 |
|[2.758410930633545,2.8264060020446777,2.7160699367523193] |2.7247869968414307|
|[2.7706151008605957,2.779581069946289,2.5795810222625732] |2.611959934234619 |
|[2.614201068878174,2.689918041229248,2.5873019695281982]  |2.640104055404663 |
|[2.6139519214630127,2.6886720657348633,2.606729030609131] |2.687675952911377 |
|[2.6924080848693848,2.705359935760498,2.632383108139038]  |2.6438400745391846|
|[2.622170925140381,2.6274020671844482,2.540726900100708]  |2.540726900100708 |
|[2.547950029373169,2.5830678939819336,2.5444629192352295] |2.5496931076049805|
|[2.5579121112823486,2.5646369457244873,

In [46]:
train_data, test_data = prepared_data.randomSplit([0.8, 0.2], seed=42)

Stworzenie i nauczenie modelu przy użyciu regresji linowej w pyspark.

In [47]:
from pyspark.ml.regression import LinearRegression

In [49]:
lr = LinearRegression(featuresCol='features', labelCol='Close', predictionCol='predicted_Close')

In [50]:
lr_model = lr.fit(train_data)

In [51]:
predictions = lr_model.transform(test_data)

In [52]:
from pyspark.ml.evaluation import RegressionEvaluator

In [54]:
evaluator = RegressionEvaluator(labelCol='Close',predictionCol='predicted_Close', metricName='rmse')

In [55]:
rmse = evaluator.evaluate(predictions)

In [56]:
print("Średnia kwadratowa błędu na danych testowych: {:.3f}".format(rmse))

Średnia kwadratowa błędu na danych testowych: 0.392


In [57]:
evaluator_r2 = RegressionEvaluator(labelCol='Close', predictionCol='predicted_Close', metricName='r2')

In [58]:
r2 = evaluator_r2.evaluate(predictions)

In [59]:
print("R kwadrat na danych testowych: {:.3f}".format(r2))

R kwadrat na danych testowych: 1.000
