<a href="https://colab.research.google.com/github/holictoweb/spark_deep_dive/blob/main/sparksql/pyspark_table_datasource.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### colab + yfinance + google drive + pyspark

1. pyspark 설치 
2. yfinance를 통해 데이터 수집
3. google drive 상에 parquet 으로 데이터 저장
4. pyspark을 통해 해당 데이터를 테이블로 저장
5. 데이터 분석 진행

### Datasource

[spark 3.1.1 공식 문서 ](https://spark.apache.org/docs/latest/sql-data-sources.html)



In [5]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [6]:
from pyspark.sql import SparkSession
from pyspark.conf import SparkConf
from pyspark.sql.types import *
from pyspark.sql.functions import lit
 
spark = SparkSession.builder.appName('test_spark').getOrCreate()

In [13]:
spark.sql("create database IF NOT EXISTS stocklab")
spark.sql("show databases location6u7u7y").show()
spark.sql("describe database stocklab").show(100,False)

+---------+
|namespace|
+---------+
|  default|
| stocklab|
+---------+

+-------------------------+-----------------------------------------+
|database_description_item|database_description_value               |
+-------------------------+-----------------------------------------+
|Database Name            |stocklab                                 |
|Comment                  |                                         |
|Location                 |file:/content/spark-warehouse/stocklab.db|
|Owner                    |                                         |
+-------------------------+-----------------------------------------+



In [None]:
# yfinance 를 통해 리스트 확보
import yfinance as yf
 
#data = yf.download("SPY AAPL", start="2017-01-01", end="2017-04-30") #sample code 
 
ticker_list = ["005930.KS", ]
 
for ticker in ticker_list:
  pdf = yf.download(ticker, sdate='2020-01-01')
 
  '''
  df_schema = StructType([ \
    StructField("open"), DoubleType(), True), \
    StructField("high"), DoubleType(), True), \
    StructField("low"), DoubleType(), True), \
    StructField("close"), DoubleType(), True), \
    StructField("adfclose"), DoubleType(), True), \
    StructField("volume"), LongType(), True) \  
  ])
  '''
 
 
  df = spark.createDataFrame(pdf)  #df.show()
  
  #df.write.format('parquet').save('drive/MyDrive/data-warehouse/test')  # column Adj Close 와 관련한 이슈 발생
  # AnalysisException: Attribute name "Adj Close" contains invalid character(s) among " ,;{}()\n\t=". Please use alias to rename it. 
 
  ticker = ticker.split('.')[0]
  print(ticker)
  
  df = df.withColumnRenamed("Adj Close", "AdjClose").withColumn("Code", lit(ticker))
 
  path = 'drive/MyDrive/data-warehouse/stock_day'
  df.write.format('parquet').mode("overwrite").save(path)
  #df.show()
  create_table_sql = 'create table if not exists stocklab.stock_day using org.apache.spark.sql.parquet options (path "'+ path +'")'
  print(create_table_sql)
  spark.sql(create_table_sql)
  
  
  '''
  df.write.saveAsTable("stock_day_test")
 
  df_read = spark.read.format("parquet").load(path)
  df_read.show()
  '''

[*********************100%***********************]  1 of 1 completed
005930
create table if not exists stocklab.stock_day using org.apache.spark.sql.parquet options (path "drive/MyDrive/data-warehouse/stock_day")


In [None]:
spark.sql("drop table stocklab.stock_day")

DataFrame[]

In [None]:
spark.sql ( "select * from stocklab.stock_day").show()
 
spark.sql("use stocklab")
spark.sql("describe table extended stock_day").show(100,False)

+----+----+---+-----+--------+------+----+
|Open|High|Low|Close|AdjClose|Volume|Code|
+----+----+---+-----+--------+------+----+
+----+----+---+-----+--------+------+----+

+----------------------------+--------------------------------------------------------------------------------+-------+
|col_name                    |data_type                                                                       |comment|
+----------------------------+--------------------------------------------------------------------------------+-------+
|Open                        |double                                                                          |null   |
|High                        |double                                                                          |null   |
|Low                         |double                                                                          |null   |
|Close                       |double                                                                       

In [None]:
spark.sql("drop table stocklab.stock_day_test")

## run spark sql direct from file

https://spark.apache.org/docs/2.3.1/sql-programming-guide.html#run-sql-on-files-directly

In [None]:
# df = spark.sql("SELECT * FROM parquet.`examples/src/main/resources/users.parquet`")
spark.sql("select * from parquet.'" + path +"/*.parquet'")

ParseException: ignored

## pyspark 사용방법
- pip install pyspark 설치 이후 
  stored in directory 위치 확인 
  
/root/.cache/pip/wheels/0b/90/c0/01de724414ef122bd05f056541fb6a0ecf47c7ca655f8b3c0f

- spark home 지정

PYSPARK_PYTHON=python3 SPARK_HOME=~/root/.cache/pip/wheels/0b/90/c0/01de724414ef122bd05f056541fb6a0ecf47c7ca655f8b3c0f

In [1]:
!pip install pyspark

Collecting pyspark
[?25l  Downloading https://files.pythonhosted.org/packages/45/b0/9d6860891ab14a39d4bddf80ba26ce51c2f9dc4805e5c6978ac0472c120a/pyspark-3.1.1.tar.gz (212.3MB)
[K     |████████████████████████████████| 212.3MB 69kB/s 
[?25hCollecting py4j==0.10.9
[?25l  Downloading https://files.pythonhosted.org/packages/9e/b6/6a4fb90cd235dc8e265a6a2067f2a2c99f0d91787f06aca4bcf7c23f3f80/py4j-0.10.9-py2.py3-none-any.whl (198kB)
[K     |████████████████████████████████| 204kB 40.2MB/s 
[?25hBuilding wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
  Created wheel for pyspark: filename=pyspark-3.1.1-py2.py3-none-any.whl size=212767604 sha256=1f259f52331909a02366411bf61b0efe62634e1b18960787dc4418cf0d72eaca
  Stored in directory: /root/.cache/pip/wheels/0b/90/c0/01de724414ef122bd05f056541fb6a0ecf47c7ca655f8b3c0f
Successfully built pyspark
Installing collected packages: py4j, pyspark
Successfully installed py4j-0.10.9 pyspark-3.1.1


In [2]:
!PYSPARK_PYTHON=python3 SPARK_HOME=~/root/.cache/pip/wheels/0b/90/c0/01de724414ef122bd05f056541fb6a0ecf47c7ca655f8b3c0f

In [3]:
!pip install yfinance

Collecting yfinance
  Downloading https://files.pythonhosted.org/packages/a7/ee/315752b9ef281ba83c62aa7ec2e2074f85223da6e7e74efb4d3e11c0f510/yfinance-0.1.59.tar.gz
Collecting lxml>=4.5.1
[?25l  Downloading https://files.pythonhosted.org/packages/cf/4d/6537313bf58fe22b508f08cf3eb86b29b6f9edf68e00454224539421073b/lxml-4.6.3-cp37-cp37m-manylinux1_x86_64.whl (5.5MB)
[K     |████████████████████████████████| 5.5MB 10.0MB/s 
Building wheels for collected packages: yfinance
  Building wheel for yfinance (setup.py) ... [?25l[?25hdone
  Created wheel for yfinance: filename=yfinance-0.1.59-py2.py3-none-any.whl size=23442 sha256=58497a280ed2fd197e30babade6ff1f3121a967901d7e2918dffa02bd9addeba
  Stored in directory: /root/.cache/pip/wheels/f8/2a/0f/4b5a86e1d52e451757eb6bc17fd899629f0925c777741b6d04
Successfully built yfinance
Installing collected packages: lxml, yfinance
  Found existing installation: lxml 4.2.6
    Uninstalling lxml-4.2.6:
      Successfully uninstalled lxml-4.2.6
Successfull

In [4]:
!ls drive/MyDrive/data-warehouse/stock_day

ls: cannot access 'drive/MyDrive/data-warehouse/stock_day': No such file or directory
