# 7주: zscore, cdf 계산
성적데이터는 n이 적지만, 정규분포를 이룬다고 가정하자.

* 1-1 성적데이터로 DataFrame을 생성.

* 1-2 zscore 컬럼을 생성.

zscore를 계산하려면, 평균과 표준편차를 알아야 한다.

계산식에 F함수를 직접 사용하면 오류가 발생한다. 따로 평균과 표준편차를 구해서 계산식에서 사용해야 한다.



* 1-3 cdf 컬럼을 생성.

scipy.stats.norm.cdf() 함수는 데이터타입을 float로 맞추어 주어야 한다.

cdf는 평균=0, 표준편차=1을 기본 값으로 누적확률을 계산한다.

```
marks=[
    "김하나, English, 100",
    "김하나, Math, 80",
    "임하나, English, 70",
    "임하나, Math, 100",
    "김갑돌, English, 82.3",
    "김갑돌, Math, 98.5"
]
```

In [1]:
import findspark
findspark.init()

In [2]:
import pyspark

myConf=pyspark.SparkConf()
spark = pyspark.sql.SparkSession.builder\
    .master("local")\
    .appName("7")\
    .config(conf=myConf)\
    .getOrCreate()

In [3]:
import os
import sys

os.environ["PYSPARK_PYTHON"]="C:\\Users\\SW\\anaconda3\\python.exe"
os.environ["PYSPARK_DRIVER_PYTHON"]="C:\\Users\\SW\\anaconda3\\python.exe"
os.environ["JAVA_HOME"]="C:\\Program Files\\Java\\jdk-11.0.11\\bin"

os.environ["SPARK_HOME"]="C:\\spark\\spark-3.1.2-bin-hadoop3.2"
os.environ["PYLIB"]="C:\\spark\\spark-3.1.2-bin-hadoop3.2\\python\\lib"

sys.path.insert(0,os.path.join(os.environ["PYLIB"],"py4j-0.10.9-src.zip"))
sys.path.insert(0,os.path.join(os.environ["PYLIB"],"pyspark.zip"))

In [4]:
marks=[
    "김하나, English, 100",
    "김하나, Math, 80",
    "임하나, English, 70",
    "임하나, Math, 100",
    "김갑돌, English, 82.3",
    "김갑돌, Math, 98.5"
]

### 문제1-1 : 성적데이터로 DataFrame을 생성

In [5]:
_marksRdd=spark.sparkContext.parallelize(marks).map(lambda x: x.split(','))

In [6]:
_marksDf=spark.createDataFrame(_marksRdd, schema=["name", "subject", "mark"])

In [7]:
_marksDf.printSchema()

root
 |-- name: string (nullable = true)
 |-- subject: string (nullable = true)
 |-- mark: string (nullable = true)



In [8]:
from pyspark.sql import functions as F
from pyspark.sql.types import FloatType, StringType

_marksDf = _marksDf.withColumn('score', _marksDf['mark'].cast(FloatType()))
_marksDf=_marksDf.drop('mark')

In [9]:
_marksDf=_marksDf.withColumnRenamed('score','mark')
_marksDf.printSchema()

root
 |-- name: string (nullable = true)
 |-- subject: string (nullable = true)
 |-- mark: float (nullable = true)



In [10]:
_marksDf.show()

+------+--------+-----+
|  name| subject| mark|
+------+--------+-----+
|김하나| English|100.0|
|김하나|    Math| 80.0|
|임하나| English| 70.0|
|임하나|    Math|100.0|
|김갑돌| English| 82.3|
|김갑돌|    Math| 98.5|
+------+--------+-----+



## 문제 1-2 zscore 컬럼을 생성.

In [16]:
marksScore = _marksDf.select(
    F.mean('mark'),
    F.stddev('mark')
).collect()

In [17]:
marksScore

[Row(avg(mark)=88.46666717529297, stddev_samp(mark)=12.786190172956093)]

In [18]:
meanM = marksScore[0][0]
stvM = marksScore[0][1]
print(meanM, stvM)

88.46666717529297 12.786190172956093


In [20]:
from pyspark.sql.functions import udf

zscoreUdf = F.udf(lambda x : (x-meanM)/stvM)
_marksDf = _marksDf.withColumn("markF",zscoreUdf(_marksDf.mark))
_marksDf.printSchema()

root
 |-- name: string (nullable = true)
 |-- subject: string (nullable = true)
 |-- mark: float (nullable = true)
 |-- markF: string (nullable = true)



### 1-2 답

In [21]:
_marksDf.show()

+------+--------+-----+-------------------+
|  name| subject| mark|              markF|
+------+--------+-----+-------------------+
|김하나| English|100.0|  0.902014804151829|
|김하나|    Math| 80.0| -0.662172786480269|
|임하나| English| 70.0| -1.444266581796318|
|임하나|    Math|100.0|  0.902014804151829|
|김갑돌| English| 82.3|-0.4822909748814927|
|김갑돌|    Math| 98.5| 0.7847007348544217|
+------+--------+-----+-------------------+



## 1-3 cdf 컬럼을 생성.

In [23]:
markScore2=_marksDf.select("mark").collect()

In [24]:
print(markScore2)

[Row(mark=100.0), Row(mark=80.0), Row(mark=70.0), Row(mark=100.0), Row(mark=82.30000305175781), Row(mark=98.5)]


In [29]:
from scipy.stats import norm

markCdf = F.udf(lambda x :float(norm.cdf(x)))
_marksDf = _marksDf.withColumn("cdf",markCdf(_marksDf.markF))
_marksDf.printSchema()

root
 |-- name: string (nullable = true)
 |-- subject: string (nullable = true)
 |-- mark: float (nullable = true)
 |-- markF: string (nullable = true)
 |-- cdf: string (nullable = true)



### 1-3 답

In [30]:
_marksDf.show()

+------+--------+-----+-------------------+-------------------+
|  name| subject| mark|              markF|                cdf|
+------+--------+-----+-------------------+-------------------+
|김하나| English|100.0|  0.902014804151829| 0.8164754981807292|
|김하나|    Math| 80.0| -0.662172786480269| 0.2539302463290559|
|임하나| English| 70.0| -1.444266581796318| 0.0743320011235712|
|임하나|    Math|100.0|  0.902014804151829| 0.8164754981807292|
|김갑돌| English| 82.3|-0.4822909748814927|0.31479962882028223|
|김갑돌|    Math| 98.5| 0.7847007348544217| 0.7836854740814176|
+------+--------+-----+-------------------+-------------------+

