### Przykładowe problemy związane ze skalowalnością zadań ML:

Ograniczenie CPU: Dane mieszczą się w pamięci RAM, ale proces uczenia trwa za długo. Np. W przypadku konieczności sprawdzenia wielu kombinacji parametrów modelu, wielu modeli, itd.


Ograniczenia pamięci: Dane są na tyle duże, że nie mieszczą się w pamięci RAM.


#### Pipeline

![](https://github.com/kornisch/ds-notebooks/blob/main/img/ml-Pipeline.png?raw=1)

#### Pipeline Model

![](https://github.com/kornisch/ds-notebooks/blob/main/img/ml-PipelineModel.png?raw=1)

### Potok przetwarzania ML

* <b>DataFrame</b>: interfejs API ML używa DataFrame, w którym można przechowywać różne typy danych. Na przykład DataFrame może mieć różne kolumny przechowujące tekst, wektory cech, prawdziwe etykiety i prognozy.

* <b>Transformer</b>: Transformator to algorytm, który może przekształcić jedną ramkę danych w inną ramkę danych. Na przykład model ML to transformator, który przekształca ramkę danych z funkcjami w ramkę danych z prognozami. Innym przykładem transformatora jest StringIndexer, który koduje zmienne tekstowe jako zmienne całkowito liczbowe.


* <b>Estimator</b>: Estymator to algorytm, który który można zaaplikować do DataFrame w celu wytworzenia transformatora. Np. Algorytm uczenia się jest estymatorem, który trenuje się na DataFrame i tworzy model.


* <b>Pipeline</b>: Potok przetwarzania łączy wiele transformatorów i estymatorów razem, aby określić przepływ pracy ML.


* <b>Parametr</b>: Wszystkie transformatory i estymatory mają teraz wspólny interfejs API do określania parametrów.

### Wczytanie danych

In [1]:
import os
# user_name = os.environ.get('USER')
user_name = 'kornisch'
print(user_name)

kornisch


In [2]:
from pyspark.sql import SparkSession
spark = SparkSession \
.builder \
.config('spark.driver.memory','1g') \
.config('spark.executor.memory', '2g') \
.getOrCreate()

In [3]:
semester = '2024l'
user_id = 7042

In [10]:
gs_path = f'/content/sample_data/survey_2020_survey_results_public.csv'

In [11]:
db_name = user_name.replace('-','_')

In [12]:
spark.sql(f'DROP DATABASE IF EXISTS {db_name} CASCADE')
spark.sql(f'CREATE DATABASE {db_name}')
spark.sql(f'USE {db_name}')

DataFrame[]

In [13]:
table_name = "survey_2020"

In [14]:
spark.sql(f'DROP TABLE IF EXISTS {table_name}')

spark.sql(f'CREATE TABLE IF NOT EXISTS {table_name} \
          USING csv \
          OPTIONS (HEADER true, INFERSCHEMA true, NULLVALUE "NA") \
          LOCATION "{gs_path}"')

DataFrame[]

In [15]:
spark.sql(f'describe {table_name}').show(100)

+--------------------+---------+-------+
|            col_name|data_type|comment|
+--------------------+---------+-------+
|          Respondent|      int|   NULL|
|          MainBranch|   string|   NULL|
|            Hobbyist|   string|   NULL|
|                 Age|   double|   NULL|
|          Age1stCode|   string|   NULL|
|            CompFreq|   string|   NULL|
|           CompTotal|   double|   NULL|
|       ConvertedComp|   double|   NULL|
|             Country|   string|   NULL|
|        CurrencyDesc|   string|   NULL|
|      CurrencySymbol|   string|   NULL|
|DatabaseDesireNex...|   string|   NULL|
|  DatabaseWorkedWith|   string|   NULL|
|             DevType|   string|   NULL|
|             EdLevel|   string|   NULL|
|          Employment|   string|   NULL|
|           Ethnicity|   string|   NULL|
|              Gender|   string|   NULL|
|          JobFactors|   string|   NULL|
|              JobSat|   string|   NULL|
|             JobSeek|   string|   NULL|
|LanguageDesireN

### Przygotowanie danych do analizy

W ramach zadania chcemy stworzyć klasyfikator, który będzie przewidywać czy respondent zarabia więcej niż 60000 USD rocznie.

In [16]:
spark_df= spark.sql(f'SELECT *, CAST((convertedComp > 60000) AS STRING) AS compAboveAvg \
                    FROM {table_name} where convertedComp IS NOT NULL ')
spark_df.limit(5).toPandas()

Unnamed: 0,Respondent,MainBranch,Hobbyist,Age,Age1stCode,CompFreq,CompTotal,ConvertedComp,Country,CurrencyDesc,...,SurveyLength,Trans,UndergradMajor,WebframeDesireNextYear,WebframeWorkedWith,WelcomeChange,WorkWeekHrs,YearsCode,YearsCodePro,compAboveAvg
0,8,I am a developer by profession,Yes,36.0,12,Yearly,116000.0,116000.0,United States,United States dollar,...,Appropriate in length,No,"Computer science, computer engineering, or sof...",Django;React.js;Vue.js,Flask,Just as welcome now as I felt last year,39.0,17,13,True
1,10,I am a developer by profession,Yes,22.0,14,Yearly,25000.0,32315.0,United Kingdom,Pound sterling,...,Appropriate in length,No,Mathematics or statistics,Flask;jQuery,Flask;jQuery,Somewhat more welcome now than last year,36.0,8,4,False
2,11,I am a developer by profession,Yes,23.0,13,Yearly,31000.0,40070.0,United Kingdom,Pound sterling,...,Appropriate in length,No,"Computer science, computer engineering, or sof...",Angular;Django;React.js,Angular;Angular.js;Django;React.js,Just as welcome now as I felt last year,40.0,10,2,False
3,12,I am a developer by profession,No,49.0,42,Monthly,1100.0,14268.0,Spain,European Euro,...,Appropriate in length,No,Mathematics or statistics,ASP.NET;jQuery,ASP.NET;jQuery,Just as welcome now as I felt last year,40.0,7,7,False
4,13,"I am not primarily a developer, but I write co...",Yes,53.0,14,Monthly,3000.0,38916.0,Netherlands,European Euro,...,Too long,No,,,,A lot less welcome now than last year,36.0,35,20,False


<B>Dążymy do tego, żeby przygotować jeden wektor cech oraz jedną kolumnę z oznaczeniami.</B>

Kodujemy kolumny tekstowe na numeryczne oraz kodujemy wartości liczbowe na reprezentacje onehotencoder. Następnie dokonujemy asemblacji do jednego wektora.

In [17]:
from pyspark.ml.feature import StringIndexer, OneHotEncoder, VectorAssembler
from pyspark.ml import Pipeline
# chcemy przewidziec compAboveAvg
y = 'compAboveAvg'
# na podstawie:
feature_columns = ['OpSys', 'EdLevel', 'MainBranch' , 'Country', 'JobSeek', 'YearsCode']

In [18]:
#Zaczynamy od transformatora StringIndexer, zamieniajacego wartosci 'string' na liczbe
# dla cech, ktore zostana wykorzystane do predykcji

##### najpierw pokazujemy prosta petle z FOR, a potem zrefactorujmy do list comprehension
stringindexer_stages_1 = []
for c in feature_columns:
    stringindexer_stages_1.append (StringIndexer(inputCol=c, outputCol='stringindexed_' + c).setHandleInvalid("keep"))

# i dla zmiennej objaśnianej
stringindexer_stages_1.append(StringIndexer(inputCol=y, outputCol='label').setHandleInvalid("keep"))


<b>handleInvalid</b> = How to handle invalid data during transform(). Options are 'keep' (invalid data presented as an extra categorical feature) or error (throw an error).

In [19]:
# Refactoring do list comprehension
stringindexer_stages = [StringIndexer(inputCol=c, outputCol='stringindexed_' + c).setHandleInvalid("keep") for c in feature_columns]

# i dla zmiennej objaśnianej
stringindexer_stages += [StringIndexer(inputCol=y, outputCol='label').setHandleInvalid("keep")]
stringindexer_stages

[StringIndexer_2517e1474885,
 StringIndexer_5a02c0722c0c,
 StringIndexer_8d48f7b49921,
 StringIndexer_8192c7d4058e,
 StringIndexer_f3fb1e2da366,
 StringIndexer_59c30bda5889,
 StringIndexer_f22e8ad6e9d4]

In [20]:
# Po wykonaniu takiej transformacji do DF zostaje dodane  7 nowych kolumn z prefixem "stringindexed_"
Pipeline(stages=stringindexer_stages).fit(spark_df).transform(spark_df).toPandas()

Unnamed: 0,Respondent,MainBranch,Hobbyist,Age,Age1stCode,CompFreq,CompTotal,ConvertedComp,Country,CurrencyDesc,...,YearsCode,YearsCodePro,compAboveAvg,stringindexed_OpSys,stringindexed_EdLevel,stringindexed_MainBranch,stringindexed_Country,stringindexed_JobSeek,stringindexed_YearsCode,label
0,8,I am a developer by profession,Yes,36.0,12,Yearly,116000.0,116000.0,United States,United States dollar,...,17,13,true,2.0,0.0,0.0,0.0,0.0,17.0,1.0
1,10,I am a developer by profession,Yes,22.0,14,Yearly,25000.0,32315.0,United Kingdom,Pound sterling,...,8,4,false,0.0,1.0,0.0,2.0,0.0,1.0,0.0
2,11,I am a developer by profession,Yes,23.0,13,Yearly,31000.0,40070.0,United Kingdom,Pound sterling,...,10,2,false,0.0,0.0,0.0,2.0,2.0,0.0,0.0
3,12,I am a developer by profession,No,49.0,42,Monthly,1100.0,14268.0,Spain,European Euro,...,7,7,false,0.0,2.0,0.0,10.0,0.0,3.0,0.0
4,13,"I am not primarily a developer, but I write co...",Yes,53.0,14,Monthly,3000.0,38916.0,Netherlands,European Euro,...,35,20,false,1.0,3.0,1.0,7.0,1.0,24.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
34751,65619,"I am not primarily a developer, but I write co...",Yes,,19,Monthly,30000.0,984.0,Nigeria,Nigerian naira,...,3,2,false,0.0,3.0,1.0,38.0,3.0,13.0,0.0
34752,65625,I am a developer by profession,Yes,,17,Monthly,5500000.0,19428.0,Colombia,Colombian peso,...,12,5,false,4.0,0.0,0.0,39.0,1.0,7.0,0.0
34753,65629,I am a developer by profession,Yes,41.0,15,Yearly,200.0,200.0,United States,United States dollar,...,25,20,false,1.0,2.0,0.0,0.0,0.0,14.0,0.0
34754,65630,I am a developer by profession,Yes,,17,Monthly,1000000.0,15048.0,Chile,Chilean peso,...,7,3,false,4.0,0.0,0.0,48.0,0.0,3.0,0.0


In [21]:
onehotencoder_stages = [OneHotEncoder(inputCol='stringindexed_' + c, outputCol='onehot_' + c) for c in feature_columns]

Rozbudowujemy pipeline:

Po wykonaniu takiej transformacji (stringIndexer+onehotencoder) do DF zostaje dodane  6 nowych kolumn z prefixem "onehot_".

In [22]:

pa = Pipeline(stages=stringindexer_stages + onehotencoder_stages).fit(spark_df).transform(spark_df).toPandas()

In [23]:
pa.columns

Index(['Respondent', 'MainBranch', 'Hobbyist', 'Age', 'Age1stCode', 'CompFreq',
       'CompTotal', 'ConvertedComp', 'Country', 'CurrencyDesc',
       'CurrencySymbol', 'DatabaseDesireNextYear', 'DatabaseWorkedWith',
       'DevType', 'EdLevel', 'Employment', 'Ethnicity', 'Gender', 'JobFactors',
       'JobSat', 'JobSeek', 'LanguageDesireNextYear', 'LanguageWorkedWith',
       'MiscTechDesireNextYear', 'MiscTechWorkedWith',
       'NEWCollabToolsDesireNextYear', 'NEWCollabToolsWorkedWith', 'NEWDevOps',
       'NEWDevOpsImpt', 'NEWEdImpt', 'NEWJobHunt', 'NEWJobHuntResearch',
       'NEWLearn', 'NEWOffTopic', 'NEWOnboardGood', 'NEWOtherComms',
       'NEWOvertime', 'NEWPurchaseResearch', 'NEWPurpleLink', 'NEWSOSites',
       'NEWStuck', 'OpSys', 'OrgSize', 'PlatformDesireNextYear',
       'PlatformWorkedWith', 'PurchaseWhat', 'Sexuality', 'SOAccount',
       'SOComm', 'SOPartFreq', 'SOVisitFreq', 'SurveyEase', 'SurveyLength',
       'Trans', 'UndergradMajor', 'WebframeDesireNextYear',
  

Nowe kolumny zawieraja wartosci typu SparseVector zawierajacy mape bitowa.


In [24]:
from IPython.display import Image
from IPython.core.display import HTML
Image(url= "https://miro.medium.com/max/2400/1*ggtP4a5YaRx6l09KQaYOnw.png")

In [25]:
print("Orginal values:")
print(pa['OpSys'].unique())
print ("---------")
print("StringIndexed values:")
print(pa['stringindexed_OpSys'].unique())
print ("---------")
print("OneHot values:")
print(pa['onehot_OpSys'].unique())

Orginal values:
['Linux-based' 'Windows' 'MacOS' None 'BSD']
---------
StringIndexed values:
[2. 0. 1. 4. 3.]
---------
OneHot values:
[SparseVector(4, {2: 1.0}) SparseVector(4, {0: 1.0})
 SparseVector(4, {1: 1.0}) SparseVector(4, {}) SparseVector(4, {3: 1.0})]


#### <B>Asemblacja</B> - połączenie wszystkich kolumn predykcyjnych do jednej (kolumna features)

In [26]:
extracted_columns = ['onehot_' + c for c in feature_columns]
vectorassembler_stage = VectorAssembler(inputCols=extracted_columns, outputCol='features')

### Połączenie wszystkich krokþw przygotowania danych w jednym potoku przetwarzania (pipeline)

In [27]:
# wybór kolumn do ostatecznej ramki danych
# poza kolumnami features i label (które będą wykorzystywane do budowania modelu)
# zostawiamy m.in. oryginalne kolumn (feature_columns)
final_columns = [y] + feature_columns + extracted_columns + ['features', 'label']
final_columns

['compAboveAvg',
 'OpSys',
 'EdLevel',
 'MainBranch',
 'Country',
 'JobSeek',
 'YearsCode',
 'onehot_OpSys',
 'onehot_EdLevel',
 'onehot_MainBranch',
 'onehot_Country',
 'onehot_JobSeek',
 'onehot_YearsCode',
 'features',
 'label']

In [28]:
transformed_df = Pipeline(stages=stringindexer_stages + \
                          onehotencoder_stages + \
                          [vectorassembler_stage]).fit(spark_df).transform(spark_df).select(final_columns)

transformed_df.limit(5).toPandas()

Unnamed: 0,compAboveAvg,OpSys,EdLevel,MainBranch,Country,JobSeek,YearsCode,onehot_OpSys,onehot_EdLevel,onehot_MainBranch,onehot_Country,onehot_JobSeek,onehot_YearsCode,features,label
0,True,Linux-based,"Bachelor’s degree (B.A., B.S., B.Eng., etc.)",I am a developer by profession,United States,"I’m not actively looking, but I am open to new...",17,"(0.0, 0.0, 1.0, 0.0)","(1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0)","(1.0, 0.0)","(1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","(1.0, 0.0, 0.0)","(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","(0.0, 0.0, 1.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, ...",1.0
1,False,Windows,"Master’s degree (M.A., M.S., M.Eng., MBA, etc.)",I am a developer by profession,United Kingdom,"I’m not actively looking, but I am open to new...",8,"(1.0, 0.0, 0.0, 0.0)","(0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0)","(1.0, 0.0)","(0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","(1.0, 0.0, 0.0)","(0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","(1.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, ...",0.0
2,False,Windows,"Bachelor’s degree (B.A., B.S., B.Eng., etc.)",I am a developer by profession,United Kingdom,I am actively looking for a job,10,"(1.0, 0.0, 0.0, 0.0)","(1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0)","(1.0, 0.0)","(0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","(0.0, 0.0, 1.0)","(1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","(1.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, ...",0.0
3,False,Windows,Some college/university study without earning ...,I am a developer by profession,Spain,"I’m not actively looking, but I am open to new...",7,"(1.0, 0.0, 0.0, 0.0)","(0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0)","(1.0, 0.0)","(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","(1.0, 0.0, 0.0)","(0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","(1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, ...",0.0
4,False,MacOS,"Secondary school (e.g. American high school, G...","I am not primarily a developer, but I write co...",Netherlands,I am not interested in new job opportunities,35,"(0.0, 1.0, 0.0, 0.0)","(0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0)","(0.0, 1.0)","(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, ...","(0.0, 1.0, 0.0)","(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","(0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, ...",0.0


### Podzial na zbior treningowy/testowy

In [47]:
training, test = transformed_df.randomSplit([0.8, 0.2], seed=666)

In [48]:
training.count()

27901

### Uczenie modelu - model.fit()

In [49]:
# na poczatek wybierzemy drzewo decyzyjne. Nie musimy podawac zadnych parametrow
from pyspark.ml.classification import DecisionTreeClassifier
dt = DecisionTreeClassifier(featuresCol='features', labelCol='label')

In [50]:
simple_model = Pipeline(stages=[dt]).fit(training)

In [51]:
simple_model.stages[0]

DecisionTreeClassificationModel: uid=DecisionTreeClassifier_edb1b88876c2, depth=5, numNodes=55, numClasses=3, numFeatures=229

### Predykcja - model.transform()

In [52]:
pred_simple = simple_model.transform(test)

In [53]:
show_columns = final_columns + ['prediction', 'rawPrediction', 'probability']
pred_simple.limit(5).select(show_columns).toPandas()

Unnamed: 0,compAboveAvg,OpSys,EdLevel,MainBranch,Country,JobSeek,YearsCode,onehot_OpSys,onehot_EdLevel,onehot_MainBranch,onehot_Country,onehot_JobSeek,onehot_YearsCode,features,label,prediction,rawPrediction,probability
0,False,,"Bachelor’s degree (B.A., B.S., B.Eng., etc.)",I am a developer by profession,India,"I’m not actively looking, but I am open to new...",3,"(0.0, 0.0, 0.0, 0.0)","(1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0)","(1.0, 0.0)","(0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","(1.0, 0.0, 0.0)","(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","(0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, ...",0.0,0.0,"[12462.0, 3265.0, 0.0]","[0.7923952438481592, 0.2076047561518408, 0.0]"
1,False,,"Bachelor’s degree (B.A., B.S., B.Eng., etc.)",I am a developer by profession,United Arab Emirates,"I’m not actively looking, but I am open to new...",5,"(0.0, 0.0, 0.0, 0.0)","(1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0)","(1.0, 0.0)","(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","(1.0, 0.0, 0.0)","(0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, ...","(0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, ...",0.0,0.0,"[12462.0, 3265.0, 0.0]","[0.7923952438481592, 0.2076047561518408, 0.0]"
2,False,,"Bachelor’s degree (B.A., B.S., B.Eng., etc.)",I am a developer by profession,United States,I am not interested in new job opportunities,1,"(0.0, 0.0, 0.0, 0.0)","(1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0)","(1.0, 0.0)","(1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","(0.0, 1.0, 0.0)","(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","(0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, ...",0.0,1.0,"[406.0, 5030.0, 0.0]","[0.07468727005150846, 0.9253127299484916, 0.0]"
3,False,,"Bachelor’s degree (B.A., B.S., B.Eng., etc.)","I am not primarily a developer, but I write co...",Afghanistan,I am actively looking for a job,10,"(0.0, 0.0, 0.0, 0.0)","(1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0)","(0.0, 1.0)","(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","(0.0, 0.0, 1.0)","(1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","(0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, ...",0.0,0.0,"[12462.0, 3265.0, 0.0]","[0.7923952438481592, 0.2076047561518408, 0.0]"
4,False,,"Bachelor’s degree (B.A., B.S., B.Eng., etc.)","I am not primarily a developer, but I write co...",Russian Federation,"I’m not actively looking, but I am open to new...",3,"(0.0, 0.0, 0.0, 0.0)","(1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0)","(0.0, 1.0)","(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","(1.0, 0.0, 0.0)","(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","(0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, ...",0.0,0.0,"[12462.0, 3265.0, 0.0]","[0.7923952438481592, 0.2076047561518408, 0.0]"


## Ewaluacje

In [54]:
# macierz pomyłek (confusion matrix)
label_and_pred = pred_simple.select('label', 'prediction')
label_and_pred.groupBy('label', 'prediction').count().toPandas()

Unnamed: 0,label,prediction,count
0,1.0,1.0,2209
1,0.0,1.0,626
2,1.0,0.0,862
3,0.0,0.0,3158


In [55]:
# Ewaluator
from pyspark.ml.evaluation import BinaryClassificationEvaluator
evaluator = BinaryClassificationEvaluator(rawPredictionCol="rawPrediction", metricName="areaUnderROC")

In [56]:
auroc_simple = evaluator.evaluate(pred_simple)
auroc_simple

IllegalArgumentException: requirement failed: rawPredictionCol vectors must have length=2, but got 3

In [57]:
from pyspark.ml.evaluation import MulticlassClassificationEvaluator

evaluator_m = MulticlassClassificationEvaluator(labelCol="label", predictionCol="prediction", metricName="accuracy")
accuracy = evaluator_m.evaluate(pred_simple)
accuracy

0.7829321663019694

## Dodanie hiperparametrów

In [58]:
# Jakie wartości hiperparametru maxDepth mają być przetestowane:
from pyspark.ml.tuning import ParamGridBuilder
param_grid = ParamGridBuilder().\
    addGrid(dt.maxDepth, [2,3,4,5,6]).\
    build()

In [59]:
# Walidacja krzyżowa wykonwyana w celu optymalizacji hiperparametrów
from pyspark.ml.tuning import CrossValidator
cv = CrossValidator(estimator=dt, estimatorParamMaps=param_grid, evaluator=evaluator, numFolds=4)

In [60]:
# Budowa modelu na podstawie danych treningowych
cv_model = cv.fit(training)

IllegalArgumentException: requirement failed: rawPredictionCol vectors must have length=2, but got 3

In [None]:
cv_model.bestModel

## Predykcja z nowym modelem

In [61]:
# Jak wygląda predykcja na zbiorze danych testowych?
pred_cv = cv_model.transform(test)
show_columns = final_columns + ['prediction', 'rawPrediction', 'probability']
pred_cv.limit(5).select(show_columns).toPandas()

NameError: name 'cv_model' is not defined

In [None]:
# Confusion matrix
label_and_pred = pred_cv.select('label', 'prediction')
label_and_pred.groupBy('label', 'prediction').count().toPandas()

In [None]:
auroc_cv = evaluator.evaluate(pred_cv)
auroc_cv

In [None]:
acc_cv = evaluator_m.evaluate(pred_cv)
acc_cv

## Klasyfikacja za pomoca Gradient Boosted Trees

In [None]:
from pyspark.ml.classification import GBTClassifier
gbt = GBTClassifier(labelCol="label", featuresCol="features", maxIter=10)
model = gbt.fit(training)

In [None]:
evaluator.evaluate(model.transform(test))

## Zadania:

* Czy mozna jeszcze poprawic jakosc predykcji:
    * a) dodajac cechy
    * b) zmieniajac model
    * c) lepiej dobierajac parametry modelu ?

In [None]:
#Kod w R
#library(data.table)
#srv <- fread("survey_results_public.csv")
#srv$OpSys2 <- srv$OpSys == "Windows"
#library(rpart)
#srv$CompAboveAvg <- CompAboveAvg$ConvertedComp > 60e3
#dt_fit = rpart(CompAboveAvg ~ Age + EdLevel + JobSeek + OpSys + YearsCode , data = srv, method = 'class')
#pred_y = predict(dt_fit, type = 'class')
#table(predict(dt_fit, srv[,c("Age" , "EdLevel", "JobSeek", "OpSys", "YearsCode")], type = "class"), srv$CompAboveAvg)
#srv(cor)
