## Spark ML
- [https://spark.apache.org/docs/latest/ml-guide.html] - DataFrame-based (user-friendlier API)
- [https://spark.apache.org/docs/latest/mllib-guide.html] - maintenance mode; RDD-based

```
ML Algorithms: common learning algorithms such as classification, regression, clustering, and collaborative filtering
Featurization: feature extraction, transformation, dimensionality reduction, and selection
Pipelines: tools for constructing, evaluating, and tuning ML Pipelines
Persistence: saving and load algorithms, models, and Pipelines
Utilities: linear algebra, statistics, data handling, etc.
```


## Spark SQL [https://spark.apache.org/docs/latest/sql-programming-guide.html]
## Pandas API on Spark [https://spark.apache.org/docs/latest/api/python/getting_started/quickstart_ps.html]
## GraphX [https://spark.apache.org/docs/latest/graphx-programming-guide.html] 
## Structured Streaming [https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html]


pip install pyspark plotly

## Spark data types and built-in functions

pyspark.pandas.frame.DataFrame

### Pandas -> Pandas-on-Spark datatypes
pyspark.pandas.from_pandas()

### Spark DataFrame -> Pandas-on-Spark Data frame
(data frame object).pandas_api()

### parquet (efficient, compact file format)

(data object).to_parquet('')
pyspark.pandas.read_parquet('')

### Spark IO (Spark has various datasources including ORC and external datasources)

.to_spark_io("filename.orc", format="orc")
pyspark.pandas.read_spark_io("filename.orc", format="orc")

### other data sources
Metastore table, Delta Lake, Parquet, ORC, Generic Spark IO, file, CSV, clipboard, Excel, json, HTML, sql


### built-in functions
spark.createDataFrame()
.head() # first 5 rows
.index
.columns
.to_numpy()
.describe()
.T

.sort_index(ascending={True,False})
.sort_values(by='column name')

.inreindex(index= .., columns= ... )

.loc[row condition, column condition] = value to assign


.dropna(axis=, how={'any','all'}, thresh=(int num NA), subset=column label(s), inplace={False, True}, ingnore_index={False, True})

.fillna(value={scalar,dict,Series,DataFrame}, method={None,'backfill','bfill','ffill'}, axis=None, inplace=False, limit=None, downcast=_NoDefault.no_default)

.mean()
.cummax()

.groupby(column name(s)) + aggregate func


.plot()


In [17]:
# testing pyspark session

from pyspark.sql import SparkSession

logFile = "README.md"  # Should be some file on your system
spark = SparkSession.builder.appName("SimpleApp").getOrCreate()
logData = spark.read.text(logFile).cache()

numAs = logData.filter(logData.value.contains('a')).count()
numBs = logData.filter(logData.value.contains('b')).count()

print("Lines with a: %i, lines with b: %i" % (numAs, numBs))

spark.stop()



24/04/01 14:02:06 WARN SparkSession: Using an existing Spark session; only runtime SQL configurations will take effect.


Lines with a: 58, lines with b: 18


In [18]:
import pandas as pd
import numpy as np
import pyspark.pandas as ps

# Pandas-on-Spark Series
s = ps.Series([1, 3, 5, np.nan, 6, 8])

# Pandas-on-Spark DataFrame
psdf = ps.DataFrame(
    {'a': [1, 2, 3, 4, 5, 6],
     'b': [100, 200, 300, 400, 500, 600],
     'c': ["one", "two", "three", "four", "five", "six"]},
    index=[10, 20, 30, 40, 50, 60])

print(s)
print(psdf)

# ps.from_pandas()


0    1.0
1    3.0
2    5.0
3    NaN
4    6.0
5    8.0
dtype: float64
    a    b      c
10  1  100    one
20  2  200    two
30  3  300  three
40  4  400   four
50  5  500   five
60  6  600    six
