<a href="https://colab.research.google.com/github/harnalashok/hadoop/blob/main/spark_dataframe_expts.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
# Last amended: 12th June, 2021
# Myfolder: github/hadoop
# Objectives:
#             i) Install pyspark on colab
#             ii) Install koalas on colab
#
#
# Java 8 install: https://stackoverflow.com/a/58191107
# Hadoop install: https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/SingleCluster.html
# Spark install:  https://stackoverflow.com/a/64183749
#                 https://www.analyticsvidhya.com/blog/2020/11/a-must-read-guide-on-how-to-work-with-pyspark-on-google-colab-for-data-scientists/

# Full spark install

### 1.0 Libraries

In [1]:
# 1.0 How to set environment variable
import os  
import time  

## 2.0 Define some functions

#### ssh_install()

In [2]:
# 2.0 Function to install ssh client and sshd (Server)
def ssh_install():
  print("\n--1. Download and install ssh server----\n")
  ! sudo apt-get remove openssh-client openssh-server
  ! sudo apt install openssh-client openssh-server
  
  print("\n--2. Restart ssh server----\n")
  ! service ssh restart

#### Java install

In [3]:
# 3.0 Function to download and install java 8
def install_java():
  ! rm -rf /usr/java

  print("\n--Download and install Java 8----\n")
  !apt-get install -y openjdk-8-jdk-headless -qq > /dev/null        # install openjdk
  os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"     # set environment variable

  !update-alternatives --set java /usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java
  !update-alternatives --set javac /usr/lib/jvm/java-8-openjdk-amd64/bin/javac
  
  !mkdir -p /usr/java
  ! ln -s "/usr/lib/jvm/java-8-openjdk-amd64"  "/usr/java"
  ! mv "/usr/java/java-8-openjdk-amd64"  "/usr/java/latest"
  
  !java -version       #check java version
  !javac -version

#### setup ssh passphrase

In [4]:
# 6.0 Function tp setup ssh passphrase
def set_keys():
  print("\n---22. Generate SSH keys----\n")
  ! cd ~ ; pwd 
  ! cd ~ ; ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa
  ! cd ~ ; cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
  ! cd ~ ; chmod 0600 ~/.ssh/authorized_keys


#### Set environment

In [5]:
# 7.0 Function to set up environmental variables
def set_env():
  print("\n---23. Set Environment variables----\n")
  # 'export' command does not work in colab
  # https://stackoverflow.com/a/57240319
  os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"     #set environment variable
  os.environ["JRE_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64/jre"   
  

#### function to install prerequisites
java and ssh<br>


In [6]:
# 8.0 Function to call all functions
def install_components():
  print("\n--Install java----\n")
  ssh_install()
  install_java()  
  #set_keys()
  set_env()


## 3.0 Install components
Start downloading, install and configure. Takes around 2 minutes<br>
Your <u>input *'y'* is required </u>at one place while overwriting earlier ssh keys

In [7]:
# 9.0 Start installation
start = time.time()
install_components()
end = time.time()
print("\n---Time taken----\n")
print((end- start)/60)


--Install java----


--1. Download and install ssh server----

Reading package lists... Done
Building dependency tree       
Reading state information... Done
Package 'openssh-server' is not installed, so not removed
The following packages will be REMOVED:
  openssh-client
0 upgraded, 0 newly installed, 1 to remove and 39 not upgraded.
After this operation, 4,162 kB disk space will be freed.
(Reading database ... 160772 files and directories currently installed.)
Removing openssh-client (1:7.6p1-4ubuntu0.3) ...
Processing triggers for man-db (2.8.3-2ubuntu0.1) ...
Reading package lists... Done
Building dependency tree       
Reading state information... Done
The following additional packages will be installed:
  ncurses-term openssh-sftp-server python3-certifi python3-chardet
  python3-idna python3-pkg-resources python3-requests python3-six
  python3-urllib3 ssh-import-id
Suggested packages:
  keychain libpam-ssh monkeysphere ssh-askpass molly-guard rssh ufw
  python3-setuptools pytho

## 4.0 Install spark
koalas will also be installed

### Define functions

`findspark`: PySpark isn't on `sys.path` by default, but that doesn't mean it can't be used as a regular library. You can address this by either symlinking pyspark into your site-packages, or adding `pyspark` to `sys.path` at runtime. `findspark` does the latter.

In [8]:
# 1.0 Function to download and unzip spark
def spark_koalas_install():
  print("\n--1.1 Install findspark----\n")
  !pip install -q findspark

  print("\n--1.2 Install databricks Koalas----\n")
  !pip install koalas

  print("\n--1.3 Download Apache tar.gz----\n")
  ! wget -c https://apachemirror.wuchna.com/spark/spark-3.1.2/spark-3.1.2-bin-hadoop3.2.tgz

  print("\n--1.4 Transfer downloaded content and unzip tar.gz----\n")
  !  mv /content/spark*   /opt/
  ! tar -xzf /opt/spark-3.1.2-bin-hadoop3.2.tgz  --directory /opt/

  print("\n--1.5 Check folder for files----\n")
  ! ls -la /opt


In [9]:
# 1.1 Function to set environment
def set_spark_env():
  print("\n---2. Set Environment variables----\n")
  os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64" 
  os.environ["JRE_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64/jre" 
  os.environ["SPARK_HOME"] = "/opt/spark-3.1.2-bin-hadoop3.2"     
  os.environ["LD_LIBRARY_PATH"] += ":/opt/spark-3.1.2-bin-hadoop3.2/lib/native"
  os.environ["PATH"] += ":/opt/spark-3.1.2-bin-hadoop3.2/bin:/opt/spark-3.1.2-bin-hadoop3.2/sbin"
  print("\n---2.1. Check Environment variables----\n")
  # Check
  ! echo $PATH
  ! echo $LD_LIBRARY_PATH

In [10]:
# 1.2 Function to configure spark 
def spark_conf():
  print("\n---3. Configure spark to access hadoop----\n")
  !mv /opt/spark-3.1.2-bin-hadoop3.2/conf/spark-env.sh.template  /opt/spark-3.1.2-bin-hadoop3.2/conf/spark-env.sh
  #!echo "HADOOP_CONF_DIR=/opt/hadoop-3.2.2/etc/hadoop/" >> /opt/spark-3.1.2-bin-hadoop3.2/conf/spark-env.sh
  print("\n---3.1 Check ----\n")
  #!cat /opt/spark-3.1.1-bin-hadoop3.2/conf/spark-env.sh

### Install spark


In [11]:
# 2.0 Call all the three functions
def install_spark():
  spark_koalas_install()
  set_spark_env()
  spark_conf()


In [12]:
# 2.1 
install_spark()


--1.1 Install findspark----


--1.2 Install databricks Koalas----

Collecting koalas
[?25l  Downloading https://files.pythonhosted.org/packages/b8/6f/d0454b8b7a8ac4cd9838f510ceff0d9eb20d64245c4627f425c06ca6b685/koalas-1.8.0-py3-none-any.whl (720kB)
[K     |████████████████████████████████| 727kB 3.9MB/s 
Installing collected packages: koalas
Successfully installed koalas-1.8.0

--1.3 Download Apache tar.gz----

--2021-06-14 23:46:39--  https://apachemirror.wuchna.com/spark/spark-3.1.2/spark-3.1.2-bin-hadoop3.2.tgz
Resolving apachemirror.wuchna.com (apachemirror.wuchna.com)... 143.110.177.196
Connecting to apachemirror.wuchna.com (apachemirror.wuchna.com)|143.110.177.196|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 228834641 (218M) [application/x-gzip]
Saving to: ‘spark-3.1.2-bin-hadoop3.2.tgz’


2021-06-14 23:47:02 (10.1 MB/s) - ‘spark-3.1.2-bin-hadoop3.2.tgz’ saved [228834641/228834641]


--1.4 Transfer downloaded content and unzip tar.gz----


--1.5 Ch

# Test spark


Call some libraries

In [13]:
# 3.0 Just call some libraries to test
import pandas as pd
import numpy as np
import os

# 3.1 Get spark in sys.path
import findspark
findspark.init()

# 3.2 Call other spark libraries
#     Just to test
from pyspark.sql import SparkSession
import databricks.koalas as ks
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.regression import LinearRegression



### Understanding SparkSession
For Spark configuration options, see [here](http://spark.apache.org/docs/latest/configuration.html#spark-properties)

In [14]:
# 3.1 Build  spark session
#     with certain configuration options
#     .master => Connect to spark which URL? "local" to run locally, 
#                "local[4]" to run locally with 4 cores,
#                or "spark://master:7077" to run on a Spark standalone cluster.
#
spark = SparkSession. \
                    builder. \
                    master("local[*]"). \
                    config("spark.driver.memory", "1g"). \
                    getOrCreate()


In [15]:
# 3.1.1
# Get spark configuration
spark.conf.get("spark.driver.memory")

'1g'

In [16]:
# 3.1.2 
# Get spark session 
abc = spark.builder.getOrCreate()

### Creating spark dataframe

#### From pandas dataframe

In [17]:
# 4.0 Pandas DataFrame
pdf = pd.DataFrame({
        'x1': ['a','a','b','b', 'b', 'c', 'd','d'],
        'x2': ['apple', 'orange', 'orange','orange', 'peach', 'peach','apple','orange'],
        'x3': [1, 1, 2, 2, 2, 4, 1, 2],
        'x4': [2.4, 2.5, 3.5, 1.4, 2.1,1.5, 3.0, 2.0],
        'y1': [1, 0, 1, 0, 0, 1, 1, 0],
        'y2': ['yes', 'no', 'no', 'yes', 'yes', 'yes', 'no', 'yes']
    })

# 4.1
pdf

Unnamed: 0,x1,x2,x3,x4,y1,y2
0,a,apple,1,2.4,1,yes
1,a,orange,1,2.5,0,no
2,b,orange,2,3.5,1,no
3,b,orange,2,1.4,0,yes
4,b,peach,2,2.1,0,yes
5,c,peach,4,1.5,1,yes
6,d,apple,1,3.0,1,no
7,d,orange,2,2.0,0,yes


In [18]:
# 4.2 Transform to Spark DataFrame
#     and print
df = spark.createDataFrame(pdf)
df.show()

+---+------+---+---+---+---+
| x1|    x2| x3| x4| y1| y2|
+---+------+---+---+---+---+
|  a| apple|  1|2.4|  1|yes|
|  a|orange|  1|2.5|  0| no|
|  b|orange|  2|3.5|  1| no|
|  b|orange|  2|1.4|  0|yes|
|  b| peach|  2|2.1|  0|yes|
|  c| peach|  4|1.5|  1|yes|
|  d| apple|  1|3.0|  1| no|
|  d|orange|  2|2.0|  0|yes|
+---+------+---+---+---+---+



In [19]:
df1 = abc.createDataFrame(pdf)
df1.show()

+---+------+---+---+---+---+
| x1|    x2| x3| x4| y1| y2|
+---+------+---+---+---+---+
|  a| apple|  1|2.4|  1|yes|
|  a|orange|  1|2.5|  0| no|
|  b|orange|  2|3.5|  1| no|
|  b|orange|  2|1.4|  0|yes|
|  b| peach|  2|2.1|  0|yes|
|  c| peach|  4|1.5|  1|yes|
|  d| apple|  1|3.0|  1| no|
|  d|orange|  2|2.0|  0|yes|
+---+------+---+---+---+---+



In [20]:
############

# Your experiments

In [21]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [36]:
pathToFolder = "/content/drive/MyDrive/healthcare-analytics/breast_cancer_wisconsin/"
pathToFolder = "/content/drive/MyDrive/Colab_data_files/census/"

In [23]:
# Get existing spark session using builder object
abc = SparkSession.builder.getOrCreate()

In [24]:
abc.conf.get("spark.driver.memory")

'1g'

In [25]:
df = spark.read.csv(
                     pathToFolder+"breast_cancer.csv",
                     header = True
                     )

In [37]:
df = spark.read.csv(
                     pathToFolder+"adultdata_modified.csv",
                     header = True
                     )

In [38]:
df.show(5)

+---+----------------+------+---------+-------------+------------------+-----------------+-------------+-----+------+------------+------------+--------------+--------------+------+
|age|       workclass|fnlwgt|education|education_num|    marital_status|       occupation| relationship| race|   sex|capital_gain|capital_loss|hours_per_week|native_country|target|
+---+----------------+------+---------+-------------+------------------+-----------------+-------------+-----+------+------------+------------+--------------+--------------+------+
| 39|       State-gov| 77516|Bachelors|           13|     Never-married|     Adm-clerical|Not-in-family|White|  Male|        2174|           0|            40| United-States| <=50K|
| 50|Self-emp-not-inc| 83311|Bachelors|           13|Married-civ-spouse|  Exec-managerial|      Husband|White|  Male|           0|           0|            13| United-States| <=50K|
| 38|         Private|215646|  HS-grad|            9|          Divorced|Handlers-cleaners|Not-i

In [40]:
len(df.columns)
df.count()

32561

In [41]:
df1 = spark.read.csv(
                     pathToFolder+"adultdata_modified_na.csv",
                     header = True
                     )

In [42]:
df1.show()

+---+----------------+------+------------+-------------+--------------------+-----------------+-------------+------------------+------+------------+------------+--------------+--------------+------+
|age|       workclass|fnlwgt|   education|education_num|      marital_status|       occupation| relationship|              race|   sex|capital_gain|capital_loss|hours_per_week|native_country|target|
+---+----------------+------+------------+-------------+--------------------+-----------------+-------------+------------------+------+------------+------------+--------------+--------------+------+
| 39|       State-gov| 77516|   Bachelors|           13|       Never-married|     Adm-clerical|Not-in-family|             White|  Male|        2174|           0|            40| United-States| <=50K|
| 50|Self-emp-not-inc| 83311|   Bachelors|           13|  Married-civ-spouse|  Exec-managerial|      Husband|             White|  Male|           0|           0|            13| United-States| <=50K|
| 38|

In [27]:
df1 = df.sample(fraction = 0.5)
df2 = df.sample(fraction = 0.5)

In [28]:
df1.head(4)

[Row(id='842517', diagnosis='M', radius_mean='20.57', texture_mean='17.77', perimeter_mean='132.9', area_mean='1326', smoothness_mean='0.08474', compactness_mean='0.07864', concavity_mean='0.0869', concave points_mean='0.07017', symmetry_mean='0.1812', fractal_dimension_mean='0.05667', radius_se='0.5435', texture_se='0.7339', perimeter_se='3.398', area_se='74.08', smoothness_se='0.005225', compactness_se='0.01308', concavity_se='0.0186', concave points_se='0.0134', symmetry_se='0.01389', fractal_dimension_se='0.003532', radius_worst='24.99', texture_worst='23.41', perimeter_worst='158.8', area_worst='1956', smoothness_worst='0.1238', compactness_worst='0.1866', concavity_worst='0.2416', concave points_worst='0.186', symmetry_worst='0.275', fractal_dimension_worst='0.08902', _c32=None),
 Row(id='84358402', diagnosis='M', radius_mean='20.29', texture_mean='14.34', perimeter_mean='135.1', area_mean='1297', smoothness_mean='0.1003', compactness_mean='0.1328', concavity_mean='0.198', concav

In [29]:
df1.collect()

[Row(id='842517', diagnosis='M', radius_mean='20.57', texture_mean='17.77', perimeter_mean='132.9', area_mean='1326', smoothness_mean='0.08474', compactness_mean='0.07864', concavity_mean='0.0869', concave points_mean='0.07017', symmetry_mean='0.1812', fractal_dimension_mean='0.05667', radius_se='0.5435', texture_se='0.7339', perimeter_se='3.398', area_se='74.08', smoothness_se='0.005225', compactness_se='0.01308', concavity_se='0.0186', concave points_se='0.0134', symmetry_se='0.01389', fractal_dimension_se='0.003532', radius_worst='24.99', texture_worst='23.41', perimeter_worst='158.8', area_worst='1956', smoothness_worst='0.1238', compactness_worst='0.1866', concavity_worst='0.2416', concave points_worst='0.186', symmetry_worst='0.275', fractal_dimension_worst='0.08902', _c32=None),
 Row(id='84358402', diagnosis='M', radius_mean='20.29', texture_mean='14.34', perimeter_mean='135.1', area_mean='1297', smoothness_mean='0.1003', compactness_mean='0.1328', concavity_mean='0.198', concav

In [None]:
len(df1.columns)

31

In [None]:
df1.summary().show()

+-------+--------------------+---------+------------------+------------------+------------------+-----------------+--------------------+-------------------+-------------------+-------------------+-------------------+----------------------+-------------------+------------------+-----------------+------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+------------------+-----------------+------------------+-----------------+-------------------+-------------------+-------------------+--------------------+-------------------+-----------------------+----+
|summary|                  id|diagnosis|       radius_mean|      texture_mean|    perimeter_mean|        area_mean|     smoothness_mean|   compactness_mean|     concavity_mean|concave points_mean|      symmetry_mean|fractal_dimension_mean|          radius_se|        texture_se|     perimeter_se|           area_se|       smoothness_se|      compactness_

In [None]:
print(df1.schema)

StructType(List(StructField(diagnosis,StringType,true),StructField(radius_mean,StringType,true),StructField(texture_mean,StringType,true),StructField(perimeter_mean,StringType,true),StructField(area_mean,StringType,true),StructField(smoothness_mean,StringType,true),StructField(compactness_mean,StringType,true),StructField(concavity_mean,StringType,true),StructField(concave_points_mean,StringType,true),StructField(symmetry_mean,StringType,true),StructField(fractal_dimension_mean,StringType,true),StructField(radius_se,StringType,true),StructField(texture_se,StringType,true),StructField(perimeter_se,StringType,true),StructField(area_se,StringType,true),StructField(smoothness_se,StringType,true),StructField(compactness_se,StringType,true),StructField(concavity_se,StringType,true),StructField(concave_points_se,StringType,true),StructField(symmetry_se,StringType,true),StructField(fractal_dimension_se,StringType,true),StructField(radius_worst,StringType,true),StructField(texture_worst,StringT

In [None]:
df1.describe().show()

+-------+-------------------+------------------+-----------------+------------------+------------------+-------------------+-------------------+-------------------+-------------------+-------------------+----------------------+-------------------+------------------+------------------+-----------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+------------------+------------------+------------------+-----------------+-------------------+-------------------+-------------------+--------------------+-------------------+-----------------------+
|summary|          diagnosis|       radius_mean|     texture_mean|    perimeter_mean|         area_mean|    smoothness_mean|   compactness_mean|     concavity_mean|concave_points_mean|      symmetry_mean|fractal_dimension_mean|          radius_se|        texture_se|      perimeter_se|          area_se|       smoothness_se|      compactness_se|        concavity_se|   c

In [None]:
df1.toPandas()

Unnamed: 0,id,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,symmetry_mean,fractal_dimension_mean,radius_se,texture_se,perimeter_se,area_se,smoothness_se,compactness_se,concavity_se,concave points_se,symmetry_se,fractal_dimension_se,radius_worst,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst,_c32
0,84300903,M,19.69,21.25,130,1203,0.1096,0.1599,0.1974,0.1279,0.2069,0.05999,0.7456,0.7869,4.585,94.03,0.00615,0.04006,0.03832,0.02058,0.0225,0.004571,23.57,25.53,152.5,1709,0.1444,0.4245,0.4504,0.243,0.3613,0.08758,
1,844359,M,18.25,19.98,119.6,1040,0.09463,0.109,0.1127,0.074,0.1794,0.05742,0.4467,0.7732,3.18,53.91,0.004314,0.01382,0.02254,0.01039,0.01369,0.002179,22.88,27.66,153.2,1606,0.1442,0.2576,0.3784,0.1932,0.3063,0.08368,
2,84458202,M,13.71,20.83,90.2,577.9,0.1189,0.1645,0.09366,0.05985,0.2196,0.07451,0.5835,1.377,3.856,50.96,0.008805,0.03029,0.02488,0.01448,0.01486,0.005412,17.06,28.14,110.6,897,0.1654,0.3682,0.2678,0.1556,0.3196,0.1151,
3,844981,M,13,21.82,87.5,519.8,0.1273,0.1932,0.1859,0.09353,0.235,0.07389,0.3063,1.002,2.406,24.32,0.005731,0.03502,0.03553,0.01226,0.02143,0.003749,15.49,30.73,106.2,739.3,0.1703,0.5401,0.539,0.206,0.4378,0.1072,
4,84501001,M,12.46,24.04,83.97,475.9,0.1186,0.2396,0.2273,0.08543,0.203,0.08243,0.2976,1.599,2.039,23.94,0.007149,0.07217,0.07743,0.01432,0.01789,0.01008,15.09,40.68,97.65,711.4,0.1853,1.058,1.105,0.221,0.4366,0.2075,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
275,925311,B,11.2,29.37,70.67,386,0.07449,0.03558,0,0,0.106,0.05502,0.3141,3.896,2.041,22.81,0.007594,0.008878,0,0,0.01989,0.001773,11.92,38.3,75.19,439.6,0.09267,0.05494,0,0,0.1566,0.05905,
276,925622,M,15.22,30.62,103.4,716.9,0.1048,0.2087,0.255,0.09429,0.2128,0.07152,0.2602,1.205,2.362,22.65,0.004625,0.04844,0.07359,0.01608,0.02137,0.006142,17.52,42.79,128.7,915,0.1417,0.7917,1.17,0.2356,0.4089,0.1409,
277,926125,M,20.92,25.09,143,1347,0.1099,0.2236,0.3174,0.1474,0.2149,0.06879,0.9622,1.026,8.758,118.8,0.006399,0.0431,0.07845,0.02624,0.02057,0.006213,24.29,29.41,179.1,1819,0.1407,0.4186,0.6599,0.2542,0.2929,0.09873,
278,926424,M,21.56,22.39,142,1479,0.111,0.1159,0.2439,0.1389,0.1726,0.05623,1.176,1.256,7.673,158.7,0.0103,0.02891,0.05198,0.02454,0.01114,0.004239,25.45,26.4,166.1,2027,0.141,0.2113,0.4107,0.2216,0.206,0.07115,


In [None]:
df1.count()

284

In [None]:
df1.crosstab("radius_mean", "texture_mean").show()

+------------------------+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+----+-----+-----+-----+-----+----+-----+-----+-----+-----+-----+-----+-----+----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+---+-----+-----+-----+-----+-----+-----+-----+-----+----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+----+-----+-----+-----+-----+-----+----+-----+-----+----+-----+-----+-----+----+-----+-----+----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+----+-----+-----+-----+-----+-----+-----+-----+-----+-----+----

In [None]:
cols = ["radius_mean", "texture_mean"]
df1.select(*cols).show()

+-----------+------------+
|radius_mean|texture_mean|
+-----------+------------+
|      17.99|       10.38|
|      20.57|       17.77|
|      11.42|       20.38|
|      18.25|       19.98|
|      13.71|       20.83|
|         13|       21.82|
|      12.46|       24.04|
|      16.02|       23.24|
|      15.78|       17.89|
|      16.13|       20.68|
|      19.81|       22.15|
|      13.54|       14.36|
|      13.08|       15.71|
|      9.504|       12.44|
|      18.61|       20.25|
|      17.57|       15.05|
|      19.27|       26.47|
|      14.25|       21.72|
|      13.03|       18.42|
|      13.48|       20.82|
+-----------+------------+
only showing top 20 rows



In [None]:
df1.selectExpr("radius_mean * 2" , "texture_mean/3").show()

+-----------------+------------------+
|(radius_mean * 2)|(texture_mean / 3)|
+-----------------+------------------+
|            35.98|3.4600000000000004|
|            41.14| 5.923333333333333|
|            22.84| 6.793333333333333|
|             36.5|              6.66|
|            27.42|6.9433333333333325|
|             26.0| 7.273333333333333|
|            24.92| 8.013333333333334|
|            32.04| 7.746666666666666|
|            31.56| 5.963333333333334|
|            32.26|6.8933333333333335|
|            39.62| 7.383333333333333|
|            27.08| 4.786666666666666|
|            26.16| 5.236666666666667|
|           19.008|4.1466666666666665|
|            37.22|              6.75|
|            35.14| 5.016666666666667|
|            38.54| 8.823333333333332|
|             28.5| 7.239999999999999|
|            26.06| 6.140000000000001|
|            26.96|              6.94|
+-----------------+------------------+
only showing top 20 rows



In [None]:
df1.selectExpr("radius_mean * texture_mean").alias("abcd").show()

+----------------------------+
|(radius_mean * texture_mean)|
+----------------------------+
|                    186.7362|
|                    365.5289|
|                    232.7396|
|                     364.635|
|                    285.5793|
|                      283.66|
|                    299.5384|
|          372.30479999999994|
|                    282.3042|
|                    333.5684|
|           438.7914999999999|
|          194.43439999999998|
|          205.48680000000002|
|          118.22975999999998|
|          376.85249999999996|
|          264.42850000000004|
|          510.07689999999997|
|                      309.51|
|          240.01260000000002|
|          280.65360000000004|
+----------------------------+
only showing top 20 rows



In [None]:
cols = ["radius_mean", "texture_mean"]
df1.sort(*cols).show()

+---------+-----------+------------+--------------+---------+---------------+----------------+--------------+-------------------+-------------+----------------------+---------+----------+------------+-------+-------------+--------------+------------+-----------------+-----------+--------------------+------------+-------------+---------------+----------+----------------+-----------------+---------------+--------------------+--------------+-----------------------+
|diagnosis|radius_mean|texture_mean|perimeter_mean|area_mean|smoothness_mean|compactness_mean|concavity_mean|concave_points_mean|symmetry_mean|fractal_dimension_mean|radius_se|texture_se|perimeter_se|area_se|smoothness_se|compactness_se|concavity_se|concave_points_se|symmetry_se|fractal_dimension_se|radius_worst|texture_worst|perimeter_worst|area_worst|smoothness_worst|compactness_worst|concavity_worst|concave_points_worst|symmetry_worst|fractal_dimension_worst|
+---------+-----------+------------+--------------+---------+-----

In [None]:
df1.take(5)

[Row(diagnosis='1', radius_mean='17.99', texture_mean='10.38', perimeter_mean='122.8', area_mean='1001', smoothness_mean='0.1184', compactness_mean='0.2776', concavity_mean='0.3001', concave_points_mean='0.1471', symmetry_mean='0.2419', fractal_dimension_mean='0.07871', radius_se='1.095', texture_se='0.9053', perimeter_se='8.589', area_se='153.4', smoothness_se='0.006399', compactness_se='0.04904', concavity_se='0.05373', concave_points_se='0.01587', symmetry_se='0.03003', fractal_dimension_se='0.006193', radius_worst='25.38', texture_worst='17.33', perimeter_worst='184.6', area_worst='2019', smoothness_worst='0.1622', compactness_worst='0.6656', concavity_worst='0.7119', concave_points_worst='0.2654', symmetry_worst='0.4601', fractal_dimension_worst='0.1189'),
 Row(diagnosis='1', radius_mean='20.57', texture_mean='17.77', perimeter_mean='132.9', area_mean='1326', smoothness_mean='0.08474', compactness_mean='0.07864', concavity_mean='0.0869', concave_points_mean='0.07017', symmetry_mea

In [None]:
df1.tail(5)

NameError: ignored

In [None]:
help(df1.take(5))

In [None]:
fd = df1.to_koalas()

In [None]:
fd['diagnosis'].value_counts()

0    176
1    108
Name: diagnosis, dtype: int64

In [None]:
grd = fd.groupby('diagnosis')

In [None]:
grd['radius_mean'].aggregate(sum)

In [None]:
grd['radius_mean'].apply(sum)

diagnosis
0    13.5413.089.50413.0313.4911.7613.058.6189.0291...
1    17.9920.5711.4218.2513.711312.4616.0215.7816.1...
Name: radius_mean, dtype: object

In [None]:
df1.agg({'radius_mean': 'sum'}).show()

+------------------+
|  sum(radius_mean)|
+------------------+
|4030.0239999999994|
+------------------+



In [None]:
grd = df1.groupby('diagnosis')

In [None]:
grd.agg({'radius_mean': 'sum'}).show()

+---------+-----------------+
|diagnosis| sum(radius_mean)|
+---------+-----------------+
|        0|2144.744000000001|
|        1|          1885.28|
+---------+-----------------+



In [None]:
df1.limit(10).select('radius_mean').show()

+-----------+
|radius_mean|
+-----------+
|      20.57|
|      11.42|
|      12.45|
|      18.25|
|         13|
|      19.17|
|      14.54|
|      16.13|
|      19.81|
|      13.08|
+-----------+



In [None]:
df1.columns

['id',
 'diagnosis',
 'radius_mean',
 'texture_mean',
 'perimeter_mean',
 'area_mean',
 'smoothness_mean',
 'compactness_mean',
 'concavity_mean',
 'concave points_mean',
 'symmetry_mean',
 'fractal_dimension_mean',
 'radius_se',
 'texture_se',
 'perimeter_se',
 'area_se',
 'smoothness_se',
 'compactness_se',
 'concavity_se',
 'concave points_se',
 'symmetry_se',
 'fractal_dimension_se',
 'radius_worst',
 'texture_worst',
 'perimeter_worst',
 'area_worst',
 'smoothness_worst',
 'compactness_worst',
 'concavity_worst',
 'concave points_worst',
 'symmetry_worst',
 'fractal_dimension_worst',
 '_c32']

In [None]:
df1.withColumnRenamed("symmetry_se", "symmeterySE").limit(10).show()

+--------+---------+-----------+------------+--------------+---------+---------------+----------------+--------------+-------------------+-------------+----------------------+---------+----------+------------+-------+-------------+--------------+------------+-----------------+-----------+--------------------+------------+-------------+---------------+----------+----------------+-----------------+---------------+--------------------+--------------+-----------------------+----+
|      id|diagnosis|radius_mean|texture_mean|perimeter_mean|area_mean|smoothness_mean|compactness_mean|concavity_mean|concave points_mean|symmetry_mean|fractal_dimension_mean|radius_se|texture_se|perimeter_se|area_se|smoothness_se|compactness_se|concavity_se|concave points_se|symmeterySE|fractal_dimension_se|radius_worst|texture_worst|perimeter_worst|area_worst|smoothness_worst|compactness_worst|concavity_worst|concave points_worst|symmetry_worst|fractal_dimension_worst|_c32|
+--------+---------+-----------+------

In [None]:
c = [ i.replace("_", "") for i in df1.columns]

In [None]:
c

In [None]:
df1.toDF(*c).limit(10).show()

+--------+---------+----------+-----------+-------------+--------+--------------+---------------+-------------+------------------+------------+--------------------+--------+---------+-----------+------+------------+-------------+-----------+----------------+----------+------------------+-----------+------------+--------------+---------+---------------+----------------+--------------+-------------------+-------------+---------------------+----+
|      id|diagnosis|radiusmean|texturemean|perimetermean|areamean|smoothnessmean|compactnessmean|concavitymean|concave pointsmean|symmetrymean|fractaldimensionmean|radiusse|texturese|perimeterse|arease|smoothnessse|compactnessse|concavityse|concave pointsse|symmetryse|fractaldimensionse|radiusworst|textureworst|perimeterworst|areaworst|smoothnessworst|compactnessworst|concavityworst|concave pointsworst|symmetryworst|fractaldimensionworst| c32|
+--------+---------+----------+-----------+-------------+--------+--------------+---------------+-------

In [None]:
df1.select(*df1.columns[:4]).sort("radius_mean",ascending = False).show()

+---------+---------+-----------+------------+
|       id|diagnosis|radius_mean|texture_mean|
+---------+---------+-----------+------------+
|   862980|        B|      9.876|        19.4|
|   862261|        B|      9.787|       19.94|
|  8910996|        B|      9.742|       15.67|
|   875099|        B|       9.72|       18.22|
|  9113514|        B|      9.668|        18.1|
|901034301|        B|      9.436|       18.32|
|   905978|        B|      9.405|        21.7|
|   924342|        B|      9.333|       21.94|
|   915186|        B|      9.268|       12.87|
|   859196|        B|      9.173|       13.86|
|   859471|        B|      9.029|       17.33|
|    89346|        B|          9|        14.4|
|   864496|        B|      8.726|       15.83|
|   872113|        B|      8.671|       14.45|
|   858981|        B|      8.598|       20.98|
|    91805|        B|      8.571|        13.1|
| 85713702|        B|      8.196|       16.84|
|   921092|        B|      7.729|       25.49|
|   921362|  

In [34]:
from pyspark.sql.functions import mean, sumDistinct

In [35]:
df1.select(sumDistinct("radius_mean")).show()

+-------------------------+
|sum(DISTINCT radius_mean)|
+-------------------------+
|       3577.8190000000004|
+-------------------------+

