<a href="https://colab.research.google.com/github/marreapato/Medium_Tutorials_and_Articles/blob/main/pyspark/pyspark_tutorial_newbies.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Installing the library

In [2]:
!pip install pyspark

Collecting pyspark
  Downloading pyspark-3.5.2.tar.gz (317.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m317.3/317.3 MB[0m [31m4.7 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
  Created wheel for pyspark: filename=pyspark-3.5.2-py2.py3-none-any.whl size=317812365 sha256=66d05ed8c180723c751acfdbd7494eb56c6ad96efe91f600b5f82b3a4ea054c4
  Stored in directory: /root/.cache/pip/wheels/34/34/bd/03944534c44b677cd5859f248090daa9fb27b3c8f8e5f49574
Successfully built pyspark
Installing collected packages: pyspark
Successfully installed pyspark-3.5.2


Loading the Italy Wine Dataset


In [3]:
from sklearn.datasets import load_wine
import pandas as pd

wine = load_wine()
df = pd.DataFrame(data=wine.data, columns=wine.feature_names)
df['target'] = wine.target


The Italy Wine dataset, commonly known as the Wine dataset, is a well-known dataset used for classification tasks in machine learning.

#### Overview:
- **Purpose**: The Wine dataset is primarily used for classification tasks where the goal is to predict the type of wine based on chemical properties.
- **Source**: The dataset comes from the UCI Machine Learning Repository and consists of the results of a chemical analysis of wines grown in the same region in Italy. The wines are derived from three different cultivars.

#### Features:
The dataset contains 13 continuous features, each representing a specific chemical property of the wine:
1. **Alcohol**: The alcohol content in the wine.
2. **Malic acid**: Amount of malic acid.
3. **Ash**: The ash content.
4. **Alcalinity of ash**: Measure of the alkalinity of ash.
5. **Magnesium**: The magnesium content.
6. **Total phenols**: The total amount of phenols.
7. **Flavanoids**: The amount of flavanoids (a type of phenolic compound).
8. **Nonflavanoid phenols**: Amount of nonflavanoid phenols.
9. **Proanthocyanins**: The amount of proanthocyanins (a class of flavonoids).
10. **Color intensity**: The intensity of the color of the wine.
11. **Hue**: The hue of the wine.
12. **OD280/OD315 of diluted wines**: The ratio of OD280 to OD315, used to assess the quality of the wine.
13. **Proline**: The amount of proline (an amino acid).

#### Target:
- The target variable is the **wine class** (or cultivar), represented by three different categories:
  1. **Class 0**: Wine from cultivar 0.
  2. **Class 1**: Wine from cultivar 1.
  3. **Class 2**: Wine from cultivar 2.

#### Dataset Size:
- **Number of Instances**: 178 samples (instances of wine).
- **Number of Features**: 13 continuous features.

#### Use Cases:
- **Classification**: This dataset is commonly used for testing and demonstrating classification algorithms. The goal is to classify a wine sample into one of the three classes based on its chemical properties.
- **Feature Selection**: The dataset is also useful for feature selection exercises, given the variety of chemical properties measured.

#### Applications:
- The Wine dataset is widely used in educational settings for demonstrating machine learning algorithms, especially for:
  - Decision Trees
  - K-Nearest Neighbors (KNN)
  - Support Vector Machines (SVM)
  - Neural Networks

In [4]:
df.head(5)

Unnamed: 0,alcohol,malic_acid,ash,alcalinity_of_ash,magnesium,total_phenols,flavanoids,nonflavanoid_phenols,proanthocyanins,color_intensity,hue,od280/od315_of_diluted_wines,proline,target
0,14.23,1.71,2.43,15.6,127.0,2.8,3.06,0.28,2.29,5.64,1.04,3.92,1065.0,0
1,13.2,1.78,2.14,11.2,100.0,2.65,2.76,0.26,1.28,4.38,1.05,3.4,1050.0,0
2,13.16,2.36,2.67,18.6,101.0,2.8,3.24,0.3,2.81,5.68,1.03,3.17,1185.0,0
3,14.37,1.95,2.5,16.8,113.0,3.85,3.49,0.24,2.18,7.8,0.86,3.45,1480.0,0
4,13.24,2.59,2.87,21.0,118.0,2.8,2.69,0.39,1.82,4.32,1.04,2.93,735.0,0


 ### Spark Pipeline

In [5]:
import pyspark
from pyspark.sql import SparkSession


In [6]:
spark = SparkSession.builder.appName("Italy Wine Dataset Analysis").getOrCreate()

In [7]:
spark

In [9]:
df_spark = spark.createDataFrame(df)
df_spark.show()

+-------+----------+----+-----------------+---------+-------------+----------+--------------------+---------------+---------------+----+----------------------------+-------+------+
|alcohol|malic_acid| ash|alcalinity_of_ash|magnesium|total_phenols|flavanoids|nonflavanoid_phenols|proanthocyanins|color_intensity| hue|od280/od315_of_diluted_wines|proline|target|
+-------+----------+----+-----------------+---------+-------------+----------+--------------------+---------------+---------------+----+----------------------------+-------+------+
|  14.23|      1.71|2.43|             15.6|    127.0|          2.8|      3.06|                0.28|           2.29|           5.64|1.04|                        3.92| 1065.0|     0|
|   13.2|      1.78|2.14|             11.2|    100.0|         2.65|      2.76|                0.26|           1.28|           4.38|1.05|                         3.4| 1050.0|     0|
|  13.16|      2.36|2.67|             18.6|    101.0|          2.8|      3.24|                 

Checking type of each variable

In [10]:
df_spark.printSchema()

root
 |-- alcohol: double (nullable = true)
 |-- malic_acid: double (nullable = true)
 |-- ash: double (nullable = true)
 |-- alcalinity_of_ash: double (nullable = true)
 |-- magnesium: double (nullable = true)
 |-- total_phenols: double (nullable = true)
 |-- flavanoids: double (nullable = true)
 |-- nonflavanoid_phenols: double (nullable = true)
 |-- proanthocyanins: double (nullable = true)
 |-- color_intensity: double (nullable = true)
 |-- hue: double (nullable = true)
 |-- od280/od315_of_diluted_wines: double (nullable = true)
 |-- proline: double (nullable = true)
 |-- target: long (nullable = true)



In [12]:
df_spark.select(["alcohol","target"]).show(5)# selecting two columns and showing the first 5 rows of data

+-------+------+
|alcohol|target|
+-------+------+
|  14.23|     0|
|   13.2|     0|
|  13.16|     0|
|  14.37|     0|
|  13.24|     0|
+-------+------+
only showing top 5 rows



### Brief Data Analysis

In [13]:
df_spark.describe().show()

+-------+------------------+------------------+------------------+------------------+-----------------+------------------+------------------+--------------------+------------------+------------------+-------------------+----------------------------+------------------+------------------+
|summary|           alcohol|        malic_acid|               ash| alcalinity_of_ash|        magnesium|     total_phenols|        flavanoids|nonflavanoid_phenols|   proanthocyanins|   color_intensity|                hue|od280/od315_of_diluted_wines|           proline|            target|
+-------+------------------+------------------+------------------+------------------+-----------------+------------------+------------------+--------------------+------------------+------------------+-------------------+----------------------------+------------------+------------------+
|  count|               178|               178|               178|               178|              178|               178|              

Adding Rate of Alcohol per Volume of Magnesium

In [14]:
df_spark = df_spark.withColumn("Alcohol Per Magnesium Rate", df_spark['alcohol']/df_spark['magnesium'])
df_spark.show(4)

+-------+----------+----+-----------------+---------+-------------+----------+--------------------+---------------+---------------+----+----------------------------+-------+------+--------------------------+
|alcohol|malic_acid| ash|alcalinity_of_ash|magnesium|total_phenols|flavanoids|nonflavanoid_phenols|proanthocyanins|color_intensity| hue|od280/od315_of_diluted_wines|proline|target|Alcohol Per Magnesium Rate|
+-------+----------+----+-----------------+---------+-------------+----------+--------------------+---------------+---------------+----+----------------------------+-------+------+--------------------------+
|  14.23|      1.71|2.43|             15.6|    127.0|          2.8|      3.06|                0.28|           2.29|           5.64|1.04|                        3.92| 1065.0|     0|        0.1120472440944882|
|   13.2|      1.78|2.14|             11.2|    100.0|         2.65|      2.76|                0.26|           1.28|           4.38|1.05|                         3.4| 10

In [16]:
df_spark = df_spark.na.drop() #dropping NAs if any