<a href="https://colab.research.google.com/github/hsabaghpour/PySpark_MLlib_repo/blob/main/PySpark_Data_sources.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Image data source

This image data source is used to load image files from a directory, it can load compressed image (jpeg, png, etc.) into raw image representation via ImageIO in Java library. The loaded DataFrame has one StructType column: “image”, containing image data stored as image schema. The schema of the image column is:


origin: StringType (represents the file path of the image)
height: IntegerType (height of the image)
width: IntegerType (width of the image)
nChannels: IntegerType (number of image channels)
mode: IntegerType (OpenCV-compatible type)
data: BinaryType (Image bytes in OpenCV-compatible order: row-wise BGR in most cases)

In [2]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('corr').getOrCreate()


In [4]:
df = spark.read.format("image").option("dropInvalid", True).load("/content/")

In [5]:
df.select("image.origin", "image.width", "image.height").show(truncate=False)


+---------------------------------------+-----+------+
|origin                                 |width|height|
+---------------------------------------+-----+------+
|file:///content/54893.jpg              |300  |311   |
|file:///content/DP802813.jpg           |199  |313   |
|file:///content/29.5.a_b_EGDP022204.jpg|300  |200   |
|file:///content/DP153539.jpg           |300  |296   |
+---------------------------------------+-----+------+



In [6]:
df.show()

+--------------------+
|               image|
+--------------------+
|{file:///content/...|
|{file:///content/...|
|{file:///content/...|
|{file:///content/...|
+--------------------+



#**LIBSVM data source**

This **LIBSVM** data source is used to load ‘libsvm’ type files from a directory. The loaded DataFrame has two columns: label containing labels stored as doubles and features containing feature vectors stored as Vectors. The schemas of the columns are:

**label: DoubleType (represents the instance label)**

**features: VectorUDT (represents the feature vector)**

In [18]:
df = spark.read.format("libsvm").option("numFeatures", "780").load("/content/sample_libsvm_data.txt")
df.show(10)

+-----+--------------------+
|label|            features|
+-----+--------------------+
|  0.0|(780,[127,128,129...|
|  1.0|(780,[158,159,160...|
|  1.0|(780,[124,125,126...|
|  1.0|(780,[152,153,154...|
|  1.0|(780,[151,152,153...|
|  0.0|(780,[129,130,131...|
|  1.0|(780,[158,159,160...|
|  1.0|(780,[99,100,101,...|
|  0.0|(780,[154,155,156...|
|  0.0|(780,[127,128,129...|
+-----+--------------------+
only showing top 10 rows

