[![Open Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1e-EtboixUBLzfIhxD4RAdfJi6iyAkdP9?usp=sharing)

# Comment:
Due to the unavailability of cloud vendor services like AWS or Azure Databricks, this script is designed to mimic a Databricks environment using PySpark in Google Colab. PySpark offers a scalable environment for handling big data, which is similar to the capabilities provided by Databricks. The primary goal here is to perform dimensionality reduction on a cinema ticket dataset. We initialize a Spark session, preprocess the data, and then apply Principal Component Analysis (PCA), a technique commonly used in Databricks for reducing data dimensions. This approach allows us to leverage the power of Spark in a local environment without needing access to cloud-based Databricks services.


In [1]:
# Install PySpark
!pip install pyspark

Collecting pyspark
  Downloading pyspark-3.5.0.tar.gz (316.9 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m316.9/316.9 MB[0m [31m3.5 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
  Created wheel for pyspark: filename=pyspark-3.5.0-py2.py3-none-any.whl size=317425345 sha256=575c271f08280cdf3586a22863e343165263e3fcea4d914620c04edf3d46e2e8
  Stored in directory: /root/.cache/pip/wheels/41/4e/10/c2cf2467f71c678cfc8a6b9ac9241e5e44a01940da8fbb17fc
Successfully built pyspark
Installing collected packages: pyspark
Successfully installed pyspark-3.5.0


In [3]:
# Import Necessary Libraries
from pyspark.sql import SparkSession
from pyspark.ml.feature import VectorAssembler, PCA
from pyspark.sql.types import IntegerType, DoubleType, FloatType

In [4]:
# Initialize Spark Session
spark = SparkSession.builder.master("local[*]").appName("PCA_Cinema_Ticket_Sales").getOrCreate()


In [6]:
# Load the Data
file_path = '/content/cinemaTicket_Ref.csv'  # Replace with your file path
sdf = spark.read.csv(file_path, header=True, inferSchema=True)


In [14]:
# Displaying the schema to confirm correct data types
sdf.printSchema()

root
 |-- film_code: integer (nullable = true)
 |-- cinema_code: integer (nullable = true)
 |-- total_sales: integer (nullable = true)
 |-- tickets_sold: integer (nullable = true)
 |-- tickets_out: integer (nullable = true)
 |-- show_time: integer (nullable = true)
 |-- occu_perc: double (nullable = true)
 |-- ticket_price: double (nullable = true)
 |-- ticket_use: integer (nullable = true)
 |-- capacity: double (nullable = true)
 |-- date: date (nullable = true)
 |-- month: integer (nullable = true)
 |-- quarter: integer (nullable = true)
 |-- day: integer (nullable = true)



In [8]:
# Fill missing values if any
sdf_filled = sdf_numerical.na.fill(0)

In [10]:
# Perform PCA
# Assembling the features into a single vector
vec_assembler = VectorAssembler(inputCols=sdf_filled.columns, outputCol="features")
sdf_assembled = vec_assembler.transform(sdf_filled)


In [11]:
# Performing PCA
pca = PCA(k=3, inputCol="features", outputCol="pcaFeatures")  # k is the number of dimensions
model = pca.fit(sdf_assembled)
sdf_pca = model.transform(sdf_assembled)

In [12]:
# Examine the Results
# Selecting and displaying the PCA features
sdf_pca.select("pcaFeatures").show()

+--------------------+
|         pcaFeatures|
+--------------------+
|[-3900044.1462103...|
|[-3360023.4913592...|
|[-2560023.5219989...|
|[-1200029.4885063...|
|[-1200023.5804174...|
|[-1050044.2718581...|
|[-1020030.0874049...|
|[-750044.28030764...|
|[-750020.12150391...|
|[-600044.28614579...|
|[-480023.62667965...|
|[-480035.42809373...|
|[-400023.62525466...|
|[-300044.30558242...|
|[-240035.43816157...|
|[-1.6500042804953...|
|[-1.3950043710614...|
|[-1.0200043873347...|
|[-6600044.0295966...|
|[-3360031.8720071...|
+--------------------+
only showing top 20 rows



In [13]:
# Explained Variance
print("Explained Variance:", model.explainedVariance)

Explained Variance: [0.9999989108412257,1.0882862864747461e-06,8.251056539257937e-10]
