# Data Loading and Inspection

In this notebook, we'll focus on loading the cell phones dataset and conducting an initial inspection to understand its structure, content, and potential challenges for analysis.


### Initializing SparkSession

To work with Spark in this notebook, we need to initialize a `SparkSession`, which serves as the entry point for any functionality in Spark.

- `appName("Cell_Phones_Analysis")`: This assigns a name to our SparkSession, which is helpful for identifying it in the Spark UI.
- `getOrCreate()`: This method retrieves an existing SparkSession if one already exists, or creates a new one if none exists.


In [1]:
from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("Cell_Phones_Analysis") \
    .getOrCreate()


23/10/30 12:34:03 WARN Utils: Your hostname, MacBook-Air-de-Ivan.local resolves to a loopback address: 127.0.0.1; using 192.168.0.10 instead (on interface en0)
23/10/30 12:34:03 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
23/10/30 12:34:04 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


### Loading the Dataset

In this section:

- We specify the path to our JSON data file, which contains information about cell phones and accessories.
- We then use Spark's `read.json` method to load this data into a DataFrame, `df`, for further processing and analysis.


In [2]:
# Load the JSON file
path_to_json = "../data_amazon/Cell_Phones_and_Accessories.json"

# Load the dataset
df = spark.read.json(path_to_json)



23/10/30 12:34:52 WARN package: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.



### Previewing the Dataset

In this section, we use the `show()` method to:
- Display the first few rows of our DataFrame, `df`.
- Gain a quick overview of the data's structure and content.


In [3]:
# Display the first few rows of the DataFrame
df.show()



+----------+-----+-------+--------------------+-----------+--------------+--------------------+-----+--------------------+--------------+--------+----+
|      asin|image|overall|          reviewText| reviewTime|    reviewerID|        reviewerName|style|             summary|unixReviewTime|verified|vote|
+----------+-----+-------+--------------------+-----------+--------------+--------------------+-----+--------------------+--------------+--------+----+
|098949232X| null|    5.0|If your into spac...|11 19, 2014|A1GG51FWU0XQYH|       Paul Williams| null|          Five Stars|    1416355200|   false|null|
|098949232X| null|    5.0|   Awesome pictures!|11 19, 2014| AVFIDS9RK38E0|         Sean Powell| null|          Five Stars|    1416355200|   false|null|
|098949232X| null|    5.0|Great wall art an...|11 19, 2014|A2S4AVR5SJ7KMI|           Tom Davis| null|          Five Stars|    1416355200|   false|null|
|098949232X| null|    5.0|As always, it is ...|11 19, 2014| AEMMMVOR9BFLI|            Kw

### DataFrame Preview

The table below showcases the first few rows of our dataset. This provides a snapshot of the data structure, giving us insight into the columns available and the type of information stored in each column:

- `asin`: Amazon Standard Identification Number.
- `image`: URL for the product's image (if available).
- `overall`: Rating given to the product.
- `reviewText`: Text of the review provided by the user.
- `reviewTime`: Date when the review was written.
- `reviewerID`: Unique identifier for the reviewer.
- `reviewerName`: Name of the reviewer.
- `style`: Details about the product's style or variant.
- `summary`: A summarized version or title of the review.
- `unixReviewTime`: Review time in UNIX timestamp format.
- `verified`: Indicates if the review is verified.
- `vote`: Number of helpful votes the review received.

Note: The table only displays the top 20 rows for brevity.


### Displaying the Dataset Schema

To understand the structure of our data and the types of values each column can hold, we print the schema of our DataFrame. This provides a hierarchical view of the dataset's structure, indicating column names, data types, and any nested structures (if present).


In [4]:
# Print the schema of the DataFrame
df.printSchema()


root
 |-- asin: string (nullable = true)
 |-- image: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- overall: double (nullable = true)
 |-- reviewText: string (nullable = true)
 |-- reviewTime: string (nullable = true)
 |-- reviewerID: string (nullable = true)
 |-- reviewerName: string (nullable = true)
 |-- style: struct (nullable = true)
 |    |-- Color Name:: string (nullable = true)
 |    |-- Color:: string (nullable = true)
 |    |-- Design:: string (nullable = true)
 |    |-- Edition:: string (nullable = true)
 |    |-- Flavor Name:: string (nullable = true)
 |    |-- Flavor:: string (nullable = true)
 |    |-- Format:: string (nullable = true)
 |    |-- Hand Orientation:: string (nullable = true)
 |    |-- Item Display Length:: string (nullable = true)
 |    |-- Item Package Quantity:: string (nullable = true)
 |    |-- Length:: string (nullable = true)
 |    |-- Material Type:: string (nullable = true)
 |    |-- Material:: string (nullable = true)
 

### Displayed Dataset Overview

The displayed output provides a glimpse into the initial rows of our dataset:

- We observe various attributes for each product review such as product ID (`asin`), overall rating (`overall`), the text of the review (`reviewText`), and the date the review was made (`reviewTime`), among others.
- The schema below the displayed rows further breaks down the structure of our data, detailing each column's name, data type, and any nested structures.
- Note the `style` column, which contains nested attributes describing various product features like color, design, and size.
- This overview aids in understanding the nature of the data we're working with, setting the stage for subsequent analyses.


### Count the Total Records

To get an understanding of the dataset size, we'll count the total number of records in the DataFrame.


In [5]:
# Count the number of records
count = df.count()
print(f"Total records: {count}")




Total records: 10063255


                                                                                

### Dataset Size

The dataset contains a substantial number of records, specifically 10,063,255 entries. This indicates a large collection of reviews and related information from the Amazon Cell Phones and Accessories category.
