# Storage Solutions for Big Data - CA1


The assessment CA 1 by **Yulianna Tsaruk** \
Programme Title: Higher Diploma in Science in AI Applications \
Module Title: Storage Solutions for Big Data



## Code contents:
1. **Exploratory Data Analysis & Processing (this file)**
2. **[Training model and Usage Example](./2_training.ipynb)**



## Intoduction

For this project I'm using HDFS (Hadoop Distributed File System) as the primary storage system, Apache Spark for processing with PySpark - an interface for Apache Spark in Python.

In this file, I will load several files from a selected dataset, process them and store them in Apache Parquet - a highly efficient column-oriented data storage format in the Apache Hadoop ecosystem.

## Preparation

In [None]:
# import spark instances
from pyspark.sql import SparkSession, DataFrame
from pyspark.sql.functions import to_date, dayofmonth, month, year, col, explode, \
                unix_timestamp, when, regexp_replace, mean, concat_ws, \
                dayofweek, udf, min, max, desc, count
from pyspark.sql.types import FloatType, BooleanType, StringType

# import additional libraries
import pandas as pd
import matplotlib.pyplot as plt
# ignore warnings
import warnings
warnings.filterwarnings("ignore")

try:
    import holidays
except ImportError:
    # install library
    !pip install holidays

In [None]:
# Creating Spark session with configurations
spark = (SparkSession.builder \
    .appName("Tokyo Airbnb Processing and Analysis")
    # hardware-related configs, comment it if not needed for your machine.
    .config("spark.driver.memory", "6g")
    .config("spark.executor.memory", "6g")  
    .config("spark.dynamicAllocation.enabled", "true")
    .config("spark.network.timeout", "600s") 
    .config("spark.executor.heartbeatInterval", "120s")
    
    # to output more
    .config("spark.sql.debug.maxToStringFields", 100)
    .getOrCreate())

In [None]:
# simulating same output equivalent to the pandas.DataFrame.info() method  
def print_dataframe_info(df: DataFrame):
    """
    Print basic information about data like column names, null counts, and data types for a Spark DataFrame.

    Args:
    df (DataFrame): The Spark DataFrame to be analyze.
    """
    # DataFrame shape
    total_rows = df.count()
    total_cols = len(df.columns)
    

    # Collect column names and their data types
    schema_info = [(field.name, field.dataType) for field in df.schema.fields]
    out_ = []
    for column, dtype in schema_info:
        null_count = df.filter(col(column).isNull()).count()
        out_.append({'Column': column, 'Nulls': null_count, 'Type': dtype.simpleString()})
    
    print(pd.DataFrame(out_))
    print()
    print(f"\tA dataset shape: {total_rows} rows, {total_cols} columns.")

pd.set_option('display.max_rows', None) # show all rows for pandas df
pd.set_option('display.float_format', lambda x: '%.3f' % x) # avoid scientific notation

## Load 1st dataset

In [None]:
# Set path to folder with dataset on HDFS
dataset_path_hdfs = '/user1/dataset/' # must end with /

In [None]:
# location of 1st file in Hadoop
dataset_path = dataset_path_hdfs + "calendar.csv" 

# load data
df_calendar = spark.read.csv(dataset_path, header=True, # 1st line is a header
                             inferSchema=True           # detect data types automatically
                            )
df_calendar.show(5)

### Explore and Process the data

In [None]:
# Nulls and types summary
print_dataframe_info(df_calendar)

In [None]:
# Statistical summary
df_calendar.describe().show()

Some variables are wrong dtype. For example, we can't see mean of price column, because values are string type. 

In [None]:
# Check amount of unique values in the 'listing_id' column
listing_gr = df_calendar.groupBy("listing_id").count()
unique_ids = listing_gr.count()
print('There are', listing_gr.count(), 'properties in this dataset.')

pandas_df_listing = listing_gr.orderBy(col("count")).toPandas()
pandas_df_listing['days_count'] = pandas_df_listing['count']
result = pandas_df_listing.groupby('days_count').size().reset_index(name='properties_count')
pandas_df_listing.head()

In [None]:
result

In [None]:
listing_to_drop = list(pandas_df_listing[pandas_df_listing['count'] <= 339]['listing_id'])
listing_to_drop

In [None]:
df_calendar.filter(col("listing_id").isin(listing_to_drop)).count()

1 property has data only for 33 days, while most of other properties obtain data for a whole year (365 days). I will drop this property and other 5 that has data only for 339 days, as it's only rows.

In [None]:
# drop rows for ids because too much data is missing
df_calendar_clean = df_calendar.filter(~col("listing_id").isin(listing_to_drop))


In [None]:
print_dataframe_info(df_calendar)

In [None]:
# Options in col 'available'
df_calendar_clean.select('available').distinct().show()

It's worth to note that, though price has a US dollar sign, it is in Japanese Yen and a sign must be removed in order to convert data to float.

In [None]:
# check min/max nights values
nights_df = df_calendar_clean.select(col('minimum_nights'), col('maximum_nights')).toPandas()
nights_df.describe()

In [None]:
# Data Preprocessing
df_calendar_new = df_calendar \
    .withColumn("available", when(col("available") == "t", 1).otherwise(0)) \
    .withColumn("price", regexp_replace(col("price"), "[\$,]", "").cast(FloatType())) \
    .withColumn("adjusted_price", regexp_replace(col("adjusted_price"), "[\$,]", "").cast(FloatType())) \
    .withColumn("date_unix", unix_timestamp(col("date")))

In [None]:
# check price col
df_calendar_new.select(col('price')).describe().toPandas()

In [None]:
# Analyse price distribution
price_data = df_calendar_new.select('price').toPandas()
price_dist = price_data.groupby('price').size().reset_index(name='count').sort_values('price')

In [None]:
price_dist.plot(kind='scatter', x='price', y='count', 
                legend=False, title='Distribution of values in column "price"')

In [None]:
# Find out if there's a difference in cols "price" and "adjusted_price"
df_with_diff = df_calendar_new.withColumn("price_difference", col("price") - col("adjusted_price"))

# Filter rows where price_difference is not zero
rows_with_difference = df_with_diff.filter(col("price_difference") != 0)

# count how many rows with differences
rows_with_difference.count()

In [None]:
rows_with_difference.filter(col('adjusted_price')>col('price')).count()

In [None]:
rows_with_difference.filter(col('adjusted_price')<col('price')).count()

In [None]:
df_calendar_new.filter(col('adjusted_price')==col('price')).count()

Since there's no data dictionary, I don't really know for sure what is 'adjusted_price' col, but I will take it as main and save col 'price' to drop later.

In [None]:
# alongside with 'min/maximum_nights' which doesn't look correct
col_to_drop = ['price', 'minimum_nights', 'maximum_nights']

In [None]:
# clear unused df-s from memory
df_with_diff.unpersist()
rows_with_difference.unpersist()

## Load 2nd dataset

In [None]:
df_list = spark.read.csv(dataset_path_hdfs + "listings.csv",
    header=True, # 1st line is a header
    quote='"',  
    escape='"', 
    multiLine=True,  # Handles new lines in fields
    inferSchema=True,  # detect data types automatically
    ignoreLeadingWhiteSpace=True,  # Ignoring white space in a line
    ignoreTrailingWhiteSpace=True)

In [None]:
df_list.show(2)

In [None]:
# The output above is messy, let's print it pandas' df
df_list.limit(5).toPandas()

### Explore and Process the data

In [None]:
# check if everithing loaded correctly through schema
df_list.printSchema()

From this dataset I'll take some info to complete my 1st one. Potentially useful columns are:
* neighbourhood_cleansed
* host_identity_verified
* property_type
* instant_bookable

In [None]:
df_list.select('property_type').distinct().toPandas()

In [None]:
df_list.select('room_type').distinct().toPandas()

After checking unique values, I see that the feature I want is called 'room_type', while 'property_type' consist of marketing names.

In [None]:
print_dataframe_info(df_list)

In [None]:
# check unique values in the 'id' column
unique_ids_list = df_list.select("id").distinct()

print(f'Unique IDs: {unique_ids_list.count()},', unique_ids_list.count()-unique_ids, 'properties more than in calendar data.')

In [None]:
# select columns that I want to use to expand calendar df
selected_cols = [
    'id',
    'neighbourhood_cleansed',
    'room_type',
    'host_identity_verified',
    'instant_bookable',
]
new_df = df_list.select(selected_cols)
# Merge new df with selected_cols and df_calendar on col id and listing_id
merged_df = new_df.join(df_calendar_new, new_df.id == df_calendar_new.listing_id, "inner")

In [None]:
merged_df.take(1)

In [None]:
merged_df = merged_df.drop('listing_id') # dublicated col

In [None]:
#fixing dtypes
df = merged_df \
    .withColumn("host_identity_verified", when(col("host_identity_verified") == "t", True).otherwise(False).cast(BooleanType())) \
    .withColumn("instant_bookable", when(col("instant_bookable") == "t", True).otherwise(False).cast(BooleanType()))

merged_df.unpersist()
new_df.unpersist()

In [None]:
# re-check dtypes
df.dtypes

## Analysis

In [None]:
df_busy_times = df.where(col("available") == False)\
                  .withColumn("year_month", concat_ws("-", year("date"), month("date"))) \
                  .select('year_month') \
                  .groupBy("year_month").count().toPandas()

df_busy_times['year_month'] = df_busy_times['year_month'].astype('period[M]')
df_busy_times.sort_values(['year_month'], ascending=True, inplace=True)

In [None]:
df_busy_times.head()

In [None]:
df_busy_times.plot(x='year_month', y='count', kind='bar', legend=False,
                      title="Occupied properties by month")

In [None]:
#'month+year' column, calculate the mean of 'price',
# and sort the results by 'year-month

df_price = df.where(col("price") > 0) \
                  .withColumn("year_month", concat_ws("-", year("date"), month("date"))) \
                  .groupBy("year_month") \
                  .agg(mean("price").alias("mean")) \
                  .toPandas()

df_price['year_month'] = df_price['year_month'].astype('period[M]')
df_price.sort_values('year_month', ascending=True, inplace=True)

In [None]:
df_price.head()

In [None]:
df_price.plot(kind='line', 
                x='year_month', y='mean', 
                legend=False,
                title="Mean price per night by Month")

In [None]:
df_list.unpersist()
df_calendar.unpersist()
# clean memory

In [None]:
df.columns

## Feature Selection and Engineering

I'd like to add new features regarding date to help algorithm find dependencies:
- if date is a weekend
- if date is a holiday

In [None]:
# if date is a weekend
df_date = df.withColumn("weekends", dayofweek(col("date")).isin([6, 7]))

In [None]:
# for holiday detection I use holidays module and user-defined function
jp_holidays = holidays.Japan()

def is_holiday(date):
    return date in jp_holidays

holiday_udf = udf(is_holiday, BooleanType())

df_date = df_date.withColumn("holiday", holiday_udf(col("date"))).sort('date')

In [None]:
print_dataframe_info(df_date)

In [None]:
col_to_drop += ['id', 'date']
col_to_drop

In [None]:
# Cleaning up 
# deleting cols that won't be used for training
df_model = df_date.drop(*col_to_drop)
# rename col
df_model = df_model.withColumnRenamed('adjusted_price', 'price')

In [None]:
print_dataframe_info(df_model)

In [None]:
# Save DataFrame to HDFS in Parquet format
df_model.write.parquet(dataset_path_hdfs +"db",
                       # for re-running this code
                       mode="overwrite")

In [None]:
# Terminate spark session
spark.stop()