<p style="text-align:center">
        <img src="https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/assets/logos/SN_web_lightmode.png" width="300" alt="Skills Network Logo">
</p>


### Analyse search terms on the e-commerce web server


##### In this assignment you will download the search term data set for the e-commerce web server and run analytic queries on it.


In [12]:
# Install spark
!pip install pyspark==3.1.2 -q
!pip install findspark -q

In [13]:
# Start session
import findspark
findspark.init()

# Suppress warnings
def warn(*args, **kwargs):
    pass
import warnings
warnings.warn = warn
warnings.filterwarnings('ignore')

In [14]:
# Import libraries
from pyspark.sql import SparkSession

from pyspark.ml.regression import LinearRegression
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.feature import StandardScaler

from pyspark.ml import Pipeline
from pyspark.ml.pipeline import PipelineModel

from pyspark.ml.evaluation import RegressionEvaluator

In [15]:
# Create a spark session
spark = SparkSession \
    .builder \
    .appName('Capstone Project - Spark_MLOps') \
    .getOrCreate()

24/02/14 03:19:52 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
24/02/14 03:19:54 WARN util.Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.


In [16]:
# Download The search term dataset from the below url
# https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-DB0321EN-SkillsNetwork/Bigdata%20and%20Spark/searchterms.csv
!wget https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-DB0321EN-SkillsNetwork/Bigdata%20and%20Spark/searchterms.csv

--2024-02-14 03:20:00--  https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-DB0321EN-SkillsNetwork/Bigdata%20and%20Spark/searchterms.csv
Resolving cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud (cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud)... 169.63.118.104, 169.63.118.104
Connecting to cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud (cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud)|169.63.118.104|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 233457 (228K) [text/csv]
Saving to: ‘searchterms.csv’


2024-02-14 03:20:01 (42.6 MB/s) - ‘searchterms.csv’ saved [233457/233457]



In [17]:
# Load the csv into a spark dataframe
df = spark.read.csv('searchterms.csv', header=True, inferSchema=True)
df.show(5)

                                                                                

+---+-----+----+--------------+
|day|month|year|    searchterm|
+---+-----+----+--------------+
| 12|   11|2021| mobile 6 inch|
| 12|   11|2021| mobile latest|
| 12|   11|2021|   tablet wifi|
| 12|   11|2021|laptop 14 inch|
| 12|   11|2021|     mobile 5g|
+---+-----+----+--------------+
only showing top 5 rows



In [18]:
# Print the number of rows and columns
row = df.count()
col = len(df.columns)
print(f'Dimension of the Dataframe is: {(row,col)}')
print(f'Number of rows are: {row}')
print(f'Number of columns are: {col}')

Dimension of the Dataframe is: (10000, 4)
Number of rows are: 10000
Number of columns are: 4


In [21]:
# Print the top 5 rows
df.show(5)

+---+-----+----+--------------+
|day|month|year|    searchterm|
+---+-----+----+--------------+
| 12|   11|2021| mobile 6 inch|
| 12|   11|2021| mobile latest|
| 12|   11|2021|   tablet wifi|
| 12|   11|2021|laptop 14 inch|
| 12|   11|2021|     mobile 5g|
+---+-----+----+--------------+
only showing top 5 rows



In [24]:
# Find out the datatype of the column searchterm?
df.printSchema()

root
 |-- day: integer (nullable = true)
 |-- month: integer (nullable = true)
 |-- year: integer (nullable = true)
 |-- searchterm: string (nullable = true)



In [27]:
# How many times was the term `gaming laptop` searched?
# Creeate a temporary view
df.createOrReplaceTempView("temp")
spark.sql("SELECT COUNT(*) FROM temp WHERE searchterm = 'gaming laptop'").show()

+--------+
|count(1)|
+--------+
|     499|
+--------+



In [30]:
# Print the top 5 most frequently used search terms?
# Take a screenshot of the code and name it as top5terms.jpg)
spark.sql("SELECT searchterm, count(searchterm) FROM temp GROUP BY searchterm ORDER BY count(searchterm) DESC LIMIT 5").show()



+-------------+-----------------+
|   searchterm|count(searchterm)|
+-------------+-----------------+
|mobile 6 inch|             2312|
|    mobile 5g|             2301|
|mobile latest|             1327|
|       laptop|              935|
|  tablet wifi|              896|
+-------------+-----------------+



                                                                                

In [31]:
# The pretrained sales forecasting model is available at the below url
# https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-DB0321EN-SkillsNetwork/Bigdata%20and%20Spark/model.tar.gz
!wget https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-DB0321EN-SkillsNetwork/Bigdata%20and%20Spark/model.tar.gz

--2024-02-14 03:51:22--  https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-DB0321EN-SkillsNetwork/Bigdata%20and%20Spark/model.tar.gz
Resolving cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud (cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud)... 169.63.118.104, 169.63.118.104
Connecting to cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud (cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud)|169.63.118.104|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1490 (1.5K) [application/x-tar]
Saving to: ‘model.tar.gz’


2024-02-14 03:51:22 (12.1 MB/s) - ‘model.tar.gz’ saved [1490/1490]



In [33]:
# importing the "tarfile" module 
import tarfile
file = tarfile.open('model.tar.gz')
file.extractall()
file.close()

In [37]:
# Load the sales forecast model
from pyspark.ml.regression import LinearRegressionModel
model = LinearRegressionModel.load('sales_prediction.model')

In [39]:
# Using the sales forecast model, predict the sales for the year of 2023.
# Take a screenshot of the code and name it as forecast.jpg
def predict(year):
    assembler = VectorAssembler(inputCols=["year"],outputCol="features")
    data = [[year,0]]
    columns = ["year", "sales"]
    df_ = spark.createDataFrame(data, columns)
    df__ = assembler.transform(df_).select('features','sales')
    predictions = model.transform(df__)
    predictions.select('prediction').show()

In [40]:
predict(2023)

+------------------+
|        prediction|
+------------------+
|175.16564294006457|
+------------------+



24/02/14 04:22:59 WARN netlib.BLAS: Failed to load implementation from: com.github.fommil.netlib.NativeSystemBLAS
24/02/14 04:22:59 WARN netlib.BLAS: Failed to load implementation from: com.github.fommil.netlib.NativeRefBLAS
