![Lancaster University](https://www.lancaster.ac.uk/media/lancaster-university/content-assets/images/fst/logos/SCC-Logo.svg)

# SCC.454: Large Scale Platforms for AI and Data Analysis
## Practice Quiz

**Duration:** 1 Hour  
**Total Marks:** 100  

---

### Instructions

1. This is a **practice quiz** to help you prepare for the actual assessment.
2. **Write your code** in the designated code cells below each question.
3. **All questions are independent** — if you cannot answer one question, move on to the next.
4. **Run your code** to verify correctness.

### API Documentation

- **NumPy:** [https://numpy.org/doc/stable/reference/](https://numpy.org/doc/stable/reference/)
- **Pandas:** [https://pandas.pydata.org/docs/reference/](https://pandas.pydata.org/docs/reference/)
- **Scikit-learn:** [https://scikit-learn.org/stable/api/](https://scikit-learn.org/stable/api/)
- **PySpark SQL Functions:** [https://spark.apache.org/docs/3.5.0/api/python/reference/pyspark.sql/functions.html](https://spark.apache.org/docs/3.5.0/api/python/reference/pyspark.sql/functions.html)
- **PySpark DataFrame:** [https://spark.apache.org/docs/3.5.0/api/python/reference/pyspark.sql/dataframe.html](https://spark.apache.org/docs/3.5.0/api/python/reference/pyspark.sql/dataframe.html)
- **PySpark ML Feature:** [https://spark.apache.org/docs/3.5.0/api/python/reference/pyspark.ml.html](https://spark.apache.org/docs/3.5.0/api/python/reference/pyspark.ml.html)

---

| Section | Topic | Marks |
|---------|-------|-------|
| **A** | Python, NumPy, Pandas & Scikit-learn | **30** |
| **B** | Apache Spark (RDDs, DataFrames, SQL) | **35** |
| **C** | Data Preprocessing & Similarity Search | **35** |
| | **Total** | **100** |


---
# Section A: Python, NumPy, Pandas & Scikit-learn (30 marks)
---


## Question 1 — NumPy Array Operations [10 marks]

Consider the following 3×4 matrix **M**:

```
      Col0  Col1  Col2  Col3
Row0    4    12     7     3
Row1    8     5    14    10
Row2    6    11     2     9
```

**API Reference:** [numpy.org/doc/stable/reference](https://numpy.org/doc/stable/reference/)

**(a)** Create the matrix `M` as a NumPy array exactly as shown above. Print its shape and data type. **[2 marks]**

**(b)** Extract and print: (i) the second row, (ii) the third column, and (iii) the element at row 1, column 2. **[2 marks]**

**(c)** Compute and print the **sum** of each row and the **mean** of each column. **[3 marks]**

**(d)** Using boolean indexing, find and print all elements in `M` that are **greater than 7**. Then, create a copy of `M` and replace all elements greater than 7 with `0`. Print the modified matrix. **[3 marks]**


In [1]:
# Q1 — Write your code here
import numpy as np

# (a) Create matrix M, print shape and dtype
matrix = np.array([[4,12,7,3],[8,5,14,10],[6,11,2,9]])
print(matrix.shape)
print(matrix.dtype)

# (b) Extract second row, third column, element at [1,2]
print(matrix[1])
print(matrix[:,2])
print(matrix[1,2])

# (c) Sum of each row, mean of each column
print(np.sum(matrix,axis=1))
print(np.mean(matrix,axis =0))

# (d) Elements > 7, then replace > 7 with 0
elements = matrix[matrix > 7]
print(elements)

matrixcopy = matrix.copy()
matrixcopy[matrixcopy > 7] =0
print(matrixcopy)

(3, 4)
int64
[ 8  5 14 10]
[ 7 14  2]
14
[26 37 28]
[6.         9.33333333 7.66666667 7.33333333]
[12  8 14 10 11  9]
[[4 0 7 3]
 [0 5 0 0]
 [6 0 2 0]]


## Question 2 — Pandas Data Manipulation [10 marks]

A shop has recorded the following sales data:

```
order_id  product      category     price   quantity  date
1001      Laptop       Electronics  999.99  1         2025-03-01
1002      Mouse        Electronics  29.99   3         2025-03-01
1003      Notebook     Stationery   5.99    10        2025-03-02
1004      Keyboard     Electronics  79.99   2         2025-03-02
1005      Pen Set      Stationery   12.99   5         2025-03-03
1006      Monitor      Electronics  349.99  1         2025-03-03
1007      Stapler      Stationery   8.99    NaN       2025-03-04
1008      Headphones   Electronics  149.99  2         2025-03-04
```

**API Reference:** [pandas.pydata.org/docs/reference](https://pandas.pydata.org/docs/reference/)

**(a)** Create this DataFrame in pandas exactly as shown (use `np.nan` for the missing value). Print the DataFrame and its info. **[2 marks]**

**(b)** Fill the missing `quantity` value with the **median** quantity of all products. Print the updated DataFrame. **[2 marks]**

**(c)** Add a new column called `total` computed as `price × quantity`. Then filter and display only rows where `total > 100`. **[3 marks]**

**(d)** Using `groupby`, calculate the **total revenue** (sum of `total`) and the **number of orders** per category. Sort by total revenue descending. **[3 marks]**


In [13]:
# Q2 — Write your code here
import pandas as pd
import numpy as np

# (a) Create DataFrame and print info
test_df = pd.DataFrame({
    'order_id': [1001, 1002, 1003, 1004, 1005, 1006, 1007, 1008],
    'product': ['Laptop', 'Mouse', 'Notebook', 'Keyboard', 'Pen Set', 'Monitor', 'Stapler', 'Headphones'],
    'category': ['Electronics', 'Electronics', 'Stationery', 'Electronics', 'Stationery', 'Electronics', 'Stationery', 'Electronics'],
    'price': [999.99, 29.99, 5.99, 79.99, 12.99, 349.99, 8.99, 149.99],
    'quantity': [1, 3, 10, 2, 5, 1, np.nan, 2],
    'date': ['2025-03-01', '2025-03-01', '2025-03-02', '2025-03-02', '2025-03-03', '2025-03-03', '2025-03-04', '2025-03-04']
}
)
print(test_df)

# (b) Fill missing quantity with median
quantity_median = test_df['quantity'].median()
test_filled_df = test_df.copy()
test_filled_df['quantity']=test_df['quantity'].fillna(quantity_median)
print(test_filled_df)
# (c) Add total column, filter where total > 100
test_df_total = test_filled_df.copy()
test_df_total['total'] =test_df_total['price'] * test_df_total['quantity']
print(test_df_total[test_df_total['total']>100])

# (d) Groupby category: total revenue and order count
print(test_df_total.groupby('category').agg(
    total_revenue = ('total',"sum"),
    order_count = ('order_id',"sum"),
))


   order_id     product     category   price  quantity        date
0      1001      Laptop  Electronics  999.99       1.0  2025-03-01
1      1002       Mouse  Electronics   29.99       3.0  2025-03-01
2      1003    Notebook   Stationery    5.99      10.0  2025-03-02
3      1004    Keyboard  Electronics   79.99       2.0  2025-03-02
4      1005     Pen Set   Stationery   12.99       5.0  2025-03-03
5      1006     Monitor  Electronics  349.99       1.0  2025-03-03
6      1007     Stapler   Stationery    8.99       NaN  2025-03-04
7      1008  Headphones  Electronics  149.99       2.0  2025-03-04
   order_id     product     category   price  quantity        date
0      1001      Laptop  Electronics  999.99       1.0  2025-03-01
1      1002       Mouse  Electronics   29.99       3.0  2025-03-01
2      1003    Notebook   Stationery    5.99      10.0  2025-03-02
3      1004    Keyboard  Electronics   79.99       2.0  2025-03-02
4      1005     Pen Set   Stationery   12.99       5.0  2025-0

## Question 3 — Scikit-learn Classification [10 marks]

You will use the Iris dataset for this question. **Run the setup cell first.**

**API Reference:** [scikit-learn.org/stable/api](https://scikit-learn.org/stable/api/)


In [None]:
# === RUN THIS CELL FIRST ===
from sklearn.datasets import load_iris
import pandas as pd

iris = load_iris()
df_iris = pd.DataFrame(iris.data, columns=iris.feature_names)
df_iris['target'] = iris.target

print(f"Dataset shape: {df_iris.shape}")
print(f"Target classes: {list(iris.target_names)}")
print(f"Class distribution:\n{df_iris['target'].value_counts().sort_index()}")
df_iris.head()


**(a)** Split the data into training (80%) and testing (20%) sets with `random_state=42` and stratified sampling. Print the shapes. **[2 marks]**

**(b)** Apply `StandardScaler` to the features. Fit on training data only, then transform both sets. **[2 marks]**

**(c)** Train a **K-Nearest Neighbours** classifier with `n_neighbors=3`. Print the accuracy on the test set. **[3 marks]**

**(d)** Print the **confusion matrix** and the **classification report** for the KNN model. **[3 marks]**


In [None]:
# Q3 — Write your code here
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

# (a) Train-test split with stratification


# (b) StandardScaler - fit on train, transform both


# (c) Train KNN with n_neighbors=3, print accuracy


# (d) Confusion matrix and classification report



---
# Section B: Apache Spark — RDDs, DataFrames & SQL (35 marks)
---

### ⚙️ Spark Setup

Run the two setup cells below before attempting the Spark questions.


In [None]:
# === SETUP CELL 1: Install PySpark and Java ===
!pip install pyspark==3.5.0 -q
!apt-get install openjdk-11-jdk-headless -qq > /dev/null 2>&1

import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-11-openjdk-amd64"
print("PySpark and Java installed successfully!")


In [1]:
# === SETUP CELL 2: Create SparkSession ===
from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("SCC454-Practice") \
    .config("spark.driver.memory", "4g") \
    .getOrCreate()

sc = spark.sparkContext
print(f"Spark version: {spark.version}")


Spark version: 4.0.2


## Question 4 — RDD Transformations and Actions [12 marks]

**Run the setup cell first**, then answer the questions below.


In [3]:
# === RUN THIS CELL FIRST ===
sentences = [
    "Apache Spark is fast",
    "Spark is used for big data",
    "Big data processing is important",
    "Spark and Hadoop are popular",
    "Data science uses Spark",
]

sentences_rdd = sc.parallelize(sentences, 2)
print(f"RDD created with {sentences_rdd.count()} sentences")


RDD created with 5 sentences


**(a)** Using `flatMap`, split each sentence into words (lowercase) and collect all words as a list. Print the total number of words. **[3 marks]**

**(b)** Using `map` and `reduceByKey`, count the occurrences of each word. Print all word counts. **[3 marks]**

**(c)** Find the **top 5 most frequent words** using `sortBy`. Print them with their counts. **[3 marks]**

**(d)** Using `filter`, find all words that contain the letter `'a'`. Print the count and the list of words. **[3 marks]**


In [24]:
from operator import add
# Q4 — Write your code here

# (a) Split sentences into words, count total words
sentences_flat_mapped = sentences_rdd.flatMap(lambda s:s.lower().split())
sentences_flat_mapped.collect()
print(sentences_flat_mapped.count())

# (b) Word count using map and reduceByKey
sentences_mapped = sentences_flat_mapped.map(lambda word:(word,1))
sentences = sentences_mapped.reduceByKey(add)
# print(sentences.collect())
# (c) Top 5 most frequent words
sorted_S = sentences.sortBy(lambda x: -x[1]).take(5)
print(sorted_S)
# (d) Words containing letter 'a'
sentences_with_a = sentences.filter(lambda x: 'a' in x[0])
print(sentences_with_a.collect())


24
[('spark', 4), ('is', 3), ('data', 3), ('big', 2), ('apache', 1)]
[('apache', 1), ('fast', 1), ('important', 1), ('and', 1), ('hadoop', 1), ('are', 1), ('popular', 1), ('spark', 4), ('data', 3)]


## Question 5 — Spark DataFrame Operations [12 marks]

Consider the following student grades data:

```
student_id  name      subject     score   semester
S001        Alice     Maths       85      Fall
S001        Alice     Physics     78      Fall
S002        Bob       Maths       92      Fall
S002        Bob       Physics     88      Fall
S003        Carol     Maths       76      Fall
S003        Carol     Physics     82      Fall
S001        Alice     Maths       88      Spring
S001        Alice     Physics     84      Spring
S002        Bob       Maths       90      Spring
S002        Bob       Physics     91      Spring
```

**Run the setup cell first.**


In [25]:
# === RUN THIS CELL FIRST ===
from pyspark.sql.types import StructType, StructField, StringType, IntegerType

grades_data = [
    ("S001", "Alice", "Maths", 85, "Fall"),
    ("S001", "Alice", "Physics", 78, "Fall"),
    ("S002", "Bob", "Maths", 92, "Fall"),
    ("S002", "Bob", "Physics", 88, "Fall"),
    ("S003", "Carol", "Maths", 76, "Fall"),
    ("S003", "Carol", "Physics", 82, "Fall"),
    ("S001", "Alice", "Maths", 88, "Spring"),
    ("S001", "Alice", "Physics", 84, "Spring"),
    ("S002", "Bob", "Maths", 90, "Spring"),
    ("S002", "Bob", "Physics", 91, "Spring"),
]

grades_schema = StructType([
    StructField("student_id", StringType(), True),
    StructField("name", StringType(), True),
    StructField("subject", StringType(), True),
    StructField("score", IntegerType(), True),
    StructField("semester", StringType(), True),
])

grades_df = spark.createDataFrame(grades_data, grades_schema)
print("Grades DataFrame created:")
grades_df.show()


Grades DataFrame created:
+----------+-----+-------+-----+--------+
|student_id| name|subject|score|semester|
+----------+-----+-------+-----+--------+
|      S001|Alice|  Maths|   85|    Fall|
|      S001|Alice|Physics|   78|    Fall|
|      S002|  Bob|  Maths|   92|    Fall|
|      S002|  Bob|Physics|   88|    Fall|
|      S003|Carol|  Maths|   76|    Fall|
|      S003|Carol|Physics|   82|    Fall|
|      S001|Alice|  Maths|   88|  Spring|
|      S001|Alice|Physics|   84|  Spring|
|      S002|  Bob|  Maths|   90|  Spring|
|      S002|  Bob|Physics|   91|  Spring|
+----------+-----+-------+-----+--------+



**(a)** Select only `name`, `subject`, and `score` columns. Then filter to show only rows where `score >= 85`. **[3 marks]**

**(b)** Add a new column `grade` based on score: `'A'` if score >= 90, `'B'` if score >= 80, `'C'` otherwise. Show the result. **[3 marks]**

**(c)** Using `groupBy`, calculate the **average score** per student (by `name`). Order by average score descending. **[3 marks]**

**(d)** Using `groupBy`, calculate the **average score** per subject per semester. Show the result ordered by semester then subject. **[3 marks]**


In [34]:
# Q5 — Write your code here
from pyspark.sql.functions import col, when, avg, round as spark_round
from pyspark.sql import functions as f
# (a) Select columns and filter score >= 85
filtered_grades = grades_df.select("name","subject","score").filter(col("score")>=85)
# print(filtered_grades.show())

# (b) Add grade column (A/B/C based on score)
letter_score = grades_df.withColumn(
    "grade",
    when(col("score") >= 90,'A').when(col("score") >= 80,'B').otherwise('C')
)
# print(letter_score.show())
# (c) Average score per student
avg_score = grades_df.groupby("name").agg(
    avg(col("score"))
)
print(avg_score.show())
# (d) Average score per subject per semester
avg_score_subject = grades_df.groupBy("semester","subject").agg(avg("score"))
avg_score_subject.show()


+-----+----------+
| name|avg(score)|
+-----+----------+
|Carol|      79.0|
|  Bob|     90.25|
|Alice|     83.75|
+-----+----------+

None
+--------+-------+-----------------+
|semester|subject|       avg(score)|
+--------+-------+-----------------+
|    Fall|  Maths|84.33333333333333|
|    Fall|Physics|82.66666666666667|
|  Spring|  Maths|             89.0|
|  Spring|Physics|             87.5|
+--------+-------+-----------------+



## Question 6 — Spark SQL [11 marks]

Register the grades DataFrame as a temporary view and answer using **Spark SQL**.


In [35]:
# Register the DataFrame as a temp view
grades_df.createOrReplaceTempView("grades")
print("View 'grades' registered.")


View 'grades' registered.


**(a)** Write a SQL query to find all students who scored **above 85** in **Maths**. Return: `name`, `score`, `semester`. **[3 marks]**

**(b)** Write a SQL query to calculate the **average score per subject**. Return: `subject`, `avg_score` (rounded to 2 decimals). **[3 marks]**

**(c)** Write a SQL query to find the **highest score** achieved by each student across all subjects and semesters. Return: `name`, `max_score`. Order by `max_score` descending. **[5 marks]**


In [43]:
# Q6 — Write your SQL queries here

# (a) Students scoring above 85 in Maths
result_a = spark.sql("""select name,score,semester from grades where score >= 85 and subject ='Maths'
    
""")
result_a.show()


# (b) Average score per subject
result_b = spark.sql("""
    select subject, round(avg(score),2) as avg_score from grades group by subject
""")
result_b.show()


# (c) Highest score per student
result_c = spark.sql("""
    select name, max(score) as max_score from grades group by name order by max_score desc
""")
result_c.show()


+-----+-----+--------+
| name|score|semester|
+-----+-----+--------+
|Alice|   85|    Fall|
|  Bob|   92|    Fall|
|Alice|   88|  Spring|
|  Bob|   90|  Spring|
+-----+-----+--------+

+-------+---------+
|subject|avg_score|
+-------+---------+
|Physics|     84.6|
|  Maths|     86.2|
+-------+---------+

+-----+---------+
| name|max_score|
+-----+---------+
|  Bob|       92|
|Alice|       88|
|Carol|       82|
+-----+---------+



---
# Section C: Data Preprocessing & Similarity Search (35 marks)
---


## Question 7 — Text Preprocessing & Regular Expressions [12 marks]

Consider the following product data with messy text:

```
id   raw_text
1    "Product: LAPTOP-2025 | Price: $999.99 | Stock: 50"
2    "Product: mouse-2024 | Price: $29.50 | Stock: 200"
3    "Product: KEYBOARD-2025 | Price: $79.00 | Stock: 75"
4    "Product: Monitor-2023 | Price: $349.99 | Stock: 30"
5    "Product: HEADSET-2025 | Price: $149.00 | Stock: 100"
```

**Run the setup cell first.**


In [44]:
# === RUN THIS CELL FIRST ===
product_data = [
    (1, "Product: LAPTOP-2025 | Price: $999.99 | Stock: 50"),
    (2, "Product: mouse-2024 | Price: $29.50 | Stock: 200"),
    (3, "Product: KEYBOARD-2025 | Price: $79.00 | Stock: 75"),
    (4, "Product: Monitor-2023 | Price: $349.99 | Stock: 30"),
    (5, "Product: HEADSET-2025 | Price: $149.00 | Stock: 100"),
]

products_df = spark.createDataFrame(product_data, ["id", "raw_text"])
print("Products DataFrame created:")
products_df.show(truncate=False)


Products DataFrame created:
+---+---------------------------------------------------+
|id |raw_text                                           |
+---+---------------------------------------------------+
|1  |Product: LAPTOP-2025 | Price: $999.99 | Stock: 50  |
|2  |Product: mouse-2024 | Price: $29.50 | Stock: 200   |
|3  |Product: KEYBOARD-2025 | Price: $79.00 | Stock: 75 |
|4  |Product: Monitor-2023 | Price: $349.99 | Stock: 30 |
|5  |Product: HEADSET-2025 | Price: $149.00 | Stock: 100|
+---+---------------------------------------------------+



**(a)** Using `regexp_extract`, extract the **product name** (e.g., "LAPTOP-2025") into a new column called `product_name`. Show the result. **[3 marks]**

**(b)** Using `regexp_extract`, extract the **price** (the numeric value after $, e.g., "999.99") into a column called `price`. Cast it to `DoubleType`. **[3 marks]**

**(c)** Using `lower()`, convert the `product_name` to lowercase. Then use `regexp_replace` to remove the year part (e.g., "-2025") from the product name. **[3 marks]**

**(d)** Using `rlike`, filter to show only products from year **2025** (i.e., product name contains "2025"). **[3 marks]**


In [54]:
# Q7 — Write your code here
from pyspark.sql.functions import regexp_extract, regexp_replace, lower, col
from pyspark.sql.types import DoubleType

# (a) Extract product name
product_name = products_df.withColumn(
    "product_name",
    regexp_extract(col("raw_text"),r'Product:\s+([^\s|]+)',idx=1)
)
# product_name.show()
# (b) Extract price and cast to Double
product_price = products_df.withColumn(
    "price",
    regexp_extract(col("raw_text"),r'Price:\s+\$([^\s|]+)',idx=1).cast(DoubleType())
)
# product_price.show()

# (c) Lowercase product name and remove year
lower_cased = product_name.withColumn(
    "product_name",
    lower(regexp_replace("product_name",r"\-\d+",""))
)
# lower_cased.show()
# (d) Filter products from 2025
productss_2025 = product_name.filter(col("product_name").rlike(r'2025'))
productss_2025.show()


+---+--------------------+-------------+
| id|            raw_text| product_name|
+---+--------------------+-------------+
|  1|Product: LAPTOP-2...|  LAPTOP-2025|
|  3|Product: KEYBOARD...|KEYBOARD-2025|
|  5|Product: HEADSET-...| HEADSET-2025|
+---+--------------------+-------------+



## Question 8 — Shingling & Jaccard Similarity [12 marks]

Consider these three short documents:

```
Doc A: "the cat sat on the mat"
Doc B: "the cat sat on the hat"
Doc C: "the dog ran in the park"
```


**(a)** Write a Python function `word_shingles(text, n)` that returns a **set** of word n-grams. Apply it to all three documents with `n=2`. Print the shingle sets for each document. **[3 marks]**

**(b)** Write a function `jaccard_similarity(set_a, set_b)` that computes Jaccard similarity. Calculate and print the similarity between: (A, B), (A, C), and (B, C). **[3 marks]**

**(c)** Based on your results, which pair of documents is **most similar**? Which is **least similar**? **[2 marks]**

**(d)** Write a simple `MinHash` function that takes a set and `num_hashes` parameter, and returns a signature (list of minimum hash values). Use Python's built-in `hash()` function with different salts. Compare the estimated Jaccard (from signatures) with the true Jaccard for documents A and B using `num_hashes=50`. **[4 marks]**


In [None]:
# Q8 — Write your code here

# Documents
doc_a = "the cat sat on the mat"
doc_b = "the cat sat on the hat"
doc_c = "the dog ran in the park"

# (a) Word shingles function, apply with n=2


# (b) Jaccard similarity function and compute for all pairs


# (c) Most similar and least similar pairs (print as comments or text)


# (d) Simple MinHash function and comparison



## Question 9 — LSH with Spark ML [11 marks]

Using the same three documents from Question 8, build an LSH pipeline with Spark ML.


**(a)** Create a Spark DataFrame with columns `id` and `text` for the three documents. Use `Tokenizer` to split into words, then `CountVectorizer` with `binary=True` to create feature vectors. Show the schema. **[3 marks]**

**(b)** Fit a `MinHashLSH` model with `numHashTables=3`. Transform the data and show the hash values. **[3 marks]**

**(c)** Use `approxSimilarityJoin` with threshold `0.6` to find similar document pairs. Display the results. **[3 marks]**

**(d)** Use `approxNearestNeighbors` to find the 2 nearest neighbours of document A. Print their IDs and distances. **[2 marks]**


In [None]:
# Q9 — Write your code here
from pyspark.ml.feature import Tokenizer, CountVectorizer, MinHashLSH

# (a) Create DataFrame, tokenize, vectorize


# (b) Fit MinHashLSH and show hashes


# (c) approxSimilarityJoin with threshold 0.6


# (d) approxNearestNeighbors for document A



---
## Cleanup


In [None]:
# Stop Spark session
spark.stop()
print("Spark session stopped. Practice quiz complete!")


---
### End of Practice Quiz

**Review your answers and check:**
- [ ] All code cells execute without errors
- [ ] Outputs match what you expect
- [ ] You understand the concepts tested

---
*SCC.454: Large Scale Platforms for AI and Data Analysis — Lancaster University*
