<a href="https://colab.research.google.com/github/guilhermelaviola/IntegrativePracticeInDataScience/blob/main/Class04.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Data Science at Scale**
Data Science at Scale involves analyzing large datasets to uncover patterns and insights for various applications, including process optimization and trend discovery. Key challenges include the need for robust infrastructures like cluster systems for parallel processing and vector architectures optimized for massive data through vector and matrix operations. Big Data, characterized by large, complex datasets, contrasts with Fast Data, which focuses on real-time data analysis requiring rapid processing capabilities. Effective tools like Hadoop and Spark are essential for handling Big Data, while Fast Data applications encompass areas like fraud detection and social media analysis. The scalability of cluster systems is crucial for large-scale analysis, addressing issues like data quality and interpretability. Techniques such as machine learning, data mining, and visualization are important for extracting insights, making proficiency in these areas vital for professionals in the field.

In [19]:
# Importing all the necessary libraries and resources:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, count, avg, stddev

## **Example: Scaling with Spark**
In the following example, we start Spark and loading a large dataset. Then we perform a basic aggregation at scale to detect 'slow' requests as anomalies. Finally, we save the results.

In [20]:
# Starting Spark:
spark = SparkSession.builder \
    .appName('LargeScaleLogAnalysis') \
    .getOrCreate()

In [21]:
# Loading a large dataset
import requests
import os

url = 'https://raw.githubusercontent.com/sidsriv/Introduction-to-Data-Science-in-python/refs/heads/master/log.csv'
local_file_path = 'log.csv' # Define a local path to save the file

# Download the file
print(f"Downloading file from {url} to {local_file_path}...")
response = requests.get(url)
response.raise_for_status() # Raise an exception for HTTP errors (4xx or 5xx)
with open(local_file_path, 'wb') as f:
    f.write(response.content)
print("Download complete.")

# This could be a huge file in HDFS, S3, etc:
# Now read the local file with Spark
df = spark.read.csv(local_file_path, header=True, inferSchema=True)

Downloading file from https://raw.githubusercontent.com/sidsriv/Introduction-to-Data-Science-in-python/refs/heads/master/log.csv to log.csv...
Download complete.


In [22]:
# Counting how many requests each user made:
requests_per_user = (
    df.groupBy('user')
      .agg(count('*').alias('num_requests'))
)

requests_per_user.show(10)

+------+------------+
|  user|num_requests|
+------+------------+
|cheryl|          11|
|   sue|          11|
|   bob|          11|
+------+------------+



In [23]:
# Simple Fast Dataâ€“style analytics:
stats = df.select(
    avg(col('time')).alias('avg_rt'),
    stddev(col('time')).alias('std_rt')
).collect()[0]

threshold = stats['avg_rt'] + 3 * stats['std_rt']

anomalies = df.filter(col('time') > threshold)

print('Anomalies detected:')
anomalies.show(10)

Anomalies detected:
+----+----+-----+-----------------+------+------+
|time|user|video|playback position|paused|volume|
+----+----+-----+-----------------+------+------+
+----+----+-----+-----------------+------+------+



In [24]:
requests_per_user.write.mode('overwrite').csv('output/requests_per_user')
anomalies.write.mode('overwrite').csv('output/anomalies')

spark.stop()