<a href="https://colab.research.google.com/github/guilhermelaviola/IntegrativePracticeInDataScience/blob/main/Class04.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Data Science at Scale**
Data Science at Scale involves analyzing large datasets to uncover patterns and insights for various applications, including process optimization and trend discovery. Key challenges include the need for robust infrastructures like cluster systems for parallel processing and vector architectures optimized for massive data through vector and matrix operations. Big Data, characterized by large, complex datasets, contrasts with Fast Data, which focuses on real-time data analysis requiring rapid processing capabilities. Effective tools like Hadoop and Spark are essential for handling Big Data, while Fast Data applications encompass areas like fraud detection and social media analysis. The scalability of cluster systems is crucial for large-scale analysis, addressing issues like data quality and interpretability. Techniques such as machine learning, data mining, and visualization are important for extracting insights, making proficiency in these areas vital for professionals in the field.

In [1]:
# Importing all the necessary libraries and resources:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, count, avg, stddev

## **Example: Scaling with Spark**
In the following example, we start Spark and loading a large dataset. Then we perform a basic aggregation at scale to detect 'slow' requests as anomalies. Finally, we save the results.

In [2]:
# Starting Spark:
spark = SparkSession.builder \
    .appName('LargeScaleLogAnalysis') \
    .getOrCreate()

In [3]:
# Loading a large dataset
# This could be a huge file in HDFS, S3, etc:
df = spark.read.csv('logs.csv', header=True)

# Example columns:
# user_id, timestamp, response_time

AnalysisException: [PATH_NOT_FOUND] Path does not exist: file:/content/logs.csv. SQLSTATE: 42K03

In [None]:
# Counting how many requests each user made:
requests_per_user = (
    df.groupBy('user_id')
      .agg(count('*').alias('num_requests'))
)

requests_per_user.show(10)

NameError: name 'df' is not defined

In [None]:
# Simple Fast Dataâ€“style analytics:
stats = df.select(
    avg(col('response_time')).alias('avg_rt'),
    stddev(col('response_time')).alias('std_rt')
).collect()[0]

threshold = stats['avg_rt'] + 3 * stats['std_rt']

anomalies = df.filter(col('response_time') > threshold)

print('Anomalies detected:')
anomalies.show(10)

In [None]:
requests_per_user.write.csv('output/requests_per_user')
anomalies.write.csv('output/anomalies')

spark.stop()