<a href="https://colab.research.google.com/github/guilhermelaviola/IntegrativePracticeInDataScience/blob/main/Class05.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Big Data Analytics**
The growing volume of data known as Big Data necessitates specific tools for analysis, defined by the 5Vs: Volume, Variety, Velocity, Veracity, and Value. Key frameworks include the Spark Framework, favored for real-time processing, and the Hadoop Framework, preferred for historical data analysis. Big Data analysis involves data collection, preparation, analysis, visualization, and storage, ultimately revealing patterns that drive decision-making. Forecasting in this domain leverages statistical techniques and machine learning for accurate predictions using historical data. The hybrid data analysis methodology integrates structured and unstructured data, enhancing insights through techniques like text mining. Mastery of these tools and methodologies is essential for professionals seeking to harness data's transformative potential.

In [40]:
# Importing all the necessary libraries and resources:
from pyspark.sql import SparkSession
from pyspark.ml.feature import Tokenizer, StopWordsRemover, CountVectorizer, VectorAssembler
from pyspark.sql.functions import count
from pyspark.ml.regression import LinearRegression
from pyspark.sql import Row
import urllib.request
import os

## **Example: Sales Forecasting Using Machine Learning**

In [41]:
# Initializing Spark session:
spark = SparkSession.builder \
    .appName('BigDataHybridExample') \
    .getOrCreate()

In [42]:
dummy_structured_data = [
    Row(id=1, feature_a=10, feature_b=100.5, category='A'),
    Row(id=2, feature_a=15, feature_b=101.2, category='B'),
    Row(id=3, feature_a=12, feature_b=99.8, category='A'),
    Row(id=4, feature_a=20, feature_b=105.1, category='C'),
    Row(id=5, feature_a=8, feature_b=98.0, category='B')
]

dummy_structured_df = spark.createDataFrame(dummy_structured_data)

# Loading Structured Big Data (using the dummy DataFrame):
structured_df = dummy_structured_df
structured_df.show(5)

+---+---------+---------+--------+
| id|feature_a|feature_b|category|
+---+---------+---------+--------+
|  1|       10|    100.5|       A|
|  2|       15|    101.2|       B|
|  3|       12|     99.8|       A|
|  4|       20|    105.1|       C|
|  5|        8|     98.0|       B|
+---+---------+---------+--------+



In [43]:
# Loading Unstructured Big Data:
unstructured_url = 'https://raw.githubusercontent.com/apache/spark/master/README.md' # Example raw text file

local_unstructured_path = '/tmp/unstructured_data.txt' # Temporary local path

# Downloading the file locally first:
urllib.request.urlretrieve(unstructured_url, local_unstructured_path)

# Reading the local file with Spark, and naming it 'reviews_df' as expected by subsequent cells:
reviews_df = spark.read.text(local_unstructured_path)
reviews_df.show(5)

# Cleaning up the local file after use if desired:
# os.remove(local_unstructured_path)

+--------------------+
|               value|
+--------------------+
|      # Apache Spark|
|                    |
|Spark is a unifie...|
|high-level APIs i...|
|supports general ...|
+--------------------+
only showing top 5 rows


In [44]:
# Cleaning and Prepare the Text Data (Text Mining):
tokenizer = Tokenizer(inputCol='value', outputCol='words')
words_df = tokenizer.transform(reviews_df)

remover = StopWordsRemover(inputCol='words', outputCol='filtered')
filtered_df = remover.transform(words_df)

vectorizer = CountVectorizer(inputCol='filtered', outputCol='features')
text_model = vectorizer.fit(filtered_df)
vectorized_df = text_model.transform(filtered_df)

In [45]:
# Aggregating Text Features (Convert to Structured Form):
# Example: Counting number of reviews (acting as proxy feature):
text_feature_df = filtered_df.agg(count('*').alias('review_count'))
text_feature_df.show()

+------------+
|review_count|
+------------+
|         180|
+------------+



In [46]:
# Combining Structured + Unstructured Data (Hybrid Analysis):
# Adding review count to each row of sales_df for demonstration:

# Create a dummy sales_df for demonstration
from pyspark.sql import Row

sales_data = [
    Row(units_sold=100, price=10.5),
    Row(units_sold=120, price=11.0),
    Row(units_sold=90, price=9.8),
    Row(units_sold=150, price=12.0),
    Row(units_sold=80, price=9.5)
]
sales_df = spark.createDataFrame(sales_data)

combined_df = sales_df.crossJoin(text_feature_df)
combined_df.show(5)

+----------+-----+------------+
|units_sold|price|review_count|
+----------+-----+------------+
|       100| 10.5|         180|
|       120| 11.0|         180|
|        90|  9.8|         180|
|       150| 12.0|         180|
|        80|  9.5|         180|
+----------+-----+------------+



In [47]:
# Simple Forecasting Using Machine Learning:
# Using a basic regression model to predict future sales from historical features:
assembler = VectorAssembler(
    inputCols=['units_sold', 'price', 'review_count'],
    outputCol='features'
)

training_data = assembler.transform(combined_df)

lr = LinearRegression(featuresCol='features', labelCol='units_sold')
model = lr.fit(training_data)

print('Coefficients:', model.coefficients)
print('Intercept:', model.intercept)

Coefficients: [1.0000000003467537,-9.870161948491333e-09,0.0]
Intercept: 6.67795025895803e-08


In [48]:
# Predicting Future Sales:
predictions = model.transform(training_data)
predictions.select('units_sold', 'prediction').show(10)

+----------+------------------+
|units_sold|        prediction|
+----------+------------------+
|       100| 99.99999999781818|
|       120|119.99999999981817|
|        90| 90.00000000125975|
|       150|150.00000000035064|
|        80| 80.00000000075326|
+----------+------------------+

