<a href="https://colab.research.google.com/github/guilhermelaviola/IntegrativePracticeInDataScience/blob/main/Class05.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Big Data Analytics**
The growing volume of data known as Big Data necessitates specific tools for analysis, defined by the 5Vs: Volume, Variety, Velocity, Veracity, and Value. Key frameworks include the Spark Framework, favored for real-time processing, and the Hadoop Framework, preferred for historical data analysis. Big Data analysis involves data collection, preparation, analysis, visualization, and storage, ultimately revealing patterns that drive decision-making. Forecasting in this domain leverages statistical techniques and machine learning for accurate predictions using historical data. The hybrid data analysis methodology integrates structured and unstructured data, enhancing insights through techniques like text mining. Mastery of these tools and methodologies is essential for professionals seeking to harness data's transformative potential.

In [None]:
# Importing all the necessary libraries and resources:
from pyspark.sql import SparkSession
from pyspark.ml.feature import Tokenizer, StopWordsRemover, CountVectorizer, VectorAssembler
from pyspark.sql.functions import count
from pyspark.ml.regression import LinearRegression

## **Example: Sales Forecasting Using Machine Learning**

In [None]:
# Initializing Spark session:
spark = SparkSession.builder \
    .appName('BigDataHybridExample') \
    .getOrCreate()

In [None]:
# Loading Structured Big Data:
sales_df = spark.read.csv('sales_history.csv', header=True, inferSchema=True)
sales_df.show(5)

AnalysisException: [PATH_NOT_FOUND] Path does not exist: file:/content/sales_history.csv.

In [None]:
# Loading Unstructured Big Data:
reviews_df = spark.read.text('customer_reviews.txt')
reviews_df.show(5)

In [None]:
# Cleaning and Prepare the Text Data (Text Mining):
tokenizer = Tokenizer(inputCol='value', outputCol='words')
words_df = tokenizer.transform(reviews_df)

remover = StopWordsRemover(inputCol='words', outputCol='filtered')
filtered_df = remover.transform(words_df)

vectorizer = CountVectorizer(inputCol='filtered', outputCol='features')
text_model = vectorizer.fit(filtered_df)
vectorized_df = text_model.transform(filtered_df)

In [None]:
# Aggregating Text Features (Convert to Structured Form):
# Example: Counting number of reviews (acting as proxy feature):
text_feature_df = filtered_df.agg(count('*').alias('review_count'))
text_feature_df.show()

In [None]:
# Combining Structured + Unstructured Data (Hybrid Analysis):
# Adding review count to each row of sales_df for demonstration:
combined_df = sales_df.crossJoin(text_feature_df)
combined_df.show(5)

In [None]:
# Simple Forecasting Using Machine Learning:
# Using a basic regression model to predict future sales from historical features:
assembler = VectorAssembler(
    inputCols=['units_sold', 'price', 'review_count'],
    outputCol='features'
)

training_data = assembler.transform(combined_df)

lr = LinearRegression(featuresCol='features', labelCol='units_sold')
model = lr.fit(training_data)

print('Coefficients:', model.coefficients)
print('Intercept:', model.intercept)

In [None]:
# Predicting Future Sales:
predictions = model.transform(training_data)
predictions.select('units_sold', 'prediction').show(10)