<a href="https://colab.research.google.com/github/guilhermelaviola/BusinessIntelligenceAndBigDataArchitectureWithAppliedDataScience/blob/main/Class13.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Introduction to Big Data**
Big Data is reshaping how organizations make decisions by leveraging the massive growth of data from mobile devices, IoT, and digital platforms, making data-driven cultures essential for optimizing operations, personalizing experiences, and uncovering new opportunities. Understanding the five Vs—Volume, Velocity, Variety, Veracity, and Value—helps explain both the potential and challenges of Big Data, which require skilled professionals, strong data literacy, and careful interpretation to avoid costly mistakes. Specialized tools and technologies such as Python’s Pandas, Parquet files, distributed search engines like Elasticsearch, and open-source frameworks including Hadoop, Spark, and Kafka enable efficient storage, processing, and analysis of large datasets. Cloud computing, Edge Computing, and IPv6 further support scalability, real-time analytics, and the expansion of connected devices, while the open-source movement continues to drive innovation and accessibility. Overall, mastering Big Data concepts, tools, and trends is critical for professionals and organizations seeking to succeed in an increasingly data-driven world.

In [3]:
# Importing all the necessary resources:
import pandas as pd
import numpy as np
from datetime import datetime, timedelta
import uuid

## **Example: Big Data Workflow**
The following example introduces essential Big Data tools and techniques for managing large datasets efficiently, using a practical example with Python’s Pandas library and financial transaction data. It highlights the limitations of loading very large datasets—such as millions of rows—directly into memory and presents the Parquet file format as an effective solution, thanks to its columnar storage and compatibility with the Hadoop ecosystem. By converting large CSV files into smaller Parquet files in manageable chunks, data can be processed more efficiently without overwhelming system resources. The segment also emphasizes that as data volumes grow, simple search methods become inefficient, making distributed search systems necessary, and sets the stage for exploring Elasticsearch as a powerful tool for fast searches in Big Data environments.


In [4]:
# Generating the dataset for study:
# Number of rows:
n_rows = 1500

# Helper function to generate random dates:
def random_dates(start, end, n):
    start_u = start.timestamp()
    end_u = end.timestamp()
    return pd.to_datetime(np.random.uniform(start_u, end_u, n), unit='s')

# Possible values:
property_types = ['House', 'Apartment', 'Detached', 'Semi-Detached', 'Terraced']
old_new = ['Old', 'New']
durations = ['Freehold', 'Leasehold']
cities = ['Roma', 'Milano', 'Torino', 'Napoli', 'Bologna', 'Firenze']
districts = ['Central', 'North', 'South', 'East', 'West']
counties = ['Lazio', 'Lombardia', 'Piemonte', 'Campania', 'Emilia-Romagna', 'Toscana']
ppd_categories = ['A', 'B']
record_statuses = ['Added', 'Changed', 'Deleted']

# Creating the dataset:
data = {
    'Transaction': [str(uuid.uuid4()) for _ in range(n_rows)],
    'Price': np.random.randint(80000, 1200000, size=n_rows),
    'Date of Transfer': random_dates(
        datetime(2015, 1, 1),
        datetime(2025, 1, 1),
        n_rows
    ),
    'Property Type': np.random.choice(property_types, n_rows),
    'Old/New': np.random.choice(old_new, n_rows),
    'Duration': np.random.choice(durations, n_rows),
    'Town/City': np.random.choice(cities, n_rows),
    'District': np.random.choice(districts, n_rows),
    'County': np.random.choice(counties, n_rows),
    'PPDCategory': np.random.choice(ppd_categories, n_rows),
    'Record Status': np.random.choice(record_statuses, n_rows)
}

# Creating the DataFrame:
df = pd.DataFrame(data)

# Displaying the data:
print(df.head())

                            Transaction    Price  \
0  f8df6a25-7d30-4321-8d8e-15d73a63870b   160949   
1  4e4ce644-8e32-4224-aaee-07ea0dddd891  1157098   
2  b51d434e-7547-474e-85de-c8526a823406  1105439   
3  b3666683-2203-43ce-b32f-66a98c73a8d1   198948   
4  38c7c851-774c-4df8-9dd9-b3105bcb1d39   413790   

               Date of Transfer  Property Type Old/New   Duration Town/City  \
0 2019-03-03 16:08:57.645555735       Terraced     New   Freehold    Napoli   
1 2021-04-20 11:28:55.066234350  Semi-Detached     Old   Freehold   Bologna   
2 2016-09-03 21:48:52.370244026      Apartment     New  Leasehold    Torino   
3 2020-05-12 22:24:01.558418751  Semi-Detached     Old  Leasehold   Firenze   
4 2018-08-03 20:29:18.300802469      Apartment     New  Leasehold    Napoli   

  District     County PPDCategory Record Status  
0    North   Campania           B         Added  
1  Central   Piemonte           B         Added  
2    North   Piemonte           A         Added  
3  Central  

In [5]:
# Saving the DataFrame to a single parquet file:
df.to_parquet('output.parquet')

In [7]:
# Calculating the total number of chunks:
chunk_size = 500
num_chunks = (len(df) + chunk_size - 1) // chunk_size

for i in range(num_chunks):
    start_idx = i * chunk_size
    end_idx = min((i + 1) * chunk_size, len(df))
    chunk = df.iloc[start_idx:end_idx]
    chunk.to_parquet(f'output_chunk_{i}.parquet')

In [9]:
# Reading the parquet file:
df = pd.read_parquet('output_chunk_0.parquet')
df

Unnamed: 0,Transaction,Price,Date of Transfer,Property Type,Old/New,Duration,Town/City,District,County,PPDCategory,Record Status
0,f8df6a25-7d30-4321-8d8e-15d73a63870b,160949,2019-03-03 16:08:57.645555735,Terraced,New,Freehold,Napoli,North,Campania,B,Added
1,4e4ce644-8e32-4224-aaee-07ea0dddd891,1157098,2021-04-20 11:28:55.066234350,Semi-Detached,Old,Freehold,Bologna,Central,Piemonte,B,Added
2,b51d434e-7547-474e-85de-c8526a823406,1105439,2016-09-03 21:48:52.370244026,Apartment,New,Leasehold,Torino,North,Piemonte,A,Added
3,b3666683-2203-43ce-b32f-66a98c73a8d1,198948,2020-05-12 22:24:01.558418751,Semi-Detached,Old,Leasehold,Firenze,Central,Lombardia,B,Changed
4,38c7c851-774c-4df8-9dd9-b3105bcb1d39,413790,2018-08-03 20:29:18.300802469,Apartment,New,Leasehold,Napoli,East,Campania,A,Deleted
...,...,...,...,...,...,...,...,...,...,...,...
495,8868294e-ad92-4939-b4d8-970b95d263c2,661261,2024-04-28 01:48:00.417154312,House,New,Leasehold,Napoli,West,Lazio,B,Added
496,0d923ca0-fec4-451a-b46f-4d8bae1ff9fc,787067,2018-09-15 01:39:55.855512142,Terraced,Old,Freehold,Torino,Central,Emilia-Romagna,B,Changed
497,d820508d-2607-4297-b77e-026b99d42cd6,137024,2019-04-03 05:50:35.948067188,House,Old,Leasehold,Roma,South,Lombardia,A,Deleted
498,ba421555-6f48-4a5c-ba82-865a2f4a0b59,391637,2022-03-08 21:34:01.407039881,Terraced,Old,Freehold,Torino,South,Campania,B,Changed
