# Big Data Analytics - Fall 2025

## Course Instructor: Associate Professor Mahdi Belcaid

### Academic Affiliations:
* Information and Computer Science (ICS)
* Hawaii Institute of Marine Biology (HIMB)
* Hawaii Data Science Institute (HI-DSI)



### Research Focus
Specializing in the application of big data techniques to:
1. Computational Biology
   - Analyzing large-scale genomic and proteomic datasets
   - Modeling complex biological systems
   - Bioinformatics pipeline development

2. Text Data Analysis
   - Natural Language Processing (NLP) for large corpora
   - Information retrieval from unstructured text data

## Why Big Data Analytics?

`We are drowning in Data but Starving for insight`


* This quote encapsulates the challenge of the modern data landscape. 

* While we have access to unprecedented amounts of data, extracting meaningful insights remains a significant challenge.


## About the Course

- **Focus**: Practical big-data analytics
- **Emphasis**: Tools and applications for big data
- **Primary Language**: Python
  - Why Python? It's a leading language in data science, offering a balance of simplicity and power

**Course Objective**: By the end of this course, you'll have the knowledge and skills to:
  * Implement big data analytics projects to real-world scenarios.
  * Confidently discuss data science concepts in job interviews and 


## Course Logistics

- **Communication**: Slack Workspace (ics438f25.slack.com)
  - All important announcements will be made here. Check regularly!
  - Join here: https://join.slack.com/t/slack-nfp6154/shared_invite/zt-2pkzkplkh-qkp3VH53FjzreIRq2xtJGA
- **Course Website**: https://mahdi-b.github.io/ICS438_F25
  - Contains schedule, references, and additional resources
- **Instructor Office Hours**:
  - Brief questions: Post-class on Thursdays
  - Detailed discussions: By appointment (recommended)


## Grading Breakdown

- **Assignments (25%)**
  - Topics: Data wrangling, analytics, and Natural Language Processing (NLP)
  - Python-based
  - Submission via GitHub (familiarize yourself with Git!)
- **In-Class Quizzes (40%)**
  - Comprehensive; tests overall understanding
- **Final Exam (30%)**
  - Held during exam week
  - Covers all course material
- **Attendance (5%)**
  - Max. 4 absences allowed
  - Complete attendance form at the start of each session
    - https://www.cognitoforms.com/MahdiBelcaid/AttendanceICS438F25

## Prerequisites

- A solid understanding of database concepts is crucial
- **Python**: 
  - Prior knowledge is highly beneficial
  - If you're new to Python, be prepared to learn quickly
  - Numerous free resources available online (e.g., CodeAcademy, Python.org tutorials, etc.)
- **Machine Learning**: Basic familiarity is advantageous but not required


## Data Analytics vs. Data Science: Key Differences

* While sometimes used interchangeably, these fields have distinct focuses and methodologies.

1. **Scope**: Analytics solves defined problems; Science explores open-ended questions
2. **Complexity**: Analytics uses established methods; Science develops new ones
3. **Time Orientation**: Analytics is often retrospective; Science is more predictive
4. **Data Types**: Analytics primarily uses structured data; Science handles both structured and unstructured
5. **Technical Depth**: Science requires deeper mathematical and computational knowledge


## Data Analyst vs. Data Scientist: Accordint to QlickView
<img src="attachment:ab2d392f-ceba-48e2-abf3-421687661ac1.png" width=900>


## Data Analyst vs. Data Scientist: Task Comparison

| Task                   | Data Analytics | Data Science |
|------------------------|----------------|--------------|
| Data wrangling         | LOW/MEDIUM     | MEDIUM/HIGH  |
| Exploratory Analyses   | HIGH           | MEDIUM       |
| Descriptive Statistics | HIGH           | LOW          |
| Statistical modeling   | LOW            | HIGH         |
| Prescriptive analytics | MEDIUM/HIGH    | LOW          |
| Predictive Analysis    | LOW/MEDIUM     | HIGH         |
| Application Development| LOW            | MEDIUM       |
| Data Engineering       | LOW            | LOW/MEDIUM   |

Note: This is a subjective comparison of the most common tasks in each field.


## Flavors of Analytics

1. **Descriptive Analytics**: 
   - What happened?
   - Tools: Dashboards, reports, data visualization

2. **Diagnostic Analytics**: 
   - Why did it happen?
   - Techniques: Data mining, correlations, root cause analysis

3. **Predictive Analytics**: 
   - What might happen in the future?
   - Methods: Machine learning, regression analysis, time series forecasting

4. **Prescriptive Analytics**: 
   - What should we do about it?
   - Approaches: Optimization, simulation, decision analysis




## The Big Data Revolution

### Driving Factors:
1. **Technological Advancements**: Ability to capture and store vast amounts of data
2. **Datafication**: Trend of turning aspects of life into quantifiable data
   - IoT devices "datafying" household appliances
   - Wearables "datafying" health metrics
   - LinkedIn "datafying" employees and job hiring


In [None]:
### Outcomes of "datafication"
- Unprecedented scale and complexity of data available for analysis
- New challenges in data integration and interpretation
- Opportunities for insights previously impossible due to data limitations


<img src="attachment:67c3b097-3885-4f04-8909-7d4b37f8bd32.png" width=700>

## DEMYSTIFYING DATA UNITS


| Unit | Value | Size |
|------|-------|------|
| b    | bit   | 0 or 1, 1/8 of a byte |
| B    | byte  | 8 bits, 1 byte |
| KB   | kilobyte  | 1,000 bytes |
| MB   | megabyte  | 1,000,000 bytes |
| GB   | gigabyte  | 1,000,000,000 bytes |
| TB   | terabyte  | 1,000,000,000,000 bytes |
| PB   | petabyte  | 1,000,000,000,000,000 bytes |
| EB   | exabyte   | 1,000,000,000,000,000,000 bytes |
| ZB   | zettabyte | 1,000,000,000,000,000,000,000 bytes |
| YB   | yottabyte | 1,000,000,000,000,000,000,000,000 bytes |

* Note: lowercase “b” is used as an abbreviation for bits, while uppercase “B” represents bytes.

## What Constitutes "Big Data"?

- Not just about volume (exceeding RAM)
- Characteristics requiring specialized tools or architectures:
  1. **Volume**: Sheer size of data
  2. **Variety**: Diverse data types and sources
  3. **Velocity**: Speed of data generation and capture
    * Does not need ot be real-time or near real-time


### Types of Big Data:
- **Tall Data**: Many observations, few variables
  - Example: Traffic data from all NYC intersections every 5 minutes for a week
- **Wide Data**: Few observations, many variables
  - Example: Genetic data for 10,000 patients across 30,000 genes

## Challenges in Big Data Analytics: Beyond Technical limitations

1. **Technical Limitations**
   - Processing power and storage constraints
   - Need for distributed computing solutions

2. **Data Quality and Validity**
   - Ensuring genuine patterns amidst potential noise
   - Importance of statistical principles (e.g., A/B testing, Bonferroni correction, etc.)

3. **Actionable Insights**
   - Balancing insight value against implementation costs
   - Example: Commuter pattern analysis vs. cost of new transit routes

4. **Interpretability**
   - Making complex patterns understandable for decision-makers
   - Balancing sophistication with clarity



## Strategies for Technical Big Data Challenges

An important aspect of this course is understanding how to technically manage large data.
1. **Subsampling**: Analyze a representative subset
2. **Data Partitioning**: Parallel processing across multiple machines
3. **Probabilistic Algorithms**: Fast, approximate answers for non-critical applications
4. **In-Memory Databases**: Rapid data processing and analytics
5. **Data Compression**: Reduce data size, but may introduce complexity


## Strategies for Technical Big Data Challenges

* Data Quality and Validity: Data wrangling techniques, data visualization, and small sample analysis.
* Interpretability: Data visualization and storytelling.
* Actionable Insights: Requires domain expertise.


## Python: The Language of Choice for Big Data Analytics

### Advantages:
- Versatile: From quick scripts to comprehensive applications
- Rich ecosystem of data science libraries
- User-friendly syntax with gentle learning curve
- Cross-platform compatibility
- Strong community support

### Potential Drawbacks:
- Lack of compiler for optimization
- Less fine-grained control compared to lower-level languages

### Mitigating Factors:
- Libraries tap into efficient, low-level implementations
- Foreign function interfaces allow integration with other languages
- Many core libraries (e.g., NumPy, TensorFlow) implemented in C/C++

## Tools of the Trade: Jupyter Notebook

- Web-based interactive development environment
- Supports multiple programming languages
- Ideal for data exploration, visualization, and sharing analyses
- Integrates well with big data tools and frameworks
- Can be installed as prt of the Anaconda Ditribution: https://docs.anaconda.com/anaconda/install/


## Next Week's Challenge: Efficient Data Subsetting

Problem: Sample data from a massive dataset that doesn't fit in RAM.

Considerations:
- Data can only be read one line at a time
- Aim for a representative sample without loading entire dataset

Dataset: https://www.dropbox.com/scl/fi/m52rcbdf58jwchqav09zu/users.tsv.gz?rlkey=ofk5u19xrv4vl8d803mycrzij&dl=1



### Tasks for the Next Session:

* Assemble a team (4-5 students recommended)
* Review Assessment A01
* Brainstorm ideas




