# Big Data Analytics - Fall 2025

## Course Instructor: Associate Professor Mahdi Belcaid

### Academic Affiliations:
* Information and Computer Science (ICS)
* Hawaii Institute of Marine Biology (HIMB)
* Hawaii Data Science Institute (HI-DSI)



### Research Focus
Specializing in the application of big data techniques to:
1. Computational Biology
   - Analyzing large-scale genomic and proteomic datasets
   - Modeling complex biological systems
   - Bioinformatics pipeline development

2. Text Data Analysis
   - Natural Language Processing (NLP) for large corpora
   - Information retrieval from unstructured text data

## Why Big Data Analytics?

`We are drowning in Data but Starving for insight`


* This quote encapsulates the challenge of the modern data landscape. 

* While we have access to unprecedented amounts of data, extracting meaningful insights remains a significant challenge.


## About the Course

- **Focus**: Practical big-data analytics
- **Emphasis**: Tools and applications for big data
- **Primary Language**: Python
  - Why Python? It's a leading language in data science, offering a balance of simplicity and power

**Course Objective**: By the end of this course, you'll have the knowledge and skills to:
  * Implement big data analytics projects to real-world scenarios.
  * Confidently discuss data science concepts in job interviews and 


In [None]:

## Course Logistics

- **Communication**: Slack Workspace (ics438f23.slack.com)
  - All important announcements will be made here. Check regularly!
- **Course Website**: mahdi-b.github.io/ics438f23
  - Contains schedule, references, and additional resources
- **Instructor Office Hours**:
  - Brief questions: Post-class on Thursdays
  - Detailed discussions: By appointment (recommended)

---

## Grading Breakdown

- **Assignments (25%)**
  - Topics: Data wrangling, analytics, and Natural Language Processing (NLP)
  - Python-based
  - Submission via GitHub (familiarize yourself with Git!)
- **In-Class Quizzes (40%)**
  - Comprehensive; tests overall understanding
- **Final Exam (30%)**
  - Held during exam week
  - Covers all course material
- **Attendance (5%)**
  - Max. 4 absences allowed
  - Complete attendance form at the start of each session

---

## Prerequisites

- **Database Concepts**: A solid understanding is crucial
- **Python**: 
  - Prior knowledge is highly beneficial
  - If you're new to Python, be prepared to learn quickly
  - Numerous free resources available online (e.g., Codecademy, Python.org tutorials)
- **Probability & Statistics**: We'll cover the basics, but prior exposure helps
- **Machine Learning**: Basic familiarity is advantageous but not required

---

## Data Analytics vs. Data Science

While these terms are often used interchangeably, there are subtle differences:

### Data Analytics
- **Focus**: Examining existing data for insights
- **Typical Tasks**:
  - Identify trends and patterns
  - Create visualizations
  - Transform findings into actionable business decisions

### Data Science
- **Focus**: Solving complex problems using data
- **Typical Tasks**:
  - Design custom data processes and models
  - Develop new algorithms and prototypes
  - Build predictive models for future-oriented analysis

---

## Flavors of Analytics

1. **Descriptive Analytics**: 
   - What happened?
   - Tools: Dashboards, reports, data visualization

2. **Diagnostic Analytics**: 
   - Why did it happen?
   - Techniques: Data mining, correlations, root cause analysis

3. **Predictive Analytics**: 
   - What might happen in the future?
   - Methods: Machine learning, regression analysis, time series forecasting

4. **Prescriptive Analytics**: 
   - What should we do about it?
   - Approaches: Optimization, simulation, decision analysis

---

## Data Analyst vs. Data Scientist: Task Comparison

| Task                   | Data Analytics | Data Science |
|------------------------|----------------|--------------|
| Data wrangling         | LOW/MEDIUM     | MEDIUM/HIGH  |
| Exploratory Analyses   | HIGH           | MEDIUM       |
| Descriptive Statistics | HIGH           | LOW          |
| Statistical modeling   | LOW            | HIGH         |
| Prescriptive analytics | MEDIUM/HIGH    | LOW          |
| Predictive Analysis    | LOW            | HIGH         |
| Application Development| LOW            | MEDIUM       |

Note: This is a subjective comparison of the most common tasks in each field.

---

## Real-world Data Analytics Examples

1. **Global Supply Chain Optimization**
   - Scenario: Analyzing global supply chain data for a multinational corporation
   - Goal: Identify bottlenecks, disruptions, and correlations between geopolitical events and supply delays
   - Outcome: Optimize supply chain efficiency and resilience

2. **E-commerce Marketing ROI Maximization**
   - Scenario: Analyzing multi-channel marketing data for an e-commerce platform
   - Goal: Compare customer acquisition channels (e.g., social media vs. email marketing) in terms of Customer Lifetime Value (LTV)
   - Outcome: Optimize marketing budget allocation and launch targeted promotions

3. **Document Relationship Analysis**
   - Scenario: Analyzing a vast collection of documents
   - Goal: Determine inter-relationships and groupings among documents
   - Outcome: Improve document categorization and information retrieval

4. **Streaming Content Investment Optimization**
   - Scenario: Analyzing viewing habits of Netflix users
   - Goal: Predict promising movie genre investments for different viewer demographics
   - Outcome: Maximize returns on content investment

---

## Data Analytics: A Multifaceted Approach

1. **Descriptive Analytics**
   - Purpose: Understand past events
   - Tools: Statistics, data visualization, probability distributions

2. **Diagnostic Analytics**
   - Purpose: Uncover reasons behind past events
   - Tools: Statistical tests, correlation analysis, interpretable models

3. **Prescriptive Analytics**
   - Purpose: Recommend future actions
   - Tools: A/B testing, market basket analysis, cluster analysis

4. **Predictive Analytics**
   - Purpose: Forecast future events
   - Tools: Historical data models, often simple and interpretable

Note: Predictive analytics provides probabilities of future events, while prescriptive analytics offers specific recommendations for action.

---

## The Evolution of Data Analytics and Data Science

- These fields have roots in academic and industry research spanning decades
- Many scientific disciplines have long relied on complex data analysis
- Modern "data science" often rebrands existing practices with new tools and scale
- Interdisciplinary fields emerging:
  - Ecoinformatics, Chemoinformatics, Social Informatics
  - Computational Biology, Quantitative Finance, Computational Linguistics

These fields represent the intersection of traditional disciplines with data science and analytics techniques.

---

## The Big Data Revolution

### Driving Factors:
1. **Technological Advancements**: Ability to capture and store vast amounts of data
2. **Datafication**: Trend of turning aspects of life into quantifiable data
   - Example: IoT devices "datafying" household appliances
   - Example: Wearables "datafying" health metrics

### Impact:
- Unprecedented scale and complexity of data available for analysis
- New challenges in data integration and interpretation
- Opportunities for insights previously impossible due to data limitations

---

## What Constitutes "Big Data"?

- Not just about volume (exceeding RAM)
- Characteristics requiring specialized tools or architectures:
  1. **Volume**: Sheer size of data
  2. **Variety**: Diverse data types and sources
  3. **Velocity**: Speed of data generation and capture

### Types of Big Data:
- **Tall Data**: Many observations, few variables
  - Example: Traffic data from all NYC intersections every 5 minutes for a week
- **Wide Data**: Few observations, many variables
  - Example: Genetic data for 10,000 patients across 30,000 genes

---

## Challenges in Big Data Analytics

1. **Data Quality and Validity**
   - Ensuring genuine patterns amidst potential noise
   - Importance of A/B testing and statistical principles (e.g., Bonferroni correction)

2. **Actionable Insights**
   - Balancing insight value against implementation costs
   - Example: Commuter pattern analysis vs. cost of new transit routes

3. **Interpretability**
   - Making complex patterns understandable for decision-makers
   - Balancing sophistication with clarity

4. **Technical Limitations**
   - Processing power and storage constraints
   - Need for distributed computing solutions

---

## Strategies for Big Data Challenges

1. **Subsampling**: Analyze a representative subset
2. **Data Partitioning**: Parallel processing across multiple machines
3. **Probabilistic Algorithms**: Fast, approximate answers for non-critical applications
4. **In-Memory Databases**: Rapid data processing and analytics
5. **Data Compression**: Reduce data size, but may introduce complexity

---

## Python: The Language of Choice for Big Data Analytics

### Advantages:
- Versatile: From quick scripts to comprehensive applications
- Rich ecosystem of data science libraries
- User-friendly syntax with gentle learning curve
- Cross-platform compatibility
- Strong community support

### Potential Drawbacks:
- Lack of compiler for optimization
- Less fine-grained control compared to lower-level languages

### Mitigating Factors:
- Libraries tap into efficient, low-level implementations
- Foreign function interfaces allow integration with other languages
- Many core libraries (e.g., NumPy, TensorFlow) implemented in C/C++

---

## Tools of the Trade: Jupyter Notebook

- Web-based interactive development environment
- Supports multiple programming languages
- Ideal for data exploration, visualization, and sharing analyses
- Integrates well with big data tools and frameworks

---

## Next Week's Challenge: Efficient Data Subsetting

Problem: Sample data from a massive dataset that doesn't fit in RAM.

Considerations:
- Data can only be read one line at a time
- Aim for a representative sample without loading entire dataset

Dataset: https://www.dropbox.com/scl/fi/m52rcbdf58jwchqav09zu/users.tsv.gz?rlkey=ofk5u19xrv4vl8d803mycrzij&dl=1

Prepare to discuss potential approaches and their trade-offs in our next session!