-
Technological progress
- storage capacity
- communication bandwidth
- computing power
- Reduction of ICT costs
-
Digital Universe
- Integration of digital technologies in every human activity
- Scientific research (produces a lot of data)
- Exponential growth of data
-
Data can be either structured (database records) or unstructured (textual data)
- The analysis of large datasets arises in:
- Retailing: product improvement, recommandation systems
- Banking/Finance: fraud detection...
- Telecommunications: user profiling
- Science: validation methods
- Medicine: diagnosis/therapy
- Social studies: IOT
- Volume
- size of data poses several computational challenges and requires a data-centric perspective
- Velocity
- the data arrives at such high rate that tey cannot be stored and processed offline, but need to be processed in streaming
- Variety
- large datasets often come unconstructed and may relate to very different scenarios
- Veracity
- large datasets coming form real-word applications are likely to contain noisy, uncerain data
- All points above require a paradigm shift with respect to traditional computing
- Novel computing/programming frameworks for big data processing: theory and practice
- Spark
- A sample of key primitives for data analysis
- Rigorous setting (be able to analitically predict what's going to happen)
- Algorithmic solutions with focus on large inputs
- Computational Frameworks: MapReduce, Apache Spark
- Clustering primitives (Professor's focus)
- Graph analysis primitives
- Association analysis primitives (Data mining)
- Data stream processing
- Written exam (26 points)
- Homeworks (6+1 points)
- groups of max 3/4 sudents
- 4 assignments, one every 2/3 weeks
- Use of Apache Spark on individual PCs (assignments 1-3) and CloudVeneto (assignment 4)
- Moodle: forum, evaluation of homeworks and of written exams
- Uniweb: written exam lists, official final grades
- Course website: http://www.dei.unipd.it/~capri/BDC/