pyspark

This repository is an introduction to using pyspark for Big Data. This tutorial was taking from guru99

Notes and Definitions:

Big Data
Apache Spark
What is PySpark?

Tutorial Coverage

Introduction

Spark is designed to work with Python, Java, Scala and SQL. Spark has a vast amount of built-in library (e.g MLlib) and can read a broad type of files.

PySpark provides an API that helps the developer/data scientist to circumvent writing parallel code that may end up having issues or being very complex. It essentially handles the tasks of multiprocessing.

Spark works closely with structured data and allows real-time querying of the data.

Installation

conda install pyspark should work given you are working in an python env. Otherwise see instructions

Initiation of SparkContext is also necessary. SparkContext is the internal engine that allows connections with clusters. It is needed to run an operation.

Last Updated: February 27th, 2022

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

pyspark

Notes and Definitions:

Tutorial Coverage

Introduction

Installation

About

Releases

Packages

License

mustaphajola/pyspark

Folders and files

Latest commit

History

Repository files navigation

pyspark

Notes and Definitions:

Tutorial Coverage

Introduction

Installation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Packages