# Cleaning Data with PySpark

Working with data is tricky - working with millions or even billions of rows is worse. Did you receive some data processing code written on a laptop with fairly pristine data? Chances are you’ve probably been put in charge of moving a basic data process from prototype to production. You may have worked with real world datasets, with missing fields, bizarre formatting, and orders of magnitude more data. Even if this is all new to you, this course helps you learn what’s needed to prepare data processes using Python with Apache Spark. You’ll learn terminology, methods, and some best practices to create a performant, maintainable, and understandable data processing platform.

## Table of Contents

- [Introduction](#intro)


In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

path = "data/dc33/"

---
<a id='intro'></a>

## Intro to data cleaning with Apache Spark

<img src="images/spark3_001.png" alt="" style="width: 800px;"/>

<img src="images/spark3_002.png" alt="" style="width: 800px;"/>

<img src="images/spark3_003.png" alt="" style="width: 800px;"/>

<img src="images/spark3_004.png" alt="" style="width: 800px;"/>

<img src="images/spark3_005.png" alt="" style="width: 800px;"/>

## Defining a schema

Creating a defined schema helps with data quality and import performance. As mentioned during the lesson, we'll create a simple schema to read in the following columns:

- Name
- Age
- City

The Name and City columns are `StringType()` and the Age column is an `IntegerType()`.

- Import * from the pyspark.sql.types library.
- Define a new schema using the StructType method.
- Define a StructField for name, age, and city. Each field should correspond to the correct datatype and not be nullable.

In [2]:
# Import the pyspark.sql.types library
from pyspark.sql.types import *

# Define a new schema using the StructType method
people_schema = StructType([
  # Define a StructField for each field
  StructField('name', StringType(), False),
  StructField('age', IntegerType(), False),
  StructField('city', StringType(), False)
])

## Immutability and lazy processing



In [None]:
<img src="images/spark3_006.png" alt="" style="width: 800px;"/>

In [None]:
---
<a id='intro'></a>