# PySpark: Cleaning data and Getting insights from It


- There are 3 Parts: Installing Spark, Loading & Cleaning data and Getting insights from data. These are the entry points for any data analytics project! 

- In my earlier notebooks, I discussed in detail about installing Spark and uploading data in Colab. This notebook is focused on Data cleansing. 

- Data cleansing is the process of analyzing the quality of data in a data source, approving/rejecting the suggestions by the system, and making changes to the data. The quality of data is important in getting useful information from it.



## PART 1. Configure PySpark environment

Copy & Paste code below. 

Read more https://github.com/kyramichel/Pyspark_Cloud/blob/master/PySpark_GoogleColab.ipynb


In [None]:
#update the packages existing on the machine
!apt-get update

#install java 
!apt-get install openjdk-8-jdk-headless -qq > /dev/null


#install spark: get the file
!wget -q https://archive.apache.org/dist/spark/spark-2.4.1/spark-2.4.1-bin-hadoop2.7.tgz
    
#unzip the file
!tar xf spark-2.4.1-bin-hadoop2.7.tgz

#set up the ennvironmental variables
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-2.3.2-bin-hadoop2.7"

#install finspark  
!pip install -q findspark

#importing findspark adds pyspark to the system path, so that next time you can import pyspark like any other python library
import findspark
findspark.init("/content/spark-2.4.1-bin-hadoop2.7")

import pyspark

#SparkContext: the entry point of spark functionality is the interface to running a spark cluster manager
from pyspark import SparkContext, SparkConf


#import a spark session
from pyspark.sql import SparkSession
#create a session
spark = SparkSession.builder.getOrCreate()
spark

#test the installation
df0 = spark.sql("select 'PySpark' as Hello")
df0.show()

# PART 2. Upload, Load & Clean Data

- To upload data, click upload, select your data file
Read more how to get in data in Colab: https://github.com/kyramichel/Pyspark_Cloud/blob/master/DataPysparkCloudColab.ipynb


- Load data, create a data frame df 

- To get insights from data we can query a data frame in Spark using both Python and SQL 

In [None]:
df = spark.read.csv('data.csv', header=True, inferSchema=True)
df.show()  

In [None]:
df.printSchema()

In [None]:
#Import pyspark sql functions library to clean data
from pyspark.sql.functions import *

In [None]:
#clean Region column - we create a new col because df is immutable
df1 = df.withColumn("RegionCleaned", when(df.Region.isNull(), 'unknown').otherwise(df.Region))
df1.show()

In [None]:
df1.select("Region", "RegionCleaned").show()

In [None]:
df1.drop("Region")
df1.show()

In [None]:
df1 = df1.withColumnRenamed("RegionCleaned","Region")
df.show()

In [None]:
#Use filter to delete entire row when Country is Null  

df1 = df.filter(df.Country.isNotNull())
df1.show()

In [None]:
df2 = df1.withColumn("PriceCleaned",
                     when(col("Product") == "Product1","1200")
                     .when(col("Product") == "Product2","3600")
                     .otherwise("7500"))
df2.show()

In [None]:
df2.printSchema()

In [None]:
df2 = df2.withColumn("PriceNum", df2["PriceCleaned"].cast("float"))
df2.printSchema()

In [None]:
df3 = df2.drop("Price", "PriceCleaned")
df3.show()

In [None]:
df3 = df3.withColumnRenamed("PriceNum","Price")
df3.show()

In [None]:
df3.dtypes

### To fill missing Latitude and Longitude values I using different interpolation techniques: mean and median imputation

In [None]:
#clean lat column - replace null with 0 
df4 = df3.withColumn("Lat1", when(df3.Latitude.isNull(), 0).otherwise(df.Latitude))
df4.printSchema()

#### Calculate mean(Latitude) grouped by Country: 

In [None]:
from pyspark.sql.functions import avg, col, when
from pyspark.sql.window import Window
w = Window().partitionBy('Country')

In [None]:
df5 = df4.withColumn('Latitude', when(col('Latitude').isNull(), avg(col('Lat1')).over(w)).otherwise(col('Latitude')))
df5.show()

### For Longitute I apply interpolation using a median=startegy 

In [None]:
df6 = df5.withColumn("Long1", when(df5.Longitude.isNull(), 0).otherwise(df5.Longitude))
df6.show()

In [None]:
longCol = df6.select("Long1")
longCol.show()

In [None]:
#Using LinAlgebra Python library to compute median
import numpy as np
median = np.median(longCol.collect())
median

In [None]:
#replace missing Longitude values with median
from pyspark.sql.functions import lit

df7 = df6.withColumn('Longitude', when(col('Longitude').isNull(), lit(median)).otherwise(col('Longitude')))
df7.show()

In [None]:
df7 = df6.drop("Lat1", "Long1")
df7.show()

# PART 3. Getting insights from our the data

## Q: Which Product has the highest sale?

In [None]:
group_data = df7.groupBy("Product").agg({'Product':'count'})
group_data.show()

In [None]:
#Product1 has the highest sales
group_data.agg({'count(Product)':'max'}).show()

## Q:Which Country sells better (all products)?

In [None]:
group_data2 = df7.groupBy("Country").agg({'Product':'count'})
group_data2.show()

In [None]:
#US sells better: 461 
group_data2 = df5.groupBy("Country").agg({'Product':'count'}).sort(col("count(Product)").desc())
group_data2.show()

## Q:Which Country sells better per product?

In [None]:
#Breakdown by products
group_data3= df7.groupBy("Country", "Product").agg({'Product':'count'}).sort(col("count(Product)").desc())
group_data3.show()

## Breakdown by Region (state) per Country

In [None]:
group_data4= df5.groupBy("Country", "Region", "Product").agg({'Product':'count'}).orderBy("Country")
group_data4.show()