Coursebook: python-data-analysis-3rd-edition.pdf
Reference: chapter 5 of the coursebook
Dataset to do experiments: choose any dataset if you want
-
Lession 1: Install pyspark and setup spark environment on Anaconda References:
Pyspark on Anaconda: https://sparkbyexamples.com/pyspark/install-pyspark-in-anaconda-jupyter-notebook/?fbclid=IwAR3RNBGx4c6eKrxLTK9OP6PLX4TXflgIXbw9QYiYL8Icnrw6lb3Cc81dGNU
Spark installation on Windows: https://sparkbyexamples.com/spark/apache-spark-installation-on-windows/?fbclid=IwAR1k0Qu9ggAWIBkSRU9Q33pDCpp3nG8HQtoUPKnvK0NvilIj8ntP7IdKtvo
-
Lession 2: Read, Write and validate data for csv file and json file
Create spark session
import findspark findspark.init() import pyspark from pyspark.sql import SparkSession spark = SparkSession.builder.appName('csvReader').getOrCreate()
Read data
path = 'dataset' #dataset path based on your definition # read csv file students = spark.read.csv(path+'students.csv', inferSchema = True, header = True) # inferSchema and header parameter to define first line of csv file is header and schema of dataframe # read json file people = spark.read.json(path+'people.json')
Define data structure
from pyspark.sql.types import StructField, StringType, DateType, StructType list_schema =[StructField("name", StringType(), True), StructField("timestamp", DateType(), True)] # read json data file again to read defined structure data_struct = StructType(fields=list_schema) people = spark.read.json(path+'people.json', schema=data_struct)
- ex1.ipynb for csv data
- ex2.ipynb for json data
- dataset for homework: https://drive.google.com/file/d/1bw7pEgXSVLyMuaI_s3FPa5smNKHsu7-c/view?fbclid=IwAR1XUrTk0Oj0k26f2mS889ZkQEGx3FCI4i7rdO3zNVi5ZM-DpahqUCX8aN4
- homework (completed): exercises 1, 2, 3, 4, 5, 10 at Read_Write_and_Validate_HW.ipynb
-
Lesson 3: Data manipulation
Learn about datatypes by pyspark
#import modules import pyspark.sql.types #or from pyspark.sql.types import * #some classes are available for pyspark.sql.types module #DataType #NullType #StringType #BinaryType #BooleanType #DateType #TimestampType #DecimalType #DoubleType #FloatType #ByteType #IntegerType #LongType #ShortType #ArrayType #MapType #StructField #StructType
How to change datatype of a column
#using cast method from pyspark.sql.functions module from pyspark.sql.functions import cast #how to call a column indataframe #for example if we have a dataframe df and a column called views df.views #or from pyspark.sql.functions import col col('views') #for example if views column has string type and we want to change it into integer type from pyspark.sql.functions import col, cast from pyspark.sql.types import IntegerType df.withColumn('views', col('views').cast(IntegerType())) #or df.withColumn('views', df.views.cast(IntegerType()))
- dataset for this lesson: https://www.kaggle.com/datasets/datasnaek/youtube-new?fbclid=IwAR1GafFaK6Pm1-voK-LRwJGG8Lgk1QWEd09UE661dVNZcAHqfnR_5-4ybN8#USvideos.csv
- ex2-1.ipynb for lession 1 revision
- ex2-2.ipynb for Youtube trending dataset from above URL
-
Lesson 4: Data manipulation (cont)
Regular expression (regex)
#for example # regex for a https URL: regex = 'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\\(\\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+'
Assignments files:
- Question 1-6(completed): Manipulating_Data_in_DataFrames_HW_Q1-6.ipynb
- Question 7-9(completed): Manipulating_Data_in_DataFrames_HW_Q7-9.ipynb
Dataset file for assignments
#dataset for assignment question 1-6 dataset_path = dataset\TweetHandle\ExtractedTweets.csv
regexp_extract method
Example
from pyspark.sql.functions import regexp_extract, col pattern = '(.)(@LatinoLeader)(.)' df = data.withColumn('Latino_mentions', regexp_extract(col('Tweet'), pattern,2))
regexp_replace method
Example
from pyspark.sql.functions import regexp_replace pattern = 'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\\(\\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+' # Regular expression pattern to match URLs clean = data.withColumn('clean', regexp_replace(data.Tweet, pattern, '')).select('clean') clean.show(truncate = False)
-
Lesson 5: Search and filter dataframe
Dataset path for this lession
path = 'dataset/fifa19.csv'
Groupby method:
#groupBy command need to use together with a statistic function. #Here, I will show you how to calculate the mean and standard deviation of each group using 'groupBy' and 'apply'. #Here, we will use the groupBy group_df = df.groupBy("Nationality").count() group_df.show()
Orderby method
from pyspark.sql.functions import desc, asc df.orderBy(desc("Age")).show() print('##################################################################################################################################') df.orderBy(asc("Age")).show()
Select with where clause method
#select data by condition using where clause fifa_df.select("Name", "Club").where(fifa_df.Club.like("%Barcelona%")).show(truncate=False)
substr method
photo_ext = fifa_df.select("Photo", fifa_df.Photo.substr(-3,3).alias("File_extension")) photo_ext.show()
isin, startswith, endswith methods
#df1 = fifa_df.Club.isin("Barcelona", "Juventus") df1 = fifa_df.select("Name", "Club").where(fifa_df.Club.isin(["Barcelona", "Juventus"])) #df2 = fifa_df.Name.startswith("B").where(fifa_df.Name.endswith("a")) df2 = fifa_df.select("Name").where(fifa_df.Name.startswith("B")).where(fifa_df.Name.endswith("a"))
filter method
filtered_df = fifa_df.filter(F.col("Name").isin(["L. Messi", "Cristiano Ronaldo"]))
Homework (completed): Search and Filter DataFrames in PySpark-HW.ipynb file
-
Lesson 6: Aggregation dataframe
Dataset for this lession
dataset_path = 'dataset/nyc_air_bnb.csv'
Assignment: Aggregating_DataFrames_in_PySpark_HW.ipynb file
agg method
grouped_df = airbnb.groupBy("host_id") # Calculate the total number of reviews for each host using 'count' total_reviews_per_host = grouped_df.agg(count("number_of_reviews").alias("total_reviews"))
Differences between agg and summary method
- agg to make statistics functions on a column
- summary to make statistics functions on whole dataframe
- Reference: https://docs.google.com/document/d/1Yc8x1z35s85CVD9MD7Dzq1g3H5DhkXtX/edit?fbclid=IwAR0uhm3RTzgNFhyKLdjZYQP6f5H2ACVPIa8DnrtrqFBnl7vnQjR40CHQT88#heading=h.fvb53qjbsunw
pivot function
pivot example:
from pyspark.sql.functions import col, avg, round # Filter for private and shared room types (replace with actual names if different) filtered_df = airbnb.filter(col("room_type").isin(["Private room", "Shared room"])) # Filter for Manhattan and Brooklyn only (replace with actual names if different) filtered_df = filtered_df.filter( col("neighbourhood_group").isin(["Manhattan", "Brooklyn"]) ) # List the room types you want to include (avoid duplicates) room_types_list = ["Private room", "Shared room"] # Use pivot to create a two-by-two table avg_price_per_listing = filtered_df.groupBy("neighbourhood_group") \ .pivot("room_type", room_types_list) \ .agg(round(avg("price"), 2).alias("avg_price")) # Display the results avg_price_per_listing.show()
Pivot table exercise with Excel: https://docs.google.com/document/d/1ba0nUB39PFDT98mcFNbcUwQcP9QEioGHt2eT41ZT020/edit?fbclid=IwZXh0bgNhZW0CMTAAAR3falVwqutUmP9BI3THRc26uSi_gHbXvB8WZ09_AeSkyRBMtmDuv8m8yEc_aem_AZPPutD_di3mu-QesDCKG1H4u0Q7yhx7LNf0PA_jp5m6sWhDGE9tVx9e-fKj4Vo5HzoOZ0fqYBbgZOX8YHFRJGqQ#heading=h.dklr7czbvy76
Pivot table exercises: folder pivot_ex/dataset (all excel files (.xlsx)
-
Lesson 7: Joining and Appending dataframe
SQL Join
Reference: https://blog.codinghorror.com/a-visual-explanation-of-sql-joins/
dataset for this lesson:
dataset_path = '/dataset/uw-madison-courses/'
Assignment for the lesson: Joining and Appending dataframe.ipynb
-
Lesson 8: Joining and Appending dataframe (cont)
-
Lesson 9: Handling missing data
- Dataset for sample code: https://www.kaggle.com/datasets/himanshupoddar/zomato-bangalore-restaurants
- Sample code file: Handling_Missing_Data_in_Pyspark.ipynb
- Assignment file: Handling_Missing_Data_in_Pyspark_HW.ipynb
- Assignment dataset: https://www.kaggle.com/datasets/meinertsen/new-york-city-taxi-trip-hourly-weather-data
-
Lession 10: Statistic with excel:
coursebook: https://drive.google.com/file/d/1Xh_j_l350RghLNXyRAaBfRP9J8ViQuKD/view?fbclid=IwZXh0bgNhZW0CMTEAAR3an0jxeAX0Tjf8oI43NuyMrVSzii0TTHRJxkX64vSTdchx4W4YV7kkZng_aem_AR1ZNkaMbYUHfz5fffr2RJlfH5lQ_lvZah_2kg-QALq4REoZh1PZdkBnB0-4xCmsdJZMtPeyXKvOC3rPOUtB2v-C Exercise: Draw bar plot, pie plot and pareto plot
-
Lession 11: Statistic with excel (cont)
# exercises file at statistic-with-excel folder #lesson content: # mean = sum of element 's values / the number of element # median is the center position of a sorted list of values # mode is the highest frequency value in a list # variance (population variance and sample variance) # population variance: σ² = Σ(xᵢ - μ)² / N # sample variance: s² = Σ(xᵢ - x̄)² / (n - 1) # standard devariance # population standard devariance: σ = sqrt(Σ(xᵢ - μ)² / N) # sample standard devariance: s = sqrt(Σ(xᵢ - x̄)² / (n - 1))