A data analytics lecture using python and pyspark

Coursebook: python-data-analysis-3rd-edition.pdf

Mid-term assignment: Data visualization

Reference: chapter 5 of the coursebook

Dataset to do experiments: choose any dataset if you want

Final assignment: Data Analytics

Requirements: https://sites.google.com/view/dangnt-courses/%C4%91%E1%BA%A1i-h%E1%BB%8Dc/sgu/ph%C3%A2n-t%C3%ADch-d%E1%BB%AF-li%E1%BB%87u/da-2023-2024-hk2?fbclid=IwZXh0bgNhZW0CMTAAAR2utsKCdw2F-ZGe5Paw5jURN9Bn-SEnSJK16Q9OiteZi3lbW2OtXyYhvgc_aem_AZOaLH8fW-N9WfIa0Mc9nqJsEjBQ1jFVdSmP8TS_HeSiMiAsfIC9vk6-_LX_FBItQMnb9bv74pAd1eZU2LXMC8ta

Deadline 12/5/2024

Content of the course:

Lession 1: Install pyspark and setup spark environment on Anaconda References:

Pyspark on Anaconda: https://sparkbyexamples.com/pyspark/install-pyspark-in-anaconda-jupyter-notebook/?fbclid=IwAR3RNBGx4c6eKrxLTK9OP6PLX4TXflgIXbw9QYiYL8Icnrw6lb3Cc81dGNU

Spark installation on Windows: https://sparkbyexamples.com/spark/apache-spark-installation-on-windows/?fbclid=IwAR1k0Qu9ggAWIBkSRU9Q33pDCpp3nG8HQtoUPKnvK0NvilIj8ntP7IdKtvo

Lession 2: Read, Write and validate data for csv file and json file

Create spark session

import findspark
findspark.init()
import pyspark
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName('csvReader').getOrCreate()

Read data

path = 'dataset' #dataset path based on your definition
# read csv file
students = spark.read.csv(path+'students.csv', inferSchema = True, header = True)
# inferSchema and header parameter to define first line of csv file is header and schema of dataframe
# read json file
people = spark.read.json(path+'people.json')

Define data structure

from pyspark.sql.types import StructField, StringType, DateType, StructType
list_schema =[StructField("name", StringType(), True),
           StructField("timestamp", DateType(), True)]
# read json data file again to read defined structure
data_struct = StructType(fields=list_schema)
people = spark.read.json(path+'people.json', schema=data_struct)

ex1.ipynb for csv data
ex2.ipynb for json data
dataset for homework: https://drive.google.com/file/d/1bw7pEgXSVLyMuaI_s3FPa5smNKHsu7-c/view?fbclid=IwAR1XUrTk0Oj0k26f2mS889ZkQEGx3FCI4i7rdO3zNVi5ZM-DpahqUCX8aN4
homework (completed): exercises 1, 2, 3, 4, 5, 10 at Read_Write_and_Validate_HW.ipynb

Lesson 3: Data manipulation

Learn about datatypes by pyspark

#import modules
import pyspark.sql.types
#or
from pyspark.sql.types import *
#some classes are available for pyspark.sql.types module
#DataType
#NullType
#StringType
#BinaryType
#BooleanType
#DateType
#TimestampType
#DecimalType
#DoubleType
#FloatType
#ByteType
#IntegerType
#LongType
#ShortType
#ArrayType
#MapType
#StructField
#StructType

How to change datatype of a column

#using cast method from pyspark.sql.functions module
from pyspark.sql.functions import cast
#how to call a column indataframe
#for example if we have a dataframe df and a column called views
df.views
#or
from pyspark.sql.functions import col
col('views')
#for example if views column has string type and we want to change it into integer type
from pyspark.sql.functions import col, cast
from pyspark.sql.types import IntegerType
df.withColumn('views', col('views').cast(IntegerType()))
#or
df.withColumn('views', df.views.cast(IntegerType()))

dataset for this lesson: https://www.kaggle.com/datasets/datasnaek/youtube-new?fbclid=IwAR1GafFaK6Pm1-voK-LRwJGG8Lgk1QWEd09UE661dVNZcAHqfnR_5-4ybN8#USvideos.csv
ex2-1.ipynb for lession 1 revision
ex2-2.ipynb for Youtube trending dataset from above URL

Lesson 4: Data manipulation (cont)

Regular expression (regex)

Reference: https://digitalfortress.tech/tips/top-15-commonly-used-regex/?fbclid=IwAR3eqN04hlb5oz6V1Wn-hVgUvbQjsYFoASU_GO2jCdgBuKo7aowAN2MeVKw

#for example
# regex for a https URL:
regex = 'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\\(\\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+'

Assignments files:

Question 1-6(completed): Manipulating_Data_in_DataFrames_HW_Q1-6.ipynb
Question 7-9(completed): Manipulating_Data_in_DataFrames_HW_Q7-9.ipynb Dataset file for assignments
```
#dataset for assignment question 1-6
dataset_path = dataset\TweetHandle\ExtractedTweets.csv
```

regexp_extract method

Reference: https://docs.google.com/document/d/1gdoh50voz5iOsCS9Rd-JtNa9CCErQVkq/edit?fbclid=IwAR1Jl6usSspEyRfqmCzdYAsxdbulLv_9XzLfduVXYJCO8oU2OsrrPhdKn6k#heading=h.fvb53qjbsunw

Example

from pyspark.sql.functions import regexp_extract, col

pattern = '(.)(@LatinoLeader)(.)'
df = data.withColumn('Latino_mentions', regexp_extract(col('Tweet'), pattern,2))

regexp_replace method

Example

from pyspark.sql.functions import regexp_replace
pattern = 'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\\(\\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+'  # Regular expression pattern to match URLs
clean = data.withColumn('clean', regexp_replace(data.Tweet, pattern, '')).select('clean')
clean.show(truncate = False)

Lesson 5: Search and filter dataframe

Dataset path for this lession

path = 'dataset/fifa19.csv'

Groupby method:

#groupBy command need to use together with a statistic  function.
#Here, I will show you how to calculate the mean and standard deviation of each group using 'groupBy' and 'apply'.
#Here, we will use the groupBy

group_df = df.groupBy("Nationality").count()
group_df.show()

Orderby method

from pyspark.sql.functions import desc, asc
df.orderBy(desc("Age")).show()
print('##################################################################################################################################')
df.orderBy(asc("Age")).show()

Select with where clause method

#select data by condition using where  clause
fifa_df.select("Name", "Club").where(fifa_df.Club.like("%Barcelona%")).show(truncate=False)

substr method

photo_ext = fifa_df.select("Photo", fifa_df.Photo.substr(-3,3).alias("File_extension"))
photo_ext.show()

isin, startswith, endswith methods

#df1 = fifa_df.Club.isin("Barcelona", "Juventus")
df1 = fifa_df.select("Name", "Club").where(fifa_df.Club.isin(["Barcelona", "Juventus"]))
#df2 = fifa_df.Name.startswith("B").where(fifa_df.Name.endswith("a"))
df2 = fifa_df.select("Name").where(fifa_df.Name.startswith("B")).where(fifa_df.Name.endswith("a"))

filter method

filtered_df = fifa_df.filter(F.col("Name").isin(["L. Messi", "Cristiano Ronaldo"]))

Homework (completed): Search and Filter DataFrames in PySpark-HW.ipynb file

Lesson 6: Aggregation dataframe

Dataset for this lession
```
dataset_path = 'dataset/nyc_air_bnb.csv'
```
Assignment: Aggregating_DataFrames_in_PySpark_HW.ipynb file

agg method
```
grouped_df = airbnb.groupBy("host_id")

# Calculate the total number of reviews for each host using 'count'
total_reviews_per_host = grouped_df.agg(count("number_of_reviews").alias("total_reviews"))
```
Differences between agg and summary method
- agg to make statistics functions on a column
- summary to make statistics functions on whole dataframe
- Reference: https://docs.google.com/document/d/1Yc8x1z35s85CVD9MD7Dzq1g3H5DhkXtX/edit?fbclid=IwAR0uhm3RTzgNFhyKLdjZYQP6f5H2ACVPIa8DnrtrqFBnl7vnQjR40CHQT88#heading=h.fvb53qjbsunw
pivot function

Reference: https://docs.google.com/document/d/1F3ZnN2jInhvVqOu-EiWrHOKm3f3vDaXC/edit?fbclid=IwAR2a03HQWzYVF_fsnBuZLD7-wvpEZ_G3S6Os51wyLvPK2SLL8aujkfYVz9Y#heading=h.fvb53qjbsunw

pivot example:
```
from pyspark.sql.functions import col, avg, round

# Filter for private and shared room types (replace with actual names if different)
filtered_df = airbnb.filter(col("room_type").isin(["Private room", "Shared room"]))

# Filter for Manhattan and Brooklyn only (replace with actual names if different)
filtered_df = filtered_df.filter(
 col("neighbourhood_group").isin(["Manhattan", "Brooklyn"])
)

# List the room types you want to include (avoid duplicates)
room_types_list = ["Private room", "Shared room"]

# Use pivot to create a two-by-two table
avg_price_per_listing = filtered_df.groupBy("neighbourhood_group") \
 .pivot("room_type", room_types_list) \
 .agg(round(avg("price"), 2).alias("avg_price"))

# Display the results
avg_price_per_listing.show()
```
Pivot table exercise with Excel: https://docs.google.com/document/d/1ba0nUB39PFDT98mcFNbcUwQcP9QEioGHt2eT41ZT020/edit?fbclid=IwZXh0bgNhZW0CMTAAAR3falVwqutUmP9BI3THRc26uSi_gHbXvB8WZ09_AeSkyRBMtmDuv8m8yEc_aem_AZPPutD_di3mu-QesDCKG1H4u0Q7yhx7LNf0PA_jp5m6sWhDGE9tVx9e-fKj4Vo5HzoOZ0fqYBbgZOX8YHFRJGqQ#heading=h.dklr7czbvy76

Pivot table exercises: folder pivot_ex/dataset (all excel files (.xlsx)
Lesson 7: Joining and Appending dataframe

SQL Join

Reference: https://blog.codinghorror.com/a-visual-explanation-of-sql-joins/

dataset for this lesson:
```
dataset_path = '/dataset/uw-madison-courses/'
```
Assignment for the lesson: Joining and Appending dataframe.ipynb
Lesson 8: Joining and Appending dataframe (cont)
Lesson 9: Handling missing data
- Dataset for sample code: https://www.kaggle.com/datasets/himanshupoddar/zomato-bangalore-restaurants
- Sample code file: Handling_Missing_Data_in_Pyspark.ipynb
- Assignment file: Handling_Missing_Data_in_Pyspark_HW.ipynb
- Assignment dataset: https://www.kaggle.com/datasets/meinertsen/new-york-city-taxi-trip-hourly-weather-data
Lession 10: Statistic with excel:

coursebook: https://drive.google.com/file/d/1Xh_j_l350RghLNXyRAaBfRP9J8ViQuKD/view?fbclid=IwZXh0bgNhZW0CMTEAAR3an0jxeAX0Tjf8oI43NuyMrVSzii0TTHRJxkX64vSTdchx4W4YV7kkZng_aem_AR1ZNkaMbYUHfz5fffr2RJlfH5lQ_lvZah_2kg-QALq4REoZh1PZdkBnB0-4xCmsdJZMtPeyXKvOC3rPOUtB2v-C Exercise: Draw bar plot, pie plot and pareto plot

Lession 11: Statistic with excel (cont)

# exercises file at statistic-with-excel folder
#lesson content:
# mean = sum of element 's values / the number of element
# median is the center position of a sorted list of values
# mode is the highest frequency value in a list
# variance (population variance and sample variance)
# population variance: σ² = Σ(xᵢ - μ)² / N
# sample variance: s² = Σ(xᵢ - x̄)² / (n - 1)
# standard devariance
# population standard devariance: σ = sqrt(Σ(xᵢ - μ)² / N)
# sample standard devariance: s = sqrt(Σ(xᵢ - x̄)² / (n - 1))

Name		Name	Last commit message	Last commit date
Latest commit History 79 Commits
Statistic_with_excel		Statistic_with_excel
data-analysis-python		data-analysis-python
dataset		dataset
final_assignment		final_assignment
first_assignment		first_assignment
pivot_ex/dataset		pivot_ex/dataset
revision		revision
Aggregating_DataFrames_in_PySpark_HW.ipynb		Aggregating_DataFrames_in_PySpark_HW.ipynb
Handling_Missing_Data_in_PySpark.ipynb		Handling_Missing_Data_in_PySpark.ipynb
Handling_Missing_Data_in_PySpark_HW.ipynb		Handling_Missing_Data_in_PySpark_HW.ipynb
Joining_and_Appending_DataFrames_in_PySpark_HW.ipynb		Joining_and_Appending_DataFrames_in_PySpark_HW.ipynb
Manipulating_Data_in_DataFrames_HW_Q1-6.ipynb		Manipulating_Data_in_DataFrames_HW_Q1-6.ipynb
Manipulating_Data_in_DataFrames_HW_Q10-end.ipynb		Manipulating_Data_in_DataFrames_HW_Q10-end.ipynb
Manipulating_Data_in_DataFrames_HW_Q7-9.ipynb		Manipulating_Data_in_DataFrames_HW_Q7-9.ipynb
README.md		README.md
Read_Write_and_Validate_Data_HW.ipynb		Read_Write_and_Validate_Data_HW.ipynb
Search and Filter DataFrames in PySpark-HW.ipynb		Search and Filter DataFrames in PySpark-HW.ipynb
ex1.ipynb		ex1.ipynb
ex2-1.ipynb		ex2-1.ipynb
ex2-2.ipynb		ex2-2.ipynb
ex2.ipynb		ex2.ipynb
fifa19.ipynb		fifa19.ipynb
python-data-analysis-3rd-edition.pdf		python-data-analysis-3rd-edition.pdf

nguyen-tho/DataAnalytics

Folders and files

Latest commit

History

Repository files navigation

A data analytics lecture using python and pyspark

Mid-term assignment: Data visualization

Final assignment: Data Analytics

Deadline 12/5/2024

Content of the course:

About

Topics

Resources

Stars

Watchers

Forks

Languages