<a href="https://colab.research.google.com/github/jalorenzo/SparkNotebookColab/blob/master/BDF_04_Operations_on_DataFrames.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#00 - Configuration of Apache Spark on Collaboratory


###Installing Java, Spark, and Findspark


---


This code installs Apache Spark 2.2.1, Java 8, and [Findspark](https://github.com/minrk/findspark), a library that makes it easy for Python to find Spark.

In [None]:
import os

os.environ["SPARK_VERSION"] = "spark-3.5.0"
!apt-get update
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget  http://apache.osuosl.org/spark/$SPARK_VERSION/$SPARK_VERSION-bin-hadoop3.tgz
!tar xf $SPARK_VERSION-bin-hadoop3.tgz
!echo $SPARK_VERSION-bin-hadoop3.tgz
!rm $SPARK_VERSION-bin-hadoop3.tgz
!pip install -q findspark

### Set Environment Variables
Set the locations where Spark and Java are installed.

In [None]:
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark/"
os.environ["DRIVE_DATA"] = "/content/gdrive/My Drive/Enseignement/2023-2024/ING3/HPDA/BigDataFrameworks/data/"

!rm /content/spark
!ln -s /content/$SPARK_VERSION-bin-hadoop3 /content/spark
!export PATH=$PATH:$SPARK_HOME/bin:$SPARK_HOME/sbin
!echo $SPARK_HOME
!env |grep  "DRIVE_DATA"

### Start a SparkSession
This will start a local Spark session.

In [None]:
!python -V

import findspark
findspark.init()

from pyspark import SparkContext
sc = SparkContext.getOrCreate()

# Example: shows the PySpark version
print("PySpark version {0}".format(sc.version))

# Example: parallelise an array and show the 2 first elements
sc.parallelize([2, 3, 4, 5, 6]).cache().take(2)

In [None]:
from pyspark.sql import SparkSession
# We create a SparkSession object (or we retrieve it if it is already created)
spark = SparkSession \
.builder \
.appName("My application") \
.config("spark.some.config.option", "some-value") \
.master("local[4]") \
.getOrCreate()
# We get the SparkContext
sc = spark.sparkContext

In [None]:
# Mount Google Drive
from google.colab import drive
drive.mount('/content/gdrive')



---


# 04 - Operations with DataFrames

We are going to see different operations that can be performed with DataFrames:

  - Row filtering
  - Sorting and grouping
  - Joins
  - Scalar functions and aggregations
  - Using them with complex types
  - Window functions
  - User-defined functions

We will end up seeing how to use SQL requests on DataFrames
  

As for reading, Spark can save DateFrames in multiple formats:

- CSV, JSON, Parquet, Hadoop...

It can write them as well on a database

In [None]:
#Retrieve a DataFrame reading it from the Parquet format
dfSE = spark.read\
            .format("parquet")\
            .option("mode", "FAILFAST")\
            .load(os.environ["DRIVE_DATA"] + "dfSE.parquet")
dfSE.cache()

In [None]:
dfSE.show(5)
dfSE.printSchema()

## Filter operations

In [None]:
# Select those posts that contain the word 'Italiano' in their body
from pyspark.sql.functions import col

colBody = col("body")
dfItaliano = dfSE.filter(colBody.like('%Italiano%'))

print("Number of posts with the word Italiano: {0}\n".format(dfItaliano.count()))

print("Show the first line")
dfItaliano.take(1)

In [None]:
# Retrieve the questions (postType == 1) which have an accepted reply (acceptedAnswerID != null)
# Note: where() is an alias of filter()

colPostType = col("postType")
colAcceptedReplyId = col("acceptedAnswerId")

dfQuestionWithAcceptedReply = dfSE\
                    .where((colPostType == 1) & (colAcceptedReplyId.isNotNull()))\
                    .withColumnRenamed("Creation_date", "Date_of_creation")

print("Number of questions with an accepted reply: {0}"\
      .format(dfQuestionWithAcceptedReply.count()))

dfQuestionWithAcceptedReply.cache()

dfQuestionWithAcceptedReply\
        .select("Date_of_creation", colPostType.alias("Post Type"), colAcceptedReplyId)\
        .show(truncate=False)

In [None]:
# Keep the entries corresponding to June 2014
from datetime import date

colCreationDate = col("Date_of_creation")

dfQuestionWithAcceptedReplyJun14 = dfQuestionWithAcceptedReply\
                    .filter((colCreationDate >= date(2014,6,1)) &
                            (colCreationDate <= date(2014,6,30)))

dfQuestionWithAcceptedReplyJun14.select(colCreationDate, colPostType, colAcceptedReplyId).show(truncate=False)

In [None]:
# Add a column with the ratio between the number of visits and the score of the question
colNumViews = col("numViewed")
colPoints = col("score")
dfQuestionWithAcceptedReplyRatio = dfQuestionWithAcceptedReply.withColumn("ratio", colNumViews/colPoints)

# Shows some columns with ratio > 35
colRatio = col("ratio")
dfQuestionWithAcceptedReplyRatio.filter(colRatio > 35)\
                        .select(colCreationDate, colNumViews, colPoints, colRatio)\
                        .show(truncate=False)

## Sorting and grouping operations

In [None]:
# Sorting by viewCount
dfQuestionWithAcceptedReply.orderBy(colNumViews, ascending=False)\
                  .select(colCreationDate, colNumViews)\
                  .show(10, truncate=False)

In [None]:
# Grouping by the userId column
colUserId = col("userId")
groupByUser = dfQuestionWithAcceptedReply.groupBy(colUserId)
print(type(groupByUser))

In [None]:
print("DataFrame with the number of posts by user")
dfPostsByUser = groupByUser.count()
dfPostsByUser.printSchema()

colNPosts = col("count")
dfPostsByUser.select(colUserId.alias("User number"),
                        colNPosts.alias("Number of posts"))\
                .orderBy(colUserId).show(10)

In [None]:
print("DataFrame with the average number of views per user")
dfAvgPerUser = groupByUser.avg("numViewed")
dfAvgPerUser.orderBy(colUserId).show(10)

In [None]:
# The 'agg' method allows grouping operations expressed as a dictionary {column_name:operation}
print("Obtain the previous tables with a single operation")
dfCountAvg = groupByUser.agg({"userId":"count", "numViewed":"avg"})
dfCountAvg.printSchema()

colCount = col("count(userId)")
colMedia = col("avg(numViewed)")
dfCountAvg.select(colUserId.alias("User number"),
                   colCount.alias("Number of posts"),
                   colMedia.alias("Views average"))\
                  .orderBy(colUserId).show(10)

In [None]:
# Grouping on two columns
dfSE.groupBy(colUserId, colPostType)\
    .count()\
    .sort(colUserId, colPostType)\
    .show()

A description of the functions used with GroupedData can be found on https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql.html#grouping

### Advanced grouping

It is possible to group data on more than one column: `Rollups` and `Cube`

#### Rollups

Grouping by multiple columns, including aggregations by the first column.

In [None]:
# For each user, count the number of questions (postType == 1) and the number of replies (postType == 2)
rollupPerUserAndPostType = dfSE.rollup("userId", "postType")
print(type(rollupPerUserAndPostType))

In [None]:
# DataFrame with the number of posts per user and 'Question' post type
# Null fields are aggregation fields. For example:
# null null = all posts
# 4    null = all posts from user with id 4
# 4    1    = all posts of type 1 from user with id 4
# NOTE: disregard posts with types 4 and 5.
dfPostPerUserAndType = rollupPerUserAndPostType.count()
dfPostPerUserAndType.printSchema()
dfPostPerUserAndType.select(colUserId.alias("User number"),
                             colPostType.alias("Post type"),
                             colNPosts.alias("Number of posts"))\
                     .orderBy(colUserId,colPostType)\
                     .show(100)

#### Cubes

Similar to Rollups, but going through all dimensions.

In [None]:
groupByUserAndPostType = dfSE.cube("userId", "postType")
print(type(groupByUserAndPostType))

In [None]:
# # DataFrame with the number of posts per user and 'Question' post type
# Null fields are aggregation fields. For example:
# null null = all posts
# null 1    = all post of type 1
# 4    null = all posts from user with id 4
# 4    1    = all posts of type 1 from user with id 4
# NOTE: disregard posts with types 4 and 5.
dfPostPerUserAndType = groupByUserAndPostType.count()
dfPostPerUserAndType.printSchema()
dfPostPerUserAndType.select(colUserId.alias("User number"),
                             colPostType.alias("Post type"),
                             colNPosts.alias("Number of posts"))\
                     .orderBy(colUserId,colPostType)\
                     .show(100)

## Joins
Spark offers the possibility of performing multiple types of joins (as in SQL)

  - inner, outer, left outer, right outer, left semi, left anti, cross

In [None]:
# We want to join each question that has an accepted reply with the actual reply chosen as the accepted answer
# We join the colAcceptedReplyId field from the questions with the id field from the answers
dfQuestions = dfQuestionWithAcceptedReply\
                .select(colUserId, colBody, colAcceptedReplyId)\
                .withColumnRenamed("userId", "User question")\
                .withColumnRenamed("body", "Question")\
                .withColumnRenamed("acceptedAnswerId", "ID Accepted Reply")

colId = col("id")
dfReplies = dfSE\
                .select(colId, colUserId, colBody)\
                .where(colPostType == 2)\
                .withColumnRenamed("id", "ID Reply")\
                .withColumnRenamed("userId", "User reply")\
                .withColumnRenamed("body", "Reply")

nQuestions = dfQuestions.count()
nReplies = dfReplies.count()
print("Number of questions with an accepted reply = {0}".format(nQuestions))
print("Number of replies = {0}".format(nReplies))

In [None]:
dfQuestions.show()
dfReplies.show()

In [None]:
# Join expression
joinExpression = dfQuestions["ID Accepted Reply"] == dfReplies["ID Reply"]

In [None]:
# Inner join
# Include only rows for which the joinExpression is true
joinType = "inner"
dfInner = dfQuestions.join(dfReplies, joinExpression, joinType)
nRows = dfInner.count()
print("Number of rows = {0}".format(nRows))
dfInner.show(nRows)

In [None]:
# Outer join
# Include all rows from both DataFrames.
# In the case there are no matching values on any of the DataFrames, give a null value.
joinType = "outer"
dfOuter = dfQuestions.join(dfReplies, joinExpression, joinType)
nRows = dfOuter.count()
print("Number of rows = {0}".format(nRows))
dfOuter.show(nRows)

In [None]:
# Left Outer join
# Include all rows from the left DataFrame (first DataFrame)
# If there are no matching values on the right DataFrame, give a null value.
joinType = "left_outer"
dfLOuter = dfQuestions.join(dfReplies, joinExpression, joinType)
nRows = dfLOuter.count()
print("Number of rows = {0}".format(nRows))
dfLOuter.show(nRows)

In [None]:
# Right Outer join
# Include all rows from the right DataFrame (second DataFrame)
# If there are no matching values on the left DataFrame, give a null value.
joinType = "right_outer"
dfROuter = dfQuestions.join(dfReplies, joinExpression, joinType)
nRows = dfROuter.count()
print("Number of rows = {0}".format(nRows))
dfROuter.show(nRows)

In [None]:
# Left Semi join
# The result includes all values from the first DataFrame that also exist in the second one.
joinType = "left_semi"
dfLSemi = dfReplies.join(dfQuestions, joinExpression, joinType)
nRows = dfLSemi.count()
print("Number of rows = {0}".format(nRows))
dfLSemi.show(nRows)

In [None]:
# Left Anti join
# The result includes all values from the first DataFrame that DO NOT exist in the second one.
joinType = "left_anti"
dfLAnti = dfReplies.join(dfQuestions, joinExpression, joinType)
nRows = dfLAnti.count()
print("Number of rows = {0}".format(nRows))
dfLAnti.show(nRows)

In [None]:
# Cross join
# Cartesian product, joins each row from the first DataFrame with all rows from the second one.
# IT IS STRONGLY ADVISED NOT TO USE IT, BECAUSE IT IS EXTREMELY COSTLY
dfCross = dfReplies.crossJoin(dfQuestions)
nRows = dfCross.count()
print("Number of rows = {0}".format(nRows))
dfCross.show(100)

## Scalar functions and aggregations

Spark has a wide offer of functions to operate with DataFrames:
- Mathematical functions: ``abs``, ``log``, ``hypot``, etc.
- Operations with strings: ``lenght``, ``concat``, etc.
- Operations with dates: ``year``, ``date_add``, etc.
- Aggregation operations: ``min``, ``max``, ``count``, ``avg``, ``sum``, ``sumDistinct``, ``stddev``, ``variance``, ``kurtosis``, ``skewness``, ``first``, ``last``, ``window``, etc.

A detailed description of those functions can be found on  https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql.html#functions

In [None]:
from pyspark.sql.functions import datediff, col
colLastActivity = col("lastActivity")
colCreationDate = col("Date_of_creation")

# Search for the question with an accepted answer that was active the longest time
# (i.e. with the highest difference between the LastActivity -"lastActivity"- and Creation_date)

mostActive = dfQuestionWithAcceptedReply.withColumn("ActiveTime",datediff(colLastActivity,colCreationDate))\
            .orderBy("ActiveTime", ascending=False)\
            .head()

print("The question \n\n{0}\n\nhas been active {1} days".\
      format(mostActive.body.replace("&lt;", "<").replace("&gt;", ">"), mostActive.ActiveTime))

In [None]:
from pyspark.sql.functions import window
# Obtain the number of posts per week from each user
# Group by userId and a date-of-creation window of one week
dfQuestionWithAcceptedReply.groupBy(
                   colUserId, window(colCreationDate, "1 week").alias("Week"))\
                  .count()\
                  .sort("count", ascending=False)\
                  .show(20,False)

In [None]:
import pyspark.sql.functions as F

# Search the average and maximum of the "points" (score) of all rows as well as the total number in the DataFrame
dfSE.select(F.avg(colPoints), F.max(colPoints), F.count(colPoints)).show()

In [None]:
# Again, but using 'describe'
dfSE.select(colPoints).describe().show()

In [None]:
# Score histogram
import matplotlib.pyplot as plt; plt.rcdefaults()
import matplotlib.pyplot as plt
#from io import StringIO
import io

def show(p):
    img = io.StringIO()
    p.savefig(img, format='svg')
    img.seek(0)
#    print ("%html <div style='width:600px'>" + img.buf() + "</div>")

# Obtain a histogram with 10 groups
x,y = dfSE.select(colPoints).rdd.flatMap(lambda x:x).histogram(20)

# Clean the graph
plt.gcf().clear()

plt.bar(x[:-1], y, width=1.3)
plt.xlabel(u'Score')
plt.ylabel(u'Number of occurrences')
plt.title(u'Histogram')

show(plt)


## Complex types

Spark works with three types of complex data: `structs`, `arrays` and `maps`

### Structs

DataFrames inside DataFrames

In [None]:
from pyspark.sql.functions import struct
# Create a new DF with a column that combines two existing columns
colNumViews = col("numViewed")
colNReplies = col("nAnswers")
dfStruct = dfSE.select(colId, colNumViews, colNReplies, struct(colNumViews, colNReplies)\
               .alias("Viewed_Replied"))
dfStruct.show(5)

In [None]:
# Obtain a field of the combined column
dfStruct.select(col("Viewed_Replied").getField("numViewed")).show(5)


### Arrays

Arrays let us work with data as if they were a Python array.

*Example*

Obtain the number of *tags* for each question with an accepted reply and replace the ``&lt;`` and ``&gt;`` by  ''<'' and ''>''

  - "tags" from each question are saved in a concatenated way, separated by   ''<'' and ''>'', codified as ``&lt;`` and ``&gt;``

`&lt;english-comparison&gt;&lt;translation&gt;&lt;phrase-request&gt;`

In [None]:
# First, obtain a DataFrame without null tags
dfSE.show(10)
dfNotNullTags = dfSE.dropna("any", subset=["tags"])
dfNotNullTags.show(10)

In [None]:
# Add a column with all tags splitted
from pyspark.sql.functions import split
colTags = col("tags")
dfTags = dfNotNullTags.withColumn("tag_array", split(colTags, "&gt;&lt;"))
dfTags.select(colTags, col("tag_array")).show(10, False)

In [None]:
from pyspark.sql.functions import size
# Show the number of tags of each entry
colTag_array = col("tag_array")
dfTags.select(colTag_array, size(colTag_array)).show(5, False)

In [None]:
# Show the second tag of each entry
dfTags.selectExpr("tag_array", "tag_array[1]").show(5, False)

In [None]:
from pyspark.sql.functions import array_contains
# Look up whether the word "usage" appears in the tags
dfTags.withColumn("With_usage", array_contains(colTag_array, "&lt;usage"))\
      .select(colTag_array, col("With_usage")).show(5, False)

In [None]:
from pyspark.sql.functions import explode
# Convert each tag in a row
dfTagsRows = dfTags.withColumn("Tags2", explode(colTag_array))
dfTagsRows.select(colTags, col("Tags2")).show(10, False)

In [None]:
# Remove symbols &lt; y &gt;
from pyspark.sql.functions import regexp_replace
dfTags = dfTagsRows.withColumn("Tags_splitted", regexp_replace("Tags2", "&[l,g]t;", ""))\
                   .drop("Tags2")
dfTags.select(colTags, col("Tags_splitted")).show(10, False)

In [None]:
# Number of entries with the "word-choice" tag
print("Number of entries with the word-choice tag = {0}"
      .format(dfTags
      .filter(col("Tags_splitted") == "word-choice")
      .count()))

### Maps

They are created from columns that work as key-value pairs.

In [None]:
from pyspark.sql.functions import create_map
# Create a column with a key-value map
# key: id, value: body
dfMap = dfSE.select(create_map(col("Creation_date"), col("lastActivity"))\
            .alias("Dates"))
dfMap.show(5, False)

In [None]:
# We can conduct a search using the key
dfMap.selectExpr("Dates['2013-11-10 19:58:02.1']").show(5, False)

## Window functions

Similar to aggregation functions, they operate in groups of rows, returning a single value for each row. This allows, among others:

  - To obtain moving averages
  - To calculate cumulative sums
  - To access values higher than the current row value

Basically, a window function calculates a value for each input row from a table based on a group of rows, called *frame*.

As window functions we can use the aggregation functions previously seen as well as other additional functions (``cume_dist``, ``dense_rank``, ``lag``, ``lead``, ``ntile``, ``percent_rank``, ``rank``, ``row_number``) specified as *Window function* in https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql.html#window

#### Example 1
From the ``dfQuestionWithAcceptedReply`` DataFrame, show the score (column "points") maximum per user and for each question, the difference between the question score and the user's maximum score.


In [None]:
from pyspark.sql.window import Window

# Specify the windows to partition the rows by the userId column
window = Window.partitionBy(colUserId)
print(type(window))


In [None]:
# Create a column with the maximum score per user
colMaxPoints = F.max(colPoints).over(window)
print(type(colMaxPoints))

In [None]:
# Obtain a new DataFrame including the maximum score per user and the difference
# between this maximum and each question score
dfQuestionWithAcceptedReply.select(colUserId, colId.alias("Question"),
                          colPoints, colMaxPoints.alias("maxPerUser"))\
                  .withColumn("Difference", colMaxPoints-colPoints)\
                  .orderBy(colUserId, colId)\
                  .show(30)

#### Example 2
Show for each user and question from the ``dfQuestionWithAcceptedReply`` DataFrame  the number of days spent between the previous user question until the current one, and from the current one to the following one.

In [None]:
# Specify the window to partition the rows by the userId column and sort them by creation day
window = Window.partitionBy(colUserId).orderBy(colCreationDate)

In [None]:
# Create a column to reference the previous question (in date)
PreviousCol = F.lag(colCreationDate, 1).over(window)
# Create a column to reference the following question (in date)
FollowingCol = F.lead(colCreationDate, 1).over(window)

# Show for each user and question the id of the previous and following questions
dfQuestionWithAcceptedReply.select(colUserId, colId, colCreationDate.alias("Creation Date"),
                          F.datediff(colCreationDate,PreviousCol).alias("Days from"),
                          F.datediff(FollowingCol,colCreationDate).alias("Days until"))\
                  .orderBy(colUserId, colId)\
                  .show(30, truncate=False)

## User-Defined Functions (UDFs)

If we need a function that is not implemented, we can create our own function to operate on columns.

**Note:**
  - UDFs in Python may be quite inefficient, due to the data serialisation in Python
  - It is recommended to code them in Scala or Java (and then call them from Python)


#### Example

User UDFs to obtain the number of *tags* for each question and change the ``&lt;`` and ``&gt;`` by  ''<'' and ''>''

  - The "tags" from each question are stored concatenated, separated by  ''<'' and ''>'', and coded as ``&lt;`` and ``&gt;``

`&lt;english-comparison&gt;&lt;translation&gt;&lt;phrase-request&gt;`

To count the number of tags, it is enough to count the number of times ``&lt;`` appears in the string.

In [None]:
colTags = col("tags")
# Obtain a DataFrame without null tags
dfNoNullTags = dfSE.na.drop("any", subset=["tags"])

In [None]:
from pyspark.sql.functions import udf

# Define a function that returns the number of &lt; in a string
def countTags(tags):
    return tags.count('&lt;')

# Define a function that replaces &lt and &gt by < and >
def replaceTags(tags):
    return tags.replace('&lt;', '<').replace('&gt;', '>')

# Create udfs from these functions
udfCountTags = udf(countTags)
udfReplaceTags = udf(replaceTags)

In [None]:
dfNoNullTags.select(udfReplaceTags(colTags).alias("Tags"),\
                          udfCountTags(colTags).alias("nTags"))\
                  .show(truncate=False)


**NOTE:** Only Python and Swift are officially supported languages on Colaboratory. If we want to create the UDFs in Scala using Colaboratory, please follow [this instructions](https://medium.com/@shadaj/machine-learning-with-scala-in-google-colaboratory-e6f1661f1c88) to install and configure a Scala kernel. Otherwise, the following two code blocks will not work.

In [None]:
// Create the previous functions in Scala
def countTagsSc(tags:String):Int = tags.split("&lt;").size - 1
def replaceTagsSc(tags:String):String = tags.replace("&lt;", "<").replace("&gt;", ">")

// Register those functions as a Spark SQL function
spark.udf.register("udfCountTagsSc", countTagsSc(_:String):Int)
spark.udf.register("udfReplaceTagsSc", replaceTagsSc(_:String):String)

In [None]:
dfNoNullTags.printSchema()
# Call IDFs Scala using an expression
dfNoNullTags.selectExpr("udfReplaceTagsSc(tags) AS Tags",
                              "udfCountTagsSc(tags) AS nTags")\
                  .show(truncate=False)

## Using SQL commands

SQL commands executed from Spark are converted to operations on DataFrames

 - It is possible to run remote commands using the JDBC/ODBC server [Thrift](https://spark.apache.org/docs/latest/sql-programming-guide.html#distributed-sql-engine)
 - It can also work with stored data in [Apache Hive](https://spark.apache.org/docs/latest/sql-programming-guide.html#hive-tables)

To use SQL commands on a DataFrame , the DataFrame must be registered as a *table* or *view*.

 - The view can be created as a temporary one (it is deleted when the session ends) or as a global one (kept between sessions).


In [None]:
# Registers the dfQuestionWithAcceptedReply DataFrame as a temporary view
dfQuestionWithAcceptedReply.createOrReplaceTempView("table_QuestionWithAcceptedReply")

# Create a table with the data stored in Parquet
spark.sql("CREATE TABLE table_SE USING PARQUET OPTIONS (path '"+os.environ["DRIVE_DATA"] + "dfSE.parquet" + "')")


In [None]:
spark.sql("SELECT * FROM table_SE").printSchema()

In [None]:
# Run a SQL command on the table contents
dfUser100 = spark.sql("""SELECT userId,id FROM table_SE
                         WHERE userId >= 100""")
dfUser100.show(5)

In [None]:
# Show the created tables
spark.sql("SHOW TABLES").show()

In [None]:
# Create a new DataFrame from one of the tables
dfFromTable = spark.sql("SELECT * FROM table_QuestionWithAcceptedReply")
dfFromTable.show(5)

In [None]:
spark.sql("DROP TABLE IF EXISTS table_QuestionWithAcceptedReply")
spark.sql("DROP TABLE IF EXISTS table_SE")

spark.sql("SHOW TABLES").show()



---

# Exercises


## Exercise 4.1: Pi Estimation

Using the Monte Carlo method, estimate the value of Pi. Use the random() method from the random class.

In [None]:
import random
import numpy as np



## Exercise 4.2: Inspect a log file

Upload the file /var/log/syslog from your computer to this notebook. Then, select only the "bad lines": WARNING and ERROR messages.