In [20]:
%%spark
println("Application Id: " + spark.sparkContext.applicationId )
println("Application Name: " + spark.sparkContext.appName)

StatementMeta(, , , SessionStarting, )

## Set up variables

These will need to be changed for each enviroment

In [None]:
val storageAccountName = "mgdcag" // replace with your blob name
val storageContainerName = "sites-dashboard" //replace with your container name

// Storage path
val adls_path = f"abfss://$storageContainerName@$storageAccountName.dfs.core.windows.net"

// Sites path
val latestSitesPath = adls_path + s"/sites/latest"


spark.conf.set("mapreduce.fileoutputcommitter.marksuccessfuljobs", "false")

StatementMeta(, , , Waiting, )

## Read the datasets into DFs (Data Frames)
This are the files created by the MGDC copy tool

In [None]:
val sitesDF =
    spark
      .read
      .format("json")
      .option("recursiveFileLookup", "false")
      .load(latestSitesPath)

StatementMeta(, , , Waiting, )

In [None]:
val sitesCount = sitesDF.count()
println(s"The number of sites: $sitesCount")

StatementMeta(, , , Waiting, )

# Enrich the Data

Pretty sure this is called feature engineering 

This is where we add value as SharePoint CSAs. The PG have provided the pizza base, now we need to add those toppings

## Add coloumns
### Using UDFs (User-Defined Functions):
You can define custom UDFs and use them to create new columns based on your specific logic. 

We will use this to add a boolean coloumn for OneDrive sites 

In [None]:
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types._

// returns true if site is OneDrive
// Slighty different to the example above as I was getting scalla errors
val isOneDrive = udf((siteUrl: String) => siteUrl.contains("-my.sharepoint.com"))

val sitesDFOD = sitesDF.withColumn("OneDriveSite", isOneDrive($"Url"))

StatementMeta(, , , Waiting, )

## Enrich DF with API data

We want to call an API then append data to the DF based on the reponse.

The function below can be used to make API calls and return the repsonse as a string. This function is used futher down in the notebook

In [None]:
import org.apache.http.client.methods.HttpGet
import org.apache.http.impl.client.HttpClients
import org.apache.http.util.EntityUtils

def makeAPICall(apiUrl: String): String = {
  val httpClient = HttpClients.createDefault()
  val httpGet = new HttpGet(apiUrl)

  val response = httpClient.execute(httpGet)
  val entity = response.getEntity
  val responseJson = EntityUtils.toString(entity)

  responseJson
}

StatementMeta(, , , Waiting, )

We will first call the last activity API to get details around file and site activity

There is a C# Azure function that contains a number of endpoints that this solution uses

In [None]:
import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.functions._
import org.json4s._
import org.json4s.jackson.JsonMethods._

// This is an Azure function that is hooked up to my tenant. Call it if you want :)
val apiUrl = "https://site-function-ag.azurewebsites.net/api/GetLastUserActivityForSites?timePeriod=D180"

// Call the API
val apiResponseJson = makeAPICall(apiUrl)

// Now you can parse apiResponseJson and process it
val jsonRDD = spark.sparkContext.parallelize(Seq(apiResponseJson))

// Load JSON data into a DataFrame without specifying the schema
val jsonDF = spark.read.json(jsonRDD)
    .withColumnRenamed("SiteId", "Id") // Rename the "SiteId" column as we use to join
    .select("Id", "lastActivityDate", "activeFileCount", "pageViewCount", "fileCount") // using select as we don't want to add all coloumns

// Join with existing dataset
val sitesDFODLA = sitesDFOD.join(jsonDF, "Id")


//sitesDFODLA.show()
display(sitesDFODLA.filter("Id == '9b88c7ff-6b3f-4df0-9f64-ec6ec52bbb54'"))


StatementMeta(, , , Waiting, )

Next steo is to call the API for each item in the DF.

We need to do this as we want to call an API for each site and get further details for each site and enrich with specific site data

extract a list of Id values from your DataFrame and then iterate through that list to perform actions for each Id. Here's a general outline of how you can do this:

Extract a list of Id values from your DataFrame.
Iterate through the list of Id values.
For each Id, perform the desired actions.

```scala
import org.apache.spark.sql.functions._

// Assuming you have a DataFrame named "df" with a column "Id"
val idList: Array[String] = df.select("Id").distinct().collect().map(row => row.getString(0))

// Iterate through the list of Id values
for (id <- idList) {
  // Perform actions for each Id
  println(s"Processing Id: $id")

  // You can call your API or perform other actions here
}
```
This may not be the most effective, but for our usecase it makese sense. Speed is not a concern at this point.


## First Item to inliase the DFs
There may be a way to do this without using the firs item to initlaise the DFs but I'm still learning

In [None]:
import org.apache.spark.sql.functions._

// Assuming you have a DataFrame named "df" with a column "Id"
val idList: Array[String] = sitesDFODLA.select("Id").distinct().collect().map(row => row.getString(0))




StatementMeta(, , , Waiting, )

In [None]:
// Pop the first item from the list to use as the schema
val firstItem = idList.head

// Perform actions for first Id
println(s"Processing Id: $firstItem")

var itemResult = sitesDFODLA.filter(s"Id == '$firstItem'").select("url", "Owner.AadObjectId").collect()
//var primaryAdminId = sitesDFODLA.filter(s"Id == '$firstItem'").select("Owner.AadObjectId")

var url = itemResult(0).getString(0)
var primaryAdminId = itemResult(0).getString(1)

val baseApiUrl = "https://site-function-ag.azurewebsites.net/api/GetAdditionalSiteInfo"

var requestUrl = s"$baseApiUrl?siteId=$firstItem&siteUrl=$url&primaryAdminId=$primaryAdminId"   

// You can call your API or perform other actions here
val apiResponseJson = makeAPICall(s"$requestUrl")

// Create a new row with ApiResponse and append it to the DataFrame
val newRowRDD = spark.sparkContext.parallelize(Seq(apiResponseJson))
val newRowDF = spark.read.json(newRowRDD)

// Check if list column had Values
if (newRowDF.select("Lists").first().get(0) == null) {
  println("No lists found")
} else {
  println("Lists found")
}

// Expload the lists
val explodedDF = newRowDF.select(col("Lists")).withColumn("exploded_data", explode(col("Lists")))

// List DF (also mutalable)
var listDF = explodedDF.select(col("*"), col("exploded_data.*")).drop("Lists", "exploded_data")

var apiResponseDF = newRowDF.drop("lists")

display(apiResponseDF)
display(listDF)

StatementMeta(, , , Waiting, )

In [None]:

val remainingItems = idList.tail
//val remainingItems = idList.tail.take(5) // using 5 as we are in dev - don't want to call the API 100s

// This is another Azure function that is hooked up to my tenant. Call it if you want :)
val baseApiUrl = "https://site-function-ag.azurewebsites.net/api/GetAdditionalSiteInfo"

for (id <- remainingItems) {
    // Perform actions for each Id
    println(s"Processing Id: $id")

    var itemResult = sitesDFODLA.filter(s"Id == '$id'").select("url", "Owner.AadObjectId").collect()
    //var primaryAdminId = sitesDFODLA.filter(s"Id == '$firstItem'").select("Owner.AadObjectId")

    var url = itemResult(0).getString(0)
    var primaryAdminId = itemResult(0).getString(1)

    var requestUrl = s"$baseApiUrl?siteId=$id&siteUrl=$url&primaryAdminId=$primaryAdminId"   
    
    // You can call your API or perform other actions here
    val apiResponseJson = makeAPICall(s"$requestUrl")

    // Create a new row with ApiResponse and append it to the DataFrame
    val newRowRDD = spark.sparkContext.parallelize(Seq(apiResponseJson))
    // Create a DataFrame from the new row
    val newRowDF = spark.read.json(newRowRDD)

    // Expload the lists
    val explodedDF = newRowDF.select(col("Lists")).withColumn("exploded_data", explode(col("Lists")))
    var listRowDF = explodedDF.select(col("*"), col("exploded_data.*")).drop("Lists", "exploded_data")

    // Add the Lists to the listDF (created in above cell)
    listDF = listDF.union(listRowDF)

    // Add the APIresponse but drop the lists
    apiResponseDF = apiResponseDF.union(newRowDF.drop("Lists"))

}

display(apiResponseDF)
display(listDF)

StatementMeta(, , , Waiting, )

## Join back with the main dataset
We kind of want to have one master dataset as it will make the PowerBI task easier.


In [None]:
val sitesMoreDetails = apiResponseDF
    .withColumnRenamed("SiteId", "Id")

// Join with existing dataset
// val sitesDFODLAMORE = sitesDFODLA.join(apiResponseDF, "Id")

// We are going to join backwards as we only have 6 items in debug - Prod would use the above
val sitesDFODLAMORE = sitesMoreDetails.join(sitesDFODLA, "Id")

display(sitesDFODLAMORE)

StatementMeta(, , , Waiting, )

## Big Value Data Points

This is where the real magic happens. With the data in the DF it's possible to work out previous version storage. What an insight, and we haven't even itterated every object.

In the MGDC sites data set we can make the following assumption

`PreviousVersionSize = TotalSize - TotalFileStreamSize - MetadataSize`

With the addional data we now have we can make a far better assumption. We calcucate storage used in Drive by getting the size used by call the drives. We could probably even remove the metadata size

`PreviousVersionSize = storageUsedInDrives - TotalFileStreamSize`

This is just one example of what we can do with just a few extra toppings to add to this maverlous MGDC flavoured Pizza.

We will use one of the UDFs from the start

In [None]:
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types._

// returns true if site is OneDrive
// Slighty different to the example above as I was getting scalla errors
val previousVersionSize = udf((storageUsedInDrives: BigInt, totalFileStreamSize: BigInt) => 
    storageUsedInDrives - totalFileStreamSize
)
// Assuming you have a DataFrame called "df"
// TotalSize - TotalFileStreamSize - MetadataSize - storageUsedPreservationHold
val sitesDFODLAMOREPV = sitesDFODLAMORE
    .withColumn("PreviousVersionSize", previousVersionSize($"storageUsedInDrives", $"StorageMetrics.TotalFileStreamSize"))

val pvColoumns: DataFrame = sitesDFODLAMOREPV.select("Id", "OneDriveSite", "PreviousVersionSize", "lastActivityDate")
// using truncate = flase paramer to see full urls
pvColoumns.show(20, truncate = false)

StatementMeta(, , , Waiting, )

## Write back to blob storage

We need to write our new dataset back to the blobs - We will dropit in another location

We also need to write out list DF from earlier

In [None]:
val latestSitesEnhanced = adls_path + s"/sitesenhanced/latest/"
sitesDFODLAMOREPV
    .repartition(1)
    .write
    .format("json")
    .mode("overwrite")
    .save(latestSitesEnhanced)


val latestLists= adls_path + s"/lists/latest/"
listDF
    .repartition(1)
    .write
    .format("json")
    .mode("overwrite")
    .save(latestLists)

StatementMeta(, , , Waiting, )