# NSS Tutorial

Field	| Type
--------|-------
ukprn	| varchar(8)
institution	| varchar(100)
subject	| varchar(60)
level	| varchar(50)
question	| varchar(10)
A_STRONGLY_DISAGREE	| int(11)
A_DISAGREE	| int(11)
A_NEUTRAL	| int(11)
A_AGREE	| int(11)
A_STRONGLY_AGREE	| int(11)
A_NA	| int(11)
CI_MIN	| int(11)
score	| int(11)
CI_MAX	| int(11)
response	| int(11)
sample	| int(11)
aggregate	| char(1)

National Student Survey 2012

The National Student Survey <http://www.thestudentsurvey.com/> is presented to thousands of graduating students in UK Higher Education. The survey asks 22 questions, students can respond with STRONGLY DISAGREE, DISAGREE, NEUTRAL, AGREE or STRONGLY AGREE. The values in these columns represent PERCENTAGES of the total students who responded with that answer.

The table `nss` has one row per institution, subject and question.

In [1]:
import $ivy.`org.apache.spark::spark-sql:3.4.0`

import org.apache.log4j.{Level, Logger}
Logger.getLogger("org").setLevel(Level.OFF)

import org.apache.spark._
import org.apache.spark.sql._
import org.apache.spark.sql.functions._

val spark = {
    NotebookSparkSession.builder()
    .progress(false)
    .appName("app08+")
    // .master("spark://192.168.31.31:7077")
    .master("local[*]")
    .config("spark.sql.warehouse.dir", 
            "hdfs://192.168.31.31:9000/user/hive/warehouse") 
    .config("spark.cores.max", "4") 
    .config("spark.executor.instances", "1") 
    .config("spark.executor.cores", "2") 
    .config("spark.executor.memory", "10g") 
    .config("spark.shuffle.service.enabled", "false") 
    .config("spark.dynamicAllocation.enabled", "false") 
    .config("spark.sql.catalogImplementation", "hive")
    .config("spark.sql.repl.eagerEval.enabled", "true")
    .config("spark.driver.allowMultipleContexts", "true")
    .getOrCreate()
}

Loading spark-stubs, spark-hive
Adding Hive conf dir /opt/hive/conf to classpath
Creating SparkSession


SLF4J: No SLF4J providers were found.
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See https://www.slf4j.org/codes.html#noProviders for further details.


[32mimport [39m[36m$ivy.$                                  

[39m
[32mimport [39m[36morg.apache.log4j.{Level, Logger}
[39m
[32mimport [39m[36morg.apache.spark._
[39m
[32mimport [39m[36morg.apache.spark.sql._
[39m
[32mimport [39m[36morg.apache.spark.sql.functions._

[39m
[36mspark[39m: [32mSparkSession[39m = org.apache.spark.sql.SparkSession@69e0463e

In [2]:
import spark.implicits._
def sc = spark.sparkContext
val hiveCxt = new org.apache.spark.sql.hive.HiveContext(sc)

[32mimport [39m[36mspark.implicits._
[39m
defined [32mfunction[39m [36msc[39m
[36mhiveCxt[39m: [32msql[39m.[32mhive[39m.[32mHiveContext[39m = org.apache.spark.sql.hive.HiveContext@1342b565

In [3]:
// Credit to Aivean
implicit class RichDF(val ds:DataFrame) {
    def showHTML(limit: Int = 50, truncate: Int = 100) = {
        import xml.Utility.escape
        val data = ds.take(limit)
        val header = ds.schema.fieldNames.toSeq        
        val rows: Seq[Seq[String]] = data.map { row =>
          row.toSeq.map {cell =>
            val str = cell match {
              case null => "null"
              case binary: Array[Byte] => binary.map("%02X".format(_)).mkString("[", " ", "]")
              case array: Array[_] => array.mkString("[", ", ", "]")
              case seq: Seq[_] => seq.mkString("[", ", ", "]")
              case _ => cell.toString
            }
            if (truncate > 0 && str.length > truncate) {
              // do not show ellipses for strings shorter than 4 characters.
              if (truncate < 4) str.substring(0, truncate)
              else str.substring(0, truncate - 3) + "..."
            } else {
              str
            }
          }: Seq[String]
        }
    publish.html(s""" <table>
                <tr>
                 ${header.map(h => s"<th>${escape(h)}</th>").mkString}
                </tr>
                ${rows.map {row =>
                  s"<tr>${row.map{c => s"<td>${escape(c)}</td>" }.mkString}</tr>"
                }.mkString}
            </table>
        """)
    }
}

defined [32mclass[39m [36mRichDF[39m

In [4]:
val nss = hiveCxt.table("sqlzoo.nss")

[36mnss[39m: [32mDataFrame[39m = [ukprn: string, institution: string ... 15 more fields]

## 1. Check out one row

The example shows the number who responded for:

- question 1
- at 'Edinburgh Napier University'
- studying '(8) Computer Science'

**Show the the percentage who STRONGLY AGREE**

In [5]:
(nss.filter((nss("question")==="Q01") && 
            (nss("institution")==="Edinburgh Napier University") && 
            (nss("subject")==="(8) Computer Science"))
 .select("A_STRONGLY_AGREE")
 .showHTML())

A_STRONGLY_AGREE
23


## 2. Calculate how many agree or strongly agree

**Show the institution and subject where the score is at least 100 for question 15.**

In [6]:
(nss.filter((nss("question")==="Q15") && (nss("score")>=100))
 .select("institution", "subject")
 .showHTML())

institution,subject
Kingston College,(I) Education
"Royal Holloway, University of London",(L) Geographical Studies
Solihull College,(I) Education
Stafford College,(D) Business and Administrative studies
University of Southampton,(E) Mass Communications and Documentation
University of Wolverhampton,(7) Mathematical Sciences
University of Leicester,(2) Subjects allied to Medicine
University of Newcastle upon Tyne,(E) Mass Communications and Documentation
"Bishop Grosseteste University College, Lincoln",(F) Languages
Universities of East Anglia and Essex; Joint Provision at University Campus Suffolk,(G) Historical and Philosophical studies


## 3. Unhappy Computer Students

**Show the institution and score where the score for '(8) Computer Science' is less than 50 for question 'Q15'**

In [7]:
(nss.filter((nss("question")==="Q15") && 
           (nss("subject")==="(8) Computer Science") && 
           (nss("score")<50))
 .select("institution", "score")
 .showHTML())

institution,score
Blackburn College,45
North Lindsey College,30
Plymouth College of Art,47
Somerset College of Arts and Technology,48
"University of Wales, Newport",30
Universities of East Anglia and Essex; Joint Provision at University Campus Suffolk,45


## 4. More Computing or Creative Students?

**Show the subject and total number of students who responded to question 22 for each of the subjects '(8) Computer Science' and '(H) Creative Arts and Design'.**

> _HINT_    
> You will need to use SUM over the response column and GROUP BY subject

In [8]:
(nss.filter((nss("question")==="Q22") && 
            (nss("subject").isin(List(
                "(8) Computer Science", "(H) Creative Arts and Design"): _*)))
 .groupBy("subject")
 .agg(sum("response"))
 .showHTML())

subject,sum(response)
(8) Computer Science,10252
(H) Creative Arts and Design,33336


## 5. Strongly Agree Numbers

**Show the subject and total number of students who A_STRONGLY_AGREE to question 22 for each of the subjects '(8) Computer Science' and '(H) Creative Arts and Design'.**

> _HINT_    
> The A_STRONGLY_AGREE column is a percentage. To work out the total number of students who strongly agree you must multiply this percentage by the number who responded (response) and divide by 100 - take the SUM of that.

In [9]:
(nss.withColumn("n_strongly_agree", nss("response")*nss("A_STRONGLY_AGREE")/lit(100))
     .filter((nss("question")==="Q22") &&
             (nss("subject").isin(List(
                 "(8) Computer Science", "(H) Creative Arts and Design"): _*)))
    .select($"subject", $"n_strongly_agree".as[Int])
    .groupBy("subject")
    .sum()
    .showHTML())

subject,sum(n_strongly_agree)
(8) Computer Science,3421.22
(H) Creative Arts and Design,12107.539999999988


## 6. Strongly Agree, Percentage

**Show the percentage of students who A_STRONGLY_AGREE to question 22 for the subject '(8) Computer Science' show the same figure for the subject '(H) Creative Arts and Design'.**

Use the **ROUND** function to show the percentage without decimal places.

In [10]:
(nss.withColumn("n_sa", nss("A_STRONGLY_AGREE")*nss("response"))
    .filter((nss("question")==="Q22") &&
            (nss("subject").isin(List(
                "(8) Computer Science", "(H) Creative Arts and Design"): _*)))
    .select("subject", "n_sa", "response")
    .groupBy("subject")
    .sum()
    .withColumn("pct", round(col("sum(n_sa)")/col("sum(response)"), 0))
    .select("subject", "pct")
    .showHTML())

subject,pct
(8) Computer Science,33.0
(H) Creative Arts and Design,36.0


## 7. Scores for Institutions in Manchester

**Show the average scores for question 'Q22' for each institution that include 'Manchester' in the name.**

The column **score** is a percentage - you must use the method outlined above to multiply the percentage by the **response** and divide by the total response. Give your answer rounded to the nearest whole number.

In [11]:
(nss.withColumn("score", nss("response")*nss("score"))
    .filter((nss("question")==="Q22") && 
            (nss("institution").contains("Manchester")))
    .select("institution", "score", "response")
    .groupBy("institution")
    .sum()
    .withColumn("score", round(col("sum(score)")/col("sum(response)"), 0))
    .select("institution", "score")
    .showHTML())

institution,score
Manchester Metropolitan University,81.0
University of Manchester,83.0
The Manchester College,72.0


## 8.Number of Computing Students in Manchester

**Show the institution, the total sample size and the number of computing students for institutions in Manchester for 'Q01'.**

In [12]:
(nss.filter(($"question"==="Q01") && 
            ($"institution".contains("Manchester")))
 .select($"institution", $"sample", $"subject",
         when($"subject"==="(8) Computer Science", nss("sample"))
         .otherwise(lit(0)).alias("comp"))
 .groupBy("institution")
 .sum()
 .showHTML())

institution,sum(sample),sum(comp)
Manchester Metropolitan University,6994,310
University of Manchester,8065,180
The Manchester College,537,46


In [13]:
spark.stop()