# Homework 2

## Description

### Data
[News Popularity in Multiple Social Media Platforms Data Set](https://archive.ics.uci.edu/ml/datasets/News+Popularity+in+Multiple+Social+Media+Platforms) - 13 CSV files, 155MB in total  

This dataset contains a large set of news items and their respective social feedback on Facebook, Google + and LinkedIn.


### Format
One CSV File with News Data Records and 12 CSV Files with Social Feedback.
The Social Feedback File contains the feedback from one of the social platforms {Facebook, Google+, LinkedIn} on one of the topics {Economy, Microsoft, Palestine, Obama}.

#### News Data Variables
Each record contains 11 attributes

1. IDLink (numeric): Unique identifier of news items 
2. Title (string): Title of the news item according to the official media sources 
3. Headline (string): Headline of the news item according to the official media sources 
4. Source (string): Original news outlet that published the news item 
5. Topic (string): Query topic used to obtain the items in the official media sources 
6. PublishDate (timestamp): Date and time of the news items' publication 
7. SentimentTitle (numeric): Sentiment score of the text in the news items' title 
8. SentimentHeadline (numeric): Sentiment score of the text in the news items' headline 
9. Facebook (numeric): Final value of the news items' popularity according to the social media source Facebook 
10. GooglePlus (numeric): Final value of the news items' popularity according to the social media source Google+ 
11. LinkedIn (numeric): Final value of the news items' popularity according to the social media source LinkedIn 

#### Social Feedback Variables
Each record contains 145 attributes

1. IDLink (numeric): Unique identifier of news items 
2. TS1 (numeric): Level of popularity in time slice 1 (0-20 minutes upon publication) 
3. TS2 (numeric): Level of popularity in time slice 2 (20-40 minutes upon publication) 
4. TS... (numeric): Level of popularity in time slice ... 
5. TS144 (numeric): Final level of popularity after 2 days upon publication


### Task
4 subtasks:
+ (20pt) In social feedback data, calculate the average popularity of each news by hour, and by day, respectively
+ (20pt) In news data, calculate the sum and average sentiment score of each topic, respectively
+ (30pt) In news data, count the words in two fields: ‘Title’ and ‘Headline’ respectively, and list the most frequent words according to the term frequency in descending order, in total, per day, and per topic, respectively
+ (30pt) From the previous subtask, for the top-100 frequent words per topic in titles and headlines, calculate their co-occurrence matrices (100x100), respectively. Each entry in the matrix will contain the co-occurrence frequency in all news titles and headlines, respectively

### Implementation Issues
+ Large number of Attributes for each record

## Implementation

In [1]:
// Pre-Configured Spark Context in sc

println("Spark Entity:       " + spark)
println("Spark version:      " + spark.version)
println("Spark master:       " + spark.sparkContext.master)
println("Running 'locally'?: " + spark.sparkContext.isLocal)

Spark Entity:       org.apache.spark.sql.SparkSession@3fb4a07d
Spark version:      2.3.0
Spark master:       local[*]
Running 'locally'?: true


### Task 1 - Average Popularity (By Hour, By Day)

In [2]:
import java.io.File

val inputFile = new File("./data/social/test/")
var inputList = List[File]()
if(inputFile.isDirectory){
    inputList = inputFile.listFiles.filter(_.isFile).toList
}else if(inputFile.isFile){
    inputList = List[File](inputFile)
}
inputList

List(./data/social/test/GooglePlus_Palestine.csv, ./data/social/test/LinkedIn_Palestine.csv)

In [3]:
var flattenSocialData = spark.sparkContext.emptyRDD[((String, Int), (Double, Int))]

inputList.foreach{ input =>
    val data = spark.sparkContext.textFile(input.toString)
    val header = data.first
    val flattenData = data.filter(l => l != header).
    flatMap{ dataString =>
        val attr = dataString.split(",")
        attr.zipWithIndex.
        filter{
            case (value,index) => index >= 1
        }.map{
            case (value,index) => ((attr(0),(index-1)/3),(value.toDouble,1))
        }
    }
    flattenSocialData = flattenSocialData.union(flattenData)
}

print("\nLoaded Data Sample: ")
flattenSocialData.take(1).foreach(print)
print(" -- ((UID, Hour), (Popularity, Count))")


Loaded Data Sample: ((61974,0),(-1.0,1)) -- ((UID, Hour), (Popularity, Count))

In [4]:
flattenSocialData.persist()

val pop_by_hour = flattenSocialData.
    reduceByKey{case ((ia, ib), (ja, jb)) => (ia+ja, ib+jb)}

print("\nIntermediate Data Sample: ")
pop_by_hour.take(1).foreach(print)
print(" -- ((UID, Hour), (Popularity, Count))")


Intermediate Data Sample: ((79184,9),(-6.0,6)) -- ((UID, Hour), (Popularity, Count))

In [5]:
val pop_by_day = flattenSocialData.
    map{case((uid, hr), (sum, count)) => ((uid, hr/24), (sum/count, 1))}.
    reduceByKey{case ((ia, ib), (ja, jb)) => (ia+ja, ib+jb)}.
    map{case((uid, day), (sum, count)) => (uid,("Day", day, sum / count))}
    

print("\nPopularity by Day Data Sample: ")
pop_by_day.take(1).foreach(print)
print(" -- ((UID, ('Day', No., Popularity Average))")


Popularity by Day Data Sample: (84459,(Day,1,0.0)) -- ((UID, ('Day', No., Popularity Average))

In [6]:
val all_pop = pop_by_hour.
    map{case((uid, hr), (sum, count)) => (uid,("Hour", hr, sum / count))}.
    union(pop_by_day).
    groupByKey.
    sortByKey(ascending = false)
    
print("\nAll Popularity Data Sample: ")
all_pop.take(1).foreach(println)


All Popularity Data Sample: (99997,CompactBuffer((Hour,22,1.6666666666666667), (Hour,4,0.0), (Hour,3,0.0), (Hour,12,0.0), (Hour,8,0.0), (Hour,1,0.0), (Hour,6,0.0), (Hour,9,0.0), (Hour,10,0.0), (Hour,11,0.0), (Hour,46,2.0), (Hour,38,2.0), (Hour,41,2.0), (Hour,15,0.0), (Hour,25,2.0), (Hour,24,2.0), (Hour,20,0.6666666666666666), (Hour,13,0.0), (Hour,17,0.16666666666666666), (Hour,28,2.0), (Hour,2,0.0), (Hour,0,0.0), (Hour,42,2.0), (Hour,7,0.0), (Hour,32,2.0), (Hour,39,2.0), (Hour,34,2.0), (Hour,29,2.0), (Hour,43,2.0), (Hour,5,0.0), (Hour,33,2.0), (Hour,27,2.0), (Hour,31,2.0), (Hour,26,2.0), (Hour,37,2.0), (Hour,36,2.0), (Hour,40,2.0), (Hour,18,0.5), (Hour,19,0.5), (Hour,30,2.0), (Hour,44,2.0), (Hour,23,2.0), (Hour,21,1.1666666666666667), (Hour,47,2.0), (Hour,16,0.0), (Hour,45,2.0), (Hour,14,0.0), (Hour,35,2.0), (Day,1,2.0), (Day,0,0.2777777777777778)))


In [7]:
val test_tuple = all_pop.first()
print("\nSorted Data Values for "+test_tuple._1+": ")
print(test_tuple._2.toVector.sortBy { tup => (tup._1, tup._2) })


Sorted Data Values for 99997: Vector((Day,0,0.2777777777777778), (Day,1,2.0), (Hour,0,0.0), (Hour,1,0.0), (Hour,2,0.0), (Hour,3,0.0), (Hour,4,0.0), (Hour,5,0.0), (Hour,6,0.0), (Hour,7,0.0), (Hour,8,0.0), (Hour,9,0.0), (Hour,10,0.0), (Hour,11,0.0), (Hour,12,0.0), (Hour,13,0.0), (Hour,14,0.0), (Hour,15,0.0), (Hour,16,0.0), (Hour,17,0.16666666666666666), (Hour,18,0.5), (Hour,19,0.5), (Hour,20,0.6666666666666666), (Hour,21,1.1666666666666667), (Hour,22,1.6666666666666667), (Hour,23,2.0), (Hour,24,2.0), (Hour,25,2.0), (Hour,26,2.0), (Hour,27,2.0), (Hour,28,2.0), (Hour,29,2.0), (Hour,30,2.0), (Hour,31,2.0), (Hour,32,2.0), (Hour,33,2.0), (Hour,34,2.0), (Hour,35,2.0), (Hour,36,2.0), (Hour,37,2.0), (Hour,38,2.0), (Hour,39,2.0), (Hour,40,2.0), (Hour,41,2.0), (Hour,42,2.0), (Hour,43,2.0), (Hour,44,2.0), (Hour,45,2.0), (Hour,46,2.0), (Hour,47,2.0))

In [8]:
val all_tuple = all_pop.map { case(uid, iterable) => 
    val vect = iterable.toVector.sortBy { tup => 
        (tup._1, tup._2)
    }.map{ case (doh, no, value) => value }
    uid+","+vect.mkString(",")
}
print("\nFinalized Data Sample: ")
all_tuple.take(1).foreach(println)


Finalized Data Sample: 99997,0.2777777777777778,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.16666666666666666,0.5,0.5,0.6666666666666666,1.1666666666666667,1.6666666666666667,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0


In [12]:
all_tuple.saveAsTextFile("./output")

103820004 Michael Fu