# Homework 2

## Description

### Data
[News Popularity in Multiple Social Media Platforms Data Set](https://archive.ics.uci.edu/ml/datasets/News+Popularity+in+Multiple+Social+Media+Platforms) - 13 CSV files, 155MB in total  

This dataset contains a large set of news items and their respective social feedback on Facebook, Google + and LinkedIn.


### Format
One CSV File with News Data Records and 12 CSV Files with Social Feedback.
The Social Feedback File contains the feedback from one of the social platforms {Facebook, Google+, LinkedIn} on one of the topics {Economy, Microsoft, Palestine, Obama}.

#### News Data Variables
Each record contains 11 attributes

1. IDLink (numeric): Unique identifier of news items 
2. Title (string): Title of the news item according to the official media sources 
3. Headline (string): Headline of the news item according to the official media sources 
4. Source (string): Original news outlet that published the news item 
5. Topic (string): Query topic used to obtain the items in the official media sources 
6. PublishDate (timestamp): Date and time of the news items' publication 
7. SentimentTitle (numeric): Sentiment score of the text in the news items' title 
8. SentimentHeadline (numeric): Sentiment score of the text in the news items' headline 
9. Facebook (numeric): Final value of the news items' popularity according to the social media source Facebook 
10. GooglePlus (numeric): Final value of the news items' popularity according to the social media source Google+ 
11. LinkedIn (numeric): Final value of the news items' popularity according to the social media source LinkedIn 

#### Social Feedback Variables
Each record contains 145 attributes

1. IDLink (numeric): Unique identifier of news items 
2. TS1 (numeric): Level of popularity in time slice 1 (0-20 minutes upon publication) 
3. TS2 (numeric): Level of popularity in time slice 2 (20-40 minutes upon publication) 
4. TS... (numeric): Level of popularity in time slice ... 
5. TS144 (numeric): Final level of popularity after 2 days upon publication


### Task
4 subtasks:
+ (20pt) In social feedback data, calculate the average popularity of each news by hour, and by day, respectively
+ (20pt) In news data, calculate the sum and average sentiment score of each topic, respectively
+ (30pt) In news data, count the words in two fields: ‘Title’ and ‘Headline’ respectively, and list the most frequent words according to the term frequency in descending order, in total, per day, and per topic, respectively
+ (30pt) From the previous subtask, for the top-100 frequent words per topic in titles and headlines, calculate their co-occurrence matrices (100x100), respectively. Each entry in the matrix will contain the co-occurrence frequency in all news titles and headlines, respectively

## Implementation

In [1]:
// Pre-Configured Spark Context in sc

println("Spark Entity:       " + spark)
println("Spark version:      " + spark.version)
println("Spark master:       " + spark.sparkContext.master)
println("Running 'locally'?: " + spark.sparkContext.isLocal)

Spark Entity:       org.apache.spark.sql.SparkSession@7dc75eac
Spark version:      2.3.0
Spark master:       local[*]
Running 'locally'?: true


## Task 1 - Average Popularity (By Hour, By Day)

In [2]:
val folder = "./data/social/"
//val topics = Seq("Economy", "Microsoft", "Obama", "Palestine")
val topics = Seq("Economy")
//val platforms = Seq("Facebook", "GooglePlus", "LinkedIn")
val platforms = Seq("Facebook")

var topic = ""
var platform = ""
var flattenSocialData = spark.sparkContext.emptyRDD[((String, Int), (Int, Double))]

println("Loaded Files: ")
for(topic <- topics; platform <- platforms){
    val loc = folder+platform+"_"+topic+".csv"
    val testData = spark.sparkContext.textFile(loc)
    val header = testData.first
    val flattenData = testData.filter(l => l != header).
    flatMap{ dataString =>
        val attr = dataString.split(",")
        attr.zipWithIndex.
        filter{
            case (value,index) => index >= 1
        }.map{
            case (value,index) => ((attr(0),(index-1)/3),(value.toInt,1.0))
        }
    }
    flattenSocialData = flattenSocialData.union(flattenData)
    println("     "+loc)
}

print("\nLoaded Data Sample: ")
flattenSocialData.take(1).foreach(print)
print(" -- ((UID, Hour), (Popularity, Count))")

Loaded Files: 
     ./data/social/Facebook_Economy.csv

Loaded Data Sample: ((1,0),(-1,1.0)) -- ((UID, Hour), (Popularity, Count))

In [3]:
flattenSocialData.persist()

val pop_by_hour = flattenSocialData.
    reduceByKey{case ((ia, ib), (ja, jb)) => (ia+ja, ib+jb)}

pop_by_hour.take(5).foreach(print)

((10974,1),(-3,3.0))((48028,27),(267,3.0))((9730,14),(0,3.0))((14772,23),(170,3.0))((11423,7),(6,3.0))

In [4]:
val pop_by_day = flattenSocialData.
    map{case((uid, hr), (sum, count)) => ((uid, hr/24), (sum/count, 1))}.
    reduceByKey{case ((ia, ib), (ja, jb)) => (ia+ja, ib+jb)}.
    map{case((uid, day), (sum, count)) => (uid,("d", day, sum / count))}
    
pop_by_day.take(5).foreach(print)

(21508,(d,0,2.486111111111111))(783,(d,0,47.02777777777778))(10974,(d,1,24.166666666666668))(46034,(d,0,4.0))(51147,(d,1,23.055555555555557))

In [5]:
val all_pop = pop_by_hour.
    map{case((uid, hr), (sum, count)) => (uid,("h", hr, sum / count))}.
    union(pop_by_day).
    groupByKey.
    sortByKey(ascending = true)
    
all_pop.take(1).foreach(print)

(1,CompactBuffer((h,32,12.0), (h,3,7.0), (h,6,8.0), (h,37,12.0), (h,25,12.0), (h,9,8.0), (h,0,-1.0), (h,4,7.666666666666667), (h,34,12.0), (h,29,12.0), (h,13,9.333333333333334), (h,26,12.0), (h,44,13.0), (h,41,12.0), (h,30,12.0), (h,5,8.0), (h,10,8.0), (h,15,10.666666666666666), (h,16,11.0), (h,23,12.0), (h,42,13.0), (h,8,8.0), (h,33,12.0), (h,20,12.0), (h,45,13.0), (h,2,1.6666666666666667), (h,22,12.0), (h,17,11.333333333333334), (h,38,12.0), (h,36,12.0), (h,18,12.0), (h,43,13.0), (h,19,12.0), (h,7,8.0), (h,11,9.0), (h,24,12.0), (h,1,-1.0), (h,35,12.0), (h,39,12.0), (h,47,13.0), (h,21,12.0), (h,12,9.0), (h,28,12.0), (h,40,12.0), (h,27,12.0), (h,31,12.0), (h,46,13.0), (h,14,10.0), (d,0,8.527777777777779), (d,1,12.25)))

In [6]:
val all_tuple = all_pop.map { case(uid, iterable) => 
    val vect = iterable.toVector.sortBy { tup => 
        (tup._1, tup._2)
    }.map{ case (doh, no, value) => value }
    uid+","+vect.mkString(",")
}

all_tuple.take(1).foreach(println)

1,8.527777777777779,12.25,-1.0,-1.0,1.6666666666666667,7.0,7.666666666666667,8.0,8.0,8.0,8.0,8.0,8.0,9.0,9.0,9.333333333333334,10.0,10.666666666666666,11.0,11.333333333333334,12.0,12.0,12.0,12.0,12.0,12.0,12.0,12.0,12.0,12.0,12.0,12.0,12.0,12.0,12.0,12.0,12.0,12.0,12.0,12.0,12.0,12.0,12.0,12.0,13.0,13.0,13.0,13.0,13.0,13.0


In [8]:
all_tuple.saveAsTextFile("./data/output")

103820004 Michael Fu