# Homework 2

## Description

### Data
[News Popularity in Multiple Social Media Platforms Data Set](https://archive.ics.uci.edu/ml/datasets/News+Popularity+in+Multiple+Social+Media+Platforms) - 13 CSV files, 155MB in total  

This dataset contains a large set of news items and their respective social feedback on Facebook, Google + and LinkedIn.


### Format
One CSV File with News Data Records and 12 CSV Files with Social Feedback.
The Social Feedback File contains the feedback from one of the social platforms {Facebook, Google+, LinkedIn} on one of the topics {Economy, Microsoft, Palestine, Obama}.

#### News Data Variables
Each record contains 11 attributes

1. IDLink (numeric): Unique identifier of news items 
2. Title (string): Title of the news item according to the official media sources 
3. Headline (string): Headline of the news item according to the official media sources 
4. Source (string): Original news outlet that published the news item 
5. Topic (string): Query topic used to obtain the items in the official media sources 
6. PublishDate (timestamp): Date and time of the news items' publication 
7. SentimentTitle (numeric): Sentiment score of the text in the news items' title 
8. SentimentHeadline (numeric): Sentiment score of the text in the news items' headline 
9. Facebook (numeric): Final value of the news items' popularity according to the social media source Facebook 
10. GooglePlus (numeric): Final value of the news items' popularity according to the social media source Google+ 
11. LinkedIn (numeric): Final value of the news items' popularity according to the social media source LinkedIn 

#### Social Feedback Variables
Each record contains 145 attributes

1. IDLink (numeric): Unique identifier of news items 
2. TS1 (numeric): Level of popularity in time slice 1 (0-20 minutes upon publication) 
3. TS2 (numeric): Level of popularity in time slice 2 (20-40 minutes upon publication) 
4. TS... (numeric): Level of popularity in time slice ... 
5. TS144 (numeric): Final level of popularity after 2 days upon publication


### Task
4 subtasks:
+ (20pt) In social feedback data, calculate the average popularity of each news by hour, and by day, respectively
+ (20pt) In news data, calculate the sum and average sentiment score of each topic, respectively
+ (30pt) In news data, count the words in two fields: ‘Title’ and ‘Headline’ respectively, and list the most frequent words according to the term frequency in descending order, in total, per day, and per topic, respectively
+ (30pt) From the previous subtask, for the top-100 frequent words per topic in titles and headlines, calculate their co-occurrence matrices (100x100), respectively. Each entry in the matrix will contain the co-occurrence frequency in all news titles and headlines, respectively

### Implementation Issues
+ Large number of Attributes for each record

## Implementation

In [1]:
// Pre-Configured Spark Context in sc

println("Spark Entity:       " + spark)
println("Spark version:      " + spark.version)
println("Spark master:       " + spark.sparkContext.master)
println("Running 'locally'?: " + spark.sparkContext.isLocal)

Spark Entity:       org.apache.spark.sql.SparkSession@782d2260
Spark version:      2.2.0
Spark master:       local[*]
Running 'locally'?: true


### Task 1 - Average Popularity (By Hour, By Day)

In [2]:
import org.apache.hadoop.fs._
import org.apache.hadoop.conf.Configuration

val inputBuffer = scala.collection.mutable.ArrayBuffer.empty[String]

val inputPath = new Path("./data/social/test")
val iterator = inputPath.getFileSystem(new Configuration()).listFiles(inputPath, true)

inputBuffer = ArrayBuffer()
inputPath = data/social/test
iterator = org.apache.hadoop.fs.FileSystem$6@3578b767


org.apache.hadoop.fs.FileSystem$6@3578b767

In [3]:
while(iterator.hasNext()){
    val fileStatus = iterator.next()
    inputBuffer += fileStatus.getPath().toString()
}
inputBuffer.toArray.foreach(println)

file:/home/micky/big_data/hw2/data/social/test/Facebook_Palestine.csv
file:/home/micky/big_data/hw2/data/social/test/GooglePlus_Palestine.csv
file:/home/micky/big_data/hw2/data/social/test/GooglePlus_Microsoft.csv


In [4]:
var flattenSocialData = spark.sparkContext.emptyRDD[((String, Int), (Double, Int))]

inputBuffer.toArray.foreach{ input =>
    val data = spark.sparkContext.textFile(input)
    val header = data.first
    val flattenData = data.filter(l => l != header).
    flatMap{ dataString =>
        val attr = dataString.split(",")
        attr.zipWithIndex.
        filter{
            case (value,index) => index >= 1
        }.map{
            case (value,index) => ((attr(0),(index-1)/3),(value.toDouble,1))
        }
    }
    flattenSocialData = flattenSocialData.union(flattenData)
}

print("\nLoaded Data Sample: ")
flattenSocialData.take(1).foreach(print)
print(" -- ((UID, Hour), (Popularity, Count))")


Loaded Data Sample: ((61974,0),(-1.0,1)) -- ((UID, Hour), (Popularity, Count))

flattenSocialData = UnionRDD[15] at union at <console>:48


UnionRDD[15] at union at <console>:48

In [5]:
flattenSocialData.persist()

val pop_by_hour = flattenSocialData.
    reduceByKey{case ((ia, ib), (ja, jb)) => (ia+ja, ib+jb)}

print("\nIntermediate Data Sample: ")
pop_by_hour.take(1).foreach(print)
print(" -- ((UID, Hour), (Popularity, Count))")


((16768,38),(10.0,3)) -- ((UID, Hour), (Popularity, Count))                     

pop_by_hour = ShuffledRDD[16] at reduceByKey at <console>:36


ShuffledRDD[16] at reduceByKey at <console>:36

In [6]:
val pop_by_day = flattenSocialData.
    map{case((uid, hr), (sum, count)) => ((uid, hr/24), (sum/count, 1))}.
    reduceByKey{case ((ia, ib), (ja, jb)) => (ia+ja, ib+jb)}.
    map{case((uid, day), (sum, count)) => (uid,("Day", day, sum / count))};
    

print("\nPopularity by Day Data Sample: ")
pop_by_day.take(1).foreach(print)
print(" -- ((UID, ('Day', No., Popularity Average))")


(18258,(Day,1,1.0)) -- ((UID, ('Day', No., Popularity Average))                 

pop_by_day = MapPartitionsRDD[19] at map at <console>:36


MapPartitionsRDD[19] at map at <console>:36

In [7]:
val all_pop = pop_by_hour.
    map{case((uid, hr), (sum, count)) => (uid,("Hour", hr, sum / count))}.
    union(pop_by_day).
    groupByKey.
    sortByKey(ascending = false)
    
print("\nAll Popularity Data Sample: ")
all_pop.take(1).foreach(println)



all_pop = ShuffledRDD[25] at sortByKey at <console>:41


ShuffledRDD[25] at sortByKey at <console>:41

In [8]:
val test_tuple = all_pop.first()
print("\nSorted Data Values for "+test_tuple._1+": ")
print(test_tuple._2.toVector.sortBy { tup => (tup._1, tup._2) })


Sorted Data Values for 99997: Vector((Day,0,23.65277777777778), (Day,1,118.08333333333333), (Hour,0,0.0), (Hour,1,0.0), (Hour,2,1.0), (Hour,3,2.6666666666666665), (Hour,4,3.8333333333333335), (Hour,5,4.5), (Hour,6,5.5), (Hour,7,5.5), (Hour,8,6.666666666666667), (Hour,9,12.333333333333334), (Hour,10,16.666666666666668), (Hour,11,20.0), (Hour,12,21.333333333333332), (Hour,13,24.0), (Hour,14,26.166666666666668), (Hour,15,27.833333333333332), (Hour,16,28.0), (Hour,17,32.166666666666664), (Hour,18,38.5), (Hour,19,44.666666666666664), (Hour,20,52.666666666666664), (Hour,21,58.666666666666664), (Hour,22,64.66666666666667), (Hour,23,70.33333333333333), (Hour,24,74.5), (Hour,25,77.33333333333333), (Hour,26,80.83333333333333), (Hour,27,88.16666666666667), (Hour,28,95.83333333333333), (Hour,29,100.66666666666667), (Hour,30,104.5), (Hour,31,108.83333333333333), (Hour,32,110.83333333333333), (Hour,33,111.0), (Hour,34,111.0), (Hour,35,111.0), (Hour,36,111.0), (Hour,37,130.16666666666666), (Hour,38,

test_tuple = (99997,CompactBuffer((Hour,0,0.0), (Hour,15,27.833333333333332), (Hour,18,38.5), (Hour,21,58.666666666666664), (Hour,14,26.166666666666668), (Hour,22,64.66666666666667), (Hour,32,110.83333333333333), (Hour,9,12.333333333333334), (Hour,31,108.83333333333333), (Hour,34,111.0), (Hour,13,24.0), (Hour,5,4.5), (Hour,26,80.83333333333333), (Hour,4,3.8333333333333335), (Hour,25,77.33333333333333), (Hour,47,148.0), (Hour,24,74.5), (Hour,41,140.33333333333334), (Hour,20,52.666666666666664), (Hour,10,16.666666666666668), (Hour,17,32.166666666666664), (Hour,37,130.16666666666666), (Hour,29,100.66666666666667), (Hour,6,5.5), (Hour,46,146.66666666666666), (Hour,38,132.16666666666666), (Hour,3,2.6666666666666665), (Hour,36,111.0), (Hour,30,104.5)...


(99997,CompactBuffer((Hour,0,0.0), (Hour,15,27.833333333333332), (Hour,18,38.5), (Hour,21,58.666666666666664), (Hour,14,26.166666666666668), (Hour,22,64.66666666666667), (Hour,32,110.83333333333333), (Hour,9,12.333333333333334), (Hour,31,108.83333333333333), (Hour,34,111.0), (Hour,13,24.0), (Hour,5,4.5), (Hour,26,80.83333333333333), (Hour,4,3.8333333333333335), (Hour,25,77.33333333333333), (Hour,47,148.0), (Hour,24,74.5), (Hour,41,140.33333333333334), (Hour,20,52.666666666666664), (Hour,10,16.666666666666668), (Hour,17,32.166666666666664), (Hour,37,130.16666666666666), (Hour,29,100.66666666666667), (Hour,6,5.5), (Hour,46,146.66666666666666), (Hour,38,132.16666666666666), (Hour,3,2.6666666666666665), (Hour,36,111.0), (Hour,30,104.5), (Hour,8,6.666666666666667), (Hour,16,28.0), (Hour,35,111.0), (Hour,11,20.0), (Hour,2,1.0), (Hour,42,143.0), (Hour,19,44.666666666666664), (Hour,43,144.66666666666666), (Hour,40,138.5), (Hour,39,134.5), (Hour,44,145.0), (Hour,45,145.5), (Hour,23,70.333333333

In [9]:
val all_tuple = all_pop.map { case(uid, iterable) => 
    val vect = iterable.toVector.sortBy { tup => 
        (tup._1, tup._2)
    }.map{ case (doh, no, value) => value }
    uid+","+vect.mkString(",")
}
print("\nFinalized Data Sample: ")
all_tuple.take(1).foreach(println)


Finalized Data Sample: 99997,23.65277777777778,118.08333333333333,0.0,0.0,1.0,2.6666666666666665,3.8333333333333335,4.5,5.5,5.5,6.666666666666667,12.333333333333334,16.666666666666668,20.0,21.333333333333332,24.0,26.166666666666668,27.833333333333332,28.0,32.166666666666664,38.5,44.666666666666664,52.666666666666664,58.666666666666664,64.66666666666667,70.33333333333333,74.5,77.33333333333333,80.83333333333333,88.16666666666667,95.83333333333333,100.66666666666667,104.5,108.83333333333333,110.83333333333333,111.0,111.0,111.0,111.0,130.16666666666666,132.16666666666666,134.5,138.5,140.33333333333334,143.0,144.66666666666666,145.0,145.5,146.66666666666666,148.0


all_tuple = MapPartitionsRDD[26] at map at <console>:39


MapPartitionsRDD[26] at map at <console>:39

In [10]:
all_tuple.saveAsTextFile("./output_popularity")



### Task 2 - Sum and Average Sentiment Score For Each Topic

In [11]:
val newsData_string = sc.textFile("./data/news.csv")
val header = newsData_string.first()

newsData_string = ./data/news.csv MapPartitionsRDD[29] at textFile at <console>:31
header = "IDLink","Title","Headline","Source","Topic","PublishDate","SentimentTitle","SentimentHeadline","Facebook","GooglePlus","LinkedIn"


"IDLink","Title","Headline","Source","Topic","PublishDate","SentimentTitle","SentimentHeadline","Facebook","GooglePlus","LinkedIn"

In [12]:
import scala.collection.mutable.ListBuffer
var regex = """,([\d.-]+)$|^([\d.-]+)(?=,)|(?:,)([\d.-]+)(?=,)|(?:,"+)((?:[^"]+"{3,})*[^"]*)(?="+,)""".r
var newsParse = newsData_string.filter(x => x!=header).map{
    string =>
    var list =  ListBuffer[String]()
    for(m <- regex.findAllIn(string).matchData;
      e <- m.subgroups)
      if(e!=null) list+=e
    list.toSeq
}
newsParse.first().foreach(println)

99248
Obama Lays Wreath at Arlington National Cemetery
Obama Lays Wreath at Arlington National Cemetery. President Barack Obama has laid a wreath at the Tomb of the Unknowns to honor
USA TODAY
obama
2002-04-02 00:00:00
0
-0.0533001790889026
-1
-1
-1


regex = ,([\d.-]+)$|^([\d.-]+)(?=,)|(?:,)([\d.-]+)(?=,)|(?:,"+)((?:[^"]+"{3,})*[^"]*)(?="+,)
newsParse = MapPartitionsRDD[31] at map at <console>:36


MapPartitionsRDD[31] at map at <console>:36

In [13]:
val flattenSentientScore = newsParse.
  map{ attr =>
      (attr(4),(attr(6).toString.toDouble, attr(7).toString.toDouble, 1))
  }

print("\nFlatten Sentient Score Sample: ")
flattenSentientScore.take(1).foreach(print)
print(" -- (Topic, (Title Sentient Score, Headline Sentient Score, Count))\n\n")


Flatten Sentient Score Sample: (obama,(0.0,-0.0533001790889026,1)) -- (Topic, (Title Sentient Score, Headline Sentient Score, Count))



flattenSentientScore = MapPartitionsRDD[32] at map at <console>:39


MapPartitionsRDD[32] at map at <console>:39

In [14]:
var smallerSampleSize = sc.parallelize(flattenSentientScore.takeSample(false, 30, System.nanoTime.toInt))
var reducedSentientScore = smallerSampleSize.reduceByKey(
    (a,b) =>
    (a._1 + b._1, a._2+ b._2, a._3 + b._3)
)

print("\nReduced Sentient Score Sample: ")
reducedSentientScore.take(1).foreach(print)
print(" -- (Topic, (Title Sentient Score, Headline Sentient Score, Count))\n\n")


Reduced Sentient Score Sample: (economy,(0.09550339823478982,-0.3221036949470573,9)) -- (Topic, (Title Sentient Score, Headline Sentient Score, Count))



smallerSampleSize = ParallelCollectionRDD[34] at parallelize at <console>:40
reducedSentientScore = ShuffledRDD[35] at reduceByKey at <console>:41


ShuffledRDD[35] at reduceByKey at <console>:41

In [15]:
var finalSentientScore = reducedSentientScore.map{
    case (topic,(titleSum, headSum, count)) =>
    (topic, titleSum, headSum, titleSum/count, headSum/count, count)
}

def myprint(s: Tuple6[Any, Double, Double, Double, Double, Any]): Unit = {
        println("For "+s._1+" ("+s._6+" entries): ")
        println("        Sum of Title Sentient Score: " + s._2)
        println("        Sum of Headline Sentient Score: " + s._3)
        println("        Average of Title Sentient Score: " + s._4)
        println("        Average of Headline Sentient Score: " + s._5 + "\n")
    }

println("\nFinal Sentient Score: \n")
finalSentientScore.collect().foreach(myprint)


Final Sentient Score: 

For economy (9 entries): 
        Sum of Title Sentient Score: 0.09550339823478982
        Sum of Headline Sentient Score: -0.3221036949470573
        Average of Title Sentient Score: 0.010611488692754424
        Average of Headline Sentient Score: -0.035789299438561926

For obama (10 entries): 
        Sum of Title Sentient Score: -0.3022366713719931
        Sum of Headline Sentient Score: -0.4927365148886257
        Average of Title Sentient Score: -0.03022366713719931
        Average of Headline Sentient Score: -0.04927365148886257

For microsoft (9 entries): 
        Sum of Title Sentient Score: -0.2767157306946107
        Sum of Headline Sentient Score: -0.43691497050841
        Average of Title Sentient Score: -0.03074619229940119
        Average of Headline Sentient Score: -0.04854610783426778

For palestine (2 entries): 
        Sum of Title Sentient Score: -0.0947711880323037
        Sum of Headline Sentient Score: 0.2604403359164371
        Average of

finalSentientScore = MapPartitionsRDD[36] at map at <console>:43


myprint: (s: (Any, Double, Double, Double, Double, Any))Unit


MapPartitionsRDD[36] at map at <console>:43

### Task 3 - Title/Headline Word Count in Descending Order (In Total, Per Day, and Per Topic)

In [10]:
import scala.collection.mutable.ListBuffer

val newsData_string = sc.textFile("./data/news.csv")
val header = newsData_string.first()
var regex = """,([\d.-]+)$|^([\de+.-]+)(?=,)|(?:,)([\d.-]+)(?=,)|(?:,"+)((?:[^"]+"{3,})*[^"]*)(?="+,)""".r
var newsParse = newsData_string.filter(x => x!=header).map{
    string =>
    var list =  ListBuffer[String]()
    for(m <- regex.findAllIn(string).matchData;
      e <- m.subgroups)
      if(e!=null) list+=e
    list.toSeq
}
newsParse.first()

newsData_string = ./data/news.csv MapPartitionsRDD[12] at textFile at <console>:35
header = "IDLink","Title","Headline","Source","Topic","PublishDate","SentimentTitle","SentimentHeadline","Facebook","GooglePlus","LinkedIn"
regex = ,([\d.-]+)$|^([\de+.-]+)(?=,)|(?:,)([\d.-]+)(?=,)|(?:,"+)((?:[^"]+"{3,})*[^"]*)(?="+,)
newsParse = MapPartitionsRDD[14] at map at <console>:38


List(99248, Obama Lays Wreath at Arlington National Cemetery, Obama Lays Wreath at Arlington National Cemetery. President Barack Obama has laid a wreath at the Tomb of the Unknowns to honor, USA TODAY, obama, 2002-04-02 00:00:00, 0, -0.0533001790889026, -1, -1, -1)

In [3]:
def wordPreProcess(input: Any): Array[String] = {
        var matchRegex = """([$]?(?:[\w]+(?:[\w',]*[\w]+)+|[\w]))""".r
        var list =  ListBuffer[String]()
        for(m <- matchRegex.findAllIn(input.toString.toLowerCase).matchData;
          e <- m.subgroups)
          list+=e
        list.toArray
    }

wordPreProcess("This is a test string, yrs' I'm hoping that this would work!! :) 103 yrs $3,000 \n'This is a book;'")

wordPreProcess: (input: Any)Array[String]


[this, is, a, test, string, yrs, i'm, hoping, that, this, would, work, 103, yrs, $3,000, this, is, a, book]

In [4]:
var smallerSampleSize = sc.parallelize(newsParse.takeSample(false, 30, System.currentTimeMillis().toInt))
var flattenWordTuples = smallerSampleSize.
  flatMap{ attr =>
          
      var titleWords = wordPreProcess(attr(1)).map(
          word =>
          (word, "title", attr(4).toString, attr(5).toString.split("\\s")(0))
      )
      
      var headlineWords = wordPreProcess(attr(2)).map(
          word =>
          (word, "headine", attr(4).toString, attr(5).toString.split("\\s")(0))
      )
      
      titleWords ++ headlineWords
  }
flattenWordTuples.persist()

print("\nFlatten Word Sample: ")
flattenWordTuples.take(1).foreach(print)
print(" -- (Word, 'Title'/'Headline', Topic, Date)\n\n")


Flatten Word Sample: (watch,title,palestine,2015-10-21) -- (Word, 'Title'/'Headline', Topic, Date)



smallerSampleSize = ParallelCollectionRDD[5] at parallelize at <console>:35
flattenWordTuples = MapPartitionsRDD[6] at flatMap at <console>:37


MapPartitionsRDD[6] at flatMap at <console>:37

In [5]:
var perTopicTuple = flattenWordTuples.
  map{ case (word, toh, topic, date) =>
    ((word, toh, topic), 1)
  }.reduceByKey{
      (j, k) =>
      j+k
  }

print("\nFlatten Word Sample (Per Topic): \n")
perTopicTuple.take(10).foreach(println)
print(" -- ((Word, 'Title'/'Headline', Topic), Count)\n\n")


Flatten Word Sample (Per Topic): 
((premier,headine,economy),1)
((process,headine,obama),1)
((jewell,headine,obama),1)
((is,title,microsoft),1)
((companies,headine,microsoft),1)
((langdana's,headine,economy),1)
(($43b,title,economy),1)
((bullard,headine,economy),1)
((other,headine,obama),1)
((shove,title,microsoft),1)
 -- ((Word, 'Title'/'Headline', Topic), Count)



perTopicTuple = ShuffledRDD[8] at reduceByKey at <console>:41


ShuffledRDD[8] at reduceByKey at <console>:41

In [5]:
import java.io.PrintWriter

var perTopicOutput = perTopicTuple.
  map{ case ((word, toh, topic), count) =>
    ((toh, topic), (word, count))
  }.groupByKey.mapValues{ iterator =>
      iterator.toVector.sortBy { case(word, count) => 
          (-count, word)
      }.map{
          case(word,count) => word + ","+ count
      }.mkString("\n")
  }

perTopicOutput.foreach{ case ((toh, topic), vect_string) =>
   new PrintWriter("./output/topic/"+topic+"_"+toh+".csv") { try {write(vect_string)} finally {close()} }
}

print("\nSample Output (Per Topic): \n")
perTopicOutput.first()._2.split("\n").take(10).foreach(println)
print(""" -- " Word, Count " """+"\n\n")


Sample Output (Per Topic): 
to,12
the,5
and,4
microsoft,4
10,3
is,3
will,3
windows,3
a,2
edge,2
 -- " Word, Count " 



perTopicOutput = MapPartitionsRDD[11] at mapValues at <console>:45


MapPartitionsRDD[11] at mapValues at <console>:45

In [6]:
var perDayTuple = flattenWordTuples.
  map{ case (word, toh, topic, date) =>
    ((word, toh, date), 1)
  }.reduceByKey{
      (j, k) =>
      j+k
  }

print("\nFlatten Word Sample (Per Day): \n")
perDayTuple.take(10).foreach(println)
print(" -- ((Word, 'Title'/'Headline', Date), Count)\n\n")


Flatten Word Sample (Per Day): 
((embargo,title,2016-05-22),1)
((as,headine,2016-03-15),2)
((bank,title,2015-12-29),1)
((directed,headine,2016-03-15),1)
((press,headine,2016-05-22),1)
((advisors,headine,2016-03-16),1)
((different,headine,2016-06-20),1)
((to,title,2016-01-13),1)
((29,title,2015-12-29),1)
((between,title,2016-05-02),1)
 -- ((Word, 'Title'/'Headline', Date), Count)



perDayTuple = ShuffledRDD[13] at reduceByKey at <console>:42


ShuffledRDD[13] at reduceByKey at <console>:42

In [7]:
import java.io.PrintWriter

var perDayOutput = perDayTuple.
  map{ case ((word, toh, date), count) =>
    ((toh, date), (word, count))
  }.groupByKey.mapValues{ iterator =>
      iterator.toVector.sortBy { case(word, count) => 
          (-count, word)
      }.map{
          case(word,count) => word + ","+ count
      }.mkString("\n")
  }

perDayOutput.foreach{ case ((toh, date), vect_string) =>
   new PrintWriter("./output/date/"+date+"_"+toh+".csv") { try {write(vect_string)} finally {close()} }
}

print("\nSample Output (Per Day): \n")
perTopicOutput.first()._2.split("\n").take(10).foreach(println)
print(""" -- " Word, Count " """+"\n\n")


Sample Output (Per Day): 
to,12
the,5
and,4
microsoft,4
10,3
is,3
will,3
windows,3
a,2
edge,2
 -- " Word, Count " 



perDayOutput = MapPartitionsRDD[16] at mapValues at <console>:50


MapPartitionsRDD[16] at mapValues at <console>:50

In [8]:
var totalTuple = perTopicTuple.
  map{ case ((word, toh, topic), count) =>
    ((word, toh), 1)
  }.reduceByKey{
      (j, k) =>
      j+k
  }

print("\nFlatten Word Sample (In Total): \n")
totalTuple.take(10).foreach(println)
print(" -- ((Word, 'Title'/'Headline'), Count)\n\n")


Flatten Word Sample (In Total): 
((service,headine),1)
((calls,headine),1)
((maher's,title),1)
((settlers,title),1)
((crises,title),1)
((backing,headine),1)
((jaitley,headine),1)
((solution,headine),1)
((out,headine),1)
((slower,headine),1)
 -- ((Word, 'Title'/'Headline'), Count)



totalTuple = ShuffledRDD[18] at reduceByKey at <console>:45


ShuffledRDD[18] at reduceByKey at <console>:45

In [9]:
import java.io.PrintWriter

var totalOutput = totalTuple.
  map{ case ((word, toh), count) =>
    (toh, (word, count))
  }.groupByKey.mapValues{ iterator =>
      iterator.toVector.sortBy { case(word, count) => 
          (-count, word)
      }.map{
          case(word,count) => word + ","+ count
      }.mkString("\n")
  }

totalOutput.foreach{ case (toh, vect_string) =>
   new PrintWriter("./output/total_"+toh+".csv") { try {write(vect_string)} finally {close()} }
}

totalOutput = MapPartitionsRDD[21] at mapValues at <console>:49


MapPartitionsRDD[21] at mapValues at <console>:49

### Task 4 - Co-occurance Matrices for the Top-100 Frequent Words in Headline and Title in each Topic

In [8]:
var sizeOfTop = 15

var perTopicTop100 = perTopicTuple.
  map{ case ((word, toh, topic), count) =>
    ((word, topic), count)
  }.reduceByKey{
      case (i, j) => i + j
  }.
  map{ case ((word, topic), count) =>
    (topic, (word, count))
  }.groupByKey.mapValues{ iterator =>
      iterator.toVector.sortBy { case(word, count) => 
          (-count, word)
      }.take(sizeOfTop).map{case(word, count) => word}
  }.collectAsMap()

print("\nSample Top "+sizeOfTop+" Words (Per Topic): \n")
perTopicTop100.take(4).foreach(println)
print(" -- (topic, Vector[String]) "+"\n\n")


Sample Top 15 Words (Per Topic): 
(obama,Vector(obama, to, the, in, president, a, for, has, on, and, atlantic, been, blame, clinton, drilling))
(microsoft,Vector(the, microsoft, to, a, is, of, on, corporate, could, in, nadella, quot, that, windows, 10))
(economy,Vector(the, economy, of, and, in, to, with, a, as, for, is, outlook, this, about, growth))
(palestine,Vector(liberation, organization, palestine, the, and, appeared, chief, cnn, criticized, fatally, for, how, israeli, of, on))
 -- (topic, Vector[String]) 



sizeOfTop = 15
perTopicTop100 = Map(obama -> Vector(obama, to, the, in, president, a, for, has, on, and, atlantic, been, blame, clinton, drilling), microsoft -> Vector(the, microsoft, to, a, is, of, on, corporate, could, in, nadella, quot, that, windows, 10), economy -> Vector(the, economy, of, and, in, to, with, a, as, for, is, outlook, this, about, growth), palestine -> Vector(liberation, organization, palestine, the, and, appeared, chief, cnn, criticized, fatally, for, how, israeli, of, on))


Map(obama -> Vector(obama, to, the, in, president, a, for, has, on, and, atlantic, been, blame, clinton, drilling), microsoft -> Vector(the, microsoft, to, a, is, of, on, corporate, could, in, nadella, quot, that, windows, 10), economy -> Vector(the, economy, of, and, in, to, with, a, as, for, is, outlook, this, about, growth), palestine -> Vector(liberation, organization, palestine, the, and, appeared, chief, cnn, criticized, fatally, for, how, israeli, of, on))

In [10]:
import java.io.PrintWriter

var perTopicOccurance = smallerSampleSize.
  map{ attr =>
      var titleAndHeadline = wordPreProcess(attr(1).toString + attr(2).toString)
      var list =  ListBuffer[Int]()
      perTopicTop100(attr(4)).foreach{ word =>
          list += (if (titleAndHeadline.contains(word)) 1 else 0)
      }
      (attr(4),list.toSeq)
  }.groupByKey

var exportMatrix = perTopicOccurance.map{
      case (topic, iterator) =>
      val matrix = Array.ofDim[Int](sizeOfTop,sizeOfTop)
      val occur_matrix = Array.ofDim[Int](sizeOfTop,sizeOfTop)
      val result_matrix = Array.ofDim[Double](sizeOfTop,sizeOfTop)
      iterator.map{
          list =>
          val indexWithValue = list.zipWithIndex.filter(_._1 != 0).map(_._2)
          for( x <- 0 until indexWithValue.length ; y <- 0 until indexWithValue.length ){
              matrix(indexWithValue(x))(indexWithValue(y)) += 1
          }
          
      }
      val resultString = ","+perTopicTop100(topic).mkString(",") +"\n" + 
                         matrix.zipWithIndex.map{ case(x,i) => perTopicTop100(topic)(i)+
                         ","+x.mkString(",")}.mkString("\n")
      (topic,resultString)
  }

exportMatrix.foreach{
    case (topic, resultString) =>
    new PrintWriter("./output/"+topic+"_co-occurance.csv") { try {write(resultString)} finally {close()} }
}

println("Occurance of Top "+sizeOfTop+" Words in "+perTopicOccurance.first()._1+" (Headline & Title):")
println(perTopicOccurance.first()+"\n")

var df_2 = spark.read.
  format("csv").
  option("header", "true").
  csv("./output/"+perTopicOccurance.first()._1+"_co-occurance.csv")

println("Co-Occurance Matrix for Top "+sizeOfTop+" Words in "+perTopicOccurance.first()._1+" (Headline & Title):")
df_2.show

Occurance of Top 15 Words in economy (Headline & Title):
(economy,CompactBuffer(List(1, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0), List(0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0), List(1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 0, 0), List(1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0), List(1, 1, 1, 1, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0, 1), List(1, 1, 1, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0), List(0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0), List(1, 1, 0, 1, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1), List(1, 1, 0, 1, 0, 0, 1, 1, 0, 1, 0, 1, 0, 0, 1), List(1, 1, 1, 0, 1, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0), List(1, 1, 1, 1, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0), List(1, 1, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0), List(1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0), List(1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0)))

Co-Occurance Matrix for Top 15 Words in economy (Headline & Title):
+-------+---+-------+---+---+---+---+----+---+---+---+---+-------+----+-----+------+
|    _c0|the|economy| of|and| in| to|with|  a| as|f

perTopicOccurance = ShuffledRDD[48] at groupByKey at <console>:60
exportMatrix = MapPartitionsRDD[49] at map at <console>:62
df_2 = [_c0: string, the: string ... 14 more fields]


[_c0: string, the: string ... 14 more fields]

103820004 Michael Fu