Clustering emoticons based on tweets
====================================

In this notebook we will look at the symbols in the Unicode block
*Emoticons*, which contains 80 commonly used emojis. The goal is to find
out which emoticons are related to each other and hopefully finding
clusters that correspond vaguely to some sentiment of an emoticon. We
will do this in a fairly naïve way since our focus was on learning
streaming, spark and scala. First let's have a look at the emojis in
question, they are presented in the table from Wikipedia below.

  

In the following two cells we create a list of these emoticons and load
the previously collected dataset of tweets.

In [None]:
val emoticonsList = List(
  "😀", "😁", "😂",	"😃", "😄",	"😅", "😆",	"😇", "😈",	"😉", "😊",	"😋", "😌", "😍", "😎", "😏",
  "😐",	"😑", "😒",	"😓", "😔", "😕", "😖", "😗", "😘", "😙", "😚", "😛", "😜", "😝", "😞", "😟",
  "😠",	"😡", "😢", "😣", "😤", "😥", "😦", "😧", "😨", "😩", "😪", "😫", "😬", "😭", "😮", "😯",
  "😰",	"😱", "😲", "😳", "😴", "😵", "😶", "😷", "😸", "😹", "😺", "😻", "😼", "😽", "😾", "😿",
  "🙀",	"🙁", "🙂", "🙃", "🙄", "🙅", "🙆", "🙇", "🙈", "🙉", "🙊", "🙋", "🙌", "🙍", "🙎", "🙏"
)

val emoticonsMap = emoticonsList.zipWithIndex.toMap
val nbrEmoticons = emoticonsList.length

In [None]:
val fullDF = sqlContext.read.parquet("/datasets/ScaDaMaLe/twitter/student-project-10_group-Geosmus/{2020,2021,continuous_22_12}/*/*/*/*/*")
println(fullDF.count)

  

>     2013137
>     fullDF: org.apache.spark.sql.DataFrame = [CurrentTweetDate: timestamp, CurrentTwID: bigint ... 7 more fields]

  

### How to cluster emoticons

We could just look at the descriptions and appearances of the various
emoticons and cluster them into broad categories based on that. However,
instead we will try to use our collected tweet dataset to create a
clustering. Then we will use the intuition based on the descriptions and
appearances to judge how successful this approach was.

We will use the Jaccard distance between emoticons to try to cluster
them. The Jaccard distance between emoticons $e*1$ and $e*2$ is given by

\\\[ d(e*1, e*2) = 1 - \\frac{\\\# (e*1\\wedge e*2)}{\\\# (e*1) + \\\#
(e*2) - \\\# (e*1\\wedge e*2)}, \\\]

where $\\\#(e)$ is the number of tweets collected containing the
emoticon $e$, and $\\\# (e*1\\wedge e*2)$ is the number of tweets
collected containing both $e*1$ and $e*2$.

In order to find the Jaccard distances between emoticons, we must create
a matrix containing for each pair of emoticons how often they appear
together in the dataset of tweets and also in how many tweets each
emoticon appears individually. First we define a function to create such
a matrix for an individual tweet. Then we will sum these matrices for
all the tweets. The matrices will be represented by a 1D array following
a certain indexing scheme and containing the entries of the upper
triangular part of the matrix (there would be redundancy in finding the
whole matrix since it will be symmetric).

In [None]:
def emoticonPairToIndex (a : Int, b : Int) : Int = { // helper function for indexing
  val i = if (a < b) a else b // makes sure j >= i
  val j = if (a < b) b else a
  return i*nbrEmoticons - (i * (i+1))/2 + j 
}

def createEmoticonIterator (s : String) : scala.util.matching.Regex.MatchIterator = { // helper function for iterating through the emoticons in a string s
  return s"""[${emoticonsList.mkString}]""".r.findAllIn(s)
}

def createEmoticonMatrix (s : String) : Array[Int] = { // The pair (i, j) will be at location i*(nbrEmoticons - (i+1)/2) + j in this array (this is compatible with later scipy functions)
  var m = Array.fill((nbrEmoticons*nbrEmoticons + nbrEmoticons)/2)(0) // there are 80 emoticons and thus 80^2 / 2 + 80 / 2 emoticon pairs including pairs of the same emoticon
  val emoticonIterator = createEmoticonIterator(s)
  // sets m to 1 for each index corresponding to a pair of emoticons present in the string s (very hacky code...)
  emoticonIterator.zipWithIndex.foreach(em_pair => // iterate over emoticons in s
                                        (createEmoticonIterator(s).drop(em_pair._2)).foreach( // iterate over remaining emoticons in s
                                          second_em => 
                                            (m(emoticonPairToIndex(
                                                emoticonsMap(em_pair._1),
                                                emoticonsMap(second_em))
                                              )
                                            = 1) // set m to 1 for each emoticon pair found in s
                                         )
                                       )
  return m
}

  

>     emoticonPairToIndex: (a: Int, b: Int)Int
>     createEmoticonIterator: (s: String)util.matching.Regex.MatchIterator
>     createEmoticonMatrix: (s: String)Array[Int]

  

In the cell below we sum all the "occurence-matrices" and print the
diagonal of the summed matrix, i.e. the number of tweets containing each
individual emoticon. It is clear that some emoticons are used far more
often than others.

In [None]:
val emoticonsMatrix = fullDF.select("CurrentTweet")
                            .filter($"CurrentTweet".rlike(emoticonsList.mkString("|"))) // filters tweets with emoticons
                            .map(row => createEmoticonMatrix(row.mkString)) // creates an "adjacency matrix" for each tweet
                            .reduce((_, _).zipped.map(_ + _)) // sums the matrices elementwise

emoticonsList.zipWithIndex.foreach({case (e, i) => println(e + ", " + Integer.toString(emoticonsMatrix(emoticonPairToIndex(i, i))) + " occurences")})

  

>     😀, 3510 occurences
>     😁, 8679 occurences
>     😂, 65997 occurences
>     😃, 3603 occurences
>     😄, 2851 occurences
>     😅, 10402 occurences
>     😆, 4817 occurences
>     😇, 2550 occurences
>     😈, 2127 occurences
>     😉, 6846 occurences
>     😊, 11300 occurences
>     😋, 3729 occurences
>     😌, 3664 occurences
>     😍, 18604 occurences
>     😎, 5412 occurences
>     😏, 3181 occurences
>     😐, 1417 occurences
>     😑, 1303 occurences
>     😒, 2678 occurences
>     😓, 1264 occurences
>     😔, 5475 occurences
>     😕, 973 occurences
>     😖, 805 occurences
>     😗, 416 occurences
>     😘, 8414 occurences
>     😙, 381 occurences
>     😚, 797 occurences
>     😛, 892 occurences
>     😜, 3070 occurences
>     😝, 1405 occurences
>     😞, 1743 occurences
>     😟, 401 occurences
>     😠, 815 occurences
>     😡, 2853 occurences
>     😢, 3563 occurences
>     😣, 1064 occurences
>     😤, 1778 occurences
>     😥, 1181 occurences
>     😦, 101 occurences
>     😧, 170 occurences
>     😨, 353 occurences
>     😩, 5055 occurences
>     😪, 1828 occurences
>     😫, 1594 occurences
>     😬, 2828 occurences
>     😭, 23494 occurences
>     😮, 483 occurences
>     😯, 328 occurences
>     😰, 612 occurences
>     😱, 2205 occurences
>     😲, 708 occurences
>     😳, 3712 occurences
>     😴, 1906 occurences
>     😵, 483 occurences
>     😶, 577 occurences
>     😷, 1568 occurences
>     😸, 230 occurences
>     😹, 933 occurences
>     😺, 143 occurences
>     😻, 1177 occurences
>     😼, 178 occurences
>     😽, 173 occurences
>     😾, 58 occurences
>     😿, 127 occurences
>     🙀, 141 occurences
>     🙁, 417 occurences
>     🙂, 3283 occurences
>     🙃, 2935 occurences
>     🙄, 7233 occurences
>     🙅, 737 occurences
>     🙆, 887 occurences
>     🙇, 1829 occurences
>     🙈, 2492 occurences
>     🙉, 193 occurences
>     🙊, 641 occurences
>     🙋, 2220 occurences
>     🙌, 6954 occurences
>     🙍, 21 occurences
>     🙎, 37 occurences
>     🙏, 25801 occurences
>     emoticonsMatrix: Array[Int] = Array(3510, 163, 149, 188, 115, 81, 91, 33, 4, 51, 64, 32, 7, 75, 44, 5, 6, 6, 6, 1, 2, 0, 1, 11, 41, 13, 10, 15, 36, 14, 8, 1, 2, 2, 6, 0, 1, 15, 0, 0, 2, 3, 0, 2, 10, 17, 10, 4, 3, 15, 9, 15, 1, 4, 1, 18, 9, 0, 3, 3, 1, 2, 1, 1, 0, 6, 26, 17, 27, 0, 2, 4, 11, 1, 3, 15, 34, 0, 0, 92, 8679, 476, 135, 171, 225, 180, 44, 14, 161, 227, 75, 39, 161, 114, 39, 17, 8, 17, 7, 19, 9, 5, 15, 132, 14, 50, 36, 104, 36, 10, 1, 3, 13, 22, 2, 12, 14, 0, 0, 0, 22, 5, 6, 30, 49, 6, 6, 4, 30, 17, 27, 7, 7, 8, 37, 3, 0, 3, 6, 2, 2, 1, 0, 1, 6, 53, 31, 63, 5, 9, 36, 47, 5, 13, 34, 74, 0, 0, 244, 65997, 139, 151, 977, 348, 104, 39, 365, 206, 117, 118, 437, 222, 115, 37, 28, 95, 20, 84, 20, 21, 14, 285, 17, 26, 46, 295, 108, 19, 9, 12, 42, 73, 22, 40, 26, 4, 5, 5, 363, 33, 72, 174, 2074, 26, 15, 11, 93, 43, 203, 38, 21, 17, 50, 5, 65, 2, 39, 2, 0, 2, 0, 1, 15, 110, 119, 355, 24, 95, 19, 306, 14, 60, 73, 183, 0, 1, 493, 3603, 176, 131, 117, 22, 3, 86, 189, 33, 16, 80, 30, 10, 5, 2, 4, 6, 8, 5, 4, 11, 59, 10, 13, 18, 26, 20, 2, 2, 2, 5, 12, 5, 3, 3, 0, 0, 8, 2, 2, 2, 10, 34, 9, 7, 3, 14, 13, 14, 3, 3, 1, 17, 9, 0, 6, 4, 2, 4, 0, 0, 2, 2, 30, 15, 12, 1, 22, 82, 4, 1, 1, 16, 27, 0, 0, 102, 2851, 130, 122, 22, 6, 58, 104, 29, 20, 66, 37, 9, 4, 5, 4, 2, 8, 4, 2, 10, 35, 10, 12, 15, 34, 20, 2, 2, 0, 3, 7, 4, 4, 3, 0, 0, 3, 1, 6, 3, 11, 25, 8, 5, 2, 6, 11, 16, 3, 1, 0, 10, 5, 3, 7, 5, 0, 4, 1, 0, 0, 2, 31, 14, 21, 1, 10, 23, 16, 0, 1, 19, 20, 0, 0, 69, 10402, 222, 53, 21, 124, 169, 71, 54, 118, 52, 62, 18, 15, 23, 20, 31, 5, 7, 18, 79, 13, 14, 23, 70, 36, 13, 4, 4, 6, 39, 17, 5, 6, 2, 1, 6, 44, 11, 12, 57, 157, 10, 7, 13, 45, 19, 40, 20, 11, 3, 17, 7, 13, 2, 3, 1, 1, 0, 1, 0, 5, 39, 48, 77, 11, 12, 67, 64, 4, 12, 19, 34, 0, 1, 135, 4817, 20, 5, 87, 160, 61, 27, 89, 36, 14, 5, 3, 7, 7, 10, 4, 3, 10, 52, 10, 13, 14, 43, 27, 4, 7, 2, 3, 8, 4, 8, 15, 1, 3, 2, 11, 1, 5, 19, 51, 10, 6, 2, 15, 15, 36, 6, 4, 1, 8, 7, 12, 4, 4, 1, 3, 1, 0, 2, 2, 23, 19, 43, 0, 9, 58, 23, 4, 2, 20, 49, 0, 0, 68, 2550, 24, 29, 99, 28, 20, 86, 46, 9, 1, 3, 3, 0, 7, 2, 0, 6, 57, 10, 13, 10, 14, 9, 4, 1, 2, 3, 10, 1, 2, 1, 0, 0, 3, 2, 2, 1, 17, 37, 4, 5, 1, 12, 15, 16, 9, 11, 0, 22, 3, 0, 2, 3, 2, 3, 0, 0, 1, 1, 22, 11, 16, 0, 3, 6, 5, 0, 0, 10, 40, 0, 0, 329, 2127, 46, 15, 45, 4, 40, 32, 57, 2, 1, 5, 16, 2, 1, 0, 0, 26, 1, 1, 11, 37, 6, 2, 0, 5, 6, 3, 2, 5, 0, 0, 0, 3, 11, 3, 2, 8, 12, 3, 0, 2, 8, 0, 4, 2, 0, 0, 3, 2, 2, 0, 4, 5, 2, 0, 0, 0, 0, 3, 6, 5, 1, 1, 0, 12, 0, 4, 10, 8, 0, 0, 11, 6846, 223, 86, 42, 127, 98, 74, 8, 9, 9, 11, 14, 4, 0, 18, 244, 15, 27, 26, 77, 20, 6, 3, 3, 2, 12, 2, 5, 5, 0, 2, 1, 7, 2, 1, 23, 18, 6, 6, 2, 14, 13, 24, 8, 0, 1, 25, 5, 4, 3, 13, 0, 0, 0, 1, 0, 3, 45, 46, 46, 7, 0, 25, 23, 5, 6, 25, 75, 0, 0, 173, 11300, 102, 58, 221, 82, 36, 23, 18, 14, 18, 26, 10, 9, 21, 254, 24, 48, 16, 39, 20, 10, 3, 0, 5, 35, 10, 2, 7, 0, 2, 6, 23, 8, 10, 28, 86, 6, 5, 5, 22, 12, 31, 23, 4, 8, 39, 8, 4, 4, 17, 1, 6, 0, 0, 1, 11, 65, 37, 50, 3, 11, 179, 20, 0, 2, 75, 99, 0, 0, 612, 3729, 49, 241, 45, 39, 10, 6, 13, 3, 14, 9, 4, 22, 106, 26, 37, 58, 54, 33, 13, 8, 4, 2, 15, 5, 5, 2, 0, 0, 3, 18, 2, 10, 12, 28, 4, 7, 0, 12, 8, 21, 6, 1, 1, 10, 6, 1, 3, 18, 1, 6, 0, 0, 0, 12, 22, 19, 25, 2, 4, 26, 25, 4, 7, 13, 20, 0, 0, 76, 3664, 87, 28, 47, 6, 5, 7, 3, 21, 12, 1, 3, 35, 11, 12, 6, 19, 11, 6, 5, 1, 2, 11, 3, 5, 4, 0, 2, 1, 18, 10, 9, 8, 40, 1, 0, 1, 7, 2, 10, 14, 4, 2, 6, 1, 3, 0, 1, 2, 0, 0, 0, 0, 3, 29, 14, 20, 1, 4, 15, 9, 1, 23, 6, 38, 0, 0, 101, 18604, 100, 53, 13, 13, 18, 10, 33, 11, 8, 35, 745, 44, 64, 47, 86, 40, 8, 7, 17, 15, 29, 4, 11, 10, 0, 1, 2, 102, 16, 22, 24, 324, 20, 9, 8, 54, 23)

  

In the following two cells we create the Jaccard distance matrix which
we want to use to cluster the emoticons.

In [None]:
def jaccardDistance (e1 : Int, e2 : Int) : Double = { // specify the emojis in terms of their indices in the list
  return 1.0 - 1.0 * emoticonsMatrix(emoticonPairToIndex(e1, e2)) / 
    (emoticonsMatrix(emoticonPairToIndex(e1, e1)) +  emoticonsMatrix(emoticonPairToIndex(e2, e2)) - emoticonsMatrix(emoticonPairToIndex(e1, e2)))
}

  

>     jaccardDistance: (e1: Int, e2: Int)Double

In [None]:
var jaccardMatrix = Array.fill(emoticonsMatrix.length)(1.0)
(0 until nbrEmoticons).foreach(i => (i until nbrEmoticons).foreach(j => (jaccardMatrix(emoticonPairToIndex(i, j)) = jaccardDistance(i, j))))

  

>     jaccardMatrix: Array[Double] = Array(0.0, 0.98644603359388, 0.997851725828311, 0.9728519855595668, 0.9815882164585334, 0.9941435904851421, 0.9889509470616804, 0.9945246391239423, 0.9992898988105805, 0.9950509461426492, 0.9956598399565983, 0.9955598723463299, 0.9990233012418027, 0.9965969417850175, 0.9950439288127957, 0.9992521687107389, 0.9987807356228409, 0.9987518202621177, 0.999029440310579, 0.9997904881625812, 0.9997773572303239, 1.0, 0.9997681965693093, 0.9971902937420178, 0.9965496928385088, 0.9966477565755544, 0.9976727949732371, 0.9965808069295646, 0.9944987775061125, 0.9971434401142624, 0.9984747378455672, 0.9997442455242966, 0.9995373583159842, 0.9996855840276686, 0.9991509834441772, 1.0, 0.9998108568186117, 0.9967921300256629, 1.0, 1.0, 0.9994819994819995, 0.9996496145760336, 1.0, 0.999607996863975, 0.9984197218710493, 0.9993700670693296, 0.9974893296510168, 0.9989567031820553, 0.9992716678805535, 0.9973684210526316, 0.9978617248752673, 0.9979186901623421, 0.9998153277931672, 0.9989972424166458, 0.9997552618697993, 0.9964426877470356, 0.997587778075583, 1.0, 0.9991780821917808, 0.9993595217762596, 0.9997287767832926, 0.9994566693833198, 0.9997196523689375, 0.9997249724972497, 1.0, 0.9984697781178271, 0.9961578247376977, 0.997355320472931, 0.9974804031354984, 1.0, 0.9995449374288965, 0.9992502343017807, 0.9981639125354699, 0.9997298757428417, 0.9992767598842816, 0.9973753280839895, 0.9967401725790987, 1.0, 1.0, 0.9968513638385982, 0.0, 0.9935849056603774, 0.9888861447270931, 0.9849458579100273, 0.9880674586338566, 0.9864824271553019, 0.9960661600357622, 0.998702742772424, 0.9895209580838323, 0.9885074929121102, 0.993918754560934, 0.9968302990897269, 0.9940638595973749, 0.9918437432925521, 0.9967007867354708, 0.9983133247345967, 0.9991979145779025, 0.9985008818342151, 0.9992954911433173, 0.9986558188892819, 0.9990666804936223, 0.9994725181981222, 0.9983480176211453, 0.9922174400094335, 0.9984523546318815, 0.99469552302143, 0.9962244362873623, 0.9910691283812795, 0.9964171974522293, 0.9990395697272378, 0.9998898557109814, 0.9996839110736487, 0.9988714298116156, 0.9981996726677578, 0.9997946822708141, 0.9988511249401627, 0.998578102782856, 1.0, 1.0, 1.0, 0.9983955659276547, 0.9995239002094839, 0.9994156033895003, 0.9973860765008278, 0.9984746606898269, 0.9993446920052425, 0.9993334073991779, 0.9995692904059438, 0.9972360420121614, 0.9981856990394877, 0.9978162406988029, 0.9993382491964454, 0.9992353904969962, 0.9991349480968859, 0.9963761018609206, 0.9996631484392544, 1.0, 0.9996598253770269, 0.9993908629441625, 0.9997741389045737, 0.999774011299435, 0.9998855311355311, 1.0, 0.999886608459009, 0.9993399339933994, 0.9955495843479721, 0.9973236639903307, 0.9960249858035207, 0.9994687068324302, 0.9990582818876217, 0.9965622612681436, 0.9957749011147069, 0.9994361114243825, 0.9986032018910498, 0.996870685687989, 0.9952439102770101, 1.0, 1.0, 0.9928729991821474, 0.0, 0.9979988770677071, 0.9978019418606344, 0.9870462199358278, 0.9950614480742486, 0.9984804874128838, 0.9994271866049791, 0.9949639890725461, 0.9973278333398192, 0.9983191828642848, 0.9983032080870828, 0.9948077562853477, 0.9968814530742973, 0.9983348536843172, 0.99945085118067, 0.9995837792841004, 0.9986147564887722, 0.9997025624247111, 0.9988233316523786, 0.9997012696041823, 0.9996855393001003, 0.9997891534511062, 0.9961551952081591, 0.9997438254396408, 0.9996105919003115, 0.9993118202354772, 0.9957104635607514, 0.9983951020893393, 0.9997194371022282, 0.9998644353733299, 0.9998203592814371, 0.9993896058597838, 0.998949443780851, 0.9996718328137353, 0.9994094633498192, 0.9996128186800095, 0.9999394801343541, 0.9999244279193494, 0.9999246363704876, 0.9948648304545261, 0.9995132168987492, 0.9989336334957567, 0.9974654411443388, 0.9762746376562911, 0.9996087519186204, 0.9997737897752978, 0.999834829874771, 0.9986345416905255, 0.9993549548468392, 0.9970793888297413, 0.9994400648345981, 0.9996840157089333, 0.9997445798338267, 0.999259423831741, 0.9999244963909275, 0.9990278920212369, 0.9999697601983731, 0.9994190809562821, 0.999969776192707, 1.0, 0.9999697212844231, 1.0, 0.9999848798705717, 0.9997740929833281, 0.9984097151944484, 0.9982706755990873, 0.9951286449399657, 0.9996402338479988, 0.9985776100854932, 0.9997197929417316, 0.9955120777906517, 0.9997884429400387, 0.9990988014058698, 0.9989287391406434, 0.9974851583113457, 1.0, 0.9999848560568201, 0.994600514758228, 0.0, 0.9719655941382606, 0.9905578780452645, 0.9859087076960135, 0.9964116783558963, 0.9994761655316919, 0.991701244813278, 0.9871550903901046, 0.9954788327168105, 0.9977934078058199, 0.9963845076151309, 0.996661101836394, 0.9985237673457337, 0.9990029910269193, 0.9995921696574225, 0.9993627529074398, 0.9987656860728246, 0.9991179713340683, 0.9989061474513236, 0.9990917347865577, 0.997255489021956, 0.9950660645592908, 0.9974836436839456, 0.997036699338956, 0.9959794505249051, 0.9960884609598315, 0.9959903769045709, 0.999625748502994, 0.9995002498750625, 0.9995471014492754, 0.999224926368005, 0.99832261671792, 0.9989274989274989, 0.9994421718110822, 0.9993725162099979, 1.0, 1.0, 0.9979736575481256, 0.9997689463955638, 0.999631608030945, 0.9996150144369587, 0.9984426101853294, 0.9987436721723386, 0.9977924944812362, 0.9982161060142711, 0.9992877492877493, 0.9975837072833966, 0.9969753373662168, 0.9980824544582934, 0.9994551398474392, 0.9992652461425422, 0.9997607083034219, 0.9967015909972836, 0.9976464435146444, 1.0, 0.9983957219251337, 0.9991624790619765, 0.9994707594601746, 0.9989395546129375, 1.0, 1.0, 0.9994655264564404, 0.9995022399203584, 0.9956242707117853, 0.9977004445807144, 0.9988913525498891, 0.999769532150265, 0.9950760966875559, 0.9846728971962617, 0.9993432933836809, 0.9997364953886693, 0.9997643176997407, 0.9972447046667815, 0.9974358974358974, 1.0, 1.0, 0.9965190089413692, 0.0, 0.9900937285681628, 0.9838324940365757, 0.9959100204498977, 0.998793242156074, 0.9939827782965038, 0.9925962839040364, 0.9955731949320714, 0.9969207082371054, 0.9969143017438871, 0.9955020666180403, 0.9985057280425037, 0.99906191369606, 0.9987948903350204, 0.9992760180995475, 0.99951373693168, 0.9990382303438327, 0.9989528795811519, 0.9994526546250684, 0.9969296898986798, 0.9968833481745325, 0.9968963376784605, 0.9966996699669967, 0.9959763948497854, 0.9942245625955495, 0.9952785646836638, 0.9995644599303136, 0.9993846153846154, 1.0, 0.9994737765304332, 0.9989074449820509, 0.9989772436716953, 0.9991351351351352, 0.9992553983618764, 1.0, 1.0, 0.9990627928772259, 0.9998734977862113, 0.9987160282473786, 0.9993246285457001, 0.998059280169372, 0.9990501519756839, 0.9975947083583885, 0.9984247006931317, 0.9994221323316961, 0.9988118811881188, 0.9968996617812852, 0.9975561325798076, 0.9993689524610854, 0.9996999699969997, 1.0, 0.9977319119981856, 0.9983745123537061, 0.9992065591113463, 0.9976565115500502, 0.9987571464081532, 1.0, 0.9986754966887417, 0.999656121045392, 1.0, 1.0, 0.9993876301285977, 0.9949205308864493, 0.9975744975744976, 0.9979131471728113, 0.9997212155004181, 0.9973175965665236, 0.9950611981962637, 0.9969964332645016, 1.0, 0.9997135491263248, 0.9962391132224862, 0.99795605518651, 1.0, 1.0, 0.99758597767904, 0.0, 0.9851970394078816, 0.9958911543530506, 0.9983210745123121, 0.9927587012380285, 0.9921515812938281, 0.994950213371266, 0.9961461604339138, 0.995915258931044, 0.996700926278391, 0.995414540344649, 0.9984747055334293, 0.9987168520102652, 0.9982384927625029, 0.9982826721621157, 0.9980436703268963, 0.9995602462620933, 0.999375, 0.9983333333333333, 0.9957837433954209, 0.9987929433611885, 0.9987483236477425, 0.9979593647413717, 0.994776898970303, 0.9969416362246198, 0.9989284536762282, 0.9996295953329012, 0.9996432712030678, 0.999547135632878, 0.9971994829814735, 0.9985151541619356, 0.9995893223819302, 0.9994817310183985, 0.9998095419483859, 0.9999054015703339, 0.9994418085403294, 0.9971452669824175, 0.9990997626647025, 0.9989986648865153, 0.9956729674333865, 0.9953466314947094, 0.999080459770115, 0.9993471976126084, 0.9988182892464321, 0.996417767871358, 0.9982868992877107, 0.9971578797783146, 0.9983723958333334, 0.9989884127276072, 0.9997266763848397, 0.9985777629047101, 0.9993411764705883, 0.99885179296944, 0.9998103006734326, 0.9997408431237043, 0.999905473107099, 0.9999054284093059, 1.0, 0.9999050151975684, 1.0, 0.9995376363972628, 0.9971420196394548, 0.9963879900669727, 0.9956145346850439, 0.9990115025161754, 0.9989358872040436, 0.994491943439658, 0.9950116913484022, 0.9996223208384477, 0.9989121566494424, 0.9984924224391019, 0.9980371781549474, 1.0, 0.9999041962061698, 0.9962570699789287, 0.0, 0.9972778004627739, 0.9992794350771005, 0.9924844505874223, 0.9899730525788055, 0.9928108426635238, 0.9968062455642299, 0.996185496314075, 0.9964681644265673, 0.9982464929859719, 0.9991973029378712, 0.9995095635115253, 0.9990651709401709, 0.998847546921304, 0.9990274265707061, 0.9993086761147598, 0.999466097170315, 0.9980853915374306, 0.9960543288565141, 0.9980724749421742, 0.9976789858953758, 0.9975417032484636, 0.994518103008669, 0.9956416464891041, 0.9993898718730934, 0.9986566877758588, 0.9996447602131439, 0.9996087126646668, 0.9990444338270426, 0.9993193806363792, 0.9987854865644451, 0.9974928965401972, 0.9997966239576977, 0.9993980738362761, 0.9996130030959752, 0.9988844944731772, 0.9998494882600842, 0.9992194817358726, 0.9975085234723315, 0.9981953290870488, 0.998109640831758, 0.9988324576765908, 0.9996314722682882, 0.9978592835735692, 0.9972776769509982, 0.9957612151183327, 0.9991067440821796, 0.9992447129909365, 0.999814574448359, 0.9987454916104751, 0.9986111111111111, 0.9979086789822238, 0.9991928974979822, 0.9993322203672788, 0.999799759711654, 0.9993984359334269, 0.9997948297086582, 1.0, 0.9995964487489911, 0.9996177370030581, 0.9971524080723041, 0.9975429975429976, 0.9964187557258266, 1.0, 0.9984196663740122, 0.9911961141469339, 0.9968432610485863, 0.9992009588493808, 0.999633431085044, 0.9971497791078808, 0.9958198259682648, 1.0, 1.0, 0.9977741407528642, 0.0, 0.9948420373952289, 0.9969040247678018, 0.9928005235982837, 0.9955207166853304, 0.9967710687762351, 0.9959179798746914, 0.9941889843355229, 0.9984271233834323, 0.9997478567826525, 0.9992207792207792, 0.9994258373205742, 1.0, 0.9991269643302569, 0.9994319795512638, 1.0, 0.9979729729729729, 0.9947739983496837, 0.9965765148921603, 0.9961007798440312, 0.997086247086247, 0.9975026757046022, 0.9977192093258996, 0.9990673816740498, 0.9996610169491525, 0.9994052928932501, 0.9994444444444445, 0.9983614615762739, 0.9997232216994187, 0.9995376791493297, 0.9997319034852546, 1.0, 1.0, 0.9989655172413793, 0.9997369459423912, 0.9995429616087751, 0.9997586290127927, 0.9968289498227942, 0.9985773061098935, 0.998679432155827, 0.998259658893143, 0.999683644416324, 0.9974699557242251, 0.9953746530989824, 0.9974383605507525, 0.9979761637058692, 0.9963600264725347, 1.0, 0.99462890625, 0.9989196975153043, 1.0, 0.9992567818654775, 0.9991944146079484, 0.9992663242846662, 0.9988970588235294, 1.0, 1.0, 0.9996282527881041, 0.9996628455832771, 0.9962140767509895, 0.9979905005480453, 0.9983618306542439, 1.0, 0.999126383226558, 0.9986279442030642, 0.9990073456422474, 1.0, 1.0, 0.9978991596638656, 0.9957734573119188, 1.0, 1.0, 0.9882592248947256, 0.0, 0.9948470930883836, 0.9988815985684462, 0.9922560660815695, 0.9993087955762917, 0.9980667923251655, 0.9957373118422805, 0.9891449247762331, 0.9994353472614342, 0.99970836978711, 0.9989583333333333, 0.9952592592592593, 0.9997368421052631, 0.9996773152629881, 1.0, 1.0, 0.9975273418925344, 0.9996011168727563, 0.9996578857338351, 0.996343085106383, 0.9928294573643411, 0.998298355076574, 0.999482936918304, 1.0, 0.9982975825672454, 0.9987937273823885, 0.9994724810972393, 0.9993728441517717, 0.9987179487179487, 1.0, 1.0, 1.0, 0.9987888574888979, 0.9984660437874774, 0.9992408906882592, 0.9994622210271579, 0.9983828582979584, 0.9995314147370066, 0.998849252013809, 1.0, 0.999269272926562, 0.9981498612395929, 1.0, 0.9993144815766923, 0.9995038451997023, 1.0, 1.0, 0.9991874322860238, 0.9991507430997877, 0.999345977763244, 1.0, 0.9987878787878788, 0.9978260869565218, 0.999129677980853, 1.0, 1.0, 1.0, 1.0, 0.9994451636767153, 0.9988132911392406, 0.9994655264564404, 0.9996507160321342, 0.9996681048788583, 1.0, 0.9973952680703277, 1.0, 0.9985528219971056, 0.9976942587041734, 0.9991182629780668, 1.0, 1.0, 0.9996059748540316, 0.0, 0.9875578865145344, 0.9918009343121366, 0.995987772258311, 0.9949847964301228, 0.9919407894736842, 0.9925650557620818, 0.999030890369473, 0.9988943488943489, 0.9990541250656858, 0.9986418076305716, 0.9988624360120256, 0.999488163787588, 1.0, 0.9975151849806737, 0.9837506659563132, 0.9979201331114809, 0.9964548319327731, 0.9966286307053942, 0.9921740014229088, 0.9975701615842546, 0.9993009437259699, 0.9995858641634456, 0.9996082528075215, 0.9997937506445292, 0.9988458209098778, 0.9997470915528579, 0.9994198862977144, 0.9993767140363999, 1.0, 0.9997148560022812, 0.9998610725201444, 0.9994114679670422, 0.9997693726937269, 0.9998815025476953, 0.9976168272717854, 0.9994063716113712, 0.9991806636624334, 0.9991629464285714, 0.9997317596566524, 0.9984508133230054, 0.9982760907041507, 0.9977216631858743, 0.9990850869167429, 1.0, 0.9998652654271086, 0.9970199070210991, 0.9992928864375619, 0.9994855305466238, 0.9995705697108502, 0.9983770287141074, 1.0, 1.0, 1.0, 0.9998565691336776, 1.0, 0.9995867768595041, 0.9955374851249504, 0.9952747817154597, 0.9967220123993444, 0.9990760295670539, 1.0, 0.9971098265895953, 0.9975308641975309, 0.999289166903611, 0.9991979681860714, 0.9972348191571728, 0.994535519125683, 1.0, 1.0, 0.9946726612058878, 0.0, 0.993166744824814, 0.9961089494163424, 0.9925546609170232, 0.9950691521346964, 0.9975077881619938, 0.9981881203718292, 0.998569725864124, 0.9989974219421369, 0.9985652797704447, 0.9984476685175234, 0.9991845388567234, 0.9992559523809523, 0.9982043608379649, 0.9869475847893114, 0.9979411512395985, 0.996016266910117, 0.9986859395532195, 0.9972786267531923, 0.9984233346472211, 0.9992327169492826, 0.9997435459052829, 1.0, 0.9996465931580435, 0.9976396007553278, 0.9991905455722843, 0.9998470480269196, 0.999438832772166, 1.0, 0.999825601674224, 0.99948484588306, 0.998591721773206, 0.999390243902439, 0.999223843526855, 0.9980141843971632, 0.9975221850870116, 0.9994905323936486, 0.9995698184633915, 0.9995800789451583, 0.998368315656753, 0.9989996665555185, 0.9979307122354983, 0.9982553288325874, 0.9996604125986925, 0.9993259752295897, 0.9969600124717437, 0.9993056760978997, 0.9996729086597432, 0.999650319083836, 0.9986356340288924, 0.9999128692166943, 0.9994767593965291, 1.0, 1.0, 0.9999125874125874, 0.9990603109516487, 0.9955227992836478, 0.9973939991548105, 0.997294811448358, 0.9997507063320592, 0.9990965834428384, 0.9861776061776062, 0.9985477781004938, 1.0, 0.9998324817823938, 0.9944217181108219, 0.9945469567612228, 1.0, 1.0, 0.9832278220833676, 0.0, 0.9933278867102396, 0.9890910736918341, 0.9950527704485488, 0.9943239703099985, 0.9980529595015576, 0.9988062077198567, 0.9979668439161714, 0.9993987975951903, 0.9984766050054407, 0.9980822501598124, 0.9991169977924945, 0.994664079553723, 0.9911938190579048, 0.9936336924583742, 0.9917576297616396, 0.9872890642121411, 0.9919940696812454, 0.9935306802587728, 0.9976186114673017, 0.9980591945657448, 0.9991189427312775, 0.9996960486322188, 0.9979387110072833, 0.9989557226399332, 0.9990912395492548, 0.9995925020374898, 1.0, 1.0, 0.9992645256190242, 0.997946611909651, 0.9996399639963996, 0.9981178242047807, 0.9981665393430099, 0.998970398970399, 0.9990494296577946, 0.9982716049382716, 1.0, 0.9979736575481256, 0.9981937231880785, 0.9971698113207547, 0.9989340913128442, 0.9997625267157445, 0.9997677119628339, 0.9981085681861169, 0.9984821654439666, 0.9997854537652864, 0.9992246058413027, 0.9963175122749591, 0.9997439836149513, 0.9984599589322382, 1.0, 1.0, 1.0, 0.9970972423802612, 0.9968526466380544, 0.9971407072987208, 0.9977141812197129, 0.9995519713261649, 0.9991326973113617, 0.9953000723065799, 0.9959651387992253, 0.9989790709545686, 0.9983955993582397, 0.9978099730458221, 0.9981243552471162, 1.0, 1.0, 0.9974197053031846, 0.0, 0.9960777241783508, 0.9969053934571176, 0.9930862018240659, 0.9988177339901478, 0.9989923417976623, 0.9988950276243094, 0.9993908629441625, 0.9976968633472253, 0.9974054054054055, 0.9997761862130707, 0.9992641648270787, 0.9970937474051316, 0.9972731779871096, 0.9973027646662171, 0.9986813186813187, 0.9971705137751303, 0.9978252273625939, 0.9988890946121088, 0.9987684729064039, 0.9997766860205449, 0.9996930161166538, 0.9984756097560976, 0.9993650793650793, 0.9990803752069156, 0.9991737244370997, 1.0, 0.9994780793319415, 0.9997509960159362, 0.9979312722675554, 0.9981758482305728, 0.9982853876928939, 0.9987661937075879, 0.998524964967918, 0.9997588036661843, 1.0, 0.9997660818713451, 0.998805868304333, 0.9995423340961098, 0.9986424110779256, 0.9974802015838733, 0.9990345160511707, 0.9995281906109932, 0.9988518943742825, 0.999743128692525, 0.9993469743143231, 1.0, 0.9997933884297521, 0.9994791666666667, 1.0, 1.0, 1.0, 1.0, 0.9992643452672879, 0.9958080370049147, 0.9978739559605163, 0.9981612576997334, 0.9997727272727273, 0.9991202990983066, 0.9972617743702081, 0.9985358711566618, 0.9997406639004149, 0.9946286781877627, 0.9989792446410344, 0.9964083175803403, 1.0, 1.0, 0.9965604141125187, 0.0, 0.99581869877906, 0.9975612000736241, 0.9993502598960415, 0.9993465366442144, 0.9991534988713319, 0.9994964246147648, 0.9986276303751144, 0.9994378002657671, 0.9995876501211278, 0.9981564392941796, 0.9716438929699691, 0.9976769969906552, 0.9966902828773854, 0.9975834233122526, 0.9960163053548268, 0.9979968951875406, 0.9996066669944442, 0.9996315401621223, 0.999123801669931, 0.999300438391941, 0.9986900352335351, 0.9997965825874695, 0.9994600166903932, 0.9994943109987358, 1.0, 0.999946732008736, 0.9998944869427592, 0.995670076834911, 0.9992163009404389, 0.9989095955590801, 0.9988789237668162, 0.9922439795087854, 0.9989510672890334, 0.9995243883105216, 0.9995835068721366, 0.9973982172970368, 0.9988076105552387)

  

Finally we write the Jaccard distance matrix and the emoticon list to
file so that we don't have to keep rerunning the above cells and so that
we can load them into python cells next.

In [None]:
scala.tools.nsc.io.File("/dbfs/datasets/ScaDaMaLe/twitter/student-project-10_group-Geosmus/emoticonsList.txt").writeAll(emoticonsList.mkString("\n"))
scala.tools.nsc.io.File("/dbfs/datasets/ScaDaMaLe/twitter/student-project-10_group-Geosmus/jMatrix.txt").writeAll(jaccardMatrix.mkString("\n"))

  

  

Clustering using python
-----------------------

We now switch to python cells in order to use various clustering methods
implemented in SciPy and scikit-learn. First we install and import some
packages for later use, then we load the previously saved Jaccard matrix
and emoticons list.

In [None]:
%pip install pycountry

  

>     Python interpreter will be restarted.
>     Collecting pycountry
>       Downloading pycountry-20.7.3.tar.gz (10.1 MB)
>     Building wheels for collected packages: pycountry
>       Building wheel for pycountry (setup.py): started
>       Building wheel for pycountry (setup.py): finished with status 'done'
>       Created wheel for pycountry: filename=pycountry-20.7.3-py2.py3-none-any.whl size=10746863 sha256=870cc02de7d6d11499effd59f2b73acdbffc04146079c4a4f9a841ca30f2b587
>       Stored in directory: /root/.cache/pip/wheels/57/e8/3f/120ccc1ff7541c108bc5d656e2a14c39da0d824653b62284c6
>     Successfully built pycountry
>     Installing collected packages: pycountry
>     Successfully installed pycountry-20.7.3
>     Python interpreter will be restarted.

In [None]:
import json
import os
from matplotlib import font_manager as fm, pyplot as plt, rcParams
import numpy as np
import pandas as pd
import pycountry
from scipy.cluster.hierarchy import dendrogram, linkage
from sklearn.manifold import locally_linear_embedding, TSNE
from sklearn.neighbors import NearestNeighbors

In [None]:
jMatrix = np.loadtxt("/dbfs/datasets/ScaDaMaLe/twitter/student-project-10_group-Geosmus/jMatrix.txt")

emoticonsList = []

with open("/dbfs/datasets/ScaDaMaLe/twitter/student-project-10_group-Geosmus/emoticonsList.txt", 'r') as filehandle:
    for line in filehandle:
        e = line.strip() #remove line break
        emoticonsList.append(e)

nbrEmoticons = len(emoticonsList)
print(emoticonsList)

  

>     ['😀', '😁', '😂', '😃', '😄', '😅', '😆', '😇', '😈', '😉', '😊', '😋', '😌', '😍', '😎', '😏', '😐', '😑', '😒', '😓', '😔', '😕', '😖', '😗', '😘', '😙', '😚', '😛', '😜', '😝', '😞', '😟', '😠', '😡', '😢', '😣', '😤', '😥', '😦', '😧', '😨', '😩', '😪', '😫', '😬', '😭', '😮', '😯', '😰', '😱', '😲', '😳', '😴', '😵', '😶', '😷', '😸', '😹', '😺', '😻', '😼', '😽', '😾', '😿', '🙀', '🙁', '🙂', '🙃', '🙄', '🙅', '🙆', '🙇', '🙈', '🙉', '🙊', '🙋', '🙌', '🙍', '🙎', '🙏']

  

Some of the SciPy clustering implementations require a full distance
matrix, rather than the condensed representation consisting of only the
upper triangular part which we have been using thus far. So we create a
full matrix in the cell below. In the cell after that we define a helper
function for plotting 2D embeddings of emoticons, note that this
function loads the unifont-upper font for emoticon rendering, which can
be downloaded from `http://unifoundry.com/unifont/index.html`.

In [None]:
def emoticonPairToIndex(a, b): # same helper function as already defined in scala previously
  i = min(a, b) # makes sure j >= i
  j = max(a, b)
  return i * nbrEmoticons - (i * (i+1))//2 + j 

fullDistanceMatrix = np.zeros([nbrEmoticons, nbrEmoticons])
for r in range(nbrEmoticons):
  for c in range(nbrEmoticons):
    fullDistanceMatrix[r, c] = jMatrix[emoticonPairToIndex(r, c)]

In [None]:
def scatterEmojis(emoticonsEmbedded):
  # This function plots a scatter plot of emoticons.
  # emoticonsEmbedded should be an 80x2 array 
  # containing 2D coordinates for each of the
  # 80 emoticons in the unicode emoticon block (in the correct order).
  
  # standardize the embedding for nicer plotting:
  emoticonsEmbedded = emoticonsEmbedded - np.mean(emoticonsEmbedded)
  emoticonsEmbedded = emoticonsEmbedded/np.std(emoticonsEmbedded) 

  # for proper emoji rendering change the font
  fpath = os.path.join(rcParams["datapath"], "/dbfs/datasets/ScaDaMaLe/twitter/student-project-10_group-Geosmus/unifont_upper-13.0.05.ttf")
  prop = fm.FontProperties(fname=fpath, size=50)

  fig = plt.figure(figsize=(14, 14))
  for i, label in enumerate(emoticonsList):
      plt.text(emoticonsEmbedded[i, 0], emoticonsEmbedded[i, 1], label, fontproperties=prop)
  plt.setp(plt.gca(), frame_on=False, xticks=(), yticks=())

  

  

### Locally linear embedding

First off we will look at embedding the emoticons into 2D in ways that
respect the Jaccard distances at least to a degree.

Locally linear embedding (LLE) is one such method, which is focused on
good presentation of local neighborhoods. You can read more about it in
the scikit-learn documentation embedded below.

In [None]:
nbrNeighbors = 8
emoticonsNeighbors = NearestNeighbors(n_neighbors=nbrNeighbors, metric="precomputed").fit(fullDistanceMatrix)
emoticonsEmbedded, err = locally_linear_embedding(emoticonsNeighbors, n_neighbors=nbrNeighbors, n_components=2)

  

  

As we can see in the scatter plot below, LLE succeeds in separating the
emoticons broadly into happy (down to the left), sad (up) and animal
(down to the right) categories. Also some local clusters can be spotted,
such as the three emoticons sticking out their tongues, close to the
lower left corner.

In [None]:
scatterEmojis(emoticonsEmbedded)

  

### t-distributed Stochastic Neighbor Embedding (t-SNE)

Another approach for embedding the distances into 2D is t-SNE. You can
read more about this method in the sk-learn documentation below.

In [None]:
emoticonsEmbedded = TSNE(n_components=2, perplexity=20.0, early_exaggeration=12.0, learning_rate=2.0, n_iter=10000,
                         metric='precomputed', angle=0.01).fit_transform(fullDistanceMatrix)


  

  

t-SNE also does a good job at showing a separation between happy and sad
emojis but the result is not as convincing as the LLE case. One could
spend more time on optimizing the hyperparameters and probably find a
better embedding here.

In [None]:
scatterEmojis(emoticonsEmbedded)

  

### Hierarchical clustering

Instead of trying to embed the distances into 2D, we can also create a
nice graphical representation in the form of a dendrogram or
hierarchical clustering. For this we need to process the distance matrix
somewhat again in the following cell.

In [None]:
# remove diagonal from jMatrix, as this is expected by the scipy linkage function:
diagonalIndices = [emoticonPairToIndex(i, i) for i in range(nbrEmoticons)]
jMatrixUpper = jMatrix[[i for i in range((nbrEmoticons**2 + nbrEmoticons)//2) if not i in diagonalIndices]]
assert len(jMatrixUpper) == len(jMatrix) - nbrEmoticons, "the upper matrix should have exactly 80 elements fewer than the upper+diagonal"

# creating a linkage matrix
Z = linkage(jMatrixUpper, 'complete', optimal_ordering=True)

  

  

Hierarchical clustering works by starting of with clusters of size one
which are just the emoticons and then iteratively joining those clusters
which are closest together. The distance between clusters can be defined
in various ways, here we somewhat arbitrarily choose so called "complete
linkage" which means that the distance between clusters $a$ and $b$ is
given by the maximum Jaccard distance between some emoticon in $a$ and
some emoticon in $b$.

We can use dendrograms to neatly represent hirearchical clusterings
graphically. The closer two emoticons (or rather emoticon clusters) are
to each other, the further down in the dendrogram their branches merge.

The interested WASP PhD student could consider taking the WASP
Topological Data Analysis course to learn more about hierarchical
clustering.

In [None]:
# plotting a dendrogram
fig = plt.figure(figsize=(40, 8))
dn = dendrogram(Z, labels=emoticonsList, leaf_rotation=0, color_threshold=1.)
ax = plt.gca()

# for proper emoji rendering change the font
fpath = os.path.join(rcParams["datapath"], "/dbfs/datasets/ScaDaMaLe/twitter/student-project-10_group-Geosmus/unifont_upper-13.0.05.ttf")
prop = fm.FontProperties(fname=fpath, size=28)
x_labels = ax.get_xmajorticklabels()
for x in x_labels:
    x.set_fontproperties(prop)

ax.set_ylim([.85, 1.01])

  

We identify six main clusters in the dendrogram above. From left to
right:

-   The green "prayer" cluster (🙌🙏😊🙇🙋😷) which also contains the mask
    emoji and a common smile emoji,
-   the teal "happy" cluster (😝😛😜😎😉😏😈😌😋😍😘😇😀😃😄😁😆😅🙂🙃),
-   the magenta "cat" cluster (😹😸😽😺😻😿😾🙀😼),
-   the yellow "shocked and kisses" or "SK" cluster (😶😬😲😮😯😗😙😚),
-   a combined "not happy" cluster consisting of the next black, green,
    red, teal and magenta clusters (😵😧😦😨😰😱😳😂😭😩😔😢😞😥😓😪😴😫😖😣😟🙁😕😐😑😒🙄😤😡😠),
-   finally the yellow "monkey" cluster (🙈🙊🙉).

We proceed with these clusters as they appeal sufficiently to our
intuition to seem worthwhile. The observant reader will however have
noted some curiosities such as the fact that the "not happy" cluster
contains the crying laughing emoji 😂 which is the most popular emoticon
in our tweet dataset and which might be used in both happy and not so
happy contexts.

Next, we finish the clustering part of this notebook by saving the
clusters to file.

In [None]:
monkeyEmoticons = dn["leaves"][76:79]
prayerEmoticons = dn["leaves"][0:6]
shockedAndKissesEmoticons = dn["leaves"][38:46]
happyEmoticons = dn["leaves"][9:29]
notHappyEmoticons = dn["leaves"][46:76]
catEmoticons = dn["leaves"][29:38]
emoticonsDict = {"monkey" : monkeyEmoticons,
                 "prayer" : prayerEmoticons,
                 "SK" : shockedAndKissesEmoticons,
                "happy" : happyEmoticons,
                "notHappy" : notHappyEmoticons,
                "cat" : catEmoticons}
print(emoticonsDict)


  

>     {'monkey': [72, 74, 73], 'prayer': [76, 79, 10, 71, 75, 55], 'SK': [54, 44, 50, 46, 47, 23, 25, 26], 'happy': [29, 27, 28, 14, 9, 15, 8, 12, 11, 13, 24, 7, 0, 3, 4, 1, 6, 5, 66, 67], 'notHappy': [53, 39, 38, 40, 48, 49, 51, 2, 45, 41, 20, 34, 30, 37, 19, 42, 52, 43, 22, 35, 31, 65, 21, 16, 17, 18, 68, 36, 33, 32], 'cat': [57, 56, 61, 58, 59, 63, 62, 64, 60]}

In [None]:
with open('/dbfs/datasets/ScaDaMaLe/twitter/student-project-10_group-Geosmus/emoticonClusters.json', 'w+') as f:
    json.dump(emoticonsDict, f)

  

  

Filtering the tweets by cluster
-------------------------------

We return to scala cells to filter the original dataset by what
emoticons are present in each tweet. First we load the clusters from the
just created json-file.

In [None]:
import org.json4s.jackson.JsonMethods.parse
val jsonString = scala.io.Source.fromFile("/dbfs/datasets/ScaDaMaLe/twitter/student-project-10_group-Geosmus/emoticonClusters.json").mkString 
val emoticonClusters = parse(jsonString).values.asInstanceOf[Map[String, List[BigInt]]]
emoticonClusters.foreach({case (key, list) => println(key + ": " + list.map(i => emoticonsList(i.toInt)).mkString)})

  

>     prayer: 🙌🙏😊🙇🙋😷
>     monkey: 🙈🙊🙉
>     happy: 😝😛😜😎😉😏😈😌😋😍😘😇😀😃😄😁😆😅🙂🙃
>     SK: 😶😬😲😮😯😗😙😚
>     cat: 😹😸😽😺😻😿😾🙀😼
>     notHappy: 😵😧😦😨😰😱😳😂😭😩😔😢😞😥😓😪😴😫😖😣😟🙁😕😐😑😒🙄😤😡😠
>     import org.json4s.jackson.JsonMethods.parse
>     jsonString: String = {"monkey": [72, 74, 73], "prayer": [76, 79, 10, 71, 75, 55], "SK": [54, 44, 50, 46, 47, 23, 25, 26], "happy": [29, 27, 28, 14, 9, 15, 8, 12, 11, 13, 24, 7, 0, 3, 4, 1, 6, 5, 66, 67], "notHappy": [53, 39, 38, 40, 48, 49, 51, 2, 45, 41, 20, 34, 30, 37, 19, 42, 52, 43, 22, 35, 31, 65, 21, 16, 17, 18, 68, 36, 33, 32], "cat": [57, 56, 61, 58, 59, 63, 62, 64, 60]}
>     emoticonClusters: Map[String,List[BigInt]] = Map(prayer -> List(76, 79, 10, 71, 75, 55), monkey -> List(72, 74, 73), happy -> List(29, 27, 28, 14, 9, 15, 8, 12, 11, 13, 24, 7, 0, 3, 4, 1, 6, 5, 66, 67), SK -> List(54, 44, 50, 46, 47, 23, 25, 26), cat -> List(57, 56, 61, 58, 59, 63, 62, 64, 60), notHappy -> List(53, 39, 38, 40, 48, 49, 51, 2, 45, 41, 20, 34, 30, 37, 19, 42, 52, 43, 22, 35, 31, 65, 21, 16, 17, 18, 68, 36, 33, 32))

  

Next, we create a dataframe `emoticonDF` with a row for each tweet
containing at least one emoticon. We add a column for each cluster
indicating if the cluster is represented by some emoticon in the tweet.
This dataframe is saved to file to be used in the next notebook 03 which
focuses more on data visualization. Here we will finish this notebook by
using the databricks `display` function to plot geopraphic information.

In [None]:
val emoticonDF = fullDF.filter($"CurrentTweet".rlike(emoticonsList.mkString("|"))) // filter tweets with emoticons
                       .select(($"countryCode" :: // select the countryCode column
                                $"CurrentTweetDate" :: // and the timestamp
                                (for {(name, cluster) <- emoticonClusters.toList} yield  // also create a new column for each emoticon cluster indicating if the tweet contains an emoticon of that cluster
                                 $"CurrentTweet".rlike(cluster.map(i => emoticonsList(i.toInt)).mkString("|"))
                                                .alias(name))) // rename new column
                                                : _*) // expand list
      
emoticonDF.show(3)

  

>     +-----------+-------------------+------+------+-----+-----+-----+--------+
>     |countryCode|   CurrentTweetDate|prayer|monkey|happy|   SK|  cat|notHappy|
>     +-----------+-------------------+------+------+-----+-----+-----+--------+
>     |         EG|2020-12-31 15:59:54| false| false|false|false|false|    true|
>     |         SA|2020-12-31 15:59:54|  true| false|false|false|false|   false|
>     |         DO|2020-12-31 15:59:54| false| false| true|false|false|   false|
>     +-----------+-------------------+------+------+-----+-----+-----+--------+
>     only showing top 3 rows
>
>     emoticonDF: org.apache.spark.sql.DataFrame = [countryCode: string, CurrentTweetDate: timestamp ... 6 more fields]

In [None]:
// save to file
// emoticonDF.write.format("parquet").mode("overwrite").save("/datasets/ScaDaMaLe/twitter/student-project-10_group-Geosmus/processedEmoticonClusterParquets/emoticonCluster.parquet")

  

  

The goal for the last part of this notebook will be to display for each
country what proportion of its total tweets correspond to a certain
cluster. First we create a dataframe `emoticonCCDF` which contains the
total number of tweets with some emoticon for each country. Using that
dataframe we create dataframes containing the described proportions for
each cluster and transfer these dataframes from scala to python by using
the `createOrReplaceTmpView` function.

In [None]:
val emoticonCCDF = emoticonDF.groupBy($"countryCode")
                             .count
emoticonCCDF.show(3)

  

>     +-----------+-----+
>     |countryCode|count|
>     +-----------+-----+
>     |         DZ|  157|
>     |         MM|   44|
>     |         TC|    5|
>     +-----------+-----+
>     only showing top 3 rows
>
>     emoticonCCDF: org.apache.spark.sql.DataFrame = [countryCode: string, count: bigint]

In [None]:
def createPropClusterDF (cluster : org.apache.spark.sql.Column) : org.apache.spark.sql.DataFrame = {  
  // This function filters the emoticonDF by a cluster-column and then
  // creates a dataframe with a row per country and columns for the countryCode and proportion
  // of tweets from that country that fall into the cluster as well as the count of tweets
  // falling into the cluster.
  val nbrClusterTweets = emoticonDF.filter(cluster).count
  val clusterDF = emoticonDF.filter(cluster)
                            .groupBy($"countryCode")
                            .count
  val propDF = emoticonCCDF.alias("total")
                           .join(clusterDF.alias("cluster"), "countryCode")
                           .select($"countryCode", $"cluster.count".alias("count"), ($"cluster.count" / $"total.count").alias("proportion"))
  return propDF
}

  

>     createPropClusterDF: (cluster: org.apache.spark.sql.Column)org.apache.spark.sql.DataFrame

  

Below we see an example of the dataframes generated by
`createPropClusterDF`.

In [None]:
val clusterColumn = $"notHappy"
val propClusterDF = createPropClusterDF(clusterColumn)
propClusterDF.show(3)

  

>     +-----------+-----+------------------+
>     |countryCode|count|        proportion|
>     +-----------+-----+------------------+
>     |         DZ|   80|0.5095541401273885|
>     |         MM|   14|0.3181818181818182|
>     |         CI|  164|0.6507936507936508|
>     +-----------+-----+------------------+
>     only showing top 3 rows
>
>     clusterColumn: org.apache.spark.sql.ColumnName = notHappy
>     propClusterDF: org.apache.spark.sql.DataFrame = [countryCode: string, count: bigint ... 1 more field]

In [None]:
def createPropClusterDFAndCreateTmpView (clusterName : String) = {
  // function for creating proportion dataframes for each cluster and making them available for later python code
  val propClusterDF = createPropClusterDF(org.apache.spark.sql.functions.col(clusterName))
  // make df available to python/sql etc
  propClusterDF.createOrReplaceTempView(clusterName)
}

  

>     createPropClusterDFAndCreateTmpView: (clusterName: String)Unit

In [None]:
// create proportion dataframes for each cluster and make them available for later python code
emoticonClusters.keys.foreach(createPropClusterDFAndCreateTmpView _)

  

  

Now we turn to python to use the `pycountry` package in order to
translate the country codes into another standard (three letters instead
of two) which makes plotting with the built in databricks `display` a
breeze. The cell below contains some functions that read the dataframes
from the temporary view created in scala and translate them to pandas
dataframes with the three letter country codes. Also, we filter out
countries for which there are fewer than 100 tweets.

In [None]:
def add_iso_a3_col(df_cc):
  cc_dict = {}
  for country in pycountry.countries:
    cc_dict[country.alpha_2] = country.alpha_3

    df_cc["iso_a3"] = df_cc["countryCode"].map(cc_dict)
  return df_cc

def cc_df_from_spark_to_pandas_and_process(df_cc, columnOfInterest):
  #df_cc should be a dataframe with a column "countryCode" and a column columnOfInterest which has some interesting numerical data
  df_cc = df_cc.toPandas()
  add_iso_a3_col(df_cc)
  df_cc = df_cc[["iso_a3", columnOfInterest]]  #reorder to have iso_a3 as first column (required in order to use the map view in display), and select the useful columns
  return df_cc

from pyspark.sql.functions import col
def createProps(clusterName):
  df = sql("select * from " + clusterName)
  return cc_df_from_spark_to_pandas_and_process(df.filter(col("count")/col("proportion") >= 100), "proportion") # filter so that only countries with at least 100 tweets in a given country are used

  

  

Finally, we can show the proportion of tweets in each country that fall
into each cluster. Make sure that the plot type is set to `map` in the
outputs from the cells below. It is possible to hover over the countries
to see the precise values.

If anything interesting can actually be read from these plots we leave
for the reader to decide.

In [None]:
display(createProps("happy"))

  

[TABLE]

Truncated to 30 rows

In [None]:
display(createProps("notHappy"))

  

[TABLE]

Truncated to 30 rows

In [None]:
display(createProps("monkey"))

  

[TABLE]

Truncated to 30 rows

In [None]:
display(createProps("cat"))

  

[TABLE]

Truncated to 30 rows

In [None]:
display(createProps("SK"))

  

[TABLE]

Truncated to 30 rows

In [None]:
display(createProps("prayer"))

  

[TABLE]

Truncated to 30 rows