# SÉANCE 1

### ÉTAPE 2 - Persona

#### Métier / Type d’utilisateur intéressé par ces données
Un thérapeute de couple ou un psychologue conjugal serait un profil très pertinent. Ces professionnels cherchent à comprendre les facteurs influençant la satisfaction conjugale et la fidélité dans le couple.

Les variables disponibles (âge, durée du mariage, présence d’enfants, religiosité, satisfaction dans le couple, etc.) peuvent les aider à identifier les profils à risque ou à mieux orienter leur accompagnement.

#### Persona : Dr. Clara Morel, psychologue conjugale
__Âge :__ 42 ans

__Profession :__ Psychologue spécialisée en thérapie de couple

__Lieu d’exercice :__ Cabinet privé à Lyon

__Contexte :__ Elle accompagne principalement des couples mariés entre 25 et 55 ans. Elle s’intéresse à l’impact de la durée du mariage, de la religiosité, du niveau d’éducation ou encore du niveau de satisfaction conjugale sur la fidélité.

### ÉTAPE 3 - Chargement, pré-traitement

In [60]:
import org.apache.spark.rdd.RDD
import org.apache.spark.sql.functions.monotonically_increasing_id
import org.apache.spark.sql.hive.HiveContext

In [61]:
case class Affair(
    id: Int,
    affairs: Int,
    gender: String,
    age: Double,
    yearsMarried: Double,
    children: String,
    religiousness: Int,
    education: Int,
    occupation: Int,
    rating: Int
)

defined class Affair


In [62]:
val data = sc.textFile("data/Affairs.csv")
    .zipWithIndex().filter { case (_, idx) => idx > 0 }.map(_._1)

val affairsRDD: RDD[Affair] = data
  .map(_.split(","))
  .map(fields => Affair(
    fields(0).toInt,
    fields(1).toInt,
    fields(2).toString,
    fields(3).toDouble,
    fields(4).toDouble,
    fields(5).toString,
    fields(6).toInt,
    fields(7).toInt,
    fields(8).toInt,
    fields(9).toInt
  ))

affairsRDD.take(10).foreach(println)

Affair(4,0,male,37.0,10.0,no,3,18,7,4)
Affair(5,0,female,27.0,4.0,no,4,14,6,4)
Affair(11,0,female,32.0,15.0,yes,1,12,1,4)
Affair(16,0,male,57.0,15.0,yes,5,18,6,5)
Affair(23,0,male,22.0,0.75,no,2,17,6,3)
Affair(29,0,female,32.0,1.5,no,2,17,5,5)
Affair(44,0,female,22.0,0.75,no,2,12,1,3)
Affair(45,0,male,57.0,15.0,yes,2,14,4,4)
Affair(47,0,female,32.0,15.0,yes,4,16,1,2)
Affair(49,0,male,22.0,1.5,no,4,14,4,5)


data = MapPartitionsRDD[290] at map at <console>:60
affairsRDD = MapPartitionsRDD[292] at map at <console>:64


MapPartitionsRDD[292] at map at <console>:64

# SÉANCE 2

### ÉTAPE 1 - RDD vers un dataframe

In [63]:
val affairsDF = affairsRDD.toDF()

affairsDF.show(10)
affairsDF.printSchema()
println(s"Total rows: ${affairsDF.count()}")
affairsDF.summary().show()

+---+-------+------+----+------------+--------+-------------+---------+----------+------+
| id|affairs|gender| age|yearsMarried|children|religiousness|education|occupation|rating|
+---+-------+------+----+------------+--------+-------------+---------+----------+------+
|  4|      0|  male|37.0|        10.0|      no|            3|       18|         7|     4|
|  5|      0|female|27.0|         4.0|      no|            4|       14|         6|     4|
| 11|      0|female|32.0|        15.0|     yes|            1|       12|         1|     4|
| 16|      0|  male|57.0|        15.0|     yes|            5|       18|         6|     5|
| 23|      0|  male|22.0|        0.75|      no|            2|       17|         6|     3|
| 29|      0|female|32.0|         1.5|      no|            2|       17|         5|     5|
| 44|      0|female|22.0|        0.75|      no|            2|       12|         1|     3|
| 45|      0|  male|57.0|        15.0|     yes|            2|       14|         4|     4|
| 47|     

affairsDF = [id: int, affairs: int ... 8 more fields]


+-------+------------------+------------------+------+-----------------+-----------------+--------+------------------+------------------+------------------+------------------+
|summary|                id|           affairs|gender|              age|     yearsMarried|children|     religiousness|         education|        occupation|            rating|
+-------+------------------+------------------+------+-----------------+-----------------+--------+------------------+------------------+------------------+------------------+
|  count|               601|               601|   601|              601|              601|     601|               601|               601|               601|               601|
|   mean|1059.7221297836938|1.4559068219633944|  NULL|32.48752079866888| 8.17769550748752|    NULL|3.1164725457570714| 16.16638935108153| 4.194675540765391|3.9317803660565724|
| stddev| 914.9046112352131|  3.29875772849468|  NULL| 9.28876170487667|5.571303149963793|    NULL|1.1675094016730692|2.

[id: int, affairs: int ... 8 more fields]

### ÉTAPE 2 - Extraction de dimensions

In [64]:
// dimensions littérales = children & gender

val dfChildren = affairsDF
    .select("children")
    .distinct
    .withColumn("id_children", monotonically_increasing_id)

dfChildren.show(10);

val dfGenders = affairsDF
    .select("gender")
    .distinct
    .withColumn("id_gender", monotonically_increasing_id)

dfGender.show(10)

+--------+-----------+
|children|id_children|
+--------+-----------+
|      no|          0|
|     yes|          1|
+--------+-----------+



dfChildren = [children: string, id_children: bigint]
dfGenders = [gender: string, id_gender: bigint]


+------+---------+
|gender|id_gender|
+------+---------+
|female|        0|
|  male|        1|
+------+---------+



[gender: string, id_gender: bigint]

In [65]:
val affairsNormalized = affairsDF
.join(dfGender, Seq("gender"))
.join(dfChildren, Seq("children"))
.drop("gender", "children")

affairsNormalized.printSchema()

root
 |-- id: integer (nullable = false)
 |-- affairs: integer (nullable = false)
 |-- age: double (nullable = false)
 |-- yearsMarried: double (nullable = false)
 |-- religiousness: integer (nullable = false)
 |-- education: integer (nullable = false)
 |-- occupation: integer (nullable = false)
 |-- rating: integer (nullable = false)
 |-- id_gender: long (nullable = false)
 |-- id_children: long (nullable = false)



affairsNormalized = [id: int, affairs: int ... 8 more fields]


[id: int, affairs: int ... 8 more fields]

### ÉTAPE 3 - Tables Hive, SQL

In [66]:
dfGender.write.mode("overwrite").saveAsTable("gender")
dfChildren.write.mode("overwrite").saveAsTable("children")
affairsNormalized.write.mode("overwrite").saveAsTable("affairs")

In [74]:
val hc = new HiveContext(sc)

hc.sql("SHOW TABLES").show()
hc.sql("SELECT * FROM gender").show()
hc.sql("SELECT * FROM children").show()
hc.sql("SELECT * FROM affairs LIMIT 10").show()

+---------+------------+-----------+
|namespace|   tableName|isTemporary|
+---------+------------+-----------+
|  default|     affairs|      false|
|  default|    children|      false|
|  default|dim_children|      false|
|  default|  dim_gender|      false|
|  default|fact_affairs|      false|
|  default|      gender|      false|
+---------+------------+-----------+

+------+---------+
|gender|id_gender|
+------+---------+
|female|        0|
|  male|        1|
+------+---------+

+--------+-----------+
|children|id_children|
+--------+-----------+
|      no|          0|
|     yes|          1|
+--------+-----------+



hc = org.apache.spark.sql.hive.HiveContext@5ad872c7




+---+-------+----+------------+-------------+---------+----------+------+---------+-----------+
| id|affairs| age|yearsMarried|religiousness|education|occupation|rating|id_gender|id_children|
+---+-------+----+------------+-------------+---------+----------+------+---------+-----------+
|  5|      0|27.0|         4.0|            4|       14|         6|     4|        0|          0|
| 11|      0|32.0|        15.0|            1|       12|         1|     4|        0|          1|
| 29|      0|32.0|         1.5|            2|       17|         5|     5|        0|          0|
| 44|      0|22.0|        0.75|            2|       12|         1|     3|        0|          0|
| 47|      0|32.0|        15.0|            4|       16|         1|     2|        0|          1|
| 80|      0|22.0|         1.5|            2|       17|         5|     4|        0|          0|
| 86|      0|27.0|         4.0|            4|       14|         5|     4|        0|          0|
| 93|      0|37.0|        15.0|         

org.apache.spark.sql.hive.HiveContext@5ad872c7