<a href="https://colab.research.google.com/github/hamel-amir/Projet_Big_Data/blob/main/Projet_Big_Data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Projet Big Data : Apache Spark

Réalisé par:

* Amir  Hamel
* Noor Khalal
* Zahra Alliche

## Objectifs du TP
Ce TP consiste à regrouper des documents textuels tels que les documents qui
partagent le même thème se retrouvent dans le même groupe, et les documents qui
portent sur des sujets très différents se trouvent dans des groupes différents.

## 2 Mise en place de l'environnement de travail 

### 2.2 Installation de Spark

**Instalation de la bibliotheque Java**

In [None]:
! apt-get install openjdk-8-jdk-headless -qq > /dev/null

**Instalation de Spark**

In [None]:
! wget -q https://dlcdn.apache.org/spark/spark-3.3.2/spark-3.3.2-bin-hadoop3.tgz


In [None]:
! tar xf spark-3.3.2-bin-hadoop3.tgz

**Instalation de PySpark**

In [None]:
# instalation de pyspark
! pip install -q findspark
! pip install pyspark

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


### 2.3/4 Définir la vatiable d'environnement et créer l'objet SparkContext 

In [None]:
import os
# definir deux variables d'environement
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"  # l'endroit de java
os.environ["SPARK_HOME"] = "/content/spark-3.3.2-bin-hadoop3" # l'endroit de spark
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages org.apache.spark:spark-avro_2.11:2.4.5 pyspark-shell' 

import findspark
findspark.init("spark-3.3.2-bin-hadoop3")

from pyspark import SparkContext, SparkConf

# lancement du spark en local (en une seule machine qui est la machine virtuelle de colab dans ce cas) avec 4 processus (workers node)
configuration = SparkConf().setAppName("name").setMaster("local[4]").set('spark.jars.packages', 'org.apache.spark:spark-avro_2.11:2.4.5')
# name: c'est le nom qu'on donne a notre app (code)
sc = SparkContext(conf=configuration) # l'objet 

In [None]:
sc

**Création de l'objet sparkSession**

In [None]:
# L'objet sparkSession créé pour utiliser l'API Spark SQL
from pyspark.sql import SparkSession
spark = SparkSession.builder.config(conf=configuration).getOrCreate()

## 3 Données

### 3.1 Téléchargement des données

In [None]:
# Téléchargement du dossier des documents
! wget -q http://qwone.com/~jason/20Newsgroups/20news-19997.tar.gz

### 3.2 Decompression des données

In [None]:
# décompresser le dossier 
! tar xf /content/20news-19997.tar.gz

### 3.3 Chargement des données dans deux variables de type RDD



In [None]:
rdd1 = sc.wholeTextFiles("/content/20_newsgroups/alt.atheism")
rdd2 = sc.wholeTextFiles("/content/20_newsgroups/rec.sport.baseball")

In [None]:
x=rdd2.take(2)

###3.4 Séparation de l’entête

In [None]:
def separateur(x):
  l=x[1].split("\n\n")
  return (x[0],l)

#separateur(x[1])

rdd1=rdd1.map(separateur)
rdd2=rdd2.map(separateur)

In [None]:
# tester le premier élément
test=rdd1.take(1)
test

[('file:/content/20_newsgroups/alt.atheism/53179',
  ['Path: cantaloupe.srv.cs.cmu.edu!das-news.harvard.edu!noc.near.net!uunet!news.claremont.edu!nntp-server.caltech.edu!keith\nFrom: keith@cco.caltech.edu (Keith Allan Schneider)\nNewsgroups: alt.atheism\nSubject: Re: <<Pompous ass\nDate: 16 Apr 1993 02:45:05 GMT\nOrganization: California Institute of Technology, Pasadena\nLines: 28\nMessage-ID: <1ql6jiINN5df@gap.caltech.edu>\nReferences: <1q0e4iINNa30@gap.caltech.edu> <1q52q8INN6pi@gap.caltech.edu> <93099.234144MVS104@psuvm.psu.edu> <1q8lk3INNitq@gap.caltech.edu> <C5CFLo.FzH@blaze.cs.jhu.edu>\nNNTP-Posting-Host: punisher.caltech.edu',
   'arromdee@jyusenkyou.cs.jhu.edu (Ken Arromdee) writes:',
   '>>Look, I\'m not the one that made those Nazi comparisons.  Other people\n>>compared what the religious people are doing now to Nazi Germany.  They\n>>have said that it started out with little things (but no one really knew\n>>about any of these "little" things, strangely enough) and grew to 

On voit ici qu'un document est représenté sous la forme d'un tuple `(Path, List)`,

 ou `List = [entête, contenu du document ]`

### 3.5 Extraction des champs de l’entête

In [None]:
#fonction d'ectraction des éléments de l’entête
#prend en prametre l’entête d'un document
#retourne un dictionnaire (Intitulé_Champs, Contenu_Champs)

def extraction_informations_entete(entete):
  #chaque ligne représente un nouveau champs
  l=entete.split("\n")
  d=dict()

  for x in l:
    #l'intitulé du champs est séparé de son contenu par ':'
    z=x.split(":")
    #quelques lignes ne contiennent pas de ':'
    if len(z)==2:
       d[z[0]]=z[1]

  return d

#### Premier example d'extraction

In [None]:
# extraire des informations du premier document du rdd1 (alt.atheism)

# par l'action first
doc1=rdd1.first()
entete=doc1[1][0]

# Dictionnaire des informations de l'entete
dict1_info=extraction_informations_entete(entete)
dict1_info

{'Path': ' cantaloupe.srv.cs.cmu.edu!das-news.harvard.edu!noc.near.net!uunet!news.claremont.edu!nntp-server.caltech.edu!keith',
 'From': ' keith@cco.caltech.edu (Keith Allan Schneider)',
 'Newsgroups': ' alt.atheism',
 'Organization': ' California Institute of Technology, Pasadena',
 'Lines': ' 28',
 'Message-ID': ' <1ql6jiINN5df@gap.caltech.edu>',
 'References': ' <1q0e4iINNa30@gap.caltech.edu> <1q52q8INN6pi@gap.caltech.edu> <93099.234144MVS104@psuvm.psu.edu> <1q8lk3INNitq@gap.caltech.edu> <C5CFLo.FzH@blaze.cs.jhu.edu>',
 'NNTP-Posting-Host': ' punisher.caltech.edu'}

In [None]:
# selectionner Organization et Newsgroups
print(dict1_info['Newsgroups'])
print(dict1_info['Organization'])

 alt.atheism
 California Institute of Technology, Pasadena


#### Deuxieme example d'extraction

In [None]:
# par l'action take
doc2=rdd2.take(2)
entete2=doc2[1][1][0]
# Dictionnaire des informations de l'entete
dict2_info=extraction_informations_entete(entete2)
dict2_info

{'Newsgroups': ' rec.sport.baseball',
 'Path': ' cantaloupe.srv.cs.cmu.edu!das-news.harvard.edu!noc.near.net!howland.reston.ans.net!usc!cs.utexas.edu!utnut!torn!newshub.ccs.yorku.ca!newshub.ariel.yorku.ca!cs902060',
 'From': ' cs902060@ariel.yorku.ca (GEOFFREY E DIAS)',
 'Subject': ' How does a pitcher get a save?',
 'Message-ID': ' <1993Apr23.135139.18749@newshub.ariel.yorku.ca>',
 'Sender': ' news@newshub.ariel.yorku.ca (USENET News System)',
 'Organization': ' York University, Toronto, Canada',
 'Lines': ' 4'}

In [None]:
# selectionner Organization et Newsgroups
print(dict2_info['Newsgroups'])
print(dict2_info['Organization'])

 rec.sport.baseball
 York University, Toronto, Canada


Ici, nous allons créer une nouvelle fonction de séparation de document qui va créer un élément de RDD de la forme 
`(Path, [Dictionnaire de l'entete , le contenu du document])`

Ceci rendra la transformation des élément en type pyspark.sql.Row plus facile durant le choix des colonnes car certains champs de l,entete n'e=éxistent pas dans tous les documents, le test d'éxistance sera plus facile grace au dictionnaire.

In [None]:
# cette fonction prend en entrée un élement du rdd (doc)
# et retourne un tuple contenant ( path, [dictionnaire de l'entete , le contenu du document])
def structuration_document(doc):
  if  doc is not None:
    l= doc[1][0].split("\n")
    d=dict()
    if len(l)!=0:
      for x in l:
        z=x.split(":",1)
        if len(z)==2:
          d[z[0]]=z[1]
        if len(z)<2:
           pass
        

      return (doc[0],[d,doc[1][1]])

In [None]:
new_rdd1=rdd1.map(structuration_document)


In [None]:
# affichage du resultat de new_rdd1
new_rdd1.take(2)

[('file:/content/20_newsgroups/alt.atheism/53179',
  [{'Path': ' cantaloupe.srv.cs.cmu.edu!das-news.harvard.edu!noc.near.net!uunet!news.claremont.edu!nntp-server.caltech.edu!keith',
    'From': ' keith@cco.caltech.edu (Keith Allan Schneider)',
    'Newsgroups': ' alt.atheism',
    'Subject': ' Re: <<Pompous ass',
    'Date': ' 16 Apr 1993 02:45:05 GMT',
    'Organization': ' California Institute of Technology, Pasadena',
    'Lines': ' 28',
    'Message-ID': ' <1ql6jiINN5df@gap.caltech.edu>',
    'References': ' <1q0e4iINNa30@gap.caltech.edu> <1q52q8INN6pi@gap.caltech.edu> <93099.234144MVS104@psuvm.psu.edu> <1q8lk3INNitq@gap.caltech.edu> <C5CFLo.FzH@blaze.cs.jhu.edu>',
    'NNTP-Posting-Host': ' punisher.caltech.edu'},
   'arromdee@jyusenkyou.cs.jhu.edu (Ken Arromdee) writes:']),
 ('file:/content/20_newsgroups/alt.atheism/54164',
  [{'Xref': ' cantaloupe.srv.cs.cmu.edu alt.atheism:54164 alt.atheism.moderated:786 news.answers:7924 alt.answers:228',
    'Path': ' cantaloupe.srv.cs.cmu.ed

In [None]:
# en appliquant la meme fonction sur le rdd2
new_rdd2=rdd2.map(structuration_document)

In [None]:
new_rdd2.take(2)

[('file:/content/20_newsgroups/rec.sport.baseball/104373',
  [{'Newsgroups': ' rec.sport.baseball',
    'Path': ' cantaloupe.srv.cs.cmu.edu!das-news.harvard.edu!noc.near.net!howland.reston.ans.net!zaphod.mps.ohio-state.edu!malgudi.oar.net!news.ans.net!newsgate.watson.ibm.com!yktnews.watson.ibm.com!bones!kja',
    'From': ' kja@watson.ibm.com ( Kenneth J. Arbeitman)',
    'Subject': ' Missing subject header',
    'Sender': ' news@watson.ibm.com (NNTP News Poster)',
    'Message-ID': ' <1993Apr15.175316.15300@watson.ibm.com>',
    'Date': ' Thu, 15 Apr 1993 17:53:16 GMT',
    'Reply-To': ' kja@bones.fishkill.ibm.com ( Kenneth J. Arbeitman)',
    'Disclaimer': " This posting represents the poster's views, not necessarily those of IBM",
    'References': '  <93095@hydra.gatech.EDU>',
    'Nntp-Posting-Host': ' bones.fishkill.ibm.com',
    'Organization': ' IBM East Fishkill                                                        Subject: Re: Torre: The worst manager?',
    'Lines': ' 39'},


### 3.6 Fusion des deux new RDD

en faisant l'union des deux RDD

In [None]:
# Fusionner les deux RDD (union)
fusion=new_rdd1.union(new_rdd2)

In [None]:
fusion.take(2)

[('file:/content/20_newsgroups/alt.atheism/53179',
  [{'Path': ' cantaloupe.srv.cs.cmu.edu!das-news.harvard.edu!noc.near.net!uunet!news.claremont.edu!nntp-server.caltech.edu!keith',
    'From': ' keith@cco.caltech.edu (Keith Allan Schneider)',
    'Newsgroups': ' alt.atheism',
    'Subject': ' Re: <<Pompous ass',
    'Date': ' 16 Apr 1993 02:45:05 GMT',
    'Organization': ' California Institute of Technology, Pasadena',
    'Lines': ' 28',
    'Message-ID': ' <1ql6jiINN5df@gap.caltech.edu>',
    'References': ' <1q0e4iINNa30@gap.caltech.edu> <1q52q8INN6pi@gap.caltech.edu> <93099.234144MVS104@psuvm.psu.edu> <1q8lk3INNitq@gap.caltech.edu> <C5CFLo.FzH@blaze.cs.jhu.edu>',
    'NNTP-Posting-Host': ' punisher.caltech.edu'},
   'arromdee@jyusenkyou.cs.jhu.edu (Ken Arromdee) writes:']),
 ('file:/content/20_newsgroups/alt.atheism/54164',
  [{'Xref': ' cantaloupe.srv.cs.cmu.edu alt.atheism:54164 alt.atheism.moderated:786 news.answers:7924 alt.answers:228',
    'Path': ' cantaloupe.srv.cs.cmu.ed

### 3.7 Transformation le nouveau RDD obtenu pour que chaque élément soit de type pyspark.sql.Row

In [None]:
from pyspark.sql import Row

# cette fonction ToRow est utilisée pour manipuler la table sql et faire des requetes.
def ToRow(x):
  entete=x[1][0]
  
  #Certains éléments (comme Subject ou Date) n'éxistent pas dans tous les documents
  #C'est pour cela que la création du dictionnaire avec la fonction structuration_document() va nous aider dans les tests d'éxistance
  row = Row(Path=x[0], Newsgroups=entete['Newsgroups'] if 'Newsgroups' in entete.keys() else None,
            Organization=entete['Organization'] if 'Organization' in entete.keys() else None,
            Subject=entete['Subject'] if 'Subject' in entete.keys() else None,
            Date=entete['Date'] if 'Date' in entete.keys() else None,
            Lines=entete['Lines'] if 'Lines' in entete.keys() else None,
            From=entete['From'] if 'From' in entete.keys() else None,

            Contenu=x[1][1])
  
  return row

rddF=fusion.map(ToRow)


In [None]:
rddF.take(2)

[Row(Path='file:/content/20_newsgroups/alt.atheism/53179', Newsgroups=' alt.atheism', Organization=' California Institute of Technology, Pasadena', Subject=' Re: <<Pompous ass', Date=' 16 Apr 1993 02:45:05 GMT', Lines=' 28', From=' keith@cco.caltech.edu (Keith Allan Schneider)', Contenu='arromdee@jyusenkyou.cs.jhu.edu (Ken Arromdee) writes:'),
 Row(Path='file:/content/20_newsgroups/alt.atheism/54164', Newsgroups=' alt.atheism,alt.atheism.moderated,news.answers,alt.answers', Organization=' Mantis Consultants, Cambridge. UK.', Subject=' Alt.Atheism FAQ: Overview for New Readers', Date=' Mon, 26 Apr 1993 14:08:03 GMT', Lines=' 146', From=' mathew <mathew@mantis.co.uk>', Contenu='Archive-name: atheism/overview\nAlt-atheism-archive-name: overview\nLast-modified: 20 April 1993\nVersion: 1.3')]

### 3.8 Création d'unobjet de type DataFrame à partir du RDD fusionné

In [None]:
#Create dataframe spark
df=spark.createDataFrame(rddF)
df.printSchema()
df.show()
df.count()

root
 |-- Path: string (nullable = true)
 |-- Newsgroups: string (nullable = true)
 |-- Organization: string (nullable = true)
 |-- Subject: string (nullable = true)
 |-- Date: string (nullable = true)
 |-- Lines: string (nullable = true)
 |-- From: string (nullable = true)
 |-- Contenu: string (nullable = true)

+--------------------+--------------------+--------------------+--------------------+--------------------+-----+--------------------+--------------------+
|                Path|          Newsgroups|        Organization|             Subject|                Date|Lines|                From|             Contenu|
+--------------------+--------------------+--------------------+--------------------+--------------------+-----+--------------------+--------------------+
|file:/content/20_...|         alt.atheism| California Insti...|   Re: <<Pompous ass| 16 Apr 1993 02:4...|   28| keith@cco.caltec...|arromdee@jyusenky...|
|file:/content/20_...| alt.atheism,alt....| Mantis Consultan...| 

2000

### 3.9 Sauvegarde la DataFrame au format Avro

In [None]:
!pip install pandavro

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [None]:
import numpy as np
import pandas as pd
import pandavro as pdx
from pyspark.sql.avro.functions import from_avro, to_avro
filename = "df.avro"
# on a passé par le dataframe pandas mais ça envoie tout au driver
df2=df.toPandas()
pdx.to_avro(filename, df2)
#df.write.partitionBy("Path",).avro("output_dir")
#avroDf = df.select(to_avro(df.Path).alias("avro"))
#avroDf.collect()

In [None]:
#lecture du fichier avro créé
saved = pdx.read_avro(filename)


### 3.10 Sauvegarder la DataFrame au format Parquet

In [None]:
# sauvgarder le dataframe au format parquet
import pyarrow as pa
import pyarrow.parquet as pq
table = pa.Table.from_pandas(df2)
pq.write_table(table, 'df.parquet')

In [None]:
# test: lecture du df a partir du fichier parquet
table2 = pq.read_table('df.parquet')
table2.to_pandas()

Unnamed: 0,Path,Newsgroups,Organization,Subject,Date,Lines,From,Contenu
0,file:/content/20_newsgroups/alt.atheism/53179,alt.atheism,"California Institute of Technology, Pasadena",Re: <<Pompous ass,16 Apr 1993 02:45:05 GMT,28,keith@cco.caltech.edu (Keith Allan Schneider),arromdee@jyusenkyou.cs.jhu.edu (Ken Arromdee) ...
1,file:/content/20_newsgroups/alt.atheism/54164,"alt.atheism,alt.atheism.moderated,news.answer...","Mantis Consultants, Cambridge. UK.",Alt.Atheism FAQ: Overview for New Readers,"Mon, 26 Apr 1993 14:08:03 GMT",146,mathew <mathew@mantis.co.uk>,Archive-name: atheism/overview\nAlt-atheism-ar...
2,file:/content/20_newsgroups/alt.atheism/53589,alt.atheism,"Technical University Braunschweig, Germany",Re: Yet more Rushdie [Re: ISLAMIC LAW],"Wed, 21 Apr 1993 11:41:07 GMT",32,I3150101@dbstu1.rz.tu-bs.de (Benedikt Rosenau),In article <116172@bu.edu>\njaeger@buphy.bu.ed...
3,file:/content/20_newsgroups/alt.atheism/53657,alt.atheism,IBM Advanced Workstation Division,Re: some thoughts.,"Wed, 21 Apr 1993 15:46:14 GMT",21,karner@austin.ibm.com (F. Karner),"\nIn article <1993Apr20.195907.10765@mks.com>,..."
4,file:/content/20_newsgroups/alt.atheism/54137,alt.atheism,AT&T,Re: some thoughts.,"Mon, 26 Apr 1993 13:28:52 GMT",26,decay@cbnewsj.cb.att.com (dean.kaflowitz),"In article <kmr4.1718.735827952@po.CWRU.edu>, ..."
...,...,...,...,...,...,...,...,...
1995,file:/content/20_newsgroups/rec.sport.baseball...,rec.sport.baseball,Informix Software,Seeking All Star game Info,20 Apr 93 17:00:45 GMT,10,mcole@miracle.informix.com (Mary Cole),"\nOK, OK, OK. First, my apologies for perhaps ..."
1996,file:/content/20_newsgroups/rec.sport.baseball...,rec.sport.baseball,"University of Colorado, Boulder",Re: Bosox go down in smoke II (Seattle 7-0) ...,"Fri, 23 Apr 1993 19:25:16 GMT",10,franjion@spot.Colorado.EDU (John Franjione),dietz@parody.Data-IO.COM (Kent Dietz) writes:
1997,file:/content/20_newsgroups/rec.sport.baseball...,rec.sport.baseball,"Computer Science & Engineering, U. of Washing...",Mea Culpa -- Bosio no-no,"Fri, 23 Apr 93 17:08:59 GMT",46,barring@cs.washington.edu (David Barrington),"Like Clinton and Reno, I accept full responsib..."
1998,file:/content/20_newsgroups/rec.sport.baseball...,rec.sport.baseball,Pomona College,Dodgers newsletter?,6 Apr 93 02:03:00 GMT,3,jaufrecht@pomona.claremont.edu,Could somebody please tell me if there is a Do...


## 4 Analyse descriptive API SPARK SQL

### 4.1 Vérification de deux catégories différentes de documents

In [None]:
# le nombre de documents par Newsgroups
df.groupBy("Newsgroups").count().show()

+--------------------+-----+
|          Newsgroups|count|
+--------------------+-----+
| talk.religion.mi...|    2|
| sci.skeptic,alt....|    7|
| alt.atheism,soc....|    1|
| alt.atheism,rec....|    3|
| alt.atheism,talk...|   92|
| rec.scouting,soc...|    1|
| soc.culture.arab...|    2|
| alt.atheism,talk...|    5|
| alt.drugs,alt.at...|    2|
| alt.atheism,alt....|    1|
| alt.atheism,talk...|    1|
| alt.atheism,talk...|    2|
| alt.atheism,soc....|    8|
|         alt.atheism|  734|
| alt.atheism,alt....|    7|
| alt.atheism,talk...|   20|
| alt.slack,talk.r...|    1|
| talk.religion.mi...|    4|
| talk.abortion,al...|   94|
| alt.atheism,soc....|    2|
+--------------------+-----+
only showing top 20 rows



On remarque que le champ Newsgroups peut contenir des éléments en plus des deux catégories principales "alt.atheism" ou "rec.sport.baseball".

Pour vérifier qu'on a bien ces deux catégories principales différentes, nous allons créer deux vues: 

* Une vue `sport` qui contient toutes les lignes ou `rec.sport.baseball` apparait.
* Une vue `religion` qui contient toutes les lignes ou `alt.atheism` apparait.

Puis nous allons compter le nombre d'éléments dans chaque liste, en plus de voir si l'intersectiondes deux vues est vide ou pas.

Si l'intersection est vide, et que la somme des éléments des deux vues est égales au nombre total des éléments de la fusion, cela veut dire que les documents sont bien separés par deux catégories principales  "alt.atheism" et "rec.sport.baseball".

In [None]:
# Mettre le dataframe dans une table sql (Document)

from pandas._libs.hashtable import value_count
df.createOrReplaceTempView("Document")

In [None]:
#Si les vues sont déja crées
#spark.sql("DROP VIEW sport")
#spark.sql("DROP VIEW religion")

In [None]:
# Création d'une vue (sport) contennat tous les document de catégorie rec.sport.baseball
sqlDF = spark.sql("CREATE TEMP VIEW sport AS SELECT Newsgroups FROM Document WHERE Newsgroups LIKE '%rec.sport.baseball%'")

++
||
++
++



In [None]:
# Création d'une vue (religion) contennat tous les document de catégorie alt.atheism
sqlDF = spark.sql("CREATE TEMP VIEW religion AS SELECT Newsgroups FROM Document WHERE Newsgroups LIKE '%alt.atheism%'")

++
||
++
++



In [None]:
# affichage la vue Sport
sqlDF2 = spark.sql("SELECT * FROM sport")
sqlDF2.show()

+-------------------+
|         Newsgroups|
+-------------------+
| rec.sport.baseball|
| rec.sport.baseball|
| rec.sport.baseball|
| rec.sport.baseball|
| rec.sport.baseball|
| rec.sport.baseball|
| rec.sport.baseball|
| rec.sport.baseball|
| rec.sport.baseball|
| rec.sport.baseball|
| rec.sport.baseball|
| rec.sport.baseball|
| rec.sport.baseball|
| rec.sport.baseball|
| rec.sport.baseball|
| rec.sport.baseball|
| rec.sport.baseball|
| rec.sport.baseball|
| rec.sport.baseball|
| rec.sport.baseball|
+-------------------+
only showing top 20 rows



In [None]:
# Affichage la vue religion
sqlDF3 = spark.sql("SELECT * FROM religion")
sqlDF3.show()

+--------------------+
|          Newsgroups|
+--------------------+
|         alt.atheism|
| alt.atheism,alt....|
|         alt.atheism|
|         alt.atheism|
|         alt.atheism|
|         alt.atheism|
|         alt.atheism|
|         alt.atheism|
| talk.abortion,al...|
|         alt.atheism|
|         alt.atheism|
|         alt.atheism|
| talk.abortion,al...|
|         alt.atheism|
| alt.atheism,talk...|
|         alt.atheism|
|         alt.atheism|
| talk.abortion,al...|
|         alt.atheism|
|         alt.atheism|
+--------------------+
only showing top 20 rows



In [None]:
# le nombre de documents dont la categorie contient rec_baseball
sqlDF3 = spark.sql("SELECT count(s.Newsgroups) as nbr_rec_baseball FROM sport s ")
sqlDF3.show()

+----------------+
|nbr_rec_baseball|
+----------------+
|            1000|
+----------------+



In [None]:
# le nombre de documents dont la categorie contient alt.atheism
sqlDF3 = spark.sql("SELECT count(Newsgroups) as nb_alt_atheism FROM religion")
sqlDF3.show()
 

+--------------+
|nb_alt_atheism|
+--------------+
|          1000|
+--------------+



In [None]:
# Pour prouver qu'il n'y a pas d'intersection entre les deux catégories 
# de plus on a 1000 lignes dans la vue religion et 1000 dans la vue religion
#(en sachant que le nombre total des lignes dans la table document est de 2000)
sqlDF3 = spark.sql("SELECT * FROM religion INTERSECT (SELECT * FROM sport) ")
sqlDF3.show()

+----------+
|Newsgroups|
+----------+
+----------+



**Conclusion:** on a bien deux catégories différentes de documents

### 4.2 le nombre d'organisations différentes

In [None]:
#le nombre d'organisations différentes
from pyspark.sql.functions import countDistinct
df2=df.select(countDistinct("Organization"))
df2.show()

+----------------------------+
|count(DISTINCT Organization)|
+----------------------------+
|                         485|
+----------------------------+



### 4.3 D’autres statistiques descriptives

Max, Min , Moyenne par rapport a l'attribut Lines

In [None]:
df.describe("Lines").show()

+-------+-----------------+
|summary|            Lines|
+-------+-----------------+
|  count|             1994|
|   mean|36.48294884653962|
| stddev|48.79137273450092|
|    min|                1|
|    max|               99|
+-------+-----------------+



Le nombre de documents par organization

In [None]:
df.groupBy("Organization").count().show()

+--------------------+-----+
|        Organization|count|
+--------------------+-----+
|           SunSelect|    2|
| NASA Goddard Spa...|    2|
| Case Western Res...|   39|
| Princeton Univer...|   24|
| University of Ne...|    2|
| S-CUBED, A Divis...|    1|
| Harlequin Ltd. C...|    1|
| IBM T.J. Watson ...|    1|
| Penn State Engin...|    1|
|   Cured, discharged|    3|
| University of Ne...|    7|
| National Optical...|    2|
| Dept. of Electro...|    1|
| Cabletron System...|    6|
| Welch Medical Li...|    2|
| Johns Hopkins Un...|    7|
|  Indiana University|   17|
| Okcforum Unix Us...|   32|
| Wesleyan University|    7|
| Tektronix, Inc.,...|   11|
+--------------------+-----+
only showing top 20 rows



Le nombre de documents qui ont été créés pour chaque auteur

In [None]:
df.groupBy("FROM").count().show()

+--------------------+-----+
|                FROM|count|
+--------------------+-----+
| dpw@sei.cmu.edu ...|    4|
| mccullou@snake2....|    6|
| livesey@solntze....|   70|
| darice@yoyo.cc.m...|   14|
| kax@cs.nott.ac.u...|    2|
| cust_ts@klaava.H...|    2|
| acooper@mac.cc.m...|    4|
| pzimmerm@mail.sa...|    1|
| "James F. Tims" ...|    1|
| kutluk@ccl.umist...|    1|
| <MVS104@psuvm.ps...|    1|
| halat@pooh.bears...|   15|
| davec@silicon.cs...|    1|
| magney@cco.calte...|    1|
| nancyo@shnext15....|    1|
| David O Hunt <bl...|    1|
| mikec@sail.LABS....|    2|
| sbradley@scic.in...|    2|
| dewey@risc.sps.m...|    3|
| adpeters@sunflow...|    3|
+--------------------+-----+
only showing top 20 rows



##5.A. Transformation de texte : Tout le document
Dans ce qui suit, nous allons créer un nouvel RDD de type pyspark.sql.Row puis un dataframe qui a une seule colonne contenant tout le document. Ceci est pour faire un premier test sur l'algorithme KMeans.


In [None]:
def ToRow_all_document(x):
  entete=x[1][0]
  val=" ".join(entete.values())

  row = Row(document=val+x[1][1])
  return row

all_document=fusion.map(ToRow_all_document)

In [None]:
# create spark dataframe contenant tout le document
all_document_df=spark.createDataFrame(all_document)
all_document_df.printSchema()

root
 |-- document: string (nullable = true)



In [None]:
#Transformation du Texte
from pyspark.ml.feature import HashingTF, IDF, Tokenizer

###5.A.2 Découpage les documents en listes de mots à l’aide de Tokenizer

In [None]:
tokenizer = Tokenizer(inputCol="document", outputCol="words")
#d=df.select("Contenu")
wordsData = tokenizer.transform(all_document_df)
wordsData.show()

+--------------------+--------------------+
|            document|               words|
+--------------------+--------------------+
| cantaloupe.srv.c...|[, cantaloupe.srv...|
| cantaloupe.srv.c...|[, cantaloupe.srv...|
| alt.atheism  can...|[, alt.atheism, ,...|
| alt.atheism  can...|[, alt.atheism, ,...|
| alt.atheism  can...|[, alt.atheism, ,...|
| alt.atheism  can...|[, alt.atheism, ,...|
| alt.atheism  can...|[, alt.atheism, ,...|
| cantaloupe.srv.c...|[, cantaloupe.srv...|
| cantaloupe.srv.c...|[, cantaloupe.srv...|
| alt.atheism  can...|[, alt.atheism, ,...|
| alt.atheism  can...|[, alt.atheism, ,...|
| alt.atheism  can...|[, alt.atheism, ,...|
| cantaloupe.srv.c...|[, cantaloupe.srv...|
| alt.atheism  can...|[, alt.atheism, ,...|
| cantaloupe.srv.c...|[, cantaloupe.srv...|
| cantaloupe.srv.c...|[, cantaloupe.srv...|
| cantaloupe.srv.c...|[, cantaloupe.srv...|
| cantaloupe.srv.c...|[, cantaloupe.srv...|
| cantaloupe.srv.c...|[, cantaloupe.srv...|
| alt.atheism  can...|[, alt.ath

###5.A.3 Création d'une représentation vectorielle des documents à l’aide de HashingTF

In [None]:
#Création d'une représentation vectorielle des documents à l’aide de HashingTF
hashingTF = HashingTF(inputCol="words", outputCol="features_hache", numFeatures=20)
featurizedData = hashingTF.transform(wordsData)
featurizedData.show()

+--------------------+--------------------+--------------------+
|            document|               words|      features_hache|
+--------------------+--------------------+--------------------+
| cantaloupe.srv.c...|[, cantaloupe.srv...|(20,[0,1,2,3,5,6,...|
| cantaloupe.srv.c...|[, cantaloupe.srv...|(20,[0,2,3,4,5,6,...|
| alt.atheism  can...|[, alt.atheism, ,...|(20,[0,1,2,3,4,5,...|
| alt.atheism  can...|[, alt.atheism, ,...|(20,[0,1,2,3,4,5,...|
| alt.atheism  can...|[, alt.atheism, ,...|(20,[0,1,2,3,4,5,...|
| alt.atheism  can...|[, alt.atheism, ,...|(20,[0,1,2,3,4,5,...|
| alt.atheism  can...|[, alt.atheism, ,...|(20,[0,1,2,3,4,5,...|
| cantaloupe.srv.c...|[, cantaloupe.srv...|(20,[0,1,2,3,4,5,...|
| cantaloupe.srv.c...|[, cantaloupe.srv...|(20,[0,1,2,3,4,5,...|
| alt.atheism  can...|[, alt.atheism, ,...|(20,[1,2,4,5,6,7,...|
| alt.atheism  can...|[, alt.atheism, ,...|(20,[0,1,2,4,5,6,...|
| alt.atheism  can...|[, alt.atheism, ,...|(20,[0,1,2,4,5,6,...|
| cantaloupe.srv.c...|[, 

##6-7.A. Groupement des documents ayants des représentations vectorielles proches

###A.1. KMeans avec la pondération Tf-Idf

#### Pondération des mots avec la formule Tf-*Idf*

In [None]:

idf = IDF(inputCol="features_hache", outputCol="features")
idfModel = idf.fit(featurizedData)
rescaledData = idfModel.transform(featurizedData)
rescaledData.show()

+--------------------+--------------------+--------------------+--------------------+
|            document|               words|      features_hache|            features|
+--------------------+--------------------+--------------------+--------------------+
| cantaloupe.srv.c...|[, cantaloupe.srv...|(20,[0,1,2,3,5,6,...|(20,[0,1,2,3,5,6,...|
| cantaloupe.srv.c...|[, cantaloupe.srv...|(20,[0,2,3,4,5,6,...|(20,[0,2,3,4,5,6,...|
| alt.atheism  can...|[, alt.atheism, ,...|(20,[0,1,2,3,4,5,...|(20,[0,1,2,3,4,5,...|
| alt.atheism  can...|[, alt.atheism, ,...|(20,[0,1,2,3,4,5,...|(20,[0,1,2,3,4,5,...|
| alt.atheism  can...|[, alt.atheism, ,...|(20,[0,1,2,3,4,5,...|(20,[0,1,2,3,4,5,...|
| alt.atheism  can...|[, alt.atheism, ,...|(20,[0,1,2,3,4,5,...|(20,[0,1,2,3,4,5,...|
| alt.atheism  can...|[, alt.atheism, ,...|(20,[0,1,2,3,4,5,...|(20,[0,1,2,3,4,5,...|
| cantaloupe.srv.c...|[, cantaloupe.srv...|(20,[0,1,2,3,4,5,...|(20,[0,1,2,3,4,5,...|
| cantaloupe.srv.c...|[, cantaloupe.srv...|(20,[0,1,2,

#### KMeans avec ponderation Tf-Idf

Comme nous allons tester l'algorithme de KMeans dans plusieurs cas (document entier/entete et Tf-Idf/normalisation), nous allons créer une fonction K_means qui prend un vecteur et applique l'algorithme de clustering dessus, puis affiche le Silhouette score, le nombre de clusters, leurs centres et la taille de chaque cluster.

In [None]:
# Kmeans  
from pyspark.ml.clustering import KMeans
from pyspark.ml.evaluation import ClusteringEvaluator
import seaborn as sns
import pandas as pd
import matplotlib as plt


def K_means(Datavector):
    # Trains a k-means model.
    kmeans = KMeans().setK(2).setSeed(5)
    model = kmeans.fit(Datavector)
    # make prediction
    predictions = model.transform(Datavector)
    # Evaluate clustering by computing Silhouette score
    evaluator = ClusteringEvaluator()

    # Le Silhouette score est une mesure utilisée pour calculer la qualité technique du clustering , sa valeur est comprise entre 1 et -1
    # 1: un bon clustering
    # 0: Signifie que les clusters sont indifférents, ou nous pouvons dire que la distance entre les clusters n'est pas significative.
    # -1: Signifie que les clusters sont affectés dans le mauvais sens.
    # s=(ba)/max(a,b) ou a est la distance intra-cluster moyenne et b est la distance inter-cluster moyenne

    silhouette = evaluator.evaluate(predictions)
    print("Silhouette with squared euclidean distance = " + str(silhouette))
    # Shows the result.
    centers = model.clusterCenters()
    print("Cluster Centers: ", len(centers))
     for center in centers:
        print(center)

    #Affichage du résultat du clustering
    predictions.groupBy('prediction').count().show()

In [None]:
#Kmeans pour idf
K_means(rescaledData)

Silhouette with squared euclidean distance = 0.9613129638840032
Cluster Centers:  2


***A REVOIR CETTE PARTIE***

Le Silhouette score est une métrique allant de -1 à 1. Le score obtenu est donc quasi maximal.

Le modèle a trouvé deux clusters bien espacés mais la quasi totalité des prédictions (99,75%) a été assigné au cluster 0.

###A.2 Normalisation des vecteurs représentant les documents

#### Normalisation

In [None]:
# normalisation
from pyspark.ml.feature import Normalizer
from pyspark.ml.linalg import Vectors
normalizer = Normalizer(inputCol="features_hache", outputCol="features", p=1.0)
l1NormData = normalizer.transform(featurizedData)
l1NormData.show()

+--------------------+--------------------+--------------------+--------------------+
|            document|               words|      features_hache|            features|
+--------------------+--------------------+--------------------+--------------------+
| cantaloupe.srv.c...|[, cantaloupe.srv...|(20,[0,1,2,3,5,6,...|(20,[0,1,2,3,5,6,...|
| cantaloupe.srv.c...|[, cantaloupe.srv...|(20,[0,2,3,4,5,6,...|(20,[0,2,3,4,5,6,...|
| alt.atheism  can...|[, alt.atheism, ,...|(20,[0,1,2,3,4,5,...|(20,[0,1,2,3,4,5,...|
| alt.atheism  can...|[, alt.atheism, ,...|(20,[0,1,2,3,4,5,...|(20,[0,1,2,3,4,5,...|
| alt.atheism  can...|[, alt.atheism, ,...|(20,[0,1,2,3,4,5,...|(20,[0,1,2,3,4,5,...|
| alt.atheism  can...|[, alt.atheism, ,...|(20,[0,1,2,3,4,5,...|(20,[0,1,2,3,4,5,...|
| alt.atheism  can...|[, alt.atheism, ,...|(20,[0,1,2,3,4,5,...|(20,[0,1,2,3,4,5,...|
| cantaloupe.srv.c...|[, cantaloupe.srv...|(20,[0,1,2,3,4,5,...|(20,[0,1,2,3,4,5,...|
| cantaloupe.srv.c...|[, cantaloupe.srv...|(20,[0,1,2,

#### KMeans avec normalisation

In [None]:
# Kmeans avec normalisation
K_means(l1NormData)

Silhouette with squared euclidean distance = 0.3599999179688524
Cluster Centers:  2


##5.B. Transformation de texte : En tete seulement

Comme le résultat du clustering n'a pas donné des résultats satisfaisants, nous allons essayer les meme étapes sur l'entete des documents.

Nous allons commencer par créer un nouvel RDD de type pyspark.sql.Row puis un dataframe qui a une seule colonne ne contenant que l'eentete. Ceci est pour faire un deuxieme test sur l'algorithme KMeans.

In [None]:
def ToRow_entete(x):
  entete=x[1][0]
  val=" ".join(entete.values())
  row = Row(entete=val)
  return row

entete_document_row=fusion.map(ToRow_entete)

In [None]:
Entete_df=spark.createDataFrame(entete_document_row)
Entete_df.printSchema()

root
 |-- entete: string (nullable = true)



###5.B.2 Découpage les documents en listes de mots à l’aide de Tokenizer

In [None]:
tokenizer2 = Tokenizer(inputCol="entete", outputCol="words")
#d=df.select("Contenu")
wordsData_entete = tokenizer2.transform(Entete_df)
wordsData_entete.show()

+--------------------+--------------------+
|              entete|               words|
+--------------------+--------------------+
| cantaloupe.srv.c...|[, cantaloupe.srv...|
| cantaloupe.srv.c...|[, cantaloupe.srv...|
| alt.atheism  can...|[, alt.atheism, ,...|
| alt.atheism  can...|[, alt.atheism, ,...|
| alt.atheism  can...|[, alt.atheism, ,...|
| alt.atheism  can...|[, alt.atheism, ,...|
| alt.atheism  can...|[, alt.atheism, ,...|
| cantaloupe.srv.c...|[, cantaloupe.srv...|
| cantaloupe.srv.c...|[, cantaloupe.srv...|
| alt.atheism  can...|[, alt.atheism, ,...|
| alt.atheism  can...|[, alt.atheism, ,...|
| alt.atheism  can...|[, alt.atheism, ,...|
| cantaloupe.srv.c...|[, cantaloupe.srv...|
| alt.atheism  can...|[, alt.atheism, ,...|
| cantaloupe.srv.c...|[, cantaloupe.srv...|
| cantaloupe.srv.c...|[, cantaloupe.srv...|
| cantaloupe.srv.c...|[, cantaloupe.srv...|
| cantaloupe.srv.c...|[, cantaloupe.srv...|
| cantaloupe.srv.c...|[, cantaloupe.srv...|
| alt.atheism  can...|[, alt.ath

###5.B.3 Création d'une représentation vectorielle des documents à l’aide de HashingTF

In [None]:
#Création d'une représentation vectorielle des documents à l’aide de HashingTF
hashingTF = HashingTF(inputCol="words", outputCol="features_hache", numFeatures=20)
featurizedData_entete = hashingTF.transform(wordsData_entete)
featurizedData_entete.show()

+--------------------+--------------------+--------------------+
|              entete|               words|      features_hache|
+--------------------+--------------------+--------------------+
| cantaloupe.srv.c...|[, cantaloupe.srv...|(20,[0,2,3,5,6,7,...|
| cantaloupe.srv.c...|[, cantaloupe.srv...|(20,[0,2,3,4,5,6,...|
| alt.atheism  can...|[, alt.atheism, ,...|(20,[0,1,2,3,4,5,...|
| alt.atheism  can...|[, alt.atheism, ,...|(20,[0,2,3,4,5,6,...|
| alt.atheism  can...|[, alt.atheism, ,...|(20,[0,2,4,5,6,7,...|
| alt.atheism  can...|[, alt.atheism, ,...|(20,[0,1,2,3,4,5,...|
| alt.atheism  can...|[, alt.atheism, ,...|(20,[0,2,3,4,5,7,...|
| cantaloupe.srv.c...|[, cantaloupe.srv...|(20,[1,2,3,5,6,9,...|
| cantaloupe.srv.c...|[, cantaloupe.srv...|(20,[0,2,3,4,5,6,...|
| alt.atheism  can...|[, alt.atheism, ,...|(20,[1,2,4,5,6,7,...|
| alt.atheism  can...|[, alt.atheism, ,...|(20,[0,1,2,4,5,6,...|
| alt.atheism  can...|[, alt.atheism, ,...|(20,[0,1,2,4,5,6,...|
| cantaloupe.srv.c...|[, 

##6-7.B. Groupement des documents ayants des représentations vectorielles proches

###B.1. KMeans avec ponderation Tf-Idf

#### Pondération des mots avec la formule Tf-*Idf*

In [None]:
idf = IDF(inputCol="features_hache", outputCol="features")
idfModel = idf.fit(featurizedData)
rescaledData_entete = idfModel.transform(featurizedData_entete)
rescaledData_entete.show()

+--------------------+--------------------+--------------------+--------------------+
|              entete|               words|      features_hache|            features|
+--------------------+--------------------+--------------------+--------------------+
| cantaloupe.srv.c...|[, cantaloupe.srv...|(20,[0,2,3,5,6,7,...|(20,[0,2,3,5,6,7,...|
| cantaloupe.srv.c...|[, cantaloupe.srv...|(20,[0,2,3,4,5,6,...|(20,[0,2,3,4,5,6,...|
| alt.atheism  can...|[, alt.atheism, ,...|(20,[0,1,2,3,4,5,...|(20,[0,1,2,3,4,5,...|
| alt.atheism  can...|[, alt.atheism, ,...|(20,[0,2,3,4,5,6,...|(20,[0,2,3,4,5,6,...|
| alt.atheism  can...|[, alt.atheism, ,...|(20,[0,2,4,5,6,7,...|(20,[0,2,4,5,6,7,...|
| alt.atheism  can...|[, alt.atheism, ,...|(20,[0,1,2,3,4,5,...|(20,[0,1,2,3,4,5,...|
| alt.atheism  can...|[, alt.atheism, ,...|(20,[0,2,3,4,5,7,...|(20,[0,2,3,4,5,7,...|
| cantaloupe.srv.c...|[, cantaloupe.srv...|(20,[1,2,3,5,6,9,...|(20,[1,2,3,5,6,9,...|
| cantaloupe.srv.c...|[, cantaloupe.srv...|(20,[0,2,3,

#### KMeans avec ponderation Tf-Idf

In [None]:
K_means(rescaledData_entete)

Silhouette with squared euclidean distance = 0.20173668245025458
Cluster Centers:  2


###B.2. KMeans avec  normalisation

####Normalisation

In [None]:
# normalisation
from pyspark.ml.feature import Normalizer
from pyspark.ml.linalg import Vectors
normalizer = Normalizer(inputCol="features_hache", outputCol="features", p=1.0)
l1NormData_entete = normalizer.transform(featurizedData_entete)
l1NormData_entete.show()

+--------------------+--------------------+--------------------+--------------------+
|              entete|               words|      features_hache|            features|
+--------------------+--------------------+--------------------+--------------------+
| cantaloupe.srv.c...|[, cantaloupe.srv...|(20,[0,2,3,5,6,7,...|(20,[0,2,3,5,6,7,...|
| cantaloupe.srv.c...|[, cantaloupe.srv...|(20,[0,2,3,4,5,6,...|(20,[0,2,3,4,5,6,...|
| alt.atheism  can...|[, alt.atheism, ,...|(20,[0,1,2,3,4,5,...|(20,[0,1,2,3,4,5,...|
| alt.atheism  can...|[, alt.atheism, ,...|(20,[0,2,3,4,5,6,...|(20,[0,2,3,4,5,6,...|
| alt.atheism  can...|[, alt.atheism, ,...|(20,[0,2,4,5,6,7,...|(20,[0,2,4,5,6,7,...|
| alt.atheism  can...|[, alt.atheism, ,...|(20,[0,1,2,3,4,5,...|(20,[0,1,2,3,4,5,...|
| alt.atheism  can...|[, alt.atheism, ,...|(20,[0,2,3,4,5,7,...|(20,[0,2,3,4,5,7,...|
| cantaloupe.srv.c...|[, cantaloupe.srv...|(20,[1,2,3,5,6,9,...|(20,[1,2,3,5,6,9,...|
| cantaloupe.srv.c...|[, cantaloupe.srv...|(20,[0,2,3,

####KMeans avec normalisation

In [None]:
K_means(l1NormData_entete)

Silhouette with squared euclidean distance = 0.14923718008341955
Cluster Centers:  2
