<h1 style="text-align: center;">&nbsp;<img style="font-size: 0.9em;" src="https://www.hospitalitynet.org/picture/153007157/travelers-push-tripadvisor-past-1-billion-reviews-opinions.jpg?t=1587981992" alt="" width="300" height="100" /><span style="font-family: tahoma, arial, helvetica, sans-serif; font-size: large;"><span style="font-size: x-large;"> Preprocessing des donn√©es avec PySpark</span></span><span style="font-family: tahoma, arial, helvetica, sans-serif; font-size: large;">&nbsp; &nbsp; &nbsp;&nbsp;</span>&nbsp;<img src="https://i0.wp.com/mosefparis1.fr/wp-content/uploads/2022/10/cropped-image-1.png?fit=532%2C540&amp;ssl=1" alt="" width="150" height="150" />&nbsp;</h1>
<p style="text-align: center;">&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;de Lucie Gabagnou et Yanis Rehoune</p>

Dans ce second notebook, nous effectuons une pipeline de preprocessing des donn√©es: Les donn√©es brutes qui ont √©t√© extraites par Webscraping ne sont pas forc√©ment dans le format attendu. Il faut ainsi nettoyer les donn√©es et extraire des features petinents.




# Processing avec PySpark
Dans cette partie, nous r√©alisons le processing de nos donn√©es: on nettoie (le texte particuli√®rement) et √©ventuellement cr√©er de nouveaux features √† partir de la base brute r√©colt√©e. Ici, il faut s'assurer que les types soient les bons, que le texte soit exploitable pour le NLP, et que tous les features soient exploitables (typiquement l'adresse doit devenir un ensemble de coordonn√©es g√©ographiques..).




 Dans notre cas, Pyspark s'av√®re pratique pour ex√©cuter facilement des fonctions sur un nombre de lignes  important. 
A l'issue de cette √©tape, les donn√©es seront pr√™tes pour l'analyse exploratoire et le ML.
Remarque: il n'y a pas d'√©tapes interm√©diaires pour voir les donn√©es car on √©vite d'utiliser les fonctions .show() qui puisent dans la m√©moire vive et sont longues.



#### Installation de l'environnement

In [1]:

import findspark
findspark.init()
findspark.find()

'/Users/luciegabagnou/opt/anaconda3/envs/scrap/lib/python3.9/site-packages/pyspark'

In [2]:
import os
os.getcwd()

'/Users/luciegabagnou/Documents/MOSEF/PYTHON/projet_trip_advisor/sentiment-analysis-tripadvisor/notebooks'

In [3]:
current_path=os.path.dirname(os.getcwd())
os.chdir(current_path)
print(os.getcwd())

/Users/luciegabagnou/Documents/MOSEF/PYTHON/projet_trip_advisor/sentiment-analysis-tripadvisor


In [4]:
os.getcwd()

'/Users/luciegabagnou/Documents/MOSEF/PYTHON/projet_trip_advisor/sentiment-analysis-tripadvisor'

### Cr√©ation d'un SparkDataFrame


In [5]:
!pip install -r requirements.txt



In [6]:

from pyspark import  HiveContext , SparkContext
from pyspark.sql.types import StructType, StructField, StringType, IntegerType, ArrayType, DoubleType
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, udf
import re
from scripts.utils import get_digits



Dans un premier temps, on cr√©er une session Spark, celle sur laquelle on va load le dataframe et effectuer nos modifications. Dans le notebook, on load un DataFrame, mais dans la partie d√©veloppement, on exectuera cela sur la base MySQL. 

In [7]:
spark = SparkSession.builder.appName("Load JSON").getOrCreate()
df = spark.read.option("multiline","true").json("data/fetch_data.json")
df.show()

23/01/25 17:05:42 WARN Utils: Your hostname, MacBook-Air.local resolves to a loopback address: 127.0.0.1; using 172.20.10.12 instead (on interface en0)
23/01/25 17:05:42 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
23/01/25 17:05:43 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
                                                                                

+------------+--------------------+--------------------+--------------+--------------------+---------+--------------------+--------------------+
|average_note|            location|                name|number_reviews|  price_and_cuisines|  ranking|             reviews|                 url|
+------------+--------------------+--------------------+--------------+--------------------+---------+--------------------+--------------------+
|         4,5|149 boulevard Vol...|        Cafe Leopard|            97|[‚Ç¨‚Ç¨-‚Ç¨‚Ç¨‚Ç¨, Fran√ßais...|    N¬∫¬†39|[Caf√© Leopard, al...|https://www.tripa...|
|         4,0|68 Rue de Grenell...|       Cuillier Caf√©|             7|           [‚Ç¨, Caf√©]|   N¬∫¬†708|[Bon petit go√ªter...|https://www.tripa...|
|         4,0|40 rue Gregoire d...|          Oenosteria|           138|[‚Ç¨‚Ç¨-‚Ç¨‚Ç¨‚Ç¨, Italienn...| N¬∫¬†2‚ÄØ485|[A √©viter absolum...|https://www.tripa...|
|         4,5|9 Rue Joseph de M...|           La Bossue|           480|[‚Ç¨‚Ç¨-‚Ç¨‚Ç¨‚Ç¨, Fran

Dans un premier temps, on s'assure que les types soient corrects:

In [8]:
df.printSchema()

root
 |-- average_note: string (nullable = true)
 |-- location: string (nullable = true)
 |-- name: string (nullable = true)
 |-- number_reviews: string (nullable = true)
 |-- price_and_cuisines: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- ranking: string (nullable = true)
 |-- reviews: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- url: string (nullable = true)



On voit que "average note" est en string alors qu'il doit s'agir de nombres d√©cimaux, de m√™me pour le nombre de reviews, le classement (ranking).


#### Type 
On corrige les types qui posaient probl√®mes pr√©c√®demment:

In [9]:
from scripts.utils import get_digits

In [10]:
def get_digits(string_to_digit: str):
    """
    Given a string, returns the first set of consecutive digits found as an int.
    If no digits are found, returns None.
    """

    if string_to_digit:
            try:
                return float("".join(re.findall(r'[+-]?\d*\.\d+|\d+', string_to_digit)))
            except:
                return  0.0
    

In [11]:
# Cr√©ation d'une fonction d√©finie par utilisateur (udf). On a repris la fonction get_digits disponibles dans les utils
# Application sur la colonne des classements pour r√©cup√©rer les chiffres/nombres de la cha√Æne de caract√®res n¬∞1 => 1
# Apply UDF to the column "tripadvisor rank"
from pyspark.sql.functions import col, lit, regexp_replace
get_digits_udf = udf(get_digits, DoubleType())
df = df.withColumn("average_note", regexp_replace(col("average_note"), ",", "."))
df = df.withColumn("ranking", get_digits_udf(df["ranking"]))
df = df.withColumn("average_note", get_digits_udf(df["average_note"]))
df = df.withColumn("number_reviews", get_digits_udf(df["number_reviews"]))


In [12]:
df.printSchema()

root
 |-- average_note: double (nullable = true)
 |-- location: string (nullable = true)
 |-- name: string (nullable = true)
 |-- number_reviews: double (nullable = true)
 |-- price_and_cuisines: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- ranking: double (nullable = true)
 |-- reviews: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- url: string (nullable = true)



In [13]:
df.show()

[Stage 2:>                                                          (0 + 1) / 1]

+------------+--------------------+--------------------+--------------+--------------------+-------+--------------------+--------------------+
|average_note|            location|                name|number_reviews|  price_and_cuisines|ranking|             reviews|                 url|
+------------+--------------------+--------------------+--------------+--------------------+-------+--------------------+--------------------+
|         4.5|149 boulevard Vol...|        Cafe Leopard|          97.0|[‚Ç¨‚Ç¨-‚Ç¨‚Ç¨‚Ç¨, Fran√ßais...|   39.0|[Caf√© Leopard, al...|https://www.tripa...|
|         4.0|68 Rue de Grenell...|       Cuillier Caf√©|           7.0|           [‚Ç¨, Caf√©]|  708.0|[Bon petit go√ªter...|https://www.tripa...|
|         4.0|40 rue Gregoire d...|          Oenosteria|         138.0|[‚Ç¨‚Ç¨-‚Ç¨‚Ç¨‚Ç¨, Italienn...| 2485.0|[A √©viter absolum...|https://www.tripa...|
|         4.5|9 Rue Joseph de M...|           La Bossue|         480.0|[‚Ç¨‚Ç¨-‚Ç¨‚Ç¨‚Ç¨, Fran√ßais...|  174.0|[Su

                                                                                

Les types ont bien √©t√© corrig√©s.

### Feature engineering:
- On va s√©parer le contenu de la colonne "price and cuisines", qui regroupe le prix et les types de cuisine. Nous ne l'avons pas fait lors du webscrapping sachant qu'il n'√©tait pas √©vident de s√©parer le contenu: 
- G√©olocalisation: on veut r√©cup√©rer les coordonn√©es g√©ographiques pour l'appli..

##### Price and cuisines

En effet, on voit qu'il n'est pas √©vident de trouver une r√®gle simple de s√©paration car le nombre d'√©lements n'est pas le m√™me selon les restaurants (parfois aucune information, parfois 4, parfois prix, parfois non, etc..):

In [14]:
from scripts.preprocessor.global_processor import geocode_address,separate_price_and_cuisine

# On impose un sch√©ma pour faire en sorte d'avoir le format final souhait√©, √† savoir des listes/arrays de cha√Æne de caract√®res
udf_separate_price_and_cuisine = udf(separate_price_and_cuisine, StructType([
    StructField("price", ArrayType(StringType())),
    StructField("cuisine", ArrayType(StringType()))
])) # On renseigne ce sch√©ma dans l' udf

#On applique les fonctions
df = df.withColumn("price", udf_separate_price_and_cuisine("price_and_cuisines").price) 
df = df.withColumn("cuisine", udf_separate_price_and_cuisine("price_and_cuisines").cuisine)
df = df.drop(*["price_and_cuisines"])

Collecting fr-core-news-md==3.5.0
  Downloading https://github.com/explosion/spacy-models/releases/download/fr_core_news_md-3.5.0/fr_core_news_md-3.5.0-py3-none-any.whl (45.8 MB)
     ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ 45.8/45.8 MB 3.8 MB/s eta 0:00:00
[38;5;2m‚úî Download and installation successful[0m
You can now load the package via spacy.load('fr_core_news_md')


[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/luciegabagnou/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     /Users/luciegabagnou/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/luciegabagnou/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [15]:
#df.select(["price","cuisine"]).show() 

##### Localisation

In [16]:
from geopy.geocoders import Nominatim
from pyspark.sql.functions import udf
from pyspark.sql.types import StructType, StructField, DoubleType
from scripts.preprocessor.global_processor import geocode_address

geocode_udf = udf(
            geocode_address,
            returnType = StructType([
    StructField("latitude", DoubleType()),
    StructField("longitude",DoubleType())]))

df = df.repartition(20) 
df = df.withColumn("longitude", geocode_udf("location").longitude)
df = df.withColumn("latitude", geocode_udf("location").latitude)

In [17]:
 
df.show()

Collecting fr-core-news-md==3.5.0                                   (0 + 1) / 1]
  Using cached https://github.com/explosion/spacy-models/releases/download/fr_core_news_md-3.5.0/fr_core_news_md-3.5.0-py3-none-any.whl (45.8 MB)
[38;5;2m‚úî Download and installation successful[0m
You can now load the package via spacy.load('fr_core_news_md')
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/luciegabagnou/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     /Users/luciegabagnou/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/luciegabagnou/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[Stage 5:>                                                          (0 + 1) / 1]

+------------+--------------------+--------------------+--------------+-------+--------------------+--------------------+--------+--------------------+-----------------+-----------+
|average_note|            location|                name|number_reviews|ranking|             reviews|                 url|   price|             cuisine|        longitude|   latitude|
+------------+--------------------+--------------------+--------------+-------+--------------------+--------------------+--------+--------------------+-----------------+-----------+
|        null|14 Rue Saint-Marc...|Domino's Pizza Pa...|           0.0|   null|                  []|https://www.tripa...|     [‚Ç¨]|         [Fran√ßaise]|        2.3406323| 48.8704412|
|         5.0|54 rue Piat, 7502...|  God Bless Broccoli|          29.0| 4346.0|[D√©jeuner:Nous av...|https://www.tripa...|[‚Ç¨‚Ç¨-‚Ç¨‚Ç¨‚Ç¨]|[Italienne, Pizza...|        2.3836598| 48.8731146|
|         2.5|55 57 Cours Saint...|               Kaori|         155.0|13813

                                                                                

##### Commentaires

In [18]:
!python -m spacy download fr_core_news_md

Collecting fr-core-news-md==3.5.0
  Using cached https://github.com/explosion/spacy-models/releases/download/fr_core_news_md-3.5.0/fr_core_news_md-3.5.0-py3-none-any.whl (45.8 MB)
[38;5;2m‚úî Download and installation successful[0m
You can now load the package via spacy.load('fr_core_news_md')


In [19]:
import nltk
import re
import spacy
from unidecode import unidecode
from nltk.corpus import stopwords
import spacy


In [20]:
from scripts.preprocessor.text_processor import clean_text_sentiment_analysis

In [21]:
udf_text_cleaning = udf(clean_text_sentiment_analysis, StructType([
    StructField("reviews", ArrayType(StringType())),
    StructField("ratings", ArrayType(StringType()))]))


df=df.withColumn("clean_reviews", udf_text_cleaning("reviews").reviews)
df=df.withColumn("ratings", udf_text_cleaning("reviews").ratings)

In [22]:
df.show()

Collecting fr-core-news-md==3.5.0                                   (0 + 1) / 1]
  Using cached https://github.com/explosion/spacy-models/releases/download/fr_core_news_md-3.5.0/fr_core_news_md-3.5.0-py3-none-any.whl (45.8 MB)
[38;5;2m‚úî Download and installation successful[0m
You can now load the package via spacy.load('fr_core_news_md')
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/luciegabagnou/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     /Users/luciegabagnou/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/luciegabagnou/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[Stage 8:>                                                          (0 + 1) / 1]

+------------+--------------------+--------------------+--------------+-------+--------------------+--------------------+--------+--------------------+-----------------+-----------+--------------------+--------------------+
|average_note|            location|                name|number_reviews|ranking|             reviews|                 url|   price|             cuisine|        longitude|   latitude|       clean_reviews|             ratings|
+------------+--------------------+--------------------+--------------+-------+--------------------+--------------------+--------+--------------------+-----------------+-----------+--------------------+--------------------+
|        null|14 Rue Saint-Marc...|Domino's Pizza Pa...|           0.0|   null|                  []|https://www.tripa...|     [‚Ç¨]|         [Fran√ßaise]|        2.3406323| 48.8704412|                  []|                  []|
|         5.0|54 rue Piat, 7502...|  God Bless Broccoli|          29.0| 4346.0|[D√©jeuner:Nous av...|

                                                                                

In [23]:
clean_data=df.toPandas() # Environ 10 min

Collecting fr-core-news-md==3.5.0                                   (0 + 1) / 1]
  Downloading https://github.com/explosion/spacy-models/releases/download/fr_core_news_md-3.5.0/fr_core_news_md-3.5.0-py3-none-any.whl (45.8 MB)
     ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ 45.8/45.8 MB 4.3 MB/s eta 0:00:00
[38;5;2m‚úî Download and installation successful[0m
You can now load the package via spacy.load('fr_core_news_md')
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/luciegabagnou/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     /Users/luciegabagnou/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/luciegabagnou/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
Collecting fr-core-news-md==3.5.0                          

In [None]:
clean_data

Unnamed: 0,average_note,location,name,number_reviews,ranking,reviews,url,price,cuisine,longitude,latitude,clean_reviews,ratings
0,,"225 boulevard Voltaire, 75011 Paris France",Shinzzo Paris 75011,0.0,,[],https://www.tripadvisor.fr/Restaurant_Review-g...,[],"[Chinoise, Japonaise, Asiatique]",2.390974,48.851526,[],[]
1,3.0,"33 Rue Saint Jacques, 75005 Paris France",Soho Trattoria,76.0,1430.0,[Bravo Maria Meilleur service:D√Æner tr√®s agr√©a...,https://www.tripadvisor.fr/Restaurant_Review-g...,[],"[Italienne, Pizza]",2.345867,48.851156,[[bravo marier meilleur service diner agreabl ...,"[[5.0], [2.0], [3.0], [3.0], [5.0], [1.0], [4...."
2,4.0,"16 rue Guillaume Bertrand, 75011 Paris France",Restaurant El Camino,41.0,74.0,[Restaurant correct mais sans plus :Un restaur...,https://www.tripadvisor.fr/Restaurant_Review-g...,[‚Ç¨],"[Latino, Sud-am√©ricaine]",2.380629,48.863593,[[restaurer correct restaurant chilien correct...,"[[3.0], [5.0], [5.0], [4.0], [5.0], [2.0], [5...."
3,4.5,"44 rue des Vinaigriers, 75010 Paris France",Gravity Bar,143.0,2258.0,[Superbe adresse üî•:L'exp√©rience √©tait vraiment...,https://www.tripadvisor.fr/Restaurant_Review-g...,[‚Ç¨‚Ç¨-‚Ç¨‚Ç¨‚Ç¨],"[V√©g√©tariens bienvenus, Plats sans gluten]",2.360821,48.873218,[[superbe adresse experience vraiment top equi...,"[[5.0], [5.0], [5.0], [5.0], [5.0], [5.0], [4...."
4,4.5,"5 rue du Nil, 75002 Paris France",Frenchie - Rue du Nil,1342.0,551.0,"[Pas mal, sans plus.:Des associations de saveu...",https://www.tripadvisor.fr/Restaurant_Review-g...,[‚Ç¨‚Ç¨‚Ç¨‚Ç¨],"[Fran√ßaise, V√©g√©tariens bienvenus, Plats sans ...",2.347922,48.867744,[[mal plus.:de association saveur original reu...,"[[3.0], [4.0], [5.0], [4.0], [1.0], [3.0], [5...."
...,...,...,...,...,...,...,...,...,...,...,...,...,...
995,3.5,"7 Rue Boulle, 75011 Paris France",L'√âchapp√©e Belle,148.0,182.0,[Un service qui se plante et donne comme excus...,https://www.tripadvisor.fr/Restaurant_Review-g...,[‚Ç¨‚Ç¨-‚Ç¨‚Ç¨‚Ç¨],"[Fran√ßaise, Europ√©enne, Moderne]",2.372001,48.857000,[[service plante donne excus ordinateur plat p...,"[[3.0], [1.0], [5.0], [4.0], [1.0], [5.0], [5...."
996,5.0,"31 rue Saint-Lazare, 75009 Paris France",Mizupoke,1.0,12244.0,[Un excellent petit restaurant pour manger des...,https://www.tripadvisor.fr/Restaurant_Review-g...,[],[],2.336122,48.876699,[[excellent petit restaurer manger poke emport...,[[5.0]]
997,4.0,"156 boulevard Voltaire, 75011 Paris France",PHO 156,86.0,200.0,[Excellent:Au d√©tour d'une rue nous avons d√©c...,https://www.tripadvisor.fr/Restaurant_Review-g...,[‚Ç¨],"[Asiatique, Vietnamienne, V√©g√©tariens bienvenus]",2.382831,48.856004,[[excellent detour rue decouvert restaurer cui...,"[[5.0], [5.0], [1.0], [5.0], [5.0], [5.0], [2...."
998,,"Rue Gervex, Paris France",Le Riad,0.0,,[],https://www.tripadvisor.fr/Restaurant_Review-g...,[],"[Marocaine, M√©diterran√©enne]",2.298243,48.888137,[],[]


In [None]:
clean_data.to_json("data/clean_data.json")