  Ce projet utilise **PySpark ** pour analyser les revenus mensuels des boutiques réparties dans plusieurs villes françaises. Les données sont extraites de plusieurs fichiers, consolidées et explorées pour fournir des insights financiers tels que :


*   Revenu mensuel moyen global et par ville.
*   Revenu total annuel par ville et par boutique.
*   Identification de la boutique réalisant la meilleure performance chaque mois.

Ce projet illustre l'utilisation de PySpark pour le traitement des données massives, l'agrégation et l'analyse avancée dans un contexte commercial.

In [None]:
pip install pyspark

Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip is available: 24.0 -> 24.1.1
[notice] To update, run: python.exe -m pip install --upgrade pip


In [None]:
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType, FloatType
from pyspark.sql.functions import col, split, avg, sum, max, lit
from functools import reduce

In [None]:
spark = SparkSession.builder.appName("Shop Revenue Analysis").getOrCreate()

In [None]:
schema = StructType([
    StructField("Month", StringType(), True),
    StructField("Revenue", FloatType(), True)
])

In [None]:
files = {
    "anger": r"C:\Users\Mehdi Katana\Desktop\Spark\content\anger.txt",
    "lyon": r"C:\Users\Mehdi Katana\Desktop\Spark\content\lyon.txt",
    "nice": r"C:\Users\Mehdi Katana\Desktop\Spark\content\nice.txt",
    "orlean": r"C:\Users\Mehdi Katana\Desktop\Spark\content\orlean.txt",
    "paris_1": r"C:\Users\Mehdi Katana\Desktop\Spark\content\paris_1.txt",
    "paris_2": r"C:\Users\Mehdi Katana\Desktop\Spark\content\paris_2.txt",
    "paris_3": r"C:\Users\Mehdi Katana\Desktop\Spark\content\paris_3.txt",
    "rennes": r"C:\Users\Mehdi Katana\Desktop\Spark\content\rennes.txt",
    "toulouse": r"C:\Users\Mehdi Katana\Desktop\Spark\content\toulouse.txt",
    "troyes": r"C:\Users\Mehdi Katana\Desktop\Spark\content\troyes.txt",
    "marseilles_1": r"C:\Users\Mehdi Katana\Desktop\Spark\content\marseilles_1.txt",
    "marseilles_2": r"C:\Users\Mehdi Katana\Desktop\Spark\content\marseilles_2.txt",
    "nantes": r"C:\Users\Mehdi Katana\Desktop\Spark\content\nantes.txt"
}

In [None]:
def load_data(file_path, store_name):
    df = spark.read.csv(file_path,schema = schema,sep = " ")
    df = df.withColumn("Store", lit(store_name))
    return df

In [None]:
df_list = [load_data(path, name) for name, path in files.items()]
df = reduce(lambda x, y: x.union(y), df_list)

In [None]:
df.printSchema()
df.show()


root
 |-- Month: string (nullable = true)
 |-- Revenue: float (nullable = true)
 |-- Store: string (nullable = false)

+-----+-------+-----+
|Month|Revenue|Store|
+-----+-------+-----+
|  JAN|   13.0|anger|
|  FEB|   12.0|anger|
|  MAR|   14.0|anger|
|  APR|   15.0|anger|
|  MAY|   12.0|anger|
|  JUN|   15.0|anger|
|  JUL|   19.0|anger|
|  AUG|   15.0|anger|
|  SEP|   13.0|anger|
|  OCT|    8.0|anger|
|  NOV|   14.0|anger|
|  DEC|   16.0|anger|
|  JAN|   13.0| lyon|
|  FEB|   12.0| lyon|
|  MAR|   14.0| lyon|
|  APR|   15.0| lyon|
|  MAY|   12.0| lyon|
|  JUN|   15.0| lyon|
|  JUL|   19.0| lyon|
|  AUG|   25.0| lyon|
+-----+-------+-----+
only showing top 20 rows



# 1. Average monthly income of the shop (all branches) in France

In [None]:
average_monthly_income_france = df.groupBy("Month").agg(avg("Revenue").alias("Average_Monthly_Income_France"))


# 2. Average monthly income of the shop in each city

In [None]:
average_monthly_income_city = df.groupBy("Store", "Month").agg(avg("Revenue").alias("Average_Monthly_Income_City"))

# 3. Total revenue per city per year

In [None]:
total_revenue_city_year = df.groupBy("Store").agg(sum("Revenue").alias("Total_Revenue_City_Year"))

# 4. Total revenue per store per year

In [None]:
total_revenue_store_year = df.groupBy("Store").agg(sum("Revenue").alias("Total_Revenue_Store_Year"))

# 5. The store that achieves the best performance in each month

In [None]:
max_revenue_monthly = df.groupBy("Month").agg(max("Revenue").alias("Max_Revenue"))

In [None]:

best_performance_store_monthly = max_revenue_monthly.alias("max_rev").join(
    df.alias("data"),
    (col("data.Revenue") == col("max_rev.Max_Revenue")) & (col("data.Month") == col("max_rev.Month")),
    "inner"
).select(col("data.Month"), col("data.Store"), col("data.Revenue")).distinct()

# results

In [None]:
average_monthly_income_france.show()
average_monthly_income_city.show()
total_revenue_city_year.show()
total_revenue_store_year.show()
best_performance_store_monthly.show()


+-----+-----------------------------+
|Month|Average_Monthly_Income_France|
+-----+-----------------------------+
|  APR|            20.23076923076923|
|  OCT|            26.53846153846154|
|  NOV|            24.53846153846154|
|  FEB|           19.153846153846153|
|  SEP|            25.53846153846154|
|  JAN|            20.76923076923077|
|  AUG|           23.076923076923077|
|  MAR|            17.53846153846154|
|  DEC|                         29.0|
|  JUN|           27.846153846153847|
|  JUL|           21.692307692307693|
|  MAY|            22.46153846153846|
+-----+-----------------------------+

+-----+-----+---------------------------+
|Store|Month|Average_Monthly_Income_City|
+-----+-----+---------------------------+
|anger|  JAN|                       13.0|
|anger|  MAY|                       12.0|
|anger|  AUG|                       15.0|
|anger|  JUL|                       19.0|
|anger|  FEB|                       12.0|
|anger|  MAR|                       14.0|
|anger|  NOV|

In [None]:
spark.stop()