# Find the best PER contract

PER (Plan d’Epargne Retraite) is a French investment account which allows you to prepare your retirement and reduce some tax. But there are so many banks and insurance company that offers various contract of PER.

The objective of this challenge is to find the `best PER contract of 2022` in the market.

The raw data is coming from https://www.francetransactions.com/per-plan-epargne-retraite-206/comparatif-per.html


## Data description


- Frais de gestion sur les unités de compte pour les supports en assurance-vie. Frais de gestion des produits sur les encours du contrat pour les autres supports (compte-titres, etc.).

- Taux publié par les assureurs, nets des frais de gestion, nets de prélèvements sociaux.

- Taux nets pour les épargnants, nets des prélèvements sociaux.


In [53]:
from pyspark.sql import SparkSession
import pandas as ps
from pyspark.sql.functions import col, desc
import os

In [2]:
local=True
if local:
    spark=SparkSession.builder.master("local[4]") \
                  .appName("PER_challenge")\
                  .getOrCreate()
else:
    spark=SparkSession.builder \
                      .master("k8s://https://kubernetes.default.svc:443") \
                      .appName("PER_challenge") \
                      .config("spark.kubernetes.container.image",os.environ['IMAGE_NAME']) \
                      .config("spark.kubernetes.authenticate.driver.serviceAccountName",os.environ['KUBERNETES_SERVICE_ACCOUNT']) \
                      .config("spark.kubernetes.namespace", os.environ['KUBERNETES_NAMESPACE']) \
                      .config("spark.executor.instances", "4") \
                      .config("spark.executor.memory","8g") \
                      .config('spark.jars.packages','com.crealytics:spark-excel_2.12:3.1.2_0.17.1') \
                      .getOrCreate()

22/08/10 20:21:46 WARN Utils: Your hostname, ubuntu resolves to a loopback address: 127.0.1.1; using 192.168.184.146 instead (on interface ens33)
22/08/10 20:21:46 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
22/08/10 20:21:47 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).


In [17]:
file_path="../../data/per.xls"

In [18]:
pdf=ps.read_excel(file_path, sheet_name='per', index_col=[0])

*** No CODEPAGE record, no encoding_override: will use 'iso-8859-1'


In [19]:
pdf.head()

Unnamed: 0_level_0,Assureur/Support,Avis sur 5,Frais Vers.,Frais Gestion Fonds ?,Frais Gestion UC,Frais/rente,Fonds euros,Taux brut,Nombre SCPI,Nombre SCI,Nombre OPCI,Nombre ETF,Nombre UC
PER,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
ABEILLE RETRAITE PLURIELLE,ABEILLE RETRAITE PROFESSIONNELLE,,5.0,1.0,1.0,,ABEILLE EURO PERP,,0,0,0,0,80
AFER RETRAITE INDIVIDUELLE,ABEILLE,,3.0,1.0,1.0,0.0,ABEILLE RP SECURITE RETRAITE,,0,0,0,0,80
ALLIANZ PER HORIZON,ALLIANZ,,4.8,0.85,0.85,,ALLIANZ RETRAITE,,0,0,0,0,92
AMBITION RETRAITE INDIVIDUELLE,LA MONDIALE,,3.9,0.7,0.7,0.0,FONDS EUROS RETRAITE,,0,0,0,0,0
AMPLI-PER LIBERTE,AMPLI-MUTUELLE,,0.0,0.5,0.4,0.0,AMPLI PER EUROS,,2,0,0,3,4


In [22]:
pdf.info()

<class 'pandas.core.frame.DataFrame'>
Index: 55 entries, ABEILLE RETRAITE PLURIELLE to YOMONI RETRAITE
Data columns (total 13 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   Assureur/Support       55 non-null     object 
 1   Avis sur 5             0 non-null      float64
 2   Frais Vers.            55 non-null     float64
 3   Frais Gestion Fonds ?  55 non-null     float64
 4   Frais Gestion UC       55 non-null     float64
 5   Frais/rente            27 non-null     float64
 6   Fonds euros            47 non-null     object 
 7   Taux brut              0 non-null      float64
 8   Nombre SCPI            55 non-null     int64  
 9   Nombre SCI             55 non-null     int64  
 10  Nombre OPCI            55 non-null     int64  
 11  Nombre ETF             55 non-null     int64  
 12  Nombre UC              55 non-null     int64  
dtypes: float64(6), int64(5), object(2)
memory usage: 6.0+ KB


# Convert pandas dataframe to spark dataframe

While converting the Pandas DataFrame to Spark DataFrame, it may throw error as Spark is not able to infer correct data type for the columns due to mix type of data in columns.

Pandas create mix type because of the data have missing values which pushes Pandas to represent them as mixed types (e.g. string for not missing, NaN for missing values).

In this case you just need to explicitly tell Spark to use a correct datatype by creating a new schema and using it in createDataFrame() definition

In [27]:
df = spark.createDataFrame(pdf)
df.show()

TypeError: field Fonds euros: Can not merge type <class 'pyspark.sql.types.StringType'> and <class 'pyspark.sql.types.DoubleType'>

Below is a set of functions that will convert pandas dataframe to spark dataframe correctly

In [48]:
from pyspark.sql.types import *

# Build spark column type
def build_type(f):
    if f == 'datetime64[ns]': return TimestampType()
    elif f == 'int64': return LongType()
    elif f == 'int32': return IntegerType()
    elif f == 'float64': return DoubleType()
    elif f == 'float32': return FloatType()
    else: return StringType()

# build spark column name, need to be changed for each dataset
def build_col_name(col_name:str)->str:
    # remove space, ., ?
    col_name=col_name.strip("?").strip(".").strip()
    # replace / by _
    col_name=col_name.replace("/","_")
    # replace space by _
    col_name=col_name.replace(" ","_")
    return col_name

def build_struct_field(col_name:str, pandas_col_type):
    try: spark_col_type = build_type(pandas_col_type)
    except: spark_col_type = StringType()
    col_name=build_col_name(col_name)
    return StructField(col_name, spark_col_type,nullable=True)

# Given pandas dataframe, it will return a spark's dataframe.
def pandas_to_spark(pandas_df):
    columns = list(pandas_df.columns)
    types = list(pandas_df.dtypes)
    struct_list = []
    for col_name, col_type in zip(columns, types):
        struct_list.append(build_struct_field(col_name, col_type))
    spark_schema = StructType(struct_list)
    return spark.createDataFrame(pandas_df, spark_schema)

In [49]:
df=pandas_to_spark(pdf)

In [50]:
df.show()

+--------------------+----------+----------+-------------------+----------------+-----------+--------------------+---------+-----------+----------+-----------+----------+---------+
|    Assureur_Support|Avis_sur_5|Frais_Vers|Frais_Gestion_Fonds|Frais_Gestion_UC|Frais_rente|         Fonds_euros|Taux_brut|Nombre_SCPI|Nombre_SCI|Nombre_OPCI|Nombre_ETF|Nombre_UC|
+--------------------+----------+----------+-------------------+----------------+-----------+--------------------+---------+-----------+----------+-----------+----------+---------+
|ABEILLE RETRAITE ...|       NaN|       5.0|                1.0|             1.0|        NaN|   ABEILLE EURO PERP|      NaN|          0|         0|          0|         0|       80|
|             ABEILLE|       NaN|       3.0|                1.0|             1.0|        0.0|ABEILLE RP SECURI...|      NaN|          0|         0|          0|         0|       80|
|             ALLIANZ|       NaN|       4.8|               0.85|            0.85|        NaN|  

In [51]:
df.printSchema()

root
 |-- Assureur_Support: string (nullable = true)
 |-- Avis_sur_5: double (nullable = true)
 |-- Frais_Vers: double (nullable = true)
 |-- Frais_Gestion_Fonds: double (nullable = true)
 |-- Frais_Gestion_UC: double (nullable = true)
 |-- Frais_rente: double (nullable = true)
 |-- Fonds_euros: string (nullable = true)
 |-- Taux_brut: double (nullable = true)
 |-- Nombre_SCPI: long (nullable = true)
 |-- Nombre_SCI: long (nullable = true)
 |-- Nombre_OPCI: long (nullable = true)
 |-- Nombre_ETF: long (nullable = true)
 |-- Nombre_UC: long (nullable = true)



In [52]:
print(df.columns)

['Assureur_Support', 'Avis_sur_5', 'Frais_Vers', 'Frais_Gestion_Fonds', 'Frais_Gestion_UC', 'Frais_rente', 'Fonds_euros', 'Taux_brut', 'Nombre_SCPI', 'Nombre_SCI', 'Nombre_OPCI', 'Nombre_ETF', 'Nombre_UC']


# column definition
After the cleaning, we have the following column

- Assureur_Support: Name of the organization of the contract
- Avis_sur_5: point given by user (1 to 5)
- Frais_Vers: Frais sur versements maximum : Des réductions importantes, jusqu'à l'annulation complète des frais sur versements (0%), peuvent être proposées aux épargnants, selon leur intermédiaire. (This tells you how much money you need to pay when you transfer money into your PER account, for example if it's 5%, and you put 100 euros into your account, you only invest 95 euros. With a 2% gain per year, it will take long time to just gain back your capital. **So, I highly recommend you to avoid all contracts that has more than 2% frais versement** vous conseillons donc d’éviter les contrats qui prévoient des frais de versement supérieurs à 2%.)
- Frais_Gestion_Fonds: Frais de gestion sur le fonds euros pour les supports assurance-vie. You need to pay every year of your investment,
     For example, if it's 1% and you have 100 euros in your PER account, you will lose 1 euros each year.
- Frais_Gestion_UC: Frais de gestion de compte pour les autres supports (compte-titres, etc.). Same thing as above
- Frais_rente: You need to pay when you start your retirement, the insurance company will pay you each month a little. For example, if they pay you 100 euros each month, and the frais_rente is 3%, then you will only receive 97 euros.
- Fonds_euros: The fonds which the company will use to invest your money.
- Taux_brut: The gain of the investment of your capitale. Brut means you need to remove the frais gestion to get the real gain.
- Nombre_SCPI: Société Civile de Placement Immobilier est simple(il s'agit d'une structure qui vous permet d'investir dans des biens immobiliers). The number of different SCPI that the contract allows you to choose
- Nombre_SCI: Société civile immobilière. Idem to above
- Nombre_OPCI: Organisme de Placement Collectif Immobilier
- Nombre_ETF:  exchange-traded fund
- Nombre_UC:

So our objective is to find a low frais, high taux PER contract

First let's get some baseline

In [69]:
target_col=["Assureur_Support","Frais_Vers","Frais_Gestion_Fonds","Frais_Gestion_UC","Frais_rente","Taux_brut"]

feature_col=["Frais_Vers","Frais_Gestion_Fonds","Frais_Gestion_UC","Frais_rente","Taux_brut"]

In [70]:
df_target_col=df.select(target_col)
df_target_col.select(feature_col).summary().show()

+-------+-----------------+-------------------+-------------------+-----------+---------+
|summary|       Frais_Vers|Frais_Gestion_Fonds|   Frais_Gestion_UC|Frais_rente|Taux_brut|
+-------+-----------------+-------------------+-------------------+-----------+---------+
|  count|               55|                 55|                 55|         55|       55|
|   mean|2.067272727272727| 0.8170909090909091| 0.7503636363636363|        NaN|      NaN|
| stddev|1.932955379287504| 0.3530642199860295|0.24142297012722402|        NaN|      NaN|
|    min|              0.0|                0.0|                0.0|        0.0|      NaN|
|    25%|              0.0|               0.65|                0.6|        0.8|      NaN|
|    50%|              2.5|                0.8|               0.84|        NaN|      NaN|
|    75%|              3.9|                0.9|               0.96|        2.0|      NaN|
|    max|              5.0|                2.0|                1.2|        NaN|      NaN|
+-------+-

In [61]:

df.orderBy(col("Frais_Vers"),col("Frais_Gestion_Fonds"),col("Frais_Gestion_UC"),col("Frais_rente"),col("Taux_brut").desc()).select(target_col).show(20)

+--------------------+----------+-------------------+----------------+-----------+---------+
|    Assureur_Support|Frais_Vers|Frais_Gestion_Fonds|Frais_Gestion_UC|Frais_rente|Taux_brut|
+--------------------+----------+-------------------+----------------+-----------+---------+
|     CREDIT AGRICOLE|       0.0|                0.0|             0.3|        0.8|      NaN|
|     CREDIT AGRICOLE|       0.0|                0.0|             0.4|        0.8|      NaN|
|          ORADEA VIE|       0.0|                0.5|            0.22|        0.0|      NaN|
|      AMPLI-MUTUELLE|       0.0|                0.5|             0.4|        0.0|      NaN|
|             ABEILLE|       0.0|                0.6|             0.6|        0.0|      NaN|
|                 MIF|       0.0|                0.6|             0.6|        0.6|      NaN|
|SWISSLIFE ASSURAN...|       0.0|                0.6|             0.6|        NaN|      NaN|
|           SWISSLIFE|       0.0|               0.65|            0.84|