<img src="http://naonedia.fr/wp-content/uploads/2018/10/logoNaonedia.png" style="height:100px">
<h1>
    <center>Expérience Logement</center>
    <center>-</center>
    <center>Préparations des données</center>
</h1>


<h3> Introduction </h3>

Dans ce dernier notebook, nous allons voir comment finaliser notre jeux de données, à la fin nous obtiendrons un jeux de données prêt à l'emploi.
Celui pourra être ensuite utilisé dans un réseaux de neurones.


<h3>Informations</h3>

Il est fortement conseillé d'utiliser Python3.x
Si vous ne disposez pas de toutes les librairies requises , il suffit de l'installer en utilisant la commande suivante.
<br>
<code>pip3 install ....</code>

<h3>Authors:</h3>
<cite>Thibault Brocherieux - Ippon Technologies</cite>

## Import des librairies requises

In [1]:
import pandas as pd
import numpy as np
from os import path
import json
from shapely.geometry import shape, Point

import plotly.offline as py
import plotly.figure_factory as ff
import plotly.graph_objs as go

## Chargement du jeux de données à finaliser

In [2]:
df = pd.read_csv('data/datasets/origin_enriched_v3_lonlat.csv', delimiter=',', header=0)
df.head()

Unnamed: 0,date_mutation,valeur_fonciere,code_postal,nom_commune,surface_carrez,type_local,surface_reelle_bati,nombre_pieces_principales,surface_terrain,mois,...,0-3-taux_epargne,0-3-taux_epargne_financier,0-4-rdb,0-4-rdb_uc,0-4-pa_rdb,0-4-pa_rdb_uc,0-4-taux_epargne,0-4-taux_epargne_financier,longitude,latitude
0,2014-01-02,107000.0,44800,ST-HERBLAIN,45.8,Appartement,46.0,2,46.0,1,...,177111.631901,1320.383769,149.129082,133.560584,122.508389,109.783356,201021.702207,1371.878736,-1.644254,47.207462
1,2014-01-02,111500.0,44000,NANTES,44.31,Appartement,17.0,1,17.0,1,...,177111.631901,1320.383769,149.129082,133.560584,122.508389,109.783356,201021.702207,1371.878736,-1.548747,47.220961
2,2014-01-02,111500.0,44000,NANTES,44.31,Appartement,45.0,2,45.0,1,...,177111.631901,1320.383769,149.129082,133.560584,122.508389,109.783356,201021.702207,1371.878736,-1.548747,47.220961
3,2014-01-02,130000.0,44300,NANTES,49.76,Appartement,50.0,2,50.0,1,...,177111.631901,1320.383769,149.129082,133.560584,122.508389,109.783356,201021.702207,1371.878736,-1.566879,47.234022
4,2014-01-02,190000.0,44300,NANTES,63.31,Appartement,62.0,3,62.0,1,...,177111.631901,1320.383769,149.129082,133.560584,122.508389,109.783356,201021.702207,1371.878736,-1.54512,47.251066


Nous commencons par regarder si notre jeux de données a des données manquantes.
Le code ci-dessous nous permet de fournir le nom des colonnes possédant un/des valeur(s) manquantes.

In [3]:
[i for i in df.columns if df[i].isnull().any()]

[]

Dans notre cas nous ne possédons aucunes données manquantes. Nous allons maintenant regarder les types de données.

In [4]:
list(set(df.dtypes))

[dtype('float64'), dtype('int64'), dtype('O')]

### Suppresion des geolocalisation dupliqué et des données aberrantes

On fixe un minimum de 1600 € au m².  
On récupère les biens compris entre 50 000€ et 1 000 000€.  
On ne récupère que les biens ayant une taille supérieur à 25m².  
On supprime les biens dupliqués (Même longitude, latitude, date de vente, surface


In [5]:
df = df.loc[(df['valeur_fonciere'] > df['surface_reelle_bati'] * 1600) & (df['valeur_fonciere'] > 50000) & (df['valeur_fonciere'] < 1000000) & (df['surface_reelle_bati'] > 25)]
dfToDrop = df[df.duplicated(subset=['longitude','latitude','date_mutation','surface_reelle_bati'], keep=False)]
df = pd.concat([df, dfToDrop, dfToDrop]).drop_duplicates(keep=False)

In [6]:
df

Unnamed: 0,date_mutation,valeur_fonciere,code_postal,nom_commune,surface_carrez,type_local,surface_reelle_bati,nombre_pieces_principales,surface_terrain,mois,...,0-3-taux_epargne,0-3-taux_epargne_financier,0-4-rdb,0-4-rdb_uc,0-4-pa_rdb,0-4-pa_rdb_uc,0-4-taux_epargne,0-4-taux_epargne_financier,longitude,latitude
0,2014-01-02,107000.00,44800,ST-HERBLAIN,45.80,Appartement,46.0,2,46.0,1,...,1.771116e+05,1320.383769,149.129082,133.560584,122.508389,109.783356,2.010217e+05,1371.878736,-1.644254,47.207462
2,2014-01-02,111500.00,44000,NANTES,44.31,Appartement,45.0,2,45.0,1,...,1.771116e+05,1320.383769,149.129082,133.560584,122.508389,109.783356,2.010217e+05,1371.878736,-1.548747,47.220961
3,2014-01-02,130000.00,44300,NANTES,49.76,Appartement,50.0,2,50.0,1,...,1.771116e+05,1320.383769,149.129082,133.560584,122.508389,109.783356,2.010217e+05,1371.878736,-1.566879,47.234022
4,2014-01-02,190000.00,44300,NANTES,63.31,Appartement,62.0,3,62.0,1,...,1.771116e+05,1320.383769,149.129082,133.560584,122.508389,109.783356,2.010217e+05,1371.878736,-1.545120,47.251066
5,2014-01-02,194400.00,44100,NANTES,84.00,Appartement,84.0,4,84.0,1,...,1.771116e+05,1320.383769,149.129082,133.560584,122.508389,109.783356,2.010217e+05,1371.878736,-1.603261,47.209692
6,2014-01-02,335000.00,44300,NANTES,118.00,Maison,118.0,5,562.0,1,...,1.771116e+05,1320.383769,149.129082,133.560584,122.508389,109.783356,2.010217e+05,1371.878736,-1.506824,47.231122
8,2014-01-03,84950.00,44400,REZE,50.00,Appartement,50.0,2,50.0,1,...,1.771116e+05,1320.383769,149.129082,133.560584,122.508389,109.783356,2.010217e+05,1371.878736,-1.547729,47.189657
9,2014-01-03,94500.00,44000,NANTES,39.02,Appartement,39.0,2,39.0,1,...,1.771116e+05,1320.383769,149.129082,133.560584,122.508389,109.783356,2.010217e+05,1371.878736,-1.550883,47.224574
10,2014-01-03,96500.00,44300,NANTES,33.00,Maison,33.0,1,464.0,1,...,1.771116e+05,1320.383769,149.129082,133.560584,122.508389,109.783356,2.010217e+05,1371.878736,-1.542793,47.254146
11,2014-01-03,96500.00,44300,NANTES,47.00,Maison,47.0,2,464.0,1,...,1.771116e+05,1320.383769,149.129082,133.560584,122.508389,109.783356,2.010217e+05,1371.878736,-1.542793,47.254146


In [7]:
# load GeoJSON file containing sectors
with open('./data/maps/nantes.geojson') as f:
    nantes_metropolis_geojson = json.load(f)


def checkNameAndPostalCode(longitude, latitude, postal_code, town_name):
    # construct point based on lon/lat returned by geocoder
    point = Point(longitude, latitude)

    # check each polygon to see if it contains the point
    for feature in nantes_metropolis_geojson['features']:
        polygon = shape(feature['geometry'])
        if polygon.contains(point):
            if feature['properties']['code_postal'] == postal_code:
                if feature['properties']['nom'] == town_name:
                    return (True, "OK")
                else:
                    return (False, "coordinate != town_name")
            else:
                return (False, "coordinate != postal_code")
    
    return (False, "not found", longitude, latitude, postal_code, town_name)

def checkDatasetLocation(dataset):
    ok = 0
    error = 0
    not_found_error = 0
    postal_code_error = 0
    town_name_error = 0

    town_list = []
    postal_list = []
    for i, row in dataset.iterrows():
        response = checkNameAndPostalCode(row['longitude'], row['latitude'], row['code_postal'], row['nom_commune'])
        
        if not response[0]:
            if response[1] == "coordinate != town_name":
                town_name_error += 1
                town_list.append(i)
            elif response[1] == "coordinate != postal_code":
                postal_code_error += 1
                postal_list.append(i)
            else:
                not_found_error += 1
                res.append(response[2:])

            error += 1
        else:
            ok += 1
    
    print("Town error : {}".format(town_name_error))
    print("Postal error : {}".format(postal_code_error))
    print("Not found error : {}".format(not_found_error))
    print("Total error : {}".format(error))
    print("Ok : {}".format(ok))
    
    return town_list, postal_list


def fixNameAndPostalCode(dataset):
    df = dataset.copy()

    for i, row in dataset.iterrows():
        # construct point based on lon/lat returned by geocoder
        point = Point(row['longitude'], row['latitude'])
        
        for feature in nantes_metropolis_geojson['features']:
            polygon = shape(feature['geometry'])
            if polygon.contains(point):
                df.loc[i,'code_postal'] = feature['properties']['code_postal']
                df.loc[i,'nom_commune'] = feature['properties']['nom']
    return df


In [8]:
print(checkNameAndPostalCode(df.iloc[0]['longitude'], df.iloc[0]['latitude'], df.iloc[0]['code_postal'], df.iloc[0]['nom_commune']))
print(checkNameAndPostalCode(df.iloc[1]['longitude'], df.iloc[1]['latitude'], df.iloc[1]['code_postal'], df.iloc[1]['nom_commune']))

(True, 'OK')
(False, 'coordinate != town_name')


In [9]:
index_town_list, index_postal_list = checkDatasetLocation(df)

Town error : 12761
Postal error : 1237
Not found error : 0
Total error : 13998
Ok : 16104


In [10]:
df[df.index.isin(index_town_list)]['nom_commune'].value_counts()

NANTES    12761
Name: nom_commune, dtype: int64

In [11]:
df[df.index.isin(index_postal_list)]['nom_commune'].value_counts()

NANTES    1237
Name: nom_commune, dtype: int64

In [12]:
df = fixNameAndPostalCode(df)

In [13]:
index_town_list, index_postal_list = checkDatasetLocation(df)

Town error : 0
Postal error : 0
Not found error : 0
Total error : 0
Ok : 30102


### Encodage des colonnes

Nous ne disposons que de 3 types, nous allons regarder de plus proche le type 'O'. Nous allons sélectionner toutes les colonnes concernées.

In [14]:
df.select_dtypes(exclude=['float64', 'int64'])

Unnamed: 0,date_mutation,nom_commune,type_local
0,2014-01-02,ST-HERBLAIN,Appartement
2,2014-01-02,MALAKOFF-SAINT_DONATIEN,Appartement
3,2014-01-02,BREIL-BARBERIE,Appartement
4,2014-01-02,NANTES_NORD,Appartement
5,2014-01-02,BELLEVUE-CHANTENAY-SAINTE_ANNE,Appartement
6,2014-01-02,DOULON-BOTTIERE,Maison
8,2014-01-03,REZE,Appartement
9,2014-01-03,MALAKOFF-SAINT_DONATIEN,Appartement
10,2014-01-03,NANTES_NORD,Maison
11,2014-01-03,NANTES_NORD,Maison


Nous ne nécessitons pas de la colonne `date_mutation`, nous avons déjà isolé l'année et le mois dans d'autres colonnes. Il nous suffit donc d'encoder les colonnes `nom_commune` et `type_local`. En faisant attentention au début, on peut aussi remarquer la colonne `code_postal`, celle-ci doit aussi être encodé.

In [15]:
# Dropping date_mutation columns
df.drop('date_mutation', axis = 1, inplace=True)

# One hot encoding `nom_commune` column
one_hot = pd.get_dummies(df['nom_commune'])
df.drop('nom_commune', axis = 1, inplace=True)
df = df.join(one_hot)

# One hot encoding `type_local` column
one_hot = pd.get_dummies(df['type_local'])
df.drop('type_local', axis = 1, inplace=True)
df = df.join(one_hot)

# One hot encoding `type_local` column
one_hot = pd.get_dummies(df['code_postal'])
df.drop('code_postal', axis = 1, inplace=True)
df = df.join(one_hot)

In [16]:
df.head()

Unnamed: 0,valeur_fonciere,surface_carrez,surface_reelle_bati,nombre_pieces_principales,surface_terrain,mois,annee,mois_sin,mois_cos,foot/vie_courante/townhall,...,44620,44640,44700,44710,44800,44830,44840,44860,44880,44980
0,107000.0,45.8,46.0,2,46.0,1,2014,0.5,0.866025,1.0,...,0,0,0,0,1,0,0,0,0,0
2,111500.0,44.31,45.0,2,45.0,1,2014,0.5,0.866025,2.0,...,0,0,0,0,0,0,0,0,0,0
3,130000.0,49.76,50.0,2,50.0,1,2014,0.5,0.866025,0.0,...,0,0,0,0,0,0,0,0,0,0
4,190000.0,63.31,62.0,3,62.0,1,2014,0.5,0.866025,0.0,...,0,0,0,0,0,0,0,0,0,0
5,194400.0,84.0,84.0,4,84.0,1,2014,0.5,0.866025,1.0,...,0,0,0,0,0,0,0,0,0,0


In [17]:
df.to_csv('data/datasets/dataset_ready_v2_lonlat.csv', index=False)

In [20]:
df.drop(['longitude','latitude','mois'], axis=1).to_csv('data/datasets/dataset_ready_v2.csv', index=False)