# Mongodb

Dans ce TP nous allons utiliser la librairie pymongo, pour créer une base de données dans mongodb avec python puis la questionner.

### Data
Voici des données au format csv, elles sont une partie d'un dataset recensant les crimes au Etats-Unis entre 1984 et 2014.

https://drive.google.com/file/d/10z7kUXDO4BHffJ6ZfVc42CgIs5558vGd/view?usp=sharing

### Création de la BDD

In [1]:
from pymongo import MongoClient
import json
import streamlit as st
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

Créez une fonction python pour passer ce csv dans une base de données mongo, chaque lignes devra être un document.

### Requêter la BDD

Créez un connecteur pour votre BDD

In [8]:
client = MongoClient(
    'mongo',
    port=27017,
    username='root',
    password='root',
    authMechanism='SCRAM-SHA-256'
)
db = client.spotify
table = db.documents

In [3]:
# Lecture du csv
df = spark.read.format('csv').options(header=True).load('data/Spotify/genres_v2.csv')
df.dtypes

[('danceability', 'double'),
 ('energy', 'double'),
 ('key', 'int'),
 ('loudness', 'double'),
 ('mode', 'int'),
 ('speechiness', 'double'),
 ('acousticness', 'double'),
 ('instrumentalness', 'double'),
 ('liveness', 'double'),
 ('valence', 'double'),
 ('tempo', 'double'),
 ('type', 'string'),
 ('id', 'string'),
 ('uri', 'string'),
 ('track_href', 'string'),
 ('analysis_url', 'string'),
 ('duration_ms', 'int'),
 ('time_signature', 'int'),
 ('genre', 'string'),
 ('song_name', 'string'),
 ('Unnamed: 0', 'string'),
 ('title', 'string')]

In [20]:
# Ecriture dans le HDFS
df.write.csv("hdfs://hadoop-master:9000/spotify.csv")

In [None]:
# Lecture dans le HDFS

In [15]:
# En une seule fois
if table.count_documents({}) == 0:    
    records = df.toPandas().to_dict(orient='records')
    table.insert_many(records)
    
df.toPandas()

[{'danceability': '0.831',
  'energy': '0.8140000000000001',
  'key': '2',
  'loudness': '-7.364',
  'mode': '1',
  'speechiness': '0.42',
  'acousticness': '0.0598',
  'instrumentalness': '0.0134',
  'liveness': '0.0556',
  'valence': '0.389',
  'tempo': '156.985',
  'type': 'audio_features',
  'id': '2Vc6NJ9PW9gD9q343XFRKx',
  'uri': 'spotify:track:2Vc6NJ9PW9gD9q343XFRKx',
  'track_href': 'https://api.spotify.com/v1/tracks/2Vc6NJ9PW9gD9q343XFRKx',
  'analysis_url': 'https://api.spotify.com/v1/audio-analysis/2Vc6NJ9PW9gD9q343XFRKx',
  'duration_ms': '124539',
  'time_signature': '4',
  'genre': 'Dark Trap',
  'song_name': 'Mercury: Retrograde',
  'Unnamed: 0': None,
  'title': None,
  '_id': ObjectId('601baf55e8ee705b7805f13c')},
 {'danceability': '0.7190000000000001',
  'energy': '0.493',
  'key': '8',
  'loudness': '-7.23',
  'mode': '1',
  'speechiness': '0.0794',
  'acousticness': '0.401',
  'instrumentalness': '0.0',
  'liveness': '0.11800000000000001',
  'valence': '0.124',
  't

###### Test de correlation de la danceability

In [22]:
{
    'energy': df.corr('danceability','energy'),
    'loudness': df.corr('danceability','loudness'),
    'valence': df.corr('danceability','valence'),
    'tempo': df.corr('danceability','tempo'),
    'liveness': df.corr('danceability','liveness'),
    'instrumentalness': df.corr('danceability','instrumentalness'),
    'speechiness': df.corr('danceability','speechiness'),
    'acousticness': df.corr('danceability','acousticness'),
}

{'energy': -0.32324758938982884,
 'loudness': -0.21677558073253783,
 'valence': 0.36984469824063193,
 'tempo': -0.16592868259426302,
 'liveness': -0.1967022785784031,
 'instrumentalness': -0.06711359015412229,
 'speechiness': 0.18217692430188007,
 'acousticness': 0.06990976582404693}

###### Test de correlation de la valence

In [23]:
{
    'energy': df.corr('valence','energy'),
    'loudness': df.corr('valence','loudness'),
    'tempo': df.corr('valence','tempo'),
    'liveness': df.corr('valence','liveness'),
    'instrumentalness': df.corr('valence','instrumentalness'),
    'speechiness': df.corr('valence','speechiness'),
    'acousticness': df.corr('valence','acousticness'),
}

{'energy': -0.013519692624507104,
 'loudness': 0.08091618276919113,
 'tempo': 0.05837414725462354,
 'liveness': -0.025155613532274863,
 'instrumentalness': -0.2570680360340621,
 'speechiness': 0.21882903805612033,
 'acousticness': 0.0993276847644048}

##### Quelles sont les armes utilisées par les criminel?
Retournez par rapport à tous les documents les valeurs uniques de la clef weapon.

In [16]:
table.distinct("Weapon")

list(table.aggregate([
    { "$group": {
        "_id": "$Weapon", "count" : {"$sum": 1}
    }},
    { "$sort": {"count": -1}}
]))

[{'_id': 'Handgun', 'count': 45748},
 {'_id': 'Knife', 'count': 19586},
 {'_id': 'Blunt Object', 'count': 11150},
 {'_id': 'Shotgun', 'count': 7419},
 {'_id': 'Rifle', 'count': 4968},
 {'_id': 'Firearm', 'count': 3629},
 {'_id': 'Unknown', 'count': 3134},
 {'_id': 'Strangulation', 'count': 1742},
 {'_id': 'Fire', 'count': 1254},
 {'_id': 'Suffocation', 'count': 600},
 {'_id': 'Drowning', 'count': 265},
 {'_id': 'Gun', 'count': 212},
 {'_id': 'Drugs', 'count': 101},
 {'_id': 'Poison', 'count': 80},
 {'_id': 'Explosives', 'count': 61},
 {'_id': 'Fall', 'count': 50}]

##### Combien de crimes ont été commis en 1980?

In [None]:
table.count_documents({"Year":1980})

##### Combien de crimes ont été commis par des hommes au Texas?

In [None]:
table.count_documents({"Perpetrator Sex": "Male", "State": "Texas"})

##### Combien de crimes ont été commis par chaque sexe en Alaska?

In [None]:
maleAndFemalePerpetratorsInAlaska = table.aggregate([
    { "$match": { "State": "Alaska" }},
    { "$group": {"_id": "$Perpetrator Sex", "crimes commis": {"$sum": 1}}},
])

list(maleAndFemalePerpetratorsInAlaska)

##### Combien y a-t'il eu de victimes dans chaque état?

In [None]:
victimInEachState = table.aggregate([
    { "$group" : {
        "_id": "$State", "victimes": {"$sum": "$Victim Count"}
    }},
    { "$sort": { "victimes": -1}}
])

list(victimInEachState)

### Bonus

Installez la librairie streamlit
Créez un dashboard qui donne la posibilité de séléctionner une année pour retourner un barplot du nombre de crime commis avec chaque arme.

In [None]:
victimWithEachWeapon = table.aggregate([
    {"$match": {"Year": 1980}},
    { "$group" : {
        "_id": "$Weapon", "victimes": {"$sum": "$Victim Count"}
    }}
])
    
list(victimWithEachWeapon)

In [None]:
# QCM 2
victimWithEachAge = table.aggregate([
    { "$group" : {
        "_id": "$Victim Age", "victimes": {"$sum": 1}
    }},
    { "$sort" : {
        "victimes": -1
    }}
])
    
list(victimWithEachAge)



In [None]:
# QCM 3
victimWithEachAgeByVictim = table.aggregate([
    {"$match": { "Weapon": "Drowning"}},
    { "$group" : {
        "_id": "$Victim Age", "victimes": {"$sum": 1}
    }},
    { "$sort" : {
        "victimes": -1
    }}
])
    
list(victimWithEachAgeByVictim)

In [None]:
# QCM 4
victimWithEachAgeByVictim = table.aggregate([
    {"$match": { "Weapon": "Fall"}},
    { "$group" : {
        "_id": "$State", "victimes": {"$sum": 1}
    }},
    { "$sort" : {
        "victimes": -1
    }}
])
    
list(victimWithEachAgeByVictim)

In [None]:
# QCM 5
victimWithEachAgeByVictim = table.aggregate([
    { "$group" : {
        "_id": "$State", "Moyenne": {"$avg": "$Victim Count"}
    }},
    { "$sort" : {
        "Moyenne": -1
    }}
])
    
list(victimWithEachAgeByVictim)