# Big Data Modeling and Management Assigment
## EU Procurements Explorer Dashboard Competition

For the final project we will be continue to work with the european public procurement notices database to analyse contracts and money expenditure within the European Union!

This time the goal is to feed data to a dashboard where we can explore the contracts in three diferent ways: per procurement code, per country and per company.

#### Problem description
Explore the code in the zip file shared with this project description.
Spin up the dashboard and check there are no errors.
Go to the `queries.py` and replace the exercices with actual queries following the example in `ex0_cpv_example`.  Check the dashboard charts are now working.
Run the `Performance test`in the front page of the dashborad and try to optimize the speed of your queries with the materials teached in classe like indexes and data modelling.
Confirm you have fast queries and the dashboard is working.
Submit the `queries.py` on moodle

#### Connection details to the MongoDB database
Each group should have received by email the credentials to connect to the group's mongo database. The same ones as homework 2.
```
Connection example: mongodb://username:password@host:port
```
These credentials should be added in the file `DB.py` within the project folder in the backend folder.

#### Project structure

* apps - All the dash dashboard code
* assets - Web assets for the dashboard
* backend 
    * `DB.py` - File with the database connection, this should be changed to your groups database connection
    * `queries.py` - File where the queries will go (the one to be submited)
    * `performance_evaluation.py` - Code used to run the performance evaluation
* `index.py` && `app.py` - Dash basic files. To start the app run `python3 index.py` (or `docker-compose up` if familiar with docker technologies)
* `test_insert_document.json`- example document to insert on the dashboard and measure the time taken

### Questions
#### Procurement codes (CPV)
1. 5 descriptive metrics of the contracts related to the CPV, the average of:
    1. Each CPV’s division contracts average spending (‘VALUE_EURO’)
    2. Each CPV’s division contract count
    3. Each CPV’s division contracts average number of offers (‘NUMBER_OFFERS’)
    4. Each CPV’s division contracts average spending (‘VALUE_EURO’) with european
funde (‘B_EU_FUNDS’)
    5. Each CPV’s division contracts average spending (‘VALUE_EURO’) without european
funds (‘B_EU_FUNDS’)
2. The count of contracts for each CPV Division
3. Per CPV Division get the average spending (‘VALUE_EURO’) and return the highest 5 cpvs
4. Per CPV Division get the average spending (‘VALUE_EURO’) and return the lowest 5 cpvs
5. Per CPV Division get the average spending (‘VALUE_EURO’) and return the highest 5 cpvs
for contracts which recieved european funds (‘B_EU_FUNDS’)
6. Per CPV Division and get the average (‘VALUE_EURO’) return the highest 5 cpvs for
contracts which did not recieve european funds (‘B_EU_FUNDS’)
7. The highest CPV Division on average spending (‘VALUE_EURO’) per country
(‘ISO_COUNTRY_CODE’)
8. Returns bucketed data with the contract counts of a particular cpv in a given range of values
(bucket) according to spending (‘VALUE_EURO’)
9. The average time and value difference for each CPV, return the highest 5 cpvs

#### Countries
10. 5 descriptive metrics of the contracts related to the Country, the average of:
    1. Each Country’s contracts average spending (‘VALUE_EURO’)
    2. Each Country’s contract count
    3. Each Country’s contracts average NUMBER_OFFERS’
    4. Each Country’s contracts average VALUE_EURO’ with ‘B_EU_FUNDS’
    5. Each Country’s contracts average ‘VALUE_EURO’ without ‘B_EU_FUNDS’
11. The count of contracts per country (‘ISO_COUNTRY_CODE’)
12. Returns the average ‘VALUE_EURO’ for each country, return the highest 5 countries
13. Returns the average ‘VALUE_EURO’ for each country, return the lowest 5 countries
14. For each country get the sum of the respective contracts 'VALUE_EURO' which received european funds 'B_EU_FUNDS' 


#### Companies
15. 5 descriptive metrics of the contracts related to the Company, the average of:
    1. Each Company’s contracts average spending (‘VALUE_EURO’)
    2. Each Company’s contract count
    3. Each Company’s contracts average NUMBER_OFFERS’
    4. Each Company’s contracts average VALUE_EURO’ with ‘B_EU_FUNDS’
    5. Each Company’s contracts average ‘VALUE_EURO’ without ‘B_EU_FUNDS’
16. Returns the average ‘VALUE_EURO’ for company (‘CAE_NAME’) return the highest 5
companies
17. Returns the average ‘VALUE_EURO’ for company (‘CAE_NAME’) return the lowest 5
companies
18. Returns the count of contracts for each company ‘CAE_NAME’, for the 15 companies with
the most contracts
19. For each country get the highest company (‘CAE_NAME’) in terms of ‘VALUE_EURO’
sum contract spending
20. Returns the top 5 most frequent co-occurring companies (‘CAE_NAME’ and ‘WIN_NAME’)


**All resulting documents should allow to perfom filters by min and max of the field year of contract, as well as for issuer country.(ISO_COUNTRY_CODE) (see example 0 for more details)**

**Insert query**  
21. On the `queries.py` there is a working function that inserts documents on the contracts database.  
If any precomputed table is generated they should be recomputed with the new data on the this insert method.


### Group  

This project assumes groups to be the same as the previous project, any copying detected by the professors will lead to a grading of zero on the project and/or other disciplinary actions!


### Submission      
Submit the `queries.py` file with all the queries (running on the group's own database) on moodle.

Delivery date: Until **23:59 of June 19th** (as there will be no exam de due date got extended)


### Evaluation    


This will be 30% of the final grade.   
1. The queries run and generate the desired visualizations.  (60%)
1. The speed of the query. (This will be benchmarked for all groups)   (20%)
1. The document insertion speed.       (This will be benchmarked for all groups)  (10%)
1. The simplicity of the query.  (10%)

The queries will be run against each groups database. So any index or extra table created will be used. 

All code will go through plagiarism automated checks. Groups with the same code will undergo investigation.

### Extra information

**Rounding** of numbers can be perfomend with any function, they will not be an evalution criteria.  
_Hint:_ To speed up the queries two sugestions are indexes and precomputed tables.

### Connection to the Database

In [1]:
from pymongo import MongoClient
import warnings
warnings.filterwarnings('ignore')
import pprint
import credentials as cred

In [2]:
host= cred.mongo_host
port= cred.mongo_port
user= cred.mongo_user
password= cred.mongo_pass
protocol= "mongodb"
client = MongoClient(f"{protocol}://{user}:{password}@{host}:{port}")

In [3]:
db = client.contracts

eu = db.eu
cpv = db.cpv
cpv_codes = db.cpv_codes
iso = db.iso_codes

### CPV Codes Transformation

Create a two new fields in every document:
- CPV_Corrected: the CPV code is corrected to have 8 digits each transforming them into string and, for the ones with 7 digits, adding a zero on the left.
- CPV_division: first 2 digits of the correspondent CPV code

In [4]:
                            # DELETE THIS CELL WHEN IS NO LONGER USEFULL
# DELETE UNWANTED FIELDS
# db.cpv_divisions_all_data.update_many(
#     {'_id.YEAR': 2018, 
#      "_id.ISO_COUNTRY_CODE" : "DE", 
#      "_id.CPV_Division" : '72'},
#    { '$unset': {'Count_Contracts_B_EU_Y': "", 'Sum_Value_Euro_B_EU_Y': "", 
#                 'Count_Contracts_B_EU_N': "", 'Sum_Value_Euro_B_EU_N': ""} }
# )

In [5]:
# SET THE FIELDS CPV_Corrected, CPV_Division AND ADDR_TOWN IN THE ORIGINAL COLLECTION
# eu.update_many(
#     {'CPV': {'$exists': True}, '_id': {'$in': inserted_ids}},
#     [
#         {"$set": {"CPV_Corrected": {'$cond': [{'$gte': [ "$CPV", 10000000 ] }, # condition
#                                               {'$toString': "$CPV"},  # true case
#                                               {"$concat": [ "0", {'$toString': "$CPV" }]} # false case
#                                              ]
#                                    },
#                   "CPV_Division": {'$substr': ['$CPV_Corrected', 0, 2]},
#                   "ADDR_TOWN": {'$toLower': {
#                       '$concat': [{'$toString': '$CAE_ADDRESS'}, ' ', {'$toString': '$CAE_TOWN'}]
#                                 }}
#                  }
#         }
#     ]
# )

There are some documents that does not have CPV codes which is odd because it is mandatory since 2006.

### Queries

##### Procurement Codes (CPV)
###### Pre-computed tables
B_EU_FUNDS

In [6]:
# get avg_spending by B_EU_FUNDS and cpv_division and put in new collection                        

pipeline = [
    { '$match': {
        'CPV_Division': {'$exists': True},
        'VALUE_EURO': {'$lt': 100000000},
        'B_EU_FUNDS': {'$eq': 'Y'}
    } },
    { "$group": {
        "_id": {
            "YEAR": "$YEAR",
            "ISO_COUNTRY_CODE": "$ISO_COUNTRY_CODE",
            "CPV_Division": "$CPV_Division"
        },
        "Count_Contracts_B_EU_Y" : { "$sum": 1 },
        "Sum_Value_Euro_B_EU_Y" : { "$sum": "$VALUE_EURO" }
    } },
    { '$out': "cpv_divisions_all_data" }
     
 ]

eu.aggregate(pipeline)

<pymongo.command_cursor.CommandCursor at 0x23b1f2b3f10>

In [7]:
# get avg_spending by B_EU_FUNDS and cpv_division and put in new collection                        

pipeline = [
    { '$match': {
        'CPV_Division': {'$exists': True},
        'VALUE_EURO': {'$lt': 100000000},
        'B_EU_FUNDS': {'$eq': 'N'}
    } },
    { "$group": {
        "_id": {
            "YEAR": "$YEAR",
            "ISO_COUNTRY_CODE": "$ISO_COUNTRY_CODE",
            "CPV_Division": "$CPV_Division"
        },
        "Count_Contracts_B_EU_N" : { "$sum": 1 },
        "Sum_Value_Euro_B_EU_N" : { "$sum": "$VALUE_EURO" }
    } },
    { '$merge' : { 'into' : "cpv_divisions_all_data", 'on': "_id" } }
     
 ]

eu.aggregate(pipeline)

<pymongo.command_cursor.CommandCursor at 0x23b1f558ac0>

TOTAL CONTRACTS WITH VALUE EURO AND VALUE EURO

In [8]:
# get total avg_spending and total count of contracts by cpv division and put in newly created collection          
pipeline = [
    { '$match': {
         'CPV_Division': {'$exists': True},
         'VALUE_EURO': {'$lt': 100000000}
    } },
    { '$project': { 
        'CPV_Division': 1,
        'YEAR': 1,
        'ISO_COUNTRY_CODE': 1,
        'VALUE_EURO':1
    } },
    { '$group': {
        '_id': {
            "YEAR": "$YEAR", 
            "ISO_COUNTRY_CODE": "$ISO_COUNTRY_CODE",
            "CPV_Division": "$CPV_Division"
        },
        'Count_Contracts_with_Value_Euro': {'$sum': 1},
        'Sum_Value_Euro': {'$sum': '$VALUE_EURO'}
    } },
    { '$project': {
        '_id': 1,
        'Count_Contracts_with_Value_Euro': 1,
        'Sum_Value_Euro': 1
    } },
    { '$merge' : { 'into' : "cpv_divisions_all_data", 'on': "_id" } }
]
              
eu.aggregate(pipeline)

<pymongo.command_cursor.CommandCursor at 0x23b1f2d6640>

NUMBER_OFFERS

In [9]:
# get avg_nr_offers by cpv_division and put information in the previously computed collection                 
 
pipeline = [
    { '$match': {
        'CPV_Division': {'$exists': True},
        'NUMBER_OFFERS': {'$exists': True}
    } },
    { '$project': { 
        'CPV_Division': 1,
        'YEAR': 1,
        'ISO_COUNTRY_CODE': 1,
        'NUMBER_OFFERS':1
    } },
    { '$group': {
        '_id': {
            "YEAR": "$YEAR", 
            "ISO_COUNTRY_CODE": "$ISO_COUNTRY_CODE",
            "CPV_Division": "$CPV_Division"
        },
        'Count_Contracts_with_Nr_Offers': {'$sum': 1},
        'Sum_Nr_Offers': {'$sum': '$NUMBER_OFFERS'}
    } },
    { '$project': {
        '_id': 1,
        'Count_Contracts_with_Nr_Offers': 1,
        'Sum_Nr_Offers': 1
    } },
    { '$merge' : { 'into' : "cpv_divisions_all_data", 'on': "_id" } }
]

eu.aggregate(pipeline)

<pymongo.command_cursor.CommandCursor at 0x23b1f577700>

TOTAL CONTRACTS

In [10]:
# get total contracts by cpv division and put in newly created collection                             
pipeline = [
    { '$match': {
         'CPV_Division': {'$exists': True}
    } },
    { '$group': {
        '_id': {
            "YEAR": "$YEAR", 
            "ISO_COUNTRY_CODE": "$ISO_COUNTRY_CODE",
            "CPV_Division": "$CPV_Division"
        },
        'Count_Contracts_Total': {'$sum': 1}
    } },
    { '$merge' : { 'into' : "cpv_divisions_all_data", 'on': "_id" } }
]
              
eu.aggregate(pipeline)

<pymongo.command_cursor.CommandCursor at 0x23b1f5771f0>

In [11]:
cpv_div_all_data = db.cpv_divisions_all_data

CPV DESCRIPTION

In [12]:
# get the respective description for each CPV division replacing the previous collection                          
pipeline = [
    { '$lookup': {
        'from': 'cpv',
        'localField': '_id.CPV_Division',   
        'foreignField': 'cpv_division',  
        'as': 'cpv_division_description'
    } },
    { '$project': {
        '_id': 1,
        "Count_Contracts_B_EU_Y" : 1,
        "Sum_Value_Euro_B_EU_Y": 1,
        "Count_Contracts_B_EU_N" : 1,
        "Sum_Value_Euro_B_EU_N": 1,
        'CPV_Description': {'$arrayElemAt': ['$cpv_division_description.cpv_division_description', 0]},
        'Count_Contracts_with_Value_Euro': 1,
        'Sum_Value_Euro': 1, 
        'Count_Contracts_with_Nr_Offers': 1,
        'Sum_Nr_Offers': 1, 
        'Count_Contracts_Total': 1
    } },
    { '$out': "cpv_divisions_all_data" }
]
    
cpv_div_all_data.aggregate(pipeline)

<pymongo.command_cursor.CommandCursor at 0x23b1f2b3ca0>

#### TESTES DE INSERT

In [13]:
# value_euro=23
# year=1991
# country='Pt'
# cpv_div='03'
# db.results.insert({
#     '_id': {'YEAR': year,'ISO_COUNTRY_CODE': country, 'CPV_Division': cpv_div},
#     'Sum_Value_Euro': value_euro,
#     'Funds_conditioned_data': [{'Funds_bol': 'Y', 'Count_Contracts_B_EU': 3, 'Sum_Value_Euro_B_EU': 5}, 
#                                {'Funds_bol': 'N', 'Count_Contracts_B_EU': 2, 'Sum_Value_Euro_B_EU': 3}]
# }
# )

In [14]:
# value_euro=None
# year=1991
# country='Pt'
# cpv_div='03'
# result = db.results.update_one(
#                 {'_id.YEAR': {'$eq': year},
#                  '_id.ISO_COUNTRY_CODE': {'$eq': country},
#                  '_id.CPV_Division': {'$eq': cpv_div},
#                  'Funds_conditioned_data.Funds_bol': {'$eq': 'Y'}
#                 },
#                 [{'$set': {
#                     'Funds_conditioned_data.Count_Contracts_B_EU': {
#                         '$cond': [{'$or': [{'$eq': [value_euro, None]}, {'$gte': [value_euro, 100000000]}]},
#                                   '$Funds_conditioned_data.Count_Contracts_B_EU',
#                                   {'$sum': ['$Funds_conditioned_data.Count_Contracts_B_EU', 1]}]
#                     },
#                     'Funds_conditioned_data.Sum_Value_Euro_B_EU': {
#                         '$cond': [{'$or': [{'$eq': [value_euro, None]}, {'$gte': [value_euro, 100000000]}]},
#                                   '$Funds_conditioned_data.Sum_Value_Euro_B_EU',
#                                   {'$sum': ['$Funds_conditioned_data.Sum_Value_Euro_B_EU', value_euro]}]
#                     }
#                 }}], 
#                 upsert=True
#                 )

# if result.modified_count > 0:
#     new = False
# else:
#     new = True
    
# print('new', new)
# print('modified_count', result.modified_count)
# print('upserted_id', result.upserted_id)

In [15]:
# if result.upserted_id:
#     # get cpv description
#     pipeline = [
#          { '$match': {
#             '_id': {'$eq': result.upserted_id}
#         }},
#         { '$lookup': {
#             'from': 'cpv',
#             'localField': '_id.CPV_Division',   
#             'foreignField': 'cpv_division',  
#             'as': 'cpv_division_description'
#         }},
#         { '$project': {
#             '_id': 1,
#             'CPV_Description': {'$arrayElemAt': ['$cpv_division_description.cpv_division_description', 0]},
#         }}
#     ]

# #     agg = db.results.aggregate(pipeline, allowDiskUse=True)
#     cpv_desc = list(results.aggregate(pipeline))[0].get('CPV_Description')
#     print(cpv_desc)

In [16]:
# results.update_one(
#     {'_id': {'$eq': result.upserted_id}},
#     [
#         {"$set": {"CPV_Corrected": cpv_desc }
#         }
#     ]
# )

ISO 3 FORMAT CODE

In [17]:
# get the respective Country Name for each ISO_COUNTRY_CODE replacing the previous collection                   
pipeline = [
    { '$lookup': {
        'from': 'iso_codes',
        'localField': '_id.ISO_COUNTRY_CODE',   
        'foreignField': 'alpha-2',  
        'as': 'iso_codes'
    } },
    { '$project': {
        '_id': 1,
        "Count_Contracts_B_EU_Y" : 1,
        "Sum_Value_Euro_B_EU_Y": 1,
        "Count_Contracts_B_EU_N" : 1,
        "Sum_Value_Euro_B_EU_N": 1,
        'CPV_Description': 1, 
        'Count_Contracts_with_Value_Euro': 1,
        'Sum_Value_Euro': 1, 
        'Count_Contracts_with_Nr_Offers': 1,
        'Sum_Nr_Offers': 1, 
        'Count_Contracts_Total': 1,
        'Alpha3': {'$arrayElemAt': ['$iso_codes.alpha-3', 0]}
    } },
    { '$out': "cpv_divisions_all_data" }
]
    
cpv_div_all_data.aggregate(pipeline)

<pymongo.command_cursor.CommandCursor at 0x23b1f5ab220>

VALUE DIFFERENCE

In [18]:
# get the value difference by cpv_division and put information in the previously computed collection                
 
pipeline = [
    { '$match': {
        'CPV_Division': {'$exists': True},
        'VALUE_EURO': {'$lt': 100000000},
        'AWARD_VALUE_EURO': {'$exists': True}
    } },
    { '$project': { 
        'CPV_Division': 1,
        'YEAR': 1,
        'ISO_COUNTRY_CODE': 1,
        'Difference_Euro': {'$subtract': ['$AWARD_VALUE_EURO', '$VALUE_EURO']}
    } },
    { '$group': {
        '_id': {
            "YEAR": "$YEAR", 
            "ISO_COUNTRY_CODE": "$ISO_COUNTRY_CODE",
            "CPV_Division": "$CPV_Division"
        },
        'Count_Contracts_with_Difference_Euro': {'$sum': 1},
        'Total_Difference_Euro': {'$sum': '$Difference_Euro'}
    } },
    { '$merge' : { 'into' : "cpv_divisions_all_data", 'on': "_id" } }
]

eu.aggregate(pipeline)

<pymongo.command_cursor.CommandCursor at 0x23b1f5772e0>

TIME DIFFERENCE

In [19]:
# get the time difference by cpv_division and put information in the previously computed collection                
pipeline = [
    { '$match': {
        'CPV_Division': {'$exists': True},
        'DT_DISPATCH': {'$exists': True},
        'DT_AWARD': {'$exists': True}
    } },
    { '$project' : {
        '_id': 0,
        'CPV_Division': 1,
        'YEAR': 1,
        'ISO_COUNTRY_CODE': 1,
        'DT_DISPATCH': {'$dateFromString': {'dateString': '$DT_DISPATCH'} },
        'DT_AWARD': {'$dateFromString': {'dateString': '$DT_AWARD'} }
    } },
    { '$project': { 
        'CPV_Division': 1,
        'YEAR': 1,
        'ISO_COUNTRY_CODE': 1,
        'Difference_Time': {'$subtract': ['$DT_DISPATCH', '$DT_AWARD']}
    } },
    { '$group': {
        '_id': {
            "YEAR": "$YEAR", 
            "ISO_COUNTRY_CODE": "$ISO_COUNTRY_CODE",
            "CPV_Division": "$CPV_Division"
        },
        'Count_Contracts_with_Difference_Time': {'$sum': 1},
        'Total_Difference_Time': {'$sum': '$Difference_Time'}
    } },
    { '$merge' : { 'into' : "cpv_divisions_all_data", 'on': "_id" } }
]

eu.aggregate(pipeline)

<pymongo.command_cursor.CommandCursor at 0x23b1f58bc10>

##### Questions
`Question 1` 5 descriptive metrics of the contracts related to the CPV, the average of:

    A. Each CPV's division contracts average spending ('VALUE_EURO')

In [20]:
# %%timeit
pipeline = [
    { '$match': {
        'Count_Contracts_with_Value_Euro': {'$exists': True}
    } },
    { '$group': {
        '_id': '$_id.CPV_Division',
        'Sum_CPV_Spending': {'$sum': '$Sum_Value_Euro'},
        'Count_contracts_CPV_Spending': {'$sum': '$Count_Contracts_with_Value_Euro'}
    } },
    { '$project': {
        'Avg_CPV_Spending': {'$divide': ['$Sum_CPV_Spending', '$Count_contracts_CPV_Spending']}
    }},
    { '$group': {
        '_id': False,
        'Avg_Spending_Total': {'$avg': '$Avg_CPV_Spending'}
    } }, 
    { '$project': {
        '_id': 0,
        'Avg_Spending_Total': 1
    } }
]

r1A = cpv_div_all_data.aggregate(pipeline)

result1A = list(r1A)
result1A

[{'Avg_Spending_Total': 2914879.8230346916}]

The value seem to be very high, meaning that might be some outliers/erros.

    B. Each CPV's division contract count

In [21]:
pipeline = [
    { '$group': {
        '_id': '$_id.CPV_Division',
        'Count_contracts_CPV': {'$sum': '$Count_Contracts_Total'}
    } },
    { '$group': {
        '_id': False,
        'Avg_Count_Contracts_Total': {'$avg': '$Count_contracts_CPV'}
    } }, 
    { '$project': {
        '_id': 0,
        'Avg_Count_Contracts_Total': 1
    } }
]

r1B = cpv_div_all_data.aggregate(pipeline)

result1B = list(r1B)
result1B

[{'Avg_Count_Contracts_Total': 123558.42222222222}]

    C. Each CPV's division contracts average number of offers ('NUMBER_OFFERS')

In [22]:
pipeline = [
    { '$match': {
        'Count_Contracts_with_Nr_Offers': {'$exists': True}
    } },
    { '$group': {
        '_id': '$_id.CPV_Division',
        'Sum_CPV_Offers': {'$sum': '$Sum_Nr_Offers'},
        'Count_contracts_CPV_Offers': {'$sum': '$Count_Contracts_with_Nr_Offers'}
    } },
    { '$project': {
        'Avg_CPV_Offers': {'$divide': ['$Sum_CPV_Offers', '$Count_contracts_CPV_Offers']}
    }},
    { '$group': {
        '_id': False,
        'Avg_NR_Offers_Total': {'$avg': '$Avg_CPV_Offers'}
    } }, 
    { '$project': {
        '_id': 0,
        'Avg_NR_Offers_Total': 1
    } }
]

r1C = cpv_div_all_data.aggregate(pipeline)

result1C = list(r1C)
print(result1C)

[{'Avg_NR_Offers_Total': 7.561289595334899}]


    D. Each CPV's division contracts average spending ('VALUE_EURO') with european funds ('B_EU_FUNDS')

In [23]:
pipeline = [
    { '$match': {
        'Count_Contracts_B_EU_Y': {'$exists': True}
    } },
    { '$group': {
        '_id': '$_id.CPV_Division',
        'Sum_CPV_Funds': {'$sum': '$Sum_Value_Euro_B_EU_Y'},
        'Count_contracts_CPV_Funds': {'$sum': '$Count_Contracts_B_EU_Y'}
    } },
    { '$project': {
        'Avg_CPV_Spending_Funds': {'$divide': ['$Sum_CPV_Funds', '$Count_contracts_CPV_Funds']}
    }},
    { '$group': {
        '_id': False,
        'Avg_Spending_Funds_Total': {'$avg': '$Avg_CPV_Spending_Funds'}
    } }, 
    { '$project': {
        '_id': 0,
        'Avg_Spending_Funds_Total': 1
    } }
]

r1C = cpv_div_all_data.aggregate(pipeline)

result1C = list(r1C)
print(result1C)

[{'Avg_Spending_Funds_Total': 2749237.10552714}]


    E. Each CPV's division contracts average spending ('VALUE_EURO') without european funds ('B_EU_FUNDS')

In [24]:
#%%timeit
pipeline = [
    { '$match': {
        'Count_Contracts_B_EU_N': {'$exists': True}
    } },
    { '$group': {
        '_id': '$_id.CPV_Division',
        'Sum_CPV_No_Funds': {'$sum': '$Sum_Value_Euro_B_EU_N'},
        'Count_contracts_CPV_No_Funds': {'$sum': '$Count_Contracts_B_EU_N'}
    } },
    { '$project': {
        'Avg_CPV_Spending_No_Funds': {'$divide': ['$Sum_CPV_No_Funds', '$Count_contracts_CPV_No_Funds']}
    }},
    { '$group': {
        '_id': False,
        'Avg_Spending_No_Funds_Total': {'$avg': '$Avg_CPV_Spending_No_Funds'}
    } }, 
    { '$project': {
        '_id': 0,
        'Avg_Spending_No_Funds_Total': 1
    } }
]

r1E = cpv_div_all_data.aggregate(pipeline)

result1E = list(r1E)
print(result1E)

[{'Avg_Spending_No_Funds_Total': 2771144.478866394}]


`Question 2` The count of contracts for each CPV Division

In [25]:
pipeline = [
    {'$group': {
        '_id': {'CPV_Division': '$_id.CPV_Division' , 'CPV_Description': '$CPV_Description'},
        'Count_contracts_CPV': {'$sum': '$Count_Contracts_Total'}
    }},
    {'$project': {
        '_id': 0,
        'cpv': '$_id.CPV_Description',
        'count': '$Count_contracts_CPV'
    }}
]

r2 = cpv_div_all_data.aggregate(pipeline)

result2 = list(r2)
print(result2)

[{'cpv': 'Office and computing machinery, equipment and supplies except furniture and software packages', 'count': 128660}, {'cpv': 'Other community, social and personal services', 'count': 25391}, {'cpv': 'Collected and purified water', 'count': 1463}, {'cpv': 'Agricultural, farming, fishing, forestry and related products', 'count': 23717}, {'cpv': 'Agricultural, forestry, horticultural, aquacultural and apicultural services', 'count': 142785}, {'cpv': 'Chemical products', 'count': 41502}, {'cpv': 'Medical equipments, pharmaceuticals and personal care products', 'count': 1775459}, {'cpv': 'Real estate services', 'count': 11924}, {'cpv': 'Postal and telecommunications services', 'count': 42213}, {'cpv': 'Public utilities', 'count': 13465}, {'cpv': 'Furniture (incl. office furniture), furnishings, domestic appliances (excl. lighting) and cleaning products', 'count': 109786}, {'cpv': 'Petroleum products, fuel, electricity and other sources of energy', 'count': 102064}, {'cpv': 'Musical i

`Question 3` Per CPV Division get the average spending ('VALUE_EURO') and return the highest 5 cpvs 

In [26]:
pipeline = [
    { '$match': {
        'Count_Contracts_with_Value_Euro': {'$exists': True}
    } },
    { '$group': {
        '_id': {'CPV_Division': '$_id.CPV_Division' , 'CPV_Description': '$CPV_Description'},
        'Sum_CPV_Spending': {'$sum': '$Sum_Value_Euro'},
        'Count_contracts_CPV_Spending': {'$sum': '$Count_Contracts_with_Value_Euro'}
    } },
    { '$project': {
        '_id': 0,
        'cpv': '$_id.CPV_Description',
        'avg': {'$divide': ['$Sum_CPV_Spending', '$Count_contracts_CPV_Spending']}
    } },
    { '$sort': {
        'avg': -1
    } },
    { '$limit': 5 }
]

r3 = cpv_div_all_data.aggregate(pipeline)

result3 = list(r3)
print(result3)

[{'cpv': 'Construction work', 'avg': 6029728.806856219}, {'cpv': 'Services related to the oil and gas industry', 'avg': 6024557.646283141}, {'cpv': 'Transport services (excl. Waste transport)', 'avg': 5777134.932820617}, {'cpv': 'Hotel, restaurant and retail trade services', 'avg': 4541651.970869796}, {'cpv': 'Architectural, construction, engineering and inspection services', 'avg': 4458947.163794739}]


`Question 4` Per CPV Division get the average spending ('VALUE_EURO') and return the lowest 5 cpvs

In [27]:
pipeline = [
    { '$match': {
        'Count_Contracts_with_Value_Euro': {'$exists': True}
    } },
    { '$group': {
        '_id': {'CPV_Division': '$_id.CPV_Division' , 'CPV_Description': '$CPV_Description'},
        'Sum_CPV_Spending': {'$sum': '$Sum_Value_Euro'},
        'Count_contracts_CPV_Spending': {'$sum': '$Count_Contracts_with_Value_Euro'}
    } },
    { '$project': {
        '_id': 0,
        'cpv': '$_id.CPV_Description',
        'avg': {'$divide': ['$Sum_CPV_Spending', '$Count_contracts_CPV_Spending']}
    } },
    { '$sort': {
        'avg': 1
    } },
    { '$limit': 5 }
]

r4 = cpv_div_all_data.aggregate(pipeline)

result4 = list(r4)
print(result4)

[{'cpv': 'Agricultural, farming, fishing, forestry and related products', 'avg': 1261617.4059277952}, {'cpv': 'Clothing, footwear, luggage articles and accessories', 'avg': 1384645.0949917191}, {'cpv': 'Laboratory, optical and precision equipments (excl. glasses)', 'avg': 1461057.6466721345}, {'cpv': 'Musical instruments, sport goods, games, toys, handicraft, art materials and accessories', 'avg': 1498052.4052218625}, {'cpv': 'Recreational, cultural and sporting services', 'avg': 1527093.0988754546}]


`Question 5` Per CPV Division get the average spending ('VALUE_EURO') and return the highest 5 cpvs for contracts which received european funds ('B_EU_FUNDS')

In [28]:
pipeline = [
    { '$match': {
        'Count_Contracts_B_EU_Y': {'$exists': True},
    } },
    { '$group': {
        '_id': {'CPV_Division': '$_id.CPV_Division' , 'CPV_Description': '$CPV_Description'},
        'Sum_CPV_Funds': {'$sum': '$Sum_Value_Euro_B_EU_Y'},
        'Count_contracts_CPV_Funds': {'$sum': '$Count_Contracts_B_EU_Y'}
    } },
    { '$project': {
        '_id': 0,
        'cpv': '$_id.CPV_Description',
        'avg': {'$divide': ['$Sum_CPV_Funds', '$Count_contracts_CPV_Funds']}
    }},
    { '$sort': {
        'avg': -1
    } },
    { '$limit': 5 }
]

r5 = cpv_div_all_data.aggregate(pipeline)

result5 = list(r5)
print(result5)

[{'cpv': 'Food, beverages, tobacco and related products', 'avg': 16257910.058471283}, {'cpv': 'Repair and maintenance services', 'avg': 10419752.572405463}, {'cpv': 'Transport services (excl. Waste transport)', 'avg': 8774581.195029901}, {'cpv': 'Chemical products', 'avg': 7059021.661101975}, {'cpv': 'Architectural, construction, engineering and inspection services', 'avg': 6433034.064154094}]


`Question 6` Per CPV Division and get the average ('VALUE_EURO') return the highest 5 cpvs for contracts which did not receive european funds ('B_EU_FUNDS')  

In [29]:
pipeline = [
    { '$match': {
        'Count_Contracts_B_EU_N': {'$exists': True},
    } },
    { '$group': {
        '_id':  {'CPV_Division': '$_id.CPV_Division' , 'CPV_Description': '$CPV_Description'},
        'Sum_CPV_No_Funds': {'$sum': '$Sum_Value_Euro_B_EU_N'},
        'Count_contracts_CPV_No_Funds': {'$sum': '$Count_Contracts_B_EU_N'}
    } },
    { '$project': {
        '_id': 0,
        'cpv': '$_id.CPV_Description',
        'avg': {'$divide': ['$Sum_CPV_No_Funds', '$Count_contracts_CPV_No_Funds']}
    } },
    { '$sort': {
        'avg': -1
    } },
    { '$limit': 5 }
]

r6 = cpv_div_all_data.aggregate(pipeline)

result6 = list(r6)
print(result6)

[{'cpv': 'Services related to the oil and gas industry', 'avg': 6222658.775468085}, {'cpv': 'Construction work', 'avg': 5376853.174595483}, {'cpv': 'Transport services (excl. Waste transport)', 'avg': 5286866.797021913}, {'cpv': 'Hotel, restaurant and retail trade services', 'avg': 4984443.18677356}, {'cpv': 'Business services: law, marketing, consulting, recruitment, printing and security', 'avg': 4466233.639456683}]


`Question 7` The highest CPV Division on average spending (‘VALUE_EURO’) per country (‘ISO_COUNTRY_CODE’)

In [30]:
pipeline = [ 
    # not all documents have VALUE_EURO and in this case, as the group by is more partitioned, they lead to a division by zero
    {'$match': {
        'Count_Contracts_with_Value_Euro': {'$exists': True}
    }},
    {'$group': {
        '_id': {
            'ISO_CODE': '$Alpha3',
            'CPV_Division': '$_id.CPV_Division',
            'CPV_Description': '$CPV_Description'
        },
        'Sum_CPV_Spending': {'$sum': '$Sum_Value_Euro'},
        'Count_contracts_CPV_Spending': {'$sum': '$Count_Contracts_with_Value_Euro'}
    }},
    {'$project': {
        '_id': 0,
        'cpv': '$_id.CPV_Description',
        'avg': {'$divide': ['$Sum_CPV_Spending', '$Count_contracts_CPV_Spending']},
        'country': '$_id.ISO_CODE'
    }},
    {'$sort': {
        'country': -1,
        'avg': -1
    }},
    {'$group': {
        '_id': "$country",
        'winner': {
            '$push': {
                'cpv': "$cpv",
                'avg': "$avg",
            }
        }
    }},
    {'$project': {
        'winner': {
            '$slice': ["$winner", 1]
        }
    }},
    {'$project': {
        '_id': 0,
        'cpv': {'$arrayElemAt': ['$winner.cpv', 0]},
        'avg': {'$arrayElemAt': ['$winner.avg', 0]},
        'country': '$_id'
    }}
]

r7 = cpv_div_all_data.aggregate(pipeline)

result7 = list(r7)
print(result7)

[{'cpv': 'Construction work', 'avg': 7907815.163451467, 'country': 'CZE'}, {'cpv': 'Architectural, construction, engineering and inspection services', 'avg': 23192041.33883759, 'country': 'LTU'}, {'cpv': 'Radio, television, communication, telecommunication and related equipment', 'avg': 12101718.21486607, 'country': 'NLD'}, {'cpv': 'Installation services (except software)', 'avg': 9347683.672892056, 'country': 'POL'}, {'cpv': 'Public utilities', 'avg': 7056595.069396986, 'country': 'FRA'}, {'cpv': 'Petroleum products, fuel, electricity and other sources of energy', 'avg': 28067633.570987653, 'country': 'AUT'}, {'cpv': 'Construction work', 'avg': 8446919.175904762, 'country': 'PRT'}, {'cpv': 'Construction work', 'avg': 5156362.670718954, 'country': 'EST'}, {'cpv': 'Services related to the oil and gas industry', 'avg': 24927159.89736842, 'country': 'ESP'}, {'cpv': 'Education and training services', 'avg': 37287103.236710235, 'country': 'GRC'}, {'cpv': 'Services related to the oil and gas

`Question 8` Returns bucketed data with the contract counts of a particular cpv in a given range of values (bucket) according to spending ('VALUE_EURO')

In [31]:
# get the contracts with VALUE_EURO and CPV Division to speed up the query into a new collection
pipeline = [
    { '$match': { # YEAR and ISO_COUNTRY_CODE always exist
        'CPV_Division': {'$exists': True},
        'VALUE_EURO': {'$lt': 100000000}
    } },
    { '$project': {
        '_id': 0,
        'YEAR': 1,
        'ISO_COUNTRY_CODE': 1,
        'CPV_Division': 1,
        'VALUE_EURO': 1
    } },
    { '$out' : 'contracts_value_euro'}
]
              
eu.aggregate(pipeline)

<pymongo.command_cursor.CommandCursor at 0x23b1f5a8820>

In [32]:
contracts_value_euro = db.contracts_value_euro

**USAR ALGO PARA NÃO DEMORAR TANTO TEMPO A CORRER (INDICES NO VALUE_EURO)**

In [33]:
# %%timeit
cpv = '50'

pipeline = [ 
    { '$match': {
        'CPV_Division': {'$eq': cpv}
    } },
    { '$bucketAuto': {
        'groupBy': "$VALUE_EURO",
        'buckets': 10,
        'granularity': 'R20'
    } },
    { '$project': {
        'bucket': '$_id.min',
        '_id': 0,
        'count': 1
    } }
]

r8 = contracts_value_euro.aggregate(pipeline)

result8 = list(r8)
result8

[{'count': 9596, 'bucket': 0.009000000000000001},
 {'count': 10152, 'bucket': 45000.0},
 {'count': 11231, 'bucket': 125000.0},
 {'count': 10546, 'bucket': 224000.0},
 {'count': 9349, 'bucket': 355000.0},
 {'count': 10525, 'bucket': 560000.0},
 {'count': 9281, 'bucket': 1000000.0},
 {'count': 9201, 'bucket': 2000000.0},
 {'count': 9195, 'bucket': 5000000.0},
 {'count': 1817, 'bucket': 25000000.0}]

`Question 9` The average time difference and average value difference for each CPV division, return the highest 5 cpvs sorted by average time

In [34]:
pipeline = [ 
    { '$group': {
        '_id': {'CPV_Division': '$_id.CPV_Division', 'CPV_Description': '$CPV_Description'},
        'Count_Contracts_with_Difference_Time_Total': {'$sum': '$Count_Contracts_with_Difference_Time'},
        'Sum_Difference_Time_Total': {'$sum': '$Total_Difference_Time'},
        'Count_Contracts_with_Difference_Euro_Total': {'$sum': '$Count_Contracts_with_Difference_Euro'},
        'Sum_Difference_Euro_Total': {'$sum': '$Total_Difference_Euro'}
    } },
    { '$project': {
        '_id': 0,
        'cpv': '$_id.CPV_Description',
        'time_difference': {'$divide': ['$Sum_Difference_Time_Total', '$Count_Contracts_with_Difference_Time_Total']},
        'value_difference': {'$divide': ['$Sum_Difference_Euro_Total', '$Count_Contracts_with_Difference_Euro_Total']}
    } },
    { '$sort': {
        'time_difference': -1
    } },
    { '$limit': 5}
]

r9 = cpv_div_all_data.aggregate(pipeline)

result9 = list(r9)
print(result9)

[{'cpv': 'Agricultural, farming, fishing, forestry and related products', 'time_difference': 7238269693.423824, 'value_difference': -975537.2121059563}, {'cpv': 'Services related to the oil and gas industry', 'time_difference': 6524291803.278688, 'value_difference': -2661017.8293143203}, {'cpv': 'Education and training services', 'time_difference': 6075951869.6554165, 'value_difference': -959285.731239647}, {'cpv': 'Public utilities', 'time_difference': 6015179103.730285, 'value_difference': -1838453.5639014537}, {'cpv': 'Health and social work services', 'time_difference': 5901637586.244661, 'value_difference': -1485746.403433414}]


##### Countries
###### Precomputed Tables

As the previously computed collection cpv_div_all_data was generated base on documents where the CPV_Division exists and there are cases where ISO_COUNTRY_CODE exists and CPV_Division does not, we cannot use the collection and instead we have to create a new one.

B_EU_FUNDS

In [35]:
# get sum_spending and count of contracts with value euro by B_EU_FUNDS and by Country and put in new collection       
pipeline = [
    { '$match': {
        'ISO_COUNTRY_CODE': {'$exists': True},
        'B_EU_FUNDS': {'$eq': 'Y'},
        'VALUE_EURO': {'$lt': 100000000}
    } },
    { "$group": {
        "_id": {
            "YEAR": "$YEAR",
            "ISO_COUNTRY_CODE": "$ISO_COUNTRY_CODE"
        },
        "Count_Contracts_B_EU_Y" : { "$sum": 1 },
        "Sum_Value_Euro_B_EU_Y" : { "$sum": "$VALUE_EURO" }
    } },
    { '$out': "countries_all_data" }
     
 ]

eu.aggregate(pipeline)

<pymongo.command_cursor.CommandCursor at 0x23b1f5ab1f0>

In [36]:
# get sum_spending and count of contracts with value euro by B_EU_FUNDS and by Country and put in new collection       
pipeline = [
    { '$match': {
        'ISO_COUNTRY_CODE': {'$exists': True},
        'B_EU_FUNDS': {'$eq': 'N'},
        'VALUE_EURO': {'$lt': 100000000}
    } },
    { "$group": {
        "_id": {
            "YEAR": "$YEAR",
            "ISO_COUNTRY_CODE": "$ISO_COUNTRY_CODE"
        },
        "Count_Contracts_B_EU_N" : { "$sum": 1 },
        "Sum_Value_Euro_B_EU_N" : { "$sum": "$VALUE_EURO" }
    } },
    { '$merge' : { 'into' : "countries_all_data", 'on': "_id" } }
     
 ]

eu.aggregate(pipeline)

<pymongo.command_cursor.CommandCursor at 0x23b1f5abf40>

TOTALS CONTRACTS WITH VALUE EURO AND VALUE_EURO

In [37]:
# get total count of contracts with value euro and total value euro by country and put in newly created collection  
pipeline = [
    { '$match': {
         'ISO_COUNTRY_CODE': {'$exists': True},
         'VALUE_EURO': {'$lt': 100000000}
    } },
    { '$project': { 
        'YEAR': 1,
        'ISO_COUNTRY_CODE': 1,
        'VALUE_EURO':1
    } },
    { '$group': {
        '_id': {
            "YEAR": "$YEAR", 
            "ISO_COUNTRY_CODE": "$ISO_COUNTRY_CODE"
        },
        'Count_Contracts_with_Value_Euro': {'$sum': 1},
        'Sum_Value_Euro': {'$sum': '$VALUE_EURO'}
    } },
    { '$project': {
        '_id': 1,
        'Count_Contracts_with_Value_Euro': 1,
        'Sum_Value_Euro': 1
    } },
    { '$merge' : { 'into' : "countries_all_data", 'on': "_id" } }
]
              
eu.aggregate(pipeline)

<pymongo.command_cursor.CommandCursor at 0x23b1f5ab4f0>

NUMBER_OFFERS

In [38]:
# get total_nr_offers and contracts with nr_offers by country and put information in the previously computed collection   
 
pipeline = [
    { '$match': {
        'ISO_COUNTRY_CODE': {'$exists': True},
        'NUMBER_OFFERS': {'$exists': True}
    } },
    { '$project': { 
        'YEAR': 1,
        'ISO_COUNTRY_CODE': 1,
        'NUMBER_OFFERS':1
    } },
    { '$group': {
        '_id': {
            "YEAR": "$YEAR", 
            "ISO_COUNTRY_CODE": "$ISO_COUNTRY_CODE"
        },
        'Count_Contracts_with_Nr_Offers': {'$sum': 1},
        'Sum_Nr_Offers': {'$sum': '$NUMBER_OFFERS'}
    } },
    { '$project': {
        '_id': 1,
        'Count_Contracts_with_Nr_Offers': 1,
        'Sum_Nr_Offers': 1
    } },
    { '$merge' : { 'into' : "countries_all_data", 'on': "_id" } }
]

eu.aggregate(pipeline)

<pymongo.command_cursor.CommandCursor at 0x23b1f5dfb20>

TOTAL CONTRACTS

In [39]:
# get total contracts by country and put in newly created collection
pipeline = [
    { '$match': {
         'ISO_COUNTRY_CODE': {'$exists': True}
    } },
    { '$group': {
        '_id': {
            "YEAR": "$YEAR", 
            "ISO_COUNTRY_CODE": "$ISO_COUNTRY_CODE"
        },
        'Count_Contracts_Total': {'$sum': 1}
    } },
    { '$merge' : { 'into' : "countries_all_data", 'on': "_id" } }
]
              
eu.aggregate(pipeline)

<pymongo.command_cursor.CommandCursor at 0x23b1f5abca0>

In [40]:
countries_all_data = db.countries_all_data

COUNTRY NAME & ISO-3 CODE

As the 'UK' ISO_COUNTRY_CODE in the eu collection does not have any correspondent alpha-2 in iso_codes collection, we tried to understand what was going on and we found that the 'UK' code is not a real alpha-2 and the code corresponding to 'UK' in indeed 'GB'. Although we believe the best solution would be replacing 'UK' for 'GB' in the eu collection documents, this would bring some problems in the dashboard and so we decided to replace 'GB' with 'UK' in the iso_codes collection document.

In [41]:
# db.iso_codes.update_many(
#     {'alpha-2': {'$exists': True}},
#     [
#         {"$set": {
#             "alpha-2": {'$cond': [{'$eq': [ "$alpha-2", 'GB' ] }, # condition
#                                            'UK',  # true case
#                                            "$alpha-2" # false case
#                                           ] 
#                                 }
#         } }
#     ]
# )

In [42]:
# get the respective country name for each ISO_CODE replacing the previous collection    
pipeline = [
    { '$lookup': {
        'from': 'iso_codes',
        'localField': '_id.ISO_COUNTRY_CODE',   
        'foreignField': 'alpha-2',  
        'as': 'iso_codes'
    } },
    { '$project': {
        '_id': 1,
        "Count_Contracts_B_EU_Y" : 1,
        "Sum_Value_Euro_B_EU_Y": 1,
        "Count_Contracts_B_EU_N" : 1,
        "Sum_Value_Euro_B_EU_N": 1,
        'Country_Name': {'$arrayElemAt': ['$iso_codes.name', 0]},
        'ISO3': {'$arrayElemAt': ['$iso_codes.alpha-3', 0]},
        'Count_Contracts_with_Value_Euro': 1,
        'Sum_Value_Euro': 1, 
        'Count_Contracts_with_Nr_Offers': 1,
        'Sum_Nr_Offers': 1, 
        'Count_Contracts_Total': 1
    } },
    { '$out': "countries_all_data" }
]
    
countries_all_data.aggregate(pipeline)

<pymongo.command_cursor.CommandCursor at 0x23b1f5e2be0>

##### Questions
`Question 10` 5 descriptive metrics of the contracts related to the Country, the average of:

    A. Each Country's contracts average spending ('EURO_VALUE'), (int)   

In [43]:
#%%timeit
pipeline = [
    {'$match': {
        'Count_Contracts_with_Value_Euro': {'$exists': True},
    }},
    { '$group': {
        '_id': '$_id.ISO_COUNTRY_CODE',
        'Sum_Country_Spending': {'$sum': '$Sum_Value_Euro'},
        'Count_contracts_Country_Spending': {'$sum': '$Count_Contracts_with_Value_Euro'}
    } },
    { '$project': {
        'Avg_Country_Spending': {'$divide': ['$Sum_Country_Spending', '$Count_contracts_Country_Spending']}
    }},
    { '$group': {
        '_id': False,
        'Avg_Spending_Total': {'$avg': '$Avg_Country_Spending'}
    } }, 
    { '$project': {
        '_id': 0,
        'Avg_Spending_Total': 1
    } }
]

r10A = countries_all_data.aggregate(pipeline)

result10A = list(r10A)
result10A

[{'Avg_Spending_Total': 3367787.4575050077}]

    B. Each Country's contract count, (int)

In [44]:
pipeline = [
    { '$group': {
        '_id': '$_id.ISO_COUNTRY_CODE',
        'Count_contracts_Country': {'$sum': '$Count_Contracts_Total'}
    } },
    { '$group': {
        '_id': False,
        'Avg_Count_Contracts_Total': {'$avg': '$Count_contracts_Country'}
    } }, 
    { '$project': {
        '_id': 0,
        'Avg_Count_Contracts_Total': 1
    } }
]

r10B = countries_all_data.aggregate(pipeline)

result10B = list(r10B)
result10B

[{'Avg_Count_Contracts_Total': 168587.0303030303}]

    C. Each Country's contracts average NUMBER_OFFERS', (int)

In [45]:
pipeline = [
    {'$match': {
        'Count_Contracts_with_Nr_Offers': {'$exists': True},
    }},
    { '$group': {
        '_id': '$_id.ISO_COUNTRY_CODE',
        'Sum_Contry_Offers': {'$sum': '$Sum_Nr_Offers'},
        'Count_contracts_Contry_Offers': {'$sum': '$Count_Contracts_with_Nr_Offers'}
    } },
    { '$project': {
        'Avg_Country_Offers': {'$divide': ['$Sum_Contry_Offers', '$Count_contracts_Contry_Offers']}
    }},
    { '$group': {
        '_id': False,
        'Avg_NR_Offers_Total': {'$avg': '$Avg_Country_Offers'}
    } }, 
    { '$project': {
        '_id': 0,
        'Avg_NR_Offers_Total': 1
    } }
]

r10C = countries_all_data.aggregate(pipeline)

result10C = list(r10C)
print(result10C)

[{'Avg_NR_Offers_Total': 12.580563334769861}]


    D. Each Country's contracts average EURO_VALUE' with 'B_EU_FUNDS', (int)

In [46]:
pipeline = [
    { '$match': {
        'Count_Contracts_B_EU_Y': {'$exists': True},
    } },
    { '$group': {
        '_id': '$_id.ISO_COUNTRY_CODE',
        'Sum_Country_Funds': {'$sum': '$Sum_Value_Euro_B_EU_Y'},
        'Count_Contracts_Country_Funds': {'$sum': '$Count_Contracts_B_EU_Y'}
    } },
    { '$project': {
        'Avg_Country_Spending_Funds': {'$divide': ['$Sum_Country_Funds', '$Count_Contracts_Country_Funds']}
    }},
    { '$group': {
        '_id': False,
        'Avg_Spending_Funds_Total': {'$avg': '$Avg_Country_Spending_Funds'}
    } }, 
    { '$project': {
        '_id': 0,
        'Avg_Spending_Funds_Total': 1
    } }
]

r10C = countries_all_data.aggregate(pipeline)

result10C = list(r10C)
print(result10C)

[{'Avg_Spending_Funds_Total': 3315399.0563988388}]


    E. Each Country's contracts average 'EURO_VALUE' without 'B_EU_FUNDS' (int)

In [47]:
#%%timeit
pipeline = [
    { '$match': {
        'Count_Contracts_B_EU_N': {'$exists': True},
    } },
    { '$group': {
        '_id': '$_id.ISO_COUNTRY_CODE',
        'Sum_Country_No_Funds': {'$sum': '$Sum_Value_Euro_B_EU_N'},
        'Count_contracts_Country_No_Funds': {'$sum': '$Count_Contracts_B_EU_N'}
    } },
    { '$project': {
        'Avg_Country_Spending_No_Funds': {'$divide': ['$Sum_Country_No_Funds', '$Count_contracts_Country_No_Funds']}
    }},
    { '$group': {
        '_id': False,
        'Avg_Spending_No_Funds_Total': {'$avg': '$Avg_Country_Spending_No_Funds'}
    } }, 
    { '$project': {
        '_id': 0,
        'Avg_Spending_No_Funds_Total': 1
    } }
]

r10E = countries_all_data.aggregate(pipeline)

result10E = list(r10E)
print(result10E)

[{'Avg_Spending_No_Funds_Total': 3363544.1983129242}]


`Question 11` The count of contracts per country ('ISO_COUNTRY_CODE')

In [48]:
pipeline = [
    {'$group': {
        '_id': {'ISO_COUNTRY_CODE': '$_id.ISO_COUNTRY_CODE' , 'Country_Name': '$Country_Name'},
        'Count_contracts_Country': {'$sum': '$Count_Contracts_Total'}
    }},
    {'$project': {
        '_id': 0,
        'country': '$_id.Country_Name',
        'count': '$Count_contracts_Country'
    }}
]

r11 = countries_all_data.aggregate(pipeline)

result11 = list(r11)
print(result11)

[{'country': 'Germany', 'count': 421613}, {'country': 'Austria', 'count': 37616}, {'country': 'Ireland', 'count': 32686}, {'country': 'Denmark', 'count': 64496}, {'country': 'Estonia', 'count': 26992}, {'country': 'Slovakia', 'count': 40809}, {'country': 'Belgium', 'count': 77642}, {'country': 'Slovenia', 'count': 150639}, {'country': 'Iceland', 'count': 1950}, {'country': 'Greece', 'count': 48069}, {'country': 'Portugal', 'count': 35912}, {'country': 'Luxembourg', 'count': 9961}, {'country': 'Finland', 'count': 70955}, {'country': 'Malta', 'count': 4548}, {'country': 'Lithuania', 'count': 117109}, {'country': 'Hungary', 'count': 72794}, {'country': 'Czechia', 'count': 128369}, {'country': 'Netherlands', 'count': 79061}, {'country': 'France', 'count': 1263013}, {'country': 'Spain', 'count': 251927}, {'country': 'United Kingdom of Great Britain and Northern Ireland', 'count': 386501}, {'country': 'Sweden', 'count': 112423}, {'country': 'Cyprus', 'count': 8865}, {'country': 'Romania', 'c

`Question 12` Returns the average 'EURO_VALUE' for each country, return the highest 5 countries

In [49]:
pipeline = [
    { '$match': {
        'Count_Contracts_with_Value_Euro': {'$exists': True},
    } },
    { '$group': {
        '_id': {'ISO_COUNTRY_CODE': '$_id.ISO_COUNTRY_CODE' , 'Country_Name': '$Country_Name'},
        'Sum_Country_Spending': {'$sum': '$Sum_Value_Euro'},
        'Count_contracts_Country_Spending': {'$sum': '$Count_Contracts_with_Value_Euro'}
    } },
    { '$project': {
        '_id': 0,
        'country': '$_id.Country_Name',
        'avg': {'$divide': ['$Sum_Country_Spending', '$Count_contracts_Country_Spending']}
    } },
    { '$sort': {
        'avg': -1
    } },
    { '$limit': 5 }
]

r12 = countries_all_data.aggregate(pipeline)

result12 = list(r12)
print(result12)

[{'country': 'United Kingdom of Great Britain and Northern Ireland', 'avg': 10803061.393965786}, {'country': 'Ireland', 'avg': 9847262.351122826}, {'country': 'Denmark', 'avg': 7685766.528293353}, {'country': 'Norway', 'avg': 6330150.5453541335}, {'country': 'Luxembourg', 'avg': 5263316.019982798}]


`Question 13` Returns the average 'EURO_VALUE' for each country, return the lowest 5 countries

In [50]:
pipeline = [
    { '$match': {
        'Count_Contracts_with_Value_Euro': {'$exists': True},
    } },
    { '$group': {
        '_id': {'ISO_COUNTRY_CODE': '$_id.ISO_COUNTRY_CODE' , 'Country_Name': '$Country_Name'},
        'Sum_Country_Spending': {'$sum': '$Sum_Value_Euro'},
        'Count_contracts_Country_Spending': {'$sum': '$Count_Contracts_with_Value_Euro'}
    } },
    { '$project': {
        '_id': 0,
        'country': '$_id.Country_Name',
        'avg': {'$divide': ['$Sum_Country_Spending', '$Count_contracts_Country_Spending']}
    } },
    { '$sort': {
        'avg': 1
    } },
    { '$limit': 5 }
]

r13 = countries_all_data.aggregate(pipeline)

result13 = list(r13)
print(result13)

[{'country': 'North Macedonia', 'avg': 622732.6235172235}, {'country': 'Slovenia', 'avg': 1123124.187401788}, {'country': 'Bulgaria', 'avg': 1148939.245934335}, {'country': 'Cyprus', 'avg': 1185984.9630768115}, {'country': 'Malta', 'avg': 1350866.011550476}]


`Question 14` For each country get the sum of the respective contracts 'VALUE_EURO' which received european funds 'B_EU_FUNDS' 

In [51]:
pipeline = [
    { '$match': {
        'Sum_Value_Euro_B_EU_Y': {'$exists': True},
    } },
    { '$group': {
        '_id': {'ISO_COUNTRY_CODE': '$_id.ISO_COUNTRY_CODE' , 'ISO3': '$ISO3'},
        'Sum_Country_Funds': {'$sum': '$Sum_Value_Euro_B_EU_Y'}
    } },
    { '$project': {
        '_id': 0,
        'country': '$_id.ISO3',
        'sum': '$Sum_Country_Funds'
    }}
]

r14 = countries_all_data.aggregate(pipeline)

result14 = list(r14)
#print(result14)

##### Companies
##### Pre-computed Tables (all the documents have CAE_NAME)
B_EU_FUNDS

In [52]:
# get sum_spending and count of contracts with value euro by B_EU_FUNDS and by Company and put in new collection       
pipeline = [
    { '$match': {
        'B_EU_FUNDS': {'$eq': 'Y'},
        'VALUE_EURO': {'$lt': 100000000}
    } },
    { "$group": {
        "_id": {
            "YEAR": "$YEAR",
            "ISO_COUNTRY_CODE": "$ISO_COUNTRY_CODE",
            "CAE_NAME": "$CAE_NAME",
            "ADDR_TOWN": "$ADDR_TOWN"
        },
        "Count_Contracts_B_EU_Y" : { "$sum": 1 },
        "Sum_Value_Euro_B_EU_Y" : { "$sum": "$VALUE_EURO" }
    } },
    { '$out': "companies_all_data" }
 ]

agg = eu.aggregate(pipeline, allowDiskUse=True)

In [53]:
# get sum_spending and count of contracts with value euro by B_EU_FUNDS and by Company and put in new collection       
pipeline = [
    { '$match': {
        'B_EU_FUNDS': {'$eq': 'N'},
        'VALUE_EURO': {'$lt': 100000000}
    } },
    { "$group": {
        "_id": {
            "YEAR": "$YEAR",
            "ISO_COUNTRY_CODE": "$ISO_COUNTRY_CODE",
            "CAE_NAME": "$CAE_NAME",
            "ADDR_TOWN": "$ADDR_TOWN"
        },
        "Count_Contracts_B_EU_N" : { "$sum": 1 },
        "Sum_Value_Euro_B_EU_N" : { "$sum": "$VALUE_EURO" }
    } },
    { '$merge' : { 'into' : "companies_all_data", 'on': "_id" } }
 ]

agg = eu.aggregate(pipeline, allowDiskUse=True)

TOTALS CONTRACTS WITH VALUE EURO AND VALUE_EURO

In [54]:
# get total count of contracts with value euro and total value euro by Company and put in newly created collection  
pipeline = [
    { '$match': {
        'VALUE_EURO': {'$lt': 100000000}
    } },
    { '$project': { 
        'YEAR': 1,
        'CAE_NAME': 1,
        'ISO_COUNTRY_CODE': 1,
        'VALUE_EURO':1,
        "ADDR_TOWN": 1
    } },
    { '$group': {
        '_id': {
            "YEAR": "$YEAR", 
            "ISO_COUNTRY_CODE": "$ISO_COUNTRY_CODE",
            'CAE_NAME': '$CAE_NAME',
            "ADDR_TOWN": "$ADDR_TOWN"
        },
        'Count_Contracts_with_Value_Euro': {'$sum': 1},
        'Sum_Value_Euro': {'$sum': '$VALUE_EURO'}
    } },
    { '$project': {
        '_id': 1,
        'Count_Contracts_with_Value_Euro': 1,
        'Sum_Value_Euro': 1
    } },
    { '$merge' : { 'into' : "companies_all_data", 'on': "_id" } }
]
              
agg = eu.aggregate(pipeline, allowDiskUse=True)

NUMBER_OFFERS

In [55]:
# get total_nr_offers and contracts with nr_offers by Company and put information in the previously computed collection   
 
pipeline = [
    { '$match': {
        'NUMBER_OFFERS': {'$exists': True}
    } },
    { '$project': { 
        'YEAR': 1,
        'ISO_COUNTRY_CODE': 1,
        'CAE_NAME': 1,
        'NUMBER_OFFERS':1,
        "ADDR_TOWN": 1
    } },
    { '$group': {
        '_id': {
            "YEAR": "$YEAR", 
            "ISO_COUNTRY_CODE": "$ISO_COUNTRY_CODE",
            'CAE_NAME': '$CAE_NAME',
            "ADDR_TOWN": "$ADDR_TOWN"
        },
        'Count_Contracts_with_Nr_Offers': {'$sum': 1},
        'Sum_Nr_Offers': {'$sum': '$NUMBER_OFFERS'}
    } },
    { '$project': {
        '_id': 1,
        'Count_Contracts_with_Nr_Offers': 1,
        'Sum_Nr_Offers': 1
    } },
    { '$merge' : { 'into' : "companies_all_data", 'on': "_id" } }
 ]

agg2 = eu.aggregate(pipeline, allowDiskUse=True)

TOTAL CONTRACTS

In [56]:
# get total contracts by Company and put in newly created collection
pipeline = [
    { '$group': {
        '_id': {
            "YEAR": "$YEAR", 
            "ISO_COUNTRY_CODE": "$ISO_COUNTRY_CODE", 
            "CAE_NAME": "$CAE_NAME",
            "ADDR_TOWN": "$ADDR_TOWN"
        },
        'Count_Contracts_Total': {'$sum': 1}
    } },
    { '$merge' : { 'into' : "companies_all_data", 'on': "_id" } }
]
              
agg3 = eu.aggregate(pipeline, allowDiskUse=True)

In [57]:
companies_all_data = db.companies_all_data

ALPHA-3

In [58]:
# get the respective alpha3 for each ISO_COUNTRY_CODE replacing the previous collection and concatenate CAE_ADDRESS and CAE_TOWN                   
pipeline = [
    { '$lookup': {
        'from': 'iso_codes',
        'localField': '_id.ISO_COUNTRY_CODE',   
        'foreignField': 'alpha-2',  
        'as': 'iso_codes'
    } },
    { '$project': {
        '_id': 1,
        "Count_Contracts_B_EU_Y" : 1,
        "Sum_Value_Euro_B_EU_Y": 1,
        "Count_Contracts_B_EU_N" : 1,
        "Sum_Value_Euro_B_EU_N": 1,
        'Count_Contracts_with_Value_Euro': 1,
        'Sum_Value_Euro': 1, 
        'Count_Contracts_with_Nr_Offers': 1,
        'Sum_Nr_Offers': 1, 
        'Count_Contracts_Total': 1,
        'Alpha3': {'$arrayElemAt': ['$iso_codes.alpha-3', 0]}
    } },
    { '$out': "companies_all_data" }
]
    
companies_all_data.aggregate(pipeline)

<pymongo.command_cursor.CommandCursor at 0x23b1f609490>

In [59]:
# "COMPANIES ID"

# pipeline = [
#     { "$project": {
#        '_id': 0,
#        'CAE_NAME': {'$toString': '$CAE_NAME'},
#     } },
#     {'$group': { 
#         "_id": "$CAE_NAME", 
#         "doc" : {"$first": "$$ROOT"}
#     }},
#     {'$replaceRoot': { 
#         "newRoot": "$doc"
#     }},
#    { '$out': "companies"}
# ]

# agg = eu.aggregate(pipeline, allowDiskUse=True)

In [60]:
# from bson.objectid import ObjectId

# for doc in db.companies.find({'CAE_NAME_ID': {'$exists': False}}):
#     id_doc = doc.get('_id')
    
#     db.companies.update_one(
#         {'_id': id_doc},
#         {"$set": {"CAE_NAME_ID": counter}}
#     )
    
#     counter+=1

In [61]:
# cae_name_ids = list(db.companies.find())

In [62]:
# db.companies_all_data.create_index(
#     [('_id.CAE_NAME', pymongo.TEXT)],
#     name='_id.CAE_NAME'
# )

In [63]:
# from tqdm import tqdm

In [64]:
# for doc in tqdm(cae_name_ids):
#     id_doc = doc.get('CAE_NAME_ID')
#     doc_name = doc.get('CAE_NAME')
    
#     db.companies_all_data.update_many(
#         {'_id.CAE_NAME': doc_name},
#         [
#             {"$set": {"CAE_NAME_ID": id_doc}}
#         ]
#     )

In [65]:
# pipeline = [
#     { '$lookup': {
#         'from': 'companies',
#         'localField': '_id.CAE_NAME',   
#         'foreignField': 'CAE_NAME',  
#         'as': 'companies'
#     } }, 
#     { '$project': {
#         '_id.YEAR': '$_id.YEAR',
#         '_id.ISO_COUNTRY_CODE': '$_id.ISO_COUNTRY_CODE',
#         '_id.CAE_NAME_ID': {'$arrayElemAt': ['$companies._id', 0]},
#         '_id.CAE_NAME': '$_id.CAE_NAME',
#         '_id.ADDR_TOWN': '$_id.ADDR_TOWN',
#         "Count_Contracts_B_EU_Y" : 1,
#         "Sum_Value_Euro_B_EU_Y": 1,
#         "Count_Contracts_B_EU_N" : 1,
#         "Sum_Value_Euro_B_EU_N": 1,
#         'Count_Contracts_with_Value_Euro': 1,
#         'Sum_Value_Euro': 1, 
#         'Count_Contracts_with_Nr_Offers': 1,
#         'Sum_Nr_Offers': 1, 
#         'Count_Contracts_Total': 1,
#         'Alpha3': 1,
#     } }, 
#     { '$out': "companies_all_data" }
# ]
    
# companies_all_data.aggregate(pipeline, allowDiskUse=True)
# r15A = companies_all_data.aggregate(pipeline)

# result15A = list(r15A)
# result15A

##### Questions
`Question 15` 5 descriptive metrics of the contracts related to the Company, the average of:

    A. Each Company's contracts average spending ('EURO_VALUE'), (int)  

In [66]:
# %%timeit
pipeline = [
    # some companies don't have contracts with value euro, which lead to a Count_contracts_Company_Spending = 0 and an impossible division
    { '$match': {
        'Count_Contracts_with_Value_Euro': {'$exists': True}
    } },
    { '$group': {
        '_id': '$_id.CAE_NAME',
        'Sum_Company_Spending': {'$sum': '$Sum_Value_Euro'},
        'Count_contracts_Company_Spending': {'$sum': '$Count_Contracts_with_Value_Euro'}
    } },
    { '$project': {
        'Avg_Company_Spending': {'$divide': ['$Sum_Company_Spending', '$Count_contracts_Company_Spending']}
    }},
    { '$group': {
        '_id': False,
        'Avg_Spending_Total': {'$avg': '$Avg_Company_Spending'}
    } }, 
    { '$project': {
        '_id': 0,
        'Avg_Spending_Total': 1
    } }
]

r15A = companies_all_data.aggregate(pipeline)

result15A = list(r15A)
result15A

[{'Avg_Spending_Total': 2326545.0956983697}]

    B. Each Company's contract count, (int) 

In [67]:
pipeline = [
    { '$group': {
        '_id': '$_id.CAE_NAME',
        'Count_contracts_Company': {'$sum': '$Count_Contracts_Total'}
    } },
    { '$group': {
        '_id': False,
        'Avg_Count_Contracts_Total': {'$avg': '$Count_contracts_Company'}
    } }, 
    { '$project': {
        '_id': 0,
        'Avg_Count_Contracts_Total': 1
    } }
]

r15B = companies_all_data.aggregate(pipeline)

result15B = list(r15B)
result15B

[{'Avg_Count_Contracts_Total': 24.281476955307262}]

    C. Each Company's contracts average NUMBER_OFFERS', (int) 

In [68]:
pipeline = [
    # some companies don't have contracts with the attribute number of offers, which lead to a Count_contracts_Company_Offers = 0 and an impossible division
    { '$match': {
        'Count_Contracts_with_Nr_Offers': {'$exists': True}
    } },
    { '$group': {
        '_id': '$_id.CAE_NAME',
        'Sum_Company_Offers': {'$sum': '$Sum_Nr_Offers'},
        'Count_contracts_Company_Offers': {'$sum': '$Count_Contracts_with_Nr_Offers'}
    } },
    { '$project': {
        'Avg_Company_Offers': {'$divide': ['$Sum_Company_Offers', '$Count_contracts_Company_Offers']}
    }},
    { '$group': {
        '_id': False,
        'Avg_NR_Offers_Total': {'$avg': '$Avg_Company_Offers'}
    } }, 
    { '$project': {
        '_id': 0,
        'Avg_NR_Offers_Total': 1
    } }
]

r15C = companies_all_data.aggregate(pipeline)

result15C = list(r15C)
print(result15C)

[{'Avg_NR_Offers_Total': 5.679427404625429}]


    D. Each Company's contracts average EURO_VALUE' with 'B_EU_FUNDS', (int)

In [69]:
pipeline = [
    { '$match': {
        'Count_Contracts_B_EU_Y': {'$exists': True},
    } },
    { '$group': {
        '_id': '$_id.CAE_NAME',
        'Sum_Company_Funds': {'$sum': '$Sum_Value_Euro_B_EU_Y'},
        'Count_Contracts_Company_Funds': {'$sum': '$Count_Contracts_B_EU_Y'}
    } },
    { '$project': {
        'Avg_Company_Spending_Funds': {'$divide': ['$Sum_Company_Funds', '$Count_Contracts_Company_Funds']}
    }},
    { '$group': {
        '_id': False,
        'Avg_Spending_Funds_Total': {'$avg': '$Avg_Company_Spending_Funds'}
    } }, 
    { '$project': {
        '_id': 0,
        'Avg_Spending_Funds_Total': 1
    } }
]

r15C = companies_all_data.aggregate(pipeline)

result15C = list(r15C)
print(result15C)

[{'Avg_Spending_Funds_Total': 2200235.258682528}]


    E. Each Company's contracts average 'EURO_VALUE' without 'B_EU_FUNDS' (int)

In [70]:
#%%timeit
pipeline = [
    { '$match': {
        'Count_Contracts_B_EU_N': {'$exists': True},
    } },
    { '$group': {
        '_id': '$_id.CAE_NAME',
        'Sum_Company_No_Funds': {'$sum': '$Sum_Value_Euro_B_EU_N'},
        'Count_contracts_Company_No_Funds': {'$sum': '$Count_Contracts_B_EU_N'}
    } },
    { '$project': {
        'Avg_Company_Spending_No_Funds': {'$divide': ['$Sum_Company_No_Funds', '$Count_contracts_Company_No_Funds']}
    }},
    { '$group': {
        '_id': False,
        'Avg_Spending_No_Funds_Total': {'$avg': '$Avg_Company_Spending_No_Funds'}
    } }, 
    { '$project': {
        '_id': 0,
        'Avg_Spending_No_Funds_Total': 1
    } }
]

r15E = companies_all_data.aggregate(pipeline)

result15E = list(r15E)
print(result15E)

[{'Avg_Spending_No_Funds_Total': 2315455.980122112}]


`Question 16` Returns the average 'EURO_VALUE' for company ('CAE_NAME') return the highest 5 companies

In [71]:
pipeline = [
    # some companies don't have contracts with the attribute value euro, which lead to a Count_Contracts_with_Value_Euro = 0 and an impossible division
    { '$match': {
        'Count_Contracts_with_Value_Euro': {'$exists': True}
    } },
    { '$group': {
        '_id': '$_id.CAE_NAME',
        'Sum_Company_Spending': {'$sum': '$Sum_Value_Euro'},
        'Count_contracts_Company_Spending': {'$sum': '$Count_Contracts_with_Value_Euro'}
    } },
    { '$project': {
        '_id': 0,
        'company': '$_id',
        'avg': {'$divide': ['$Sum_Company_Spending', '$Count_contracts_Company_Spending']}
    } },
    { '$sort': {
        'avg': -1
    } },
    { '$limit': 5 }
]

r16 = companies_all_data.aggregate(pipeline)

result16 = list(r16)
print(result16)

[{'company': 'CH Cote d&apos;Argent — Dax---CH Robert Boulin de Libourne---CH Saint Cyr---CH Sud Gironde la Reole — Langon---CH d&apos;Arcachon pole de sante d&apos;Arcachon---CH d&apos;Oloron Sainte Marie---CH d&apos;Orthez---CH de Bergerac---CH de Haute-Gironde — Blaye---CH de Mont de Marsan---CH de Pau---CH de Périgueux---CH de la Cote Basque — Bayonne---CH intercommunal Marmande — Tonneins---CHU de Bordeaux — DAEE---Centre hospitalier général Agen---Groupement aquitaine', 'avg': 99999999.0}, {'company': 'CH Robert Boulin de Libourne---CH Saint-Cyr---CH Sud Gironde la Réole — Langon---CH cote d&apos;Argent — Dax---CH d&apos;Arcachon pole de sante d&apos;Arcachon---CH d&apos;Oloron-Sainte-Marie---CH d&apos;Orthez---CH de Bergerac---CH de Haute-Gironde — Blaye---CH de Mont de Marsan---CH de Pau---CH de Périgueux---CH de la Cote Basque — Bayonne---CH intercommunal Marmande — Tonneins---CHU de Bordeaux---Centre hospitalier Général Agen---Groupement aquitaine', 'avg': 99999999.0}, {'comp

`Question 17` Returns the average 'EURO_VALUE' for company ('CAE_NAME') return the lowest 5 companies

In [72]:
pipeline = [
    # some companies don't have contracts with the attribute value euro, which lead to a Count_Contracts_with_Value_Euro = 0 and an impossible division
    { '$match': {
        'Count_Contracts_with_Value_Euro': {'$exists': True}
    } },
    { '$group': {
        '_id': '$_id.CAE_NAME',
        'Sum_Company_Spending': {'$sum': '$Sum_Value_Euro'},
        'Count_contracts_Company_Spending': {'$sum': '$Count_Contracts_with_Value_Euro'}
    } },
    { '$project': {
        '_id': 0,
        'company': '$_id',
        'avg': {'$divide': ['$Sum_Company_Spending', '$Count_contracts_Company_Spending']}
    } },
    { '$sort': {
        'avg': 1
    } },
    { '$limit': 5 }
]

r17 = companies_all_data.aggregate(pipeline)

result17 = list(r17)
print(result17)

# these values are extremely low, but they might be acceptable values for "just for paper transactions"

[{'company': 'Hallesche Wasser und Stadtwirtschaft GmbH Abteilung Einkauf', 'avg': 0.009999999999999998}, {'company': 'Landeshauptstadt München, Direktorium – HA II, Vergabestelle 1, Abt. 1/3', 'avg': 0.009999999999999998}, {'company': 'Land Baden-Württemberg vertreten durch die Universität Stuttgart', 'avg': 0.01}, {'company': 'Abfallverwertungsgesellschaft des Landkreises Ludwigsburg mbH---Landkreis Enzkreis, Amt für Abfallwirtschaft---Technischen Diensten (Abfallwirtschaft) der Stadt Pforzheim', 'avg': 0.01}, {'company': 'Jobcenter Lippe Anstalt des öffentlichen Rechts', 'avg': 0.01}]


`Question 18` Returns the count of contracts for each company 'CAE_NAME', for the 15 companies with the most contracts

In [73]:
pipeline = [
    {'$group': {
        '_id': '$_id.CAE_NAME',
        'Count_contracts_Company': {'$sum': '$Count_Contracts_Total'}
    }},
    {'$project': {
        '_id': 0,
        'company': '$_id',
        'count': '$Count_contracts_Company'
    }},
    { '$sort': {
        'count': -1
    } },
    { '$limit': 15 }
]

r18 = companies_all_data.aggregate(pipeline)

result18 = list(r18)
print(result18)

[{'company': 'Uniwersyteckie Centrum Kliniczne', 'count': 14061}, {'company': 'Szpital Kliniczny Przemienienia Pańskiego Uniwersytetu Medycznego im. Karola Marcinkowskiego w Poznaniu', 'count': 12779}, {'company': 'Viešoji įstaiga Centrinė projektų valdymo agentūra (126125624)', 'count': 9956}, {'company': 'Lesy Slovenskej republiky, štátny podnik', 'count': 9795}, {'company': 'Szpital Uniwersytecki w Krakowie', 'count': 9624}, {'company': 'Wojewódzki Szpital Zespolony', 'count': 9457}, {'company': 'Klinički bolnički centar Rijeka', 'count': 9144}, {'company': 'Spitalul Clinic Judetean de Urgenta Sf. Spiridon Iasi', 'count': 8925}, {'company': 'Wojskowy Instytut Medyczny', 'count': 8322}, {'company': 'Kompania Węglowa S.A.', 'count': 8282}, {'company': 'Institutul Oncologic Prof. Dr. I. Chiricuta Cluj-Napoca', 'count': 8088}, {'company': 'Sabiedrība ar ierobežotu atbildību “Ogres rajona slimnīca”', 'count': 8049}, {'company': 'Valsts reģionālās attīstības aģentūra', 'count': 8035}, {'c

`Question 19` For each country get the highest sum 'EURO_VALUE' by company ('CAE_NAME')

In [74]:
#%%timeit
pipeline = [ 
    {'$group': {
        '_id': {
            'ISO_CODE': '$Alpha3',
            'CAE_NAME': '$_id.CAE_NAME',
            'ADDR_TOWN': '$_id.ADDR_TOWN'
        },
        'Sum_Company_Spending': {'$sum': '$Sum_Value_Euro'},
    }},
    {'$project': {
        '_id': 0,
        'company': '$_id.CAE_NAME',
        'sum': '$Sum_Company_Spending',
        'country': '$_id.ISO_CODE',
        'address': '$_id.ADDR_TOWN'
    }},
    {'$sort': {
        'country': -1,
        'sum': -1
    }},
    {'$group': {
        '_id': "$country",
        'winner': {
            '$push': {
                'company': "$company",
                'address': '$address',
                'sum': "$sum",
            }
        }
    }},
    {'$project': {
        'winner': {
            '$slice': ["$winner", 1]
        }
    }},
    {'$project': {
        '_id': 0,
        'company': {'$arrayElemAt': ['$winner.company', 0]},
        'sum': {'$arrayElemAt': ['$winner.sum', 0]},
        'country': '$_id',
        'address': {'$arrayElemAt': ['$winner.address', 0]}
    }}
]

r19 = companies_all_data.aggregate(pipeline, allowDiskUse=True)

result19 = list(r19)
result19

[{'company': 'Auftraggeber gemäß den Ausschreibungsunterlagen beiliegender Drittkundenliste der Bundesbeschaffung GmbH',
  'sum': 13995849335.64,
  'country': 'AUT',
  'address': 'lassallestraße 9b wien'},
 {'company': 'De Lijn — Exploitatie',
  'sum': 8952983174.75,
  'country': 'BEL',
  'address': 'motstraat 20 mechelen'},
 {'company': 'УМБАЛСМ „Н. И. Пирогов“ ЕАД',
  'sum': 9024207791.19,
  'country': 'BGR',
  'address': 'бул. „Тотлебен“ № 21 София'},
 {'company': 'Kanton Zürich, Baudirektion, Immobilienamt, Telematik',
  'sum': 390387084.92,
  'country': 'CHE',
  'address': 'walcheplatz 1 zürich'},
 {'company': 'Φαρμακευτικές Υπηρεσίες',
  'sum': 1629064655.98,
  'country': 'CYP',
  'address': 'Λεωφόρος Λάρνακας 7, block b, 4ος όροφος Λευκωσία'},
 {'company': 'Technická zpráva komunikací hlavního města Prahy',
  'sum': 17093433183.06,
  'country': 'CZE',
  'address': 'Řásnovka 770/8 praha 1'},
 {'company': 'Landschaftsverband Rheinland',
  'sum': 5702951520,
  'country': 'DEU',
  '

`Question 20` Returns the top 5 most frequent co-occurring companies ('CAE_NAME' and 'WIN_NAME')

In [75]:
# get count of CAE_NAME-WIN_NAME occurences by country and year (independently of companies order)
pipeline = [
    { "$match": {
        'CAE_NAME': {'$exists': True},
        'WIN_NAME': {'$exists': True}
    }},
    { "$project": {
        '_id': 0,
        'CAE_NAME': {'$toString': '$CAE_NAME'},
        'WIN_NAME': {'$toString': '$WIN_NAME'},
        'ISO_COUNTRY_CODE': 1,
        'YEAR': 1
    } },
    { "$project": {
        '_id': 0,
        'CAE_WIN': {'$cond': [{'$gte': [ "$CAE_NAME", '$WIN_NAME' ] }, # condition
                                {'$concat': [ "$WIN_NAME", ' with ', '$CAE_NAME' ]},  # true case
                                {"$concat": [ "$CAE_NAME", ' with ', '$WIN_NAME' ]} # false case
                                 ] 
                       },
        'ISO_COUNTRY_CODE': 1,
        'YEAR': 1
    } },
    {'$group': {
        '_id': {
            'ISO_COUNTRY_CODE': '$ISO_COUNTRY_CODE',
            'YEAR': '$YEAR',
            'CAE_WIN': '$CAE_WIN',
        },
        'Count': {'$sum': 1}
    }}, 
    { '$out': "companies_occurrences" }
 ]

agg3 = eu.aggregate(pipeline, allowDiskUse=True)

In [76]:
companies_occurrences = db.companies_occurrences

In [77]:
# %%timeit
pipeline = [
    { '$group': {
        '_id': '$_id.CAE_WIN',
        'Co-occurrences': {'$sum': '$Count'}
    } },
    { '$project': {
        '_id': 0,
        'companies': '$_id',
        'count': '$Co-occurrences'
    }},
    { '$sort': {'count': -1}},
    { '$limit': 5}
]

r20 = companies_occurrences.aggregate(pipeline, allowDiskUse = True)

result20 = list(r20)
result20

[{'companies': 'UAB "Limedika" (134056779) with Viešoji įstaiga Centrinė projektų valdymo agentūra (126125624)',
  'count': 1741},
 {'companies': 'SIA “Magnum Medical” with Sabiedrība ar ierobežotu atbildību “Ogres rajona slimnīca”',
  'count': 1643},
 {'companies': "UAB 'Limedika' (134056779) with Viešoji įstaiga Centrinė projektų valdymo agentūra (126125624)",
  'count': 1626},
 {'companies': 'C SYSTEM CZ a.s. with Masarykova univerzita', 'count': 1321},
 {'companies': 'UAB "ARMILA" (123813957) with Viešoji įstaiga Centrinė projektų valdymo agentūra (126125624)',
  'count': 1220}]

### INDEXES

In [78]:
import pymongo

# FOR INSERT LOOKUP
db.iso_codes.create_index(
    [('alpha-2', 1)],
    collation = {'locale': "en"},
    name='alpha-2',
)

db.iso_codes.create_index(
    [('name', pymongo.TEXT)],
    name='name',
)

db.cpv_codes.create_index(
    [('cpv_division_description', pymongo.TEXT)],
    name='cpv_division_description',
)

'cpv_division_description'

In [79]:
# FOR FILTERINGS
companies_all_data.create_index(
    [('_id.ISO_COUNTRY_CODE', 1), ('_id.YEAR', 1)],
    collation = {'locale': "en"},
    name='filter',
)

companies_occurrences.create_index(
    [('_id.ISO_COUNTRY_CODE', 1), ('_id.YEAR', 1)],
    collation = {'locale': "en"},
    name='filter',
)

contracts_value_euro.create_index(
    [('_id.ISO_COUNTRY_CODE', 1), ('_id.YEAR', 1)],
    collation = {'locale': "en"},
    name='filter',
)

countries_all_data.create_index(
    [('_id.ISO_COUNTRY_CODE', 1), ('_id.YEAR', 1)],
    collation = {'locale': "en"},
    name='filter',
)

cpv_div_all_data.create_index(
    [('_id.ISO_COUNTRY_CODE', 1), ('_id.YEAR', 1)],
    collation = {'locale': "en"},
    name='filter',
)

'filter'

In [80]:
# IMPORTANT FOR QUERY 8
contracts_value_euro.create_index(
    [('CPV_Division', 1)],
    name='CPV_Division',
)

'CPV_Division'

In [81]:
# FOR MATCHES AND PROJECTS
# companies_all_data.create_index(
#     [('_id.CAE_NAME', 1)],
#     name='_id.CAE_NAME',
# #     sparse=True
# )

In [82]:
# db.companies.create_index(
#     [('CAE_NAME', pymongo.TEXT)],
#     name='CAE_NAME'
# )