# Use Pandas for ETL

Now it’s time to write some simples ETL jobs for data analysis. Our scope is to create a fact table (denormalized) in our presentation area.

In this notebook we will discuss which process steps (divided into extract, transform and load) we have to do to clean the source data, aggregate the records and, finally, load our records in our Document Store.

Overview of our ETL steps:

![picture](https://drive.google.com/uc?id=1h60hvtzWmZYHJyuOaONpYiyNLmsQTlje)

 ## Load and extract the source file

First of all we need to load raw data (from CSV files) into our environment.

In [None]:
from google.colab import files

uploaded = files.upload()

Saving 1_ds_project_details_full.csv to 1_ds_project_details_full.csv


In [None]:
for fn in uploaded.keys():
  print('User uploaded file "{name}" with length {length} bytes'.format(
      name=fn, length=len(uploaded[fn])))

User uploaded file "1_ds_project_details_full.csv" with length 4400323 bytes


Most of our ETL code revolve around using the following functions:
- drop_duplicates
- dropna
- replace / fillna
- df[df['column'] != value]: filtering
- apply: transform, or adding new column
- merge: SQL like inner, left, or right join
- groupby
- read_csv / to_csv

Functions like drop_duplicates and drop_na are nice abstractions and save tens of SQL statements.
And replace / fillna is a typical step that to manipulate the data array.

All these features are available from pandas.


In [2]:
import pandas as pd
import io

In [3]:
ds_project_details_full = pd.read_csv('/content/ds_project_details_full.csv')
# pd.read_sql("select campo, count(*) from tabella group by campo")

In [None]:
# ds_project_details_full = pd.read_csv(io.BytesIO(uploaded['ds_project_details_full.csv']))


In [4]:
ds_project_details_full.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 25 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   Unnamed: 0            10000 non-null  int64  
 1   bullet_point          1 non-null      object 
 2   category              10000 non-null  object 
 3   category_url          10000 non-null  object 
 4   clickthrough_url      10000 non-null  object 
 5   close_date            9999 non-null   object 
 6   currency              10000 non-null  object 
 7   funds_raised_amount   10000 non-null  int64  
 8   funds_raised_percent  10000 non-null  float64
 9   image_url             10000 non-null  object 
 10  is_indemand           10000 non-null  bool   
 11  is_pre_launch         10000 non-null  bool   
 12  is_proven             10000 non-null  bool   
 13  offered_by            0 non-null      float64
 14  open_date             9999 non-null   object 
 15  perk_goal_percentage

Data processing is often exploratory.
We need to see the shape of the data, and write our next line of code based on our previous output. So the process is iterative.

One tool that Python + Pandas comes in handy is Jupyter Notebook or Google Colab. 

In [5]:
ds_project_details_full.head()

Unnamed: 0.1,Unnamed: 0,bullet_point,category,category_url,clickthrough_url,close_date,currency,funds_raised_amount,funds_raised_percent,image_url,...,perk_goal_percentage,perks_claimed,price_offered,price_retail,product_stage,project_id,project_type,tagline,tags,title
0,0,,Video Games,/explore/video-games,/projects/odin-the-ultimate-gaming-handheld,2021-10-03T23:59:59-07:00,HKD,29696921,49.70425,https://c1.iggcdn.com/indiegogo-media-prod-cld...,...,,,,,,2685187,campaign,"Flagship gaming handheld. FHD 1080p 6"" touch s...","['computers', 'pc', 'laptops']",Odin: The Ultimate Gaming Handheld
1,1,,Video Games,/explore/video-games,/projects/g-case-all-in-one-gaming-case-for-sw...,2022-03-11T23:59:59-08:00,HKD,5388665,30.820762,https://c1.iggcdn.com/indiegogo-media-prod-cld...,...,,,,,,2739227,campaign,Modular Battery | Interchangeable Grips | Deta...,"['bluetooth', 'batteries', 'design']",G-Case: All-In-One Gaming Case for Switch & OLED
2,2,,Film,/explore/film,/projects/super-troopers-2,2015-04-24T23:59:59-07:00,USD,4617223,2.081839,https://c1.iggcdn.com/indiegogo-media-prod-cld...,...,,,,,,1166581,campaign,"The #SuperTroopers2 campaign is over, but the ...",['other'],Super Troopers 2
3,3,,Web Series & TV Shows,/explore/web-series-tv-shows,/projects/con-man,2015-04-10T23:59:59-07:00,USD,3156178,7.347459,https://c1.iggcdn.com/indiegogo-media-prod-cld...,...,,,,,,1143140,campaign,A new comedy from Alan Tudyk and Nathan Fillio...,['other'],Con Man
4,4,,Art,/explore/art,/projects/artbook-that-photographed-gods-who-d...,2022-02-18T23:59:59-08:00,JPY,3114937,3.082077,https://c1.iggcdn.com/indiegogo-media-prod-cld...,...,,,,,,2735280,campaign,This concept is coming from teaching of Shinto...,"['books', 'design', 'other', 'professional']",ArtBook that photographed Gods who dwell in na...


In [6]:
number_of_records = ds_project_details_full.shape[0]
print(f"Number of records loaded {number_of_records}")

Number of records loaded 10000


## Transform

After loading the raw data, let's go do the initial cleaning tasks.

Since we want to upload the data to MongoDB, we should immediately add a unique identifier (_id on MongoDB).

The operations we will do are to create our staging table **st_projects** where:
- we do not allow **duplicates**
- we select only the **necessary columns**
- remove **anomalous records**

In [7]:
# Add the id
ds_project_details_full['_id'] = ds_project_details_full['project_id']

In [8]:
# Remove duplicates
ds_project_no_duplicates = ds_project_details_full.drop_duplicates(subset=['title'])
ds_project_no_duplicates = ds_project_no_duplicates.drop_duplicates(subset=['tagline'])

In [9]:
number_of_records_without = ds_project_no_duplicates.shape[0]
print(f"-- Number of records without duplicates {number_of_records_without}")

-- Number of records without duplicates 9907


In [10]:
ds_project_no_duplicates.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 9907 entries, 0 to 9999
Data columns (total 26 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   Unnamed: 0            9907 non-null   int64  
 1   bullet_point          1 non-null      object 
 2   category              9907 non-null   object 
 3   category_url          9907 non-null   object 
 4   clickthrough_url      9907 non-null   object 
 5   close_date            9906 non-null   object 
 6   currency              9907 non-null   object 
 7   funds_raised_amount   9907 non-null   int64  
 8   funds_raised_percent  9907 non-null   float64
 9   image_url             9907 non-null   object 
 10  is_indemand           9907 non-null   bool   
 11  is_pre_launch         9907 non-null   bool   
 12  is_proven             9907 non-null   bool   
 13  offered_by            0 non-null      float64
 14  open_date             9906 non-null   object 
 15  perk_goal_percentage 

In [13]:
# Select only some features
ds_project_no_duplicates["project_url"] = ds_project_no_duplicates["clickthrough_url"]
ds_project_features = ds_project_no_duplicates[['_id', 'project_id', 'title', 'project_url',
                                                'tags', 'tagline', 'open_date', 'funds_raised_amount',
                                                'funds_raised_percent', 'currency', 'close_date', 'category']]

In [14]:
ds_project_features.head()

Unnamed: 0,_id,project_id,title,project_url,tags,tagline,open_date,funds_raised_amount,funds_raised_percent,currency,close_date,category
0,2685187,2685187,Odin: The Ultimate Gaming Handheld,/projects/odin-the-ultimate-gaming-handheld,"['computers', 'pc', 'laptops']","Flagship gaming handheld. FHD 1080p 6"" touch s...",2021-08-19T00:00:00-07:00,29696921,49.70425,HKD,2021-10-03T23:59:59-07:00,Video Games
1,2739227,2739227,G-Case: All-In-One Gaming Case for Switch & OLED,/projects/g-case-all-in-one-gaming-case-for-sw...,"['bluetooth', 'batteries', 'design']",Modular Battery | Interchangeable Grips | Deta...,2022-03-10T23:59:59-08:00,5388665,30.820762,HKD,2022-03-11T23:59:59-08:00,Video Games
2,1166581,1166581,Super Troopers 2,/projects/super-troopers-2,['other'],"The #SuperTroopers2 campaign is over, but the ...",2015-03-24T10:00:57-07:00,4617223,2.081839,USD,2015-04-24T23:59:59-07:00,Film
3,1143140,1143140,Con Man,/projects/con-man,['other'],A new comedy from Alan Tudyk and Nathan Fillio...,2015-03-10T14:48:01-07:00,3156178,7.347459,USD,2015-04-10T23:59:59-07:00,Web Series & TV Shows
4,2735280,2735280,ArtBook that photographed Gods who dwell in na...,/projects/artbook-that-photographed-gods-who-d...,"['books', 'design', 'other', 'professional']",This concept is coming from teaching of Shinto...,2022-02-17T23:59:59-08:00,3114937,3.082077,JPY,2022-02-18T23:59:59-08:00,Art


In [15]:
# Remove noise
ds_project_cleaned = ds_project_features[(ds_project_features['funds_raised_percent'] > 0) & (ds_project_features['funds_raised_percent'] < 1000)]

In [16]:
# Remove null values in title
ds_project_cleaned = ds_project_cleaned[ds_project_cleaned.tagline.notnull()]

In [17]:
number_of_records_without_noise = ds_project_cleaned.shape[0]
print(f"-- Number of records without noise {number_of_records_without_noise}")

-- Number of records without noise 9903


# Load data in MongoDB

Now the records are ready, following a Big Data approach:
- we load the raw, raw data on a table with all the source data (**sc_projects**)
- load the clean data in the staging table **st_projects**

For the connection to MongoDB we will use the **pymongo** library.

In [18]:
!pip install pymongo

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [20]:
import pymongo
import json
from pymongo import UpdateOne

In [21]:
#client = pymongo.MongoClient("mongodb://xxxxx:xxxx@xxxxx:27017,xxxxx:27017,xxxx:27017/myFirstDatabase?ssl=true&replicaSet=atlas-14k1wg-shard-0&authSource=admin&retryWrites=true&w=majority")
db = client.indiegogo

The data on MongoDB is in bson (**binary json**) format.

We convert our dataframe pandas in json and create the list of update or insert on our collection.

In [22]:
records = json.loads(ds_project_details_full.T.to_json()).values()
upserts=[UpdateOne({'_id':x['_id']}, {'$setOnInsert':x}, upsert=True) for x in records]
db.sc_project.bulk_write(upserts)

<pymongo.results.BulkWriteResult at 0x7f0fa10dd790>

In [23]:
records = json.loads(ds_project_cleaned.T.to_json()).values()
upserts=[UpdateOne({'_id':x['_id']}, {'$setOnInsert':x}, upsert=True) for x in records]
db.st_project_cleaned.bulk_write(upserts)

<pymongo.results.BulkWriteResult at 0x7f0fa21a5590>

# Extract and load image details

We perform the same work now on the list of concepts extracted with the API from the images,



In [24]:
from google.colab import files
uploaded = files.upload()

In [25]:
#ds_img_details_full = pd.read_csv(io.BytesIO(uploaded['ds_img_details_full.csv']))
ds_img_details_full = pd.read_csv('/content/ds_img_details_full.csv')

In [26]:
ds_img_details_full.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20000 entries, 0 to 19999
Data columns (total 6 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   Unnamed: 0   20000 non-null  int64  
 1   project_id   20000 non-null  int64  
 2   project_url  20000 non-null  object 
 3   image        20000 non-null  object 
 4   name         20000 non-null  object 
 5   value        20000 non-null  float64
dtypes: float64(1), int64(2), object(3)
memory usage: 937.6+ KB


In [27]:
number_of_records = ds_img_details_full.shape[0]
print(f"Number of records loaded {number_of_records}")

Number of records loaded 20000


In [28]:
ds_img_details_full.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20000 entries, 0 to 19999
Data columns (total 6 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   Unnamed: 0   20000 non-null  int64  
 1   project_id   20000 non-null  int64  
 2   project_url  20000 non-null  object 
 3   image        20000 non-null  object 
 4   name         20000 non-null  object 
 5   value        20000 non-null  float64
dtypes: float64(1), int64(2), object(3)
memory usage: 937.6+ KB


In [29]:
ds_img_details_full.head()

Unnamed: 0.1,Unnamed: 0,project_id,project_url,image,name,value
0,0,2685187,/projects/odin-the-ultimate-gaming-handheld,img_1.jpg,telephone,0.998494
1,1,2685187,/projects/odin-the-ultimate-gaming-handheld,img_1.jpg,technology,0.996249
2,2,2685187,/projects/odin-the-ultimate-gaming-handheld,img_1.jpg,screen,0.99502
3,3,2685187,/projects/odin-the-ultimate-gaming-handheld,img_1.jpg,cellular telephone,0.993292
4,4,2685187,/projects/odin-the-ultimate-gaming-handheld,img_1.jpg,portable,0.992356


In [30]:
records = json.loads(ds_img_details_full.T.to_json()).values()
db.sc_images.insert_many(records)

<pymongo.results.InsertManyResult at 0x7f0f9c2e1350>

In [31]:
ds_img_details_full['concepts'] = ds_img_details_full. \
  apply(lambda row: {'name': row['name'], 'value': row['value']}, axis=1)

In [33]:
ds_img_details_full.head()

Unnamed: 0.1,Unnamed: 0,project_id,project_url,image,name,value,concepts
0,0,2685187,/projects/odin-the-ultimate-gaming-handheld,img_1.jpg,telephone,0.998494,"{'name': 'telephone', 'value': 0.998494267463684}"
1,1,2685187,/projects/odin-the-ultimate-gaming-handheld,img_1.jpg,technology,0.996249,"{'name': 'technology', 'value': 0.996249377727..."
2,2,2685187,/projects/odin-the-ultimate-gaming-handheld,img_1.jpg,screen,0.99502,"{'name': 'screen', 'value': 0.9950199127197266}"
3,3,2685187,/projects/odin-the-ultimate-gaming-handheld,img_1.jpg,cellular telephone,0.993292,"{'name': 'cellular telephone', 'value': 0.9932..."
4,4,2685187,/projects/odin-the-ultimate-gaming-handheld,img_1.jpg,portable,0.992356,"{'name': 'portable', 'value': 0.992356300354004}"


In [36]:
ds_images_aggregate = ds_img_details_full.groupby('project_url')['concepts'].apply(list).reset_index(name="concepts")

In [37]:
ds_images_aggregate.head()

Unnamed: 0,project_url,concepts
0,/projects/1-618-beauty-unearthed-the-golden-ra...,"[{'name': 'no person', 'value': 0.989616751670..."
1,/projects/100k-for-7-am-documentary-promotion,"[{'name': 'time', 'value': 0.9851144552230836}..."
2,/projects/13-fanboy,"[{'name': 'abstract', 'value': 0.9790816307067..."
3,/projects/1804-the-hidden-history-of-haiti,"[{'name': 'people', 'value': 0.980213224887848..."
4,/projects/198x,"[{'name': 'typography', 'value': 0.99401372671..."


Document databases also admit complex data types, so we go to load our records, where we have an array of concepts.

In [38]:
ds_images_aggregate.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 2 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   project_url  1000 non-null   object
 1   concepts     1000 non-null   object
dtypes: object(2)
memory usage: 15.8+ KB


In [39]:
ds_images_aggregate['_id'] = ds_images_aggregate['project_url']

In [40]:
records = json.loads(ds_images_aggregate.T.to_json()).values()
upserts=[UpdateOne({'_id':x['_id']}, {'$setOnInsert':x}, upsert=True) for x in records]
db.st_concepts.bulk_write(upserts)

<pymongo.results.BulkWriteResult at 0x7f0f9bbc53d0>

# Location data

In [41]:
ds_location = pd.read_csv('/content/ds_project_location_full.csv')

In [44]:
number_of_records = ds_location.shape[0]
print(f"Number of records loaded {number_of_records}")

Number of records loaded 10000


In [45]:
records = json.loads(ds_location.T.to_json()).values()
db.sc_location.insert_many(records)

<pymongo.results.InsertManyResult at 0x7f0f9a6ad490>

In [46]:
ds_location.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 5 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   Unnamed: 0   10000 non-null  int64  
 1   project_url  10000 non-null  object 
 2   project_id   10000 non-null  int64  
 3   lat          9991 non-null   float64
 4   lng          9991 non-null   float64
dtypes: float64(2), int64(2), object(1)
memory usage: 390.8+ KB


Clean location data

In [47]:
ds_location_cleaned = ds_location[ds_location["project_url"].notnull()]

In [48]:
number_of_records = ds_location_cleaned.shape[0]
print(f"Number of records cleaned {number_of_records}")

Number of records cleaned 10000


In [49]:
records = json.loads(ds_location_cleaned.T.to_json()).values()
upserts=[UpdateOne({'_id':x['project_url']}, {'$setOnInsert':x}, upsert=True) for x in records]
db.st_locations.bulk_write(upserts)

<pymongo.results.BulkWriteResult at 0x7f0f9bbdb710>

# Data to presentation layer

Let's now build the final fact table: the goal is to create a denormalized table ready for analysis.

In [1]:
import pymongo
import json
from pymongo import UpdateOne
import pandas as pd

In [2]:
#client = pymongo.MongoClient("mongodb://xxxx:xx@xxxx:27017,xxxx:27017,xxx:27017/myFirstDatabase?ssl=true&replicaSet=atlas-14k1wg-shard-0&authSource=admin&retryWrites=true&w=majority")
db = client.indiegogo

In [3]:
st_project_cleaned = db.st_project_cleaned
st_concepts = db.st_concepts
st_locations = db.st_locations

In [4]:
st_project_cleaned.count_documents({})

9922

In [5]:
st_concepts.count_documents({})

1108

In [6]:
st_locations.count_documents({})

10187

In [7]:
# db.collection.find({}).forEach(function(x) {
#    t = db.collection2.findOne({chiave: x.chiave})
# })

In [8]:
### Example MongoDB -- NOT RUN!!!!

In [9]:
result = db.sc_images.aggregate([
    {
        '$match': {
            'value': {
                '$gt': 0.95
            }
        }
    }, {
        '$group': {
            '_id': '$project_id', 
            'count': {
                '$sum': 1
            }
        }
    }, {
        '$out': 'st_after_aggregate'
    }
])

In [10]:
# Join collections

In [12]:
df_concepts =  pd.DataFrame(list(st_concepts.find({}))).drop(columns=['_id'])
df_concepts = df_concepts[df_concepts["project_url"].notnull()]
df_concepts

Unnamed: 0,concepts,project_id,project_url
108,"[{'name': 'no person', 'value': 0.9896167517},...",,/projects/1-618-beauty-unearthed-the-golden-ra...
109,"[{'name': 'time', 'value': 0.9851144552}, {'na...",,/projects/100k-for-7-am-documentary-promotion
110,"[{'name': 'abstract', 'value': 0.9790816307}, ...",,/projects/13-fanboy
111,"[{'name': 'people', 'value': 0.9802132249}, {'...",,/projects/1804-the-hidden-history-of-haiti
112,"[{'name': 'typography', 'value': 0.9940137267}...",,/projects/198x
...,...,...,...
1103,"[{'name': 'illustration', 'value': 0.998546540...",,/projects/zhelter-pixel-action-survival-game
1104,"[{'name': 'illustration', 'value': 0.991601288...",,/projects/zombie-tsunami-the-board-game
1105,"[{'name': 'illustration', 'value': 0.976869404...",,/projects/zombies-20th-anniversary-edition-lat...
1106,"[{'name': 'danger', 'value': 0.9972725511}, {'...",,/projects/zore-a-new-generation-of-gun-storage--2


In [13]:
df_projects_cleaned =  pd.DataFrame(list(st_project_cleaned.find({})))
df_projects_cleaned

Unnamed: 0,_id,category,close_date,currency,funds_raised_amount,funds_raised_percent,open_date,project_id,tagline,tags,title,project_url
0,2394811,Tabletop Games,2018-10-22T23:59:59-07:00,USD,462432,42.744458,2018-10-21T23:59:59-07:00,2394811,Survive against all odds facing monsters & per...,"['fantasy', 'indie']",Unbroken: a solo game of survival and revenge,
1,400689,Video Games,2013-07-05T23:59:59-07:00,USD,644301,2.684588,2013-05-21T04:11:15-07:00,400689,Help bring the first Tobuscus Adventures Game ...,['other'],Help Build the 'Tobuscus Adventures: THE GAME'...,
2,1759114,Writing & Publishing,2016-05-25T23:59:59-07:00,USD,1287686,16.890350,2016-05-24T23:59:59-07:00,1759114,A book that inspires girls with the stories of...,"['kids', 'books', 'female founders', 'social i...",Good Night Stories for Rebel Girls,
3,2406069,Film,2018-10-04T23:59:59-07:00,EUR,171792,1.347060,2018-08-24T00:00:00-07:00,2406069,"""Psychomagic, an art to heal"" has been shot, h...",['documentary'],"PSYCHOMAGIC, AN ART TO HEAL",
4,2632142,Video Games,2020-10-06T23:59:59-07:00,USD,892729,20.053769,2020-10-05T23:59:59-07:00,2632142,Get Fit Fighting Your Way Through A Fantasy Wo...,"['wireless', 'apps', 'computers', 'design', 's...",QUELL: Real Fitness. Real Gaming.,
...,...,...,...,...,...,...,...,...,...,...,...,...
9917,201730,Music,2012-09-29T23:59:59-07:00,USD,10728,1.072800,2012-08-15T11:02:54-07:00,201730,MUSIC HEALS. MUSIC FEELS. MUSIC SAVES.\r\nA NE...,['food'],Julie Neumark: NEU album!!!,/projects/julie-neumark-neu-album
9918,1192654,Music,2015-05-13T23:59:59-07:00,CAD,10726,1.072600,2015-04-13T08:22:46-07:00,1192654,Help Montreal's Disadvantaged Youth Find Their...,['other'],Sing Montréal Chante,/projects/sing-montreal-chante
9919,854401,Dance & Theater,2014-09-18T23:59:59-07:00,USD,10725,0.268125,2014-07-20T08:02:43-07:00,854401,Support us in making a difference - promote Co...,['other'],First exposure of Contemporary Israeli Circus,/projects/first-exposure-of-contemporary-israe...
9920,1526606,Art,2016-01-08T23:59:59-08:00,USD,10725,1.165625,2015-12-09T21:20:02-08:00,1526606,A book to display the work and art of the trad...,['other'],Chikara: The Art of Horimitsu,/projects/chikara-the-art-of-horimitsu


In [14]:
df_locations_cleaned =  pd.DataFrame(list(st_locations.find({}))).drop(columns=['_id'])
df_locations_cleaned

Unnamed: 0.1,Unnamed: 0,lat,lng,project_id,project_url
0,0.0,22.544267,114.054533,2685187.0,
1,1.0,22.264412,114.167061,2739227.0,
2,10.0,34.052238,-118.243344,2656903.0,
3,24.0,34.052238,-118.243344,1759114.0,
4,15.0,37.869058,-122.270455,712200.0,
...,...,...,...,...,...
10182,9995.0,34.090684,-118.371751,201730.0,/projects/julie-neumark-neu-album
10183,9996.0,45.509062,-73.553363,1192654.0,/projects/sing-montreal-chante
10184,9997.0,32.436990,34.919826,854401.0,/projects/first-exposure-of-contemporary-israe...
10185,9998.0,49.263566,-123.138572,1526606.0,/projects/chikara-the-art-of-horimitsu


In [15]:
df_projects_cleaned.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9922 entries, 0 to 9921
Data columns (total 12 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   _id                   9922 non-null   int64  
 1   category              9922 non-null   object 
 2   close_date            9920 non-null   object 
 3   currency              9922 non-null   object 
 4   funds_raised_amount   9922 non-null   int64  
 5   funds_raised_percent  9922 non-null   float64
 6   open_date             9920 non-null   object 
 7   project_id            9922 non-null   int64  
 8   tagline               9921 non-null   object 
 9   tags                  9919 non-null   object 
 10  title                 9922 non-null   object 
 11  project_url           100 non-null    object 
dtypes: float64(1), int64(3), object(8)
memory usage: 930.3+ KB


In [16]:
df_concepts.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1000 entries, 108 to 1107
Data columns (total 3 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   concepts     1000 non-null   object
 1   project_id   0 non-null      object
 2   project_url  1000 non-null   object
dtypes: object(3)
memory usage: 31.2+ KB


In [17]:
df_ft_projects = df_projects_cleaned \
  .merge(df_locations_cleaned, on='project_url', how='left')

In [18]:
df_ft_projects = df_ft_projects \
  .merge(df_concepts, on='project_url', how='left')

In [19]:
df_ft_projects.head()

Unnamed: 0.1,_id,category,close_date,currency,funds_raised_amount,funds_raised_percent,open_date,project_id_x,tagline,tags,title,project_url,Unnamed: 0,lat,lng,project_id_y,concepts,project_id
0,2394811,Tabletop Games,2018-10-22T23:59:59-07:00,USD,462432,42.744458,2018-10-21T23:59:59-07:00,2394811,Survive against all odds facing monsters & per...,"['fantasy', 'indie']",Unbroken: a solo game of survival and revenge,,0.0,22.544267,114.054533,2685187.0,,
1,2394811,Tabletop Games,2018-10-22T23:59:59-07:00,USD,462432,42.744458,2018-10-21T23:59:59-07:00,2394811,Survive against all odds facing monsters & per...,"['fantasy', 'indie']",Unbroken: a solo game of survival and revenge,,1.0,22.264412,114.167061,2739227.0,,
2,2394811,Tabletop Games,2018-10-22T23:59:59-07:00,USD,462432,42.744458,2018-10-21T23:59:59-07:00,2394811,Survive against all odds facing monsters & per...,"['fantasy', 'indie']",Unbroken: a solo game of survival and revenge,,10.0,34.052238,-118.243344,2656903.0,,
3,2394811,Tabletop Games,2018-10-22T23:59:59-07:00,USD,462432,42.744458,2018-10-21T23:59:59-07:00,2394811,Survive against all odds facing monsters & per...,"['fantasy', 'indie']",Unbroken: a solo game of survival and revenge,,24.0,34.052238,-118.243344,1759114.0,,
4,2394811,Tabletop Games,2018-10-22T23:59:59-07:00,USD,462432,42.744458,2018-10-21T23:59:59-07:00,2394811,Survive against all odds facing monsters & per...,"['fantasy', 'indie']",Unbroken: a solo game of survival and revenge,,15.0,37.869058,-122.270455,712200.0,,


In [20]:
df_ft_projects.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1836814 entries, 0 to 1836813
Data columns (total 18 columns):
 #   Column                Dtype  
---  ------                -----  
 0   _id                   int64  
 1   category              object 
 2   close_date            object 
 3   currency              object 
 4   funds_raised_amount   int64  
 5   funds_raised_percent  float64
 6   open_date             object 
 7   project_id_x          int64  
 8   tagline               object 
 9   tags                  object 
 10  title                 object 
 11  project_url           object 
 12  Unnamed: 0            float64
 13  lat                   float64
 14  lng                   float64
 15  project_id_y          float64
 16  concepts              object 
 17  project_id            object 
dtypes: float64(5), int64(3), object(10)
memory usage: 266.3+ MB


In [None]:
records = json.loads(df_ft_projects.T.to_json()).values()
upserts=[UpdateOne({'_id':x['_id']}, {'$setOnInsert':x}, upsert=True) for x in records]
db.ft_projects.bulk_write(upserts)