# Recommendations with IBM
## Data Wrangling

In [1]:
import pandas as pd

### Articles

In [2]:
articles = pd.read_csv("../data/raw/articles_community.csv")
articles.head()

Unnamed: 0.1,Unnamed: 0,doc_body,doc_description,doc_full_name,doc_status,article_id
0,3,Skip navigation Sign in SearchLoading...\r\n\r...,Detect bad readings in real time using Python ...,Detect Malfunctioning IoT Sensors with Streami...,Live,0
1,5,No Free Hunch Navigation * kaggle.com\r\n\r\n ...,"See the forest, see the trees. Here lies the c...",Communicating data science: A guide to present...,Live,1
2,7,☰ * Login\r\n * Sign Up\r\n\r\n * Learning Pat...,Here’s this week’s news in Data Science and Bi...,"This Week in Data Science (April 18, 2017)",Live,2
3,8,"DATALAYER: HIGH THROUGHPUT, LOW LATENCY AT SCA...",Learn how distributed DBs solve the problem of...,DataLayer Conference: Boost the performance of...,Live,3
4,12,Skip navigation Sign in SearchLoading...\r\n\r...,This video demonstrates the power of IBM DataS...,Analyze NY Restaurant data using Spark in DSX,Live,4


In [3]:
articles.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1056 entries, 0 to 1055
Data columns (total 6 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   Unnamed: 0       1056 non-null   int64 
 1   doc_body         1042 non-null   object
 2   doc_description  1053 non-null   object
 3   doc_full_name    1056 non-null   object
 4   doc_status       1056 non-null   object
 5   article_id       1056 non-null   int64 
dtypes: int64(2), object(4)
memory usage: 49.6+ KB


> NOTES

> Remove `Unnamed: 0`

> Reorder/rename columns (`article_id`, `name`, `decription`, `body`)

In [4]:
articles[articles.doc_body.isna()]

Unnamed: 0.1,Unnamed: 0,doc_body,doc_description,doc_full_name,doc_status,article_id
206,570,,Watch how to convert XML data to CSV format to...,Load XML data into dashDB,Live,206
276,818,,Love to work in Microsoft Excel? Watch how to ...,Integrate dashDB with Excel,Live,276
484,1488,,See how to evaluate and convert your DDL and S...,Convert IBM Puredata for Analytics to dashDB,Live,483
508,1561,,Watch how to generate SQL-based reports for Cl...,Use dashDB with IBM Embeddable Reporting Service,Live,507
540,1660,,Need to move some data to the cloud for wareho...,Convert data from Oracle to dashDB,Live,539
638,1965,,See how to create a new dashDB instance and po...,Load JSON from Cloudant database into dashDB,Live,637
667,2041,,"See how to connect dashDB, as a source and tar...",Integrate dashDB and Informatica Cloud,Live,666
706,2165,,Aginity Workbench is a free application known ...,Use Aginity Workbench for IBM dashDB,Live,704
842,2593,,Learn how to configure a dashDB connection in ...,Leverage dashDB in Cognos Business Intelligence,Live,839
876,2693,,See how to populate data into a table in your ...,Load data from the desktop into dashDB,Live,873


In [5]:
articles[articles.doc_description.isna()]

Unnamed: 0.1,Unnamed: 0,doc_body,doc_description,doc_full_name,doc_status,article_id
354,1068,The search index lets you create flexible quer...,,Build the search index in Cloudant,Live,354
768,2351,Compose The Compose logo Articles Sign in Free...,,Announcing the Data Browser for JanusGraph,Live,765
919,2833,Cloudant Query is a powerful declarative JSON ...,,Use the new Cloudant query,Live,916


> NOTES

> Descriptions and body is not that important as long as the name/title exist. Leaving as is.

In [6]:
articles.doc_status.value_counts()

Live    1056
Name: doc_status, dtype: int64

> NOTES

> Drop `doc_status` as it has no other value than `Live`

In [7]:
articles[articles.article_id.duplicated(keep=False)].sort_values(by="article_id")

Unnamed: 0.1,Unnamed: 0,doc_body,doc_description,doc_full_name,doc_status,article_id
50,129,Follow Sign in / Sign up Home About Insight Da...,Community Detection at Scale,Graph-based machine learning,Live,50
365,1103,Follow Sign in / Sign up Home About Insight Da...,During the seven-week Insight Data Engineering...,Graph-based machine learning,Live,50
221,623,* United States\r\n\r\nIBM® * Site map\r\n\r\n...,When used to make sense of huge amounts of con...,How smart catalogs can turn the big data flood...,Live,221
692,2123,Homepage Follow Sign in / Sign up Homepage * H...,One of the earliest documented catalogs was co...,How smart catalogs can turn the big data flood...,Live,221
232,672,Homepage Follow Sign in Get started Homepage *...,"If you are like most data scientists, you are ...",Self-service data preparation with IBM Data Re...,Live,232
971,3017,Homepage Follow Sign in Get started * Home\r\n...,"If you are like most data scientists, you are ...",Self-service data preparation with IBM Data Re...,Live,232
399,1186,Homepage Follow Sign in Get started * Home\r\n...,Today’s world of data science leverages data f...,Using Apache Spark as a parallel processing fr...,Live,398
761,2324,Homepage Follow Sign in Get started Homepage *...,Today’s world of data science leverages data f...,Using Apache Spark as a parallel processing fr...,Live,398
578,1803,This video shows you how to construct queries ...,This video shows you how to construct queries ...,Use the Primary Index,Live,577
970,3016,This video shows you how to construct queries ...,This video shows you how to construct queries ...,Use the Primary Index,Live,577


> NOTES

> Safe to drop duplicates on `article_id`

**Perform Cleaning**

In [8]:
# Remove `Unnamed: 0` and `doc_status`

articles = articles.drop(columns=["Unnamed: 0", "doc_status"])
articles.head()

Unnamed: 0,doc_body,doc_description,doc_full_name,article_id
0,Skip navigation Sign in SearchLoading...\r\n\r...,Detect bad readings in real time using Python ...,Detect Malfunctioning IoT Sensors with Streami...,0
1,No Free Hunch Navigation * kaggle.com\r\n\r\n ...,"See the forest, see the trees. Here lies the c...",Communicating data science: A guide to present...,1
2,☰ * Login\r\n * Sign Up\r\n\r\n * Learning Pat...,Here’s this week’s news in Data Science and Bi...,"This Week in Data Science (April 18, 2017)",2
3,"DATALAYER: HIGH THROUGHPUT, LOW LATENCY AT SCA...",Learn how distributed DBs solve the problem of...,DataLayer Conference: Boost the performance of...,3
4,Skip navigation Sign in SearchLoading...\r\n\r...,This video demonstrates the power of IBM DataS...,Analyze NY Restaurant data using Spark in DSX,4


In [9]:
# Reorder/rename columns (`article_id`, `name`, `decription`, `body`)

articles = articles.rename(columns={ "doc_full_name": "name", "doc_description": "description", "doc_body": "body" })
articles = articles[["article_id", "name", "description", "body"]]
articles

Unnamed: 0,article_id,name,description,body
0,0,Detect Malfunctioning IoT Sensors with Streami...,Detect bad readings in real time using Python ...,Skip navigation Sign in SearchLoading...\r\n\r...
1,1,Communicating data science: A guide to present...,"See the forest, see the trees. Here lies the c...",No Free Hunch Navigation * kaggle.com\r\n\r\n ...
2,2,"This Week in Data Science (April 18, 2017)",Here’s this week’s news in Data Science and Bi...,☰ * Login\r\n * Sign Up\r\n\r\n * Learning Pat...
3,3,DataLayer Conference: Boost the performance of...,Learn how distributed DBs solve the problem of...,"DATALAYER: HIGH THROUGHPUT, LOW LATENCY AT SCA..."
4,4,Analyze NY Restaurant data using Spark in DSX,This video demonstrates the power of IBM DataS...,Skip navigation Sign in SearchLoading...\r\n\r...
...,...,...,...,...
1051,1046,A look under the covers of PouchDB-find,PouchDB uses MapReduce as its default search m...,PouchDB-find is a new API and syntax that allo...
1052,1047,A comparison of logistic regression and naive ...,We compare discriminative and generative learn...,We compare discriminative and generative learn...
1053,1048,What I Learned Implementing a Classifier from ...,In order to demystify some of the magic behind...,"Essays about data, building products and boots..."
1054,1049,Use dashDB with Spark,Learn how to use IBM dashDB as data store for ...,


In [10]:
articles.to_csv("../data/processed/articles.csv", index=False)

### User Interaction

In [11]:
interactions = pd.read_csv("../data/raw/user-item-interactions.csv")
interactions.head()

Unnamed: 0.1,Unnamed: 0,article_id,title,email
0,0,1430.0,"using pixiedust for fast, flexible, and easier...",ef5f11f77ba020cd36e1105a00ab868bbdbf7fe7
1,1,1314.0,healthcare python streaming application demo,083cbdfa93c8444beaa4c5f5e0f5f9198e4f9e0b
2,2,1429.0,use deep learning for image classification,b96a4f2e92d8572034b1e9b28f9ac673765cd074
3,3,1338.0,ml optimization using cognitive assistant,06485706b34a5c9bf2a0ecdac41daf7e7654ceb7
4,4,1276.0,deploy your python model as a restful api,f01220c46fc92c6e6b161b1849de11faacd7ccb2


In [12]:
interactions.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45993 entries, 0 to 45992
Data columns (total 4 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   Unnamed: 0  45993 non-null  int64  
 1   article_id  45993 non-null  float64
 2   title       45993 non-null  object 
 3   email       45976 non-null  object 
dtypes: float64(1), int64(1), object(2)
memory usage: 1.4+ MB


> NOTES

> Drop `Unnamed: 0`

> Drop `title` as we can refer them already to articles

> Convert article_id to `int64`

In [13]:
interactions[interactions.email.isna()]

Unnamed: 0.1,Unnamed: 0,article_id,title,email
25131,25146,1016.0,why you should master r (even if it might even...,
29758,30157,1393.0,the nurse assignment problem,
29759,30158,20.0,working interactively with rstudio and noteboo...,
29760,30159,1174.0,breast cancer wisconsin (diagnostic) data set,
29761,30160,62.0,data visualization: the importance of excludin...,
35264,36016,224.0,"using apply, sapply, lapply in r",
35276,36029,961.0,beyond parallelize and collect,
35277,36030,268.0,sector correlations shiny app,
35278,36031,268.0,sector correlations shiny app,
35279,36032,268.0,sector correlations shiny app,


> NOTES

> Although we have null on users (email) in the table, it might not matter when we are fetching the "popular" options. Leaving as is.

**Perform Cleaning**

In [14]:
# Drop `Unnamed: 0` and `title`

interactions = interactions.drop(columns=["Unnamed: 0", "title"])
interactions.head()

Unnamed: 0,article_id,email
0,1430.0,ef5f11f77ba020cd36e1105a00ab868bbdbf7fe7
1,1314.0,083cbdfa93c8444beaa4c5f5e0f5f9198e4f9e0b
2,1429.0,b96a4f2e92d8572034b1e9b28f9ac673765cd074
3,1338.0,06485706b34a5c9bf2a0ecdac41daf7e7654ceb7
4,1276.0,f01220c46fc92c6e6b161b1849de11faacd7ccb2


In [15]:
# Convert article_id to `int64`

interactions.article_id = interactions.article_id.astype("int64")
interactions.head()

Unnamed: 0,article_id,email
0,1430,ef5f11f77ba020cd36e1105a00ab868bbdbf7fe7
1,1314,083cbdfa93c8444beaa4c5f5e0f5f9198e4f9e0b
2,1429,b96a4f2e92d8572034b1e9b28f9ac673765cd074
3,1338,06485706b34a5c9bf2a0ecdac41daf7e7654ceb7
4,1276,f01220c46fc92c6e6b161b1849de11faacd7ccb2


In [16]:
interactions.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45993 entries, 0 to 45992
Data columns (total 2 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   article_id  45993 non-null  int64 
 1   email       45976 non-null  object
dtypes: int64(1), object(1)
memory usage: 718.8+ KB


In [17]:
interactions.to_csv("../data/processed/interactions.csv", index=False)