## **Data Ingestion and Exploratory Analysis**

New York Times has a wide audience and plays a prominent role in shaping people's opinion and outlook on current affairs and also in setting the tone of the public discourse, especially in the USA. The comment section in the articles is very active and it gives a glimpse of readers' take on the matters concerning the articles.

The data contains information about the comments made on the articles published in New York Times in Jan-May 2017 and Jan-April 2018. The month-wise data is given in two csv files - one each for the articles on which comments were made and for the comments themselves. The csv files for comments contain over 2 million comments in total with 34 features and those for articles contain 16 features about more than 9,000 articles.


In [2]:
#pip install pandarallel
import multiprocessing

num_processors = multiprocessing.cpu_count()
print(f'Available CPUs: {num_processors}')

import pandarallel
from pandarallel import pandarallel
pandarallel.initialize(nb_workers=num_processors-1, use_memory_fs=False)

Available CPUs: 8
INFO: Pandarallel will run on 7 workers.
INFO: Pandarallel will use standard multiprocessing data transfer (pipe) to transfer data between the main process and workers.


In [3]:
import pandas as pd
import numpy as np

from textblob import TextBlob

import warnings 
warnings.filterwarnings('ignore')

import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns

from plotly.offline import init_notebook_mode, iplot
import plotly.figure_factory as ff
init_notebook_mode(connected=True)
plt.style.use('fivethirtyeight')
%matplotlib inline

import gc   # For memory management

import re   # Regex

----
### **Part0: Data Ingestion**

In [5]:
# Importing all comments

curr_dir = '/Users/kshitijmittal/Documents/UChicago Acad/03 Quarter 3/01 ML/NYT_Reader_Feedback/00_Data/'
df1 = pd.read_csv(curr_dir + 'CommentsJan2017.csv')
df2 = pd.read_csv(curr_dir + 'CommentsFeb2017.csv')
df3 = pd.read_csv(curr_dir + 'CommentsMarch2017.csv')
df4 = pd.read_csv(curr_dir + 'CommentsApril2017.csv')
df5 = pd.read_csv(curr_dir + 'CommentsMay2017.csv')
df6 = pd.read_csv(curr_dir + 'CommentsJan2018.csv')
df7 = pd.read_csv(curr_dir + 'CommentsFeb2018.csv')
df8 = pd.read_csv(curr_dir + 'CommentsMarch2018.csv')
df9 = pd.read_csv(curr_dir + 'CommentsApril2018.csv')

comments_all = pd.concat([df1, df2, df3, df4, df5, df6, df7, df8, df9])
comments_all.drop_duplicates(subset='commentID', inplace=True)
comments_all.head(3)

Unnamed: 0,approveDate,articleID,articleWordCount,commentBody,commentID,commentSequence,commentTitle,commentType,createDate,depth,...,status,timespeople,trusted,updateDate,userDisplayName,userID,userLocation,userTitle,userURL,typeOfMaterial
0,1483455908,58691a5795d0e039260788b9,1324.0,For all you Americans out there --- still rejo...,20969730.0,20969730.0,<br/>,comment,1483426000.0,1.0,...,approved,1.0,0.0,1483455908,N. Smith,64679318.0,New York City,,,News
1,1483455656,58691a5795d0e039260788b9,1324.0,Obamas policies may prove to be the least of t...,20969325.0,20969325.0,<br/>,comment,1483417000.0,1.0,...,approved,1.0,0.0,1483455656,Kilocharlie,69254188.0,Phoenix,,,News
2,1483455655,58691a5795d0e039260788b9,1324.0,Democrats are comprised of malcontents who gen...,20969855.0,20969855.0,<br/>,comment,1483431000.0,1.0,...,approved,1.0,0.0,1483455655,Frank Fryer,76788711.0,Florida,,,News


In [6]:
# Importing all articles

df1 = pd.read_csv(curr_dir + 'ArticlesJan2017.csv')
df2 = pd.read_csv(curr_dir + 'ArticlesFeb2017.csv')
df3 = pd.read_csv(curr_dir + 'ArticlesMarch2017.csv')
df4 = pd.read_csv(curr_dir + 'ArticlesApril2017.csv')
df5 = pd.read_csv(curr_dir + 'ArticlesMay2017.csv')
df6 = pd.read_csv(curr_dir + 'ArticlesJan2018.csv')
df7 = pd.read_csv(curr_dir + 'ArticlesFeb2018.csv')
df8 = pd.read_csv(curr_dir + 'ArticlesMarch2018.csv')
df9 = pd.read_csv(curr_dir + 'ArticlesApril2018.csv')

articles_all = pd.concat([df1, df2, df3, df4, df5, df6, df7, df8, df9])
articles_all.drop_duplicates(subset='articleID', inplace=True)
articles_all.head(3)

Unnamed: 0,articleID,abstract,byline,documentType,headline,keywords,multimedia,newDesk,printPage,pubDate,sectionName,snippet,source,typeOfMaterial,webURL,articleWordCount
0,58691a5795d0e039260788b9,,By JENNIFER STEINHAUER,article,G.O.P. Leadership Poised to Topple Obama’s Pi...,"['United States Politics and Government', 'Law...",1,National,1,2017-01-01 15:03:38,Politics,The most powerful and ambitious Republican-led...,The New York Times,News,https://www.nytimes.com/2017/01/01/us/politics...,1324
1,586967bf95d0e03926078915,,By MARK LANDLER,article,Fractured World Tested the Hope of a Young Pre...,"['Obama, Barack', 'Afghanistan', 'United State...",1,Foreign,1,2017-01-01 20:34:00,Asia Pacific,A strategy that went from a “good war” to the ...,The New York Times,News,https://www.nytimes.com/2017/01/01/world/asia/...,2836
2,58698a1095d0e0392607894a,,By CAITLIN LOVINGER,article,Little Troublemakers,"['Crossword Puzzles', 'Boxing Day', 'Holidays ...",1,Games,0,2017-01-01 23:00:24,Unknown,Chuck Deodene puts us in a bubbly mood.,The New York Times,News,https://www.nytimes.com/2017/01/01/crosswords/...,445


In [7]:
# Memory management
del df1, df2, df3, df4, df5, df6, df7, df8, df9
gc.collect()

656

In [8]:
print(f"We have {comments_all.shape[0]:,} comments")
print(f"On {articles_all.shape[0]:,} articles")

We have 2,118,617 comments
On 9,298 articles


----
### **Part1: First Look at Comments dataset**

In [9]:
print("shape:",comments_all.shape)
comments_all.head()

shape: (2118617, 34)


Unnamed: 0,approveDate,articleID,articleWordCount,commentBody,commentID,commentSequence,commentTitle,commentType,createDate,depth,...,status,timespeople,trusted,updateDate,userDisplayName,userID,userLocation,userTitle,userURL,typeOfMaterial
0,1483455908,58691a5795d0e039260788b9,1324.0,For all you Americans out there --- still rejo...,20969730.0,20969730.0,<br/>,comment,1483426000.0,1.0,...,approved,1.0,0.0,1483455908,N. Smith,64679318.0,New York City,,,News
1,1483455656,58691a5795d0e039260788b9,1324.0,Obamas policies may prove to be the least of t...,20969325.0,20969325.0,<br/>,comment,1483417000.0,1.0,...,approved,1.0,0.0,1483455656,Kilocharlie,69254188.0,Phoenix,,,News
2,1483455655,58691a5795d0e039260788b9,1324.0,Democrats are comprised of malcontents who gen...,20969855.0,20969855.0,<br/>,comment,1483431000.0,1.0,...,approved,1.0,0.0,1483455655,Frank Fryer,76788711.0,Florida,,,News
3,1483455653,58691a5795d0e039260788b9,1324.0,The picture in this article is the face of con...,20969407.0,20969407.0,<br/>,comment,1483419000.0,1.0,...,approved,1.0,0.0,1483455653,James Young,72718862.0,Seattle,,,News
4,1483455216,58691a5795d0e039260788b9,1324.0,Elections have consequences.,20969274.0,20969274.0,,comment,1483417000.0,1.0,...,approved,1.0,0.0,1483455216,M.,7529267.0,Seattle,,,News


In [10]:
pd.options.display.max_colwidth=None
comments_all['commentBody'].sample(5, random_state=42)

187423                                                                                                                                                                                                                               Thank you Rep. Mast. Once a hero, always a hero.  Many commenters want amendments to your proposals but as they stand are very reasonable efforts to deal with a multifaceted problem. When the 2nd Amendment was written “mass shootings” existed only in close quarters on the battlefield, and a civilian bombing would require a horse drawn carriage with kegs of gunpowder. Eighteenth century thinking will not resolve 21st century problems. I’m not a constituent, nor even a Republican voter, but If you stick to your guns on gun control, I will put your name on my political donation list. 
75386                                                                                                                                                                                 

In [11]:
# Cleaning comments body text

def preprocess(commentBody):
    commentBody = commentBody.str.replace('(<br/>)', '', regex=True)
    commentBody = commentBody.str.replace('(<a).*(>).*(</a>)', '', regex=True)
    commentBody = commentBody.str.replace('(&amp)', '', regex=True)
    commentBody = commentBody.str.replace('(&gt)', '', regex=True)
    commentBody = commentBody.str.replace('(&lt)', '', regex=True)
    commentBody = commentBody.str.replace('(\xa0)', ' ', regex=True)  
    return commentBody

comments_all.commentBody = preprocess(comments_all.commentBody)
comments_all['commentBody'].sample(5, random_state=42)

187423                                                                                                                                                                                                                          Thank you Rep. Mast. Once a hero, always a hero.  Many commenters want amendments to your proposals but as they stand are very reasonable efforts to deal with a multifaceted problem. When the 2nd Amendment was written “mass shootings” existed only in close quarters on the battlefield, and a civilian bombing would require a horse drawn carriage with kegs of gunpowder. Eighteenth century thinking will not resolve 21st century problems. I’m not a constituent, nor even a Republican voter, but If you stick to your guns on gun control, I will put your name on my political donation list. 
75386                                                                                                                                                                                      

In [12]:
# Removing mentions and urls
comments_all['commentBody']=comments_all['commentBody'].apply(lambda x: re.sub(r'(?:\@|http?\://|https?\://|www)\S+','',str(x)))

# Removing new lines
comments_all['commentBody']=comments_all['commentBody'].apply(lambda x: re.sub(r'(?:\n)','',str(x)))

# Removing hashtags
comments_all['commentBody']=comments_all['commentBody'].apply(lambda x: re.sub(r'(?:\B#\w*[a-zA-Z]+\w*)','',str(x)))

comments_all['commentBody'].sample(5, random_state=123)

167418                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  

In [13]:
comments_all.info()

<class 'pandas.core.frame.DataFrame'>
Index: 2118617 entries, 0 to 264923
Data columns (total 34 columns):
 #   Column                 Dtype  
---  ------                 -----  
 0   approveDate            int64  
 1   articleID              object 
 2   articleWordCount       float64
 3   commentBody            object 
 4   commentID              float64
 5   commentSequence        float64
 6   commentTitle           object 
 7   commentType            object 
 8   createDate             float64
 9   depth                  float64
 10  editorsSelection       object 
 11  inReplyTo              float64
 12  newDesk                object 
 13  parentID               float64
 14  parentUserDisplayName  object 
 15  permID                 object 
 16  picURL                 object 
 17  printPage              float64
 18  recommendations        float64
 19  recommendedFlag        float64
 20  replyCount             float64
 21  reportAbuseFlag        float64
 22  sectionName            o

In [14]:
pd.options.display.max_columns=None
comments_all.head()

Unnamed: 0,approveDate,articleID,articleWordCount,commentBody,commentID,commentSequence,commentTitle,commentType,createDate,depth,editorsSelection,inReplyTo,newDesk,parentID,parentUserDisplayName,permID,picURL,printPage,recommendations,recommendedFlag,replyCount,reportAbuseFlag,sectionName,sharing,status,timespeople,trusted,updateDate,userDisplayName,userID,userLocation,userTitle,userURL,typeOfMaterial
0,1483455908,58691a5795d0e039260788b9,1324.0,"For all you Americans out there --- still rejoicing over the majority win of Republicans over the Legislature of this land.Beware.Just like you would have been, if there were any other kind of majority.The Founding Fathers had something like this in mind wheh they formed our Great Nation.It's part of the natural 'Checks ; Balances' system that keeps this country on an even keel.But this is now being threatened -- with the majority of Executive, Legislative and Judidical power all in the hands of one political party.See to it that you remember our U.S. Constitution, and our Bill of Rights.Remember that ""We"", are still ""the People"".America belongs to all of us.And God help us all.",20969730.0,20969730.0,<br/>,comment,1483426000.0,1.0,0,0.0,National,0.0,,20969730,https://graphics8.nytimes.com/images/apps/timespeople/none.png,1.0,5.0,,0.0,,Politics,0,approved,1.0,0.0,1483455908,N. Smith,64679318.0,New York City,,,News
1,1483455656,58691a5795d0e039260788b9,1324.0,Obamas policies may prove to be the least of this countrys worries. The GOP has been emboldened to actually cut the OCE in the night before the new president or congress is sworn in following in their leaders footsteps. Everyone is crying foul including the GOP leadership and conservative think tanks. How about a twitter regarding the swamp draining tonight?,20969325.0,20969325.0,<br/>,comment,1483417000.0,1.0,0,0.0,National,0.0,,20969325,https://graphics8.nytimes.com/images/apps/timespeople/none.png,1.0,3.0,,0.0,,Politics,0,approved,1.0,0.0,1483455656,Kilocharlie,69254188.0,Phoenix,,,News
2,1483455655,58691a5795d0e039260788b9,1324.0,Democrats are comprised of malcontents who generally have never worked a day in their lives and thrive on handouts from the Govt; fire 1/2 of the Govt. non workers and save billions Trust is earned and our elected officials have not earned our trust. The don't adhere to the U.S. Constitution as it restricts their movement. Thank goodness for the 2nd Amendment and the NRA or they would have already enslaved us.,20969855.0,20969855.0,<br/>,comment,1483431000.0,1.0,0,0.0,National,0.0,,20969855,https://graphics8.nytimes.com/images/apps/timespeople/none.png,1.0,3.0,,0.0,,Politics,0,approved,1.0,0.0,1483455655,Frank Fryer,76788711.0,Florida,,,News
3,1483455653,58691a5795d0e039260788b9,1324.0,"The picture in this article is the face of congressional leadership. It would seem that only Harry Reid seemed to know that his time has passed. Conversely, so should Mitch McConnell, his Times has come and gone, he forces people his own age to retire while he goes on being a Senator.The Senator's ideas, and ways of thinking are no longer mainstream, he hasn't the political will or the intestinal fortitude that it will take to keep his president from running this country into the ground. It's time that our elected representatives have term limits imposed. Mitch McConnell was raised in a white mans world, where women were excluded. Even though he never throughly supported Trump. Yet, his wife gets a cabinet appointment, the GOP leadership lead by McConnell, strips down the only department that makes any real sense was the independent ethics board. Under the guise of how much better, how much more stream line, the ethics complaints will be handled. Essentially, setting it up for their president to enrich himself as he always had, and does, only no ethics rules won't apply anymore. More and more this isn't a country by the people and for the people. No it's becoming a country of the have and have nots, and us being the have nots, will shortly have lots less.The Democrats would be right in using the emolument clause and the constitutional issues Trump will bring, against the GOP, offer them as much help as they offered Obama, and by extension us the citizens of this country.",20969407.0,20969407.0,<br/>,comment,1483419000.0,1.0,0,0.0,National,0.0,,20969407,https://graphics8.nytimes.com/images/apps/timespeople/none.png,1.0,3.0,,0.0,,Politics,0,approved,1.0,0.0,1483455653,James Young,72718862.0,Seattle,,,News
4,1483455216,58691a5795d0e039260788b9,1324.0,Elections have consequences.,20969274.0,20969274.0,,comment,1483417000.0,1.0,0,0.0,National,0.0,,20969274,https://graphics8.nytimes.com/images/apps/timespeople/none.png,1.0,3.0,,0.0,,Politics,0,approved,1.0,0.0,1483455216,M.,7529267.0,Seattle,,,News


In [15]:
# Only keeping relevant columns in the dataframe
# Filtering for only comments, not replies

print(comments_all.commentType.value_counts())
comments_all_v2=comments_all[comments_all.commentType=='comment']
print("reduced data set shape:",comments_all_v2.shape)

commentType
comment          1550485
userReply         567819
reporterReply        313
Name: count, dtype: int64
reduced data set shape: (1550485, 34)


In [16]:
# memory management
del comments_all
gc.collect()

0

In [17]:
# we have many comments sections
comments_all_v2['sectionName'].value_counts()

sectionName
Unknown                801341
Politics               353785
Sunday Review          109860
Europe                  32064
Middle East             22257
                        ...  
Entertainment               6
Real Estate                 5
Fashion & Beauty            3
Cricket                     1
Opinion | The World         1
Name: count, Length: 63, dtype: int64

In [18]:
comments_all_v2['newDesk'].value_counts()

newDesk
OpEd               574613
National           265991
Washington         147128
Editorial          124079
Foreign             88090
Business            70954
Learning            32425
Magazine            31412
Upshot              25047
Well                19715
Metro               19153
Culture             18847
Science             17795
Sports              16808
Games               12507
Dining              12010
Politics             7449
Investigative        7387
Styles               5749
RealEstate           4904
Arts&Leisure         4880
SundayBusiness       4706
Insider              4610
Travel               4558
Climate              4428
Unknown              4187
Metropolitan         3739
Weekend              3577
Express              2780
Obits                2217
BookReview           2043
NewsDesk             2035
EdLife               1281
SpecialSections       895
Letters               871
Smarter Living        642
Video                 256
Photo                 241
TSty

In [19]:
# Only keeping relevant columns
comments_all_v2.columns
comments_all_v3=comments_all_v2[['articleID','articleWordCount','commentID','commentBody','sectionName','newDesk','userLocation','editorsSelection','recommendations','trusted']]
comments_all_v3.head()

Unnamed: 0,articleID,articleWordCount,commentID,commentBody,sectionName,newDesk,userLocation,editorsSelection,recommendations,trusted
0,58691a5795d0e039260788b9,1324.0,20969730.0,"For all you Americans out there --- still rejoicing over the majority win of Republicans over the Legislature of this land.Beware.Just like you would have been, if there were any other kind of majority.The Founding Fathers had something like this in mind wheh they formed our Great Nation.It's part of the natural 'Checks ; Balances' system that keeps this country on an even keel.But this is now being threatened -- with the majority of Executive, Legislative and Judidical power all in the hands of one political party.See to it that you remember our U.S. Constitution, and our Bill of Rights.Remember that ""We"", are still ""the People"".America belongs to all of us.And God help us all.",Politics,National,New York City,0,5.0,0.0
1,58691a5795d0e039260788b9,1324.0,20969325.0,Obamas policies may prove to be the least of this countrys worries. The GOP has been emboldened to actually cut the OCE in the night before the new president or congress is sworn in following in their leaders footsteps. Everyone is crying foul including the GOP leadership and conservative think tanks. How about a twitter regarding the swamp draining tonight?,Politics,National,Phoenix,0,3.0,0.0
2,58691a5795d0e039260788b9,1324.0,20969855.0,Democrats are comprised of malcontents who generally have never worked a day in their lives and thrive on handouts from the Govt; fire 1/2 of the Govt. non workers and save billions Trust is earned and our elected officials have not earned our trust. The don't adhere to the U.S. Constitution as it restricts their movement. Thank goodness for the 2nd Amendment and the NRA or they would have already enslaved us.,Politics,National,Florida,0,3.0,0.0
3,58691a5795d0e039260788b9,1324.0,20969407.0,"The picture in this article is the face of congressional leadership. It would seem that only Harry Reid seemed to know that his time has passed. Conversely, so should Mitch McConnell, his Times has come and gone, he forces people his own age to retire while he goes on being a Senator.The Senator's ideas, and ways of thinking are no longer mainstream, he hasn't the political will or the intestinal fortitude that it will take to keep his president from running this country into the ground. It's time that our elected representatives have term limits imposed. Mitch McConnell was raised in a white mans world, where women were excluded. Even though he never throughly supported Trump. Yet, his wife gets a cabinet appointment, the GOP leadership lead by McConnell, strips down the only department that makes any real sense was the independent ethics board. Under the guise of how much better, how much more stream line, the ethics complaints will be handled. Essentially, setting it up for their president to enrich himself as he always had, and does, only no ethics rules won't apply anymore. More and more this isn't a country by the people and for the people. No it's becoming a country of the have and have nots, and us being the have nots, will shortly have lots less.The Democrats would be right in using the emolument clause and the constitutional issues Trump will bring, against the GOP, offer them as much help as they offered Obama, and by extension us the citizens of this country.",Politics,National,Seattle,0,3.0,0.0
4,58691a5795d0e039260788b9,1324.0,20969274.0,Elections have consequences.,Politics,National,Seattle,0,3.0,0.0


In [30]:
# Checking uniqueness of comments database
print("length of unique comment ids:",len(comments_all_v3['commentID'].unique()))
print("shape of comments database",comments_all_v3.shape[0])

# We can conclude that all comment ids are unique, need no duplicates removal

length of unique comment ids: 1550485
shape of comments database 1550485


-----
### **Part2: First Look at Articles dataset**

In [20]:
articles_all.head(2)

Unnamed: 0,articleID,abstract,byline,documentType,headline,keywords,multimedia,newDesk,printPage,pubDate,sectionName,snippet,source,typeOfMaterial,webURL,articleWordCount
0,58691a5795d0e039260788b9,,By JENNIFER STEINHAUER,article,G.O.P. Leadership Poised to Topple Obama’s Pillars,"['United States Politics and Government', 'Law and Legislation', 'House of Representatives', 'Senate', 'Patient Protection and Affordable Care Act (2010)', 'Trump, Donald J', 'McConnell, Mitch']",1,National,1,2017-01-01 15:03:38,Politics,The most powerful and ambitious Republican-led Congress in 20 years plans quick action on several priorities — most notably to clear a path for the repeal of President Obama’s health care law.,The New York Times,News,https://www.nytimes.com/2017/01/01/us/politics/with-new-congress-poised-to-convene-obamas-policies-are-in-peril.html,1324
1,586967bf95d0e03926078915,,By MARK LANDLER,article,Fractured World Tested the Hope of a Young President,"['Obama, Barack', 'Afghanistan', 'United States Defense and Military Forces', 'Afghanistan War (2001-14)']",1,Foreign,1,2017-01-01 20:34:00,Asia Pacific,A strategy that went from a “good war” to the shorthand “Afghan good enough” reflects the president’s coming to terms with what was possible in Afghanistan.,The New York Times,News,https://www.nytimes.com/2017/01/01/world/asia/obama-afghanistan-war.html,2836


In [21]:
articles_all.isna().sum()

articleID              0
abstract            9131
byline                 0
documentType           0
headline               0
keywords               0
multimedia             0
newDesk                0
printPage              0
pubDate                0
sectionName            0
snippet                0
source                 0
typeOfMaterial         0
webURL                 0
articleWordCount       0
dtype: int64

In [22]:
articles_all['sectionName'].value_counts()

sectionName
Unknown               6355
Politics               636
Sunday Review          345
Television             261
Asia Pacific           173
                      ... 
Rugby                    1
Paying for College       1
Insider Events           1
Cricket                  1
Learning                 1
Name: count, Length: 62, dtype: int64

In [23]:
articles_all_v2=articles_all.drop(['abstract','byline','multimedia','source'], axis=1)

In [24]:
articles_all_v2.head(3)

Unnamed: 0,articleID,documentType,headline,keywords,newDesk,printPage,pubDate,sectionName,snippet,typeOfMaterial,webURL,articleWordCount
0,58691a5795d0e039260788b9,article,G.O.P. Leadership Poised to Topple Obama’s Pillars,"['United States Politics and Government', 'Law and Legislation', 'House of Representatives', 'Senate', 'Patient Protection and Affordable Care Act (2010)', 'Trump, Donald J', 'McConnell, Mitch']",National,1,2017-01-01 15:03:38,Politics,The most powerful and ambitious Republican-led Congress in 20 years plans quick action on several priorities — most notably to clear a path for the repeal of President Obama’s health care law.,News,https://www.nytimes.com/2017/01/01/us/politics/with-new-congress-poised-to-convene-obamas-policies-are-in-peril.html,1324
1,586967bf95d0e03926078915,article,Fractured World Tested the Hope of a Young President,"['Obama, Barack', 'Afghanistan', 'United States Defense and Military Forces', 'Afghanistan War (2001-14)']",Foreign,1,2017-01-01 20:34:00,Asia Pacific,A strategy that went from a “good war” to the shorthand “Afghan good enough” reflects the president’s coming to terms with what was possible in Afghanistan.,News,https://www.nytimes.com/2017/01/01/world/asia/obama-afghanistan-war.html,2836
2,58698a1095d0e0392607894a,article,Little Troublemakers,"['Crossword Puzzles', 'Boxing Day', 'Holidays and Special Occasions']",Games,0,2017-01-01 23:00:24,Unknown,Chuck Deodene puts us in a bubbly mood.,News,https://www.nytimes.com/2017/01/01/crosswords/little-troublemakers.html,445


In [31]:
# Checking uniqueness of articles database
print("length of unique article ids:",len(articles_all_v2['articleID'].unique()))
print("shape of articles database",articles_all_v2.shape[0])

# We can conclude that all article ids are unique, need no duplicates removal

length of unique article ids: 9298
shape of articles database 9298


### **Part3: Making an articles-comments dataset**

In [33]:
# What are the common columns between comments and articles dataset?
set(articles_all_v2.columns).intersection(set(comments_all_v3.columns))

{'articleID', 'articleWordCount', 'newDesk', 'sectionName'}

In [None]:
comments_all_v3.head(2)
comments_all_v3[comments_all_v3.articleID=='58691a5795d0e039260788b9']

In [41]:
df_merged=pd.merge(comments_all_v3, articles_all_v2, how='inner', on='articleID')
df_merged.head(2)

Unnamed: 0,articleID,articleWordCount_x,commentID,commentBody,sectionName_x,newDesk_x,userLocation,editorsSelection,recommendations,trusted,documentType,headline,keywords,newDesk_y,printPage,pubDate,sectionName_y,snippet,typeOfMaterial,webURL,articleWordCount_y
0,58691a5795d0e039260788b9,1324.0,20969730.0,"For all you Americans out there --- still rejoicing over the majority win of Republicans over the Legislature of this land.Beware.Just like you would have been, if there were any other kind of majority.The Founding Fathers had something like this in mind wheh they formed our Great Nation.It's part of the natural 'Checks ; Balances' system that keeps this country on an even keel.But this is now being threatened -- with the majority of Executive, Legislative and Judidical power all in the hands of one political party.See to it that you remember our U.S. Constitution, and our Bill of Rights.Remember that ""We"", are still ""the People"".America belongs to all of us.And God help us all.",Politics,National,New York City,0,5.0,0.0,article,G.O.P. Leadership Poised to Topple Obama’s Pillars,"['United States Politics and Government', 'Law and Legislation', 'House of Representatives', 'Senate', 'Patient Protection and Affordable Care Act (2010)', 'Trump, Donald J', 'McConnell, Mitch']",National,1,2017-01-01 15:03:38,Politics,The most powerful and ambitious Republican-led Congress in 20 years plans quick action on several priorities — most notably to clear a path for the repeal of President Obama’s health care law.,News,https://www.nytimes.com/2017/01/01/us/politics/with-new-congress-poised-to-convene-obamas-policies-are-in-peril.html,1324
1,58691a5795d0e039260788b9,1324.0,20969325.0,Obamas policies may prove to be the least of this countrys worries. The GOP has been emboldened to actually cut the OCE in the night before the new president or congress is sworn in following in their leaders footsteps. Everyone is crying foul including the GOP leadership and conservative think tanks. How about a twitter regarding the swamp draining tonight?,Politics,National,Phoenix,0,3.0,0.0,article,G.O.P. Leadership Poised to Topple Obama’s Pillars,"['United States Politics and Government', 'Law and Legislation', 'House of Representatives', 'Senate', 'Patient Protection and Affordable Care Act (2010)', 'Trump, Donald J', 'McConnell, Mitch']",National,1,2017-01-01 15:03:38,Politics,The most powerful and ambitious Republican-led Congress in 20 years plans quick action on several priorities — most notably to clear a path for the repeal of President Obama’s health care law.,News,https://www.nytimes.com/2017/01/01/us/politics/with-new-congress-poised-to-convene-obamas-policies-are-in-peril.html,1324


In [50]:
# In some instances, the common columns do not have the same value for identical article ids in both comments and articles data set
# We will apply a business rule of using the values obtained from articles data set, as these are more article focussed values

df_merged['flg_word_count'] = df_merged[['articleWordCount_x', 'articleWordCount_y']].apply(lambda row: 1 if row['articleWordCount_x'] == row['articleWordCount_y'] else 0, axis=1)
df_merged['flg_newDesk'] = df_merged[['newDesk_x', 'newDesk_y']].apply(lambda row: 1 if row['newDesk_x'] == row['newDesk_y'] else 0, axis=1)
df_merged['flg_sectionName'] = df_merged[['sectionName_x', 'sectionName_y']].apply(lambda row: 1 if row['sectionName_x'] == row['sectionName_y'] else 0, axis=1)

df_merged[df_merged['flg_newDesk']==0].head()
df_merged[df_merged['flg_sectionName']==0].head()

In [54]:
# df_merged.drop(['articleWordCount_x','newDesk_x','sectionName_x'], axis=1, inplace=True)
df_merged.drop(['flg_word_count','flg_newDesk','flg_sectionName'], axis=1, inplace=True)
df_merged.head(3)

Unnamed: 0,articleID,commentID,commentBody,userLocation,editorsSelection,recommendations,trusted,documentType,headline,keywords,newDesk_y,printPage,pubDate,sectionName_y,snippet,typeOfMaterial,webURL,articleWordCount_y
0,58691a5795d0e039260788b9,20969730.0,"For all you Americans out there --- still rejoicing over the majority win of Republicans over the Legislature of this land.Beware.Just like you would have been, if there were any other kind of majority.The Founding Fathers had something like this in mind wheh they formed our Great Nation.It's part of the natural 'Checks ; Balances' system that keeps this country on an even keel.But this is now being threatened -- with the majority of Executive, Legislative and Judidical power all in the hands of one political party.See to it that you remember our U.S. Constitution, and our Bill of Rights.Remember that ""We"", are still ""the People"".America belongs to all of us.And God help us all.",New York City,0,5.0,0.0,article,G.O.P. Leadership Poised to Topple Obama’s Pillars,"['United States Politics and Government', 'Law and Legislation', 'House of Representatives', 'Senate', 'Patient Protection and Affordable Care Act (2010)', 'Trump, Donald J', 'McConnell, Mitch']",National,1,2017-01-01 15:03:38,Politics,The most powerful and ambitious Republican-led Congress in 20 years plans quick action on several priorities — most notably to clear a path for the repeal of President Obama’s health care law.,News,https://www.nytimes.com/2017/01/01/us/politics/with-new-congress-poised-to-convene-obamas-policies-are-in-peril.html,1324
1,58691a5795d0e039260788b9,20969325.0,Obamas policies may prove to be the least of this countrys worries. The GOP has been emboldened to actually cut the OCE in the night before the new president or congress is sworn in following in their leaders footsteps. Everyone is crying foul including the GOP leadership and conservative think tanks. How about a twitter regarding the swamp draining tonight?,Phoenix,0,3.0,0.0,article,G.O.P. Leadership Poised to Topple Obama’s Pillars,"['United States Politics and Government', 'Law and Legislation', 'House of Representatives', 'Senate', 'Patient Protection and Affordable Care Act (2010)', 'Trump, Donald J', 'McConnell, Mitch']",National,1,2017-01-01 15:03:38,Politics,The most powerful and ambitious Republican-led Congress in 20 years plans quick action on several priorities — most notably to clear a path for the repeal of President Obama’s health care law.,News,https://www.nytimes.com/2017/01/01/us/politics/with-new-congress-poised-to-convene-obamas-policies-are-in-peril.html,1324
2,58691a5795d0e039260788b9,20969855.0,Democrats are comprised of malcontents who generally have never worked a day in their lives and thrive on handouts from the Govt; fire 1/2 of the Govt. non workers and save billions Trust is earned and our elected officials have not earned our trust. The don't adhere to the U.S. Constitution as it restricts their movement. Thank goodness for the 2nd Amendment and the NRA or they would have already enslaved us.,Florida,0,3.0,0.0,article,G.O.P. Leadership Poised to Topple Obama’s Pillars,"['United States Politics and Government', 'Law and Legislation', 'House of Representatives', 'Senate', 'Patient Protection and Affordable Care Act (2010)', 'Trump, Donald J', 'McConnell, Mitch']",National,1,2017-01-01 15:03:38,Politics,The most powerful and ambitious Republican-led Congress in 20 years plans quick action on several priorities — most notably to clear a path for the repeal of President Obama’s health care law.,News,https://www.nytimes.com/2017/01/01/us/politics/with-new-congress-poised-to-convene-obamas-policies-are-in-peril.html,1324


In [56]:
# Creating a lighter database using just essential columns
df_merged_lite=df_merged[['articleID','pubDate','keywords','snippet','articleWordCount_y','commentID','commentBody','userLocation']]

----
### This will be our syndicated database for the next step analyses

In [57]:
## Saving to database
write_dir='/Users/kshitijmittal/Documents/UChicago Acad/03 Quarter 3/01 ML/NYT_Reader_Feedback/00_Data/01_Processed_Data'
df_merged.to_csv(write_dir + '/' + 'comments_articles_df.csv')
df_merged_lite.to_csv(write_dir + '/' + 'comments_articles_df_lite.csv')