## **Data Ingestion and Exploratory Analysis**
We will use articles and comments from April 2018 for the first iteration. And later scale our findings to the entire data set

In [1]:
#pip install pandarallel
import multiprocessing

num_processors = multiprocessing.cpu_count()
print(f'Available CPUs: {num_processors}')

import pandarallel
from pandarallel import pandarallel
pandarallel.initialize(nb_workers=num_processors-1, use_memory_fs=False)

Available CPUs: 8
INFO: Pandarallel will run on 7 workers.
INFO: Pandarallel will use standard multiprocessing data transfer (pipe) to transfer data between the main process and workers.


In [9]:
import pandas as pd
import numpy as np

from textblob import TextBlob

import warnings 
warnings.filterwarnings('ignore')

import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns

from plotly.offline import init_notebook_mode, iplot
import plotly.figure_factory as ff
init_notebook_mode(connected=True)
plt.style.use('fivethirtyeight')
%matplotlib inline

import gc   # For memory management

----
### **Part0: Data Ingestion**

In [None]:
# Next we import the dataframe containing all the comments on New York Times articles published in April 2018
curr_dir = '../input/'
comments = pd.read_csv(curr_dir + 'CommentsApril2018.csv')
articles = pd.read_csv(curr_dir + 'ArticlesApril2018.csv')

# We write the two functions that are used often:
def print_largest_values(s, n=5):
    s = sorted(s.unique())
    for v in s[-1:-(n+1):-1]:
        print(v)
    print()
    
def print_smallest_values(s, n=5):
    s = sorted(s.unique())
    for v in s[:n]:
        print(v)
    print()

In [6]:
# Importing all comments

curr_dir = '/Users/kshitijmittal/Documents/UChicago Acad/03 Quarter 3/01 ML/NYT_Reader_Feedback/00_Data/'
df1 = pd.read_csv(curr_dir + 'CommentsJan2017.csv')
df2 = pd.read_csv(curr_dir + 'CommentsFeb2017.csv')
df3 = pd.read_csv(curr_dir + 'CommentsMarch2017.csv')
df4 = pd.read_csv(curr_dir + 'CommentsApril2017.csv')
df5 = pd.read_csv(curr_dir + 'CommentsMay2017.csv')
df6 = pd.read_csv(curr_dir + 'CommentsJan2018.csv')
df7 = pd.read_csv(curr_dir + 'CommentsFeb2018.csv')
df8 = pd.read_csv(curr_dir + 'CommentsMarch2018.csv')
df9 = pd.read_csv(curr_dir + 'CommentsApril2018.csv')

comments_all = pd.concat([df1, df2, df3, df4, df5, df6, df7, df8, df9])
comments_all.drop_duplicates(subset='commentID', inplace=True)
comments_all.head(3)

Unnamed: 0,approveDate,articleID,articleWordCount,commentBody,commentID,commentSequence,commentTitle,commentType,createDate,depth,...,status,timespeople,trusted,updateDate,userDisplayName,userID,userLocation,userTitle,userURL,typeOfMaterial
0,1483455908,58691a5795d0e039260788b9,1324.0,For all you Americans out there --- still rejo...,20969730.0,20969730.0,<br/>,comment,1483426000.0,1.0,...,approved,1.0,0.0,1483455908,N. Smith,64679318.0,New York City,,,News
1,1483455656,58691a5795d0e039260788b9,1324.0,Obamas policies may prove to be the least of t...,20969325.0,20969325.0,<br/>,comment,1483417000.0,1.0,...,approved,1.0,0.0,1483455656,Kilocharlie,69254188.0,Phoenix,,,News
2,1483455655,58691a5795d0e039260788b9,1324.0,Democrats are comprised of malcontents who gen...,20969855.0,20969855.0,<br/>,comment,1483431000.0,1.0,...,approved,1.0,0.0,1483455655,Frank Fryer,76788711.0,Florida,,,News


In [17]:
# Importing all articles

df1 = pd.read_csv(curr_dir + 'ArticlesJan2017.csv')
df2 = pd.read_csv(curr_dir + 'ArticlesFeb2017.csv')
df3 = pd.read_csv(curr_dir + 'ArticlesMarch2017.csv')
df4 = pd.read_csv(curr_dir + 'ArticlesApril2017.csv')
df5 = pd.read_csv(curr_dir + 'ArticlesMay2017.csv')
df6 = pd.read_csv(curr_dir + 'ArticlesJan2018.csv')
df7 = pd.read_csv(curr_dir + 'ArticlesFeb2018.csv')
df8 = pd.read_csv(curr_dir + 'ArticlesMarch2018.csv')
df9 = pd.read_csv(curr_dir + 'ArticlesApril2018.csv')

articles_all = pd.concat([df1, df2, df3, df4, df5, df6, df7, df8, df9])
articles_all.drop_duplicates(subset='articleID', inplace=True)
articles_all.head(3)

Unnamed: 0,articleID,abstract,byline,documentType,headline,keywords,multimedia,newDesk,printPage,pubDate,sectionName,snippet,source,typeOfMaterial,webURL,articleWordCount
0,58691a5795d0e039260788b9,,By JENNIFER STEINHAUER,article,G.O.P. Leadership Poised to Topple Obama’s Pi...,"['United States Politics and Government', 'Law...",1,National,1,2017-01-01 15:03:38,Politics,The most powerful and ambitious Republican-led...,The New York Times,News,https://www.nytimes.com/2017/01/01/us/politics...,1324
1,586967bf95d0e03926078915,,By MARK LANDLER,article,Fractured World Tested the Hope of a Young Pre...,"['Obama, Barack', 'Afghanistan', 'United State...",1,Foreign,1,2017-01-01 20:34:00,Asia Pacific,A strategy that went from a “good war” to the ...,The New York Times,News,https://www.nytimes.com/2017/01/01/world/asia/...,2836
2,58698a1095d0e0392607894a,,By CAITLIN LOVINGER,article,Little Troublemakers,"['Crossword Puzzles', 'Boxing Day', 'Holidays ...",1,Games,0,2017-01-01 23:00:24,Unknown,Chuck Deodene puts us in a bubbly mood.,The New York Times,News,https://www.nytimes.com/2017/01/01/crosswords/...,445


In [18]:
# Memory management
del df1, df2, df3, df4, df5, df6, df7, df8, df9
gc.collect()

779

In [22]:
print(f"We have {comments_all.shape[0]:,} comments")
print(f"On {articles_all.shape[0]:,} articles")

We have 2,118,617 comments
On 9,298 articles


----
### **Part1: Comments EDA**