In [1]:
import pandas as pd

## Load Data

In [2]:
df = pd.read_csv(filepath_or_buffer='input/dsjVoxArticles.tsv', sep='\t',
                         header=0, index_col=None, lineterminator='\n')

## Data Sense-making

In [3]:
df.iloc[0]

title             Bitcoin is down 60 percent this year. Here's w...
author                                               Timothy B. Lee
category                                         Business & Finance
published_date                                  2014-03-31 14:01:30
updated_on                                      2014-12-16 16:37:36
slug              http://www.vox.com/2014/3/31/5557170/bitcoin-b...
blurb             Bitcoins have lost more than 60 percent of the...
body              <p>The markets haven't been kind to<span> </sp...
Name: 0, dtype: object

In [4]:
print(df['category'].unique().shape)
df['category'].unique()

(186,)


array(['Business & Finance', 'War on Drugs', 'Criminal Justice',
       'Health Care', 'Explainers', 'Life', 'Science & Health',
       'Neuroscience', 'Apple', 'Politics & Policy', 'Culture',
       'Human Rights', 'The Latest', 'World', 'Marriage Equality',
       'Almanac', 'Transportation', 'Space', 'Emmy Awards', 'Xpress',
       'Identities', 'Marijuana Legalization', 'Joe Biden', 'Star Wars',
       'Sports', 'North Korea', 'On Instagram', 'Race in America',
       'Media', 'Education', 'Gender-Based Violence', 'Supreme Court ',
       'Gender Equality', 'Orange Is the New Black', 'Immigration',
       'Hillary Clinton', 'Gun Violence', 'Politics', 'ISIS', 'NFL',
       'Science of Everyday Life', 'Infectious Disease', 'Congress',
       'College Football', 'Campaign Finance', 'Books', 'Vox', 'Music',
       '2016 Presidential Election', 'LGBTQ', 'Interviews', 'Ted Cruz',
       'Energy & Environment', 'Genetics', 'True Detective',
       'Officer-Involved Shootings', 'Jeb Bush'

In [5]:
print(df.iloc[0]['body'])
print(df.shape)

<p>The markets haven't been kind to<span> </span><a href="http://www.vox.com/cards/bitcoin/" style="font-size: 17px; line-height: 28.4624996185303px;">Bitcoin</a> in 2014. The currency reached a high of nearly $1,000 in January before falling to around $350 this month, a plunge of more than 60 percent. It would be easy to write Bitcoin off as a fad whose novelty has worn off.</p> \n<p>After all, dollars seem superior in almost every respect. T<span>hey're accepted everywhere, they're convenient to use, and they have a stable value. Bitcoin is an inferior currency on all three counts.</span></p> \n<p><q class="right"><span>Bitcoin's detractors are making the same mistake as many Bitcoin fans</span> </q></p> \n<p><span>Yet it would be foolish to write Bitcoin off. The currency has had months-long slumps in the past, only to bounce back. </span>More importantly, it's a mistake to think about Bitcoin as a new kind of currency. W<span>hat makes Bitcoin potentially revolutionary is that it's

## Data Preprocessing

We are going to use the library `Beautiful Soup` to process this html file. `Beautiful Soup` is a Python library for pulling data out of HTML and XML files. You can check more information about this library from <a href="https://www.crummy.com/software/BeautifulSoup/bs4/doc/">this link</a>

In [6]:
from bs4 import BeautifulSoup 
soup = BeautifulSoup(df.iloc[0]['body'])
print(soup.get_text())

The markets haven't been kind to Bitcoin in 2014. The currency reached a high of nearly $1,000 in January before falling to around $350 this month, a plunge of more than 60 percent. It would be easy to write Bitcoin off as a fad whose novelty has worn off. \nAfter all, dollars seem superior in almost every respect. They're accepted everywhere, they're convenient to use, and they have a stable value. Bitcoin is an inferior currency on all three counts. \nBitcoin's detractors are making the same mistake as many Bitcoin fans  \nYet it would be foolish to write Bitcoin off. The currency has had months-long slumps in the past, only to bounce back. More importantly, it's a mistake to think about Bitcoin as a new kind of currency. What makes Bitcoin potentially revolutionary is that it's the world's first completely open financial network. \nHistory suggests that open platforms like Bitcoin often become fertile soil for innovation. Think about the internet. It didn't seem like a very practica

#### remove missing values and remove all `\n`

In [7]:
data_df = df.dropna()

In [8]:
data_df['body'] = data_df['body'].apply(lambda x: x.replace("\\n", ""))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


In [14]:
data_ready = data_df.sample(frac=0.3, replace=True, random_state=1)

In [15]:
data_ready.shape

(6904, 8)

In [16]:
data_ready['body'] = data_ready['body'].apply(lambda x: BeautifulSoup(x).get_text())

data_ready.iloc[0]

title             Every year of a prison term makes a couple 32 ...
author                                                    Dara Lind
category                                           Criminal Justice
published_date                                  2014-05-29 12:30:05
updated_on                                      2014-05-29 12:30:07
slug              http://www.vox.com/2014/5/29/5756646/every-yea...
blurb             But even a short jail stay can strain a marria...
body              A new study by criminologists Sonja Siennick a...
Name: 235, dtype: object

In [17]:
data_ready.iloc[0]['body']

'A new study by criminologists Sonja Siennick and Eric Stewart of Florida State University and Jeremy Staff of Penn State takes a hard look at the effects of incarceration on marriage. Here\'s what we already know from other research, what this study says, and the questions that remain unanswered. What we already knew about incarceration and divorce Incarceration increases divorce rates. Studies consistently show that incarceration during marriage is correlated with higher divorce rates. When one spouse has been incarcerated before getting married, the couple isn\'t any more likely to split up — but when a spouse is incarcerated during the marriage, the odds of divorce increase. Even after the inmate is released from prison, the marriage is still at risk. In fact, inmates and their spouses can be optimistic about their marriages before the inmate is released — but that optimism often falls apart.\xa0One study of Dutch men found that their odds of divorce increased over the ten years af

In [18]:
data_ready.to_csv('VoxData.csv', sep=',', index=False)