<a href="https://colab.research.google.com/github/mlfa19/assignments/blob/master/Module%202/05/All_the_News_Sample_Dataset.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# All The News Dataset

If you want to spend more time on the modeling, experimentation, and interpretation portion of the mini-project, you might consider using this dataset as we have cleaned up the data and tokenized it so you can use it with Na&iuml;ve Bayes.

Here is the description of the dataset from the person who put it together.

> I wanted to see how articles clustered together if the articles were rendered into document-term matrices---would there be greater affinity among political affiliations, or medium, subject matter, etc. The data was scraped using BeautifulSoup and stored in Sqlite, but I've chopped it up into three separate CSVs here, because the entire Sqlite database came out to about 1.2 gb, beyond Kaggle's max.
>
> The publications include the New York Times, Breitbart, CNN, Business Insider, the Atlantic, Fox News, Talking Points Memo, Buzzfeed News, National Review, New York Post, the Guardian, NPR, Reuters, Vox, and the Washington Post. Sampling wasn't quite scientific; I chose publications based on my familiarity of the domain and tried to get a range of political alignments, as well as a mix of print and digital publications. By count, the publications break down accordingly:
>
> The data primarily falls between the years of 2016 and July 2017, although there is a not-insignificant number of articles from 2015, and a possibly insignificant number from before then.

## Downloading and Parsing the Data

We have put one of the three csv files on Google Drive to make it easy / fast to download.  If you want the rest of the data, we leave it up to you to modify the notebook to handle that.

In [3]:
import pandas as pd
import gdown

gdown.download('https://drive.google.com/uc?authuser=0&id=1T8V87Hdz2IvhKjzwzKyLWA4vI6sA2wTX&export=download',
               'articles1.csv',
               quiet=False)
df = pd.read_csv('articles1.csv')
df

Downloading...
From: https://drive.google.com/uc?authuser=0&id=1T8V87Hdz2IvhKjzwzKyLWA4vI6sA2wTX&export=download
To: /content/articles1.csv
204MB [00:01, 199MB/s]


Unnamed: 0.1,Unnamed: 0,id,title,publication,author,date,year,month,url,content
0,0,17283,House Republicans Fret About Winning Their Hea...,New York Times,Carl Hulse,2016-12-31,2016.0,12.0,,WASHINGTON — Congressional Republicans have...
1,1,17284,Rift Between Officers and Residents as Killing...,New York Times,Benjamin Mueller and Al Baker,2017-06-19,2017.0,6.0,,"After the bullet shells get counted, the blood..."
2,2,17285,"Tyrus Wong, ‘Bambi’ Artist Thwarted by Racial ...",New York Times,Margalit Fox,2017-01-06,2017.0,1.0,,"When Walt Disney’s “Bambi” opened in 1942, cri..."
3,3,17286,"Among Deaths in 2016, a Heavy Toll in Pop Musi...",New York Times,William McDonald,2017-04-10,2017.0,4.0,,"Death may be the great equalizer, but it isn’t..."
4,4,17287,Kim Jong-un Says North Korea Is Preparing to T...,New York Times,Choe Sang-Hun,2017-01-02,2017.0,1.0,,"SEOUL, South Korea — North Korea’s leader, ..."
...,...,...,...,...,...,...,...,...,...,...
49995,53287,73465,"Rex Tillerson Says Climate Change Is Real, but …",Atlantic,Robinson Meyer,2017-01-11,2017.0,1.0,,"As chairman and CEO of ExxonMobil, Rex Tillers..."
49996,53288,73466,The Biggest Intelligence Questions Raised by t...,Atlantic,Amy Zegart,2017-01-11,2017.0,1.0,,I’ve spent nearly 20 years looking at intellig...
49997,53289,73467,Trump Announces Plan That Does Little to Resol...,Atlantic,Jeremy Venook,2017-01-11,2017.0,1.0,,Donald Trump will not be taking necessary st...
49998,53290,73468,Dozens of For-Profit Colleges Could Soon Close,Atlantic,Emily DeRuy,2017-01-11,2017.0,1.0,,Dozens of colleges could be forced to close ...


## Tokenizing Articles to Words

In order to tokenize the article into words in this example we will use [NLTK](https://www.nltk.org/)'s built-in word tokenizer (you can use your own method, such as the one we used in assignment 4 if you like).  NLTK stands for the Natural Language Toolkit and there are some really cool things built into it for dealing with text data.

In [5]:
# NLTK has a built-in module for extracting words from text.
# This takes a few minutes to run, so be patient.

import nltk

nltk.download('punkt')

def remove_punctuation(article):
    # substitute in a regular apostrophe for '’' to word with word_tokenize
    article = article.replace('’', "'")
    tokens = nltk.tokenize.word_tokenize(article)
    words = list(filter(lambda w: any(x.isalpha() for x in w), tokens))
    return " ".join(words)

df['content_no_punctuation'] = df['content'].map(remove_punctuation)

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


Here's what the code article text looks like after the words have all been separated and punctuation has been removed.

In [8]:
df['content_no_punctuation']

0        WASHINGTON Congressional Republicans have a ne...
1        After the bullet shells get counted the blood ...
2        When Walt Disney 's Bambi opened in critics pr...
3        Death may be the great equalizer but it is n't...
4        SEOUL South Korea North Korea 's leader Kim sa...
                               ...                        
49995    As chairman and CEO of ExxonMobil Rex Tillerso...
49996    I 've spent nearly years looking at intelligen...
49997    Donald Trump will not be taking necessary step...
49998    Dozens of colleges could be forced to close in...
49999    The force of gravity can be described using a ...
Name: content_no_punctuation, Length: 50000, dtype: object

At this point you can take the dataset in many different directions.  If you want to use ngrams to analyze the data, you can use sklearn's `CountVectorizer` as we did in the previous notebooks.

In [0]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(binary=True, ngram_range=(1,2), min_df=20, tokenizer=lambda x: x.split(' '))
X = vectorizer.fit_transform(df['content_no_punctuation'])

This would allow you to then fit a model with Na&iuml;ve Bayes.  Sample code is available for this upon request, but we think there is enough guidance from previous assignments and that it would be a good exercise to write the code on your own.