# Unit 4 Capstone - News Article Analysis & Classification

## John A. Fonte

---
---

### Instructions

1. Find 100 different entries from at least 10 different authors (articles?)
2. Reserve 25% for test set
3. cluster vectorized data (go through a few clustering methods)
4. Perform unsupervised feature generation and selection
5. Perform supervised modeling by classifying by author
6. Comment on your 25% holdout group. Did the clusters for the holdout group change dramatically, or were they consistent with the training groups? Is the performance of the model consistent? If not, why?
7. Conclude with which models (clustering or not) work best for classifying texts.

---
---

### About the Dataset

__Source:__ https://archive.ics.uci.edu/ml/datasets/Reuter_50_50#

__Description:__ This is a subset of the [Reuters Corpus Volume 1 (RCV1)](https://scikit-learn.org/0.17/datasets/rcv1.html). Specifically, this subset consists of the top 50 authors by article proliferation, with a total of 100 articles per each author within the combined training and testing sets.

---
---
# 1. Data Load and Cleaning

In [1]:
import pandas as pd

In [2]:
'''
Loading Data from Local Computer
Each author is a subfolder, and within each folder is a series of .txt files
The goal of this cell is to load all the contents of every subfolder into the 
DataFrame, while retaining the author designation for those works.
'''

from os import listdir

def multiple_file_load(file_directory):
    
    # identifying all author subfolders - appending them into list 
    
    authorlist = []
    textlist = []
    
    for author in listdir(file_directory):
        authorname = str(author)
        author_sub_directory = (file_directory + '/' + author) #author file path
    
    # identifying all files within each subfolder - 
    
        for filename in listdir(author_sub_directory):
            text_file_path = (author_sub_directory + '/' + filename) # text file path
            
            if (filename.lower().endswith('txt')):
                authorlist.append(authorname)
                textfile = open(text_file_path,'r') # this is how you open files
                substantive_text = textfile.read()  # this is how to read a file
                textlist.append(substantive_text)   # this is how to do something with that file
                textfile.close()                    # this is how to close the file 
                                                             # (you must close one before opening another!)
  # pushing the two lists into a dataframe 

    df = pd.DataFrame({'Author':authorlist, 'Text':textlist})
    
    return df
                

In [3]:
# loading training data (note the file path)
df_train = multiple_file_load('D:/Github/Data-Science-Bootcamp/CAPSTONE - Unsupervised Learning/C50/C50train')

In [4]:
df_train.head()

Unnamed: 0,Author,Text
0,AaronPressman,The Internet may be overflowing with new techn...
1,AaronPressman,The U.S. Postal Service announced Wednesday a ...
2,AaronPressman,Elementary school students with access to the ...
3,AaronPressman,An influential Internet organisation has backe...
4,AaronPressman,An influential Internet organisation has backe...


In [5]:
len(df_train)

2500

In [6]:
# adding the space in the authors...because I want it
import re

In [7]:
author_split = [re.findall('[A-Z][a-z]*', i) for i in df_train.Author]

In [8]:
author_split[:5]

[['Aaron', 'Pressman'],
 ['Aaron', 'Pressman'],
 ['Aaron', 'Pressman'],
 ['Aaron', 'Pressman'],
 ['Aaron', 'Pressman']]

In [9]:
#joining them back together
author_join = []

for couple in author_split:
    joined_string = couple[0] + ' ' + couple[1]
    author_join.append(joined_string)    

In [10]:
df_train['Author'] = pd.Series(author_join)
df_train.tail()

Unnamed: 0,Author,Text
2495,William Kazer,China's central bank chief has said that infla...
2496,William Kazer,"China ushered in 1997, a year it has hailed as..."
2497,William Kazer,China issued tough new rules on the handling o...
2498,William Kazer,China will avoid bold moves in tackling its ai...
2499,William Kazer,Communist Party chief Jiang Zemin has put his ...


In [11]:
# Before I begin adding features, assignment asks for 25% data split, NOT 50/50
# Going to have to concat some of those testing articles
df_test = multiple_file_load('D:/Github/Data-Science-Bootcamp/CAPSTONE - Unsupervised Learning/C50/C50test')
df_test.head()

Unnamed: 0,Author,Text
0,AaronPressman,U.S. Senators on Tuesday sharply criticized a ...
1,AaronPressman,Two members of Congress criticised the Federal...
2,AaronPressman,Commuters stuck in traffic on the Leesburg Pik...
3,AaronPressman,A broad coalition of corporations went to Capi...
4,AaronPressman,"On the Internet, where new products come and g..."


In [12]:
#another fix to Author column

author_split = [re.findall('[A-Z][a-z]*', i) for i in df_test.Author]

author_join = []

for couple in author_split:
    joined_string = couple[0] + ' ' + couple[1]
    author_join.append(joined_string)    
    
df_test['Author'] = pd.Series(author_join)

In [13]:
'''GOAL:
Trying to get half of the datapoints OF EACH AUTHOR
in the testing set into a new DataFrame, which
will be concatenated onto the training set.
I will delete that from the testing set later.

Doing this instead of combining both and splitting 75/25 later 
ensures balanced data between the authors.
'''

def appendingdataframe(dataframe):
    appendabledataframe = pd.DataFrame(columns=['Author', 'Text'])
    
    for item in dataframe.Author.unique():
        print("Looping through ", item)
        df_testauthor = df_test[df_test['Author'] == item].copy() 
        appendabledataframe = appendabledataframe.append(df_testauthor[25:], 
                                                         ignore_index=True) # want half of df_testauthor!
    
    return appendabledataframe
    

In [14]:
# using appendabledataframe to avoid screwing up original data
# This is explicit inefficiency at the cost of being cautious

df_train2 = df_train.append(appendingdataframe(df_train), ignore_index=True)

Looping through  Aaron Pressman
Looping through  Alan Crosby
Looping through  Alexander Smith
Looping through  Benjamin Kang
Looping through  Bernard Hickey
Looping through  Brad Dorfman
Looping through  Darren Schuettler
Looping through  David Lawder
Looping through  Edna Fernandes
Looping through  Eric Auchard
Looping through  Fumiko Fujisaki
Looping through  Graham Earnshaw
Looping through  Heather Scoffield
Looping through  Jane Macartney
Looping through  Jan Lopatka
Looping through  Jim Gilchrist
Looping through  Joe Ortiz
Looping through  John Mastrini
Looping through  Jonathan Birt
Looping through  Jo Winterbottom
Looping through  Karl Penhaul
Looping through  Keith Weir
Looping through  Kevin Drawbaugh
Looping through  Kevin Morrison
Looping through  Kirstin Ridley
Looping through  Kourosh Karimkhany
Looping through  Lydia Zajc
Looping through  Lynne O
Looping through  Lynnley Browning
Looping through  Marcel Michelson
Looping through  Mark Bendeich
Looping through  Martin Wolk

In [15]:
# checking if the appending worked
len(df_train2)

3750

In [16]:
df_train = df_train2.copy()

In [17]:
# doing same for df_test

df_test2 = df_test.append(appendingdataframe(df_train), ignore_index=True)

# checking if the appending worked
len(df_test2)

Looping through  Aaron Pressman
Looping through  Alan Crosby
Looping through  Alexander Smith
Looping through  Benjamin Kang
Looping through  Bernard Hickey
Looping through  Brad Dorfman
Looping through  Darren Schuettler
Looping through  David Lawder
Looping through  Edna Fernandes
Looping through  Eric Auchard
Looping through  Fumiko Fujisaki
Looping through  Graham Earnshaw
Looping through  Heather Scoffield
Looping through  Jane Macartney
Looping through  Jan Lopatka
Looping through  Jim Gilchrist
Looping through  Joe Ortiz
Looping through  John Mastrini
Looping through  Jonathan Birt
Looping through  Jo Winterbottom
Looping through  Karl Penhaul
Looping through  Keith Weir
Looping through  Kevin Drawbaugh
Looping through  Kevin Morrison
Looping through  Kirstin Ridley
Looping through  Kourosh Karimkhany
Looping through  Lydia Zajc
Looping through  Lynne O
Looping through  Lynnley Browning
Looping through  Marcel Michelson
Looping through  Mark Bendeich
Looping through  Martin Wolk

3750

In [18]:
# and now to drop the rows added to df_train from df_test

df_test2.drop_duplicates(keep=False, inplace=True)
len(df_test2)

1250

In [21]:
df_test = df_test2.copy()

# Adding Features

Just some fun numerical features!

In [20]:
# adding some numerical features for text analysis

df_train['Raw Character Count'] = df_train['Text'].apply(lambda x: len(x))
df_train['Raw Word Count'] = df_train['Text'].apply(lambda x: len(x.split()))

In [22]:
# doing same for df_test

df_test['Raw Character Count'] = df_test['Text'].apply(lambda x: len(x))
df_test['Raw Word Count'] = df_test['Text'].apply(lambda x: len(x.split()))

In [23]:
# creating numerical classes for authors:
# I feel like one hot encoding would've screwed things up, so I did "factorize"

df_train['AuthorNum'] = pd.factorize(df_train.Author)[0]
df_train['AuthorNum'] = df_train['AuthorNum'].astype("category")

In [24]:
df_train.tail()

Unnamed: 0,Author,Text,Raw Character Count,Raw Word Count,AuthorNum
3745,William Kazer,China has scored new successes in its fight ag...,2473,411,49
3746,William Kazer,China has scored new successes in its fight ag...,2473,411,49
3747,William Kazer,China is on target with plans to to promote 10...,1742,287,49
3748,William Kazer,China may need to adjust the mix of its treasu...,3263,546,49
3749,William Kazer,A Chinese ideologue known for his strictly ort...,3026,483,49


In [25]:
# and same for df_test...

df_test['AuthorNum'] = pd.factorize(df_test.Author)[0]
df_test['AuthorNum'] = df_test['AuthorNum'].astype("category")

In [26]:
df_test.tail()

Unnamed: 0,Author,Text,Raw Character Count,Raw Word Count,AuthorNum
2470,William Kazer,China's Foreign Minister Qian Qichen on Friday...,1827,299,49
2471,William Kazer,China blamed criminal elements on Sunday for a...,3156,516,49
2472,William Kazer,An unemployed Taiwanese journalist on Monday d...,3000,492,49
2473,William Kazer,China moved ahead on Wednesday with plans to h...,3762,590,49
2474,William Kazer,Premier Li Peng said on Friday China wanted a ...,2417,408,49


In [None]:
# Doing a "Meaningful Word Count" via spacy implementation

import spacy

nlp = spacy.load('en')

df_train['Spacy-ed Text'] = pd.Series([nlp(text) for text in df_train.Text])

# Spacy-ing will take a LONG LONG LONG LONNNNNNNNNGGGGGGG TIME TO LOAD. BE PATIENT!
# run-time (about five minutes for df_train)

In [None]:
df_test['Spacy-ed Text'] = pd.Series([nlp(text) for text in df_test.Text])

In [None]:
from collections import Counter # good to know that it exists

def lemma_frequencies(text, include_stop=True):
    
    # Build a list of lemmas.
    # Strip out punctuation and, optionally, stop words.
    lemmas = []
    for token in text:
        if not token.is_punct and (not token.is_stop or include_stop):
            lemmas.append(token.lemma_)
            
    # Build and return a Counter object containing word counts.
    return Counter(lemmas)

In [None]:
df_train['Meaningful Word Count'] = pd.Series([lemma_frequencies(text) for text in df_train['Spacy-ed Text']])
df_test['Meaningful Word Count'] = pd.Series([lemma_frequencies(text) for text in df_test['Spacy-ed Text']])

In [None]:
df_testna['raw char count'] = len(df_testna['Text'])

# Vectorizing! Changing Text to Numbers

Once everything is numerical, then we can feed that data into the clusters for analysis.

In [None]:
df_test[df_test.isna().any(axis=1)]