# 20 News Group Feature Extraction

---

## Overview:

**This dataset is a collection of  20 newsgroups documents. The processing has been done for the purpose of feature extraction.**

*This is a list of the 20 newsgroups:*

- comp.graphics
- comp.os.ms-windows.misc
- comp.sys.ibm.pc.hardware
- comp.sys.mac.hardware
- comp.windows.x rec.autos
- rec.motorcycles
- rec.sport.baseball
- rec.sport.hockey sci.crypt
- sci.electronics
- sci.med
- sci.space
- misc.forsale talk.politics.misc
- talk.politics.guns
- talk.politics.mideast talk.religion.misc
- alt.atheism
- soc.religion.christian

#### Download Link: [20news-bydate.tar.gz ](http://qwone.com/~jason/20Newsgroups/20news-bydate.tar.gz) - 20 Newsgroups sorted by date; duplicates and some headers removed (18846 documents)

The 20 newsgroup dataset was transformed by using the **Bag of word** and **Term frequency-Inverse document frequency (tf-idf)** method. The dataset after transformation consists of five main classes:
    
 - Computer
 - Recreational
 - Science
 - Talkshow
 
 and each of these classes contains **`train.csv`** and **`test.csv`** files.
 
 
 **Used Methods**
 
 - [x] Bag of words - CountVectorizer
 - [x] Bag of Words - BinaryVectorizer
 - [x] TF-IDF - TfidfVectorizer




#### Generated CSV Files:

`comp_count_train.csv` `comp_count_test.csv` `comp_binary_train.csv` `comp_binary_test.csv` `comp_tfidf_train.csv` `comp_tfidf_test.csv`

`rec_count_train.csv` `rec_count_test.csv` `rec_binary_train.csv` `rec_binary_test.csv` `rec_tfidf_train.csv` `rec_tfidf_test.csv`

`sci_count_train.csv` `sci_count_test.csv` `sci_binary_train.csv` `sci_binary_test.csv` `sci_tfidf_train.csv` `sci_tfidf_test.csv`

`talk_count_train.csv` `talk_count_test.csv` `talk_binary_train.csv` `talk_binary_test.csv` `talk_tfidf_train.csv` `talk_tfidf_test.csv`

---

# Code Implementation

### Import required modules

import all the required modules from `os` `pandas` `sklearn` `nltk`

In [1]:
import os
import pandas as pd
from string import punctuation

# libraries importing from scikit-learn
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.model_selection import train_test_split

# natural language toolkit packages
import nltk
from nltk.corpus import stopwords
from nltk.tokenize.treebank import TreebankWordDetokenizer, TreebankWordTokenizer
from nltk.stem.snowball import  EnglishStemmer
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.corpus import wordnet as wn
from nltk.tokenize import word_tokenize

### Initialize the methods

In [2]:
# get the set of stop words
_stop_words = set(stopwords.words('english') + list(punctuation))

# initialize treebank word tokenizer and detokenizer
tokenizer = TreebankWordTokenizer() 
detokenizer = TreebankWordDetokenizer()

# initialize english stemmer
stemmer = EnglishStemmer()

# initialize word net lemmatizer
lemmatizer = WordNetLemmatizer()

# word list used in english language
_word_list = set([word for word in wn.words(lang='eng')])

### Get all the file list

Read all the files, directories, roots using `os.walk()` and **return all the files** in the directory

In [3]:
# method to get all the text files from the directory to a list
def read_files(dir):
    f = []
    for roots, dirs, files in os.walk(dir):
        for file in files:
            f.append(os.path.join(roots, file))
    return f

### Open and Categorize files

Open and read all the files and categorize them into `comp` `rec` `sci` `talk`

In [4]:
# method to categorize and read the files
def open_files(file_paths):
    comp = []
    rec = []
    sci = []
    talk = []
    
    for path in file_paths:
        if "comp" in path:
            w = open(path, encoding='utf-8', errors='ignore')
            comp += [w.read()]
        elif "rec" in path:
            w = open(path, encoding='utf-8', errors='ignore')
            rec += [w.read()]
        elif "sci" in path:
            w = open(path, encoding='utf-8', errors='ignore')
            sci += [w.read()]
        elif "talk" in path:
            w = open(path, encoding='utf-8', errors='ignore')
            talk += [w.read()]
    return comp, rec, sci, talk

### Get list of unique set of words

In [5]:
def get_unique_word_set(data_list):
    word_set = []
    for docs in data_list:
        for doc in docs:
            words = []
            for word in word_tokenize(doc):
                word = stemmer.stem(word)
                word = lemmatizer.lemmatize(word)
                words += [word]
            word_set += [word for word in words if word not in _stop_words if word in _word_list if word.isalpha()]
    return list(set(word_set))

### Clean and Preprocess Data

**Tokenize** and **normalize** the text and **detokenize** cleaned text set. `stemmer` and `lemmatizer` helps to normalize the texts such as,

>running, runs -> run
 eating, eats -> eat

cleaned data without numbers and only valid text words in english language

In [6]:
# method to clean data
def clean_data(data):
    cleaned_data = []
    for text in data:
        words = []
        for word in tokenizer.tokenize(text):
            word = stemmer.stem(word)
            word = lemmatizer.lemmatize(word)
            words += [word]   
        cleaned_data += [detokenizer.detokenize([word for word in words if word in _word_list if word.isalpha()])]
    return cleaned_data

### Vectorize data
---

#### Convert a collection of text documents to a matrix of token counts

In [7]:
def count_vectorize_csv(cleaned_data, name, vocabulary_set):
    vectorizer = CountVectorizer(stop_words=_stop_words, vocabulary=vocabulary_set)
    x = vectorizer.fit_transform(cleaned_data)
    y = vectorizer.get_feature_names()
    
    df = pd.DataFrame(data=x.toarray(), columns=y)
    
    df.to_csv(name+'_freq.csv')
    
    print(name+'_freq.csv successfully created!')
    return df

#### Convert a collection of text documents to a matrix of binary values

In [8]:
def binary_vectorize_csv(cleaned_data, name, vocabulary_set):
    vectorizer = CountVectorizer(stop_words=_stop_words, binary=True, vocabulary=vocabulary_set)
    x = vectorizer.fit_transform(cleaned_data)
    y = vectorizer.get_feature_names()
    
    df = pd.DataFrame(data=x.toarray(), columns=y)
    
    df.to_csv(name+'_binary.csv')
    
    print(name+'_binary.csv successfully created!')
    return df

#### Convert a collection of text documents to a matrix of token weights

In [9]:
def tfidf_vectorize_csv(cleaned_data, name, vocabulary_set):
    vectorizer = TfidfVectorizer(stop_words=_stop_words, vocabulary=vocabulary_set)
    x = vectorizer.fit_transform(cleaned_data)
    y = vectorizer.get_feature_names()
    
    df = pd.DataFrame(data=x.toarray(), columns=y)
    
    df.to_csv(name+'_tfidf.csv')
    
    print(name+'_tfidf.csv successfully created!')
    return df

### Vectorize data using `CountVectorizer()` and `TfidfVectorizer()` and return `train/test` set

In [10]:
def get_feature_vectors(data, name, vocabulary_set):
    cleaned_data = clean_data(data)
    
    c_train = count_vectorize_csv(cleaned_data, name, vocabulary_set)
    b_train = binary_vectorize_csv(cleaned_data, name, vocabulary_set)
    t_train = tfidf_vectorize_csv(cleaned_data, name, vocabulary_set)
    return c_train, b_train, t_train

# Program Execution

---

#### Call `read_files()` method to get all the files to the `files` variable

In [11]:
files = read_files('./')

#### Call `open_file()` method to *read and categorize files* into


`comp_data` `rec_data` `sci_data` `talk_data`

In [12]:
comp_data = []
rec_data = []
sci_data = []
talk_data = []

comp_data, rec_data, sci_data, talk_data = open_files(files)

#### Split train and test set

call `train_test_split()` on the data.

**Train data = 70%
Test data = 30%**

In [13]:
comp_data_train, comp_data_test = train_test_split(comp_data, train_size=0.7, test_size=0.3, random_state=42)
rec_data_train, rec_data_test = train_test_split(rec_data, train_size=0.7, test_size=0.3, random_state=42)
sci_data_train, sci_data_test = train_test_split(sci_data, train_size=0.7, test_size=0.3, random_state=42)
talk_data_train, talk_data_test = train_test_split(talk_data, train_size=0.7, test_size=0.3, random_state=42)

**Train and Test data sets**

In [14]:
train_data = [comp_data_train, rec_data_train, sci_data_train, talk_data_train]
test_data = [comp_data_test, rec_data_test, sci_data_test, talk_data_test]

#### Get unique word set

Call `get_unique_word_set()` to get word dictionary from `comp_train_data` `rec_train_data` `sci_train_data` `talk_train_data`

**We use this word dictionary as the `vocabulary` attribute in 

In [15]:
unique_word_set = get_unique_word_set(train_data)
unique_word_set.sort()

### Call `get_feature_vector` method to `comp_data_train` and `comp_data_test`

It generates `train.csv` and `test.csv` files to **computer category**

**Vectorize `comp_data_train`**

In [16]:
comp_f_train, comp_b_train, comp_t_train = get_feature_vectors(comp_data_train, 'comp_train', unique_word_set)

comp_train_freq.csv successfully created!
comp_train_binary.csv successfully created!
comp_train_tfidf.csv successfully created!


**Vectorize `comp_data_test`**

In [17]:
comp_f_test, comp_b_test, comp_t_test = get_feature_vectors(comp_data_test, 'comp_test', unique_word_set)

comp_test_freq.csv successfully created!
comp_test_binary.csv successfully created!
comp_test_tfidf.csv successfully created!


### Call `get_feature_vector` method to `rec_data_train` and `rec_data_test`

It generates `train.csv` and `test.csv` files to **recreation category**

**Vectorize `rec_data_train`**

In [18]:
rec_f_train, rec_b_train, rec_t_train = get_feature_vectors(rec_data_train, 'rec_train', unique_word_set)

rec_train_freq.csv successfully created!
rec_train_binary.csv successfully created!
rec_train_tfidf.csv successfully created!


**Vectorize `rec_data_test`**

In [19]:
rec_f_test, rec_b_test, rec_t_test = get_feature_vectors(rec_data_test, 'rec_test', unique_word_set)

rec_test_freq.csv successfully created!
rec_test_binary.csv successfully created!
rec_test_tfidf.csv successfully created!


#### Call `get_feature_vector` method to `sci_data_train` and `sci_data_test`

It generates `train.csv` and `test.csv` files to **science category**

**Vectorize `sci_data_train`**

In [20]:
sci_f_train, sci_b_train, sci_t_train = get_feature_vectors(sci_data_train, 'sci_train', unique_word_set)

sci_train_freq.csv successfully created!
sci_train_binary.csv successfully created!
sci_train_tfidf.csv successfully created!


**Vectorize `sci_data_test`**

In [21]:
sci_f_test, sci_b_test, sci_t_test = get_feature_vectors(sci_data_test, 'sci_test', unique_word_set)

sci_test_freq.csv successfully created!
sci_test_binary.csv successfully created!
sci_test_tfidf.csv successfully created!


### Call `get_feature_vector` method to `talk_data_train` and `talk_data_test`

It generates `train.csv` and `test.csv` files to **talk show category**

**Vectorize `talk_data_train`**

In [22]:
talk_f_train, talk_b_train, talk_t_train = get_feature_vectors(talk_data_train, 'talk_train', unique_word_set)

talk_train_freq.csv successfully created!
talk_train_binary.csv successfully created!
talk_train_tfidf.csv successfully created!


**Vectorize `talk_data_test`**

In [23]:
talk_f_test, talk_b_test, talk_t_test = get_feature_vectors(talk_data_test, 'talk_test', unique_word_set)

talk_test_freq.csv successfully created!
talk_test_binary.csv successfully created!
talk_test_tfidf.csv successfully created!
