# 20 News group set feature extraction

---

## Overview:

**This dataset is a collection of  20 newsgroups documents. The processing has been done for the purpose of feature extraction.**

*This is a list of the 20 newsgroups:*

- comp.graphics
- comp.os.ms-windows.misc
- comp.sys.ibm.pc.hardware
- comp.sys.mac.hardware
- comp.windows.x rec.autos
- rec.motorcycles
- rec.sport.baseball
- rec.sport.hockey sci.crypt
- sci.electronics
- sci.med
- sci.space
- misc.forsale talk.politics.misc
- talk.politics.guns
- talk.politics.mideast talk.religion.misc
- alt.atheism
- soc.religion.christian

#### Download Link: [20news-bydate.tar.gz ](http://qwone.com/~jason/20Newsgroups/20news-bydate.tar.gz) - 20 Newsgroups sorted by date; duplicates and some headers removed (18846 documents)

The 20 newsgroup dataset was transformed by using the **Bag of word** and **Term frequency-Inverse document frequency (tf-idf)** method. The dataset after transformation consists of five main classes:
    
 - Computer
 - Recreational
 - Science
 - Talkshow
 - Other
 
 and each of these classes contains **`train.csv`** and **`test.csv`** files.
 
 
 **Used Methods**
 
 - [x] Bag of words - CountVectorizer
 - [x] Bag of Words - BinaryVectorizer
 - [x] TF-IDF - TfidfVectorizer




#### Generated CSV Files:

`comp_count_train.csv` `comp_count_test.csv` `comp_binary_train.csv` `comp_binary_test.csv` `comp_tfidf_train.csv` `comp_tfidf_test.csv`

`rec_count_train.csv` `rec_count_test.csv` `rec_binary_train.csv` `rec_binary_test.csv` `rec_tfidf_train.csv` `rec_tfidf_test.csv`

`sci_count_train.csv` `sci_count_test.csv` `sci_binary_train.csv` `sci_binary_test.csv` `sci_tfidf_train.csv` `sci_tfidf_test.csv`

`talk_count_train.csv` `talk_count_test.csv` `talk_binary_train.csv` `talk_binary_test.csv` `talk_tfidf_train.csv` `talk_tfidf_test.csv`

`other_count_train.csv` `other_count_test.csv` `other_binary_train.csv` `other_binary_test.csv` `other_tfidf_train.csv` `other_tfidf_test.csv`

### Import modules


- The **OS module** in Python provides a way of using **operating system dependent functionality.** Here we use *os.walk* to go through all the files, directories and roots.
- **Pandas** is a software library which use for **data manipulation and analysis**.
- 

In [1]:
import os
import pandas as pd
from string import punctuation

# libraries importing from scikit-learn
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.model_selection import train_test_split

# natural language toolkit packages
import nltk
from nltk.corpus import stopwords
from nltk.tokenize.treebank import TreebankWordDetokenizer, TreebankWordTokenizer
from nltk.stem.snowball import  EnglishStemmer
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.corpus import wordnet as wn
from nltk.tokenize import word_tokenize

In [2]:
# get the set of stop words
_stop_words = set(stopwords.words('english') + list(punctuation))

# initialize treebank word tokenizer
tokenizer = TreebankWordTokenizer()

# initialize treebank word detokenizer
detokenizer = TreebankWordDetokenizer()

# initialize english stemmer
stemmer = EnglishStemmer()

# initialize word net lemmatizer
lemmatizer = WordNetLemmatizer()

# word list used in english language
_word_list = set([word for word in wn.words(lang='eng')])

In [3]:
# method to get all the text files from the directory to a list
def read_files(dir):
    f = []
    for roots, dirs, files in os.walk(dir):
        for file in files:
            f.append(os.path.join(roots, file))
    return f

In [4]:
# method to categorize and read the files
def open_files(file_paths):
    comp = []
    rec = []
    sci = []
    talk = []
    other = []
    
    for path in file_paths:
        if "comp" in path:
            w = open(path, encoding='utf-8', errors='ignore')
            comp += [w.read()]
        elif "rec" in path:
            w = open(path, encoding='utf-8', errors='ignore')
            rec += [w.read()]
        elif "sci" in path:
            w = open(path, encoding='utf-8', errors='ignore')
            sci += [w.read()]
        elif "talk" in path:
            w = open(path, encoding='utf-8', errors='ignore')
            talk += [w.read()]
        elif "alt" in path or "misc" in path or "soc" in path:
            w = open(path, encoding='utf-8', errors='ignore')
            other += [w.read()]
    return comp, rec, sci, talk, other

In [5]:
# method to clean data
def clean_data(data):
    cleaned_data = []
    for text in data:
        words = []
        for word in tokenizer.tokenize(text):
            word = stemmer.stem(word)
            word = lemmatizer.lemmatize(word)
            words += [word]   
        cleaned_data += [detokenizer.detokenize([word for word in words if word in _word_list if word.isalpha()])]
    return cleaned_data

In [6]:
def count_vectorize_csv(cleaned_data, name):
    vectorizer = CountVectorizer(stop_words=_stop_words)
    x = vectorizer.fit_transform(cleaned_data)
    y = vectorizer.get_feature_names()
    
    df = pd.DataFrame(data=x.toarray(), columns=y)
    
    train, test = train_test_split(df, train_size=0.7, test_size=0.3, random_state=42)
    
    train.to_csv(name+'_count_train.csv')
    test.to_csv(name+'_count_test.csv')
    
    print('Count vectorize train and test csv for {0} is successfully created!'.format(name))
    return train, test

In [7]:
def binary_vectorize_csv(cleaned_data, name):
    vectorizer = CountVectorizer(stop_words=_stop_words, binary=True)
    x = vectorizer.fit_transform(cleaned_data)
    y = vectorizer.get_feature_names()
    
    df = pd.DataFrame(data=x.toarray(), columns=y)
    
    train, test = train_test_split(df, train_size=0.7, test_size=0.3, random_state=42)
    
    train.to_csv(name+'_binary_train.csv')
    test.to_csv(name+'_binary_test.csv')
    
    print('Binary vectorize train and test csv for {0} is successfully created!'.format(name))
    return train, test

In [8]:
def tfidf_vectorize_csv(cleaned_data, name):
    vectorizer = TfidfVectorizer(stop_words=_stop_words)
    x = vectorizer.fit_transform(cleaned_data)
    y = vectorizer.get_feature_names()
    
    df = pd.DataFrame(data=x.toarray(), columns=y)
    
    train, test = train_test_split(df, train_size=0.7, test_size=0.3, random_state=42)
    
    train.to_csv(name+'_tfidf_train.csv')
    test.to_csv(name+'_tfidf_test.csv')
    
    print('Tfidf vectorize train and test csv for {0} is successfully created!'.format(name))
    return train, test

In [9]:
def get_feature_vectors(data, name):
    cleaned_data = clean_data(data)
    
    c_train, c_test = count_vectorize_csv(cleaned_data, name)
    b_train, b_test = binary_vectorize_csv(cleaned_data, name)
    t_train, t_test = tfidf_vectorize_csv(cleaned_data, name)
    return c_train, c_test, b_train, b_test, t_train, t_test

In [10]:
files = read_files('./')

In [11]:
comp_data = []
rec_data = []
sci_data = []
talk_data = []
other_data = []

comp_data, rec_data, sci_data, talk_data, other_data = open_files(files)

In [12]:
comp_c_train, comp_c_test, comp_b_train, comp_b_test, comp_t_train, comp_t_test = get_feature_vectors(comp_data, 'comp')

Count vectorize train and test csv for comp is successfully created!
Binary vectorize train and test csv for comp is successfully created!
Tfidf vectorize train and test csv for comp is successfully created!


In [13]:
rec_c_train, rec_c_test, rec_b_train, rec_b_test, rec_t_train, rec_t_test = get_feature_vectors(rec_data, 'rec')

Count vectorize train and test csv for rec is successfully created!
Binary vectorize train and test csv for rec is successfully created!
Tfidf vectorize train and test csv for rec is successfully created!


In [14]:
sci_c_train, sci_c_test, sci_b_train, sci_b_test, sci_t_train, sci_t_test = get_feature_vectors(sci_data, 'sci')

Count vectorize train and test csv for sci is successfully created!
Binary vectorize train and test csv for sci is successfully created!
Tfidf vectorize train and test csv for sci is successfully created!


In [15]:
talk_c_train, talk_c_test, talk_b_train, talk_b_test, talk_t_train, talk_t_test = get_feature_vectors(talk_data, 'talk')

Count vectorize train and test csv for talk is successfully created!
Binary vectorize train and test csv for talk is successfully created!
Tfidf vectorize train and test csv for talk is successfully created!


In [16]:
other_c_train, other_c_test, other_b_train, other_b_test, other_t_train, other_t_test = get_feature_vectors(other_data, 'other')

Count vectorize train and test csv for other is successfully created!
Binary vectorize train and test csv for other is successfully created!
Tfidf vectorize train and test csv for other is successfully created!
