# 20Newsgroups review corpus

Like in the [previous](1.0_Amazon_corpus_to_pandas.ipynb) notebook this notebook deals with the transformation of a raw corpus into the Pandas-based formant that is required by the subsequent components.

The 20NEWSGROUPS data set was downloaded from [this](http://qwone.com/~jason/20Newsgroups/20news-bydate.tar.gz) location.  
In the `tar.gz` file there are two folders (`20news-bydate-test` and `20news-bydate-train`) each containing 20 subfolder named by their corresponding category (e.g. `alt.atheism`, `comp.graphics` and `comp.os.ms-windows.misc`), which in turn contain one file per text.  
For example the file `20news-bydate-train/rec.autos/103532` looks like this:
```
From: hkon@mit.edu (Henry Kon)
Subject: sunroof leaks - I'm all wet
Organization: MIT
Lines: 8
NNTP-Posting-Host: msiegel.mit.edu

My sunroof leaks.  I've always thought those things were a royal pain.

Can anyone provide any insight ?

I know the seal isn't great.  Maybe I could weld the stupid thing shut.

hk

```
That means the text is contained in the files whereas the information about the categorical membership is encoded in the path to each file.

__Note:__ We refuse the default train/test split as we use our own Cross-validation strategy.

In [None]:
raw_corpus_path = 'data/20NEWSGROUPS/corpus'
pd_corpus_path = 'data/20NEWSGROUPS/dataframes'

In [None]:
import os
import pandas as pd
from tqdm import tqdm

documents = []
for dirpath, dirnames, filenames in tqdm(os.walk(raw_corpus_path)):
    for f in filenames:
        datapoint = {}
        with open(dirpath + os.sep + f, encoding='utf-8', errors='ignore') as fh:
            text = fh.read()
            if len(text) >= 10:
                category = dirpath.split(os.sep)[-1]
                datapoint = {
                    'id': f,
                    'text': text,
                    'category': category
                }
            else:
                continue
            documents.append(pd.Series(datapoint))

print('loading data has finished. compiling dataframe now')
df = pd.DataFrame(documents)
df.to_pickle(pd_corpus_path + '/20NEWSGROUPS.pkl')