# Topic Extraction and Classification from News Data (part 1)

In this notebook we will introduce some tools to analyse the topics of a collection of news documents from the BBC.

Here is the description of the dataset we will be using:

http://mlg.ucd.ie/datasets/bbc.html

In [None]:
from urllib.request import urlretrieve
from pathlib import Path

BBC_DATASET_URL = "http://mlg.ucd.ie/files/datasets/bbc-fulltext.zip"
archive_filepath = Path(BBC_DATASET_URL.rsplit("/", 1)[1])

if not archive_filepath.exists():
    print(f"Downloading {BBC_DATASET_URL} to {archive_filepath}...")
    urlretrieve(BBC_DATASET_URL, archive_filepath)
    print("done.")
else:
    print(f"{archive_filepath} exists.")

In [None]:
from zipfile import ZipFile

zf = ZipFile(archive_filepath)

In [None]:
zf.filelist[:10]

In [None]:
zf.extractall(path=".")

In [None]:
bbc_folder_path = Path("bbc")
bbc_folder_path.is_dir()

In [None]:
list(bbc_folder_path.iterdir())

In [None]:
print((bbc_folder_path / "README.TXT").read_text(encoding="utf-8"))

In [None]:
text_filepaths = sorted(bbc_folder_path.glob("*/*.txt"))

In [None]:
text_filepaths[:10]

In [None]:
text_filepaths[-10:]

In [None]:
len(text_filepaths)

In [None]:
first_filepath = text_filepaths[0]
first_filepath

In [None]:
print(first_filepath.read_text(encoding="utf-8"))

In [None]:
print(first_filepath.read_text(encoding="iso-8859-1"))

In [None]:
for path in text_filepaths:
    try:
        path.read_text(encoding="utf-8")
    except Exception as e:
        print(path)
        print(type(e), e)

In [None]:
b"\xa3".decode("utf-8")

In [None]:
b"\xa3".decode("cp1252")  # Western Europe (Windows code page)

In [None]:
b"\xa3".decode("cp1251")  # Cyrillic (Windows code page)

In [None]:
b"\xa3".decode("cp932")  # Japanese (Windows code page)

In [None]:
b"\xa3".decode("iso-8859-1")  # also known as latin-1

In [None]:
b"\xa3".decode("iso-8859-15")  # also known as latin-9

In [None]:
problematic_filepath = Path("bbc/sport/199.txt")
print(problematic_filepath.read_text(encoding="iso-8859-1"))

In [None]:
print(problematic_filepath.read_text(encoding="cp1251"))

In the context of an English speaking news site, a Western european code page makes more sense. However the first article is clearly utf-8 and the `bbc/sport/199.txt` article is clearly not utf-8.

So it means that not all articles where encoded with the same encoding. The documentation of the dataset does not give us any information on which encoding was used.

In this case we could try to guess, for instance using the `chardet.detect()` function to use a  machine learning model to guess the encoding of each document:

https://pypi.org/project/chardet/

In [None]:
!pip install chardet

In [None]:
import chardet

chardet.detect(problematic_filepath.read_bytes())

This seems to agree with our manual inspection of this file. However if we try chardet on the first document it gives a bad answer:

In [None]:
chardet.detect(first_filepath.read_bytes())

So we cannot trust this tool for this dataset. There is too much ambiguity. As we know that all documents are in English, most of the words should be represented the same way in both encodings. Let's just assume that UTF-8 was used everywhere and ignore/skip characters that cannot be decoded with the utf-8 encoding:

In [None]:
print(problematic_filepath.read_text(encoding="utf-8", errors="ignore"))

In [None]:
texts = [path.read_text(encoding="utf-8", errors="ignore") for path in text_filepaths]

Now that we have loaded all the text documents in memory, we can load the target label (categories) of those documents by looking at the name of their parent folder:

In [None]:
def extract_label_from_path(filepath):
    return filepath.parent.name


extract_label_from_path(first_file)

In [None]:
categories = [extract_label_from_path(path) for path in text_filepaths]

In [None]:
len(categories)

In [None]:
from collections import Counter

counter = Counter(categories)
counter.most_common()

## Vectorizing Text Data: Bag of Words

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

In [None]:
count_vectorizer = CountVectorizer()
count_vectorizer

In [None]:
test_sentence = "C'est l'été au Brésil!"

In [None]:
word_analyzer = CountVectorizer().build_analyzer()
word_analyzer(test_sentence)

In [None]:
word_analyzer = CountVectorizer(strip_accents="unicode", ngram_range=(2, 2)).build_analyzer()
word_analyzer(test_sentence)

In [None]:
word_analyzer = CountVectorizer(strip_accents="unicode", ngram_range=(1, 2)).build_analyzer()
word_analyzer(test_sentence)

In [None]:
word_analyzer = CountVectorizer(ngram_range=(2, 2)).build_analyzer()
word_analyzer(test_sentence)

In [None]:
char_analyzer = CountVectorizer(analyzer="char").build_analyzer()
char_analyzer(test_sentence)

In [None]:
char_analyzer = CountVectorizer(analyzer="char", ngram_range=(1, 3)).build_analyzer()
char_analyzer(test_sentence)

## Supervised Text Classification Pipelines

## Unsupervised Text Clustering

## Visualization of High Dimensional Data

## Semantic Similarity in a Low-rank Latent Space