# Scikit-learn datasets

Scikit-learn is a powerful library for Python programming that includes many useful tools and technologies for machine learning tasks. In the past exercises, you have probably already used various elements of the library such as implementations of machine learning algorithms (e.g. SVM or NaiveBayes), feature extraction tools (e.g. CountVectorizer) or metric calculations (e.g. for accuracy scores). <br>

Additionally, scikit-learn contains a variety of prepared datasets that can be useful to test and explore developed machine learning concepts and approaches. For our next exercise, we will take a look into the "20 newsgroups" dataset. As "20 newsgroups" is a text-based dataset that comprises around 18000 newsgroups posts on 20 topics, it is ideal for the context of NLP.  

### Getting started

As usual, we first need to import the required elements of scikit-learn which in this case is the _fetch_\__20newsgroups_ function:

In [None]:
from sklearn.datasets import fetch_20newsgroups

Internally, the "20 newsgroups" set is already divided into a train and test split. By passing a corresponding value for the parameter "subset", we can easily fetch the corresponding split (this might take a while):

In [None]:
newsgroups = fetch_20newsgroups(subset='train')

The return value of the function that is now stored in the variable "newsgroups" is a container-like object which allows us to access the news data by using the _data_ attribute of the container:  

In [None]:
# list of all entries
news_list = newsgroups.data

# lots of data to explore
print('A total of {} entries!\n'.format(len(news_list)))

In [None]:
# show first entry
print(news_list[0])

### Class labels

As "20 newsgroups" is a labeled dataset, we can access the targets or class labels for each text entry in a similar fashion by using the attributes _target_\__names_ and _target_ as shown below. You can see that the entries belong to very different real-world topics:  

In [None]:
# show the names of all targets/labels that occur in the newsgroup container 
print(newsgroups.target_names)

In [None]:
# get the targets/labels for each news entry
news_targets = newsgroups.target

# we can access the corresponding label for each news entry
for i in range(0,3):
    print('News entry {} belongs to category {}.'.format(i, news_targets[i])) 

We see that only the indices of the targets are stored. Try to use the list of target names to find out which category the fifth entry belongs to. Afterwards, print the text of the fifth entry - does the content match the category?

In [None]:
# let's see to which category our first news entry from before belongs
print("The fifth entry in the list belongs to category '{}'.\n".format(newsgroups.target_names[news_targets[4]]))

# take a look at the text of the fifth news entry
print(news_list[4])

### Parameter options

When fetching data with the introduced function, we can adjust the outcome through several useful parameters. Next to the "subset" parameter that we have already seen, we can use "categories" to retrieve only a subset of targets and entries:

In [None]:
# extract only the first three category names
target_subset = newsgroups.target_names[:3]

# fetch only news entries from training data that belong to these categories
ng_first_three = fetch_20newsgroups(subset='train', categories=target_subset)
print('Fetched categories:', ng_first_three.target_names)
print('Number of fetched entries:', len(ng_first_three.data))

Now you can try to retrieve only entries that belong to sports-related categories:

In [None]:
# your turn: try to get only the sports-related entries and targets
sports_targets = [name for name in newsgroups.target_names if 'sport' in name]
ng_sports = fetch_20newsgroups(subset='train', categories=sports_targets)
print('Fetched categories:', ng_sports.target_names)
print('Number of fetched news entries:', len(ng_sports.data))

<br>Another useful option is given by the "remove" parameter. When looking at the news entries, we see a lot of meta-information around the actual message such as e-mail headers, user-defined footers or quotes of other messages: 

In [None]:
# original entries contain a lot of meta-information
print(news_list[3])

Depending on the use context, we might want to get rid of some (or all) of this meta-information. The "remove" parameter allows us to retrieve preprocessed variants of the entries where the specified elements are already removed for us:

In [None]:
# fetch only messages without headers and footers
ng_cleaned = fetch_20newsgroups(subset='train', remove=('headers', 'footers'))
print(ng_cleaned.data[3])

<br>Lastly, the fetch function allows us to retrieve already shuffled data via the Boolean parameter "shuffle". To guarantee a deterministic shuffling process, we can additionally specify a seed by passing an int value to the parameter "random_state".<br>

Now it's your turn! Try to combine the previous information to fetch and review a customized set of news entries. It should include only texts from the categores _rec.motorcycles_ and _sci.space_ of the train subset. Furthermore, it should be shuffled with a random seed of 123, and headers and footers should be removed. Store the resulting text entries in a variable _docs_ and the corresponding labels in a separate variable _targets_. 

In [None]:
# fetch custom corpus 
ng_custom = fetch_20newsgroups(subset='train', categories=['rec.motorcycles', 'sci.space'], 
                               shuffle=True, random_state=123, remove=('headers', 'footers'))

# store text entries and corresponding labels for further processing
docs = ng_custom.data
targets = ng_custom.target

# print some docs and their categories
names = ng_custom.target_names
for i in range(0, 2):
    print("==================================\n")
    print("Category: '{}'\n\n{}\n".format(names[targets[i]], docs[i]))