# Text Visualization Lab
In this lab, you are going to analyse a dataset containing Vox articles published before March 21, 2017. After your analysis, you should be able to answer:
#### - What are the topics of Vox articles?
#### - How do these topics distribute in text (How many documents does each topic have)?

### To answer the questions above, please follow the steps listed below, and demonstrate your results for each of them.
1. Explore your data: What do you learn from the data sense-making process? 
2. Pre-process your data: What pre-processing methods are you using?
3. Analyse your data.<br/>
If you use any Machine Learning Models in your analysis, please answer:
   - how do you choose your machine learning models used for this task?<br/>
   - how do you select parameters?<br/><br/>
   Hint: 
   - If you are going to use LDA, you should set corpus, id2word, and num_topics. It could run faster if we set other parameters as default. You can also change other parameters, then please state your reasons for changing this parameter in your notebook.
   - There can be many topics, run your model with different num_topics and choose the results you like best. If there are some topics do not make sense to you, you can just ignore them in the later analysis part and do not need to rerun your model.
   - You can use `get_document_topics()` function from `gensim.models.ldamodel` libarary to get the probabilities of each topic in this article. More information can be checked in <a href="https://radimrehurek.com/gensim/models/ldamodel.html">this document</a>.
   
4. Display your answers for  `What are the topics of VOX articles?`, `How does those topics distribute in text? (How many documents does each topic have)` with <strong>static</strong> visualizations.

### The data loading part is provided below. Please add your code and explanations after the data loading part.
If you are new to Jupyter Notebook, you may need to know the following things:
1. You can click `Cell`->`Cell Type`->`Markdown` to change the cell into text in Markdown format. Google `Markdown` to learn the grammar.

In [71]:
import numpy as np, pandas as pd
import gensim, spacy
import gensim.corpora as corpora
from gensim.utils import lemmatize, simple_preprocess
from gensim.models import CoherenceModel
import matplotlib.pyplot as plt
import nltk

## Load Data

In [72]:
data_loaded = pd.read_csv(filepath_or_buffer='VoxData.csv', sep=',', header=None, index_col=None)

## Data Sense-making

In [73]:
data = data_loaded.values
data = data.reshape(data.shape[0],)
data.tolist()[:1]

['The markets haven\'t been kind to Bitcoin in 2014. The currency reached a high of nearly $1,000 in January before falling to around $350 this month, a plunge of more than 60 percent. It would be easy to write Bitcoin off as a fad whose novelty has worn off. \\nAfter all, dollars seem superior in almost every respect. They\'re accepted everywhere, they\'re convenient to use, and they have a stable value. Bitcoin is an inferior currency on all three counts. \\nBitcoin\'s detractors are making the same mistake as many Bitcoin fans  \\nYet it would be foolish to write Bitcoin off. The currency has had months-long slumps in the past, only to bounce back. More importantly, it\'s a mistake to think about Bitcoin as a new kind of currency. What makes Bitcoin potentially revolutionary is that it\'s the world\'s first completely open financial network. \\nHistory suggests that open platforms like Bitcoin often become fertile soil for innovation. Think about the internet. It didn\'t seem like a

In [74]:
data.shape

(23011,)

## Data Preprocessing

You need to replace all '\n' into a space character `' '` or a real newline character `'\n'`. Otherwise, you may get words like 'nthe' and 'nbut' which should be 'the' and 'but'

## What are the topics of Vox articles?

## How are topics distributed?