# DSI Summer Workshops Series


Peggy Lindner<br>
Center for Advanced Computing & Data Science (CACDS)<br>
Data Science Institute (DSI)<br>
University of Houston  
plindner@uh.edu 


This Notebook is available at:
http://bitly.com/UHDSInotebook1

You can already download it! Use the "Save Link as" method.

Please make sure you have a copy of R up and running, as well as a Python 3 installation (ideally from Anacodna).

## Goals for today

Understand basics of text analysis using R

(well enough so that you can Google your problems, find the answer, and implement it.)

#### More specifically

1. Up and running with R & IPython
2. Understand a basic exploratory data analysis workflow
3. Basics of R and Topic Modeling 

#### Why R and not Python 

It's good for data exploration! 

## Part 1: Getting yourself ready

### First: Install software on your computer

* R [CRAN](https://www.anaconda.com/download/)
* Python[Anaconda](https://www.anaconda.com/download/)




### Second: Prep your R environment
On a Mac open a terminal and start R

![](https://raw.github.com/peggylind/DSI_Summer_Workshops/master/images/on-a-mac.png)

On Windows: Open the Anaconda Command line and start R

![](https://raw.github.com/peggylind/DSI_Summer_Workshops/master/images/anaconda-start.png)

![](https://raw.github.com/peggylind/DSI_Summer_Workshops/master/images/windows.png)

Now let's install some packages ...

```
> install.packages(c('repr', 'IRdisplay', 'evaluate', 'crayon', 'pbdZMQ', 'devtools', 'uuid', 'digest'))
> devtools::install_github('IRkernel/IRkernel')
```

When you see "Please select a CRAN mirror" , well select one.

![](https://raw.github.com/peggylind/DSI_Summer_Workshops/master/images/cran-repo.png)

... one last step - installing the Kernel

```
> IRkernel::installspec()
```

Now we can close the R environment (but leave your terminal and console open)


```
> quit()
```

Say "N" (no) when asked to save the workspace.

### Jupyter Notebooks is what we will be going to use

We are now ready to start up our Jupyter Environment from the terminal or the console:

```
$ jupyter notebook --notebook-dir C:/Users/[your username]

or on a Mac

$ jupyter notebook --notebook-dir /Users/[your username]
```

And your browser should open at the address: http://localhost:8888/tree

![](https://raw.github.com/peggylind/DSI_Summer_Workshops/master/images/jupyter.png)

#### Open the downloaded notebook on your computer

![](https://raw.github.com/peggylind/DSI_Summer_Workshops/master/images/second-screen.png)


#### Quick intro to Jupyter notebooks

Cells can be Markdown (like this one) or code


#### To start off with

Make sure you hit `Shift-Enter` or `Ctrl-Enter` when you are done.

In [1]:
2 + 2

### Part 2: The Exploratory Analysis Workflow

![](https://raw.github.com/peggylind/DSI_Summer_Workshops/master/images/data-science-workflow.png)
Image source: Hadley Wickham, R for Data Science

#### Our Example 

Media Analysis of a bunch of articles downloaded from a database called "Factiva"



#### Frequently used R Packages in conjunction with text data

* [readr](https://cran.r-project.org/web/packages/readr/readr.pdf) Import data

Data Analysis of text based material

* [stringr](https://cran.r-project.org/web/packages/stringr/vignettes/stringr.html) Clean up text
* [SnowballC](https://cran.r-project.org/web/packages/SnowballC/SnowballC.pdf)  Stemming of words
* [tm](https://cran.r-project.org/web/packages/tm/vignettes/tm.pdf) Text mining
* [Quanteda](https://quanteda.io/) veratile text analysis tool

Visualization

* [ggplot2](http://ggplot2.tidyverse.org/) Modern R visulaizations
* [wordcloud](http://developer.marvel.com) Make some nice word clouds
* [RColorBrewer](https://dataset.readthedocs.org/en/latest/) Get color into your visualizations



In [3]:
#load all required libraries library
library(readr)
library(stringr)
library(SnowballC)
library(wordcloud)
library(RColorBrewer)
library(tm)

Loading required package: RColorBrewer
Loading required package: NLP

Attaching package: ‘NLP’

The following object is masked from ‘package:ggplot2’:

    annotate



Analysis

* [NLTK](http://nltk.org) Natural Language Toolkit. There's a [book](http://nltk.org/book/)
* [scikit-learn](http://scikit-learn.org/stable/) Machine learning
* [pandas](http://pandas.pydata.org) Data management and analysis. There is a book.
* [gensim](http://radimrehurek.com/gensim/) Topic modeling
* [NetworkX](http://networkx.github.io) Network analysis
* [matplotlib](http://matplotlib.org) Plotting* [NumPy](http://www.numpy.org) and [scipy](http://scipy.org) Computational backbone
* [rpy](http://rpy.sourceforge.net) Python bindings for R
